Groq¶

What it is¶

Groq is an AI infrastructure company that developed the Language Processing Unit (LPU), a new type of processor designed specifically for the extreme high-speed requirements of LLMs.

What problem it solves¶

Solves the "bottleneck" of slow LLM inference, providing near-instantaneous responses that enable real-time applications and highly interactive agents.

Where it fits in the stack¶

Inference Provider / Infrastructure. It provides a high-speed API for the most popular open-source models (Llama, Mixtral, Gemma).

Typical use cases¶

Real-time Agents: Voice assistants or interactive chatbots that require sub-second response times.
High-Volume Processing: Summarizing or analyzing large quantities of text at hundreds of tokens per second.
Interactive Coding: Powering coding assistants where immediate, fluid feedback is essential.

Getting started¶

Install the SDK:

pip install groq

Basic API call (Python):

from groq import Groq

client = Groq()

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of low latency in AI.",
        }
    ],
    model="llama3-70b-8192",
)

print(chat_completion.choices[0].message.content)

Technical examples¶

High-Speed Streaming¶

Groq's LPU hardware enables exceptionally fluid streaming responses.

from groq import Groq

client = Groq()

stream = client.chat.completions.create(
    messages=[{"role": "user", "content": "Write a 500-word story about a fast computer."}],
    model="llama3-70b-8192",
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

LPU Performance Context¶

The LPU (Language Processing Unit) is designed for sequential processing, which is the primary bottleneck in LLM inference.

Tokens per Second (TPS): Regularly achieves 800+ TPS on Llama 3 8B and 250+ TPS on Llama 3 70B.
LPU vs GPU: Unlike GPUs which excel at parallel pixel processing, LPUs are optimized for the serial nature of text generation, eliminating the "memory wall" that slows down standard hardware.

Strengths¶

Extreme Speed: Often 10x+ faster than traditional GPU-based providers (400-800+ tokens/sec).
Open Model Support: Focuses on the best open-weights models like Llama 3 and Mixtral.
Low Latency: Unmatched time-to-first-token (TTFT) and overall throughput.
Pricing Tiers: Provides a generous Free tier for development and prototyping, alongside competitive usage-based On-Demand pricing.

Limitations¶

Model Selection: Limited to the open models they have specifically optimized for their LPU hardware.
Context Window: Historically had smaller context windows than cloud giants, though this is expanding rapidly.

When to use it¶

When response speed is the absolute top priority.
For "agentic" workflows where an agent makes many sequential, recursive LLM calls.
When using Llama or Mistral models and looking for the fastest possible user experience.

When not to use it¶

If you need proprietary models like GPT-4 or Claude 3.5.
For extremely large context tasks (e.g., 200k+ tokens) that may exceed current LPU memory limits.

Groq¶

What it is¶

What problem it solves¶

Where it fits in the stack¶

Typical use cases¶

Getting started¶

Technical examples¶

High-Speed Streaming¶

LPU Performance Context¶

Strengths¶

Limitations¶

When to use it¶

When not to use it¶

Licensing and cost¶

Sources / References¶

Contribution Metadata¶

Groq¶

What it is¶

What problem it solves¶

Where it fits in the stack¶

Typical use cases¶

Getting started¶

Technical examples¶

High-Speed Streaming¶

LPU Performance Context¶

Strengths¶

Limitations¶

When to use it¶

When not to use it¶

Licensing and cost¶

Related tools / concepts¶

Sources / References¶

Contribution Metadata¶