Post

Benchmarking Amazon Bedrock LLM Latency: A Multi-Model Comparison

disclaimer: This post focuses strictly on inference latency, not model quality, accuracy, or reasoning performance. A model that responds in 400ms is not “better” than one that responds in 2 seconds. It is simply faster to start streaming tokens. The right model for your use case depends on many factors beyond latency. This benchmark measures one thing: how quickly each model delivers its first token and completes a response through the Bedrock ConverseStream API.

When building real-time AI applications, latency matters. A chatbot that takes 5 seconds to start responding feels broken. One that responds in 400ms feels instant. But which Amazon Bedrock model actually delivers the lowest inference latency? And do features like Cross-Region Inference (CRIS) and prompt caching make a measurable difference?

I built a benchmarking tool to find out, testing six models across multiple runs with streaming responses. Here is what I learned.

The Setup

I tested six models available on Amazon Bedrock, each with different infrastructure characteristics:

Model Model ID CRIS Prompt Caching
Amazon Nova Micro us.amazon.nova-micro-v1:0 ✅ (min 1K tokens)
Claude Haiku 4.5 us.anthropic.claude-haiku-4-5-20251001-v1:0 ✅ (min 4,096 tokens)
DeepSeek V3.2 deepseek.v3.2
MiniMax M2.5 minimax.minimax-m2.5
Qwen3 Next 80B qwen.qwen3-next-80b-a3b
GPT OSS 20B openai.gpt-oss-20b-1:0

CRIS (Cross-Region Inference Support) routes requests across multiple AWS regions for better availability. Prompt Caching caches static prompt prefixes to reduce latency and cost on repeated calls with a 5-minute TTL.

I discovered the model IDs and CRIS profiles programmatically using the AWS CLI:

1
2
3
4
5
6
7
# Find foundation models
aws bedrock list-foundation-models \
  --query "modelSummaries[?contains(modelId, 'nova-micro')]"

# Find CRIS inference profiles
aws bedrock list-inference-profiles \
  --query "inferenceProfileSummaries[?contains(inferenceProfileId, 'nova-micro')]"

Nova Micro’s CRIS profile (us.amazon.nova-micro-v1:0) routes across us-east-1, us-west-2, and us-east-2.

Measuring What Matters: TTFT and Total Time

This is a latency benchmark, not a model evaluation. We are not comparing how well these models answer questions. We are comparing how fast the Bedrock infrastructure delivers tokens to the client.

For streaming applications, the two key latency metrics are:

  • Time to First Token (TTFT): how long until the user sees the first word appear
  • Total Time: end-to-end response completion time

I used the ConverseStream API to measure both:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import time
import boto3
from botocore.config import Config

session = boto3.Session(profile_name="default")
client = session.client(
    "bedrock-runtime",
    region_name="us-west-2",
    config=Config(tcp_keepalive=True, retries={"max_attempts": 2}),
)

start = time.perf_counter()
response = client.converse_stream(
    modelId="us.amazon.nova-micro-v1:0",
    system=[{"text": "You are a helpful assistant."}],
    messages=[{"role": "user", "content": [{"text": "Your prompt here"}]}],
    inferenceConfig={"maxTokens": 256, "temperature": 0.1, "topP": 0.9},
)

ttft = None
for event in response["stream"]:
    if "contentBlockDelta" in event:
        if ttft is None:
            ttft = (time.perf_counter() - start) * 1000
        text = event["contentBlockDelta"]["delta"].get("text", "")
        print(text, end="", flush=True)
    elif "metadata" in event:
        usage = event["metadata"].get("usage", {})

total_time = (time.perf_counter() - start) * 1000
print(f"\nTTFT: {ttft:.0f}ms | Total: {total_time:.0f}ms")

The key detail: time.perf_counter() gives sub-millisecond precision, and we capture TTFT at the exact moment the first contentBlockDelta event arrives.

The Results

Region: us-west-2 3 runs per model ~672 token system prompt

Latency Summary

Model Avg TTFT Avg Total CRIS Cache
Nova Micro 407ms 616ms Y Y
GPT OSS 20B 475ms 1,297ms N N
MiniMax M2.5 572ms 1,955ms N N
Qwen3 Next 80B 606ms 2,080ms N N
Claude Haiku 4.5 1,553ms 1,970ms Y N*
DeepSeek V3.2 5,038ms 11,232ms N N

*Claude Haiku 4.5 requires 4,096 minimum tokens per cache checkpoint. Our ~672 token prompt was below the threshold, so caching never activated.

Per-Run TTFT Breakdown

Model Run 1 TTFT Run 2 TTFT Run 3 TTFT
Nova Micro 549ms 334ms 339ms
GPT OSS 20B 429ms 540ms 456ms
MiniMax M2.5 526ms 500ms 982ms
Qwen3 Next 80B 1,112ms 575ms 586ms
Claude Haiku 4.5 1,971ms 916ms 4,057ms
DeepSeek V3.2 17,653ms 1,223ms 4,983ms

The variance tells a story. Let’s dig in.

Finding 1: Nova Micro Has the Lowest Latency

Nova Micro delivered the lowest average TTFT (407ms) and total time (616ms). On warm runs with cache hits, it dropped to 334ms, the fastest we observed from any model.

This is expected from a latency perspective: it is the smallest model in the test, it has CRIS for cross-region routing, and prompt caching was actively working (672 tokens cached and read back). Smaller models have less computation to perform before producing the first token.

Finding 2: Cold Start Variance Is the Real Latency Problem

The most striking finding was not average latency but variance. Look at DeepSeek V3.2: Run 1 took 17.6 seconds, Run 2 took 1.2 seconds. That is a 14x difference.

MiniMax M2.5 showed similar behavior in earlier tests (7.3s to 613ms, a 12x swing).

Models with CRIS support (Nova Micro, Claude Haiku) showed much more consistent latency across runs. CRIS routes requests to whichever region has available capacity, smoothing out the cold start problem.

For models without CRIS (DeepSeek, MiniMax, Qwen, GPT OSS), you are at the mercy of on-demand scheduling. If GPU capacity is not immediately available, you wait.

Finding 3: GPT OSS 20B Has Surprisingly Consistent Latency

Despite having no CRIS or caching support, GPT OSS 20B delivered remarkably stable TTFT: 429ms, 540ms, 456ms across three runs. No cold start spikes. At 475ms average TTFT, it was the second fastest model in our latency tests.

Finding 4: Prompt Caching Requires Planning

Prompt caching in the Converse API works by placing cachePoint markers in your system prompt or messages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
system = [
    {"text": "Your long system prompt here..."},
    {"cachePoint": {"type": "default"}},  # Cache everything above
]

messages = [
    {
        "role": "user",
        "content": [
            {"text": "User question"},
            {"cachePoint": {"type": "default"}},  # Cache this too
        ],
    }
]

response = client.converse_stream(
    modelId="us.amazon.nova-micro-v1:0",
    system=system,
    messages=messages,
    inferenceConfig={"maxTokens": 256, "temperature": 0.1},
)

The response metadata tells you what happened:

1
2
3
4
5
for event in response["stream"]:
    if "metadata" in event:
        usage = event["metadata"]["usage"]
        print(f"Cache read:  {usage.get('cacheReadInputTokens', 0)}")
        print(f"Cache write: {usage.get('cacheWriteInputTokens', 0)}")

But there is a catch: each model has a minimum token threshold per cache checkpoint.

Model Minimum Tokens
Nova Micro 1,024
Claude Haiku 4.5 4,096

Our ~672 token system prompt was enough for Nova Micro (which cached successfully) but fell short for Claude Haiku 4.5 (which silently skipped caching). The inference still succeeds, you just do not get the caching benefit.

Another gotcha: CRIS can route consecutive requests to different regions, and the cache is per-region. In our tests, Nova Micro Run 3 sometimes got a cache miss because CRIS routed it to a different region than Run 2.

Finding 5: Connection Reuse Cuts 25% Off TTFT

A separate benchmark isolating connection overhead showed significant impact:

Connection State TTFT
Cold (first request) 451ms
Warm (reused) 340ms

That is a 25% reduction just from reusing the TCP connection. The fix is simple:

1
2
3
4
5
6
7
from botocore.config import Config

client = session.client(
    "bedrock-runtime",
    region_name="us-west-2",
    config=Config(tcp_keepalive=True),
)

In production, keep your boto3 client alive across requests. Do not create a new client per invocation.

Why ~400ms Is the Latency Floor for On-Demand

Even with every optimization applied (warm connection, prompt cache hit, CRIS), Nova Micro bottomed out at ~340ms TTFT. That ~340-400ms floor comes from four layers that are inherent to on-demand inference:

  1. Network round-trip (~50-100ms): TLS handshake, TCP, HTTP/2 negotiation between your client and the Bedrock endpoint
  2. CRIS routing (~50ms): the cross-region routing decision and inter-region hop
  3. On-demand GPU scheduling (~100-150ms): shared compute pool means your request waits for available capacity
  4. Model prefill (~100ms): processing input tokens before generating the first output token

AWS offers two paths to break below this floor:

  • Latency Optimized Inference (preview): set performanceConfig.latency = "optimized" for supported models (currently Nova Pro, Claude 3.5 Haiku, Llama 3.1 70B/405B)
  • Provisioned Throughput: dedicated GPU capacity eliminates scheduling delay entirely

Neither Nova Micro nor the other models in our benchmark currently support latency optimized inference.

TTFT Optimization Cheat Sheet

Strategy Impact Effort
Connection reuse (tcp_keepalive) ~25% TTFT reduction One line of config
Use CRIS inference profiles Consistent latency, fewer cold starts Change model ID
Prompt caching Reduced cost + modest latency gain Add cachePoint markers
Region proximity Lower network RTT Deploy in same region
Latency Optimized Inference Significant TTFT reduction Add performanceConfig (limited models)
Provisioned Throughput Eliminates cold starts Higher cost commitment

Conclusion

To be clear one more time: this benchmark measures inference latency only. It says nothing about which model produces better answers, handles complex reasoning, or follows instructions more accurately. Those are separate evaluations entirely.

With that said, if you need the lowest possible TTFT on Amazon Bedrock today, Nova Micro with CRIS and prompt caching delivered the best latency numbers at ~400ms average. GPT OSS 20B came in second at ~475ms with impressive consistency despite no CRIS support.

The biggest takeaway is not about any single model. It is that cold start variance on on-demand inference is the dominant factor in perceived latency. A model that averages 500ms but occasionally spikes to 17 seconds (like DeepSeek V3.2) will feel worse to users than one that consistently delivers 600ms. CRIS support is the single most impactful feature for taming that variance.

For production workloads where sub-300ms TTFT is required, on-demand inference will not get you there regardless of model choice. You will need Provisioned Throughput or Latency Optimized Inference when it becomes generally available for more models.

This post is licensed under CC BY 4.0 by the author.