Benchmarking Amazon Bedrock LLM Latency: A Multi-Model Comparison

Posted Apr 1, 2026

By Rajesh Sitaraman 8 min read

disclaimer: This post focuses strictly on inference latency, not model quality, accuracy, or reasoning performance. A model that responds in 400ms is not “better” than one that responds in 2 seconds. It is simply faster to start streaming tokens. The right model for your use case depends on many factors beyond latency. This benchmark measures one thing: how quickly each model delivers its first token and completes a response through the Bedrock ConverseStream API.

When building real-time AI applications, latency matters. A chatbot that takes 5 seconds to start responding feels broken. One that responds in 400ms feels instant. But which Amazon Bedrock model actually delivers the lowest inference latency? And do features like Cross-Region Inference (CRIS) and prompt caching make a measurable difference?

I built a benchmarking tool to find out, testing six models across multiple runs with streaming responses. Here is what I learned.

The Setup

I tested six models available on Amazon Bedrock, each with different infrastructure characteristics:

Model	Model ID	CRIS	Prompt Caching
Amazon Nova Micro	`us.amazon.nova-micro-v1:0`	✅	✅ (min 1K tokens)
Claude Haiku 4.5	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	✅	✅ (min 4,096 tokens)
DeepSeek V3.2	`deepseek.v3.2`	❌	❌
MiniMax M2.5	`minimax.minimax-m2.5`	❌	❌
Qwen3 Next 80B	`qwen.qwen3-next-80b-a3b`	❌	❌
GPT OSS 20B	`openai.gpt-oss-20b-1:0`	❌	❌

CRIS (Cross-Region Inference Support) routes requests across multiple AWS regions for better availability. Prompt Caching caches static prompt prefixes to reduce latency and cost on repeated calls with a 5-minute TTL.

I discovered the model IDs and CRIS profiles programmatically using the AWS CLI:

        
      
# Find foundation models
aws bedrock list-foundation-models \
  --query "modelSummaries[?contains(modelId, 'nova-micro')]"

# Find CRIS inference profiles
aws bedrock list-inference-profiles \
  --query "inferenceProfileSummaries[?contains(inferenceProfileId, 'nova-micro')]"

Nova Micro’s CRIS profile (us.amazon.nova-micro-v1:0) routes across us-east-1, us-west-2, and us-east-2.

Measuring What Matters: TTFT and Total Time

This is a latency benchmark, not a model evaluation. We are not comparing how well these models answer questions. We are comparing how fast the Bedrock infrastructure delivers tokens to the client.

For streaming applications, the two key latency metrics are:

Time to First Token (TTFT): how long until the user sees the first word appear
Total Time: end-to-end response completion time

I used the ConverseStream API to measure both:

        
      
import time
import boto3
from botocore.config import Config

session = boto3.Session(profile_name="default")
client = session.client(
    "bedrock-runtime",
    region_name="us-west-2",
    config=Config(tcp_keepalive=True, retries={"max_attempts": 2}),
)

start = time.perf_counter()
response = client.converse_stream(
    modelId="us.amazon.nova-micro-v1:0",
    system=[{"text": "You are a helpful assistant."}],
    messages=[{"role": "user", "content": [{"text": "Your prompt here"}]}],
    inferenceConfig={"maxTokens": 256, "temperature": 0.1, "topP": 0.9},
)

ttft = None
for event in response["stream"]:
    if "contentBlockDelta" in event:
        if ttft is None:
            ttft = (time.perf_counter() - start) * 1000
        text = event["contentBlockDelta"]["delta"].get("text", "")
        print(text, end="", flush=True)
    elif "metadata" in event:
        usage = event["metadata"].get("usage", {})

total_time = (time.perf_counter() - start) * 1000
print(f"\nTTFT: {ttft:.0f}ms | Total: {total_time:.0f}ms")

The key detail: time.perf_counter() gives sub-millisecond precision, and we capture TTFT at the exact moment the first contentBlockDelta event arrives.

The Results

Region: us-west-2 3 runs per model ~672 token system prompt

Latency Summary

Model	Avg TTFT	Avg Total	CRIS	Cache
Nova Micro	407ms	616ms	Y	Y
GPT OSS 20B	475ms	1,297ms	N	N
MiniMax M2.5	572ms	1,955ms	N	N
Qwen3 Next 80B	606ms	2,080ms	N	N
Claude Haiku 4.5	1,553ms	1,970ms	Y	N*
DeepSeek V3.2	5,038ms	11,232ms	N	N

*Claude Haiku 4.5 requires 4,096 minimum tokens per cache checkpoint. Our ~672 token prompt was below the threshold, so caching never activated.

Per-Run TTFT Breakdown

Model	Run 1 TTFT	Run 2 TTFT	Run 3 TTFT
Nova Micro	549ms	334ms	339ms
GPT OSS 20B	429ms	540ms	456ms
MiniMax M2.5	526ms	500ms	982ms
Qwen3 Next 80B	1,112ms	575ms	586ms
Claude Haiku 4.5	1,971ms	916ms	4,057ms
DeepSeek V3.2	17,653ms	1,223ms	4,983ms

The variance tells a story. Let’s dig in.

Finding 1: Nova Micro Has the Lowest Latency

Nova Micro delivered the lowest average TTFT (407ms) and total time (616ms). On warm runs with cache hits, it dropped to 334ms, the fastest we observed from any model.

This is expected from a latency perspective: it is the smallest model in the test, it has CRIS for cross-region routing, and prompt caching was actively working (672 tokens cached and read back). Smaller models have less computation to perform before producing the first token.

Finding 2: Cold Start Variance Is the Real Latency Problem

The most striking finding was not average latency but variance. Look at DeepSeek V3.2: Run 1 took 17.6 seconds, Run 2 took 1.2 seconds. That is a 14x difference.

MiniMax M2.5 showed similar behavior in earlier tests (7.3s to 613ms, a 12x swing).

Models with CRIS support (Nova Micro, Claude Haiku) showed much more consistent latency across runs. CRIS routes requests to whichever region has available capacity, smoothing out the cold start problem.

For models without CRIS (DeepSeek, MiniMax, Qwen, GPT OSS), you are at the mercy of on-demand scheduling. If GPU capacity is not immediately available, you wait.

Finding 3: GPT OSS 20B Has Surprisingly Consistent Latency

Despite having no CRIS or caching support, GPT OSS 20B delivered remarkably stable TTFT: 429ms, 540ms, 456ms across three runs. No cold start spikes. At 475ms average TTFT, it was the second fastest model in our latency tests.

Finding 4: Prompt Caching Requires Planning

Prompt caching in the Converse API works by placing cachePoint markers in your system prompt or messages:

        
      
system = [
    {"text": "Your long system prompt here..."},
    {"cachePoint": {"type": "default"}},  # Cache everything above
]

messages = [
    {
        "role": "user",
        "content": [
            {"text": "User question"},
            {"cachePoint": {"type": "default"}},  # Cache this too
        ],
    }
]

response = client.converse_stream(
    modelId="us.amazon.nova-micro-v1:0",
    system=system,
    messages=messages,
    inferenceConfig={"maxTokens": 256, "temperature": 0.1},
)

The response metadata tells you what happened:

        
      
for event in response["stream"]:
    if "metadata" in event:
        usage = event["metadata"]["usage"]
        print(f"Cache read:  {usage.get('cacheReadInputTokens', 0)}")
        print(f"Cache write: {usage.get('cacheWriteInputTokens', 0)}")

But there is a catch: each model has a minimum token threshold per cache checkpoint.

Model	Minimum Tokens
Nova Micro	1,024
Claude Haiku 4.5	4,096

Our ~672 token system prompt was enough for Nova Micro (which cached successfully) but fell short for Claude Haiku 4.5 (which silently skipped caching). The inference still succeeds, you just do not get the caching benefit.

Another gotcha: CRIS can route consecutive requests to different regions, and the cache is per-region. In our tests, Nova Micro Run 3 sometimes got a cache miss because CRIS routed it to a different region than Run 2.

Finding 5: Connection Reuse Cuts 25% Off TTFT

A separate benchmark isolating connection overhead showed significant impact:

Connection State	TTFT
Cold (first request)	451ms
Warm (reused)	340ms

That is a 25% reduction just from reusing the TCP connection. The fix is simple:

        
      
from botocore.config import Config

client = session.client(
    "bedrock-runtime",
    region_name="us-west-2",
    config=Config(tcp_keepalive=True),
)

In production, keep your boto3 client alive across requests. Do not create a new client per invocation.

Why ~400ms Is the Latency Floor for On-Demand

Even with every optimization applied (warm connection, prompt cache hit, CRIS), Nova Micro bottomed out at ~340ms TTFT. That ~340-400ms floor comes from four layers that are inherent to on-demand inference:

Network round-trip (~50-100ms): TLS handshake, TCP, HTTP/2 negotiation between your client and the Bedrock endpoint
CRIS routing (~50ms): the cross-region routing decision and inter-region hop
On-demand GPU scheduling (~100-150ms): shared compute pool means your request waits for available capacity
Model prefill (~100ms): processing input tokens before generating the first output token

AWS offers two paths to break below this floor:

Latency Optimized Inference (preview): set performanceConfig.latency = "optimized" for supported models (currently Nova Pro, Claude 3.5 Haiku, Llama 3.1 70B/405B)
Provisioned Throughput: dedicated GPU capacity eliminates scheduling delay entirely

Neither Nova Micro nor the other models in our benchmark currently support latency optimized inference.

TTFT Optimization Cheat Sheet

Strategy	Impact	Effort
Connection reuse (`tcp_keepalive`)	~25% TTFT reduction	One line of config
Use CRIS inference profiles	Consistent latency, fewer cold starts	Change model ID
Prompt caching	Reduced cost + modest latency gain	Add `cachePoint` markers
Region proximity	Lower network RTT	Deploy in same region
Latency Optimized Inference	Significant TTFT reduction	Add `performanceConfig` (limited models)
Provisioned Throughput	Eliminates cold starts	Higher cost commitment

Conclusion

To be clear one more time: this benchmark measures inference latency only. It says nothing about which model produces better answers, handles complex reasoning, or follows instructions more accurately. Those are separate evaluations entirely.

With that said, if you need the lowest possible TTFT on Amazon Bedrock today, Nova Micro with CRIS and prompt caching delivered the best latency numbers at ~400ms average. GPT OSS 20B came in second at ~475ms with impressive consistency despite no CRIS support.

The biggest takeaway is not about any single model. It is that cold start variance on on-demand inference is the dominant factor in perceived latency. A model that averages 500ms but occasionally spikes to 17 seconds (like DeepSeek V3.2) will feel worse to users than one that consistently delivers 600ms. CRIS support is the single most impactful feature for taming that variance.

For production workloads where sub-300ms TTFT is required, on-demand inference will not get you there regardless of model choice. You will need Provisioned Throughput or Latency Optimized Inference when it becomes generally available for more models.

aws, ai

genai llm bedrock latency benchmark CRIS caching

This post is licensed under CC BY 4.0 by the author.