Benchmarking Amazon Bedrock LLM Latency: A Multi-Model Comparison
disclaimer: This post focuses strictly on inference latency, not model quality, accuracy, or reasoning performance. A model that responds in 400ms is not “better” than one that responds in 2 seconds. It is simply faster to start streaming tokens. The right model for your use case depends on many factors beyond latency. This benchmark measures one thing: how quickly each model delivers its first token and completes a response through the Bedrock ConverseStream API.
When building real-time AI applications, latency matters. A chatbot that takes 5 seconds to start responding feels broken. One that responds in 400ms feels instant. But which Amazon Bedrock model actually delivers the lowest inference latency? And do features like Cross-Region Inference (CRIS) and prompt caching make a measurable difference?
I built a benchmarking tool to find out, testing six models across multiple runs with streaming responses. Here is what I learned.
The Setup
I tested six models available on Amazon Bedrock, each with different infrastructure characteristics:
| Model | Model ID | CRIS | Prompt Caching |
|---|---|---|---|
| Amazon Nova Micro | us.amazon.nova-micro-v1:0 |
✅ | ✅ (min 1K tokens) |
| Claude Haiku 4.5 | us.anthropic.claude-haiku-4-5-20251001-v1:0 |
✅ | ✅ (min 4,096 tokens) |
| DeepSeek V3.2 | deepseek.v3.2 |
❌ | ❌ |
| MiniMax M2.5 | minimax.minimax-m2.5 |
❌ | ❌ |
| Qwen3 Next 80B | qwen.qwen3-next-80b-a3b |
❌ | ❌ |
| GPT OSS 20B | openai.gpt-oss-20b-1:0 |
❌ | ❌ |
CRIS (Cross-Region Inference Support) routes requests across multiple AWS regions for better availability. Prompt Caching caches static prompt prefixes to reduce latency and cost on repeated calls with a 5-minute TTL.
I discovered the model IDs and CRIS profiles programmatically using the AWS CLI:
1
2
3
4
5
6
7
# Find foundation models
aws bedrock list-foundation-models \
--query "modelSummaries[?contains(modelId, 'nova-micro')]"
# Find CRIS inference profiles
aws bedrock list-inference-profiles \
--query "inferenceProfileSummaries[?contains(inferenceProfileId, 'nova-micro')]"
Nova Micro’s CRIS profile (us.amazon.nova-micro-v1:0) routes across us-east-1, us-west-2, and us-east-2.
Measuring What Matters: TTFT and Total Time
This is a latency benchmark, not a model evaluation. We are not comparing how well these models answer questions. We are comparing how fast the Bedrock infrastructure delivers tokens to the client.
For streaming applications, the two key latency metrics are:
- Time to First Token (TTFT): how long until the user sees the first word appear
- Total Time: end-to-end response completion time
I used the ConverseStream API to measure both:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import time
import boto3
from botocore.config import Config
session = boto3.Session(profile_name="default")
client = session.client(
"bedrock-runtime",
region_name="us-west-2",
config=Config(tcp_keepalive=True, retries={"max_attempts": 2}),
)
start = time.perf_counter()
response = client.converse_stream(
modelId="us.amazon.nova-micro-v1:0",
system=[{"text": "You are a helpful assistant."}],
messages=[{"role": "user", "content": [{"text": "Your prompt here"}]}],
inferenceConfig={"maxTokens": 256, "temperature": 0.1, "topP": 0.9},
)
ttft = None
for event in response["stream"]:
if "contentBlockDelta" in event:
if ttft is None:
ttft = (time.perf_counter() - start) * 1000
text = event["contentBlockDelta"]["delta"].get("text", "")
print(text, end="", flush=True)
elif "metadata" in event:
usage = event["metadata"].get("usage", {})
total_time = (time.perf_counter() - start) * 1000
print(f"\nTTFT: {ttft:.0f}ms | Total: {total_time:.0f}ms")
The key detail: time.perf_counter() gives sub-millisecond precision, and we capture TTFT at the exact moment the first contentBlockDelta event arrives.
The Results
Region: us-west-2 |
3 runs per model | ~672 token system prompt |
Latency Summary
| Model | Avg TTFT | Avg Total | CRIS | Cache |
|---|---|---|---|---|
| Nova Micro | 407ms | 616ms | Y | Y |
| GPT OSS 20B | 475ms | 1,297ms | N | N |
| MiniMax M2.5 | 572ms | 1,955ms | N | N |
| Qwen3 Next 80B | 606ms | 2,080ms | N | N |
| Claude Haiku 4.5 | 1,553ms | 1,970ms | Y | N* |
| DeepSeek V3.2 | 5,038ms | 11,232ms | N | N |
*Claude Haiku 4.5 requires 4,096 minimum tokens per cache checkpoint. Our ~672 token prompt was below the threshold, so caching never activated.
Per-Run TTFT Breakdown
| Model | Run 1 TTFT | Run 2 TTFT | Run 3 TTFT |
|---|---|---|---|
| Nova Micro | 549ms | 334ms | 339ms |
| GPT OSS 20B | 429ms | 540ms | 456ms |
| MiniMax M2.5 | 526ms | 500ms | 982ms |
| Qwen3 Next 80B | 1,112ms | 575ms | 586ms |
| Claude Haiku 4.5 | 1,971ms | 916ms | 4,057ms |
| DeepSeek V3.2 | 17,653ms | 1,223ms | 4,983ms |
The variance tells a story. Let’s dig in.
Finding 1: Nova Micro Has the Lowest Latency
Nova Micro delivered the lowest average TTFT (407ms) and total time (616ms). On warm runs with cache hits, it dropped to 334ms, the fastest we observed from any model.
This is expected from a latency perspective: it is the smallest model in the test, it has CRIS for cross-region routing, and prompt caching was actively working (672 tokens cached and read back). Smaller models have less computation to perform before producing the first token.
Finding 2: Cold Start Variance Is the Real Latency Problem
The most striking finding was not average latency but variance. Look at DeepSeek V3.2: Run 1 took 17.6 seconds, Run 2 took 1.2 seconds. That is a 14x difference.
MiniMax M2.5 showed similar behavior in earlier tests (7.3s to 613ms, a 12x swing).
Models with CRIS support (Nova Micro, Claude Haiku) showed much more consistent latency across runs. CRIS routes requests to whichever region has available capacity, smoothing out the cold start problem.
For models without CRIS (DeepSeek, MiniMax, Qwen, GPT OSS), you are at the mercy of on-demand scheduling. If GPU capacity is not immediately available, you wait.
Finding 3: GPT OSS 20B Has Surprisingly Consistent Latency
Despite having no CRIS or caching support, GPT OSS 20B delivered remarkably stable TTFT: 429ms, 540ms, 456ms across three runs. No cold start spikes. At 475ms average TTFT, it was the second fastest model in our latency tests.
Finding 4: Prompt Caching Requires Planning
Prompt caching in the Converse API works by placing cachePoint markers in your system prompt or messages:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
system = [
{"text": "Your long system prompt here..."},
{"cachePoint": {"type": "default"}}, # Cache everything above
]
messages = [
{
"role": "user",
"content": [
{"text": "User question"},
{"cachePoint": {"type": "default"}}, # Cache this too
],
}
]
response = client.converse_stream(
modelId="us.amazon.nova-micro-v1:0",
system=system,
messages=messages,
inferenceConfig={"maxTokens": 256, "temperature": 0.1},
)
The response metadata tells you what happened:
1
2
3
4
5
for event in response["stream"]:
if "metadata" in event:
usage = event["metadata"]["usage"]
print(f"Cache read: {usage.get('cacheReadInputTokens', 0)}")
print(f"Cache write: {usage.get('cacheWriteInputTokens', 0)}")
But there is a catch: each model has a minimum token threshold per cache checkpoint.
| Model | Minimum Tokens |
|---|---|
| Nova Micro | 1,024 |
| Claude Haiku 4.5 | 4,096 |
Our ~672 token system prompt was enough for Nova Micro (which cached successfully) but fell short for Claude Haiku 4.5 (which silently skipped caching). The inference still succeeds, you just do not get the caching benefit.
Another gotcha: CRIS can route consecutive requests to different regions, and the cache is per-region. In our tests, Nova Micro Run 3 sometimes got a cache miss because CRIS routed it to a different region than Run 2.
Finding 5: Connection Reuse Cuts 25% Off TTFT
A separate benchmark isolating connection overhead showed significant impact:
| Connection State | TTFT |
|---|---|
| Cold (first request) | 451ms |
| Warm (reused) | 340ms |
That is a 25% reduction just from reusing the TCP connection. The fix is simple:
1
2
3
4
5
6
7
from botocore.config import Config
client = session.client(
"bedrock-runtime",
region_name="us-west-2",
config=Config(tcp_keepalive=True),
)
In production, keep your boto3 client alive across requests. Do not create a new client per invocation.
Why ~400ms Is the Latency Floor for On-Demand
Even with every optimization applied (warm connection, prompt cache hit, CRIS), Nova Micro bottomed out at ~340ms TTFT. That ~340-400ms floor comes from four layers that are inherent to on-demand inference:
- Network round-trip (~50-100ms): TLS handshake, TCP, HTTP/2 negotiation between your client and the Bedrock endpoint
- CRIS routing (~50ms): the cross-region routing decision and inter-region hop
- On-demand GPU scheduling (~100-150ms): shared compute pool means your request waits for available capacity
- Model prefill (~100ms): processing input tokens before generating the first output token
AWS offers two paths to break below this floor:
- Latency Optimized Inference (preview): set
performanceConfig.latency = "optimized"for supported models (currently Nova Pro, Claude 3.5 Haiku, Llama 3.1 70B/405B) - Provisioned Throughput: dedicated GPU capacity eliminates scheduling delay entirely
Neither Nova Micro nor the other models in our benchmark currently support latency optimized inference.
TTFT Optimization Cheat Sheet
| Strategy | Impact | Effort |
|---|---|---|
Connection reuse (tcp_keepalive) |
~25% TTFT reduction | One line of config |
| Use CRIS inference profiles | Consistent latency, fewer cold starts | Change model ID |
| Prompt caching | Reduced cost + modest latency gain | Add cachePoint markers |
| Region proximity | Lower network RTT | Deploy in same region |
| Latency Optimized Inference | Significant TTFT reduction | Add performanceConfig (limited models) |
| Provisioned Throughput | Eliminates cold starts | Higher cost commitment |
Conclusion
To be clear one more time: this benchmark measures inference latency only. It says nothing about which model produces better answers, handles complex reasoning, or follows instructions more accurately. Those are separate evaluations entirely.
With that said, if you need the lowest possible TTFT on Amazon Bedrock today, Nova Micro with CRIS and prompt caching delivered the best latency numbers at ~400ms average. GPT OSS 20B came in second at ~475ms with impressive consistency despite no CRIS support.
The biggest takeaway is not about any single model. It is that cold start variance on on-demand inference is the dominant factor in perceived latency. A model that averages 500ms but occasionally spikes to 17 seconds (like DeepSeek V3.2) will feel worse to users than one that consistently delivers 600ms. CRIS support is the single most impactful feature for taming that variance.
For production workloads where sub-300ms TTFT is required, on-demand inference will not get you there regardless of model choice. You will need Provisioned Throughput or Latency Optimized Inference when it becomes generally available for more models.
