API Cost Optimization Strategies
API Cost Optimization Strategies
After spending $200+ on GPT-4 API calls last month, I went on a mission to optimize. Here are the strategies that actually work.
The Problem
LLM APIs are priced per token. For context:
| Model | Input | Output |
|---|---|---|
| GPT-4o | $5/1M | $15/1M |
| Claude 3.5 | $3/1M | $15/1M |
| Gemini 1.5 | $1.25/1M | $5/1M |
A typical conversation might use 10k input tokens and 2k output = ~$0.07 per conversation. Doesn't sound much, but it adds up.
Strategy 1: Model Selection
Use the right model for the task:
Simple classification → GPT-3.5-turbo ($0.50/1M)
Draft generation → Gemini 1.5 Flash ($0.035/1M)
Complex reasoning → GPT-4o ($5/1M)
Code generation → Claude 3.5 Sonnet ($3/1M)
I built a router that classifies the task and routes accordingly. Simple tasks go to cheap models.
Strategy 2: Prompt Compression
Remove redundancy from prompts:
// Before: 500 tokens
const prompt = `
You are a helpful assistant. Your goal is to assist users
with their questions. Please provide accurate and helpful
responses. Remember to be polite and professional...
[continues for 200 more words]
`
// After: 100 tokens
const prompt = `Helpful assistant. Answer accurately and concisely.`
Tools I use:
- LLMLingua — Prompt compression library
- Semantic context extraction — Only keep relevant context
Strategy 3: Caching
Cache frequent patterns:
import hashlib
def get_cached_response(prompt: str, model: str) -> str | None:
cache_key = hashlib.sha256(
f"{model}:{prompt}".encode()
).hexdigest()
cached = redis.get(cache_key)
if cached:
return cached
response = call_api(prompt, model)
redis.setex(cache_key, 3600, response) # 1 hour TTL
return response
My use case: Customer support FAQs. 40% hit rate = 40% savings.
Strategy 4: Batch Processing
Don't make one request at a time:
# Instead of:
for item in items:
result = call_api(item) # N requests
# Do:
results = batch_call_api(items) # 1 request with items array
OpenAI and Anthropic support batch APIs with 50% discount.
Strategy 5: Output Length Control
Add response length constraints:
// Instead of open-ended:
"Explain photosynthesis"
// Constrained:
"Explain photosynthesis in 2-3 sentences, suitable for a 10-year-old"
This prevents verbose responses that burn tokens.
Strategy 6: Local Fallbacks
Route simple tasks to local models:
routing:
- condition: "task == 'classify' AND complexity == 'low'"
model: "local/qwen-9b"
- condition: "task == 'reason' AND complexity == 'high'"
model: "openai/gpt-4o"
My setup handles 70% of requests locally.
Strategy 7: Smart Context Management
Don't include full history when unnecessary:
# Keep only relevant context
messages = [
{"role": "system", "content": system_prompt}, # Always
{"role": "user", "content": recent_messages[-5:]}, # Last 5
]
Also: Summarize old conversations and replace with summary.
Results
After implementing all strategies:
| Month | Spend | Strategy |
|---|---|---|
| January | $234 | Baseline |
| February | $189 | Model routing |
| March | $87 | All strategies |
| April | $52 | + Local fallbacks |
That's a 78% reduction.
Tools & Resources
- OpenRouter — Unified API with automatic model comparison
- Helicone — API observability and cost tracking
- LangSmith — Prompt optimization and caching
- LocalAI — Self-hosted model serving
Conclusion
You don't need to sacrifice quality to save money. The strategies above let me run 5x more requests for the same budget. Start with model routing — it's the highest impact, lowest effort change.