API Cost Optimization Strategies

After spending $200+ on GPT-4 API calls last month, I went on a mission to optimize. Here are the strategies that actually work.

The Problem

LLM APIs are priced per token. For context:

Model	Input	Output
GPT-4o	$5/1M	$15/1M
Claude 3.5	$3/1M	$15/1M
Gemini 1.5	$1.25/1M	$5/1M

A typical conversation might use 10k input tokens and 2k output = ~$0.07 per conversation. Doesn't sound much, but it adds up.

Strategy 1: Model Selection

Use the right model for the task:

Simple classification → GPT-3.5-turbo ($0.50/1M)
Draft generation → Gemini 1.5 Flash ($0.035/1M)
Complex reasoning → GPT-4o ($5/1M)
Code generation → Claude 3.5 Sonnet ($3/1M)

I built a router that classifies the task and routes accordingly. Simple tasks go to cheap models.

Strategy 2: Prompt Compression

Remove redundancy from prompts:

// Before: 500 tokens
const prompt = `
You are a helpful assistant. Your goal is to assist users
with their questions. Please provide accurate and helpful
responses. Remember to be polite and professional...
[continues for 200 more words]
`

// After: 100 tokens
const prompt = `Helpful assistant. Answer accurately and concisely.`

Tools I use:

LLMLingua — Prompt compression library
Semantic context extraction — Only keep relevant context

Strategy 3: Caching

Cache frequent patterns:

import hashlib

def get_cached_response(prompt: str, model: str) -> str | None:
    cache_key = hashlib.sha256(
        f"{model}:{prompt}".encode()
    ).hexdigest()
    
    cached = redis.get(cache_key)
    if cached:
        return cached
    
    response = call_api(prompt, model)
    redis.setex(cache_key, 3600, response)  # 1 hour TTL
    return response

My use case: Customer support FAQs. 40% hit rate = 40% savings.

Strategy 4: Batch Processing

Don't make one request at a time:

# Instead of:
for item in items:
    result = call_api(item)  # N requests
    
# Do:
results = batch_call_api(items)  # 1 request with items array

OpenAI and Anthropic support batch APIs with 50% discount.

Strategy 5: Output Length Control

Add response length constraints:

// Instead of open-ended:
"Explain photosynthesis"

// Constrained:
"Explain photosynthesis in 2-3 sentences, suitable for a 10-year-old"

This prevents verbose responses that burn tokens.

Strategy 6: Local Fallbacks

Route simple tasks to local models:

routing:
  - condition: "task == 'classify' AND complexity == 'low'"
    model: "local/qwen-9b"
    
  - condition: "task == 'reason' AND complexity == 'high'"
    model: "openai/gpt-4o"

My setup handles 70% of requests locally.

Strategy 7: Smart Context Management

Don't include full history when unnecessary:

# Keep only relevant context
messages = [
    {"role": "system", "content": system_prompt},  # Always
    {"role": "user", "content": recent_messages[-5:]},  # Last 5
]

Also: Summarize old conversations and replace with summary.

Results

After implementing all strategies:

Month	Spend	Strategy
January	$234	Baseline
February	$189	Model routing
March	$87	All strategies
April	$52	+ Local fallbacks

That's a 78% reduction.

Tools & Resources

OpenRouter — Unified API with automatic model comparison
Helicone — API observability and cost tracking
LangSmith — Prompt optimization and caching
LocalAI — Self-hosted model serving

Conclusion

You don't need to sacrifice quality to save money. The strategies above let me run 5x more requests for the same budget. Start with model routing — it's the highest impact, lowest effort change.