Section 5
Model Evaluation
Decision frameworks, benchmark guides, interactive algorithm selector, and cost calculators.
5.1
LLM Evaluation Framework
How to evaluate and compare LLMs for your specific use case.
🎯
Accuracy
Correct answers on domain-specific test cases you create
🧠
Reasoning
Multi-step problem solving (math, logic, policy interpretation)
📋
Instruction Following
Does it output the exact format you need (JSON, markdown)?
📏
Context Length
Correctly use information from very long inputs
🛡️
Safety/Alignment
Appropriate refusals, no hallucinations on critical info
💰
Cost-Performance
Quality per dollar — often Gemini Flash wins here
5.2
Key Benchmarks
Standard evaluations and what they actually measure.
| Benchmark | Measures | Questions | Limitations |
|---|---|---|---|
| MMLU | Knowledge across 57 subjects | 14K | Multiple choice only, no reasoning |
| HumanEval | Python code generation | 164 | Small scale, outdated problems |
| GSM8K | Grade school math | 8.5K | Limited difficulty range |
| MATH | Competition mathematics | 12.5K | Narrow domain |
| HellaSwag | Common sense reasoning | 70K | Easy for modern models |
| BIG-Bench | Diverse tasks | 204 tasks | Too broad, varies widely |
Benchmark Caveat
Public benchmarks are gamed. Models increasingly train on benchmark data (contamination). The best evaluation is always a custom test set built from YOUR real tasks.
5.3
Interactive Algorithm Selector
Answer questions about your problem and get a personalized algorithm recommendation.
Interactive Algorithm Selector
What type of output do you need?
5.4
Token Cost Calculator
Estimate API costs across different models based on your usage patterns.
Token Cost Calculator
| Model | Per Request | Monthly | Annual |
|---|---|---|---|
| CheapestGemini 2.5 Flash | $0.0004 | $0.45 | $5 |
| Claude 3.5 Haiku | $0.0028 | $2.80 | $34 |
| Gemini 2.5 Pro | $0.0088 | $8.75 | $105 |
| Claude 3.5 Sonnet | $0.0105 | $10.50 | $126 |
| GPT-4o | $0.0125 | $12.50 | $150 |