Methodology
How we collect, aggregate, and rank model data
Pricing Data
Token prices are aggregated from multiple independent sources and cross-checked for accuracy. We require agreement from at least three sources (within $0.01 tolerance) before publishing a price.
Sources
- models.dev Community-maintained model pricing catalog
- LiteLLM Open-source proxy pricing database
- genai-prices Pydantic's pricing dataset
- OpenRouter Multi-provider routing API
- Artificial Analysis Independent AI benchmarking platform
How it works
Each source is fetched independently and normalized to a common format (cost per 1M tokens). Prices are then compared across sources. A price is accepted when at least three sources agree within a $0.01 tolerance. This consensus approach filters out stale or incorrect data from any single source.
Benchmark Data
Performance benchmarks are sourced from independent evaluation platforms. Each dimension captures a different aspect of model capability.
Intelligence Index
Artificial Analysis A composite score (0–100) aggregating results across reasoning, knowledge, and language understanding benchmarks including MMLU-Pro, GPQA, HLE, and IFBench.
Coding Index
Artificial Analysis A composite score (0–100) measuring code generation and understanding, based on LiveCodeBench, SciCode, and TerminalBench.
Math Index
Artificial Analysis A composite score (0–100) evaluating mathematical reasoning, derived from MATH-500 and AIME benchmarks.
Speed
Artificial Analysis Output tokens per second, measured under standardized conditions across providers.
Arena ELO
Chatbot Arena ELO rating from blind human preference voting at arena.ai. Only models with 500+ votes are included to ensure statistical confidence.
Model Parameters
HuggingFace Hub Parameter count extracted from model safetensors metadata. Only available for open-weight models.
Percentile Rankings
Raw benchmark scores use different scales — intelligence is 0–100, speed is measured in tokens per second, and Arena ELO uses a rating system around 1000–1400. To make these comparable, we convert each dimension to a percentile rank.
How percentiles are computed
For each dimension, we collect all non-null raw values across active models, sort them, and compute the percentile rank:
Where rank is the number of models with a strictly lower score. This means:
- P0 — the lowest-scoring model in that dimension
- P50 — scores higher than half of all models
- P95 — scores higher than 95% of models
Percentile ranks are recomputed globally after every data sync. A model's percentile can change even if its raw score doesn't — because other models may have been added or updated.
Where percentiles appear
The Pricing table, Compare view, and Value Index all display percentile ranks (shown as P0–P99). Hover over any percentile to see the original raw value in a tooltip.
Value Index
The Value Index ranks models by how much capability you get per dollar. For each dimension, it divides the percentile score by the blended price per 1M tokens:
Blended price is the simple average of input and output costs per 1M tokens. Models are then sorted by value score in descending order — highest value for money first.
Benchmark Propagation
Many models are available from multiple providers (e.g., a model may be available directly from its creator and also through resellers). When a reseller offers the same model but doesn't appear in benchmark evaluations independently, we propagate benchmark data from the canonical provider. This ensures that all instances of the same model are comparable on performance, even when benchmark labs only test one provider's version.
Update Frequency
- Pricing — refreshed every 4 hours from all sources
- Benchmarks — refreshed every 6 hours
- Percentile ranks — recomputed after every benchmark refresh