Methodology — AI Token Pricer

All data is refreshed automatically every 4 hours. Pricing uses multi-source consensus. Benchmark scores are normalized to percentile ranks for fair cross-dimension comparison.

Pricing Data

Token prices are aggregated from multiple independent sources and cross-checked for accuracy. We require agreement from at least three sources (within $0.01 tolerance) before publishing a price.

Sources

models.dev Community-maintained model pricing catalog
LiteLLM Open-source proxy pricing database
genai-prices Pydantic's pricing dataset
OpenRouter Multi-provider routing API
Artificial Analysis Independent AI benchmarking platform

How it works

Each source is fetched independently and normalized to a common format (cost per 1M tokens). Prices are then compared across sources. A price is accepted when at least three sources agree within a $0.01 tolerance. This consensus approach filters out stale or incorrect data from any single source.

Benchmark Data

Performance benchmarks are sourced from independent evaluation platforms. Each dimension captures a different aspect of model capability.

Intelligence Index

Artificial Analysis A composite score (0–100) aggregating results across reasoning, knowledge, and language understanding benchmarks including MMLU-Pro, GPQA, HLE, and IFBench.

Coding Index

Artificial Analysis A composite score (0–100) measuring code generation and understanding, based on LiveCodeBench, SciCode, and TerminalBench.

Math Index

Artificial Analysis A composite score (0–100) evaluating mathematical reasoning, derived from MATH-500 and AIME benchmarks.

Speed

Artificial Analysis Output tokens per second, measured under standardized conditions across providers.

Arena ELO

Chatbot Arena ELO rating from blind human preference voting at arena.ai. Only models with 500+ votes are included to ensure statistical confidence.

Model Parameters

HuggingFace Hub Parameter count extracted from model safetensors metadata. Only available for open-weight models.

Percentile Rankings

Raw benchmark scores use different scales — intelligence is 0–100, speed is measured in tokens per second, and Arena ELO uses a rating system around 1000–1400. To make these comparable, we convert each dimension to a percentile rank.

How percentiles are computed

For each dimension, we collect all non-null raw values across active models, sort them, and compute the percentile rank:

percentile = (rank / total_count) × 100

Where rank is the number of models with a strictly lower score. This means:

P0 — the lowest-scoring model in that dimension
P50 — scores higher than half of all models
P95 — scores higher than 95% of models

Percentile ranks are recomputed globally after every data sync. A model's percentile can change even if its raw score doesn't — because other models may have been added or updated.

Where percentiles appear

The Pricing table, Compare view, and Value Index all display percentile ranks (shown as P0–P99). Hover over any percentile to see the original raw value in a tooltip.

Value Index

The Value Index ranks models by how much capability you get per dollar. For each dimension, it divides the percentile score by the blended price per 1M tokens:

value_score = percentile_score / blended_price_per_1M

Blended price is the simple average of input and output costs per 1M tokens. Models are then sorted by value score in descending order — highest value for money first.

Benchmark Propagation

Many models are available from multiple providers (e.g., a model may be available directly from its creator and also through resellers). When a reseller offers the same model but doesn't appear in benchmark evaluations independently, we propagate benchmark data from the canonical provider. This ensures that all instances of the same model are comparable on performance, even when benchmark labs only test one provider's version.

Update Frequency

Pricing — refreshed every 4 hours from all sources
Benchmarks — refreshed every 6 hours
Percentile ranks — recomputed after every benchmark refresh