Benchmarks
Model Lens evaluates local AI models across 11 standard benchmarks plus DevBench — a developer-realistic evaluation harness for TypeScript, NestJS, and React.
Scoring system
Section titled “Scoring system”Score components
Section titled “Score components”Each benchmark response is evaluated on five weighted components:
| Component | Weight | Description |
|---|---|---|
| Correctness | 40% | Is the answer actually right? |
| Instruction compliance | 20% | Does it follow formatting constraints? |
| Reasoning quality | 20% | Is the reasoning structured and thoughtful? |
| Code executability | 15% | Does generated code actually run? |
| Type safety | 5% | Does TypeScript code pass tsc --noEmit? |
Statistical rigor
Section titled “Statistical rigor”Every prompt is run 5 times to capture variance:
- Mean ± std — Average score with standard deviation
- 95% confidence intervals — Student’s t-distribution (scipy) or normal approximation
- Outlier detection — IQR and z-score methods
- Reliability threshold — Coefficient of variation < 0.1 for “reliable” scores
Failure taxonomy
Section titled “Failure taxonomy”Failures are classified into 13 types (see scoring.py):
| Failure type | Example |
|---|---|
hallucinated_api | Made-up library functions |
wrong_async_usage | async without await |
incorrect_json_schema | JSON doesn’t match expected shape |
syntax_error | Code that won’t parse |
type_error | TypeScript type mismatch |
stale_closure | useEffect with [] capturing stale values |
race_condition | Missing Promise.all in parallel async ops |
missing_import | Unimported symbol |
incorrect_di | NestJS @Injectable() without constructor |
oververbose_output | Response too long |
missed_constraint | Failed formatting requirement |
logic_error | Code runs but produces wrong result |
other | Unclassified |
Benchmark categories
Section titled “Benchmark categories”Knowledge & reasoning
Section titled “Knowledge & reasoning”| Benchmark | Description | Metric |
|---|---|---|
| MMLU-Pro | 12,000+ multiple-choice across 57 subjects | Accuracy (%) |
| GSM8K | Grade-school math word problems | Numerical match (±tolerance) |
| AIME | Competition-level mathematics | Numerical match (±tolerance) |
| IFEval | Instruction-following constraints | Constraint adherence (%) |
Code generation
Section titled “Code generation”| Benchmark | Description | Metric |
|---|---|---|
| HumanEval | 164 Python programming problems | Pass@k |
| SWE-bench Lite | Real GitHub issues (simplified) | Patch correctness |
| BFCL | Function-calling accuracy | Tool choice accuracy |
| Coding | TypeScript/NestJS/React code tasks | Multi-component score |
Performance
Section titled “Performance”| Benchmark | Description | Metric |
|---|---|---|
| Speed/Latency | Tokens/sec, TTFT, throughput | ms, tokens/sec |
| Memory | RAM/VRAM usage during inference | MB (mean/peak) |
| Needle-in-Haystack | Long-context retrieval | Accuracy vs. position |
Quality
Section titled “Quality”| Benchmark | Description | Metric |
|---|---|---|
| Creativity | Open-ended generation quality | Multi-dimensional score |
DevBench v2
Section titled “DevBench v2”A separate evaluation harness focused on real developer workloads:
- TypeScript execution — Compiled with
tsc --noEmit, executed withts-node - ESLint validation — Code quality via linting rules
- Developer realism score — correctness (40%) + debugging (30%) + instruction following (20%) + latency (10%)
- Tokenization-aware metrics — Normalized by output length
- Steady-state speed — Ignores warmup tokens
DevBench prompt categories
Section titled “DevBench prompt categories”| Category | Focus |
|---|---|
| Code | Implementing features, fixing bugs |
| Frontend | React/Next.js component work |
| Reasoning | Architecture decisions, tradeoffs |
| Math | Numerical computation, data analysis |
| Instruction | Schema compliance, format constraints |
Execution-grounded scoring
Section titled “Execution-grounded scoring”For code tasks, scoring is grounded in actual execution:
- Extract code block from model response
- Compile with
tsc --noEmit(type safety) - Execute with
ts-node(runtime correctness) - Lint with ESLint (code quality)
- Detect failure patterns (stale closures, race conditions, DI issues)
All tooling (ts-node, tsc, eslint) is optional — scoring degrades gracefully if unavailable.
Evaluator plugins
Section titled “Evaluator plugins”Built-in evaluators (see evaluators.py):
| Evaluator | Purpose |
|---|---|
JSONSchemaEvaluator | Validate JSON against schema |
RegexConstraintEvaluator | Check formatting, forbidden patterns, counts |
KeywordMatchEvaluator | Verify key concepts are present |
NumericalAnswerEvaluator | Compare numerical answers with tolerance |
CodeExecutionEvaluator | Execute TypeScript and verify output |
CompositeEvaluator | Combine multiple evaluators with weights |
See also
Section titled “See also”- Architecture — Benchmark section
- Prompt Packs — how prompts are organized