Skip to content

Benchmarks

Model Lens evaluates local AI models across 11 standard benchmarks plus DevBench — a developer-realistic evaluation harness for TypeScript, NestJS, and React.

Each benchmark response is evaluated on five weighted components:

ComponentWeightDescription
Correctness40%Is the answer actually right?
Instruction compliance20%Does it follow formatting constraints?
Reasoning quality20%Is the reasoning structured and thoughtful?
Code executability15%Does generated code actually run?
Type safety5%Does TypeScript code pass tsc --noEmit?

Every prompt is run 5 times to capture variance:

  • Mean ± std — Average score with standard deviation
  • 95% confidence intervals — Student’s t-distribution (scipy) or normal approximation
  • Outlier detection — IQR and z-score methods
  • Reliability threshold — Coefficient of variation < 0.1 for “reliable” scores

Failures are classified into 13 types (see scoring.py):

Failure typeExample
hallucinated_apiMade-up library functions
wrong_async_usageasync without await
incorrect_json_schemaJSON doesn’t match expected shape
syntax_errorCode that won’t parse
type_errorTypeScript type mismatch
stale_closureuseEffect with [] capturing stale values
race_conditionMissing Promise.all in parallel async ops
missing_importUnimported symbol
incorrect_diNestJS @Injectable() without constructor
oververbose_outputResponse too long
missed_constraintFailed formatting requirement
logic_errorCode runs but produces wrong result
otherUnclassified

BenchmarkDescriptionMetric
MMLU-Pro12,000+ multiple-choice across 57 subjectsAccuracy (%)
GSM8KGrade-school math word problemsNumerical match (±tolerance)
AIMECompetition-level mathematicsNumerical match (±tolerance)
IFEvalInstruction-following constraintsConstraint adherence (%)
BenchmarkDescriptionMetric
HumanEval164 Python programming problemsPass@k
SWE-bench LiteReal GitHub issues (simplified)Patch correctness
BFCLFunction-calling accuracyTool choice accuracy
CodingTypeScript/NestJS/React code tasksMulti-component score
BenchmarkDescriptionMetric
Speed/LatencyTokens/sec, TTFT, throughputms, tokens/sec
MemoryRAM/VRAM usage during inferenceMB (mean/peak)
Needle-in-HaystackLong-context retrievalAccuracy vs. position
BenchmarkDescriptionMetric
CreativityOpen-ended generation qualityMulti-dimensional score

A separate evaluation harness focused on real developer workloads:

  • TypeScript execution — Compiled with tsc --noEmit, executed with ts-node
  • ESLint validation — Code quality via linting rules
  • Developer realism score — correctness (40%) + debugging (30%) + instruction following (20%) + latency (10%)
  • Tokenization-aware metrics — Normalized by output length
  • Steady-state speed — Ignores warmup tokens
CategoryFocus
CodeImplementing features, fixing bugs
FrontendReact/Next.js component work
ReasoningArchitecture decisions, tradeoffs
MathNumerical computation, data analysis
InstructionSchema compliance, format constraints

For code tasks, scoring is grounded in actual execution:

  1. Extract code block from model response
  2. Compile with tsc --noEmit (type safety)
  3. Execute with ts-node (runtime correctness)
  4. Lint with ESLint (code quality)
  5. Detect failure patterns (stale closures, race conditions, DI issues)

All tooling (ts-node, tsc, eslint) is optional — scoring degrades gracefully if unavailable.

Built-in evaluators (see evaluators.py):

EvaluatorPurpose
JSONSchemaEvaluatorValidate JSON against schema
RegexConstraintEvaluatorCheck formatting, forbidden patterns, counts
KeywordMatchEvaluatorVerify key concepts are present
NumericalAnswerEvaluatorCompare numerical answers with tolerance
CodeExecutionEvaluatorExecute TypeScript and verify output
CompositeEvaluatorCombine multiple evaluators with weights