Skip to content

Run Schema

The canonical data model for a single model evaluation run. Every benchmark run, trace capture, workload evaluation, and comparison is a Run. This is the central entity in Model Lens — everything else is derived from it.

Run
├── id
├── model
├── provider
├── workload ← What was evaluated (benchmark, prompt pack, project)
├── trace ← Execution timeline (token-level events)
├── metrics ← Numeric measurements (latency, throughput, scores)
├── artifacts ← Raw outputs (response text, logs, errors)
└── config ← Snapshot of evaluation parameters
{
"run": {
"id": "run_20240601_123456_qwen-3.5-9b",
"version": "1.0.0",
"model": {
"id": "qwen-3.5-9b-coder",
"provider": "lm-studio",
"parameters": "9b",
"quantization": "Q4_K_M",
"size_bytes": 5830000000
},
"provider": {
"name": "lm-studio",
"endpoint": "http://localhost:1234/v1",
"version": "0.3.10"
},
"workload": {
"type": "benchmark",
"name": "mmlu_pro",
"category": "reasoning",
"version": "1.0.0"
},
"trace": {
"trace_id": "trace_abc123def456",
"started_at": "2024-06-01T12:34:56.000Z",
"completed_at": "2024-06-01T12:35:02.000Z",
"events": [
{
"id": "e1",
"type": "prompt",
"label": "Prompt Sent",
"detail": "What is the capital of France?",
"timing_ms": 0,
"status": "success"
},
{
"id": "e2",
"type": "token",
"label": "Token 1",
"detail": "Paris",
"timing_ms": 150,
"status": "success"
}
],
"metrics": {
"ttft_ms": 150,
"tokens_per_second": 45.2,
"total_tokens": 15,
"prompt_tokens": 8,
"completion_tokens": 7,
"total_latency_ms": 320
},
"artifacts": {
"response": "Paris is the capital of France.",
"logs": [],
"errors": []
}
},
"metrics": {
"scores": {
"correctness": 0.95,
"completeness": 1.0,
"code_quality": 0.85,
"style_match": 0.90,
"efficiency": 0.75
},
"performance": {
"tokens_per_sec": 45.2,
"ttft_ms": 150,
"total_latency_ms": 320
},
"stats": {
"mean": 0.89,
"std": 0.06,
"min": 0.78,
"max": 0.95,
"runs": 5,
"confidence_95": [0.85, 0.93]
}
},
"config": {
"prompt_version": "v1",
"seed": 42,
"num_runs": 5,
"hardware": {
"platform": "macOS-14.5-arm64",
"processor": "arm",
"memory_gb": 18
}
},
"timestamp": "2024-06-01T12:34:56.000Z",
"git_sha": "a1b2c3d"
}
}
FieldTypeRequiredDescription
idstringGlobally unique run identifier
versionstringSchema version (semver)
modelModelThe model being evaluated
providerProviderThe provider serving the model
workloadWorkloadWhat was evaluated
traceTraceExecution timeline
metricsMetricsEvaluation scores and performance
artifactsArtifactsRaw outputs
configConfigEvaluation parameters snapshot
timestampstringISO 8601 timestamp
git_shastringGit commit SHA
FieldTypeRequiredDescription
idstringEvent identifier (unique within trace)
typeenumsystem, prompt, token, tool_call, reasoning, response, error
labelstringHuman-readable label
detailstringExtended description
timing_msnumberDuration of this step
statusenumsuccess, failure, pending
toolstringTool name (for tool_call events)
inputstringTool input
outputstringTool output

Runs are serialized as JSON files:

results/
models/
qwen-3.5-9b/
20240601_123456_run.json
traces/
trace_abc123.json
runs_index.json

The Run schema uses semantic versioning:

  • Major: Breaking changes to required fields or types
  • Minor: New optional fields, backward-compatible additions
  • Patch: Documentation fixes

Current version: 1.0.0

ImplementationFile
Python BenchmarkResultapps/cli/results_schema.py
Python Tracepackages/core/trace_schema.py
TypeScript RunIndexapps/dashboard/src/lib/loadResults.ts