Skip to content

Vision

Benchmarking is a feature. Observability is the product. Serving is infrastructure. Understanding is the workflow.

Most LLM benchmarks give you a single score. They don’t tell you:

  • Why a model failed on your prompt
  • Why it’s slower on your hardware
  • Why it consumes more memory on your workload
  • What changed between model versions
  • What actually happened during execution

Model Lens exists to answer these questions.

We are not building another leaderboard, chat interface, model server, or agent framework. Excellent projects already exist for those use cases.

Model Lens focuses on understanding model behavior after execution. Think:

  • Chrome DevTools for local AI
  • Datadog for local AI
  • OpenTelemetry for local AI
  • GitHub Actions replay for local AI
  1. Observability over scores — A single number is useless. Traces, metrics, and replays are useful.
  2. Real hardware, real workloads — Benchmarks should reflect how developers actually use models on their machines.
  3. Local-first — No cloud dependency. Everything runs on your hardware.
  4. Extensible by design — Prompt packs, skills, and providers should be community-extendable.
  1. Observability — Capture traces, metrics, and execution details
  2. Replay — Record and replay model execution sessions with playback controls
  3. Workload Evaluation — Test models on real-world projects, not just benchmarks
  4. Benchmarking — Run standardized benchmarks against local models
  5. Community Packs — Shareable, versioned prompt collections

Brand: Model Lens
Theme: Precision optics — clinical, reliable, utilitarian.
Concepts: focus, zoom, exposure, snapshots, replay, timelines

LayerTools
Model ServingOllama, llama.cpp, vLLM, LM Studio
User InterfacesOpen WebUI, Jan, LibreChat
BenchmarkingOpenBench, lm-evaluation-harness
ObservabilityModel Lens
  • A cloud-hosted SaaS platform
  • A general-purpose AI agent
  • A model training or fine-tuning tool
  • A replacement for academic benchmarks (MMLU, HumanEval) — we integrate them, we don’t compete

A developer should be able to ask:

“Why is Qwen better than Gemma for my codebase?”

And Model Lens should provide benchmark evidence, execution traces, replay sessions, latency metrics, memory metrics, and workload comparisons — instead of a single score.