Skills
Skills are pure function tools that benchmark logic can invoke. They are versioned, sandboxed, and validated against a lockfile for deterministic, reproducible evaluation.
Philosophy
Section titled “Philosophy”Benchmark logic should be extensible without bloating core. Skills provide:
- Determinism — Same input always produces same output
- Sandboxing — No filesystem/network access without explicit permission
- Versioning — Changes require version bumps + lockfile updates
- Immutability — Cannot be modified at runtime once registered
Built-in skills
Section titled “Built-in skills”Four built-in skills ship with Model Lens (see packages/skills/builtins/):
| Skill | Description | Input | Output |
|---|---|---|---|
read_file | Read file contents within sandbox | path | File text or error |
write_file | Write content to file within sandbox | path, content | Success/error |
json_parse | Parse and validate JSON | content, schema | Parsed data or validation errors |
diff | Compute text diff between two strings | original, modified | Unified diff output |
Lockfile system
Section titled “Lockfile system”The modellens.lock file ensures reproducibility across machines:
{ "skills": { "read_file": { "version": "1.0.0", "checksum": "a1b2c3d4e5f6..." }, "write_file": { "version": "1.0.0", "checksum": "b2c3d4e5f6a1..." } }, "mode": "strict"}Lock modes
Section titled “Lock modes”| Mode | Behavior |
|---|---|
strict | Fail on any version mismatch (required for benchmarks) |
warn | Log warning but continue (dev only) |
ignore | Skip lock checking entirely (NOT for benchmarks) |
Skill interface
Section titled “Skill interface”Every skill extends the Skill abstract base class:
from skills.types import Skill, SkillManifest, SkillInput, SkillOutput, SkillContext
class MySkill(Skill): def _create_manifest(self) -> SkillManifest: return SkillManifest( name="my_skill", version="1.0.0", description="What this skill does", input_schema={ "type": "object", "required": ["input_field"], "properties": { "input_field": {"type": "string"}, }, }, tags=["utility"], )
async def run(self, input_data: SkillInput, ctx: SkillContext) -> SkillOutput: value = input_data.get("input_field") return SkillOutput(success=True, data={"result": value.upper()})Key types
Section titled “Key types”| Type | Purpose |
|---|---|
SkillManifest | Static metadata: name, version, schemas, tags |
SkillInput | Validated input with JSON schema checking |
SkillOutput | Success/failure with data or error |
SkillContext | Sandboxed environment (working directory, run ID) |
Action | A single tool invocation in agentic sequences |
AgenticResponse | Ordered list of actions for agentic benchmarks |
AgenticScore | Multi-dimensional agentic evaluation score |
Agentic scoring
Section titled “Agentic scoring”For agentic/tool-use benchmarks, models are scored on:
| Dimension | Weight | Description |
|---|---|---|
| Validity | 25% | Well-formed JSON? Schema-compliant? Real skills? |
| Planning | 25% | Tool sequence correct and minimal? |
| Skill correctness | 30% | Correct tool chosen? Correct parameters? |
| Constraint adherence | 20% | Respects skill allowlist? No hallucinations? |
Registry
Section titled “Registry”from skills.registry import create_registry
# Auto-loads built-ins and validates against lockfileregistry = create_registry(lockfile_path="modellens.lock")
# List registered skillsprint(registry.list_names()) # ['diff', 'json_parse', 'read_file', 'write_file']
# Get a skillskill = registry.get("json_parse")
# Skills CANNOT be registered after finalize()registry.finalize()See also
Section titled “See also”- Provider Contract — formal interface specification
- Contributing — how to add new skills