Skip to content

Skills

Skills are pure function tools that benchmark logic can invoke. They are versioned, sandboxed, and validated against a lockfile for deterministic, reproducible evaluation.

Benchmark logic should be extensible without bloating core. Skills provide:

  • Determinism — Same input always produces same output
  • Sandboxing — No filesystem/network access without explicit permission
  • Versioning — Changes require version bumps + lockfile updates
  • Immutability — Cannot be modified at runtime once registered

Four built-in skills ship with Model Lens (see packages/skills/builtins/):

SkillDescriptionInputOutput
read_fileRead file contents within sandboxpathFile text or error
write_fileWrite content to file within sandboxpath, contentSuccess/error
json_parseParse and validate JSONcontent, schemaParsed data or validation errors
diffCompute text diff between two stringsoriginal, modifiedUnified diff output

The modellens.lock file ensures reproducibility across machines:

{
"skills": {
"read_file": { "version": "1.0.0", "checksum": "a1b2c3d4e5f6..." },
"write_file": { "version": "1.0.0", "checksum": "b2c3d4e5f6a1..." }
},
"mode": "strict"
}
ModeBehavior
strictFail on any version mismatch (required for benchmarks)
warnLog warning but continue (dev only)
ignoreSkip lock checking entirely (NOT for benchmarks)

Every skill extends the Skill abstract base class:

from skills.types import Skill, SkillManifest, SkillInput, SkillOutput, SkillContext
class MySkill(Skill):
def _create_manifest(self) -> SkillManifest:
return SkillManifest(
name="my_skill",
version="1.0.0",
description="What this skill does",
input_schema={
"type": "object",
"required": ["input_field"],
"properties": {
"input_field": {"type": "string"},
},
},
tags=["utility"],
)
async def run(self, input_data: SkillInput, ctx: SkillContext) -> SkillOutput:
value = input_data.get("input_field")
return SkillOutput(success=True, data={"result": value.upper()})
TypePurpose
SkillManifestStatic metadata: name, version, schemas, tags
SkillInputValidated input with JSON schema checking
SkillOutputSuccess/failure with data or error
SkillContextSandboxed environment (working directory, run ID)
ActionA single tool invocation in agentic sequences
AgenticResponseOrdered list of actions for agentic benchmarks
AgenticScoreMulti-dimensional agentic evaluation score

For agentic/tool-use benchmarks, models are scored on:

DimensionWeightDescription
Validity25%Well-formed JSON? Schema-compliant? Real skills?
Planning25%Tool sequence correct and minimal?
Skill correctness30%Correct tool chosen? Correct parameters?
Constraint adherence20%Respects skill allowlist? No hallucinations?
from skills.registry import create_registry
# Auto-loads built-ins and validates against lockfile
registry = create_registry(lockfile_path="modellens.lock")
# List registered skills
print(registry.list_names()) # ['diff', 'json_parse', 'read_file', 'write_file']
# Get a skill
skill = registry.get("json_parse")
# Skills CANNOT be registered after finalize()
registry.finalize()