Checkpoints & Tasks
Production evaluation uses curated checkpoint task sets. Understanding checkpoints is key to optimizing your agent for real-world rewards.
What are Checkpoints?
Checkpoints are curated subsets of the full Terminal-Bench 2.0 dataset (91 tasks). While you can test your agent on all 91 tasks locally, production evaluation uses checkpoints to focus on the most meaningful tasks.
Production Checkpoint
checkpoint3 is currently used in production. Focus your optimization efforts on mastering these 15 challenging tasks.
Available Checkpoints
checkpoint1checkpoint2checkpoint3checkpoint4Running on Checkpoints
Use these commands to test your agent against specific checkpoints:
List Available Checkpoints
# List all available checkpoints
term bench list-checkpoints
# Output:
# checkpoint1 - 30 tasks (first 30 alphabetically)
# checkpoint2 - 30 tasks (20 hard failed + 10 complex)
# checkpoint3 - 15 tasks (production)Run on Production Checkpoint
# Run your agent on the production checkpoint
term bench agent -a ./my-agent \
--checkpoint checkpoint3 \
--concurrent 4
# Results will show pass rate on 15 tasksRun on Custom Checkpoint File
# Run on a specific checkpoint file directly
term bench agent -a ./my-agent \
-d ./checkpoints/checkpoint2.json \
--concurrent 4Checkpoint3 Task Breakdown
The production checkpoint contains 15 carefully selected tasks designed to differentiate top-performing agents.
Hardest Tasks (10 tasks)
Tasks with 0% historical success rate. These require advanced reasoning, multi-step planning, and precise execution.
- Complex multi-file code refactoring
- System configuration with edge cases
- Debugging non-obvious issues
- Data transformation with constraints
- Integration tests requiring setup
Fragile Tasks (5 tasks)
Tasks with ~60% success rate. These distinguish good agents from great ones - they're solvable but require precision.
- Edge case handling in parsing
- Partial file modifications
- Environment-sensitive operations
- Output format requirements
- Time-constrained operations
Optimization Strategies
Start with Full Benchmark
First, run your agent on all 91 tasks to get a baseline understanding of its strengths and weaknesses.
Analyze Failures
Review logs and trajectories for failed tasks. Look for patterns: timeouts, incorrect commands, missing verification steps.
Focus on checkpoint3
Since production uses checkpoint3, concentrate your optimization on these 15 tasks after establishing a baseline.
Iterate Rapidly
Use --concurrent 4 to speed up testing. Target fragile tasks first - they offer the quickest wins.
Best Practices
DO: Explore Before Acting
Your agent should always run ls, cat README.md, or similar commands before attempting to solve a task.
DO: Verify Results
Before signaling completion, verify files exist and contain expected content. Many failures come from assuming success.
DON'T: Hardcode Task Logic
Never match against task content with if "task" in instruction. Your agent must generalize.
DON'T: Skip Error Handling
Checkpoint3 contains edge cases. Implement robust error handling for missing files, permission issues, and timeouts.