9.7 KiB
9.7 KiB
Agent Evaluation Framework
This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects.
Evaluation Criteria
Primary Metrics
| Metric | Threshold | Action |
|---|---|---|
| Progress Rate | < 10% per 30 min | Re-evaluate approach |
| Same Error Pattern | > 3 failures | Investigate root cause |
| Test Harness | Time per iteration | Track convergence speed |
| File Changes | No meaningful changes | Agent stuck or unclear task |
| Time Elapsed | > 2x estimate | Re-think strongly advised |
| Time Elapsed | > 3x estimate | Re-think required |
Secondary Metrics
- Context Usage: Monitor token usage in chatlog
- Git Commits: Track meaningful changes
- Test Pass Rate: Monitor improvement over iterations
- API Call Success: For browser automation tasks
Re-think Decision Tree
┌─────────────────────────────────────┐
│ Task Running with Agent │
└─────────────────────────────────────┘
↓
┌─────────────────────────────┐
│ Is time > 50% of estimate? │
└─────────────────────────────┘
│ │
YES │ NO
↓ │
┌──────────────────┐
│ Check progress │
│ Still on track? │
└──────────────────┘
│
YES │ NO
↓ ↓
Continue ┌──────────────┐
checkpoint │ Review │
│ blockers │
└──────────────┘
↓ │
┌──────────────────┐
│ Time > 90%? │
└──────────────────┘
│
YES │ NO
↓ │
┌──────────────────┐
│ Near completion │
│ Keep going │
└──────────────────┘
↓ │
Complete ┌──────────────┐
│ Time > 2x? │
└──────────────┘
│
YES │ NO
↓ │
┌──────────────┐
│ Re-evaluate │
│ - Check task │
│ - Review AGENTS.md
│ - Adjust approach
└──────────────┘
↓
┌──────────────┐
│ Time > 3x? │
└──────────────┘
│
YES │ NO
↓ │
┌──────────────┐
│ Strong │
│ Re-think │
│ - Clear task │
│ - New brief │
│ - New tool │
└──────────────┘
Agent-Specific Evaluation
OpenCode Evaluation
Expected Behavior:
- Reads AGENTS.md for context
- Writes files directly to project
- Runs tests repeatedly
- Reports blockers clearly
Good Signs:
- Multiple git commits per session
- Test failure patterns changing
- Iteration time decreasing
- Clear progress indicators
Bad Signs:
- Repeating same error
- Only small/pointless changes
- Session time increasing
- Agent "thinking" with no output
Actions:
- Minor stall: Wait 5-10 min
- Repeated errors: Update AGENTS.md, clarify task
- No progress: Pause, re-evaluate task brief
Aider Evaluation
Expected Behavior:
- CLI-based, simple interactions
- Works well for single-file changes
- Requires model configuration
Good Signs:
- Quick response times
- Clean diff output
- Minimal context needed
Bad Signs:
- Repeated file overwrites
- Model timeout errors
- Large context required
Playwright Evaluation
Expected Behavior:
- Test files in
tests/folder - HTML report output
- Screenshot on failure
Good Signs:
- Tests running successfully
- Reports capturing issues
- Network interception working
Bad Signs:
- Browser not launching
- API calls timing out
- Element not found errors
Task Progress Tracking
For Each Task
Create/Update: <project-root>/tasks.md
# Task: Increase Test Coverage for LinkdingSync
## Start Time
2026-05-09 08:00
## Estimated Duration
45 minutes
## Current Progress
25% - Test structure created
## Current Blockers
None
## Next Steps
1. Implement auth test
2. Implement API call test
3. Run full suite
Checkpoint Questions
At 50% time:
- Is the agent still making progress?
- Are tests converging or regressing?
- Have blockers been identified?
At 90% time:
- Should be near completion
- Review remaining work
- Decide: continue or adjust
After 2x time:
- Review AGENTS.md for missing context
- Check task brief clarity
- Consider tool change
After 3x time:
- Strong evidence of stuck loop
- Re-think required
- New approach or tool needed
Tool Evaluation
When to Switch Tools
| Current Tool | Switch If... | To... |
|---|---|---|
| OpenCode | Simple one-off | Aider |
| OpenCode | Very complex refactoring | Consider re-scoping |
| Aider | Complex iterative task | OpenCode |
| Playwright | Test runner errors | Fix config, continue |
| Any | 3x time with no progress | Re-evaluate approach |
Cross-Project Patterns
Document in docs/tools.md:
- What worked well
- What didn't work
- Tool preferences by project type
- Configuration lessons learned
Documentation Requirements
AGENTS.md (Per Project)
# AGENTS.md
## Project Overview
[What this project does]
## Setup Commands
```bash
npm install
npm run dev
npm test
Architecture
[Brief notes]
Testing
- Unit tests:
npm test - E2E tests:
npx playwright test - Coverage target: 80%
Conventions
- Use TypeScript strict mode
- Error handling with try/catch
- API calls must timeout
Known Issues
- [List if any]
Project Tools
- Playwright for browser tests
- OpenCode for iteration
- API:
https://api.linkding.com
### task-brief.md (Per Task)
```markdown
# Task Brief
## Context
[Why this task]
## Goal
[What needs done]
## Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2
## Constraints
- [ ] Constraint 1
## Related Files
- File 1
- File 2
Example Evaluation Log
# Evaluation Log: LinkdingSync Test Harness
## Session 1 (2026-05-09)
### Agent: OpenCode
### Task: Add Playwright tests
### Progress
- [x] Test structure created
- [x] First test implemented
- [ ] Tests converging
### Time Elapsed
30 min (of 60 estimated)
### Issues
- API calls timing out intermittently
### Decision
Continue - tests improving
---
## Session 2 (2026-05-09)
### Time Elapsed
55 min
### Progress
- [x] Tests converging
- [ ] 2 of 3 scenarios passing
### Issues
- Resolved API timeout with retry logic
### Decision
Continue - approaching completion
---
## Final Summary
### Time Actual: 75 min
### Time Estimated: 60 min
### Deviation: +25%
### Outcome
SUCCESS - All acceptance criteria met
### Lessons
- API retry logic needed upfront
- Playwright config requires specific timeout values
Integration with Chat Logs
Automatic Logging
Chat logs are automatically written to:
<project-root>/chatlog.md
Key Information to Capture
At task start:
- Task brief summary
- AGENTS.md reference
- Estimated time
At checkpoints:
- Current progress
- Issues encountered
- Decision made
At completion:
- Time actual vs estimated
- Lessons learned
- Recommendations
Re-think Workflow
When re-thinking is triggered:
- Stop agent (if running in terminal)
- Review chatlog.md for session history
- Check tasks.md for progress notes
- Review AGENTS.md for missing context
- Document in tasks.md:
- What went wrong
- What's changed
- New estimates
- Clear task brief or update
- Resume or restart agent
Escalation Path
Agent Struggling → Check AGENTS.md → Update context
→ Continue → Still stuck → Re-evaluate approach
→ Clear approach → Time > 2x → Re-think
↓
Time > 3x or No Progress
↓
Re-think Required:
- New task brief
- Different tool
- New approach
Quick Reference Commands
OpenCode
# Start new task
opencode --task task-brief.md
# Stop (Ctrl+C in terminal)
Aider
# Start
aider
# Stop
Ctrl+C
Playwright
# Run tests
npx playwright test
# With specific project
npx playwright test --project=chromium
Git for Verification
# Check recent commits
git log --oneline -10
# Check what changed
git diff HEAD~5..HEAD
# Check for stuck state (no new commits)
git status