# Agent Evaluation Framework This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects. ## Evaluation Criteria ### Primary Metrics | Metric | Threshold | Action | |--------|-----------|--------| | **Progress Rate** | < 10% per 30 min | Re-evaluate approach | | **Same Error Pattern** | > 3 failures | Investigate root cause | | **Test Harness** | Time per iteration | Track convergence speed | | **File Changes** | No meaningful changes | Agent stuck or unclear task | | **Time Elapsed** | > 2x estimate | Re-think strongly advised | | **Time Elapsed** | > 3x estimate | Re-think required | ### Secondary Metrics - **Context Usage**: Monitor token usage in chatlog - **Git Commits**: Track meaningful changes - **Test Pass Rate**: Monitor improvement over iterations - **API Call Success**: For browser automation tasks ## Re-think Decision Tree ``` ┌─────────────────────────────────────┐ │ Task Running with Agent │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Is time > 50% of estimate? │ └─────────────────────────────┘ │ │ YES │ NO ↓ │ ┌──────────────────┐ │ Check progress │ │ Still on track? │ └──────────────────┘ │ YES │ NO ↓ ↓ Continue ┌──────────────┐ checkpoint │ Review │ │ blockers │ └──────────────┘ ↓ │ ┌──────────────────┐ │ Time > 90%? │ └──────────────────┘ │ YES │ NO ↓ │ ┌──────────────────┐ │ Near completion │ │ Keep going │ └──────────────────┘ ↓ │ Complete ┌──────────────┐ │ Time > 2x? │ └──────────────┘ │ YES │ NO ↓ │ ┌──────────────┐ │ Re-evaluate │ │ - Check task │ │ - Review AGENTS.md │ - Adjust approach └──────────────┘ ↓ ┌──────────────┐ │ Time > 3x? │ └──────────────┘ │ YES │ NO ↓ │ ┌──────────────┐ │ Strong │ │ Re-think │ │ - Clear task │ │ - New brief │ │ - New tool │ └──────────────┘ ``` ## Agent-Specific Evaluation ### OpenCode Evaluation **Expected Behavior:** - Reads AGENTS.md for context - Writes files directly to project - Runs tests repeatedly - Reports blockers clearly **Good Signs:** - Multiple git commits per session - Test failure patterns changing - Iteration time decreasing - Clear progress indicators **Bad Signs:** - Repeating same error - Only small/pointless changes - Session time increasing - Agent "thinking" with no output **Actions:** - **Minor stall**: Wait 5-10 min - **Repeated errors**: Update AGENTS.md, clarify task - **No progress**: Pause, re-evaluate task brief ### Aider Evaluation **Expected Behavior:** - CLI-based, simple interactions - Works well for single-file changes - Requires model configuration **Good Signs:** - Quick response times - Clean diff output - Minimal context needed **Bad Signs:** - Repeated file overwrites - Model timeout errors - Large context required ### Playwright Evaluation **Expected Behavior:** - Test files in `tests/` folder - HTML report output - Screenshot on failure **Good Signs:** - Tests running successfully - Reports capturing issues - Network interception working **Bad Signs:** - Browser not launching - API calls timing out - Element not found errors ## Task Progress Tracking ### For Each Task Create/Update: `/tasks.md` ```markdown # Task: Increase Test Coverage for LinkdingSync ## Start Time 2026-05-09 08:00 ## Estimated Duration 45 minutes ## Current Progress 25% - Test structure created ## Current Blockers None ## Next Steps 1. Implement auth test 2. Implement API call test 3. Run full suite ``` ### Checkpoint Questions **At 50% time:** 1. Is the agent still making progress? 2. Are tests converging or regressing? 3. Have blockers been identified? **At 90% time:** 1. Should be near completion 2. Review remaining work 3. Decide: continue or adjust **After 2x time:** 1. Review AGENTS.md for missing context 2. Check task brief clarity 3. Consider tool change **After 3x time:** 1. Strong evidence of stuck loop 2. Re-think required 3. New approach or tool needed ## Tool Evaluation ### When to Switch Tools | Current Tool | Switch If... | To... | |--------------|---------------|-------| | OpenCode | Simple one-off | Aider | | OpenCode | Very complex refactoring | Consider re-scoping | | Aider | Complex iterative task | OpenCode | | Playwright | Test runner errors | Fix config, continue | | Any | 3x time with no progress | Re-evaluate approach | ### Cross-Project Patterns **Document in `docs/tools.md`:** - What worked well - What didn't work - Tool preferences by project type - Configuration lessons learned ## Documentation Requirements ### AGENTS.md (Per Project) ```markdown # AGENTS.md ## Project Overview [What this project does] ## Setup Commands ```bash npm install npm run dev npm test ``` ## Architecture [Brief notes] ## Testing - Unit tests: `npm test` - E2E tests: `npx playwright test` - Coverage target: 80% ## Conventions - Use TypeScript strict mode - Error handling with try/catch - API calls must timeout ## Known Issues - [List if any] ## Project Tools - Playwright for browser tests - OpenCode for iteration - API: `https://api.linkding.com` ``` ### task-brief.md (Per Task) ```markdown # Task Brief ## Context [Why this task] ## Goal [What needs done] ## Acceptance Criteria - [ ] Criterion 1 - [ ] Criterion 2 ## Constraints - [ ] Constraint 1 ## Related Files - File 1 - File 2 ``` ## Example Evaluation Log ```markdown # Evaluation Log: LinkdingSync Test Harness ## Session 1 (2026-05-09) ### Agent: OpenCode ### Task: Add Playwright tests ### Progress - [x] Test structure created - [x] First test implemented - [ ] Tests converging ### Time Elapsed 30 min (of 60 estimated) ### Issues - API calls timing out intermittently ### Decision Continue - tests improving --- ## Session 2 (2026-05-09) ### Time Elapsed 55 min ### Progress - [x] Tests converging - [ ] 2 of 3 scenarios passing ### Issues - Resolved API timeout with retry logic ### Decision Continue - approaching completion --- ## Final Summary ### Time Actual: 75 min ### Time Estimated: 60 min ### Deviation: +25% ### Outcome SUCCESS - All acceptance criteria met ### Lessons - API retry logic needed upfront - Playwright config requires specific timeout values ``` ## Integration with Chat Logs ### Automatic Logging Chat logs are automatically written to: - `/chatlog.md` ### Key Information to Capture **At task start:** - Task brief summary - AGENTS.md reference - Estimated time **At checkpoints:** - Current progress - Issues encountered - Decision made **At completion:** - Time actual vs estimated - Lessons learned - Recommendations ## Re-think Workflow When re-thinking is triggered: 1. **Stop agent** (if running in terminal) 2. **Review chatlog.md** for session history 3. **Check tasks.md** for progress notes 4. **Review AGENTS.md** for missing context 5. **Document in tasks.md**: - What went wrong - What's changed - New estimates 6. **Clear task brief** or update 7. **Resume or restart** agent ## Escalation Path ``` Agent Struggling → Check AGENTS.md → Update context → Continue → Still stuck → Re-evaluate approach → Clear approach → Time > 2x → Re-think ↓ Time > 3x or No Progress ↓ Re-think Required: - New task brief - Different tool - New approach ``` ## Quick Reference Commands ### OpenCode ```bash # Start new task opencode --task task-brief.md # Stop (Ctrl+C in terminal) ``` ### Aider ```bash # Start aider # Stop Ctrl+C ``` ### Playwright ```bash # Run tests npx playwright test # With specific project npx playwright test --project=chromium ``` ### Git for Verification ```bash # Check recent commits git log --oneline -10 # Check what changed git diff HEAD~5..HEAD # Check for stuck state (no new commits) git status