429 lines
9.7 KiB
Markdown
429 lines
9.7 KiB
Markdown
# Agent Evaluation Framework
|
|
|
|
This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects.
|
|
|
|
## Evaluation Criteria
|
|
|
|
### Primary Metrics
|
|
|
|
| Metric | Threshold | Action |
|
|
|--------|-----------|--------|
|
|
| **Progress Rate** | < 10% per 30 min | Re-evaluate approach |
|
|
| **Same Error Pattern** | > 3 failures | Investigate root cause |
|
|
| **Test Harness** | Time per iteration | Track convergence speed |
|
|
| **File Changes** | No meaningful changes | Agent stuck or unclear task |
|
|
| **Time Elapsed** | > 2x estimate | Re-think strongly advised |
|
|
| **Time Elapsed** | > 3x estimate | Re-think required |
|
|
|
|
### Secondary Metrics
|
|
|
|
- **Context Usage**: Monitor token usage in chatlog
|
|
- **Git Commits**: Track meaningful changes
|
|
- **Test Pass Rate**: Monitor improvement over iterations
|
|
- **API Call Success**: For browser automation tasks
|
|
|
|
## Re-think Decision Tree
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Task Running with Agent │
|
|
└─────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────┐
|
|
│ Is time > 50% of estimate? │
|
|
└─────────────────────────────┘
|
|
│ │
|
|
YES │ NO
|
|
↓ │
|
|
┌──────────────────┐
|
|
│ Check progress │
|
|
│ Still on track? │
|
|
└──────────────────┘
|
|
│
|
|
YES │ NO
|
|
↓ ↓
|
|
Continue ┌──────────────┐
|
|
checkpoint │ Review │
|
|
│ blockers │
|
|
└──────────────┘
|
|
↓ │
|
|
┌──────────────────┐
|
|
│ Time > 90%? │
|
|
└──────────────────┘
|
|
│
|
|
YES │ NO
|
|
↓ │
|
|
┌──────────────────┐
|
|
│ Near completion │
|
|
│ Keep going │
|
|
└──────────────────┘
|
|
↓ │
|
|
Complete ┌──────────────┐
|
|
│ Time > 2x? │
|
|
└──────────────┘
|
|
│
|
|
YES │ NO
|
|
↓ │
|
|
┌──────────────┐
|
|
│ Re-evaluate │
|
|
│ - Check task │
|
|
│ - Review AGENTS.md
|
|
│ - Adjust approach
|
|
└──────────────┘
|
|
↓
|
|
┌──────────────┐
|
|
│ Time > 3x? │
|
|
└──────────────┘
|
|
│
|
|
YES │ NO
|
|
↓ │
|
|
┌──────────────┐
|
|
│ Strong │
|
|
│ Re-think │
|
|
│ - Clear task │
|
|
│ - New brief │
|
|
│ - New tool │
|
|
└──────────────┘
|
|
```
|
|
|
|
## Agent-Specific Evaluation
|
|
|
|
### OpenCode Evaluation
|
|
|
|
**Expected Behavior:**
|
|
- Reads AGENTS.md for context
|
|
- Writes files directly to project
|
|
- Runs tests repeatedly
|
|
- Reports blockers clearly
|
|
|
|
**Good Signs:**
|
|
- Multiple git commits per session
|
|
- Test failure patterns changing
|
|
- Iteration time decreasing
|
|
- Clear progress indicators
|
|
|
|
**Bad Signs:**
|
|
- Repeating same error
|
|
- Only small/pointless changes
|
|
- Session time increasing
|
|
- Agent "thinking" with no output
|
|
|
|
**Actions:**
|
|
- **Minor stall**: Wait 5-10 min
|
|
- **Repeated errors**: Update AGENTS.md, clarify task
|
|
- **No progress**: Pause, re-evaluate task brief
|
|
|
|
### Aider Evaluation
|
|
|
|
**Expected Behavior:**
|
|
- CLI-based, simple interactions
|
|
- Works well for single-file changes
|
|
- Requires model configuration
|
|
|
|
**Good Signs:**
|
|
- Quick response times
|
|
- Clean diff output
|
|
- Minimal context needed
|
|
|
|
**Bad Signs:**
|
|
- Repeated file overwrites
|
|
- Model timeout errors
|
|
- Large context required
|
|
|
|
### Playwright Evaluation
|
|
|
|
**Expected Behavior:**
|
|
- Test files in `tests/` folder
|
|
- HTML report output
|
|
- Screenshot on failure
|
|
|
|
**Good Signs:**
|
|
- Tests running successfully
|
|
- Reports capturing issues
|
|
- Network interception working
|
|
|
|
**Bad Signs:**
|
|
- Browser not launching
|
|
- API calls timing out
|
|
- Element not found errors
|
|
|
|
## Task Progress Tracking
|
|
|
|
### For Each Task
|
|
|
|
Create/Update: `<project-root>/tasks.md`
|
|
|
|
```markdown
|
|
# Task: Increase Test Coverage for LinkdingSync
|
|
|
|
## Start Time
|
|
2026-05-09 08:00
|
|
|
|
## Estimated Duration
|
|
45 minutes
|
|
|
|
## Current Progress
|
|
25% - Test structure created
|
|
|
|
## Current Blockers
|
|
None
|
|
|
|
## Next Steps
|
|
1. Implement auth test
|
|
2. Implement API call test
|
|
3. Run full suite
|
|
```
|
|
|
|
### Checkpoint Questions
|
|
|
|
**At 50% time:**
|
|
1. Is the agent still making progress?
|
|
2. Are tests converging or regressing?
|
|
3. Have blockers been identified?
|
|
|
|
**At 90% time:**
|
|
1. Should be near completion
|
|
2. Review remaining work
|
|
3. Decide: continue or adjust
|
|
|
|
**After 2x time:**
|
|
1. Review AGENTS.md for missing context
|
|
2. Check task brief clarity
|
|
3. Consider tool change
|
|
|
|
**After 3x time:**
|
|
1. Strong evidence of stuck loop
|
|
2. Re-think required
|
|
3. New approach or tool needed
|
|
|
|
## Tool Evaluation
|
|
|
|
### When to Switch Tools
|
|
|
|
| Current Tool | Switch If... | To... |
|
|
|--------------|---------------|-------|
|
|
| OpenCode | Simple one-off | Aider |
|
|
| OpenCode | Very complex refactoring | Consider re-scoping |
|
|
| Aider | Complex iterative task | OpenCode |
|
|
| Playwright | Test runner errors | Fix config, continue |
|
|
| Any | 3x time with no progress | Re-evaluate approach |
|
|
|
|
### Cross-Project Patterns
|
|
|
|
**Document in `docs/tools.md`:**
|
|
- What worked well
|
|
- What didn't work
|
|
- Tool preferences by project type
|
|
- Configuration lessons learned
|
|
|
|
## Documentation Requirements
|
|
|
|
### AGENTS.md (Per Project)
|
|
|
|
```markdown
|
|
# AGENTS.md
|
|
|
|
## Project Overview
|
|
[What this project does]
|
|
|
|
## Setup Commands
|
|
```bash
|
|
npm install
|
|
npm run dev
|
|
npm test
|
|
```
|
|
|
|
## Architecture
|
|
[Brief notes]
|
|
|
|
## Testing
|
|
- Unit tests: `npm test`
|
|
- E2E tests: `npx playwright test`
|
|
- Coverage target: 80%
|
|
|
|
## Conventions
|
|
- Use TypeScript strict mode
|
|
- Error handling with try/catch
|
|
- API calls must timeout
|
|
|
|
## Known Issues
|
|
- [List if any]
|
|
|
|
## Project Tools
|
|
- Playwright for browser tests
|
|
- OpenCode for iteration
|
|
- API: `https://api.linkding.com`
|
|
```
|
|
|
|
### task-brief.md (Per Task)
|
|
|
|
```markdown
|
|
# Task Brief
|
|
|
|
## Context
|
|
[Why this task]
|
|
|
|
## Goal
|
|
[What needs done]
|
|
|
|
## Acceptance Criteria
|
|
- [ ] Criterion 1
|
|
- [ ] Criterion 2
|
|
|
|
## Constraints
|
|
- [ ] Constraint 1
|
|
|
|
## Related Files
|
|
- File 1
|
|
- File 2
|
|
```
|
|
|
|
## Example Evaluation Log
|
|
|
|
```markdown
|
|
# Evaluation Log: LinkdingSync Test Harness
|
|
|
|
## Session 1 (2026-05-09)
|
|
|
|
### Agent: OpenCode
|
|
### Task: Add Playwright tests
|
|
|
|
### Progress
|
|
- [x] Test structure created
|
|
- [x] First test implemented
|
|
- [ ] Tests converging
|
|
|
|
### Time Elapsed
|
|
30 min (of 60 estimated)
|
|
|
|
### Issues
|
|
- API calls timing out intermittently
|
|
|
|
### Decision
|
|
Continue - tests improving
|
|
|
|
---
|
|
|
|
## Session 2 (2026-05-09)
|
|
|
|
### Time Elapsed
|
|
55 min
|
|
|
|
### Progress
|
|
- [x] Tests converging
|
|
- [ ] 2 of 3 scenarios passing
|
|
|
|
### Issues
|
|
- Resolved API timeout with retry logic
|
|
|
|
### Decision
|
|
Continue - approaching completion
|
|
|
|
---
|
|
|
|
## Final Summary
|
|
|
|
### Time Actual: 75 min
|
|
### Time Estimated: 60 min
|
|
### Deviation: +25%
|
|
|
|
### Outcome
|
|
SUCCESS - All acceptance criteria met
|
|
|
|
### Lessons
|
|
- API retry logic needed upfront
|
|
- Playwright config requires specific timeout values
|
|
```
|
|
|
|
## Integration with Chat Logs
|
|
|
|
### Automatic Logging
|
|
|
|
Chat logs are automatically written to:
|
|
- `<project-root>/chatlog.md`
|
|
|
|
### Key Information to Capture
|
|
|
|
**At task start:**
|
|
- Task brief summary
|
|
- AGENTS.md reference
|
|
- Estimated time
|
|
|
|
**At checkpoints:**
|
|
- Current progress
|
|
- Issues encountered
|
|
- Decision made
|
|
|
|
**At completion:**
|
|
- Time actual vs estimated
|
|
- Lessons learned
|
|
- Recommendations
|
|
|
|
## Re-think Workflow
|
|
|
|
When re-thinking is triggered:
|
|
|
|
1. **Stop agent** (if running in terminal)
|
|
2. **Review chatlog.md** for session history
|
|
3. **Check tasks.md** for progress notes
|
|
4. **Review AGENTS.md** for missing context
|
|
5. **Document in tasks.md**:
|
|
- What went wrong
|
|
- What's changed
|
|
- New estimates
|
|
6. **Clear task brief** or update
|
|
7. **Resume or restart** agent
|
|
|
|
## Escalation Path
|
|
|
|
```
|
|
Agent Struggling → Check AGENTS.md → Update context
|
|
→ Continue → Still stuck → Re-evaluate approach
|
|
→ Clear approach → Time > 2x → Re-think
|
|
↓
|
|
Time > 3x or No Progress
|
|
↓
|
|
Re-think Required:
|
|
- New task brief
|
|
- Different tool
|
|
- New approach
|
|
```
|
|
|
|
## Quick Reference Commands
|
|
|
|
### OpenCode
|
|
```bash
|
|
# Start new task
|
|
opencode --task task-brief.md
|
|
|
|
# Stop (Ctrl+C in terminal)
|
|
```
|
|
|
|
### Aider
|
|
```bash
|
|
# Start
|
|
aider
|
|
|
|
# Stop
|
|
Ctrl+C
|
|
```
|
|
|
|
### Playwright
|
|
```bash
|
|
# Run tests
|
|
npx playwright test
|
|
|
|
# With specific project
|
|
npx playwright test --project=chromium
|
|
```
|
|
|
|
### Git for Verification
|
|
```bash
|
|
# Check recent commits
|
|
git log --oneline -10
|
|
|
|
# Check what changed
|
|
git diff HEAD~5..HEAD
|
|
|
|
# Check for stuck state (no new commits)
|
|
git status |