myworkspace/docs/agent-evaluation-framework.md

# Agent Evaluation Framework

This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects.

## Evaluation Criteria

### Primary Metrics

| Metric | Threshold | Action |
|--------|-----------|--------|
| **Progress Rate** | < 10% per 30 min | Re-evaluate approach |
| **Same Error Pattern** | > 3 failures | Investigate root cause |
| **Test Harness** | Time per iteration | Track convergence speed |
| **File Changes** | No meaningful changes | Agent stuck or unclear task |
| **Time Elapsed** | > 2x estimate | Re-think strongly advised |
| **Time Elapsed** | > 3x estimate | Re-think required |

### Secondary Metrics

- **Context Usage**: Monitor token usage in chatlog
- **Git Commits**: Track meaningful changes
- **Test Pass Rate**: Monitor improvement over iterations
- **API Call Success**: For browser automation tasks

## Re-think Decision Tree

```
┌─────────────────────────────────────┐
│   Task Running with Agent           │
└─────────────────────────────────────┘
                ↓
    ┌─────────────────────────────┐
    │ Is time > 50% of estimate?   │
    └─────────────────────────────┘
            │          │
         YES │         NO
            ↓          │
    ┌──────────────────┐
    │ Check progress    │
    │ Still on track?   │
    └──────────────────┘
            │
       YES │ NO
            ↓          ↓
   Continue    ┌──────────────┐
   checkpoint  │ Review       │
               │ blockers     │
               └──────────────┘
            ↓          │
    ┌──────────────────┐
    │ Time > 90%?      │
    └──────────────────┘
            │
       YES │ NO
            ↓          │
    ┌──────────────────┐
    │ Near completion  │
    │ Keep going       │
    └──────────────────┘
            ↓          │
   Complete      ┌──────────────┐
                 │ Time > 2x?   │
                 └──────────────┘
                         │
                    YES │ NO
                         ↓          │
                    ┌──────────────┐
                    │ Re-evaluate  │
                    │ - Check task │
                    │ - Review AGENTS.md
                    │ - Adjust approach
                    └──────────────┘
                         ↓
                    ┌──────────────┐
                    │ Time > 3x?   │
                    └──────────────┘
                             │
                        YES │ NO
                             ↓          │
                        ┌──────────────┐
                        │ Strong       │
                        │ Re-think     │
                        │ - Clear task │
                        │ - New brief  │
                        │ - New tool   │
                        └──────────────┘
```

## Agent-Specific Evaluation

### OpenCode Evaluation

**Expected Behavior:**
- Reads AGENTS.md for context
- Writes files directly to project
- Runs tests repeatedly
- Reports blockers clearly

**Good Signs:**
- Multiple git commits per session
- Test failure patterns changing
- Iteration time decreasing
- Clear progress indicators

**Bad Signs:**
- Repeating same error
- Only small/pointless changes
- Session time increasing
- Agent "thinking" with no output

**Actions:**
- **Minor stall**: Wait 5-10 min
- **Repeated errors**: Update AGENTS.md, clarify task
- **No progress**: Pause, re-evaluate task brief

### Aider Evaluation

**Expected Behavior:**
- CLI-based, simple interactions
- Works well for single-file changes
- Requires model configuration

**Good Signs:**
- Quick response times
- Clean diff output
- Minimal context needed

**Bad Signs:**
- Repeated file overwrites
- Model timeout errors
- Large context required

### Playwright Evaluation

**Expected Behavior:**
- Test files in `tests/` folder
- HTML report output
- Screenshot on failure

**Good Signs:**
- Tests running successfully
- Reports capturing issues
- Network interception working

**Bad Signs:**
- Browser not launching
- API calls timing out
- Element not found errors

## Task Progress Tracking

### For Each Task

Create/Update: `<project-root>/tasks.md`

```markdown
# Task: Increase Test Coverage for LinkdingSync

## Start Time
2026-05-09 08:00

## Estimated Duration
45 minutes

## Current Progress
25% - Test structure created

## Current Blockers
None

## Next Steps
1. Implement auth test
2. Implement API call test
3. Run full suite
```

### Checkpoint Questions

**At 50% time:**
1. Is the agent still making progress?
2. Are tests converging or regressing?
3. Have blockers been identified?

**At 90% time:**
1. Should be near completion
2. Review remaining work
3. Decide: continue or adjust

**After 2x time:**
1. Review AGENTS.md for missing context
2. Check task brief clarity
3. Consider tool change

**After 3x time:**
1. Strong evidence of stuck loop
2. Re-think required
3. New approach or tool needed

## Tool Evaluation

### When to Switch Tools

| Current Tool | Switch If... | To... |
|--------------|---------------|-------|
| OpenCode | Simple one-off | Aider |
| OpenCode | Very complex refactoring | Consider re-scoping |
| Aider | Complex iterative task | OpenCode |
| Playwright | Test runner errors | Fix config, continue |
| Any | 3x time with no progress | Re-evaluate approach |

### Cross-Project Patterns

**Document in `docs/tools.md`:**
- What worked well
- What didn't work
- Tool preferences by project type
- Configuration lessons learned

## Documentation Requirements

### AGENTS.md (Per Project)

```markdown
# AGENTS.md

## Project Overview
[What this project does]

## Setup Commands
```bash
npm install
npm run dev
npm test
```

## Architecture
[Brief notes]

## Testing
- Unit tests: `npm test`
- E2E tests: `npx playwright test`
- Coverage target: 80%

## Conventions
- Use TypeScript strict mode
- Error handling with try/catch
- API calls must timeout

## Known Issues
- [List if any]

## Project Tools
- Playwright for browser tests
- OpenCode for iteration
- API: `https://api.linkding.com`
```

### task-brief.md (Per Task)

```markdown
# Task Brief

## Context
[Why this task]

## Goal
[What needs done]

## Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Constraints
- [ ] Constraint 1

## Related Files
- File 1
- File 2
```

## Example Evaluation Log

```markdown
# Evaluation Log: LinkdingSync Test Harness

## Session 1 (2026-05-09)

### Agent: OpenCode
### Task: Add Playwright tests

### Progress
- [x] Test structure created
- [x] First test implemented
- [ ] Tests converging

### Time Elapsed
30 min (of 60 estimated)

### Issues
- API calls timing out intermittently

### Decision
Continue - tests improving

---

## Session 2 (2026-05-09)

### Time Elapsed
55 min

### Progress
- [x] Tests converging
- [ ] 2 of 3 scenarios passing

### Issues
- Resolved API timeout with retry logic

### Decision
Continue - approaching completion

---

## Final Summary

### Time Actual: 75 min
### Time Estimated: 60 min
### Deviation: +25%

### Outcome
SUCCESS - All acceptance criteria met

### Lessons
- API retry logic needed upfront
- Playwright config requires specific timeout values
```

## Integration with Chat Logs

### Automatic Logging

Chat logs are automatically written to:
- `<project-root>/chatlog.md`

### Key Information to Capture

**At task start:**
- Task brief summary
- AGENTS.md reference
- Estimated time

**At checkpoints:**
- Current progress
- Issues encountered
- Decision made

**At completion:**
- Time actual vs estimated
- Lessons learned
- Recommendations

## Re-think Workflow

When re-thinking is triggered:

1. **Stop agent** (if running in terminal)
2. **Review chatlog.md** for session history
3. **Check tasks.md** for progress notes
4. **Review AGENTS.md** for missing context
5. **Document in tasks.md**:
   - What went wrong
   - What's changed
   - New estimates
6. **Clear task brief** or update
7. **Resume or restart** agent

## Escalation Path

```
Agent Struggling → Check AGENTS.md → Update context
                 → Continue → Still stuck → Re-evaluate approach
                 → Clear approach → Time > 2x → Re-think
                                    ↓
                            Time > 3x or No Progress
                                    ↓
                            Re-think Required:
                            - New task brief
                            - Different tool
                            - New approach
```

## Quick Reference Commands

### OpenCode
```bash
# Start new task
opencode --task task-brief.md

# Stop (Ctrl+C in terminal)
```

### Aider
```bash
# Start
aider

# Stop
Ctrl+C
```

### Playwright
```bash
# Run tests
npx playwright test

# With specific project
npx playwright test --project=chromium
```

### Git for Verification
```bash
# Check recent commits
git log --oneline -10

# Check what changed
git diff HEAD~5..HEAD

# Check for stuck state (no new commits)
git status