Files
myworkspace/docs/agent-evaluation-framework.md

9.7 KiB

Agent Evaluation Framework

This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects.

Evaluation Criteria

Primary Metrics

Metric Threshold Action
Progress Rate < 10% per 30 min Re-evaluate approach
Same Error Pattern > 3 failures Investigate root cause
Test Harness Time per iteration Track convergence speed
File Changes No meaningful changes Agent stuck or unclear task
Time Elapsed > 2x estimate Re-think strongly advised
Time Elapsed > 3x estimate Re-think required

Secondary Metrics

  • Context Usage: Monitor token usage in chatlog
  • Git Commits: Track meaningful changes
  • Test Pass Rate: Monitor improvement over iterations
  • API Call Success: For browser automation tasks

Re-think Decision Tree

┌─────────────────────────────────────┐
│   Task Running with Agent           │
└─────────────────────────────────────┘
                ↓
    ┌─────────────────────────────┐
    │ Is time > 50% of estimate?   │
    └─────────────────────────────┘
            │          │
         YES │         NO
            ↓          │
    ┌──────────────────┐
    │ Check progress    │
    │ Still on track?   │
    └──────────────────┘
            │
       YES │ NO
            ↓          ↓
   Continue    ┌──────────────┐
   checkpoint  │ Review       │
               │ blockers     │
               └──────────────┘
            ↓          │
    ┌──────────────────┐
    │ Time > 90%?      │
    └──────────────────┘
            │
       YES │ NO
            ↓          │
    ┌──────────────────┐
    │ Near completion  │
    │ Keep going       │
    └──────────────────┘
            ↓          │
   Complete      ┌──────────────┐
                 │ Time > 2x?   │
                 └──────────────┘
                         │
                    YES │ NO
                         ↓          │
                    ┌──────────────┐
                    │ Re-evaluate  │
                    │ - Check task │
                    │ - Review AGENTS.md
                    │ - Adjust approach
                    └──────────────┘
                         ↓
                    ┌──────────────┐
                    │ Time > 3x?   │
                    └──────────────┘
                             │
                        YES │ NO
                             ↓          │
                        ┌──────────────┐
                        │ Strong       │
                        │ Re-think     │
                        │ - Clear task │
                        │ - New brief  │
                        │ - New tool   │
                        └──────────────┘

Agent-Specific Evaluation

OpenCode Evaluation

Expected Behavior:

  • Reads AGENTS.md for context
  • Writes files directly to project
  • Runs tests repeatedly
  • Reports blockers clearly

Good Signs:

  • Multiple git commits per session
  • Test failure patterns changing
  • Iteration time decreasing
  • Clear progress indicators

Bad Signs:

  • Repeating same error
  • Only small/pointless changes
  • Session time increasing
  • Agent "thinking" with no output

Actions:

  • Minor stall: Wait 5-10 min
  • Repeated errors: Update AGENTS.md, clarify task
  • No progress: Pause, re-evaluate task brief

Aider Evaluation

Expected Behavior:

  • CLI-based, simple interactions
  • Works well for single-file changes
  • Requires model configuration

Good Signs:

  • Quick response times
  • Clean diff output
  • Minimal context needed

Bad Signs:

  • Repeated file overwrites
  • Model timeout errors
  • Large context required

Playwright Evaluation

Expected Behavior:

  • Test files in tests/ folder
  • HTML report output
  • Screenshot on failure

Good Signs:

  • Tests running successfully
  • Reports capturing issues
  • Network interception working

Bad Signs:

  • Browser not launching
  • API calls timing out
  • Element not found errors

Task Progress Tracking

For Each Task

Create/Update: <project-root>/tasks.md

# Task: Increase Test Coverage for LinkdingSync

## Start Time
2026-05-09 08:00

## Estimated Duration
45 minutes

## Current Progress
25% - Test structure created

## Current Blockers
None

## Next Steps
1. Implement auth test
2. Implement API call test
3. Run full suite

Checkpoint Questions

At 50% time:

  1. Is the agent still making progress?
  2. Are tests converging or regressing?
  3. Have blockers been identified?

At 90% time:

  1. Should be near completion
  2. Review remaining work
  3. Decide: continue or adjust

After 2x time:

  1. Review AGENTS.md for missing context
  2. Check task brief clarity
  3. Consider tool change

After 3x time:

  1. Strong evidence of stuck loop
  2. Re-think required
  3. New approach or tool needed

Tool Evaluation

When to Switch Tools

Current Tool Switch If... To...
OpenCode Simple one-off Aider
OpenCode Very complex refactoring Consider re-scoping
Aider Complex iterative task OpenCode
Playwright Test runner errors Fix config, continue
Any 3x time with no progress Re-evaluate approach

Cross-Project Patterns

Document in docs/tools.md:

  • What worked well
  • What didn't work
  • Tool preferences by project type
  • Configuration lessons learned

Documentation Requirements

AGENTS.md (Per Project)

# AGENTS.md

## Project Overview
[What this project does]

## Setup Commands
```bash
npm install
npm run dev
npm test

Architecture

[Brief notes]

Testing

  • Unit tests: npm test
  • E2E tests: npx playwright test
  • Coverage target: 80%

Conventions

  • Use TypeScript strict mode
  • Error handling with try/catch
  • API calls must timeout

Known Issues

  • [List if any]

Project Tools

  • Playwright for browser tests
  • OpenCode for iteration
  • API: https://api.linkding.com

### task-brief.md (Per Task)

```markdown
# Task Brief

## Context
[Why this task]

## Goal
[What needs done]

## Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Constraints
- [ ] Constraint 1

## Related Files
- File 1
- File 2

Example Evaluation Log

# Evaluation Log: LinkdingSync Test Harness

## Session 1 (2026-05-09)

### Agent: OpenCode
### Task: Add Playwright tests

### Progress
- [x] Test structure created
- [x] First test implemented
- [ ] Tests converging

### Time Elapsed
30 min (of 60 estimated)

### Issues
- API calls timing out intermittently

### Decision
Continue - tests improving

---

## Session 2 (2026-05-09)

### Time Elapsed
55 min

### Progress
- [x] Tests converging
- [ ] 2 of 3 scenarios passing

### Issues
- Resolved API timeout with retry logic

### Decision
Continue - approaching completion

---

## Final Summary

### Time Actual: 75 min
### Time Estimated: 60 min
### Deviation: +25%

### Outcome
SUCCESS - All acceptance criteria met

### Lessons
- API retry logic needed upfront
- Playwright config requires specific timeout values

Integration with Chat Logs

Automatic Logging

Chat logs are automatically written to:

  • <project-root>/chatlog.md

Key Information to Capture

At task start:

  • Task brief summary
  • AGENTS.md reference
  • Estimated time

At checkpoints:

  • Current progress
  • Issues encountered
  • Decision made

At completion:

  • Time actual vs estimated
  • Lessons learned
  • Recommendations

Re-think Workflow

When re-thinking is triggered:

  1. Stop agent (if running in terminal)
  2. Review chatlog.md for session history
  3. Check tasks.md for progress notes
  4. Review AGENTS.md for missing context
  5. Document in tasks.md:
    • What went wrong
    • What's changed
    • New estimates
  6. Clear task brief or update
  7. Resume or restart agent

Escalation Path

Agent Struggling → Check AGENTS.md → Update context
                 → Continue → Still stuck → Re-evaluate approach
                 → Clear approach → Time > 2x → Re-think
                                    ↓
                            Time > 3x or No Progress
                                    ↓
                            Re-think Required:
                            - New task brief
                            - Different tool
                            - New approach

Quick Reference Commands

OpenCode

# Start new task
opencode --task task-brief.md

# Stop (Ctrl+C in terminal)

Aider

# Start
aider

# Stop
Ctrl+C

Playwright

# Run tests
npx playwright test

# With specific project
npx playwright test --project=chromium

Git for Verification

# Check recent commits
git log --oneline -10

# Check what changed
git diff HEAD~5..HEAD

# Check for stuck state (no new commits)
git status