Initial commit: LinkSyncServer and LinkSyncExtension projects with complete documentation, models, API endpoints, tests, and extension implementation
This commit is contained in:
429
docs/agent-evaluation-framework.md
Normal file
429
docs/agent-evaluation-framework.md
Normal file
@@ -0,0 +1,429 @@
|
||||
# Agent Evaluation Framework
|
||||
|
||||
This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects.
|
||||
|
||||
## Evaluation Criteria
|
||||
|
||||
### Primary Metrics
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| **Progress Rate** | < 10% per 30 min | Re-evaluate approach |
|
||||
| **Same Error Pattern** | > 3 failures | Investigate root cause |
|
||||
| **Test Harness** | Time per iteration | Track convergence speed |
|
||||
| **File Changes** | No meaningful changes | Agent stuck or unclear task |
|
||||
| **Time Elapsed** | > 2x estimate | Re-think strongly advised |
|
||||
| **Time Elapsed** | > 3x estimate | Re-think required |
|
||||
|
||||
### Secondary Metrics
|
||||
|
||||
- **Context Usage**: Monitor token usage in chatlog
|
||||
- **Git Commits**: Track meaningful changes
|
||||
- **Test Pass Rate**: Monitor improvement over iterations
|
||||
- **API Call Success**: For browser automation tasks
|
||||
|
||||
## Re-think Decision Tree
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Task Running with Agent │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────┐
|
||||
│ Is time > 50% of estimate? │
|
||||
└─────────────────────────────┘
|
||||
│ │
|
||||
YES │ NO
|
||||
↓ │
|
||||
┌──────────────────┐
|
||||
│ Check progress │
|
||||
│ Still on track? │
|
||||
└──────────────────┘
|
||||
│
|
||||
YES │ NO
|
||||
↓ ↓
|
||||
Continue ┌──────────────┐
|
||||
checkpoint │ Review │
|
||||
│ blockers │
|
||||
└──────────────┘
|
||||
↓ │
|
||||
┌──────────────────┐
|
||||
│ Time > 90%? │
|
||||
└──────────────────┘
|
||||
│
|
||||
YES │ NO
|
||||
↓ │
|
||||
┌──────────────────┐
|
||||
│ Near completion │
|
||||
│ Keep going │
|
||||
└──────────────────┘
|
||||
↓ │
|
||||
Complete ┌──────────────┐
|
||||
│ Time > 2x? │
|
||||
└──────────────┘
|
||||
│
|
||||
YES │ NO
|
||||
↓ │
|
||||
┌──────────────┐
|
||||
│ Re-evaluate │
|
||||
│ - Check task │
|
||||
│ - Review AGENTS.md
|
||||
│ - Adjust approach
|
||||
└──────────────┘
|
||||
↓
|
||||
┌──────────────┐
|
||||
│ Time > 3x? │
|
||||
└──────────────┘
|
||||
│
|
||||
YES │ NO
|
||||
↓ │
|
||||
┌──────────────┐
|
||||
│ Strong │
|
||||
│ Re-think │
|
||||
│ - Clear task │
|
||||
│ - New brief │
|
||||
│ - New tool │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
## Agent-Specific Evaluation
|
||||
|
||||
### OpenCode Evaluation
|
||||
|
||||
**Expected Behavior:**
|
||||
- Reads AGENTS.md for context
|
||||
- Writes files directly to project
|
||||
- Runs tests repeatedly
|
||||
- Reports blockers clearly
|
||||
|
||||
**Good Signs:**
|
||||
- Multiple git commits per session
|
||||
- Test failure patterns changing
|
||||
- Iteration time decreasing
|
||||
- Clear progress indicators
|
||||
|
||||
**Bad Signs:**
|
||||
- Repeating same error
|
||||
- Only small/pointless changes
|
||||
- Session time increasing
|
||||
- Agent "thinking" with no output
|
||||
|
||||
**Actions:**
|
||||
- **Minor stall**: Wait 5-10 min
|
||||
- **Repeated errors**: Update AGENTS.md, clarify task
|
||||
- **No progress**: Pause, re-evaluate task brief
|
||||
|
||||
### Aider Evaluation
|
||||
|
||||
**Expected Behavior:**
|
||||
- CLI-based, simple interactions
|
||||
- Works well for single-file changes
|
||||
- Requires model configuration
|
||||
|
||||
**Good Signs:**
|
||||
- Quick response times
|
||||
- Clean diff output
|
||||
- Minimal context needed
|
||||
|
||||
**Bad Signs:**
|
||||
- Repeated file overwrites
|
||||
- Model timeout errors
|
||||
- Large context required
|
||||
|
||||
### Playwright Evaluation
|
||||
|
||||
**Expected Behavior:**
|
||||
- Test files in `tests/` folder
|
||||
- HTML report output
|
||||
- Screenshot on failure
|
||||
|
||||
**Good Signs:**
|
||||
- Tests running successfully
|
||||
- Reports capturing issues
|
||||
- Network interception working
|
||||
|
||||
**Bad Signs:**
|
||||
- Browser not launching
|
||||
- API calls timing out
|
||||
- Element not found errors
|
||||
|
||||
## Task Progress Tracking
|
||||
|
||||
### For Each Task
|
||||
|
||||
Create/Update: `<project-root>/tasks.md`
|
||||
|
||||
```markdown
|
||||
# Task: Increase Test Coverage for LinkdingSync
|
||||
|
||||
## Start Time
|
||||
2026-05-09 08:00
|
||||
|
||||
## Estimated Duration
|
||||
45 minutes
|
||||
|
||||
## Current Progress
|
||||
25% - Test structure created
|
||||
|
||||
## Current Blockers
|
||||
None
|
||||
|
||||
## Next Steps
|
||||
1. Implement auth test
|
||||
2. Implement API call test
|
||||
3. Run full suite
|
||||
```
|
||||
|
||||
### Checkpoint Questions
|
||||
|
||||
**At 50% time:**
|
||||
1. Is the agent still making progress?
|
||||
2. Are tests converging or regressing?
|
||||
3. Have blockers been identified?
|
||||
|
||||
**At 90% time:**
|
||||
1. Should be near completion
|
||||
2. Review remaining work
|
||||
3. Decide: continue or adjust
|
||||
|
||||
**After 2x time:**
|
||||
1. Review AGENTS.md for missing context
|
||||
2. Check task brief clarity
|
||||
3. Consider tool change
|
||||
|
||||
**After 3x time:**
|
||||
1. Strong evidence of stuck loop
|
||||
2. Re-think required
|
||||
3. New approach or tool needed
|
||||
|
||||
## Tool Evaluation
|
||||
|
||||
### When to Switch Tools
|
||||
|
||||
| Current Tool | Switch If... | To... |
|
||||
|--------------|---------------|-------|
|
||||
| OpenCode | Simple one-off | Aider |
|
||||
| OpenCode | Very complex refactoring | Consider re-scoping |
|
||||
| Aider | Complex iterative task | OpenCode |
|
||||
| Playwright | Test runner errors | Fix config, continue |
|
||||
| Any | 3x time with no progress | Re-evaluate approach |
|
||||
|
||||
### Cross-Project Patterns
|
||||
|
||||
**Document in `docs/tools.md`:**
|
||||
- What worked well
|
||||
- What didn't work
|
||||
- Tool preferences by project type
|
||||
- Configuration lessons learned
|
||||
|
||||
## Documentation Requirements
|
||||
|
||||
### AGENTS.md (Per Project)
|
||||
|
||||
```markdown
|
||||
# AGENTS.md
|
||||
|
||||
## Project Overview
|
||||
[What this project does]
|
||||
|
||||
## Setup Commands
|
||||
```bash
|
||||
npm install
|
||||
npm run dev
|
||||
npm test
|
||||
```
|
||||
|
||||
## Architecture
|
||||
[Brief notes]
|
||||
|
||||
## Testing
|
||||
- Unit tests: `npm test`
|
||||
- E2E tests: `npx playwright test`
|
||||
- Coverage target: 80%
|
||||
|
||||
## Conventions
|
||||
- Use TypeScript strict mode
|
||||
- Error handling with try/catch
|
||||
- API calls must timeout
|
||||
|
||||
## Known Issues
|
||||
- [List if any]
|
||||
|
||||
## Project Tools
|
||||
- Playwright for browser tests
|
||||
- OpenCode for iteration
|
||||
- API: `https://api.linkding.com`
|
||||
```
|
||||
|
||||
### task-brief.md (Per Task)
|
||||
|
||||
```markdown
|
||||
# Task Brief
|
||||
|
||||
## Context
|
||||
[Why this task]
|
||||
|
||||
## Goal
|
||||
[What needs done]
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Criterion 1
|
||||
- [ ] Criterion 2
|
||||
|
||||
## Constraints
|
||||
- [ ] Constraint 1
|
||||
|
||||
## Related Files
|
||||
- File 1
|
||||
- File 2
|
||||
```
|
||||
|
||||
## Example Evaluation Log
|
||||
|
||||
```markdown
|
||||
# Evaluation Log: LinkdingSync Test Harness
|
||||
|
||||
## Session 1 (2026-05-09)
|
||||
|
||||
### Agent: OpenCode
|
||||
### Task: Add Playwright tests
|
||||
|
||||
### Progress
|
||||
- [x] Test structure created
|
||||
- [x] First test implemented
|
||||
- [ ] Tests converging
|
||||
|
||||
### Time Elapsed
|
||||
30 min (of 60 estimated)
|
||||
|
||||
### Issues
|
||||
- API calls timing out intermittently
|
||||
|
||||
### Decision
|
||||
Continue - tests improving
|
||||
|
||||
---
|
||||
|
||||
## Session 2 (2026-05-09)
|
||||
|
||||
### Time Elapsed
|
||||
55 min
|
||||
|
||||
### Progress
|
||||
- [x] Tests converging
|
||||
- [ ] 2 of 3 scenarios passing
|
||||
|
||||
### Issues
|
||||
- Resolved API timeout with retry logic
|
||||
|
||||
### Decision
|
||||
Continue - approaching completion
|
||||
|
||||
---
|
||||
|
||||
## Final Summary
|
||||
|
||||
### Time Actual: 75 min
|
||||
### Time Estimated: 60 min
|
||||
### Deviation: +25%
|
||||
|
||||
### Outcome
|
||||
SUCCESS - All acceptance criteria met
|
||||
|
||||
### Lessons
|
||||
- API retry logic needed upfront
|
||||
- Playwright config requires specific timeout values
|
||||
```
|
||||
|
||||
## Integration with Chat Logs
|
||||
|
||||
### Automatic Logging
|
||||
|
||||
Chat logs are automatically written to:
|
||||
- `<project-root>/chatlog.md`
|
||||
|
||||
### Key Information to Capture
|
||||
|
||||
**At task start:**
|
||||
- Task brief summary
|
||||
- AGENTS.md reference
|
||||
- Estimated time
|
||||
|
||||
**At checkpoints:**
|
||||
- Current progress
|
||||
- Issues encountered
|
||||
- Decision made
|
||||
|
||||
**At completion:**
|
||||
- Time actual vs estimated
|
||||
- Lessons learned
|
||||
- Recommendations
|
||||
|
||||
## Re-think Workflow
|
||||
|
||||
When re-thinking is triggered:
|
||||
|
||||
1. **Stop agent** (if running in terminal)
|
||||
2. **Review chatlog.md** for session history
|
||||
3. **Check tasks.md** for progress notes
|
||||
4. **Review AGENTS.md** for missing context
|
||||
5. **Document in tasks.md**:
|
||||
- What went wrong
|
||||
- What's changed
|
||||
- New estimates
|
||||
6. **Clear task brief** or update
|
||||
7. **Resume or restart** agent
|
||||
|
||||
## Escalation Path
|
||||
|
||||
```
|
||||
Agent Struggling → Check AGENTS.md → Update context
|
||||
→ Continue → Still stuck → Re-evaluate approach
|
||||
→ Clear approach → Time > 2x → Re-think
|
||||
↓
|
||||
Time > 3x or No Progress
|
||||
↓
|
||||
Re-think Required:
|
||||
- New task brief
|
||||
- Different tool
|
||||
- New approach
|
||||
```
|
||||
|
||||
## Quick Reference Commands
|
||||
|
||||
### OpenCode
|
||||
```bash
|
||||
# Start new task
|
||||
opencode --task task-brief.md
|
||||
|
||||
# Stop (Ctrl+C in terminal)
|
||||
```
|
||||
|
||||
### Aider
|
||||
```bash
|
||||
# Start
|
||||
aider
|
||||
|
||||
# Stop
|
||||
Ctrl+C
|
||||
```
|
||||
|
||||
### Playwright
|
||||
```bash
|
||||
# Run tests
|
||||
npx playwright test
|
||||
|
||||
# With specific project
|
||||
npx playwright test --project=chromium
|
||||
```
|
||||
|
||||
### Git for Verification
|
||||
```bash
|
||||
# Check recent commits
|
||||
git log --oneline -10
|
||||
|
||||
# Check what changed
|
||||
git diff HEAD~5..HEAD
|
||||
|
||||
# Check for stuck state (no new commits)
|
||||
git status
|
||||
Reference in New Issue
Block a user