Files

DavidSaylor aed69afdfd Initial commit: LinkSyncServer and LinkSyncExtension projects with complete documentation, models, API endpoints, tests, and extension implementation

2026-05-11 17:37:10 -05:00

9.7 KiB

Raw Permalink Blame History

Agent Evaluation Framework

This document defines how to evaluate agent performance and make re-thinking decisions across your MyWorkspace projects.

Evaluation Criteria

Primary Metrics

Metric	Threshold	Action
Progress Rate	< 10% per 30 min	Re-evaluate approach
Same Error Pattern	> 3 failures	Investigate root cause
Test Harness	Time per iteration	Track convergence speed
File Changes	No meaningful changes	Agent stuck or unclear task
Time Elapsed	> 2x estimate	Re-think strongly advised
Time Elapsed	> 3x estimate	Re-think required

Secondary Metrics

Context Usage: Monitor token usage in chatlog
Git Commits: Track meaningful changes
Test Pass Rate: Monitor improvement over iterations
API Call Success: For browser automation tasks

Re-think Decision Tree

┌─────────────────────────────────────┐
│   Task Running with Agent           │
└─────────────────────────────────────┘
                ↓
    ┌─────────────────────────────┐
    │ Is time > 50% of estimate?   │
    └─────────────────────────────┘
            │          │
         YES │         NO
            ↓          │
    ┌──────────────────┐
    │ Check progress    │
    │ Still on track?   │
    └──────────────────┘
            │
       YES │ NO
            ↓          ↓
   Continue    ┌──────────────┐
   checkpoint  │ Review       │
               │ blockers     │
               └──────────────┘
            ↓          │
    ┌──────────────────┐
    │ Time > 90%?      │
    └──────────────────┘
            │
       YES │ NO
            ↓          │
    ┌──────────────────┐
    │ Near completion  │
    │ Keep going       │
    └──────────────────┘
            ↓          │
   Complete      ┌──────────────┐
                 │ Time > 2x?   │
                 └──────────────┘
                         │
                    YES │ NO
                         ↓          │
                    ┌──────────────┐
                    │ Re-evaluate  │
                    │ - Check task │
                    │ - Review AGENTS.md
                    │ - Adjust approach
                    └──────────────┘
                         ↓
                    ┌──────────────┐
                    │ Time > 3x?   │
                    └──────────────┘
                             │
                        YES │ NO
                             ↓          │
                        ┌──────────────┐
                        │ Strong       │
                        │ Re-think     │
                        │ - Clear task │
                        │ - New brief  │
                        │ - New tool   │
                        └──────────────┘

Agent-Specific Evaluation

OpenCode Evaluation

Expected Behavior:

Reads AGENTS.md for context
Writes files directly to project
Runs tests repeatedly
Reports blockers clearly

Good Signs:

Multiple git commits per session
Test failure patterns changing
Iteration time decreasing
Clear progress indicators

Bad Signs:

Repeating same error
Only small/pointless changes
Session time increasing
Agent "thinking" with no output

Actions:

Minor stall: Wait 5-10 min
Repeated errors: Update AGENTS.md, clarify task
No progress: Pause, re-evaluate task brief

Aider Evaluation

Expected Behavior:

CLI-based, simple interactions
Works well for single-file changes
Requires model configuration

Good Signs:

Quick response times
Clean diff output
Minimal context needed

Bad Signs:

Repeated file overwrites
Model timeout errors
Large context required

Playwright Evaluation

Expected Behavior:

Test files in tests/ folder
HTML report output
Screenshot on failure

Good Signs:

Tests running successfully
Reports capturing issues
Network interception working

Bad Signs:

Browser not launching
API calls timing out
Element not found errors

Task Progress Tracking

For Each Task

Create/Update: <project-root>/tasks.md

# Task: Increase Test Coverage for LinkdingSync

## Start Time
2026-05-09 08:00

## Estimated Duration
45 minutes

## Current Progress
25% - Test structure created

## Current Blockers
None

## Next Steps
1. Implement auth test
2. Implement API call test
3. Run full suite

Checkpoint Questions

At 50% time:

Is the agent still making progress?
Are tests converging or regressing?
Have blockers been identified?

At 90% time:

Should be near completion
Review remaining work
Decide: continue or adjust

After 2x time:

Review AGENTS.md for missing context
Check task brief clarity
Consider tool change

After 3x time:

Strong evidence of stuck loop
Re-think required
New approach or tool needed

Tool Evaluation

When to Switch Tools

Current Tool	Switch If...	To...
OpenCode	Simple one-off	Aider
OpenCode	Very complex refactoring	Consider re-scoping
Aider	Complex iterative task	OpenCode
Playwright	Test runner errors	Fix config, continue
Any	3x time with no progress	Re-evaluate approach

Cross-Project Patterns

Document in docs/tools.md:

What worked well
What didn't work
Tool preferences by project type
Configuration lessons learned

Documentation Requirements

AGENTS.md (Per Project)

# AGENTS.md

## Project Overview
[What this project does]

## Setup Commands
```bash
npm install
npm run dev
npm test

Architecture

[Brief notes]

Testing

Unit tests: npm test
E2E tests: npx playwright test
Coverage target: 80%

Conventions

Use TypeScript strict mode
Error handling with try/catch
API calls must timeout

Known Issues

[List if any]

Project Tools

Playwright for browser tests
OpenCode for iteration
API: https://api.linkding.com


### task-brief.md (Per Task)

```markdown
# Task Brief

## Context
[Why this task]

## Goal
[What needs done]

## Acceptance Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Constraints
- [ ] Constraint 1

## Related Files
- File 1
- File 2

Example Evaluation Log

# Evaluation Log: LinkdingSync Test Harness

## Session 1 (2026-05-09)

### Agent: OpenCode
### Task: Add Playwright tests

### Progress
- [x] Test structure created
- [x] First test implemented
- [ ] Tests converging

### Time Elapsed
30 min (of 60 estimated)

### Issues
- API calls timing out intermittently

### Decision
Continue - tests improving

---

## Session 2 (2026-05-09)

### Time Elapsed
55 min

### Progress
- [x] Tests converging
- [ ] 2 of 3 scenarios passing

### Issues
- Resolved API timeout with retry logic

### Decision
Continue - approaching completion

---

## Final Summary

### Time Actual: 75 min
### Time Estimated: 60 min
### Deviation: +25%

### Outcome
SUCCESS - All acceptance criteria met

### Lessons
- API retry logic needed upfront
- Playwright config requires specific timeout values

Integration with Chat Logs

Automatic Logging

Chat logs are automatically written to:

<project-root>/chatlog.md

Key Information to Capture

At task start:

Task brief summary
AGENTS.md reference
Estimated time

At checkpoints:

Current progress
Issues encountered
Decision made

At completion:

Time actual vs estimated
Lessons learned
Recommendations

Re-think Workflow

When re-thinking is triggered:

Stop agent (if running in terminal)
Review chatlog.md for session history
Check tasks.md for progress notes
Review AGENTS.md for missing context
Document in tasks.md:
- What went wrong
- What's changed
- New estimates
Clear task brief or update
Resume or restart agent

Escalation Path

Agent Struggling → Check AGENTS.md → Update context
                 → Continue → Still stuck → Re-evaluate approach
                 → Clear approach → Time > 2x → Re-think
                                    ↓
                            Time > 3x or No Progress
                                    ↓
                            Re-think Required:
                            - New task brief
                            - Different tool
                            - New approach

Quick Reference Commands

OpenCode

# Start new task
opencode --task task-brief.md

# Stop (Ctrl+C in terminal)

Aider

# Start
aider

# Stop
Ctrl+C

Playwright

# Run tests
npx playwright test

# With specific project
npx playwright test --project=chromium

Git for Verification

# Check recent commits
git log --oneline -10

# Check what changed
git diff HEAD~5..HEAD

# Check for stuck state (no new commits)
git status

9.7 KiB Raw Permalink Blame History

Agent Evaluation Framework

Evaluation Criteria

Primary Metrics

Secondary Metrics

Re-think Decision Tree

Agent-Specific Evaluation

OpenCode Evaluation

Aider Evaluation

Playwright Evaluation

Task Progress Tracking

For Each Task

Checkpoint Questions

Tool Evaluation

When to Switch Tools

Cross-Project Patterns

Documentation Requirements

AGENTS.md (Per Project)

Architecture

Testing

Conventions

Known Issues

Project Tools

Example Evaluation Log

Integration with Chat Logs

Automatic Logging

Key Information to Capture

Re-think Workflow

Escalation Path

Quick Reference Commands

OpenCode

Aider

Playwright

Git for Verification

9.7 KiB

Raw Permalink Blame History