Implement smarter context refresh conditional on worker

Home Epics Implement smarter context refresh conditional on worker availabilityEdit

Implement smarter context refresh conditional on worker availability

Cancel

Title *

Project *

Description

The current `/new` command mechanism (`ResetContextJob`) has two problems:

1. **Broken**: Sends wrong message format (`{command: "/new"}` but bridge expects `{type: "input", data: "..."}`)
2. **Not smart**: Fires every hour at :08 regardless of worker state - could interrupt mid-task workers

3. **Zombie worker problem**: Workers that crash/error/forget to mark themselves `idle` stay `busy` forever

## Current State (Broken)

**File**: `/rails/app/jobs/reset_context_job.rb`
```ruby
# Sends /new at :08 every hour to ALL active agents
Agent.where(status: :active).find_each do |agent|
  ActionCable.server.broadcast(
    "agent_#{agent.agent_type}_#{agent.project_id}",
    { command: "/new" }  # ❌ WRONG FORMAT
  )
end
```

**File**: `/rails/agent-bridge.go`
- Only handles `{type: "input", data: "..."}`
- Silently ignores anything else

## Proposed Solution

### Option A: Fix + Smart Filtering + Timeout (Keep Scheduled Refresh)
- Fix message format to use `AgentCommander.send_message`
- Only send to `idle` workers OR workers with stale activity
- Add timeout: auto-reset `busy` → `idle` after X minutes of inactivity
- Keep schedule at :08 (within quiet period :00-:10)

### Option B: Remove Scheduled Refresh, Add Heartbeat (Event-Driven)
- Workers already check for work via `list_tasks` MCP tool
- Add heartbeat/last_activity_at tracking
- Auto-idle workers with stale heartbeat
- Delete `ResetContextJob`

### Option C: Conditional Refresh by Orchestrator + Zombie Detection
- Orchestrator decides when to refresh context
- Only sends refresh when:
  - Worker has been idle for X minutes
  - Worker's last ticket was completed/failed
  - **Worker is `busy` but stale (zombie detection)**
- Remove hardcoded schedule
- Add zombie worker cleanup

### Option D: Agent-Side Context Management + Auto-Idle
- Workers manage their own context via MCP tools
- Add `clear_context` or `reset_session` MCP tool
- **Auto-idle timeout**: workers reset to `idle` if no activity for X minutes
- No external triggering needed

## Recommended Approach: Option C (Orchestrator-Driven Conditional + Zombie Detection)

**Why**: Orchestrator already knows:
- Which workers are idle vs busy
- What tickets are available
- Which workers haven't had work in a while
- **Which workers are stale/zombies (busy but no recent activity)**

**Implementation**:
1. Remove `ResetContextJob` and recurring.yml entry
2. Add logic to `OrchestratorPingJob` or new service:
   - Check for idle workers with stale last_activity_at → safe to refresh
   - **Check for `busy` workers with stale activity → zombie, force to `idle`**
   - Only send refresh if worker idle AND no tasks in queue
   - Send via proper `AgentCommander.send_message` format
3. Add heartbeat tracking:
   - Update `last_activity_at` on every MCP call
   - Update `last_activity_at` on ticket state transition
4. Add MCP tool `refresh_worker_context(worker_id)` for explicit control
5. Optional: Let workers request refresh via MCP tool

## Zombie Worker Detection Strategy

**Define "stale"**: Worker marked `busy` but no activity for X minutes (configurable, default: 30)

**Detection logic**:
```ruby
stale_workers = Agent.where(status: :active, availability_status: :busy)
  .where('last_activity_at < ?', X.minutes.ago)

stale_workers.each do |zombie|
  zombie.update(availability_status: :idle)
  Rails.logger.warn "[ZombieWorker] Force-idle agent #{zombie.id} - no activity for #{X} minutes"
  # Optionally: send context refresh to clean up
end
```

**Heartbeat sources**:
- MCP tool calls
- Ticket transitions (start_work, complete, fail_audit, etc.)
- Agent messages via WebSocket
- Terminal activity (via agent-bridge.go)

## Acceptance Criteria
- [ ] `ResetContextJob` is removed (or fixed if Option A)
- [ ] Scheduled "/new" no longer fires blindly
- [ ] Context refresh only happens when worker is `idle` OR stale/zombie
- [ ] **Zombie workers are auto-reset to `idle` after X minutes of inactivity**
- [ ] Refresh respects:
  - Worker not currently working on a ticket
  - Worker has been idle for X minutes (configurable)
  - No urgent work pending
  - **Worker is stale (last_activity_at older than threshold)**
- [ ] Message format uses proper `AgentCommander.send_message`
- [ ] MCP tool `refresh_worker_context` exists for manual control
- [ ] `last_activity_at` is updated on all relevant worker activities
- [ ] Tests cover conditional logic and zombie detection
- [ ] Quiet period (:00-:10) still respected

## Technical Notes

**Agent availability_status**:
- `idle` (0) - available for work
- `busy` (1) - actively working

**Zombie detection**:
- `busy` + stale `last_activity_at` → force to `idle`
- Configurable timeout (default: 30 minutes)
- Log warnings for zombie detection

**Current schedule coordination**:
- :00-:10: Quiet period (no orchestrator pings)
- :08: ResetContextJob fires (TO BE REMOVED/FIXED)
- :10+: OrchestratorPingJob every 3 minutes

**Files to modify**:
- `/rails/config/recurring.yml` - remove/reset schedule
- `/rails/app/jobs/reset_context_job.rb` - delete or refactor
- `/rails/app/jobs/orchestrator_ping_job.rb` - add conditional logic + zombie detection
- `/rails/app/services/agent_commander.rb` - ensure proper format used
- `/rails/app/models/agent.rb` - add `stale?` scope, `mark_idle_if_stale!` method
- `/rails/agent-bridge.go` - optionally add heartbeat updates