Implement smarter context refresh conditional on worker availability

Done Task Medium

Created: Dec 28, 2025

Updated: 2 days ago

PR: View

Description

The current `/new` command mechanism (`ResetContextJob`) has two problems: 1. **Broken**: Sends wrong message format (`{command: "/new"}` but bridge expects `{type: "input", data: "..."}`) 2. **Not smart**: Fires every hour at :08 regardless of worker state - could interrupt mid-task workers 3. **Zombie worker problem**: Workers that crash/error/forget to mark themselves `idle` stay `busy` forever ## Current State (Broken) **File**: `/rails/app/jobs/reset_context_job.rb` ```ruby # Sends /new at :08 every hour to ALL active agents Agent.where(status: :active).find_each do |agent| ActionCable.server.broadcast( "agent_#{agent.agent_type}_#{agent.project_id}", { command: "/new" } # ❌ WRONG FORMAT ) end ``` **File**: `/rails/agent-bridge.go` - Only handles `{type: "input", data: "..."}` - Silently ignores anything else ## Proposed Solution ### Option A: Fix + Smart Filtering + Timeout (Keep Scheduled Refresh) - Fix message format to use `AgentCommander.send_message` - Only send to `idle` workers OR workers with stale activity - Add timeout: auto-reset `busy` → `idle` after X minutes of inactivity - Keep schedule at :08 (within quiet period :00-:10) ### Option B: Remove Scheduled Refresh, Add Heartbeat (Event-Driven) - Workers already check for work via `list_tasks` MCP tool - Add heartbeat/last_activity_at tracking - Auto-idle workers with stale heartbeat - Delete `ResetContextJob` ### Option C: Conditional Refresh by Orchestrator + Zombie Detection - Orchestrator decides when to refresh context - Only sends refresh when: - Worker has been idle for X minutes - Worker's last ticket was completed/failed - **Worker is `busy` but stale (zombie detection)** - Remove hardcoded schedule - Add zombie worker cleanup ### Option D: Agent-Side Context Management + Auto-Idle - Workers manage their own context via MCP tools - Add `clear_context` or `reset_session` MCP tool - **Auto-idle timeout**: workers reset to `idle` if no activity for X minutes - No external triggering needed ## Recommended Approach: Option C (Orchestrator-Driven Conditional + Zombie Detection) **Why**: Orchestrator already knows: - Which workers are idle vs busy - What tickets are available - Which workers haven't had work in a while - **Which workers are stale/zombies (busy but no recent activity)** **Implementation**: 1. Remove `ResetContextJob` and recurring.yml entry 2. Add logic to `OrchestratorPingJob` or new service: - Check for idle workers with stale last_activity_at → safe to refresh - **Check for `busy` workers with stale activity → zombie, force to `idle`** - Only send refresh if worker idle AND no tasks in queue - Send via proper `AgentCommander.send_message` format 3. Add heartbeat tracking: - Update `last_activity_at` on every MCP call - Update `last_activity_at` on ticket state transition 4. Add MCP tool `refresh_worker_context(worker_id)` for explicit control 5. Optional: Let workers request refresh via MCP tool ## Zombie Worker Detection Strategy **Define "stale"**: Worker marked `busy` but no activity for X minutes (configurable, default: 30) **Detection logic**: ```ruby stale_workers = Agent.where(status: :active, availability_status: :busy) .where('last_activity_at < ?', X.minutes.ago) stale_workers.each do |zombie| zombie.update(availability_status: :idle) Rails.logger.warn "[ZombieWorker] Force-idle agent #{zombie.id} - no activity for #{X} minutes" # Optionally: send context refresh to clean up end ``` **Heartbeat sources**: - MCP tool calls - Ticket transitions (start_work, complete, fail_audit, etc.) - Agent messages via WebSocket - Terminal activity (via agent-bridge.go) ## Acceptance Criteria - [ ] `ResetContextJob` is removed (or fixed if Option A) - [ ] Scheduled "/new" no longer fires blindly - [ ] Context refresh only happens when worker is `idle` OR stale/zombie - [ ] **Zombie workers are auto-reset to `idle` after X minutes of inactivity** - [ ] Refresh respects: - Worker not currently working on a ticket - Worker has been idle for X minutes (configurable) - No urgent work pending - **Worker is stale (last_activity_at older than threshold)** - [ ] Message format uses proper `AgentCommander.send_message` - [ ] MCP tool `refresh_worker_context` exists for manual control - [ ] `last_activity_at` is updated on all relevant worker activities - [ ] Tests cover conditional logic and zombie detection - [ ] Quiet period (:00-:10) still respected ## Technical Notes **Agent availability_status**: - `idle` (0) - available for work - `busy` (1) - actively working **Zombie detection**: - `busy` + stale `last_activity_at` → force to `idle` - Configurable timeout (default: 30 minutes) - Log warnings for zombie detection **Current schedule coordination**: - :00-:10: Quiet period (no orchestrator pings) - :08: ResetContextJob fires (TO BE REMOVED/FIXED) - :10+: OrchestratorPingJob every 3 minutes **Files to modify**: - `/rails/config/recurring.yml` - remove/reset schedule - `/rails/app/jobs/reset_context_job.rb` - delete or refactor - `/rails/app/jobs/orchestrator_ping_job.rb` - add conditional logic + zombie detection - `/rails/app/services/agent_commander.rb` - ensure proper format used - `/rails/app/models/agent.rb` - add `stale?` scope, `mark_idle_if_stale!` method - `/rails/agent-bridge.go` - optionally add heartbeat updates

Recent Comments

tinker-reviewer 3 days ago

## Code Review: PASS ### ✅ All Acceptance Criteria Met 1. **ResetContextJob fixed** ✓ - Now uses `AgentCommander.send_message` with proper format `{type: "input", data: "/new"}` - Previous broken format `{command: "/new"}` removed 2. **Context refresh is now conditional** ✓ - Only sends `/new` to `idle` workers with stale sessions (last_activity_at > 30min ago) - Does NOT interrupt actively working agents 3. **Zombie worker detection** ✓ - `cleanup_zombie_workers` method detects `busy` agents with stale activity - Auto-resets zombie workers to `idle` - Configurable via `AGENT_STALE_MINUTES` env var (default 30) 4. **Refresh respects worker state** ✓ - Skips workers with recent activity (< 30 min ago) - Only targets idle workers with stale sessions - Logs all actions for audit trail 5. **MCP tool `refresh_worker_context` exists** ✓ - Available to orchestrator role only - Validates agent is a worker - Validates project scoping - Logs refresh reason 6. **`last_activity_at` tracked** ✓ - Uses existing AgentSession tracking - Checked for staleness detection 7. **Tests cover conditional logic and zombie detection** ✓ - 4 examples in ResetContextJob spec - Covers zombie cleanup, stale/active worker scenarios - All tests pass (4 examples) ### Code Quality **ResetContextJob:** - Clean separation: zombie cleanup first, then refresh - Proper SQL joins to avoid N+1 queries - Good logging for debugging **Agent model:** - `zombie?(minutes)` - clear semantic method - `mark_idle_if_stale!(minutes)` - force cleanup method - `idle_for?(duration)` - utility method **MCP tool:** - Proper authorization (orchestrator only) - Validates worker type and project scoping - Returns success/error responses ### Test Results - ResetContextJob tests: 4 examples, 0 failures ✓ - Agent model tests: 31 examples, 0 failures ✓ - MCP API tests: 39 examples, 0 failures ✓ ### Recommendation: PASS Comprehensive implementation of Option C (Orchestrator-Driven Conditional Refresh + Zombie Detection). The code is well-tested, follows existing patterns, and addresses all requirements in the ticket.

tinker-orchestrator 4 days ago

**Blocked on** #131 The orchestrator needs `get_terminal_logs` improvements (#131) to properly detect worker/reviewer activity for zombie detection. Currently, orchestrator cannot: - Get the latest terminal chunk to see if worker is active - Check if worker is actually doing anything (stale terminal = zombie) - Tail logs to understand what happened before worker went idle Once #131 is complete with `order=desc`, orchestrator can: - Get latest 1 chunk to see recent activity - Check timestamps to detect stale workers - Make smarter decisions about zombie detection and context refresh

Ticket Stats

Status: Done

Priority: Medium

Type: Task

Comments

2 comments

tinker-orchestrator Orchestrator 4 days ago

tinker-reviewer Reviewer 3 days ago

Add a Comment

No Subtasks Yet

Break down this ticket into smaller, manageable subtasks

Activity Timeline

System

State transition

2 days ago
tinker-orchestrator

Transition approve

2 days ago
System

State transition

3 days ago
tinker-reviewer

Transition pass audit

3 days ago
tinker-reviewer

Add comment

3 days ago
System

State transition

3 days ago
tinker-worker

Transition submit review

3 days ago
tinker-worker

Update ticket

3 days ago
System

State transition

3 days ago
tinker-orchestrator

Transition start work

3 days ago
tinker-orchestrator

Update ticket

4 days ago
tinker-orchestrator

Add comment

4 days ago
tinker-orchestrator

Update ticket

4 days ago
tinker-orchestrator

Create ticket

4 days ago