The current `/new` command mechanism (`ResetContextJob`) has two problems:
1. **Broken**: Sends wrong message format (`{command: "/new"}` but bridge expects `{type: "input", data: "..."}`)
2. **Not smart**: Fires every hour at :08 regardless of worker state - could interrupt mid-task workers
3. **Zombie worker problem**: Workers that crash/error/forget to mark themselves `idle` stay `busy` forever
## Current State (Broken)
**File**: `/rails/app/jobs/reset_context_job.rb`
```ruby
# Sends /new at :08 every hour to ALL active agents
Agent.where(status: :active).find_each do |agent|
ActionCable.server.broadcast(
"agent_#{agent.agent_type}_#{agent.project_id}",
{ command: "/new" } # ❌ WRONG FORMAT
)
end
```
**File**: `/rails/agent-bridge.go`
- Only handles `{type: "input", data: "..."}`
- Silently ignores anything else
## Proposed Solution
### Option A: Fix + Smart Filtering + Timeout (Keep Scheduled Refresh)
- Fix message format to use `AgentCommander.send_message`
- Only send to `idle` workers OR workers with stale activity
- Add timeout: auto-reset `busy` → `idle` after X minutes of inactivity
- Keep schedule at :08 (within quiet period :00-:10)
### Option B: Remove Scheduled Refresh, Add Heartbeat (Event-Driven)
- Workers already check for work via `list_tasks` MCP tool
- Add heartbeat/last_activity_at tracking
- Auto-idle workers with stale heartbeat
- Delete `ResetContextJob`
### Option C: Conditional Refresh by Orchestrator + Zombie Detection
- Orchestrator decides when to refresh context
- Only sends refresh when:
- Worker has been idle for X minutes
- Worker's last ticket was completed/failed
- **Worker is `busy` but stale (zombie detection)**
- Remove hardcoded schedule
- Add zombie worker cleanup
### Option D: Agent-Side Context Management + Auto-Idle
- Workers manage their own context via MCP tools
- Add `clear_context` or `reset_session` MCP tool
- **Auto-idle timeout**: workers reset to `idle` if no activity for X minutes
- No external triggering needed
## Recommended Approach: Option C (Orchestrator-Driven Conditional + Zombie Detection)
**Why**: Orchestrator already knows:
- Which workers are idle vs busy
- What tickets are available
- Which workers haven't had work in a while
- **Which workers are stale/zombies (busy but no recent activity)**
**Implementation**:
1. Remove `ResetContextJob` and recurring.yml entry
2. Add logic to `OrchestratorPingJob` or new service:
- Check for idle workers with stale last_activity_at → safe to refresh
- **Check for `busy` workers with stale activity → zombie, force to `idle`**
- Only send refresh if worker idle AND no tasks in queue
- Send via proper `AgentCommander.send_message` format
3. Add heartbeat tracking:
- Update `last_activity_at` on every MCP call
- Update `last_activity_at` on ticket state transition
4. Add MCP tool `refresh_worker_context(worker_id)` for explicit control
5. Optional: Let workers request refresh via MCP tool
## Zombie Worker Detection Strategy
**Define "stale"**: Worker marked `busy` but no activity for X minutes (configurable, default: 30)
**Detection logic**:
```ruby
stale_workers = Agent.where(status: :active, availability_status: :busy)
.where('last_activity_at < ?', X.minutes.ago)
stale_workers.each do |zombie|
zombie.update(availability_status: :idle)
Rails.logger.warn "[ZombieWorker] Force-idle agent #{zombie.id} - no activity for #{X} minutes"
# Optionally: send context refresh to clean up
end
```
**Heartbeat sources**:
- MCP tool calls
- Ticket transitions (start_work, complete, fail_audit, etc.)
- Agent messages via WebSocket
- Terminal activity (via agent-bridge.go)
## Acceptance Criteria
- [ ] `ResetContextJob` is removed (or fixed if Option A)
- [ ] Scheduled "/new" no longer fires blindly
- [ ] Context refresh only happens when worker is `idle` OR stale/zombie
- [ ] **Zombie workers are auto-reset to `idle` after X minutes of inactivity**
- [ ] Refresh respects:
- Worker not currently working on a ticket
- Worker has been idle for X minutes (configurable)
- No urgent work pending
- **Worker is stale (last_activity_at older than threshold)**
- [ ] Message format uses proper `AgentCommander.send_message`
- [ ] MCP tool `refresh_worker_context` exists for manual control
- [ ] `last_activity_at` is updated on all relevant worker activities
- [ ] Tests cover conditional logic and zombie detection
- [ ] Quiet period (:00-:10) still respected
## Technical Notes
**Agent availability_status**:
- `idle` (0) - available for work
- `busy` (1) - actively working
**Zombie detection**:
- `busy` + stale `last_activity_at` → force to `idle`
- Configurable timeout (default: 30 minutes)
- Log warnings for zombie detection
**Current schedule coordination**:
- :00-:10: Quiet period (no orchestrator pings)
- :08: ResetContextJob fires (TO BE REMOVED/FIXED)
- :10+: OrchestratorPingJob every 3 minutes
**Files to modify**:
- `/rails/config/recurring.yml` - remove/reset schedule
- `/rails/app/jobs/reset_context_job.rb` - delete or refactor
- `/rails/app/jobs/orchestrator_ping_job.rb` - add conditional logic + zombie detection
- `/rails/app/services/agent_commander.rb` - ensure proper format used
- `/rails/app/models/agent.rb` - add `stale?` scope, `mark_idle_if_stale!` method
- `/rails/agent-bridge.go` - optionally add heartbeat updates
The current `/new` command mechanism (`ResetContextJob`) has two problems:
1. **Broken**: Sends wrong message format (`{command: "/new"}` but bridge expects `{type: "input", data: "..."}`)
2. **Not smart**: Fires every hour at :08 regardless of worker state - could interrupt mid-task workers
3. **Zombie worker problem**: Workers that crash/error/forget to mark themselves `idle` stay `busy` forever
## Current State (Broken)
**File**: `/rails/app/jobs/reset_context_job.rb`
```ruby
# Sends /new at :08 every hour to ALL active agents
Agent.where(status: :active).find_each do |agent|
ActionCable.server.broadcast(
"agent_#{agent.agent_type}_#{agent.project_id}",
{ command: "/new" } # ❌ WRONG FORMAT
)
end
```
**File**: `/rails/agent-bridge.go`
- Only handles `{type: "input", data: "..."}`
- Silently ignores anything else
## Proposed Solution
### Option A: Fix + Smart Filtering + Timeout (Keep Scheduled Refresh)
- Fix message format to use `AgentCommander.send_message`
- Only send to `idle` workers OR workers with stale activity
- Add timeout: auto-reset `busy` → `idle` after X minutes of inactivity
- Keep schedule at :08 (within quiet period :00-:10)
### Option B: Remove Scheduled Refresh, Add Heartbeat (Event-Driven)
- Workers already check for work via `list_tasks` MCP tool
- Add heartbeat/last_activity_at tracking
- Auto-idle workers with stale heartbeat
- Delete `ResetContextJob`
### Option C: Conditional Refresh by Orchestrator + Zombie Detection
- Orchestrator decides when to refresh context
- Only sends refresh when:
- Worker has been idle for X minutes
- Worker's last ticket was completed/failed
- **Worker is `busy` but stale (zombie detection)**
- Remove hardcoded schedule
- Add zombie worker cleanup
### Option D: Agent-Side Context Management + Auto-Idle
- Workers manage their own context via MCP tools
- Add `clear_context` or `reset_session` MCP tool
- **Auto-idle timeout**: workers reset to `idle` if no activity for X minutes
- No external triggering needed
## Recommended Approach: Option C (Orchestrator-Driven Conditional + Zombie Detection)
**Why**: Orchestrator already knows:
- Which workers are idle vs busy
- What tickets are available
- Which workers haven't had work in a while
- **Which workers are stale/zombies (busy but no recent activity)**
**Implementation**:
1. Remove `ResetContextJob` and recurring.yml entry
2. Add logic to `OrchestratorPingJob` or new service:
- Check for idle workers with stale last_activity_at → safe to refresh
- **Check for `busy` workers with stale activity → zombie, force to `idle`**
- Only send refresh if worker idle AND no tasks in queue
- Send via proper `AgentCommander.send_message` format
3. Add heartbeat tracking:
- Update `last_activity_at` on every MCP call
- Update `last_activity_at` on ticket state transition
4. Add MCP tool `refresh_worker_context(worker_id)` for explicit control
5. Optional: Let workers request refresh via MCP tool
## Zombie Worker Detection Strategy
**Define "stale"**: Worker marked `busy` but no activity for X minutes (configurable, default: 30)
**Detection logic**:
```ruby
stale_workers = Agent.where(status: :active, availability_status: :busy)
.where('last_activity_at < ?', X.minutes.ago)
stale_workers.each do |zombie|
zombie.update(availability_status: :idle)
Rails.logger.warn "[ZombieWorker] Force-idle agent #{zombie.id} - no activity for #{X} minutes"
# Optionally: send context refresh to clean up
end
```
**Heartbeat sources**:
- MCP tool calls
- Ticket transitions (start_work, complete, fail_audit, etc.)
- Agent messages via WebSocket
- Terminal activity (via agent-bridge.go)
## Acceptance Criteria
- [ ] `ResetContextJob` is removed (or fixed if Option A)
- [ ] Scheduled "/new" no longer fires blindly
- [ ] Context refresh only happens when worker is `idle` OR stale/zombie
- [ ] **Zombie workers are auto-reset to `idle` after X minutes of inactivity**
- [ ] Refresh respects:
- Worker not currently working on a ticket
- Worker has been idle for X minutes (configurable)
- No urgent work pending
- **Worker is stale (last_activity_at older than threshold)**
- [ ] Message format uses proper `AgentCommander.send_message`
- [ ] MCP tool `refresh_worker_context` exists for manual control
- [ ] `last_activity_at` is updated on all relevant worker activities
- [ ] Tests cover conditional logic and zombie detection
- [ ] Quiet period (:00-:10) still respected
## Technical Notes
**Agent availability_status**:
- `idle` (0) - available for work
- `busy` (1) - actively working
**Zombie detection**:
- `busy` + stale `last_activity_at` → force to `idle`
- Configurable timeout (default: 30 minutes)
- Log warnings for zombie detection
**Current schedule coordination**:
- :00-:10: Quiet period (no orchestrator pings)
- :08: ResetContextJob fires (TO BE REMOVED/FIXED)
- :10+: OrchestratorPingJob every 3 minutes
**Files to modify**:
- `/rails/config/recurring.yml` - remove/reset schedule
- `/rails/app/jobs/reset_context_job.rb` - delete or refactor
- `/rails/app/jobs/orchestrator_ping_job.rb` - add conditional logic + zombie detection
- `/rails/app/services/agent_commander.rb` - ensure proper format used
- `/rails/app/models/agent.rb` - add `stale?` scope, `mark_idle_if_stale!` method
- `/rails/agent-bridge.go` - optionally add heartbeat updates