# Autonomous Health Monitor & Permission Audit
**Date:** 2026-04-27
**Duration:** ~45 minutes
**Repos touched:** autonomous-health (new), trading-agent, knowledgeBase, auto-shorts, centralDiscord, pm-interview-practice
## Context & Motivation
User noticed the trading agent’s research sub-agent was “silently failing” and “mentioning needing webfetch permissions in the void” with no way to respond. The core question: how do we surface permission blocks and failures across ALL autonomous processes (cron runners, Discord bot dispatches, server routes) so they can be handled?
## Decisions Made
### 1. Root Cause: Missing `–dangerously-skip-permissions` in subprocess calls
– **Decision:** Add `–dangerously-skip-permissions` to every Claude CLI invocation that runs headlessly (cron, subprocess, server route)
– **Alternatives considered:** (a) Using `–allowedTools` to whitelist specific tools (more restrictive but still needs the bypass flag for non-interactive contexts), (b) Leaving text-only calls without the flag (decided against for consistency)
– **Rationale:** Any Claude CLI call running without a TTY will silently block on permission prompts. Even text-only calls could unexpectedly try to use a tool.
– **Trade-offs:** `–dangerously-skip-permissions` gives full tool access. Mitigated by the fact these are all internal, trusted contexts.
### 2. New standalone repo for health monitoring
– **Decision:** Create `autonomous-health` as its own private repo rather than adding to an existing repo
– **Alternatives considered:** (a) Adding to `scripts/` repo, (b) Adding to `autonomousDev`
– **Rationale:** User explicitly requested its own repo. It monitors all other repos, so it shouldn’t live inside any one of them.
### 3. Schedule-aware staleness detection
– **Decision:** Config includes `active_hours_utc` and `active_days` per runner, staleness check skips runners outside their active window
– **Rationale:** Without this, overnight-only runners (autonomousDev, fix-checker) would always alert during daytime.
## What Was Built / Changed
### New: autonomous-health repo
Created `~/repos/autonomous-health/` (GitHub: npezarro/autonomous-health, private).
**5 health checks:**
1. **NDJSON log parsing** — Scans latest run log per runner for permission keywords, AskUserQuestion calls, and tool errors
2. **Schedule-aware staleness** — Alerts only when a runner is overdue during its active window (3x expected interval threshold)
3. **Failure rate** — Flags runners with >3 failures in the last 20 log lines
4. **VM PM2 health** — SSH to pezant-vm, checks for errored/stopped processes and elevated restart counts (threshold: 20)
5. **Missing permission flags** — Scans all repos for Claude subprocess calls lacking `–dangerously-skip-permissions`
**Infrastructure:**
– Cron: every 15 minutes (`*/15 * * * *`)
– Discord: `#autonomous-health` channel (1498417022856466653) in logs category
– Behavior: posts only on alert, silent when healthy
– First live alert detected finance-tracker with 48 restarts
### Permission flag fixes (6 files across 5 repos)
| Repo | File | Commit | Issue |
|——|——|——–|——-|
| trading-agent | `collector/researcher.py` | ce3745f | Research sub-agent couldn’t use WebSearch/WebFetch |
| knowledgeBase | `scripts/promote.sh` | 413a045 | Wiki promotion couldn’t use tools |
| auto-shorts | `lib/shorts-routes.js` | ef77842 | Suggestions route ran headless |
| centralDiscord | `src/bot/errorMonitor.js` | b2fedd4 | Error monitor ran headless (text-only but for consistency) |
| pm-interview-practice | `lib/claude.js` | 40f1eed | Interview server route ran headless |
### Full audit results (confirmed OK)
These were verified to already have the flag or not need it:
– All `autonomousDev*` runners (have the flag)
– `centralDiscord/executor.js` (defaults to bypassPermissions)
– `centralDiscord/claudeReply.js`, `parallelTeam.js` (delegate to executor)
– `claudeNet/claudenet-worker.js` (has the flag)
– `assortedLLMTasks/job_pipeline/discover_custom.py` and `generate.py` (have the flag)
– `student-transcript/route.ts` (text-only Haiku, no tools)
– `claude-bakeoff` scripts (use `–print`, read-only)
– `browser-agent/agent-server.js` (HTTP relay, no direct Claude calls)
## Architecture & Design
“`
autonomous-health (cron */15)
|
+—————-+—————-+
| | |
[Log Scanner] [PM2 Checker] [Flag Scanner]
| | |
Parse NDJSON SSH to VM grep repos for
for permission pm2 jlist missing flag
keywords
| | |
+——–+——-+——–+——+
|
Discord #autonomous-health
(alert only, silent on OK)
“`
Config-driven: `config.json` defines all runners (paths, schedules, active hours) and VM processes. Adding a new runner = adding a JSON entry.
## Learnings Captured
1. **`–dangerously-skip-permissions` is required for ALL headless Claude CLI calls** — even text-only ones for consistency. The main runner script having the flag doesn’t protect subprocess calls within the session.
2. **The autonomous-health monitor’s `check_permission_flags` function** will catch new scripts that get added without the flag going forward.
3. Memory file created: `project_autonomous_health.md`
## Open Items & Follow-ups
– **finance-tracker has 48 restarts** — First real alert from the health monitor. Needs investigation.
– **centralDiscord deploy** — The errorMonitor.js fix was pushed to GitHub but needs `git pull` + `pm2 restart` on the VM to take effect.
– **knowledgeBase promote.sh** — Pushed to branch `claude/learnings-350`, needs merge to main.
– **Suppress known-OK false positives** — The permission flag scanner may flag scripts that intentionally don’t have the flag. Consider adding an allowlist to config.json.
## Key Files
– `~/repos/autonomous-health/run.sh` — Main monitor script
– `~/repos/autonomous-health/config.json` — All runner and PM2 process definitions
– `~/repos/trading-agent/collector/researcher.py` — The original broken file (now fixed)