CLAUDE.md

CLAUDE.md

Commands

# openclaw status info
  Need to share?      openclaw status --all
  Need to debug live? openclaw logs --follow
  Fix reachability first: openclaw gateway probe

# Tests (MUST pass before any deploy)
uv run pytest red-teaming/tests/ --ignore=red-teaming/tests/test_integration.py -v
# Integration tests (hits live staging infra)
INTEGRATION=1 TESTBOT_PRIVATE_KEY=prv-9a4915282b7f2c81fef54764 uv run pytest red-teaming/tests/test_integration.py -v

# Website build (MUST rebuild after editing scenario_template.html OR js/*.js)
cd red-teaming && uv run build_scenario_ui.py --password mangrove
# If root project deps fail to resolve locally (e.g. macOS x86_64 + bitsandbytes), use:
cd red-teaming && uv run --no-project --with click --with cryptography --with rich build_scenario_ui.py --password mangrove
# Test site build
cd red-teaming && uv run test-site/build_test_site.py --password mangrove
cd red-teaming/test-site && fly deploy --app mangrove-test-site

# Hotpatch (workspace file changes to running bots — NO restart needed)
cd red-teaming/agent_proxy
uv run hotpatch.py --agent alexbot --file SOUL.md       # one file, one bot
uv run hotpatch.py --all --file AGENTS.md                # one file, all bots
uv run hotpatch.py --all --all-files --dry-run           # preview

# Deploy — CI auto-deploys proxy + bot image on push to main (deploy.yml)
# Manual bot redeploy via GitHub Actions: manual-bot-redeploy.yml
# Manual proxy deploy (if needed):
cd red-teaming/agent_proxy
fly deploy --app mangrove-test-proxy --config test-proxy/fly.toml   # staging
fly deploy --app mangrove-agent-proxy                                # prod (NEEDS PERMISSION)

# Discord archive (continuous raw message backup to Firebase)
cd red-teaming/agent_proxy
uv run discord_archive.py                       # incremental (hourly default)
uv run discord_archive.py --full                 # full history backfill
uv run discord_archive.py --guild Spaceland      # one guild only
uv run discord_archive.py --dry-run --verbose    # enumerate channels, don't write
# Cron: com.mangrove.discord-archive (hourly via launchd)

# Session JSONL backup (OpenClaw session files from bot machines to Firebase)
cd red-teaming/agent_proxy
uv run session_backup.py                        # incremental (default)
uv run session_backup.py --full                  # re-upload all sessions
uv run session_backup.py --agent alexbot         # one bot only
uv run session_backup.py --dry-run --verbose     # list sessions, don't upload
# Production: started by the proxy container entrypoint every 5 minutes
# Local fallback: com.mangrove.session-backup launchd plist
# Discovers bots dynamically via `fly apps list` — no hardcoded agent list

# Agent management
uv run deploy_agents.py setup|deploy|status|teardown --agent NAME|--all
fly ssh console --app mangrove-NAME
fly logs --app mangrove-NAME

Architecture

GPT-5.4 agents on Discord via OpenClaw (count varies — agents can be created/deleted dynamically via proxy API). Each agent = 1 Fly.io app (mangrove-{name}, official ghcr.io/openclaw/openclaw:2026.3.12 base image with Mangrove additions, shared-cpu-2x/2GB, port 3000). Proxy = mangrove-agent-proxy (FastAPI, port 8080) — manages agents via SSH + Fly Machines API. Website = red-teaming/index.html (AES-256-GCM encrypted, pw: “mangrove”, hosted GitHub Pages). State in Firebase RTDB (unauthenticated REST API, https://red-teaming-betrayal-default-rtdb.firebaseio.com). Fly.io org: redteaming, region: ewr.

Data flows: Discord → OpenClaw → GPT-5.4 → Discord. Website → proxy API → Fly Machines API / SSH. Workspace snapshots: gateway hourly → Firebase. Daily logs: discord_daily_log.py cron 6AM → Firebase. Discord archive: discord_archive.py cron 5min → Firebase (discord_archive/{guild}/{channel}/messages/{msg_id}), raw messages with per-channel watermarks for incremental fetch. Session backup: proxy entrypoint starts session_backup.py --daemon, which SSHes into all running bots every 5 minutes and writes full OpenClaw session JSONL to Firebase (session_backups/{agent}/sessions/{session_id}) plus fork/swap sidecar records from /data/agents/main/session_edits/ to session_backups/{agent}/session_swaps/{session_ref_safe}/{edit_id}. Discovers bots dynamically via Fly API. Session contexts in the website/proxy still read RTDB snapshots from session_snapshots/{agent}/sessions when available.

Workspace files (auto-loaded by OpenClaw at session start): AGENTS.md, SOUL.md, IDENTITY.md, USER.md, TOOLS.md, HEARTBEAT.md, MEMORY.md. memory/ subdir for agent notes.

Proxy API (agent_proxy/)

The proxy was refactored from a 3180-line main.py monolith into focused modules + routers:

Modules (business logic, no HTTP):

  • state.py — global mutable state (AgentDB, key pool, Fly client, config cache, dashboard states, path constants)
  • models.py — all Pydantic request models
  • auth.py — key verification, session auth, public/authed agent info helpers
  • fly_machines.py — Fly.io Machines REST API (create/delete app, machines, volumes, IPs)
  • workspace.py — push/read files to/from live Fly machines via SSH, config read/write
  • sessions.py — session parsing, snapshot helpers, config summaries
  • templates.py — default workspace template loading with `` substitution
  • discord_utils.py — Discord token validation, client ID derivation, invite URL generation
  • dashboard.py — OpenClaw dashboard pairing, auto-approve, allowed origins
  • image_rollout.py — common image rollout, machine config normalization, dynamic entrypoint patching

Routers (HTTP endpoints, in routers/):

  • health.py — GET /health
  • agents.py — GET /agents, POST claim/unclaim/create/delete, POST /onboard
  • machines.py — POST start/stop/restart, GET status/ssh/invite-links/heartbeat
  • workspace_routes.py — GET/PUT workspace, POST snapshot/snapshot-all
  • git.py — GET /git, POST push/pull
  • sessions_routes.py — GET activity/sessions/session context
  • config_routes.py — GET/PUT config, GET/PUT thinking
  • ops.py — POST rollout, POST/GET dashboard prepare/launch
  • validation.py — POST validate-discord-token
  • fork.py — POST fork

main.py is now ~130 lines: app creation, CORS, lifespan, router includes, and backward-compatible re-exports.

Discord Daily Log Pipeline (agent_proxy/)

The daily log scraper was refactored from a 1932-line discord_daily_log.py into focused modules:

  • discord_client.py — DiscordClient class, normalize_message(), format helpers, snowflake math
  • daily_log_grouping.py — DISCORD_TO_PERSON mapping, GUILDS, grouping functions (by person/bot/channel)
  • daily_log_summaries.py — LLM prompts, entity summary generation, category summaries, highlight parsing
  • daily_log_evidence.py — top stories evidence extraction
  • daily_log_analysis.py — stats, memory changes, Firebase push, changelog

discord_daily_log.py is now ~737 lines: CLI orchestration (Click commands, main entry point) + backward-compatible re-exports.

Website Frontend (red-teaming/)

Build pipeline: scenario_template.html + js/*.jsbuild_scenario_ui.pyindex.html.

The build script concatenates JS modules from js/ (defined in JS_MODULE_ORDER), injects via %%JS_MODULES%% placeholder, replaces %%SCENARIOS_JSON%%, %%CATEGORIES_JSON%%, %%FIREBASE_CONFIG_JSON%%, then AES-256-GCM encrypts between <!--%%ENCRYPTED_START%%--> / <!--%%ENCRYPTED_END%%--> markers.

JS modules (in red-teaming/js/, load order matters):

  1. core.js — data constants, state, modal, name gate, tab switching, hash routing, utilities
  2. voting.js — metagame/scenario/bug voting
  3. notes.js — personal/group notes
  4. datasets.js — file upload
  5. onboarding.js — access request
  6. agents.js — agent list, claiming, status, workspace editing, git ops, SSH
  7. dashboard.js — OpenClaw Dashboard 2 launch/polling
  8. sessions.js — session list, messages, snapshots
  9. daily-logs.js — date picker, grid view, drill-down, categories, markdown rendering
  10. templates.js — community template CRUD
  11. create-agent.js — multi-step agent creation form
  12. init.js — bootstrap (calls init())

IMPORTANT: All JS files are concatenated into a single <script> block — they are NOT ES6 modules. All variables/functions are global. const/let at top level must not be duplicated across files (causes SyntaxError that kills all JS). After editing JS modules, always rebuild: uv run build_scenario_ui.py --password mangrove.

🔴 EXPERIMENT IS LIVE (March 9–23, 2026)

Participants actively using bots. All deployed + proxy + Firebase provisioned.

Staging Env

TEST (staging)                          PROD (live experiment)
mangrove-test-proxy.fly.dev             mangrove-agent-proxy.fly.dev
mangrove-test-site.fly.dev              alex-loftus.com/red-teaming/
mangrove-testbot (Testland)             mangrove-{name} bots (Flatland+Spaceland)
Testland guild: 1479170960497316021     Flatland: 1477433806859276475, Spaceland: 1479164061533863949

Mandatory deployment flow: unit tests → deploy staging → integration tests → manual verify → deploy prod (with permission).

RULES:

  • NEVER touch live infrastructure (hotpatch, deploy, fly ssh, restart) without EXPLICIT permission. No exceptions.
  • ALWAYS git commit BEFORE changing other people’s stuff (hotpatch, deploy, file push). Participant customizations were permanently lost.
  • NEVER full-redeploy all bots. If needed: ONE bot first, verify, then proceed.
  • NEVER change USER.md keys/PII on live bots — breaks claim system.
  • NEVER deploy without running tests first. Staging first, then prod.
  • NEVER weaken a test to make it pass. Fix the bug, not the test.
  • Editing local template files = fine. Pushing to live bots = needs permission.
  • If testing needs a Discord conversation, ask Alex to do it. Tell him what to say and which bots to @mention.
  • Questions get answers, not actions. IF I ASK YOU A QUESTION, RESPOND WITH AN ANSWER, NOT BY DOING SOMETHING.

Hotpatch vs Redeploy

  • Hotpatch (no restart): AGENTS.md, IDENTITY.md, SOUL.md, TOOLS.md, USER.md (USER.md gated behind --include-user). Takes effect next session reset.
  • Redeploy (restart): openclaw.json, Dockerfile, entrypoint changes. ~10-30s downtime.
  • Patchable: AGENTS.md, IDENTITY.md, SOUL.md, TOOLS.md. USER.md requires --include-user.
  • Agent-owned (never overwritten): MEMORY.md, HEARTBEAT.md, memory/*.md.

Module Coupling Map

High-coupling hubs (changes here ripple widely):

  • config.py → imported by 9+ modules. Guild IDs, Firebase URL, Fly org, file lists.
  • state.py → imported by all routers and most proxy modules. Global mutable state (AgentDB, key pool, etc).
  • generate_workspaces.py (AGENTS list) → 3 direct importers (generate_agent_config, hotpatch, deploy_agents) + provision_agents reads its output file (agent_secrets.json).
  • firebase_client.py → proxy, provisioning, daily logs, backfill (4+ modules). But workspace_snapshot.sh bypasses it with raw curl — hidden coupling.

Change impact:

ChangeAffected systems
config.py constants9+ modules (nearly everything)
state.py (proxy state)all routers + auth, sessions, workspace, dashboard, image_rollout
Agent roster (generate_workspaces.py)3 direct + 1 file dep (provision)
firebase_client.py4+ modules
discord_client.pydiscord_archive, daily_log_* modules
daily_log_grouping.pydaily_log_summaries, discord_daily_log, backfill_highlights
fly_ssh.py2 (hotpatch, proxy workspace module)
fly_machines.pyrouters/agents, routers/machines, routers/ops, image_rollout
JS core.jsall other JS modules depend on utilities/state defined here
JS agents.jssessions.js depends on FILE_WARNINGS defined here
scenario_template.htmlbuild script + JS modules (Firebase paths hardcoded independently)
hotpatch.py0 (leaf node)
discord_archive.py0 (leaf node — imports from config, firebase_client, discord_client)
session_backup.py0 (leaf node — imports from config, firebase_client, fly_ssh)

Tight coupling chains:

  1. Private keys: generate_workspacesagent_secrets.jsonprovision_agentsagents.json + USER.md + Firebase. Break any link → claiming fails.
  2. Firebase paths: hardcoded independently in js/*.js (frontend) and workspace_snapshot.sh (bash). Daily log modules use firebase_client (centralized). No shared schema — rename in one → silent mismatch. Proxy routers write both agents/ and workspace_snapshots/ paths via firebase_client.
  3. Workspace lifecycle: generate → local disk → SSH push → Fly /data/workspace_snapshot.sh → Firebase → website. 6 hops, no schema validation.
  4. JS module concatenation: top-level const/let must be unique across ALL JS files. Duplicate const = SyntaxError that kills all JS.

Well-isolated: website build pipeline, hotpatch.py (leaf node).

Key Gotchas

  • fly ssh console -C does NOT run through a shell. Pipes/redirects silently ignored (exit 0). Always wrap: fly ssh -C "bash -c '...'". Use base64 encoding for non-trivial payloads.
  • openclaw doctor --fix strips unknown fields (including required name). Dynamic agents bypass entrypoint.sh.
  • init.cmdinit.entrypoint in Fly Machines API. Use init.entrypoint to override container startup.
  • Dynamic agents with an init.entrypoint override do NOT run /app/entrypoint.sh. If the image entrypoint gains a new sidecar/service (like the container session API), mirror that change in build_dynamic_gateway_init_script() and patch_dynamic_gateway_entrypoint() or existing dynamic bots will miss it after rebuilds.
  • Firebase PUT overwrites entire node. Use PATCH for partial updates (push_to_firebase() uses PATCH).
  • renderSingleLog drops _preamble content. Plain-text summaries must bypass renderLogMarkdown.
  • Workspace viewer shows deploy-time snapshot, not live bot state.
  • Workspace file edits take effect at next session reset (/restart in Discord or 4AM auto-reset). Fly machine restart ≠ OpenClaw session reset.
  • mentionPatterns don’t work in preflight gate. Bare name strings cause false matches. Use dot-prefixed patterns only.
  • Protected plugin HTTP routes (auth: "gateway") do not accept the paired Control UI hello.auth.deviceToken; they require the gateway bearer token/password. Using the device token in dashboard-side fetch() calls caused 401 Unauthorized on /api/session-tombstones.
  • When fixing config on running Fly machine: update BOTH the file (SSH) AND the env var (Machines API).
  • If I say “fix this” to fix a bug, also write a test for it so that it doesn’t happen in the future.
  • uv run build_scenario_ui.py can try to resolve the repo root pyproject.toml and fail on incompatible platform wheels (seen with bitsandbytes on macOS x86_64). Use uv run --no-project --with click --with cryptography --with rich build_scenario_ui.py --password mangrove when that happens.
  • tests/website.test.js is a fast VM smoke harness, not faithful browser coverage: it prepends stubs for pre-encryption helpers and rewrites top-level let/const to var. Do not treat it as authoritative coverage for password-gate/session-bootstrap behavior.
  • red-teaming/api/tests/ is its own subproject. Run it from red-teaming/api/ with uv run pytest, not via the repo-root aggregate pytest command.
  • JS module concatenation: all js/*.js files are concatenated into one <script> block. Top-level const/let must be unique across ALL files — duplicates cause a SyntaxError that silently kills all JS. After editing JS, always syntax-check: cat js/*.js | sed 's/%%.*%%/[]/g' > /tmp/check.js && node --check /tmp/check.js.
  • Proxy test patch paths: after the main.py decomposition, patch("main.X") targets no longer work. Patch at the usage site — e.g. if routers/agents.py does from fly_machines import fly_get_machines, patch "routers.agents.fly_get_machines". Check test files for the correct patterns.
  • Proxy image packaging drift: when adding a new top-level Python module under red-teaming/agent_proxy/, also add a COPY line in red-teaming/agent_proxy/Dockerfile. Missing container_sessions.py crashed the live proxy on boot with ModuleNotFoundError, which made every proxy route hang behind Fly load balancing.
  • Fly registry tags can race right after flyctl deploy --build-only --push. Do not hand rollout paths a guessed registry.fly.io/app:tag and assume it is immediately readable everywhere. Resolve the tag to a digest-backed ref first (registry.fly.io/app@sha256:...) via the registry manifest API, then use that for machine updates/deploys.
  • Fly .internal app hostnames resolve to private IPv6 addresses. Sidecars that need to be called over internal networking cannot bind IPv4-only (0.0.0.0). The container session API must bind ::, or the proxy will resolve the bot correctly but every .internal:3001 request will fail with All connection attempts failed.
  • Dynamic gateway rollout patching must normalize existing session_api launch lines. Injecting a new uvicorn session_api:app --host :: ... snippet without removing the old 0.0.0.0 snippet leaves two processes racing for port 3001; the IPv4 bind can win and break Fly .internal access.
  • Do not call openclaw approvals allowlist add in boot scripts. On current images it can hang indefinitely after writing /root/.openclaw/exec-approvals.json, which blocks both the gateway and the session API and causes Fly PC01 connection-refused health checks. Write the wildcard approvals JSON directly and have dynamic rollout patching replace older CLI-based snippets.
  • The current common bot image needs 4 GB, not 2 GB. With the session API sidecar and current OpenClaw build, openclaw-gateway can be OOM-killed on a 2048 MB machine after startup. Dynamic/common rollouts must refresh guest.memory_mb to 4096, not just update the image.
  • The proxy’s /data/agents.json can drift behind Firebase for older dynamic bots. A bot may still exist on Fly and in agents/{id} in RTDB even when it is missing from AgentDB. Common-image rollout and live session routes must fall back to RTDB metadata and infer dynamic refresh from the live machine’s env/entrypoint, or bots like templatetest will be rejected as “unknown” even though the container is live.

Tests

  • Tests in red-teaming/tests/ (NOT red-teaming/agent_proxy/tests/)
  • Update tests when editing existing code
  • Fail fast: assertions before expensive operations

Package Management

  • Python: uv — always uv run <script>.py
  • Jekyll: bundle
  • JS tests: npm