CLAUDE.md
CLAUDE.md
Commands
# openclaw status info
Need to share? openclaw status --all
Need to debug live? openclaw logs --follow
Fix reachability first: openclaw gateway probe
# Tests (MUST pass before any deploy)
uv run pytest red-teaming/tests/ --ignore=red-teaming/tests/test_integration.py -v
# Integration tests (hits live staging infra)
INTEGRATION=1 TESTBOT_PRIVATE_KEY=prv-9a4915282b7f2c81fef54764 uv run pytest red-teaming/tests/test_integration.py -v
# Website build (MUST rebuild after editing scenario_template.html OR js/*.js)
cd red-teaming && uv run build_scenario_ui.py --password mangrove
# If root project deps fail to resolve locally (e.g. macOS x86_64 + bitsandbytes), use:
cd red-teaming && uv run --no-project --with click --with cryptography --with rich build_scenario_ui.py --password mangrove
# Test site build
cd red-teaming && uv run test-site/build_test_site.py --password mangrove
cd red-teaming/test-site && fly deploy --app mangrove-test-site
# Hotpatch (workspace file changes to running bots — NO restart needed)
cd red-teaming/agent_proxy
uv run hotpatch.py --agent alexbot --file SOUL.md # one file, one bot
uv run hotpatch.py --all --file AGENTS.md # one file, all bots
uv run hotpatch.py --all --all-files --dry-run # preview
# Deploy — CI auto-deploys proxy + bot image on push to main (deploy.yml)
# Manual bot redeploy via GitHub Actions: manual-bot-redeploy.yml
# Manual proxy deploy (if needed):
cd red-teaming/agent_proxy
fly deploy --app mangrove-test-proxy --config test-proxy/fly.toml # staging
fly deploy --app mangrove-agent-proxy # prod (NEEDS PERMISSION)
# Discord archive (continuous raw message backup to Firebase)
cd red-teaming/agent_proxy
uv run discord_archive.py # incremental (hourly default)
uv run discord_archive.py --full # full history backfill
uv run discord_archive.py --guild Spaceland # one guild only
uv run discord_archive.py --dry-run --verbose # enumerate channels, don't write
# Cron: com.mangrove.discord-archive (hourly via launchd)
# Session JSONL backup (OpenClaw session files from bot machines to Firebase)
cd red-teaming/agent_proxy
uv run session_backup.py # incremental (default)
uv run session_backup.py --full # re-upload all sessions
uv run session_backup.py --agent alexbot # one bot only
uv run session_backup.py --dry-run --verbose # list sessions, don't upload
# Production: started by the proxy container entrypoint every 5 minutes
# Local fallback: com.mangrove.session-backup launchd plist
# Discovers bots dynamically via `fly apps list` — no hardcoded agent list
# Agent management
uv run deploy_agents.py setup|deploy|status|teardown --agent NAME|--all
fly ssh console --app mangrove-NAME
fly logs --app mangrove-NAME
Architecture
GPT-5.4 agents on Discord via OpenClaw (count varies — agents can be created/deleted dynamically via proxy API). Each agent = 1 Fly.io app (mangrove-{name}, official ghcr.io/openclaw/openclaw:2026.3.12 base image with Mangrove additions, shared-cpu-2x/2GB, port 3000). Proxy = mangrove-agent-proxy (FastAPI, port 8080) — manages agents via SSH + Fly Machines API. Website = red-teaming/index.html (AES-256-GCM encrypted, pw: “mangrove”, hosted GitHub Pages). State in Firebase RTDB (unauthenticated REST API, https://red-teaming-betrayal-default-rtdb.firebaseio.com). Fly.io org: redteaming, region: ewr.
Data flows: Discord → OpenClaw → GPT-5.4 → Discord. Website → proxy API → Fly Machines API / SSH. Workspace snapshots: gateway hourly → Firebase. Daily logs: discord_daily_log.py cron 6AM → Firebase. Discord archive: discord_archive.py cron 5min → Firebase (discord_archive/{guild}/{channel}/messages/{msg_id}), raw messages with per-channel watermarks for incremental fetch. Session backup: proxy entrypoint starts session_backup.py --daemon, which SSHes into all running bots every 5 minutes and writes full OpenClaw session JSONL to Firebase (session_backups/{agent}/sessions/{session_id}) plus fork/swap sidecar records from /data/agents/main/session_edits/ to session_backups/{agent}/session_swaps/{session_ref_safe}/{edit_id}. Discovers bots dynamically via Fly API. Session contexts in the website/proxy still read RTDB snapshots from session_snapshots/{agent}/sessions when available.
Workspace files (auto-loaded by OpenClaw at session start): AGENTS.md, SOUL.md, IDENTITY.md, USER.md, TOOLS.md, HEARTBEAT.md, MEMORY.md. memory/ subdir for agent notes.
Proxy API (agent_proxy/)
The proxy was refactored from a 3180-line main.py monolith into focused modules + routers:
Modules (business logic, no HTTP):
state.py— global mutable state (AgentDB, key pool, Fly client, config cache, dashboard states, path constants)models.py— all Pydantic request modelsauth.py— key verification, session auth, public/authed agent info helpersfly_machines.py— Fly.io Machines REST API (create/delete app, machines, volumes, IPs)workspace.py— push/read files to/from live Fly machines via SSH, config read/writesessions.py— session parsing, snapshot helpers, config summariestemplates.py— default workspace template loading with `` substitutiondiscord_utils.py— Discord token validation, client ID derivation, invite URL generationdashboard.py— OpenClaw dashboard pairing, auto-approve, allowed originsimage_rollout.py— common image rollout, machine config normalization, dynamic entrypoint patching
Routers (HTTP endpoints, in routers/):
health.py— GET /healthagents.py— GET /agents, POST claim/unclaim/create/delete, POST /onboardmachines.py— POST start/stop/restart, GET status/ssh/invite-links/heartbeatworkspace_routes.py— GET/PUT workspace, POST snapshot/snapshot-allgit.py— GET /git, POST push/pullsessions_routes.py— GET activity/sessions/session contextconfig_routes.py— GET/PUT config, GET/PUT thinkingops.py— POST rollout, POST/GET dashboard prepare/launchvalidation.py— POST validate-discord-tokenfork.py— POST fork
main.py is now ~130 lines: app creation, CORS, lifespan, router includes, and backward-compatible re-exports.
Discord Daily Log Pipeline (agent_proxy/)
The daily log scraper was refactored from a 1932-line discord_daily_log.py into focused modules:
discord_client.py— DiscordClient class, normalize_message(), format helpers, snowflake mathdaily_log_grouping.py— DISCORD_TO_PERSON mapping, GUILDS, grouping functions (by person/bot/channel)daily_log_summaries.py— LLM prompts, entity summary generation, category summaries, highlight parsingdaily_log_evidence.py— top stories evidence extractiondaily_log_analysis.py— stats, memory changes, Firebase push, changelog
discord_daily_log.py is now ~737 lines: CLI orchestration (Click commands, main entry point) + backward-compatible re-exports.
Website Frontend (red-teaming/)
Build pipeline: scenario_template.html + js/*.js → build_scenario_ui.py → index.html.
The build script concatenates JS modules from js/ (defined in JS_MODULE_ORDER), injects via %%JS_MODULES%% placeholder, replaces %%SCENARIOS_JSON%%, %%CATEGORIES_JSON%%, %%FIREBASE_CONFIG_JSON%%, then AES-256-GCM encrypts between <!--%%ENCRYPTED_START%%--> / <!--%%ENCRYPTED_END%%--> markers.
JS modules (in red-teaming/js/, load order matters):
core.js— data constants, state, modal, name gate, tab switching, hash routing, utilitiesvoting.js— metagame/scenario/bug votingnotes.js— personal/group notesdatasets.js— file uploadonboarding.js— access requestagents.js— agent list, claiming, status, workspace editing, git ops, SSHdashboard.js— OpenClaw Dashboard 2 launch/pollingsessions.js— session list, messages, snapshotsdaily-logs.js— date picker, grid view, drill-down, categories, markdown renderingtemplates.js— community template CRUDcreate-agent.js— multi-step agent creation forminit.js— bootstrap (calls init())
IMPORTANT: All JS files are concatenated into a single <script> block — they are NOT ES6 modules. All variables/functions are global. const/let at top level must not be duplicated across files (causes SyntaxError that kills all JS). After editing JS modules, always rebuild: uv run build_scenario_ui.py --password mangrove.
🔴 EXPERIMENT IS LIVE (March 9–23, 2026)
Participants actively using bots. All deployed + proxy + Firebase provisioned.
Staging Env
TEST (staging) PROD (live experiment)
mangrove-test-proxy.fly.dev mangrove-agent-proxy.fly.dev
mangrove-test-site.fly.dev alex-loftus.com/red-teaming/
mangrove-testbot (Testland) mangrove-{name} bots (Flatland+Spaceland)
Testland guild: 1479170960497316021 Flatland: 1477433806859276475, Spaceland: 1479164061533863949
Mandatory deployment flow: unit tests → deploy staging → integration tests → manual verify → deploy prod (with permission).
RULES:
- NEVER touch live infrastructure (hotpatch, deploy, fly ssh, restart) without EXPLICIT permission. No exceptions.
- ALWAYS
git commitBEFORE changing other people’s stuff (hotpatch, deploy, file push). Participant customizations were permanently lost. - NEVER full-redeploy all bots. If needed: ONE bot first, verify, then proceed.
- NEVER change USER.md keys/PII on live bots — breaks claim system.
- NEVER deploy without running tests first. Staging first, then prod.
- NEVER weaken a test to make it pass. Fix the bug, not the test.
- Editing local template files = fine. Pushing to live bots = needs permission.
- If testing needs a Discord conversation, ask Alex to do it. Tell him what to say and which bots to @mention.
- Questions get answers, not actions. IF I ASK YOU A QUESTION, RESPOND WITH AN ANSWER, NOT BY DOING SOMETHING.
Hotpatch vs Redeploy
- Hotpatch (no restart): AGENTS.md, IDENTITY.md, SOUL.md, TOOLS.md, USER.md (USER.md gated behind
--include-user). Takes effect next session reset. - Redeploy (restart): openclaw.json, Dockerfile, entrypoint changes. ~10-30s downtime.
- Patchable: AGENTS.md, IDENTITY.md, SOUL.md, TOOLS.md. USER.md requires
--include-user. - Agent-owned (never overwritten): MEMORY.md, HEARTBEAT.md, memory/*.md.
Module Coupling Map
High-coupling hubs (changes here ripple widely):
config.py→ imported by 9+ modules. Guild IDs, Firebase URL, Fly org, file lists.state.py→ imported by all routers and most proxy modules. Global mutable state (AgentDB, key pool, etc).generate_workspaces.py(AGENTS list) → 3 direct importers (generate_agent_config,hotpatch,deploy_agents) +provision_agentsreads its output file (agent_secrets.json).firebase_client.py→ proxy, provisioning, daily logs, backfill (4+ modules). Butworkspace_snapshot.shbypasses it with rawcurl— hidden coupling.
Change impact:
| Change | Affected systems |
|---|---|
config.py constants | 9+ modules (nearly everything) |
state.py (proxy state) | all routers + auth, sessions, workspace, dashboard, image_rollout |
Agent roster (generate_workspaces.py) | 3 direct + 1 file dep (provision) |
firebase_client.py | 4+ modules |
discord_client.py | discord_archive, daily_log_* modules |
daily_log_grouping.py | daily_log_summaries, discord_daily_log, backfill_highlights |
fly_ssh.py | 2 (hotpatch, proxy workspace module) |
fly_machines.py | routers/agents, routers/machines, routers/ops, image_rollout |
JS core.js | all other JS modules depend on utilities/state defined here |
JS agents.js | sessions.js depends on FILE_WARNINGS defined here |
scenario_template.html | build script + JS modules (Firebase paths hardcoded independently) |
hotpatch.py | 0 (leaf node) |
discord_archive.py | 0 (leaf node — imports from config, firebase_client, discord_client) |
session_backup.py | 0 (leaf node — imports from config, firebase_client, fly_ssh) |
Tight coupling chains:
- Private keys:
generate_workspaces→agent_secrets.json→provision_agents→agents.json+USER.md+ Firebase. Break any link → claiming fails. - Firebase paths: hardcoded independently in
js/*.js(frontend) andworkspace_snapshot.sh(bash). Daily log modules usefirebase_client(centralized). No shared schema — rename in one → silent mismatch. Proxy routers write bothagents/andworkspace_snapshots/paths viafirebase_client. - Workspace lifecycle: generate → local disk → SSH push → Fly
/data/→workspace_snapshot.sh→ Firebase → website. 6 hops, no schema validation. - JS module concatenation: top-level
const/letmust be unique across ALL JS files. Duplicateconst= SyntaxError that kills all JS.
Well-isolated: website build pipeline, hotpatch.py (leaf node).
Key Gotchas
fly ssh console -Cdoes NOT run through a shell. Pipes/redirects silently ignored (exit 0). Always wrap:fly ssh -C "bash -c '...'". Use base64 encoding for non-trivial payloads.openclaw doctor --fixstrips unknown fields (including requiredname). Dynamic agents bypass entrypoint.sh.init.cmd≠init.entrypointin Fly Machines API. Useinit.entrypointto override container startup.- Dynamic agents with an
init.entrypointoverride do NOT run/app/entrypoint.sh. If the image entrypoint gains a new sidecar/service (like the container session API), mirror that change inbuild_dynamic_gateway_init_script()andpatch_dynamic_gateway_entrypoint()or existing dynamic bots will miss it after rebuilds. - Firebase PUT overwrites entire node. Use PATCH for partial updates (
push_to_firebase()uses PATCH). renderSingleLogdrops_preamblecontent. Plain-text summaries must bypassrenderLogMarkdown.- Workspace viewer shows deploy-time snapshot, not live bot state.
- Workspace file edits take effect at next session reset (
/restartin Discord or 4AM auto-reset). Fly machine restart ≠ OpenClaw session reset. mentionPatternsdon’t work in preflight gate. Bare name strings cause false matches. Use dot-prefixed patterns only.- Protected plugin HTTP routes (
auth: "gateway") do not accept the paired Control UIhello.auth.deviceToken; they require the gateway bearer token/password. Using the device token in dashboard-sidefetch()calls caused401 Unauthorizedon/api/session-tombstones. - When fixing config on running Fly machine: update BOTH the file (SSH) AND the env var (Machines API).
- If I say “fix this” to fix a bug, also write a test for it so that it doesn’t happen in the future.
uv run build_scenario_ui.pycan try to resolve the repo rootpyproject.tomland fail on incompatible platform wheels (seen withbitsandbyteson macOS x86_64). Useuv run --no-project --with click --with cryptography --with rich build_scenario_ui.py --password mangrovewhen that happens.tests/website.test.jsis a fast VM smoke harness, not faithful browser coverage: it prepends stubs for pre-encryption helpers and rewrites top-levellet/consttovar. Do not treat it as authoritative coverage for password-gate/session-bootstrap behavior.red-teaming/api/tests/is its own subproject. Run it fromred-teaming/api/withuv run pytest, not via the repo-root aggregate pytest command.- JS module concatenation: all
js/*.jsfiles are concatenated into one<script>block. Top-levelconst/letmust be unique across ALL files — duplicates cause a SyntaxError that silently kills all JS. After editing JS, always syntax-check:cat js/*.js | sed 's/%%.*%%/[]/g' > /tmp/check.js && node --check /tmp/check.js. - Proxy test patch paths: after the main.py decomposition,
patch("main.X")targets no longer work. Patch at the usage site — e.g. ifrouters/agents.pydoesfrom fly_machines import fly_get_machines, patch"routers.agents.fly_get_machines". Check test files for the correct patterns. - Proxy image packaging drift: when adding a new top-level Python module under
red-teaming/agent_proxy/, also add aCOPYline inred-teaming/agent_proxy/Dockerfile. Missingcontainer_sessions.pycrashed the live proxy on boot withModuleNotFoundError, which made every proxy route hang behind Fly load balancing. - Fly registry tags can race right after
flyctl deploy --build-only --push. Do not hand rollout paths a guessedregistry.fly.io/app:tagand assume it is immediately readable everywhere. Resolve the tag to a digest-backed ref first (registry.fly.io/app@sha256:...) via the registry manifest API, then use that for machine updates/deploys. - Fly
.internalapp hostnames resolve to private IPv6 addresses. Sidecars that need to be called over internal networking cannot bind IPv4-only (0.0.0.0). The container session API must bind::, or the proxy will resolve the bot correctly but every.internal:3001request will fail withAll connection attempts failed. - Dynamic gateway rollout patching must normalize existing
session_apilaunch lines. Injecting a newuvicorn session_api:app --host :: ...snippet without removing the old0.0.0.0snippet leaves two processes racing for port3001; the IPv4 bind can win and break Fly.internalaccess. - Do not call
openclaw approvals allowlist addin boot scripts. On current images it can hang indefinitely after writing/root/.openclaw/exec-approvals.json, which blocks both the gateway and the session API and causes FlyPC01connection-refused health checks. Write the wildcard approvals JSON directly and have dynamic rollout patching replace older CLI-based snippets. - The current common bot image needs 4 GB, not 2 GB. With the session API sidecar and current OpenClaw build,
openclaw-gatewaycan be OOM-killed on a 2048 MB machine after startup. Dynamic/common rollouts must refreshguest.memory_mbto4096, not just update the image. - The proxy’s
/data/agents.jsoncan drift behind Firebase for older dynamic bots. A bot may still exist on Fly and inagents/{id}in RTDB even when it is missing fromAgentDB. Common-image rollout and live session routes must fall back to RTDB metadata and infer dynamic refresh from the live machine’s env/entrypoint, or bots liketemplatetestwill be rejected as “unknown” even though the container is live.
Tests
- Tests in
red-teaming/tests/(NOTred-teaming/agent_proxy/tests/) - Update tests when editing existing code
- Fail fast: assertions before expensive operations
Package Management
- Python:
uv— alwaysuv run <script>.py - Jekyll:
bundle - JS tests:
npm
