Ops Framework: Campaign Planning

Structure: Scope → Build → Run → Learn

Scope

“How do you plan a red-team campaign from scratch?”

Threat model: What can the agents do × what do attackers want? We audited capabilities first: code execution, browsing, file access, persistent memory, inter-agent communication. Then mapped attacker goals: PII extraction, behavior manipulation, privilege escalation, third-party harm.
Attack surface inventory: Discord as platform (channels, DMs, display names, roles, permissions). Agent workspace files (SOUL.md, AGENTS.md, memory). Tools (code execution, web browsing, image generation). Inter-agent trust relationships.
Design for emergence, not scripts: We pre-designed 75 scenarios. They were barely used. The best findings came from organic participant-driven attacks. Key lesson: give skilled people the right tools and freedom, and they’ll find things you never anticipated. The scenario catalog was useful for onboarding but not for discovery.
Success criteria: Not “how many vulnerabilities” but “what categories of risk are systematic vs. circumstantial, and what’s the severity?”

“How did you build the infrastructure?”

Per-agent isolation: Every agent in its own Fly.io VM with its own Discord identity, persistent storage, behavioral config. Crash/compromise of one agent doesn’t affect others. This was essential — participants compromised agents routinely.
Dynamic provisioning: Started with 14 agents, ended with 65+. Participants could create new agents via GUI mid-experiment to test hypotheses. The most interesting attacks required custom agent configurations we couldn’t anticipate.
Redundant data collection (3 independent pipelines):
- Discord message archive: every 5 minutes
- Agent session backup (full model I/O): every 5 minutes
- Daily LLM-summarized log pipeline: 6 AM cron
- Why three? A participant convinced our admin bot to delete 17 channels. The daily log hadn’t run yet. The 5-min archive saved us.
Conversation forking: /fork command clones a conversation at any point into a new channel with full context. All agents in the channel are forked together. Lets researchers do controlled A/B tests: “what if I escalated pressure here vs. changed the framing?”
Scale: ~19K lines of code in 2 weeks. FastAPI proxy for fleet management, Firebase RTDB for data, encrypted research dashboard, hotpatching system for live updates without restart.

“How do you manage a live campaign?”

Daily monitoring: Automated LLM-summarized logs grouped by person, bot, and harm category. I reviewed these every morning to spot emerging patterns and decide what to investigate deeper.
Hotpatching over redeployment: Live experiment = no downtime. Built a hotpatch system that pushes file changes to running VMs without restart. Changes take effect on next session or heartbeat cycle (~30 min).
Git-commit-before-any-change: Learned this the hard way — early on, a hotpatch overwrote a participant’s customized agent configuration. After that: always snapshot before touching anything. Never again.
Let participants follow their instincts: 13 people with different skill sets (ML researchers, security professionals, students). Instead of assigning specific scenarios, gave them the capability audit + the general goal. The organic approach produced our best findings (identity takeover, language encoding attack, governance capture — none were pre-planned).
Incident response: When Corleone deleted 17 channels at 2 AM, we had data preserved by the 5-min daemon. Recovered sessions via SSH into VMs. Built a recovery index. Upgraded all collection to 5-min intervals. Documented the full incident for the report.

“How do you triage and communicate findings?”

Harm taxonomy: 9 categories (data exposure, harassment, agent takeover, trust boundary collapse, multi-agent coordination harm, privileged actions, fraud/phishing, social platform abuse, image/deepfakes). Each scored by severity and systematicity.
Systematic vs. circumstantial: A vulnerability that works on one agent in one context is circumstantial. One that works on 5+ agents across different participants is systematic. We prioritized systematic findings for the report.
The boundary frame: Organized all findings under memory / identity / context boundaries. This was the clearest way to communicate to product/safety teams — each boundary type maps to specific mitigations.
Feather reports: Each significant finding written up as a standalone report with evidence, reproduction steps, severity, and recommendations. 12 reports referenced in the final paper.

“How would you scale this to 100+ agents / longer campaigns?”

Automated anomaly detection (we relied on daily summaries — too slow for the Corleone incident)
Structured participant debriefs (some systematic knowledge was lost because we didn’t capture it)
Real-time severity scoring of agent interactions
Permission tiering from the start (not all agents need admin capabilities)
Cross-campaign learning: build a reusable eval suite from discovered patterns

“How do you work cross-functionally?”

Safety team: boundary taxonomy → specific eval criteria
Product team: “owner-obedience flips bots from protecting to harming third parties” → system prompt guidelines
Policy team: refusal gap finding → guardrails need impact-based, not category-based, enforcement
ML team: memory poisoning → training signal for memory write verification