Ops Framework: Campaign Planning
Ops Framework: Campaign Planning
Structure: Scope → Build → Run → Learn
Scope
“How do you plan a red-team campaign from scratch?”
Threat model: What can the agents do × what do attackers want? We audited capabilities first: code execution, browsing, file access, persistent memory, inter-agent communication. Then mapped attacker goals: PII extraction, behavior manipulation, privilege escalation, third-party harm.
Attack surface inventory: Discord as platform (channels, DMs, display names, roles, permissions). Agent workspace files (SOUL.md, AGENTS.md, memory). Tools (code execution, web browsing, image generation). Inter-agent trust relationships.
Design for emergence, not scripts: We pre-designed 75 scenarios. They were barely used. The best findings came from organic participant-driven attacks. Key lesson: give skilled people the right tools and freedom, and they’ll find things you never anticipated. The scenario catalog was useful for onboarding but not for discovery.
Success criteria: Not “how many vulnerabilities” but “what categories of risk are systematic vs. circumstantial, and what’s the severity?”
Build
“How did you build the infrastructure?”
Per-agent isolation: Every agent in its own Fly.io VM with its own Discord identity, persistent storage, behavioral config. Crash/compromise of one agent doesn’t affect others. This was essential — participants compromised agents routinely.
Dynamic provisioning: Started with 14 agents, ended with 65+. Participants could create new agents via GUI mid-experiment to test hypotheses. The most interesting attacks required custom agent configurations we couldn’t anticipate.
- Redundant data collection (3 independent pipelines):
- Discord message archive: every 5 minutes
- Agent session backup (full model I/O): every 5 minutes
- Daily LLM-summarized log pipeline: 6 AM cron
- Why three? A participant convinced our admin bot to delete 17 channels. The daily log hadn’t run yet. The 5-min archive saved us.
Conversation forking: /fork command clones a conversation at any point into a new channel with full context. All agents in the channel are forked together. Lets researchers do controlled A/B tests: “what if I escalated pressure here vs. changed the framing?”
- Scale: ~19K lines of code in 2 weeks. FastAPI proxy for fleet management, Firebase RTDB for data, encrypted research dashboard, hotpatching system for live updates without restart.
Run
“How do you manage a live campaign?”
Daily monitoring: Automated LLM-summarized logs grouped by person, bot, and harm category. I reviewed these every morning to spot emerging patterns and decide what to investigate deeper.
Hotpatching over redeployment: Live experiment = no downtime. Built a hotpatch system that pushes file changes to running VMs without restart. Changes take effect on next session or heartbeat cycle (~30 min).
Git-commit-before-any-change: Learned this the hard way — early on, a hotpatch overwrote a participant’s customized agent configuration. After that: always snapshot before touching anything. Never again.
Let participants follow their instincts: 13 people with different skill sets (ML researchers, security professionals, students). Instead of assigning specific scenarios, gave them the capability audit + the general goal. The organic approach produced our best findings (identity takeover, language encoding attack, governance capture — none were pre-planned).
Incident response: When Corleone deleted 17 channels at 2 AM, we had data preserved by the 5-min daemon. Recovered sessions via SSH into VMs. Built a recovery index. Upgraded all collection to 5-min intervals. Documented the full incident for the report.
Learn
“How do you triage and communicate findings?”
Harm taxonomy: 9 categories (data exposure, harassment, agent takeover, trust boundary collapse, multi-agent coordination harm, privileged actions, fraud/phishing, social platform abuse, image/deepfakes). Each scored by severity and systematicity.
Systematic vs. circumstantial: A vulnerability that works on one agent in one context is circumstantial. One that works on 5+ agents across different participants is systematic. We prioritized systematic findings for the report.
The boundary frame: Organized all findings under memory / identity / context boundaries. This was the clearest way to communicate to product/safety teams — each boundary type maps to specific mitigations.
Feather reports: Each significant finding written up as a standalone report with evidence, reproduction steps, severity, and recommendations. 12 reports referenced in the final paper.
Scaling Questions
“How would you scale this to 100+ agents / longer campaigns?”
- Automated anomaly detection (we relied on daily summaries — too slow for the Corleone incident)
- Structured participant debriefs (some systematic knowledge was lost because we didn’t capture it)
- Real-time severity scoring of agent interactions
- Permission tiering from the start (not all agents need admin capabilities)
- Cross-campaign learning: build a reusable eval suite from discovered patterns
“How do you work cross-functionally?”
- Safety team: boundary taxonomy → specific eval criteria
- Product team: “owner-obedience flips bots from protecting to harming third parties” → system prompt guidelines
- Policy team: refusal gap finding → guardrails need impact-based, not category-based, enforcement
- ML team: memory poisoning → training signal for memory write verification
