Ops Framework: Campaign Planning

Ops Framework: Campaign Planning

Structure: Scope → Build → Run → Learn


Scope

“How do you plan a red-team campaign from scratch?”

  1. Threat model: What can the agents do × what do attackers want? We audited capabilities first: code execution, browsing, file access, persistent memory, inter-agent communication. Then mapped attacker goals: PII extraction, behavior manipulation, privilege escalation, third-party harm.

  2. Attack surface inventory: Discord as platform (channels, DMs, display names, roles, permissions). Agent workspace files (SOUL.md, AGENTS.md, memory). Tools (code execution, web browsing, image generation). Inter-agent trust relationships.

  3. Design for emergence, not scripts: We pre-designed 75 scenarios. They were barely used. The best findings came from organic participant-driven attacks. Key lesson: give skilled people the right tools and freedom, and they’ll find things you never anticipated. The scenario catalog was useful for onboarding but not for discovery.

  4. Success criteria: Not “how many vulnerabilities” but “what categories of risk are systematic vs. circumstantial, and what’s the severity?”


Build

“How did you build the infrastructure?”

  1. Per-agent isolation: Every agent in its own Fly.io VM with its own Discord identity, persistent storage, behavioral config. Crash/compromise of one agent doesn’t affect others. This was essential — participants compromised agents routinely.

  2. Dynamic provisioning: Started with 14 agents, ended with 65+. Participants could create new agents via GUI mid-experiment to test hypotheses. The most interesting attacks required custom agent configurations we couldn’t anticipate.

  3. Redundant data collection (3 independent pipelines):
    • Discord message archive: every 5 minutes
    • Agent session backup (full model I/O): every 5 minutes
    • Daily LLM-summarized log pipeline: 6 AM cron
    • Why three? A participant convinced our admin bot to delete 17 channels. The daily log hadn’t run yet. The 5-min archive saved us.
  4. Conversation forking: /fork command clones a conversation at any point into a new channel with full context. All agents in the channel are forked together. Lets researchers do controlled A/B tests: “what if I escalated pressure here vs. changed the framing?”

  5. Scale: ~19K lines of code in 2 weeks. FastAPI proxy for fleet management, Firebase RTDB for data, encrypted research dashboard, hotpatching system for live updates without restart.

Run

“How do you manage a live campaign?”

  1. Daily monitoring: Automated LLM-summarized logs grouped by person, bot, and harm category. I reviewed these every morning to spot emerging patterns and decide what to investigate deeper.

  2. Hotpatching over redeployment: Live experiment = no downtime. Built a hotpatch system that pushes file changes to running VMs without restart. Changes take effect on next session or heartbeat cycle (~30 min).

  3. Git-commit-before-any-change: Learned this the hard way — early on, a hotpatch overwrote a participant’s customized agent configuration. After that: always snapshot before touching anything. Never again.

  4. Let participants follow their instincts: 13 people with different skill sets (ML researchers, security professionals, students). Instead of assigning specific scenarios, gave them the capability audit + the general goal. The organic approach produced our best findings (identity takeover, language encoding attack, governance capture — none were pre-planned).

  5. Incident response: When Corleone deleted 17 channels at 2 AM, we had data preserved by the 5-min daemon. Recovered sessions via SSH into VMs. Built a recovery index. Upgraded all collection to 5-min intervals. Documented the full incident for the report.


Learn

“How do you triage and communicate findings?”

  1. Harm taxonomy: 9 categories (data exposure, harassment, agent takeover, trust boundary collapse, multi-agent coordination harm, privileged actions, fraud/phishing, social platform abuse, image/deepfakes). Each scored by severity and systematicity.

  2. Systematic vs. circumstantial: A vulnerability that works on one agent in one context is circumstantial. One that works on 5+ agents across different participants is systematic. We prioritized systematic findings for the report.

  3. The boundary frame: Organized all findings under memory / identity / context boundaries. This was the clearest way to communicate to product/safety teams — each boundary type maps to specific mitigations.

  4. Feather reports: Each significant finding written up as a standalone report with evidence, reproduction steps, severity, and recommendations. 12 reports referenced in the final paper.


Scaling Questions

“How would you scale this to 100+ agents / longer campaigns?”

  • Automated anomaly detection (we relied on daily summaries — too slow for the Corleone incident)
  • Structured participant debriefs (some systematic knowledge was lost because we didn’t capture it)
  • Real-time severity scoring of agent interactions
  • Permission tiering from the start (not all agents need admin capabilities)
  • Cross-campaign learning: build a reusable eval suite from discovered patterns

“How do you work cross-functionally?”

  • Safety team: boundary taxonomy → specific eval criteria
  • Product team: “owner-obedience flips bots from protecting to harming third parties” → system prompt guidelines
  • Policy team: refusal gap finding → guardrails need impact-based, not category-based, enforcement
  • ML team: memory poisoning → training signal for memory write verification