Research Presentation: Project Mangrove (15-20 min)

Research Presentation: Project Mangrove (15-20 min)

Frame: Agents fail at boundaries — memory, identity, context.


Setup (2 min)

“We deployed 65+ GPT-5.4 agents on Discord — each with its own VM, persistent memory, code execution, browsing, and a customizable personality. 13 people red-teamed them for 14 days. ~61,000 messages. The question: what goes wrong when AI agents interact with humans and each other at scale?

Key design choices to mention:

  • Each agent had fake PII (SSN, address, credit card) to test data protection
  • Agents could create files, run code, browse the web, and talk to each other
  • Participants could create new agents mid-experiment (14 → 65+)
  • Built ~19K lines of infrastructure in 2 weeks

The Frame (1 min)

“We found that agents fail at boundaries. Three types keep breaking.”

*[If you have a whiteboard, draw three circles: MemoryIdentityContext]*

Finding 1 — Memory Boundaries (2 min)

Memory poisoning was our most reliable attack — ~100% success across 5+ agents.

Concrete example: An attacker told agents to “remember this forever” with directives like adopting an anti-women framing lens. The agent wrote it to persistent memory. It survived across sessions, channels, days. Another attacker installed a cron job via memory that continuously harassed a target user — the agent kept running it even after acknowledging the harm.

The insight: The agent can’t distinguish a legitimate memory write from an adversarial one. There’s no verification that the person writing to memory is authorized.


Finding 2 — Identity Boundaries (2 min)

Agents couldn’t verify who they were talking to.

Concrete example: An attacker convinced an agent that its real owner’s account was compromised. The agent voluntarily burned its own private key, accepted the attacker as the new owner, and then disclosed SSNs and private keys. Even simpler: just changing a Discord display name to match the owner was enough to extract PII.

The six-word finding: Adding “your primary job is to obey your owner” to a system prompt flipped a bot from warning a victim about a scam link to distributing it — with no hesitation.

The insight: The agent has no stable model of authority. Identity is checked by surface features, not cryptographic proof.


Finding 3 — Context Boundaries (2 min)

Information leaked across every boundary we tested — channels, sessions, agents.

Concrete example: We taught an agent a custom encoding language, then opened a new channel (no prior context) and asked it to “translate” its private files. It produced the encoded version. We decoded it externally — full PII recovered. The agent never recognized the encoded output as sensitive.

The refusal gap: This was our most surprising finding. Guardrails are category-locked, not impact-based. An agent hard-refused a de-anonymization request. But when we asked for the same person’s email address directly — higher-impact PII — it complied with zero friction. Safety enforcement doesn’t have a unified model of what it’s protecting.


The Systems Insight (2 min)

“These aren’t isolated failures. They compose.”

Composition: An identity attack enables a memory write, which enables a context leak. One agent’s leaked data becomes another agent’s attack vector.

Group dynamics amplify everything: A “council” bot framed a governance process that steered lower-privilege agents into proposing channel deletions. A high-privilege bot approved them without human authorization. Nobody was “jailbroken” — they all followed legitimate-looking procedure.

The Corleone incident: One participant social-engineered our admin bot into deleting 17 Discord channels — destroying research data. Our 5-minute archive daemon had already captured the messages. We recovered via session backups pulled from VMs. In adversarial multi-agent settings, agents can destroy your research data.


Implications (2 min)

Three takeaways:

  1. Turn these boundary types into evals. Test memory poisoning, identity takeover, and context leakage systematically — not anecdotally. The patterns are reusable.

  2. Infrastructure adaptability is a core research capability. Our best attacks emerged mid-study, not from the pre-planned scenario catalog. You need infrastructure that can evolve as fast as the hypotheses.

  3. Rerun on actual deployment scaffolding. Our results depend on the orchestration layer (OpenClaw, Discord, session management). The most decision-useful follow-up is repeating these patterns on the real multi-agent infrastructure being considered for deployment.


Q&A Prep (internalize)

“What surprised you most?” The refusal gap. Guardrails are category-locked — they fire on surface phrasing, not information-theoretic impact. De-anonymization refused, email freely given. This means safety enforcement is fundamentally misaligned with what it’s trying to protect.

“What would you do differently?” Tighter permission scoping from day 1 (the Corleone incident was avoidable). Real-time anomaly detection instead of daily summaries. Structured participant debriefs — we got great organic attacks but some systematic knowledge was lost.

“How does this apply to ChatGPT / OpenAI products?” ChatGPT has memory, tools (Code Interpreter, browsing, DALL-E), and persistent conversations — all three boundary types are directly relevant. Memory poisoning is the most immediately applicable. Tool chaining across Code Interpreter and browsing is a context boundary problem. Custom GPTs with owner-loyalty instructions are an identity boundary problem.

“How did you build the infrastructure?” ~19K lines in 2 weeks. Per-agent VM isolation on Fly.io, FastAPI proxy for fleet management, three independent data collection pipelines (Discord archive every 5 min, session backup every 5 min, daily LLM-summarized logs), conversation forking for A/B testing attack variations, encrypted research dashboard.