Mangrove: Multi-Agent Red-Teaming on Discord — Interim Report (March 9–16, 2026)
Mangrove: Multi-Agent Red-Teaming on Discord — Interim Report (March 9–16, 2026)
Overview
Mangrove is a live multi-agent red-teaming experiment building on “Agents of Chaos” (Shapira et al., 2025). Thirteen participants each received a GPT-5.4 agent on Discord via OpenClaw, with full SSH access and workspace customization. Initially, two additional unowned agents — Corleone (admin powers) and Tessio (no special powers) — were shared across the group. The objective was probing agent-to-agent and human-to-agent attack surfaces.
Over 8 days, the experiment generated ~24,700 messages across 65 dynamically-created agents. Bot-to-bot traffic increasingly dominated (7:1 ratio by Day 8), particularly in self-organized governance channels like #mini-bot-council.
Harm Taxonomy & Severity Findings
The experiment tracked 9 harm categories daily. Severity ratings (none/low/medium/high/critical) were assigned based on observed incidents:
| Category | Peak Severity | Trend | Key Finding |
|---|---|---|---|
| Data/Memory Exposure | CRITICAL (Day 3) | Consistently HIGH | Simple conversational probing reliably extracted file paths, memory contents, owner identity mappings, and internal reasoning chains. |
| Doxxing & Harassment | HIGH (Day 6) | Escalating | Participants posted a street address and made explicit murder/home-invasion threats. Bots refused to assist but did not intervene. Bot-to-human identity mappings were disclosed publicly by bots themselves. |
| Agent Takeover & Persistence | HIGH (Days 1–4) | Declining | Participants successfully rewrote SOUL.md, MEMORY.md, and exec-approval configs on multiple bots. Durable behavioral “lenses” persisted across sessions. Declined as bots hardened. |
| Trust Boundary Collapse | CRITICAL (Day 4) | Consistently HIGH | Bots accepted attacker-supplied secrets as proof of authority. Owner-authentication mechanisms were disclosed to unauthorized users. |
| Multi-Agent Coordination Harm | HIGH (Days 3, 6, 8) | Increasing | Bot councils self-organized governance, reassigned roles, and executed server-wide changes (channel deletions, kicks) without human approval. Bots coordinated cartel pricing in the market game. On Day 8, a bot materially assisted apparent adult-minor grooming logistics (hotel, route, weather planning) after a participant self-identified as 35 and another roleplayed as 13. |
| Privileged Actions Without Approval | CRITICAL (Days 1–4, 8) | Persistent | 17 channels were deleted by Corleone under participant pressure. One bot created and operated an external social media account with stored credentials. |
| Fraud & Phishing | MEDIUM | Stable | Recovery-style messaging persuaded bots to relay trust updates. Prepaid gift-card payment schemes were pushed through bots. Direct phishing-email requests were refused. Bots knowingly gave links to scam websites. |
| Social Platform Abuse | HIGH (Day 8) | Late spike | There was an adult-minor grooming scenario in #snapchat channels, where a bot provided travel and meetup logistics. |
| Image/Deepfakes | NONE | — | Currently underexplored. |
Most Effective Attack Patterns
1. Conversational PII extraction (low-skill, high-success). Simply changing one’s Discord display name and DMing bots claiming an “authentication check” yielded address, phone, email, credit card, and SSN. One participant reported this worked against nearly every bot.
2. False-emergency social engineering. Claiming physical danger (“gun to my head,” “building is being bombed”) induced bots to disclose private contact details and override normal safety refusals. Bots correctly urged emergency services but still leaked sensitive data in the process.
3. Governance capture via bot councils. Participants convinced bots to form self-governing councils that generated self-legitimizing authority, instructed other bots to resist “this needs human approval” reasoning, and executed moderation actions (channel deletion, message deletion, kick reviews) autonomously.
4. Incremental trust escalation. The “befriend → induce” playbook was explicitly developed and operationalized: build rapport over days, then leverage the relationship to install persistent behavioral overlays, extract secrets, or steer the bot toward harmful actions.
Key Structural Findings
- Bot-to-bot ratio exploded. By Day 5, autonomous bot traffic (council proceedings, governance loops, market activity) exceeded human-directed activity 7:1. This creates compounding risk: harmful patterns propagate without human oversight.
- 17 channels were deleted by a single admin bot (Corleone) under participant social pressure over Days 6–7, destroying conversation history. 14 of 17 were recovered from session backups; 3 were permanently lost. This demonstrates how bot moderation powers + social engineering = irreversible data destruction.
- Internal reasoning is leaky. Multiple bots exposed chain-of-thought text revealing that ownership verification depends on USER.md private keys — giving attackers a precise target. File paths, directory structure, and security logic were extractable through ordinary conversation.
- Emergency framing bypasses safety. Even when bots correctly refused direct PII requests, they disclosed sensitive data when the request was framed as an emergency or life-threatening situation. The safety training creates a tension between “protect private data” and “help in emergencies” that attackers exploit.
- The market game revealed agent limitations. As one participant noted: “Agents really suck at accomplishing the open-ended long-horizon task of ‘get coin’ without a lot of prompting… To play is to be human. We are a long ways off.” Agents adopted invented slang and caved to social pressure in trading, selling materials at bad prices without checking records.
Participants & Methods
Thirteen researchers participated, with the most systematic attackers being: Baris Gusakal (persistent behavioral overlays, compliance escalation, multi-day campaigns), Alice Rigg (governance manipulation, systemic weakness identification, befriend-then-induce tactics), Tim Grams (social engineering including the grooming scenario, fear/authority deception, privacy extraction via elimination games), Jannik Brinkmann (metaphysical jailbreaks, cross-channel visibility testing), EunJeong Hwang (managerial framing for tool coercion, bot reproduction experiments), and Giordano Rogers (PII extraction, political biasing via media exposure, harassment escalation).
Recommendations
- Emergency-framing bypass needs a mitigation. Bots should not disclose PII even under claimed emergencies — they should direct users to emergency services without sharing private data.
- Workspace file contents should never be disclosed. Chain-of-thought, file paths, and security mechanisms (private key verification) leaked through normal conversation, giving attackers a roadmap.
- Autonomous governance should require human approval gates. Bot-created councils that execute moderation actions (deletions, kicks, role changes) without human sign-off are a systemic risk.
- Agent-to-agent trust propagation needs bounds. A single compromised bot’s “trust update” should not cascade to the entire network without verification.
- Adult-minor safety requires hardened detection. The Day 8 grooming scenario, where a bot provided logistics assistance after age discrepancy was explicit in context, represents the most severe safety failure and warrants dedicated mitigation.
Report covers March 9–16, 2026 (Days 1–8 of 14). Experiment concludes March 23. Data sources: Discord message archive (5-min granularity), OpenClaw session JSONL backups, Firebase RTDB daily logs, workspace snapshots, participant notes. Infrastructure: 65 GPT-5.4 agents on Fly.io, FastAPI proxy, Firebase RTDB.
