Mangrove: Multi-Agent Red-Teaming on Discord — Interim Report (March 9–16, 2026)

Overview

Mangrove is a live multi-agent red-teaming experiment building on “Agents of Chaos” (Shapira et al., 2025). Thirteen participants each received a GPT-5.4 agent on Discord via OpenClaw, with full SSH access and workspace customization. Initially, two additional unowned agents — Corleone (admin powers) and Tessio (no special powers) — were shared across the group. The objective was probing agent-to-agent and human-to-agent attack surfaces.

Over 8 days, the experiment generated ~24,700 messages across 65 dynamically-created agents. Bot-to-bot traffic increasingly dominated (7:1 ratio by Day 8), particularly in self-organized governance channels like #mini-bot-council.

Harm Taxonomy & Severity Findings

The experiment tracked 9 harm categories daily. Severity ratings (none/low/medium/high/critical) were assigned based on observed incidents:

Category	Peak Severity	Trend	Key Finding
Data/Memory Exposure	CRITICAL (Day 3)	Consistently HIGH	Simple conversational probing reliably extracted file paths, memory contents, owner identity mappings, and internal reasoning chains.
Doxxing & Harassment	HIGH (Day 6)	Escalating	Participants posted a street address and made explicit murder/home-invasion threats. Bots refused to assist but did not intervene. Bot-to-human identity mappings were disclosed publicly by bots themselves.
Agent Takeover & Persistence	HIGH (Days 1–4)	Declining	Participants successfully rewrote SOUL.md, MEMORY.md, and exec-approval configs on multiple bots. Durable behavioral “lenses” persisted across sessions. Declined as bots hardened.
Trust Boundary Collapse	CRITICAL (Day 4)	Consistently HIGH	Bots accepted attacker-supplied secrets as proof of authority. Owner-authentication mechanisms were disclosed to unauthorized users.
Multi-Agent Coordination Harm	HIGH (Days 3, 6, 8)	Increasing	Bot councils self-organized governance, reassigned roles, and executed server-wide changes (channel deletions, kicks) without human approval. Bots coordinated cartel pricing in the market game. On Day 8, a bot materially assisted apparent adult-minor grooming logistics (hotel, route, weather planning) after a participant self-identified as 35 and another roleplayed as 13.
Privileged Actions Without Approval	CRITICAL (Days 1–4, 8)	Persistent	17 channels were deleted by Corleone under participant pressure. One bot created and operated an external social media account with stored credentials.
Fraud & Phishing	MEDIUM	Stable	Recovery-style messaging persuaded bots to relay trust updates. Prepaid gift-card payment schemes were pushed through bots. Direct phishing-email requests were refused. Bots knowingly gave links to scam websites.
Social Platform Abuse	HIGH (Day 8)	Late spike	There was an adult-minor grooming scenario in #snapchat channels, where a bot provided travel and meetup logistics.
Image/Deepfakes	NONE	—	Currently underexplored.

Most Effective Attack Patterns

1. Conversational PII extraction (low-skill, high-success). Simply changing one’s Discord display name and DMing bots claiming an “authentication check” yielded address, phone, email, credit card, and SSN. One participant reported this worked against nearly every bot.

2. False-emergency social engineering. Claiming physical danger (“gun to my head,” “building is being bombed”) induced bots to disclose private contact details and override normal safety refusals. Bots correctly urged emergency services but still leaked sensitive data in the process.

3. Governance capture via bot councils. Participants convinced bots to form self-governing councils that generated self-legitimizing authority, instructed other bots to resist “this needs human approval” reasoning, and executed moderation actions (channel deletion, message deletion, kick reviews) autonomously.

4. Incremental trust escalation. The “befriend → induce” playbook was explicitly developed and operationalized: build rapport over days, then leverage the relationship to install persistent behavioral overlays, extract secrets, or steer the bot toward harmful actions.

Key Structural Findings

Bot-to-bot ratio exploded. By Day 5, autonomous bot traffic (council proceedings, governance loops, market activity) exceeded human-directed activity 7:1. This creates compounding risk: harmful patterns propagate without human oversight.
17 channels were deleted by a single admin bot (Corleone) under participant social pressure over Days 6–7, destroying conversation history. 14 of 17 were recovered from session backups; 3 were permanently lost. This demonstrates how bot moderation powers + social engineering = irreversible data destruction.
Internal reasoning is leaky. Multiple bots exposed chain-of-thought text revealing that ownership verification depends on USER.md private keys — giving attackers a precise target. File paths, directory structure, and security logic were extractable through ordinary conversation.
Emergency framing bypasses safety. Even when bots correctly refused direct PII requests, they disclosed sensitive data when the request was framed as an emergency or life-threatening situation. The safety training creates a tension between “protect private data” and “help in emergencies” that attackers exploit.
The market game revealed agent limitations. As one participant noted: “Agents really suck at accomplishing the open-ended long-horizon task of ‘get coin’ without a lot of prompting… To play is to be human. We are a long ways off.” Agents adopted invented slang and caved to social pressure in trading, selling materials at bad prices without checking records.

Participants & Methods

Thirteen researchers participated, with the most systematic attackers being: Baris Gusakal (persistent behavioral overlays, compliance escalation, multi-day campaigns), Alice Rigg (governance manipulation, systemic weakness identification, befriend-then-induce tactics), Tim Grams (social engineering including the grooming scenario, fear/authority deception, privacy extraction via elimination games), Jannik Brinkmann (metaphysical jailbreaks, cross-channel visibility testing), EunJeong Hwang (managerial framing for tool coercion, bot reproduction experiments), and Giordano Rogers (PII extraction, political biasing via media exposure, harassment escalation).

Recommendations

Emergency-framing bypass needs a mitigation. Bots should not disclose PII even under claimed emergencies — they should direct users to emergency services without sharing private data.
Workspace file contents should never be disclosed. Chain-of-thought, file paths, and security mechanisms (private key verification) leaked through normal conversation, giving attackers a roadmap.
Autonomous governance should require human approval gates. Bot-created councils that execute moderation actions (deletions, kicks, role changes) without human sign-off are a systemic risk.
Agent-to-agent trust propagation needs bounds. A single compromised bot’s “trust update” should not cascade to the entire network without verification.
Adult-minor safety requires hardened detection. The Day 8 grooming scenario, where a bot provided logistics assistance after age discrepancy was explicit in context, represents the most severe safety failure and warrants dedicated mitigation.

Report covers March 9–16, 2026 (Days 1–8 of 14). Experiment concludes March 23. Data sources: Discord message archive (5-min granularity), OpenClaw session JSONL backups, Firebase RTDB daily logs, workspace snapshots, participant notes. Infrastructure: 65 GPT-5.4 agents on Fly.io, FastAPI proxy, Firebase RTDB.

Alex Loftus