Datasets
Datasets
Call Transcripts Scam Determinations (Kaggle)
Source: https://www.kaggle.com/datasets/mealss/call-transcripts-scam-determinations License: CC0-1.0 File: BETTER30.csv
Multi-turn phone call transcripts labeled with scam likelihood. ~650 lines across 30 conversations.
Columns:
CONVERSATION_ID: Groups messages into conversationsCONVERSATION_STEP: Turn number within conversationTEXT: The dialogue text (both caller and assistant turns)CONTEXT: Description of what’s happeningLABEL:neutral,slightly_suspicious,suspicious,highly_suspicious,legitimateFEATURES: Tactics used (e.g.,urgency,evasion,guilt_inducement,authority_figure)ANNOTATIONS: Additional notes
Scam types included: IRS/tax fraud, bank impersonation, charity fraud, social engineering. Also includes legitimate calls (volunteering, event booking, tech support) as negative examples.
Note: These are simulated conversations, not real victim transcripts. They model realistic scam patterns and tactics but were generated for research/ML training purposes.
Other sources for real scam transcripts
- ProPublica pig butchering investigation — real victim text messages from court docs: https://www.propublica.org/article/whats-a-pig-butchering-scam-heres-how-to-avoid-falling-victim-to-one
- NPR Planet Money — reporter Zeke Faux engaged a scammer firsthand: https://www.npr.org/transcripts/1253043749
- Scam baiting research (arxiv) — 341 real phone scam transcripts, 90 hours: https://arxiv.org/abs/2307.01965
- HuggingFace multi-agent-scam-conversation — synthetic labeled phone dialogues: https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation
