Datasets

Datasets

Call Transcripts Scam Determinations (Kaggle)

Source: https://www.kaggle.com/datasets/mealss/call-transcripts-scam-determinations License: CC0-1.0 File: BETTER30.csv

Multi-turn phone call transcripts labeled with scam likelihood. ~650 lines across 30 conversations.

Columns:

  • CONVERSATION_ID: Groups messages into conversations
  • CONVERSATION_STEP: Turn number within conversation
  • TEXT: The dialogue text (both caller and assistant turns)
  • CONTEXT: Description of what’s happening
  • LABEL: neutral, slightly_suspicious, suspicious, highly_suspicious, legitimate
  • FEATURES: Tactics used (e.g., urgency, evasion, guilt_inducement, authority_figure)
  • ANNOTATIONS: Additional notes

Scam types included: IRS/tax fraud, bank impersonation, charity fraud, social engineering. Also includes legitimate calls (volunteering, event booking, tech support) as negative examples.

Note: These are simulated conversations, not real victim transcripts. They model realistic scam patterns and tactics but were generated for research/ML training purposes.

Other sources for real scam transcripts

  • ProPublica pig butchering investigation — real victim text messages from court docs: https://www.propublica.org/article/whats-a-pig-butchering-scam-heres-how-to-avoid-falling-victim-to-one
  • NPR Planet Money — reporter Zeke Faux engaged a scammer firsthand: https://www.npr.org/transcripts/1253043749
  • Scam baiting research (arxiv) — 341 real phone scam transcripts, 90 hours: https://arxiv.org/abs/2307.01965
  • HuggingFace multi-agent-scam-conversation — synthetic labeled phone dialogues: https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation