Seerie Dataset — India's cyber-safety training data

What it is

A model that knows India, not a generic one

Seerie is a synthetic instruction-tuning dataset of India-specific cybersecurity conversations. Each record is a short, realistic chat: a worried person describes a scam in their own words, and a calm assistant replies with a verdict, reassurance, and clear numbered steps, always pointing to India's official channels.

It ships in standard chat format, so it drops straight into a fine-tune for a safety chatbot, a fraud-awareness assistant, or an on-device helper. The scenarios are written the way Indians actually message, broken grammar, real app names, Hinglish, panic, so a model trained on it responds to real users instead of textbook English.

Global safety datasets don't know what a "digital arrest" call sounds like, or why a stranger wants you to approve a UPI "collect" request. This one does.

4,180

Unique conversations

3,034

Records citing 1930

2,659

Records citing cybercrime.gov.in

0

Duplicate questions or pairs

Coverage

Every scam Indians actually face

Counts below are records that touch each theme. The set is weighted toward the fraud Indians report most: payments, identity, and account takeover. Every count is reproduced by the audit script that ships with the set.

1,097

OTP & account takeover

SIM, banking, social

686

UPI fraud

collect requests, refunds

525

Aadhaar misuse

UIDAI, masking, leaks

300

Phishing links & SMS

smishing, fake portals

179

SIM swap & Sanchar Saathi

lost number, fraud SIMs

172

Crypto & investment scams

fake trading, doubling

138

Fake KYC updates

bank, wallet, app

119

Sextortion & morphed images

deepfake, blackmail

107

Courier & customs

parcel, FedEx, duty

81

Digital arrest

fake police / CBI calls

63

Loan-app harassment

predatory apps, threats

60

AI deepfake & voice clone

cloned voice, video

Inside the data

See the actual records

Three representative pairs, one per language. Same persona throughout: verdict first, then calm numbered steps, save evidence before blocking, and only India's verified national channels.

EnglishUPI fraud

User

i think i got scammed on phonepe, sent 5000 to wrong upi after a call. what do i do

Assistant

Take a breath, this can still be acted on, and acting fast matters most.
1) Open PhonePe, find the transaction, and tap Raise a dispute / Report.
2) Call the number printed on the back of your card and ask your bank to flag the transfer.
3) Report on the national cyber helpline 1930 right away, the sooner the better for any chance of a freeze.
4) File the complaint at cybercrime.gov.in and save the reference number.
5) Keep screenshots of the chat, the call log, and the payment, you will need them.
Do not send anything more, and never share an OTP with anyone who calls offering to "reverse" it. That call is the scam continuing.

HinglishUtility scam

User

bijli connection katne ka message aaya, abhi bill bharo warna raat ko light chali jayegi, link bhi diya hai

Assistant

Ye scam hai, ghabraiye mat. Bijli department aise raat me "turant bharo" wale SMS aur random link nahi bhejta.
1) Link pe click bilkul mat kijiye.
2) Apne bijli board ki official website ya app khud se khol kar bill check kijiye.
3) SMS me diya number par call mat kijiye, asli customer care number alag hota hai.
4) Is number ko 1909 par spam report kar dijiye.
5) Agar galti se paisa ya OTP de diya ho to turant 1930 par call kijiye aur cybercrime.gov.in par complaint dijiye.
Asli bill kabhi ek link se "abhi bharo" jaise urgent nahi hota.

हिंदी · DevanagariImage abuse

User

किसी ने मेरी फोटो एडिट करके गलत तस्वीर बना दी है और अब पैसे माँग रहा है

Assistant

यह आपकी गलती नहीं है, और इससे निपटा जा सकता है। शांत रहिए और ये कदम उठाइए:
1) उससे कोई बात मत कीजिए और पैसे बिलकुल मत भेजिए, देने से माँग और बढ़ती है।
2) ब्लॉक करने से पहले मैसेज, प्रोफाइल और तस्वीर के स्क्रीनशॉट सबूत के तौर पर रख लीजिए।
3) 1930 पर कॉल कीजिए और cybercrime.gov.in पर शिकायत दर्ज कीजिए।
4) जिस ऐप पर यह हुआ, वहाँ भी रिपोर्ट कर दीजिए।
5) किसी भरोसेमंद व्यक्ति को बता दीजिए, इसे अकेले मत झेलिए।
आप सही कर रहे हैं कि मदद ढूँढ रहे हैं।

Format

Drop-in ready for fine-tuning

Standard role-based chat format. Cleaned, de-duplicated, and pre-split, so you can train without touching it first.

record schema · one conversation per item

{
  "conversations": [
    { "role": "user",      "content": "otp aaya bank se, kya karu?" },
    { "role": "assistant", "content": "Kisi ko OTP mat dijiye. 1) ..." }
  ]
}

Chat format

Role-based user / assistant turns. Works with common instruction fine-tuning pipelines as-is.

Pre-split, stratified

Train 3,344, validation 418, test 418 as JSONL, split by language so your eval reflects real use.

Cleaned

Zero duplicate questions, zero exact-pair duplicates, consistent schema, em-dash-free, UTF-8.

JSON + JSONL

Full master as a JSON array, plus ready-to-train .jsonl splits. Frozen test set kept separate.

Refusals built in

~3.5% of records decline harmful asks (attack someone back, fake documents) and redirect to defensive help, teaching the model where the line is.

Single-turn + triage

98% are single-exchange pairs; 64 records are multi-turn triage dialogues where the assistant asks a clarifying question before giving full guidance.

How to use it

From JSONL to an on-device guardian

Seerie is built for small language models that run offline on a phone or laptop, exactly where private beats cloud for sensitive fraud situations. It works for larger models too, as a domain ingredient in a bigger fine-tuning mix, but on-device is the differentiated path. The recipe is deliberately boring.

Pick a base. A compact instruction-tuned model, roughly 1B to 4B parameters, whose chat template you can match to the dataset's user / assistant format.
Use the splits as given. Train on train.jsonl, monitor on val.jsonl, and keep test.jsonl untouched until the end.
Fine-tune. Standard supervised instruction tuning works. LoRA or QLoRA fits a single consumer GPU.
Train conservatively. A few epochs is enough on a few thousand high-quality examples; watch validation loss and stop before the model starts parroting.
Quantise and ship. 4-bit quantisation runs the assistant locally with no cloud call, which is the privacy promise that makes the product distinct.

Evaluation

Prove the lift, don't claim it

After fine-tuning, run the frozen 418-question test split against the same base model without fine-tuning, and score on what matters for safety, not text overlap.

Helpline correctness

Does the answer cite the right official channel for that scam: 1930 and cybercrime.gov.in for financial fraud, Sanchar Saathi for SIM fraud?

Step correctness

Right actions in a sensible order, including save evidence before blocking, and never share an OTP.

Language match

Replies in the language the user wrote in: Hinglish to Hinglish, Hindi to Hindi.

Refusal correctness

On harmful requests, declines and redirects to defensive help instead of complying.

Tone

Calm and non-blaming, extra gentle on harassment and sextortion cases.

Fine-tuned vs baseline

The headline number is the lift in helpline-and-step correctness. The test set is held out and frozen, so the number is honest and repeatable.

Documentation

Read the research before you buy

Seerie ships with its homework: a white paper, a dataset paper, and an audit script that reproduces every number on this page. Papers are not public downloads yet — email us and we will send access.

White paper · v1.0 · June 2026

Access by request

Seerie: An India-Specific Cyber-Safety Dataset for On-Device Assistants

The problem: India's cyber-fraud gap
What is inside, and the design rules
Fine-tuning recipe for on-device SLMs
How to measure that it actually worked
Provenance, ethics and licensing

Request the white paper

We reply with a secure link. No public download yet.

Dataset paper · draft v1 · June 2026

Access by request

Full statistics, integrity audit and evaluation protocol

Verified statistics: 7 tables + composition figure
Independent integrity audit of every claim
Five-metric safety evaluation protocol
Results section reserved for the fine-tune study

Request the dataset paper

We reply with a secure link. No public download yet.

Every count on this page, 4,180 records, 3,034 citing 1930, zero duplicates, disjoint stratified splits, is reproduced by the audit script included with the dataset.

Provenance

Synthetic by design. Honest about it.

Every record is generated, not scraped. No real person's chat, name, or number is in the set, which means there is no real PII to inherit and no DPDP exposure for you as the buyer. That is the point: protective data that was never harvested from anyone.

Current release: v6 (4,180 records). The latest batches were generated with Anthropic's Claude (Fable 5) under the same fixed specification, deduplicated against earlier records, merged, cleaned to the no-em-dash house style, and re-audited: 0 em-dashes, 0 institution-specific helpline numbers, 0 exact duplicates, and fully disjoint train/validation/test splits. Every number on this page describes the audited v6 release.

It is built around India's official, verifiable channels, the national cyber helpline 1930 and cybercrime.gov.in, spam reporting on 1909, Sanchar Saathi for SIM and number fraud, and UIDAI 1947 for Aadhaar. Language is Hinglish-first, the way most Indians actually type, with English and Devanagari Hindi alongside it. It is free for India; international teams can start with a free sample before licensing.

No real PII

Fully synthetic scenarios. Safe to ship; nothing was collected from real users.

Built in India

India-specific apps, laws, helplines and scam patterns, not a translated global set.

Official channels

1930, cybercrime.gov.in, 1909, Sanchar Saathi and UIDAI referenced throughout.

Hinglish-first

English, Hinglish and Hindi, matching how real Indian users message for help.

Who it's for

If you build for Indian users, this is your head start

Fintechs & banks

Power an in-app fraud-awareness assistant that speaks the user's language and gives the right next step.

Telecoms

Anti-spam and SIM-fraud education tuned to Sanchar Saathi and 1909 workflows.

Cyber & literacy NGOs / CERT-In programs

A grounded base for public digital-safety bots and awareness tools.

Model & on-device builders

Fine-tune small or offline models to handle India's scam landscape out of the box, or fold the set into a larger model's training mix.

Responsible use

Stated plainly, because a safety product that overclaims is dangerous

Hinglish-first coverage

Devanagari Hindi is still thin (~3%) and other Indian languages are not yet present. Plan around it, or commission additions.

Verify institution numbers

159 records carry institution-specific 1800-series numbers; verify or generalise them before production. The national channels (1930, 1909, cybercrime.gov.in) are sound.

Advisory, not legal advice

The data is educational guidance, not legal or financial advice, and should be presented to end users as such.

Keep a human in the loop

For high-stakes deployments, keep a human review layer and re-verify official channels on a schedule; helplines and portals change.

License

Free for India. Licensed for the world.

Seerie is free for Indian people and organisations building digital-safety tools for Indian users. International teams license it commercially. Full terms are in the one-page license file that ships with the dataset. Reach out and we will scope it with you.

India

Free for India

₹0 free forever

Full 4,180-record set
JSON master + train / val / test
Fine-tuning rights for products serving Indian users
White paper, dataset paper + audit script
For Indian developers, researchers, students, startups, NGOs & public programs

Get it free

International

Commercial license

On request one-time

Full 4,180-record set
JSON master + train / val / test
Commercial fine-tuning rights
Datasheet & coverage report
Free 100-record sample to evaluate first

Request pricing

Ongoing

Extended + updates

On request annual

Everything in your tier
New batches as scams evolve
Added languages / topics on request
Priority support

Talk to us

India-specific. Synthetic. Free for India.

Teach your model to protect Indians.

Get Seerie free (India) International licensing

jay@seerror.com · one line about what you're building is enough · Seerror · Bengaluru, India

Built for humans. Against the machine. India's Digital RAW.