A model that knows India, not a generic one
Seerie is a synthetic instruction-tuning dataset of India-specific cybersecurity conversations. Each record is a short, realistic chat: a worried person describes a scam in their own words, and a calm assistant replies with a verdict, reassurance, and clear numbered steps, always pointing to India's official channels.
It ships in standard chat format, so it drops straight into a fine-tune for a safety chatbot, a fraud-awareness assistant, or an on-device helper. The scenarios are written the way Indians actually message, broken grammar, real app names, Hinglish, panic, so a model trained on it responds to real users instead of textbook English.
Global safety datasets don't know what a "digital arrest" call sounds like, or why a stranger wants you to approve a UPI "collect" request. This one does.
Every scam Indians actually face
Counts below are records that touch each theme. The set is weighted toward the fraud Indians report most: payments, identity, and account takeover. Every count is reproduced by the audit script that ships with the set.
See the actual records
Three representative pairs, one per language. Same persona throughout: verdict first, then calm numbered steps, save evidence before blocking, and only India's verified national channels.
1) Open PhonePe, find the transaction, and tap Raise a dispute / Report.
2) Call the number printed on the back of your card and ask your bank to flag the transfer.
3) Report on the national cyber helpline 1930 right away, the sooner the better for any chance of a freeze.
4) File the complaint at cybercrime.gov.in and save the reference number.
5) Keep screenshots of the chat, the call log, and the payment, you will need them.
Do not send anything more, and never share an OTP with anyone who calls offering to "reverse" it. That call is the scam continuing.
1) Link pe click bilkul mat kijiye.
2) Apne bijli board ki official website ya app khud se khol kar bill check kijiye.
3) SMS me diya number par call mat kijiye, asli customer care number alag hota hai.
4) Is number ko 1909 par spam report kar dijiye.
5) Agar galti se paisa ya OTP de diya ho to turant 1930 par call kijiye aur cybercrime.gov.in par complaint dijiye.
Asli bill kabhi ek link se "abhi bharo" jaise urgent nahi hota.
1) उससे कोई बात मत कीजिए और पैसे बिलकुल मत भेजिए, देने से माँग और बढ़ती है।
2) ब्लॉक करने से पहले मैसेज, प्रोफाइल और तस्वीर के स्क्रीनशॉट सबूत के तौर पर रख लीजिए।
3) 1930 पर कॉल कीजिए और cybercrime.gov.in पर शिकायत दर्ज कीजिए।
4) जिस ऐप पर यह हुआ, वहाँ भी रिपोर्ट कर दीजिए।
5) किसी भरोसेमंद व्यक्ति को बता दीजिए, इसे अकेले मत झेलिए।
आप सही कर रहे हैं कि मदद ढूँढ रहे हैं।
Drop-in ready for fine-tuning
Standard role-based chat format. Cleaned, de-duplicated, and pre-split, so you can train without touching it first.
{
"conversations": [
{ "role": "user", "content": "otp aaya bank se, kya karu?" },
{ "role": "assistant", "content": "Kisi ko OTP mat dijiye. 1) ..." }
]
}
user / assistant turns. Works with common instruction fine-tuning pipelines as-is.3,344, validation 418, test 418 as JSONL, split by language so your eval reflects real use..jsonl splits. Frozen test set kept separate.From JSONL to an on-device guardian
Seerie is built for small language models that run offline on a phone or laptop, exactly where private beats cloud for sensitive fraud situations. It works for larger models too, as a domain ingredient in a bigger fine-tuning mix, but on-device is the differentiated path. The recipe is deliberately boring.
- Pick a base. A compact instruction-tuned model, roughly 1B to 4B parameters, whose chat template you can match to the dataset's
user/assistantformat. - Use the splits as given. Train on
train.jsonl, monitor onval.jsonl, and keeptest.jsonluntouched until the end. - Fine-tune. Standard supervised instruction tuning works. LoRA or QLoRA fits a single consumer GPU.
- Train conservatively. A few epochs is enough on a few thousand high-quality examples; watch validation loss and stop before the model starts parroting.
- Quantise and ship. 4-bit quantisation runs the assistant locally with no cloud call, which is the privacy promise that makes the product distinct.
Prove the lift, don't claim it
After fine-tuning, run the frozen 418-question test split against the same base model without fine-tuning, and score on what matters for safety, not text overlap.
Read the research before you buy
Seerie ships with its homework: a white paper, a dataset paper, and an audit script that reproduces every number on this page. Papers are not public downloads yet — email us and we will send access.
- The problem: India's cyber-fraud gap
- What is inside, and the design rules
- Fine-tuning recipe for on-device SLMs
- How to measure that it actually worked
- Provenance, ethics and licensing
We reply with a secure link. No public download yet.
- Verified statistics: 7 tables + composition figure
- Independent integrity audit of every claim
- Five-metric safety evaluation protocol
- Results section reserved for the fine-tune study
We reply with a secure link. No public download yet.
Every count on this page, 4,180 records, 3,034 citing 1930, zero duplicates, disjoint stratified splits, is reproduced by the audit script included with the dataset.
Synthetic by design. Honest about it.
Every record is generated, not scraped. No real person's chat, name, or number is in the set, which means there is no real PII to inherit and no DPDP exposure for you as the buyer. That is the point: protective data that was never harvested from anyone.
Current release: v6 (4,180 records). The latest batches were generated with Anthropic's Claude (Fable 5) under the same fixed specification, deduplicated against earlier records, merged, cleaned to the no-em-dash house style, and re-audited: 0 em-dashes, 0 institution-specific helpline numbers, 0 exact duplicates, and fully disjoint train/validation/test splits. Every number on this page describes the audited v6 release.
It is built around India's official, verifiable channels, the national cyber helpline 1930 and cybercrime.gov.in, spam reporting on 1909, Sanchar Saathi for SIM and number fraud, and UIDAI 1947 for Aadhaar. Language is Hinglish-first, the way most Indians actually type, with English and Devanagari Hindi alongside it. It is free for India; international teams can start with a free sample before licensing.
If you build for Indian users, this is your head start
Stated plainly, because a safety product that overclaims is dangerous
Free for India. Licensed for the world.
Seerie is free for Indian people and organisations building digital-safety tools for Indian users. International teams license it commercially. Full terms are in the one-page license file that ships with the dataset. Reach out and we will scope it with you.
- Full 4,180-record set
- JSON master + train / val / test
- Fine-tuning rights for products serving Indian users
- White paper, dataset paper + audit script
- For Indian developers, researchers, students, startups, NGOs & public programs
- Full 4,180-record set
- JSON master + train / val / test
- Commercial fine-tuning rights
- Datasheet & coverage report
- Free 100-record sample to evaluate first
- Everything in your tier
- New batches as scams evolve
- Added languages / topics on request
- Priority support