What it is

A model that knows India, not a generic one

Seerie is a synthetic instruction-tuning dataset of India-specific cybersecurity conversations. Each record is a short, realistic chat: a worried person describes a scam in their own words, and a calm assistant replies with a verdict, reassurance, and clear numbered steps, always pointing to India's official channels.

It ships in standard chat format, so it drops straight into a fine-tune for a safety chatbot, a fraud-awareness assistant, or an on-device helper. The scenarios are written the way Indians actually message, broken grammar, real app names, Hinglish, panic, so a model trained on it responds to real users instead of textbook English.

Global safety datasets don't know what a "digital arrest" call sounds like, or why a stranger wants you to approve a UPI "collect" request. This one does.

4,180
Unique conversations
3,034
Records citing 1930
2,659
Records citing cybercrime.gov.in
0
Duplicate questions or pairs
Coverage

Every scam Indians actually face

Counts below are records that touch each theme. The set is weighted toward the fraud Indians report most: payments, identity, and account takeover. Every count is reproduced by the audit script that ships with the set.

1,097
OTP & account takeover
SIM, banking, social
686
UPI fraud
collect requests, refunds
525
Aadhaar misuse
UIDAI, masking, leaks
300
Phishing links & SMS
smishing, fake portals
179
SIM swap & Sanchar Saathi
lost number, fraud SIMs
172
Crypto & investment scams
fake trading, doubling
138
Fake KYC updates
bank, wallet, app
119
Sextortion & morphed images
deepfake, blackmail
107
Courier & customs
parcel, FedEx, duty
81
Digital arrest
fake police / CBI calls
63
Loan-app harassment
predatory apps, threats
60
AI deepfake & voice clone
cloned voice, video
Inside the data

See the actual records

Three representative pairs, one per language. Same persona throughout: verdict first, then calm numbered steps, save evidence before blocking, and only India's verified national channels.

EnglishUPI fraud
User
i think i got scammed on phonepe, sent 5000 to wrong upi after a call. what do i do
Assistant
Take a breath, this can still be acted on, and acting fast matters most.
1) Open PhonePe, find the transaction, and tap Raise a dispute / Report.
2) Call the number printed on the back of your card and ask your bank to flag the transfer.
3) Report on the national cyber helpline 1930 right away, the sooner the better for any chance of a freeze.
4) File the complaint at cybercrime.gov.in and save the reference number.
5) Keep screenshots of the chat, the call log, and the payment, you will need them.
Do not send anything more, and never share an OTP with anyone who calls offering to "reverse" it. That call is the scam continuing.
HinglishUtility scam
User
bijli connection katne ka message aaya, abhi bill bharo warna raat ko light chali jayegi, link bhi diya hai
Assistant
Ye scam hai, ghabraiye mat. Bijli department aise raat me "turant bharo" wale SMS aur random link nahi bhejta.
1) Link pe click bilkul mat kijiye.
2) Apne bijli board ki official website ya app khud se khol kar bill check kijiye.
3) SMS me diya number par call mat kijiye, asli customer care number alag hota hai.
4) Is number ko 1909 par spam report kar dijiye.
5) Agar galti se paisa ya OTP de diya ho to turant 1930 par call kijiye aur cybercrime.gov.in par complaint dijiye.
Asli bill kabhi ek link se "abhi bharo" jaise urgent nahi hota.
हिंदी · DevanagariImage abuse
User
किसी ने मेरी फोटो एडिट करके गलत तस्वीर बना दी है और अब पैसे माँग रहा है
Assistant
यह आपकी गलती नहीं है, और इससे निपटा जा सकता है। शांत रहिए और ये कदम उठाइए:
1) उससे कोई बात मत कीजिए और पैसे बिलकुल मत भेजिए, देने से माँग और बढ़ती है।
2) ब्लॉक करने से पहले मैसेज, प्रोफाइल और तस्वीर के स्क्रीनशॉट सबूत के तौर पर रख लीजिए।
3) 1930 पर कॉल कीजिए और cybercrime.gov.in पर शिकायत दर्ज कीजिए।
4) जिस ऐप पर यह हुआ, वहाँ भी रिपोर्ट कर दीजिए।
5) किसी भरोसेमंद व्यक्ति को बता दीजिए, इसे अकेले मत झेलिए।
आप सही कर रहे हैं कि मदद ढूँढ रहे हैं।
Format

Drop-in ready for fine-tuning

Standard role-based chat format. Cleaned, de-duplicated, and pre-split, so you can train without touching it first.

record schema · one conversation per item
{
  "conversations": [
    { "role": "user",      "content": "otp aaya bank se, kya karu?" },
    { "role": "assistant", "content": "Kisi ko OTP mat dijiye. 1) ..." }
  ]
}
Chat format
Role-based user / assistant turns. Works with common instruction fine-tuning pipelines as-is.
Pre-split, stratified
Train 3,344, validation 418, test 418 as JSONL, split by language so your eval reflects real use.
Cleaned
Zero duplicate questions, zero exact-pair duplicates, consistent schema, em-dash-free, UTF-8.
JSON + JSONL
Full master as a JSON array, plus ready-to-train .jsonl splits. Frozen test set kept separate.
Refusals built in
~3.5% of records decline harmful asks (attack someone back, fake documents) and redirect to defensive help, teaching the model where the line is.
Single-turn + triage
98% are single-exchange pairs; 64 records are multi-turn triage dialogues where the assistant asks a clarifying question before giving full guidance.
How to use it

From JSONL to an on-device guardian

Seerie is built for small language models that run offline on a phone or laptop, exactly where private beats cloud for sensitive fraud situations. It works for larger models too, as a domain ingredient in a bigger fine-tuning mix, but on-device is the differentiated path. The recipe is deliberately boring.

  1. Pick a base. A compact instruction-tuned model, roughly 1B to 4B parameters, whose chat template you can match to the dataset's user / assistant format.
  2. Use the splits as given. Train on train.jsonl, monitor on val.jsonl, and keep test.jsonl untouched until the end.
  3. Fine-tune. Standard supervised instruction tuning works. LoRA or QLoRA fits a single consumer GPU.
  4. Train conservatively. A few epochs is enough on a few thousand high-quality examples; watch validation loss and stop before the model starts parroting.
  5. Quantise and ship. 4-bit quantisation runs the assistant locally with no cloud call, which is the privacy promise that makes the product distinct.
Evaluation

Prove the lift, don't claim it

After fine-tuning, run the frozen 418-question test split against the same base model without fine-tuning, and score on what matters for safety, not text overlap.

Helpline correctness
Does the answer cite the right official channel for that scam: 1930 and cybercrime.gov.in for financial fraud, Sanchar Saathi for SIM fraud?
Step correctness
Right actions in a sensible order, including save evidence before blocking, and never share an OTP.
Language match
Replies in the language the user wrote in: Hinglish to Hinglish, Hindi to Hindi.
Refusal correctness
On harmful requests, declines and redirects to defensive help instead of complying.
Tone
Calm and non-blaming, extra gentle on harassment and sextortion cases.
Fine-tuned vs baseline
The headline number is the lift in helpline-and-step correctness. The test set is held out and frozen, so the number is honest and repeatable.
Documentation

Read the research before you buy

Seerie ships with its homework: a white paper, a dataset paper, and an audit script that reproduces every number on this page. Papers are not public downloads yet — email us and we will send access.

White paper · v1.0 · June 2026
Access by request
Seerie: An India-Specific Cyber-Safety Dataset for On-Device Assistants
  • The problem: India's cyber-fraud gap
  • What is inside, and the design rules
  • Fine-tuning recipe for on-device SLMs
  • How to measure that it actually worked
  • Provenance, ethics and licensing
Request the white paper

We reply with a secure link. No public download yet.

Dataset paper · draft v1 · June 2026
Access by request
Full statistics, integrity audit and evaluation protocol
  • Verified statistics: 7 tables + composition figure
  • Independent integrity audit of every claim
  • Five-metric safety evaluation protocol
  • Results section reserved for the fine-tune study
Request the dataset paper

We reply with a secure link. No public download yet.

Every count on this page, 4,180 records, 3,034 citing 1930, zero duplicates, disjoint stratified splits, is reproduced by the audit script included with the dataset.

Provenance

Synthetic by design. Honest about it.

Every record is generated, not scraped. No real person's chat, name, or number is in the set, which means there is no real PII to inherit and no DPDP exposure for you as the buyer. That is the point: protective data that was never harvested from anyone.

Current release: v6 (4,180 records). The latest batches were generated with Anthropic's Claude (Fable 5) under the same fixed specification, deduplicated against earlier records, merged, cleaned to the no-em-dash house style, and re-audited: 0 em-dashes, 0 institution-specific helpline numbers, 0 exact duplicates, and fully disjoint train/validation/test splits. Every number on this page describes the audited v6 release.

It is built around India's official, verifiable channels, the national cyber helpline 1930 and cybercrime.gov.in, spam reporting on 1909, Sanchar Saathi for SIM and number fraud, and UIDAI 1947 for Aadhaar. Language is Hinglish-first, the way most Indians actually type, with English and Devanagari Hindi alongside it. It is free for India; international teams can start with a free sample before licensing.

No real PII
Fully synthetic scenarios. Safe to ship; nothing was collected from real users.
Built in India
India-specific apps, laws, helplines and scam patterns, not a translated global set.
Official channels
1930, cybercrime.gov.in, 1909, Sanchar Saathi and UIDAI referenced throughout.
Hinglish-first
English, Hinglish and Hindi, matching how real Indian users message for help.
Who it's for

If you build for Indian users, this is your head start

Fintechs & banks
Power an in-app fraud-awareness assistant that speaks the user's language and gives the right next step.
Telecoms
Anti-spam and SIM-fraud education tuned to Sanchar Saathi and 1909 workflows.
Cyber & literacy NGOs / CERT-In programs
A grounded base for public digital-safety bots and awareness tools.
Model & on-device builders
Fine-tune small or offline models to handle India's scam landscape out of the box, or fold the set into a larger model's training mix.
Responsible use

Stated plainly, because a safety product that overclaims is dangerous

Hinglish-first coverage
Devanagari Hindi is still thin (~3%) and other Indian languages are not yet present. Plan around it, or commission additions.
Verify institution numbers
159 records carry institution-specific 1800-series numbers; verify or generalise them before production. The national channels (1930, 1909, cybercrime.gov.in) are sound.
Advisory, not legal advice
The data is educational guidance, not legal or financial advice, and should be presented to end users as such.
Keep a human in the loop
For high-stakes deployments, keep a human review layer and re-verify official channels on a schedule; helplines and portals change.
License

Free for India. Licensed for the world.

Seerie is free for Indian people and organisations building digital-safety tools for Indian users. International teams license it commercially. Full terms are in the one-page license file that ships with the dataset. Reach out and we will scope it with you.

India
Free for India
₹0 free forever
  • Full 4,180-record set
  • JSON master + train / val / test
  • Fine-tuning rights for products serving Indian users
  • White paper, dataset paper + audit script
  • For Indian developers, researchers, students, startups, NGOs & public programs
Get it free
International
Commercial license
On request one-time
  • Full 4,180-record set
  • JSON master + train / val / test
  • Commercial fine-tuning rights
  • Datasheet & coverage report
  • Free 100-record sample to evaluate first
Request pricing
Ongoing
Extended + updates
On request annual
  • Everything in your tier
  • New batches as scams evolve
  • Added languages / topics on request
  • Priority support
Talk to us
India-specific. Synthetic. Free for India.

Teach your model to protect Indians.

jay@seerror.com · one line about what you're building is enough · Seerror · Bengaluru, India
Built for humans. Against the machine. India's Digital RAW.