Why We Built Dataset.ET on Telegram: A Lesson in Community-First Data Collection
When we started Dataset.ET, we ran straight into a problem that doesn't show up in NLP papers: how do you actually get speakers of 80+ Ethiopian languages to sit down and contribute training data?
The naive answer is "build a website and they will come." The honest answer is they won't. We know — we considered it.
So we did something that felt counterintuitive at the time: we built our contribution platform inside Telegram, on top of a bot. No app to download. No account to create. No dashboard to learn. That single decision is the main reason Dataset.ET exists today instead of being a half-finished side project.
The Problem We're Actually Solving
Before Dataset.ET, building AI for Ethiopian languages was a brick wall. No speech recognition. No reasonable machine translation. No conversational models that didn't sound like a foreign tourist with a phrasebook. The reason is mundane: there isn't enough public training data, and the data that exists is scattered across academic archives, news sites, and individual researchers' hard drives.
You can't fine-tune a language model on a vibe. You need millions of clean sentences, ideally across registers — news, conversation, poetry, technical writing — and you need them at a quality bar that's actually usable.
The traditional way to build that corpus is to hire linguists, build internal tooling, run quality workflows, and spend a year or two. For a low-resource language with a small budget, that math doesn't work. We needed a path that scaled with the community instead of with our payroll.
Why Not a Web App?
We sketched the web app version first. Login, contribution dashboard, gamification, leaderboards, validation queue. We had wireframes. Here's roughly how that story plays out for most volunteer-driven platforms:
A few dozen excited people sign up in week one. Most of them never come back, because contributing requires finding the site again, remembering a password, navigating a UI they don't recognize, and feeling like they're still on the clock. The drop-off between "wanting to help" and "actually contributing" is brutal — and every step you add makes it worse.
For a single-use volunteer task, the marginal cost of asking someone to install or sign up is almost always higher than the marginal value of the contribution. So they don't.
The Telegram Insight
Two facts changed our minds:
- Ethiopia's Telegram community is enormous. Millions of daily users across age groups, urban and rural. It's already where conversation happens.
- Telegram has a real bot platform. Not a chat widget — a full API with rich keyboards, inline buttons, file uploads, and persistent state. You can build something that feels like an app, without the app.
That combination changed the question. Instead of asking "how do we convince people to come to our platform," we asked "how do we put a contribution flow inside the platform they're already on?"
The friction collapses:
- Click a
t.me/dataset_et_botlink, you're in. - Send a message, you've contributed.
- Close the chat, you're done.
- Notifications, history, sharing, and group dynamics all come for free.
That's it. That's the whole product.
How It Works
The contribution flow is intentionally boring:
User opens the bot
↓
Bot: "Help preserve Ethiopian languages. Pick one:"
[Amharic] [Afaan Oromoo] [Somali] [Tigrinya] ...
↓
User picks a language
↓
Bot: "Send a sentence in {language}"
↓
User types or pastes a sentence
↓
Bot: "✓ Recorded. Send another or /done"No forms. No accounts. No "save your progress." The state lives in the chat — same as every other Telegram conversation the user is already having.
The Numbers
Three months in:
| Language | Sentences | Contributors |
|---|---|---|
| Amharic | 500,000+ | 2,400 |
| Afaan Oromo | 450,000+ | 1,800 |
| Somali | 200,000+ | 890 |
| Tigrinya | 200,000+ | 750 |
| Total | 1.35M+ | 5,800+ |
The whole thing runs on a single $5/month Hetzner box. Bot, database, queue, review tooling — all of it, on the base tier. That's not a humble-brag; it's the point. The platform decision did more for our cost structure than any optimization we could have done downstream. We aren't paying to acquire contributors. We aren't paying to retain them. We aren't paying for an app store presence. We're paying for one small VM and the electricity to run it.
Why Telegram Beat the Alternatives
We considered (and rejected) several others. The short version:
- WhatsApp is too personal. A phone number is a closer identifier than people want for a public dataset contribution. The Business API is also a different product than what we needed.
- Discord has a real community model but assumes server membership, roles, and channel navigation. It's a heavier mental model than "open a chat with a bot."
- Web app is what we'd already ruled out.
Telegram bots sit at a specific intersection: lightweight install (none), rich UI primitives (buttons, keyboards), persistent chat history, and a community already used to interacting with bots. We haven't found another platform that hits all four for our context.
The Implementation, Stripped Down
The full bot is a few hundred lines, but the shape of the message handler is the part worth showing:
from telegram import Update
from telegram.ext import ContextTypes
async def handle_sentence(update: Update, context: ContextTypes.DEFAULT_TYPE):
user = update.effective_user
text = update.message.text.strip()
language = context.user_data.get("language")
if not language:
await update.message.reply_text("Pick a language first with /start")
return
if not (5 <= len(text) <= 500):
await update.message.reply_text("Sentence is too short or too long — try another.")
return
if await db.is_duplicate(language=language, text=text):
await update.message.reply_text("We already have that one — try another.")
return
await db.save_contribution(
user_id=user.id,
language=language,
text=text,
status="pending_review",
)
count = await db.user_count(user.id)
msg = "✓ Recorded. Send another or /done"
if count and count % 5 == 0:
msg = f"✓ Recorded. You've contributed {count} sentences. 🙏"
await update.message.reply_text(msg)The interesting choices aren't in the code — they're in what's not there. No login. No session token. No CAPTCHA. No "draft" state. The bot trusts the platform's auth (every message comes from a Telegram user ID we can attribute), and quality is handled downstream rather than upstream. That trade-off — accept more raw, review later — is what lets the contribution flow stay fast enough for people to actually use.
Quality Control: Cheap at Submission, Strict at Ingest
We don't try to gate quality at the moment of contribution. That's a friction trap. Instead we layer it after:
- Bot-level basics. Length bounds, exact-duplicate rejection, obvious-spam filtering.
- Automated semantic deduplication. Embed each sentence, drop near-duplicates within a language.
- Community flagging. Contributors and reviewers can mark items for human review.
- Linguist review of flagged items. Accept or reject with a reason; rejection feedback feeds the auto-filter.
The result is that contributors never feel judged in the moment, and the dataset still meets a usable quality bar by the time it ships. Most submissions pass cleanly; the long tail of edge cases gets the human attention.
What Else Telegram Quietly Did For Us
Some of the benefits we didn't plan for:
- Community formed on its own. Language-specific Telegram groups grew alongside the bot. Contributors started coaching each other on what counts as a good sentence — better than any onboarding doc we could have written.
- Distribution is built in. Every contributor can forward the bot link to a friend in one tap. That single mechanic drives most of our organic growth.
- Trust comes pre-installed. People understand they're contributing to preserve their language. They're not being asked to install a stranger's app or hand over their email. The mission and the channel are aligned.
- Dropout is low because cost is low. A contribution is 30 seconds. People come back because there's almost nothing to come back to — no profile to maintain, no streak to lose.
What the Dataset Actually Contains
It's not just "random sentences from the internet." The mix we're collecting:
- News and current events in native languages
- Conversational text that reflects how people actually speak
- Literary and poetic forms important for cultural and stylistic coverage
- Technical and educational writing to expand vocabulary into modern domains
- Historical and archival material that's being digitized for the first time
This diversity is what makes the corpus useful for downstream model training rather than just for benchmarking.
Open by Default
Everything Dataset.ET collects is open under CC-BY-4.0. Every entry carries the contributor's attribution and the contribution date. Researchers can pull the corpus directly. Builders can train on it. We don't hold the data behind a license wall — that would defeat the entire reason the community is contributing in the first place.
You can explore and download the corpus at dataset.et.
Lessons Worth Generalizing
If you're building any volunteer-driven contribution system — datasets, citizen science, open mapping, translation — here's what we'd carry into the next one:
- Go where your community already is. Don't build a destination; build inside the platform they open every day. The right platform halves your problem before you've written a line of code.
- Target zero-friction contribution. Measure your funnel by seconds, not steps. Every signup field, every install prompt, every "verify your email" is a leak.
- Defer quality control. Gate at ingest, not at submission. Make contributing feel weightless; make the dataset rigorous behind the scenes.
- Make impact legible. Show contributors their counts, the languages they're moving forward, the models being trained on their work. The motivation is intrinsic — your job is to keep it visible.
- Don't gamify a mission. People aren't here for badges. They're here because they care about the language. Treat that seriously and you don't need rewards.
What's Next for Dataset.ET
The roadmap is more of what's working:
- Voice contributions — short recorded clips to seed speech recognition and TTS.
- Validator tooling — a lightweight review interface for linguists, also delivered through Telegram.
- Reference models — open-weight baselines trained on the corpus, so contributors can see what their work makes possible.
- Downstream applications — translation, transcription, and conversational models built on the data, all open.
Telegram stays the contribution surface. It's the part of the system we have the least reason to change.
Get Involved
Dataset.ET is a community project. The dataset belongs to everyone who contributes.
- Explore the corpus: dataset.et
- Contribute via Telegram: t.me/dataset_et_bot
- Join the community: t.me/dataset_et
One sentence at a time, we're building the data layer that the next generation of Ethiopian-language AI will rest on. The hardest part wasn't the bot. It was deciding that meeting people where they already are mattered more than building something that looked impressive on a slide.
