About: RAN Database

About Resource Abundance Notation

The term “low-resource” is used constantly in NLP but means almost nothing precise. A paper might call a language low-resource whether it has 500 parallel sentences or 500,000, whether there are 10 fluent speakers or 10 million. Existing classifications give a single coarse class number; RAN gives a short, reproducible, multi-dimensional score that a reader can decode at a glance.

The notation

A RAN score is written as:

S / M / L₁-B₁ / L₂-B₂ / …

S = ⌊log₁₀(fluent speakers)⌋
M = ⌊log₁₀(monolingual sentences)⌋
Each L_i-B_i is a bilingual partner language and ⌊log₁₀(parallel sentences)⌋, listed in descending order of B_i.

Because each integer is an order of magnitude, they are easy to read: 3 = thousands, 6 = millions, 9 = billions.

Examples

Language	RAN	Reading
Spanish	8/9/en-9/fr-8/pt-8	Hundreds of millions of speakers, billions of monolingual sentences, billions of en–es parallel.
Swahili	8/4/en-7/fr-7	Hundreds of millions of speakers, tens of thousands of monolingual sentences, tens of millions of en–sw parallel.
Quechua	6/0/en-6/es-3	Millions of speakers, almost no monolingual corpus, millions of en–qu parallel.
Cherokee	3/2/en-4	Thousands of speakers, hundreds of monolingual sentences, tens of thousands of en–chr parallel.
Owens Valley Paiute	1/1/	Single-digit speakers, a handful of monolingual sentences, no parallel corpus.

Example sources

Any reputable, citable source can back a RAN value. Commonly used references include:

S: Ethnologue (L1+L2 speaker totals), Wikipedia language pages, national census data.
M: OSCAR deduplicated word counts, CC-100, mC4, native-language Wikipedia, government or academic corpora.
B_i: OPUS parallel-corpus sentence counts, NLLB bitext releases, dataset-specific parallel releases (FLORES, JW300, etc.).

What matters for review is that the source is reproducible (a snapshot URL, DOI, or canonical dump version) and reports the claimed counts in the right denomination (deduplicated sentences, not raw bytes). Sources may be submitted as BibTeX, URL, or DOI.

How this database works

Anyone can submit a revision: an ISO 639-3 code, one or more RAN components, and a source per value (BibTeX preferred). Admins from the ran-admins GitHub organization review each submission against the cited sources before it is applied.

Updates are max-overwrite: resource abundance only grows over time, so the database stores the largest verified value for each field and rejects strictly smaller submissions. This means any RAN score published in a paper remains a valid lower bound on today's value, even as new corpora appear.

Every accepted change writes a row to an append-only revision history (old_value to new_value, source, reviewer, timestamp). You can see the full history for any language from its detail page; citations can pin to a specific revision to freeze the snapshot a paper relied on.

Correcting mistakes

Max-overwrite means a wrong but inflated value would stick around otherwise. To handle corrections without rewriting history, anyone can submit an invalidation request for a past submission (ID + reason). Once an admin approves, a new entry is appended to the invalidation log, and the current values are recomputed as the maximum across all revisions whose submission has not been invalidated. Nothing is ever deleted: the original submission and its revisions stay in the database, flagged as invalidated. The submitter (or anyone else) can then file a new update with the correct value.

Using RAN in your own work

Cite a language's current profile by its RAN string: for instance, “Our evaluation includes 6/0/en-6/es-3 Quechua.”
Scripts can fetch the full current state from /api/export (JSON).

Browse the database Submit a score