Robust frequency lists

Table of Contents

Nothing in Nature is random… A thing appears random only through the incompleteness of our knowledge. —Spinoza, Ethics I

1 Know thy corpus!

Frequency lists determine the probability estimates for words in language models. They are also widely used in lexicography, corpus linguistics and language education studies. However, word frequency lists coming from different corpora differ considerably in spite of relatively small changes in their composition, because some words can become too frequent in a relatively small number of texts specific to this corpus, so their frequency counts provide unreliable information about their expected frequency. Adam Kilgarriff referred to them as whelks, a rare word, which can become topical if a text is about whelks. The whelks of the British National Corpus (BNC) are medical terms, as seen in the following two extracts from the BNC frequency list:

  1. foods, lighting, sciences, anglo, emerge, contacts, gastric, desirable, 1950s, gender, poland, picking, suggestions, enjoying, laughter
  2. incidentally, sticking, angrily, speeds, drum, spine, realm, mucosa, heather, allegedly, rested, builders, lid, invention, blowing

where the frequency of gastric is ranked higher than desirable and mucosa is ranked higher than builders. Only a small proportion of BNC texts are taken from the Journal of Gastroentorology and Hepatology (this is less than 0.8% of the BNC, 713 thousand tokens). However, the frequency bursts of words from this domain propel them into the core lexicon.

No corpus is immune to whelks. For example, the frequency of Texas in the LDC Gigaword corpus is greater than that of such common words as long or car. In the Wikipedia dump, which was used to train BERT, pomeranian is more frequent than hypothetical or centrally.

Robust lists for a number of languages and corpora are listed in the Frequency lists section below.

2 Robust methods

The tools in this repository can produce robust frequency lists by using document-level measures to filter out frequency bursts using methods from robust statistics. The method which this study found to be more useful is based on huberM and S//n estimators of expected values.

In short, we determine robust estimates of how many times a word is likely to occur in a document of this corpus. With this value we clip (Winsorise) its frequency to our predicted robust estimate if the frequency exceeds the estimate in the case of a frequency burst in this document. This helps in describing the frequency distributions from different corpora by making more reliable predictions of how common the words and their constructions are, and in inferring the significant differences in the lexicons of different text collections, e.g., detecting problems in a given corpus, how a Web crawl is different from the BNC, etc. For the rationale and the methodology, see:

@InProceedings{sharoff20,
  author =   {Serge Sharoff},
  title =    {Know thy corpus! Robust methods for digital curation of Web corpora},
  booktitle = {Proc LREC},
  year =     2020,
  month =    {May},
  address =      {Marseilles}}

https://aclanthology.org/2020.lrec-1.298/

3 Scripts

The starting point for building a robust frequency list is a document level frequency list, e.g.,

correct 1 8309
correct 1 20116
gastric 1 62338
gastric 17 59681

In this artificial example, correct occurs once in two documents (with the lengths of 8309 and 20116 words), while gastric occurs once in one document and 17 times in another one (the document with the length of 59681). The latter is an outlying observation which can be detected using methods from robust statistics.

A list of this kind can be produced by taking a corpus in the form of a single file with one document per line, e.g.

... invariably , even when we have needed to correct or update details in our reports ,
... diffuse into the gastric lumen , so the presence of any iron in fasting gastric juice

and running:

frq-line.py corpus.ol | sort >corpus-doc.num

If you have access to a computer cluster with many computing nodes, this list can be produced for a large corpus much faster by running:

split -l XX corpus.ol

first to split the corpus into multiple files with a fixed number of documents (XX), so that the document level frequency lists for each file can be computed in parallel. After that the separate frequency lists can be combined by running

sort x*.num >corpus-doc.num

Finally robustpython.py can be used to compute the robust frequency list:

python3 robustpython.py 5 <corpus-doc.num | sort -nsrk3,3 >corpus-robust.num

The parameter is the document level frequency threshold, i.e., the words for the frequency list need to occur in at least 5 documents in this example.

For each word (or another object of counting, such as lemma or n-gram), robustpython.py outputs in tab-separated format: 1. the word itself, 1. the raw frequency, 2. the adjusted robust frequency value counted for all documents with Winsorisation, 3. the number of documents subject to Winsorisation, and 4. the document frequency. For example, a sample of frequencies from the BNC lists looks like:

Word Raw Adjusted Clipped Total Doc
correct 6706 5500 263 1925
desirable 2084 1858 125 975
gastric 2085 154 16 70

The Raw frequency column counts the raw number of occurrences, Adjusted is the same count with clipped outlier observations, Clipped is the number of documents in which this happened, Total is the overall number of documents in which this word occurred in this corpus. By discarding outlying observations, we can estimate a better rank for gastric by using the Adjusted frequency list.

The most significant changes in the frequency list before and after robust estimation (produced by the Perl script compare_fq_lists.pl in the repository) from the BNC are as follows:

Word Raw Robust LL-score
hon 10709 378 2890
lifespan 3854 110 1139
darlington 5606 426 875
inc 6584 794 527
taped 4151 460 389
athelstan 1061 15 385
gastric 2085 154 335
theda 838 9 320
robyn 1206 46 313
middlesbrough 3620 488 227
infinitive 721 22 208
jenna 668 19 198
minton 760 29 197
ronni 538 8 193
corbett 1541 144 188
colonic 830 42 183
     
mucosa 1041 133 74

For example, Athelstan, Darlington or Theda are person names in some of the BNC texts (e.g., from "The remains of the day" for Darlington), while the frequency busts of Hon (which takes it to the top 1000 most frequent words in the BNC) is down to its frequent repetition in contexts like: The Princess Margaret was represented by the Hon Mrs Wills at the Memorial Service for Colonel the Hon Sir Gordon Palmer.

For information about the log-likelihood score in keyword detection see http://ucrel.lancs.ac.uk/llwizard.html

4 Frequency lists

4.1 English

4.1.1 Lists of word forms:

  1. BNC. This is from the classic British National Corpus.
  2. ukWac. A corpus from the Wacky family.
  3. Wikipedia.
  4. OpenWebText. This is a clone of OpenAI's corpus used for training GPT-2, collected from upvoted links from Reddit, see OpenWebText.
  5. CCNET. This is the English corpus from the Common Crawl cleaned for XLM-R, see their paper.

4.2 Russian

Lists of lemmas with POS codes:

  1. Russian National Corpus. You can compare this to the raw RNC frequencies in the classic list of Lyashevskaya and Sharoff, 2009.
  2. ruTenTen. A popular corpus from the SketchEngine.
  3. ruWac. A corpus from the Wacky family.
  4. GICR. This is the news component of the General Internet Corpus of Russian.
  5. Aranea Maximus. A large Aranea Web crawl for Russian, see the paper describing its properties.
  6. CCNet-Russian. The Russian part of the Common Crawl cleaned for XLM-R, see their paper.

The POS codes in the Russian National Corpus have not been unified with the remaining corpora. For example, _s is the code for nouns in the RNC while it is _n in other corpora.

4.3 Wikipedia lists

Robust frequency filtering helps in removing various artifacts of Wikipedia processing, e.g., unreasonably frequent pomeranian, montane, spurred, substrates, encompassed, italianate, prelate, attaining in the BERT BPE lexicon.

ccnet-en-200-clean2-biwt.tsv.xz

openwebtext-robust-R.tsv.xz

Author: Serge Sharoff

Created: 2024-01-15 Mon 11:52

Validate