Difficulty resources

Vikidia: Wikipedia for 8-13yr old kids

Entries	>50 words	Words	Languages
4,489	2232	671,897	en
40,750	38064	15,411,238	fr
515	155	28,451	ca
6,324	3790	1,016,066	es
9,482	3390	839,184	it
265	87	15,555	ru

For training the tools for recognising Vikidia from the standard Wikipedia texts I've created a set of files which match the respective entries in the standard Wikipedias as closely as possible in their topics and formatting. This also means selecting subsets of each Wikipedia entry, so that the models cannot rely on the heuristics that short articles are always from Vikidia.

This is the set of test files:

Languages	Wiki: texts	Median word length	Vikidia: texts	Median word length
en	1109	185	1913	139
ca	125	104	125	77
es	2739	145	2738	136
fr	2000	143	2000	140
it	2457	177	2456	135
ru	104	75	104	54

The format of the dataset is the same for all languages:

vikidia	Mountain A mountain is a rise in the earth's surface. The definition of how tall a mountain is varies, but ..
wiki	Nervous system Living arthropods have paired main nerve cords running along their bodies below the gut …

A multilingual BERT model can be trained using ./bert-train.py and the training file as: python3 bert-train.py wv-en-train.dat.xz

After that the accuracy can be tested using ./bert-test.py across languages as python3 bert-test.py MODEL.pth TEST.dat.xz

You can also use these scripts with the Cambridge Readability dataset, which has been converted to the same one-line format, so that it can be also used with the same training script: Readability_cup-snc4.ol.xz