Difficulty resources
Vikidia: Wikipedia for 8-13yr old kids
| Entries | >50 words | Words | Languages |
|---|---|---|---|
| 4,489 | 2232 | 671,897 | en |
| 40,750 | 38064 | 15,411,238 | fr |
| 515 | 155 | 28,451 | ca |
| 6,324 | 3790 | 1,016,066 | es |
| 9,482 | 3390 | 839,184 | it |
| 265 | 87 | 15,555 | ru |
For training the tools for recognising Vikidia from the standard Wikipedia texts I've created a set of files which match the respective entries in the standard Wikipedias as closely as possible in their topics and formatting. This also means selecting subsets of each Wikipedia entry, so that the models cannot rely on the heuristics that short articles are always from Vikidia.
This is the set of test files:
| Languages | Wiki: texts | Median word length | Vikidia: texts | Median word length | |
|---|---|---|---|---|---|
| en | 1109 | 185 | 1913 | 139 | |
| ca | 125 | 104 | 125 | 77 | |
| es | 2739 | 145 | 2738 | 136 | |
| fr | 2000 | 143 | 2000 | 140 | |
| it | 2457 | 177 | 2456 | 135 | |
| ru | 104 | 75 | 104 | 54 |
The format of the dataset is the same for all languages:
| vikidia | Mountain A mountain is a rise in the earth's surface. The definition of how tall a mountain is varies, but .. |
| wiki | Nervous system Living arthropods have paired main nerve cords running along their bodies below the gut … |
A multilingual BERT model can be trained using ./bert-train.py and the training file as:
python3 bert-train.py wv-en-train.dat.xz
After that the accuracy can be tested using ./bert-test.py across languages as
python3 bert-test.py MODEL.pth TEST.dat.xz
You can also use these scripts with the Cambridge Readability dataset, which has been converted to the same one-line format, so that it can be also used with the same training script: Readability_cup-snc4.ol.xz