Difficulty resources

Vikidia: Wikipedia for 8-13yr old kids

Entries >50 words Words Languages
4,489 2232 671,897 en
40,750 38064 15,411,238 fr
515 155 28,451 ca
6,324 3790 1,016,066 es
9,482 3390 839,184 it
265 87 15,555 ru

For training the tools for recognising Vikidia from the standard Wikipedia texts I've created a set of files which match the respective entries in the standard Wikipedias as closely as possible in their topics and formatting. This also means selecting subsets of each Wikipedia entry, so that the models cannot rely on the heuristics that short articles are always from Vikidia.

This is the set of test files:

Languages Wiki: texts Median word length   Vikidia: texts Median word length
en 1109 185   1913 139
ca 125 104   125 77
es 2739 145   2738 136
fr 2000 143   2000 140
it 2457 177   2456 135
ru 104 75   104 54

The format of the dataset is the same for all languages:

vikidia Mountain A mountain is a rise in the earth's surface. The definition of how tall a mountain is varies, but ..
wiki Nervous system Living arthropods have paired main nerve cords running along their bodies below the gut …

A multilingual BERT model can be trained using ./bert-train.py and the training file as: python3 bert-train.py wv-en-train.dat.xz

After that the accuracy can be tested using ./bert-test.py across languages as python3 bert-test.py MODEL.pth TEST.dat.xz

You can also use these scripts with the Cambridge Readability dataset, which has been converted to the same one-line format, so that it can be also used with the same training script: Readability_cup-snc4.ol.xz

Author: Serge Sharoff

Created: 2024-07-01 Mon 22:10

Validate