Tools for processing corpora

This is a fairly antiquated page describing the tools I developed for corpus mining and processing. They have been used for creating the corpora for:

Sharoff, S. (2006) Open-source corpora: using the net to fish for linguistic data. In International Journal of Corpus Linguistics 11(4), 435-462, Prepublication draft

The newer tools have been collected on my github https://github.com/ssharoff/.

On this page I collected several tools I developed for processing corpora. Some of them are fairly obvious, but they are included for the sake of completeness. Many tools imply the use of the IMS Corpus Workbench.

Corpus mining tools

  1. Russian POS tagger, lemmatiser, syntactic parser and corpora are available from a separate page
  2. Chinese tokenisation and tagging tools are available from a separate page
  3. Italian parser developed by Marilena di Bari on the basis of Paisa
  4. Kannada tools — tagger and lemmatiser for Kannada (one of the 30 most spoken languages in the world), developed primarily by Siva Reddy. It is described in our paper: Siva Reddy, Serge Sharoff. Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources. In Proc CLIA 2011 at IJNCLP 2011.
  5. Georgian tagger and lemmatiser developed in collaboration with Sofia Daraselia and Marina Beridze. It is described in our paper: Daraselia S. and Sharoff S. (2015) Error Analyses in Part-of-Speech Tagging in Georgian. International Conference - Language and Modern Technologies IV, Tbilisi, Georgia, 10-15 September, 2015.
  6. CSAR — CGI interface to Corpus Workbench (supports concordances and collocation lists with filters)
  7. flist2utf8.pl — concatenates all html pages and converts them to utf8 (it relies on enca for Chinese and Russian)
  8. html2text.pl — uses the BTE algorithm to extract the body of continuous text from html pages (a heavily modified version of Marco Baroni's tool)
  9. PotaModule.pm — a perl library used by flist2utf8 and html2text
  10. cleanwords.pl — filters the list of retrieved pages by good/bad keywords and URLs
  11. dedupes.pl — finds duplicate texts by finding duplicate sentences (it is based on the idea from the Wac toolkit by Niels Ott)
  12. make-frequency-list.pl — generates frequency lists for CWB corpora (it relies on the CWB frequency table)
  13. make-subfrequency-list.pl — generates frequency lists for subcorpora of CWB-encoded corpora (as it actually computes frequency data for subcorpora, it can also produce n-grams and similar things, but it is slower than the previous script)
  14. mwedetect.pl — a tool for detecting multi-word expressions in CWB-encoded corpora, please refer to What is at Stake: a Case Study of Russian Expressions Starting with a Preposition, Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing (MWE-2004)
  15. getsimple.pl and getsimple-cwb.pl — tools for ranking Chinese texts (plain text or CWB) against a list of characters the student knows; learn-vocab.pl is a version for Latin-based languages (it works with a list of words)

Text classification tools

  1. make-arff.pl — makes an ARFF file of selected features to be used by Weka
  2. arff2sparse.pl — converts ARFF files to the sparse matrix format, inter alia used by CLUTO
  3. arff2cw.pl — converts ARFF files to the graph format, used by Chinese Whispers
  4. make-keywords-cwb.pl — finds words more specific to individual documents in a corpus (useful for building keyword lists for new languages)
  5. compare-fq-lists.pl — finds words more specific to individual frequency lists (log-likelihood and log-odds scores are supported at the moment, the output can be a tab-separated file or LaTex table)

Miscellaneous tools

  1. cedict2dictd.pl — converts CEDICT (a free Chinese-English dictionary) to the DICTD format and adds information from a frequency list
  2. make-lcmc.pl — makes a CWB file from LCMC files
  3. smallutils.pm — a collection of small functions for opening UTF8 files, operating with frequency lists, transliterating Cyrillic chars, etc
  4. utf8-tokenize.pl — a tokeniser for utf8 files, it's a part of TreeTagger, but useful for other purposes.

Copyright notice

The software downloadable from this page may be freely distributed and modified in terms of the GNU General Public License or under the Perl Artistic License, and as per the license terms, the copyright notice at the top of each file must be retained.

The software is provided in the hope that it can be useful. ABSOLUTELY NO warranty is given, in particular, with respect to its suitability for your specific purposes. Please contact me for more details for getting this software under another license.

The resources have been developed by Serge Sharoff (Centre for Translation Studies, University of Leeds). Get in touch with me if you have any suggestions.



Serge Sharoff 2011-11-24