Serge Sharoff's homepage

Logo

I am Serge Sharoff, Professor of Language Technology and Digital Humanities at the University of Leeds, UK

   
Official homepage https://ahc.leeds.ac.uk/languages/staff/1137/dr-serge-sharoff
Google Scholar: https://scholar.google.co.uk/citations?user=qcnf4QsAAAAJ
Semantic Scholar: https://www.semanticscholar.org/author/S.-Sharoff/2506104

My research interests

Artificial Intelligence and more specifically Large Language Models, such as ChatGPT, have recently made a profound impact on how we interact with the computers by providing the ability to produce new texts in response to prompts. Fundamental research in this area is at the core of my expertise, I’ve been doing this since my own PhD in the 1990s. This pre-dated LLMs, but the idea of linking language to meanings remains the same. One of my recent papers on the diversity of texts on the Web has been cited by some of the GPT creators from OpenAI. See also my collection of curious cases of testing ChatGPT in a range of scenarios.

My research interests are related to three domains: linguistics, computer science and cognitive science.

Probably the most interesting bit in my recent research is in digital curation of corpora from the web, cf. the set of available large corpora and the procedure described at http://corpus.leeds.ac.uk/internet.html, see the full paper. The current set of resources includes multi-million word corpora for Chinese, English, French, German, Italian, Polish, Portuguese, Russian and Spanish.

Web corpora can be curated in terms of domains and genres, and also via automatic annotation for their linguistic properties, such as parts of speech, syntactic relations or named entities. The resources for developing statistical models are relatively modest for many languages, so I research methods for bootstraping them from related languages, for example, from Russian to Ukrainian.

Another aspect of Web corpus curation concerns developing frequency lists. While it is easy to count the number of times each word occurs in a corpus, this can lead to undesirable frequency bursts, as a word can be used too often in a small number of files. Please check my reliable frequency lists obtained for several corpora and several languages. See also the pedagogical frequency lists produced as the outcome of the Kelly project.

Another example is ASSIST, a joint project with Lancaster University, which is about an automatic procedure for finding translation equivalent using large comparable corpora (consisting of texts which are not translations of each other). See a series of BUCC workshops and a recent book on the topic. Its introduction is available from my list of publications.

My approach in linguistics rests on the assumption that language is the resource for exchanging meanings. My interests in linguistics stretch from contrastive semantics (how to study words that are used to mean things in different ways in different languages) to corpus linguistics (how to study real uses of words in their contexts) to computational linguistics (how to dewsign computational models for natural language understanding and generation). See also the page with my tools for corpus collection and processing

My interests in communication studies focus on social practices of communities of language speakers, which result in creation and maintenance of meanings in the intersubjective space of people conducting communication. This is directly relevant to understanding when and how large language models are likely to fail in the form of biases and hallucinations, because via pre-training on very large corpora the LLMs have acquired very good knowledge of linguistic resources used for communication without understanding the conditions under which the corresponding texts have been produced.

The most convenient access to the list of my publications is via Google Scholar.

See my formal CV as well as my academic genealogy (the list through my supervisors can be traced back to Leibniz, Poisson and Gauss).

PhD projects

I am happy to consider applications from prospective PhD students in the area of my expertise. The following general topics are preferable:

Automatic Text Classification for Translation

Setting up a translation project usually involves assessing the amount of time required for translating a text and selecting the most suitable translator. Modern approaches in Language Technology can do wonders with text processing, but it is not clear how helpful they can be in the translation settings. For example, can they help to determine the genre of a text, its difficulty or suitability to translators? Similar text classification tools can be also used for tasks related to learning foreign languages.

Background references:

Language adaptation for improving models of lesser-resourced languages

A translation model needs to be applicable to a large number of languages, while the training resources or linguistic models are often better developed only for some languages. Language adaptation can be designed in a way similar to domain adaptation to improve the models of lesser-resourced languages by taking into account the resources available for closely related languages, e.g., from French to Romanian. This can be applied in a range of training scenarios, such as Part-Of-Speech tagging, text classification, translation quality prediction, etc.

Background references:

Non-parallel resources for translation

Modern Machine Translation is based on “plagiarising” large amounts of existing translations, which usually come from institutions such as the United Nations or the European Parliament. This is not enough for many language directions or for specific domains, such as biomedicine. What are productive methods to mine information about translations from non-parallel texts, such as Wikipedia articles on the same topic or news wire streams in different languages?

Background references:

I have also prepared a textbook on Comparable Corpora published in the Synthesis Lecture Series. The introduction to the book is available.