Lithuanian-Latvian, Latvian-Lithuanian Parallel Corpus (LILA)

Authors

  • Erika Rimkutė VDU
  • Andrius Utka VDU
  • Kristīne Levāne-Petrova

DOI:

https://doi.org/10.5755/j01.sal.0.23.4582

Keywords:

lygiagretusis tekstynas, lietuvių kalba, latvių kalba, baltų kalbos, mažai išteklių turinčios kalbos

Abstract

The paper presents a new linguistic resource, LILA, which is the Lithuanian-Latvian-Lithuanian parallel corpus aligned on paragraph and sentence level.The total size of the LILA corpus is 9 m words. So far it is a unique resource for this language pair. The corpus contains metadata with bibliographicalinformation (title, author, year of publishing, etc.). The corpus contains the structural annotation, which includes boundaries of aligned segments,paragraphs, and sentences. The alignment of paragraphs and sentences has been done by the semi-automatic alignment tool Aligner 2.0.6.7. The corpuswas compiled during 2011-2012 by scientists of the Vytautas Magnus University’s Centre of Computational Linguistics (VMU CCL) and the LatvianUniversity’s Mathematical and Informatics Institute’s Laboratory of Artificial Intelligence (LU MII). The paper describes problems and challenges thatneed to be solved, when a parallel corpus for two small languages is created. The limited choice of appropriate parallel material poses the most difficultobstacle, as then it is difficult to compile a corpus of desired size. The paper presents: the conception and structure of the LILA corpus, phases of itscompilation, the alignment tool, the query system, and examples of usage. The corpus is especially useful for teaching and learning languages, forcomparing languages, for compilation of dictionaries, and for developing language technology tools (e. g. statistical machine translation systems).

DOI: http://dx.doi.org/10.5755/j01.sal.0.23.4582

Downloads

Published

2013-12-18

Issue

Section

COMPUTATIONAL LINGUISTICS