Lithuanian-Latvian, Latvian-Lithuanian Parallel Corpus (LILA)

Erika Rimkutė, Andrius Utka, Kristīne Levāne-Petrova

Abstract


The paper presents a new linguistic resource, LILA, which is the Lithuanian-Latvian-Lithuanian parallel corpus aligned on paragraph and sentence level.The total size of the LILA corpus is 9 m words. So far it is a unique resource for this language pair. The corpus contains metadata with bibliographicalinformation (title, author, year of publishing, etc.). The corpus contains the structural annotation, which includes boundaries of aligned segments,paragraphs, and sentences. The alignment of paragraphs and sentences has been done by the semi-automatic alignment tool Aligner 2.0.6.7. The corpuswas compiled during 2011-2012 by scientists of the Vytautas Magnus University’s Centre of Computational Linguistics (VMU CCL) and the LatvianUniversity’s Mathematical and Informatics Institute’s Laboratory of Artificial Intelligence (LU MII). The paper describes problems and challenges thatneed to be solved, when a parallel corpus for two small languages is created. The limited choice of appropriate parallel material poses the most difficultobstacle, as then it is difficult to compile a corpus of desired size. The paper presents: the conception and structure of the LILA corpus, phases of itscompilation, the alignment tool, the query system, and examples of usage. The corpus is especially useful for teaching and learning languages, forcomparing languages, for compilation of dictionaries, and for developing language technology tools (e. g. statistical machine translation systems).

DOI: http://dx.doi.org/10.5755/j01.sal.0.23.4582


Keywords


lygiagretusis tekstynas; lietuvių kalba; latvių kalba; baltų kalbos; mažai išteklių turinčios kalbos

Full Text: PDF

Print ISSN: 1648-2824
Online ISSN: 2029-7203