Lithuanian-Latvian, Latvian-Lithuanian Parallel Corpus (LILA)
Keywords:lygiagretusis tekstynas, lietuvių kalba, latvių kalba, baltų kalbos, mažai išteklių turinčios kalbos
AbstractThe paper presents a new linguistic resource, LILA, which is the Lithuanian-Latvian-Lithuanian parallel corpus aligned on paragraph and sentence level.The total size of the LILA corpus is 9 m words. So far it is a unique resource for this language pair. The corpus contains metadata with bibliographicalinformation (title, author, year of publishing, etc.). The corpus contains the structural annotation, which includes boundaries of aligned segments,paragraphs, and sentences. The alignment of paragraphs and sentences has been done by the semi-automatic alignment tool Aligner 188.8.131.52. The corpuswas compiled during 2011-2012 by scientists of the Vytautas Magnus University’s Centre of Computational Linguistics (VMU CCL) and the LatvianUniversity’s Mathematical and Informatics Institute’s Laboratory of Artificial Intelligence (LU MII). The paper describes problems and challenges thatneed to be solved, when a parallel corpus for two small languages is created. The limited choice of appropriate parallel material poses the most difficultobstacle, as then it is difficult to compile a corpus of desired size. The paper presents: the conception and structure of the LILA corpus, phases of itscompilation, the alignment tool, the query system, and examples of usage. The corpus is especially useful for teaching and learning languages, forcomparing languages, for compilation of dictionaries, and for developing language technology tools (e. g. statistical machine translation systems).
The copyright for the articles in this journal is retained by the author(s) with the first publication right granted to the journal. The journal is licensed under the Creative Commons Attribution License 4.0 (CC BY 4.0).