Corpus methodology
Standards and milestones in corpus compilation
Aims and parameters
ParCorOE, an aligned parallel corpus of Old English prose, is an ongoing project that aims at compiling 300,000 words in the source language, plus the parallel version in the target language. With this general aim, the parameters of ParCorOE can be set as follows .
Compilation standards
The following standards, which serve the general aim of increasing the searchability and recoverability of information, guide the design and compilation of ParCorOE.
Standard 1: Alignment
An aligned parallel corpus Old English-English consists of a parallel text, that is to say, an Old English text placed along its translation into Present-Day English, with alignment at text, sentence and word level, in such a way that each source language segment is paired with a target language segment. Word, sentence, and text alignment requires tokenisation at these three structural levels. Alignment parings should be marked by means of the highlighting of the source and the target segment.
Standard 2: Annotation
Three types of annotation must be distinguished: mark up at text level, as well as syntactic annotation and morphological tagging at sentence/word level. Fragments (tokens) are comprised of at least one sentence or one syntactically independent period, identified by means of a text number.
Standard 3: Lemmatisation
Standard 4: Automation
Within the limits imposed by the available written standards and the variation that they present, the annotation of the parallel corpus must be automatic. This includes not only syntactic annotation and morphological tagging, but also the necessary lemmatisation. Lemmas and inflections must be listed dynamically.
Standard 5: Feeding
The corpus must be fed with the information available from The Knowledge Base of Old English (OEKB). The parallel corpus may retrieve information from the relational databases in OEKB in order to maximise the automation of the tasks of tagging, annotation and lemmatisation.
Standard 6: Searchability
The corpus must be searchable by text, fragment and word, as well as by morphological tag and syntactic annotation. Combined searches by inflectional form and lemma are also required. The corpus must be based on a concordance and an index, so that the main layouts are interconnected.
Standard 7: Dissemination
The corpus must be available online in open access and must be searchable with an Internet browser. Users should not have training or previous experience with database software in order to search ParCorOE.
ParCorOEv1: The Pilot Corpus
In order to fix design inadequacies and compilation shortcomings, a ten-thousand-word pilot corpus was compiled and annotated. The selection of texts comprised the major genres of historical prose, religious prose and translations from Latin: The Anglo-Saxon Chronicle, Orosius, Ælfric´s Lives of Saints, Cura Pastoralis, and Bede´s Ecclesiastical History. The Old English texts, as well as their translations into Present-Day English, were extracted from Fernández Cuesta et al. (1997). The pilot corpus had two building blocks: the concordance (including a word index) to the texts and the parallel corpus layouts. Two layouts were distinguished: the static presentation, which offered the running texts Old English-Present-Day English, aligned them by fragment and word and provided word-for-word gloss as well as fragment translation; and the dynamic presentation, which was aligned at word level in such a way that each word was highlighted in the source and in the target text. Full tagging and annotation was imported from the relational databases, including the information on lemma, alternative spellings, lexical category, morphological class, inflectional paradigm, derivational paradigm, meaning definition, and the references of the secondary sources that discuss the lemma or the inflectional form in question.
About us
RGFGs, Nerthus Project
Department of Modern Languages, University of
La Rioja.
Nerthus Project - Universidad de La Rioja © 2023