Old Norse

Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late-14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with old_norse_) to discover available Old_norse corpora.

>>> from cltk.corpus.utils.importer import CorpusImporter

>>> corpus_importer = CorpusImporter("old_norse")

>>> corpus_importer.list_corpora
['old_norse_text_perseus', 'old_norse_models_cltk']

Stopword Filtering

To use the CLTK's built-in stopwords list, We use an example from Eiríks saga rauða: .. code-block:: python

>>> from nltk.tokenize.punkt import PunktLanguageVars
>>> from cltk.stop.old_norse.stops import STOPS_LIST
>>> sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'
>>> p = PunktLanguageVars()
>>> tokens = p.word_tokenize(sentence.lower())
>>> [w for w in tokens if not w in STOPS_LIST]
['var',
 'einn',
 'morgin',
 ',',
 'karlsefni',
 'rjóðrit',
 'flekk',
 'nökkurn',
 ',',
 'glitraði']

POS tagging

You can get the POS tags of Old Norse texts using the CLTK's wrapper around the NLTK tokenizer. First, download the model by importing the old_norse_models_cltk corpus. This TnT tagger was trained from annotated data from Icelandic Parsed Historical Corpus (version 0.9, license: LGPL).

TnT tagger

The following sentence is from the first verse of Völuspá (a poem describing destiny of Agards gods).

>>> from cltk.tag.pos import POSTag

>>> tagger = POSTag('old_norse')

>> sent = 'Hlióðs bið ek allar.'

>>> tagger.tag_tnt(sent)
[('Hlióðs', 'Unk'),
 ('bið', 'VBPI'),
 ('ek', 'PRO-N'),
 ('allar', 'Q-A'),
 ('.', '.')]