Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late-14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: Wikipedia)
CorpusImporter() or browse the CLTK GitHub organization (anything beginning with
old_norse_) to discover available Old_norse corpora.
>>> from cltk.corpus.utils.importer import CorpusImporter >>> corpus_importer = CorpusImporter("old_norse") >>> corpus_importer.list_corpora ['old_norse_text_perseus', 'old_norse_models_cltk']
To use the CLTK's built-in stopwords list, We use an example from Eiríks saga rauða:
>>> from nltk.tokenize.punkt import PunktLanguageVars >>> from cltk.stop.old_norse.stops import STOPS_LIST >>> sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim' >>> p = PunktLanguageVars() >>> tokens = p.word_tokenize(sentence.lower()) >>> [w for w in tokens if not w in STOPS_LIST] ['var', 'einn', 'morgin', ',', 'karlsefni', 'rjóðrit', 'flekk', 'nökkurn', ',', 'glitraði']
The corpus module has a class for generating a Swadesh list for Old Norse.
In : from cltk.corpus.swadesh import Swadesh In : swadesh = Swadesh('old_norse') In : swadesh.words()[:10] Out: ['ek', 'þú', 'hann', 'vér', 'þér', 'þeir', 'sjá, þessi', 'sá', 'hér', 'þar']
A very simple tokenizer is available for Old Norse. For now, it does not take into account specific Old Norse constructions like the merge of conjugated verbs with þú and with sik. Here is a sentence extracted from Gylfaginning.
>>> word_tokenizer = WordTokenizer('old_norse') >>> sentence = "Gylfi konungr var maðr vitr ok fjölkunnigr." >>> result = word_tokenizer.tokenize(sentence) >>> result ['Gylfi', 'konungr', 'var', 'maðr', 'vitr', 'ok', 'fjölkunnigr', '.']
You can get the POS tags of Old Norse texts using the CLTK's wrapper around the NLTK tokenizer. First, download the model by importing the
old_norse_models_cltk corpus. This TnT tagger was trained from annotated data from Icelandic Parsed Historical Corpus (version 0.9, license: LGPL).
The following sentence is from the first verse of Völuspá (a poem describing destiny of Agards gods).
>>> from cltk.tag.pos import POSTag >>> tagger = POSTag('old_norse') >> sent = 'Hlióðs bið ek allar.' >>> tagger.tag_tnt(sent) [('Hlióðs', 'Unk'), ('bið', 'VBPI'), ('ek', 'PRO-N'), ('allar', 'Q-A'), ('.', '.')]