Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late-14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: Wikipedia)
CorpusImporter() or browse the CLTK GitHub organization (anything beginning with
old_norse_) to discover available Old_norse corpora.
>>> from cltk.corpus.utils.importer import CorpusImporter >>> corpus_importer = CorpusImporter("old_norse") >>> corpus_importer.list_corpora ['old_norse_text_perseus', 'old_norse_models_cltk']
To use the CLTK's built-in stopwords list, We use an example from Eiríks saga rauða: .. code-block:: python
>>> from nltk.tokenize.punkt import PunktLanguageVars
>>> from cltk.stop.old_norse.stops import STOPS_LIST
>>> sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'
>>> p = PunktLanguageVars()
>>> tokens = p.word_tokenize(sentence.lower())
>>> [w for w in tokens if not w in STOPS_LIST] ['var', 'einn', 'morgin', ',', 'karlsefni', 'rjóðrit', 'flekk', 'nökkurn', ',', 'glitraði']
You can get the POS tags of Old Norse texts using the CLTK's wrapper around the NLTK tokenizer. First, download the model by importing the
old_norse_models_cltk corpus. This TnT tagger was trained from annotated data from Icelandic Parsed Historical Corpus (version 0.9, license: LGPL).
The following sentence is from the first verse of Völuspá (a poem describing destiny of Agards gods).
>>> from cltk.tag.pos import POSTag >>> tagger = POSTag('old_norse') >> sent = 'Hlióðs bið ek allar.' >>> tagger.tag_tnt(sent) [('Hlióðs', 'Unk'), ('bið', 'VBPI'), ('ek', 'PRO-N'), ('allar', 'Q-A'), ('.', '.')]