8.1.8. cltk.lemmatize package

Init for cltk.lemmatize.

8.1.8.1. Submodules

8.1.8.2. cltk.lemmatize.ang module

class cltk.lemmatize.ang.OldEnglishDictionaryLemmatizer[source]

Bases: DictionaryRegexLemmatizer

Naive lemmatizer for Old English.

TODO: Add silent and non-interactive options to this class

>>> lemmatizer = OldEnglishDictionaryLemmatizer()
>>> lemmatizer.lemmatize_token('ġesāƿen')
'geseon'
>>> lemmatizer.lemmatize_token('ġesāƿen', return_frequencies=True)
('geseon', -6.519245611523386)
>>> lemmatizer.lemmatize_token('ġesāƿen', return_frequencies=True, best_guess=False)
[('geseon', -6.519245611523386), ('gesaƿan', 0), ('saƿan', 0)]
>>> lemmatizer.lemmatize(['Same', 'men', 'cweþaþ', 'on', 'Englisc', 'þæt', 'hit', 'sie', 'feaxede', 'steorra', 'forþæm', 'þær', 'stent', 'lang', 'leoma', 'of', 'hwilum', 'on', 'ane', 'healfe', 'hwilum', 'on', 'ælce', 'healfe'], return_frequencies=True, best_guess=False)
[[('same', -8.534148632065651), ('sum', -5.166852802079177)], [('mann', -6.829400539827225)], [('cweþan', -9.227295812625597)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('englisc', -8.128683523957486)], [('þæt', -2.365584472144866), ('se', -2.9011463394704973)], [('hit', -4.300042127468392)], [('wesan', -7.435536343397541)], [('feaxede', -9.227295812625597)], [('steorra', -8.534148632065651)], [('forðam', -6.282856833459156)], [('þær', -3.964605623720711)], [('standan', -7.617857900191496)], [('lang', -6.829400539827225)], [('leoma', -7.841001451505705)], [('of', -3.9440920838876075)], [('hwilum', -6.282856833459156)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('an', -5.02260319323463)], [('healf', -7.841001451505705)], [('hwilum', -6.282856833459156)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('ælc', -7.841001451505705)], [('healf', -7.841001451505705)]]
_load_forms_and_lemmas()[source]

Load the dictionary of lemmas and forms from the OE models repository.

_load_unigram_counts()[source]

Load the table of frequency counts of word forms.

8.1.8.3. cltk.lemmatize.backoff module

Lemmatization module—includes several classes for different lemmatizing approaches–based on training data, regex pattern matching, etc. These can be chained together using the backoff parameter. Also, includes a pre-built chain that uses models in cltk_data.

The logic behind the backoff lemmatizer is based on backoff POS-tagging in NLTK and repurposes several of the tagging classes for lemmatization tasks. See here for more info on sequential backoff tagging in NLTK: http://www.nltk.org/_modules/nltk/tag/sequential.html

PJB: The Latin lemmatizer modules were completed as part of Google Summer of Code 2016. I have written up a detailed report of the summer work here: https://gist.github.com/diyclassics/fc80024d65cc237f185a9a061c5d4824.

class cltk.lemmatize.backoff.SequentialBackoffLemmatizer(backoff, verbose=False)[source]

Bases: SequentialBackoffTagger

Abstract base class for lemmatizers created as a subclass of NLTK’s SequentialBackoffTagger. Lemmatizers in this class “[tag] words sequentially, left to right. Tagging of individual words is performed by the choose_tag() method, which should be defined by subclasses. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.”

See: https://www.nltk.org/_modules/nltk/tag/sequential.html#SequentialBackoffTagger

Variables:
  • _taggers – A list of all the taggers in the backoff chain, inc. self.

  • _repr – An instance of Repr() from reprlib to handle list and dict length in subclass __repr__’s

tag(tokens)[source]

Docs (mostly) inherited from TaggerI; cf. https://www.nltk.org/_modules/nltk/tag/api.html#TaggerI.tag

Two tweaks: 1. Properly handle ‘verbose’ listing of current tagger in the case of None (i.e. if tag: etc.) 2. Keep track of taggers and change return depending on ‘verbose’ flag

:rtype list :type tokens: list :type tokens: list[str] :param tokens: List of tokens to tag

tag_one(tokens, index, history)[source]

Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.

Return type:

tuple

Parameters:
  • tokens (list[str]) – The list of words that are being tagged.

  • index (int) – The index of the word whose tag should be returned.

  • history (list[str]) – A list of the tags for all words before index.

lemmatize(tokens)[source]

Transform tag method into custom method for lemmatizing tasks. Cf. tag method above.

Return type:

list[str]

class cltk.lemmatize.backoff.DefaultLemmatizer(lemma=None, backoff=None, verbose=False)[source]

Bases: SequentialBackoffLemmatizer

Lemmatizer that assigns the same lemma to every token. Useful as the final tagger in chain, e.g. to assign ‘UNK’ to all remaining unlemmatized tokens. :type lemma: str :type lemma: str :param lemma: Lemma to assign to each token

>>> default_lemmatizer = DefaultLemmatizer('UNK')
>>> list(default_lemmatizer.lemmatize('arma virumque cano'.split()))
[('arma', 'UNK'), ('virumque', 'UNK'), ('cano', 'UNK')]
choose_tag(tokens, index, history)[source]

Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.

Return type:

str

Parameters:
  • tokens (list[str]) – The list of words that are being tagged.

  • index (int) – The index of the word whose tag should be returned.

  • history (list[str]) – A list of the tags for all words before index.

class cltk.lemmatize.backoff.IdentityLemmatizer(backoff=None, verbose=False)[source]

Bases: SequentialBackoffLemmatizer

Lemmatizer that returns a given token as its lemma. Like DefaultLemmatizer, useful as the final tagger in a chain, e.g. to assign a possible form to all remaining unlemmatized tokens, increasing the chance of a successful match.

>>> identity_lemmatizer = IdentityLemmatizer()
>>> list(identity_lemmatizer.lemmatize('arma virumque cano'.split()))
[('arma', 'arma'), ('virumque', 'virumque'), ('cano', 'cano')]
choose_tag(tokens, index, history)[source]

Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.

Return type:

str

Parameters:
  • tokens (list[str]) – The list of words that are being tagged.

  • index (int) – The index of the word whose tag should be returned.

  • history (list[str]) – A list of the tags for all words before index.

class cltk.lemmatize.backoff.DictLemmatizer(lemmas, backoff=None, source=None, verbose=False)[source]

Bases: SequentialBackoffLemmatizer

Standalone version of ‘model’ function found in UnigramTagger; by defining as its own class, it is clearer that this lemmatizer is based on dictionary lookup and does not use training data.

choose_tag(tokens, index, history)[source]

Looks up token in lemmas dict and returns the corresponding value as lemma. :rtype: str :type tokens: list :type tokens: list[str] :param tokens: List of tokens to be lemmatized :type index: int :type index: int :param index: Int with current token :type history: list :type history: list[str] :param history: List with tokens that have already been lemmatized; NOT USED

class cltk.lemmatize.backoff.UnigramLemmatizer(train=None, model=None, backoff=None, source=None, cutoff=0, verbose=False)[source]

Bases: SequentialBackoffLemmatizer, UnigramTagger

Standalone version of ‘train’ function found in UnigramTagger; by defining as its own class, it is clearer that this lemmatizer is based on training data and not on dictionary.

class cltk.lemmatize.backoff.RegexpLemmatizer(regexps=None, source=None, backoff=None, verbose=False)[source]

Bases: SequentialBackoffLemmatizer, RegexpTagger

Regular expression tagger, inheriting from SequentialBackoffLemmatizer and RegexpTagger.

choose_tag(tokens, index, history)[source]

Use regular expressions for rules-based lemmatizing based on word endings; tokens are matched for patterns with the base kept as a group; an word ending replacement is added to the (base) group. :rtype: str :type tokens: list :type tokens: list[str] :param tokens: List of tokens to be lemmatized :type index: int :type index: int :param index: Int with current token :type history: list :type history: list[str] :param history: List with tokens that have already been lemmatized; NOT USED

8.1.8.4. cltk.lemmatize.fro module

Lemmatizer for Old French. Rules are based on Brunot & Bruneau (1949).

class cltk.lemmatize.fro.OldFrenchDictionaryLemmatizer[source]

Bases: DictionaryRegexLemmatizer

Naive lemmatizer for Old French.

>>> lemmatizer = OldFrenchDictionaryLemmatizer()
>>> lemmatizer.lemmatize_token('corant')
'corant'
>>> lemmatizer.lemmatize_token('corant', return_frequencies=True)
('corant', -9.319508628976836)
>>> lemmatizer.lemmatize_token('corant', return_frequencies=True, best_guess=False)
[('corir', 0), ('corant', -9.319508628976836)]
>>> lemmatizer.lemmatize(['corant', '.', 'vult', 'premir'], return_frequencies=True, best_guess=False)
[[('corir', 0), ('corant', -9.319508628976836)], [('PUNK', 0)], [('vout', -7.527749159748781)], [('premir', 0)]]
_load_forms_and_lemmas()[source]

Load the dictionary of lemmas and forms from the fro data repository.

_load_unigram_counts()[source]

Load the table of frequency counts of word forms.

8.1.8.5. cltk.lemmatize.grc module

Module for lemmatizing Ancient Greek.

class cltk.lemmatize.grc.GreekBackoffLemmatizer(train=None, seed=3, verbose=False)[source]

Bases: object

Suggested backoff chain; includes at least on of each type of major sequential backoff class from backoff.py.

lemmatize(tokens)[source]

Lemmatize a list of words.

>>> lemmatizer = GreekBackoffLemmatizer()
>>> from cltk.alphabet.text_normalization import cltk_normalize
>>> word = cltk_normalize('διοτρεφές')
>>> lemmatizer.lemmatize([word])
[('διοτρεφές', 'διοτρεφής')]
>>> republic = cltk_normalize("κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος")
>>> lemmatizer.lemmatize(republic.split())
[('κατέβην', 'καταβαίνω'), ('χθὲς', 'χθές'), ('εἰς', 'εἰς'), ('Πειραιᾶ', 'Πειραιεύς'), ('μετὰ', 'μετά'), ('Γλαύκωνος', 'Γλαύκων'), ('τοῦ', 'ὁ'), ('Ἀρίστωνος', 'Ἀρίστων')]
evaluate()[source]

8.1.8.6. cltk.lemmatize.lat module

Module for lemmatizing Latin

class cltk.lemmatize.lat.RomanNumeralLemmatizer(default=None, backoff=None)[source]

Bases: RegexpLemmatizer

Lemmatizer for identifying roman numerals in Latin text based on regex.

>>> lemmatizer = RomanNumeralLemmatizer()
>>> lemmatizer.lemmatize("i ii iii iv v vi vii vii ix x xx xxx xl l lx c cc".split())
[('i', 'NUM'), ('ii', 'NUM'), ('iii', 'NUM'), ('iv', 'NUM'), ('v', 'NUM'), ('vi', 'NUM'), ('vii', 'NUM'), ('vii', 'NUM'), ('ix', 'NUM'), ('x', 'NUM'), ('xx', 'NUM'), ('xxx', 'NUM'), ('xl', 'NUM'), ('l', 'NUM'), ('lx', 'NUM'), ('c', 'NUM'), ('cc', 'NUM')]
>>> lemmatizer = RomanNumeralLemmatizer(default="RN")
>>> lemmatizer.lemmatize('i ii iii'.split())
[('i', 'RN'), ('ii', 'RN'), ('iii', 'RN')]
choose_tag(tokens, index, history)[source]

Use regular expressions for rules-based lemmatizing based on word endings; tokens are matched for patterns with the base kept as a group; an word ending replacement is added to the (base) group. :rtype: str :type tokens: list :type tokens: list[str] :param tokens: List of tokens to be lemmatized :type index: int :type index: int :param index: Int with current token :type history: list :type history: list[str] :param history: List with tokens that have already been lemmatized; NOT USED

class cltk.lemmatize.lat.LatinBackoffLemmatizer(train=None, seed=3, verbose=False)[source]

Bases: object

Suggested backoff chain; includes at least on of each type of major sequential backoff class from backoff.py

### Putting it all together ### BETA Version of the Backoff Lemmatizer AKA BackoffLatinLemmatizer ### For comparison, there is also a TrainLemmatizer that replicates the ### original Latin lemmatizer from cltk.stem

lemmatize(tokens)[source]
evaluate()[source]

8.1.8.7. cltk.lemmatize.naive_lemmatizer module

class cltk.lemmatize.naive_lemmatizer.DictionaryRegexLemmatizer[source]

Bases: ABC

Implementation of a lemmatizer based on a dictionary of lemmas and forms, backing off to regex rules. Since a given form may map to multiple lemmas, a corpus-based frequency disambiguator is employed.

Subclasses must provide methods to load dictionary and corpora, and to specify regular expressions.

_relative_frequency(word)[source]

Computes the log relative frequency for a word form

Return type:

float

_apply_regex(token)[source]

Looks for a match in the regex rules with the token. If found, applies the replacement part of the rule to the token and returns the result. Else just returns the token unchanged.

lemmatize_token(token, best_guess=True, return_frequencies=False)[source]

Lemmatize a single token. If best_guess is true, then take the most frequent lemma when a form has multiple possible lemmatizations. If the form is not found, just return it. If best_guess is false, then always return the full set of possible lemmas, or the empty list if none found. If return_frequencies is true ,then also return the relative frequency of the lemma in a corpus.

>>> from cltk.lemmatize.ang import OldEnglishDictionaryLemmatizer
>>> lemmatizer = OldEnglishDictionaryLemmatizer()
>>> lemmatizer.lemmatize_token('fōrestepeþ')
'foresteppan'
>>> lemmatizer.lemmatize_token('Caesar', return_frequencies=True, best_guess=True)
('Caesar', 0)
Return type:

Union[str, list[Union[str, tuple[str, float]]]]

lemmatize(tokens, best_guess=True, return_frequencies=False)[source]

Lemmatize tokens in a list of strings.

>>> from cltk.lemmatize.ang import OldEnglishDictionaryLemmatizer
>>> lemmatizer = OldEnglishDictionaryLemmatizer()
>>> lemmatizer.lemmatize(['eotenas','ond','ylfe','ond','orcneas'], return_frequencies=True, best_guess=True)
[('eoten', -9.227295812625597), ('and', -2.8869365088978443), ('ylfe', -9.227295812625597), ('and', -2.8869365088978443), ('orcneas', -9.227295812625597)]
Return type:

Union[str, list[Union[str, tuple[str, float]]]]

8.1.8.8. cltk.lemmatize.processes module

Processes for lemmatization.

class cltk.lemmatize.processes.LemmatizationProcess(language=None)[source]

Bases: Process

To be inherited for each language’s lemmatization declarations.

Example: LemmatizationProcess -> LatinLemmatizationProcess

>>> from cltk.lemmatize.processes import LemmatizationProcess
>>> from cltk.core.data_types import Process
>>> issubclass(LemmatizationProcess, Process)
True
run(input_doc)[source]
Return type:

Doc

class cltk.lemmatize.processes.GreekLemmatizationProcess(language=None)[source]

Bases: LemmatizationProcess

The default Ancient Greek lemmatization algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import MultilingualTokenizationProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Greek pipeline",     processes=[MultilingualTokenizationProcess, GreekLemmatizationProcess],     language=get_lang("grc"))
>>> nlp = NLP(language='grc', custom_pipeline=pipe, suppress_banner=True)
>>> nlp(get_example_text("grc")).lemmata[30:40]
['ἔλεγον.', 'καίτοι', 'ἀληθές', 'γε', 'ὡς', 'ἔπος', 'εἰπεῖν', 'οὐδὲν', 'εἰρήκασιν.', 'μάλιστα']
description = 'Lemmatization process for Ancient Greek'
algorithm
class cltk.lemmatize.processes.LatinLemmatizationProcess(language=None)[source]

Bases: LemmatizationProcess

The default Latin lemmatization algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import LatinTokenizationProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Latin pipeline",     processes=[LatinTokenizationProcess, LatinLemmatizationProcess],     language=get_lang("lat"))
>>> nlp = NLP(language='lat', custom_pipeline=pipe, suppress_banner=True)
>>> nlp(get_example_text("lat")).lemmata[30:40]
['institutis', ',', 'legibus', 'inter', 'se', 'differunt', '.', 'Gallos', 'ab', 'Aquitanis']
description = 'Lemmatization process for Latin'
algorithm
class cltk.lemmatize.processes.OldEnglishLemmatizationProcess(language=None)[source]

Bases: LemmatizationProcess

The default Old English lemmatization algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import MultilingualTokenizationProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old English pipeline",     processes=[MultilingualTokenizationProcess, OldEnglishLemmatizationProcess],     language=get_lang("ang"))
>>> nlp = NLP(language='ang', custom_pipeline=pipe, suppress_banner=True)
>>> nlp(get_example_text("ang")).lemmata[30:40]
['siððan', 'ær', 'weorþan', 'feasceaft', 'findan', ',', 'he', 'se', 'frofre', 'gebidan']
description = 'Lemmatization process for Old English'
algorithm
class cltk.lemmatize.processes.OldFrenchLemmatizationProcess(language=None)[source]

Bases: LemmatizationProcess

The default Old French lemmatization algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import MultilingualTokenizationProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old French pipeline",     processes=[MultilingualTokenizationProcess, OldFrenchLemmatizationProcess],     language=get_lang("fro"))
>>> nlp = NLP(language='fro', custom_pipeline=pipe, suppress_banner=True)
>>> nlp(get_example_text("fro")).lemmata[30:40]
['avenir', 'jadis', 'en', 'bretaingne', 'avoir', '.I.', 'molt', 'riche', 'chevalier', 'PUNK']
description = 'Lemmatization process for Old French'
algorithm