8.1.5. cltk.dependency package

Init for cltk.dependency.

8.1.5.1. Submodules

8.1.5.2. cltk.dependency.processes module

Process classes for accessing the Stanza project.

class cltk.dependency.processes.StanzaProcess(language=None)[source]

Bases: Process

A Process type to capture everything that the stanza project can do for a given language.

Note

stanza has only partial functionality available for some languages.

>>> from cltk.languages.example_texts import get_example_text
>>> process_stanza = StanzaProcess(language="lat")
>>> isinstance(process_stanza, StanzaProcess)
True
>>> from stanza.models.common.doc import Document
>>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat")))
>>> isinstance(output_doc.stanza_doc, Document)
True
language: str = None
algorithm
run(input_doc)[source]
Return type:

Doc

static stanza_to_cltk_word_type(stanza_doc)[source]

Take an entire stanza document, extract each word, and encode it in the way expected by the CLTK’s Word type.

>>> from cltk.dependency.processes import StanzaProcess
>>> from cltk.languages.example_texts import get_example_text
>>> process_stanza = StanzaProcess(language="lat")
>>> cltk_words = process_stanza.run(Doc(raw=get_example_text("lat"))).words
>>> isinstance(cltk_words, list)
True
>>> isinstance(cltk_words[0], Word)
True
>>> cltk_words[0]
 Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='Gallia', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=1, features={Case: [nominative], Gender: [feminine], InflClass: [ind_eur_a], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)
class cltk.dependency.processes.GreekStanzaProcess(language='grc', description='Default process for Stanza for the Ancient Greek language.', authorship_info='``LatinSpacyProcess`` using Stanza model by Stanford University from https://stanfordnlp.github.io/stanza/ . Please cite: https://arxiv.org/abs/2003.07082')[source]

Bases: StanzaProcess

Stanza processor for Ancient Greek.

language: str = 'grc'
description: str = 'Default process for Stanza for the Ancient Greek language.'
authorship_info: str = '``LatinSpacyProcess`` using Stanza model by Stanford University from https://stanfordnlp.github.io/stanza/ . Please cite: https://arxiv.org/abs/2003.07082'
class cltk.dependency.processes.LatinStanzaProcess(language='lat', description='Default process for Stanza for the Latin language.')[source]

Bases: StanzaProcess

Stanza processor for Latin.

language: str = 'lat'
description: str = 'Default process for Stanza for the Latin language.'
class cltk.dependency.processes.OCSStanzaProcess(language='chu', description='Default process for Stanza for the Old Church Slavonic language.')[source]

Bases: StanzaProcess

Stanza processor for Old Church Slavonic.

language: str = 'chu'
description: str = 'Default process for Stanza for the Old Church Slavonic language.'
class cltk.dependency.processes.OldFrenchStanzaProcess(language='fro', description='Default process for Stanza for the Old French language.')[source]

Bases: StanzaProcess

Stanza processor for Old French.

language: str = 'fro'
description: str = 'Default process for Stanza for the Old French language.'
class cltk.dependency.processes.GothicStanzaProcess(language='got', description='Default process for Stanza for the Gothic language.')[source]

Bases: StanzaProcess

Stanza processor for Gothic.

language: str = 'got'
description: str = 'Default process for Stanza for the Gothic language.'
class cltk.dependency.processes.CopticStanzaProcess(language='cop', description='Default process for Stanza for the Coptic language.')[source]

Bases: StanzaProcess

Stanza processor for Coptic.

language: str = 'cop'
description: str = 'Default process for Stanza for the Coptic language.'
class cltk.dependency.processes.ChineseStanzaProcess(language='lzh', description='Default process for Stanza for the Classical Chinese language.')[source]

Bases: StanzaProcess

Stanza processor for Classical Chinese.

language: str = 'lzh'
description: str = 'Default process for Stanza for the Classical Chinese language.'
class cltk.dependency.processes.TreeBuilderProcess(language=None)[source]

Bases: Process

A Process that takes a doc containing sentences of CLTK words and returns a dependency tree for each sentence.

TODO: JS help to make this work, illustrate better.

>>> from cltk import NLP
>>> nlp = NLP(language="got", suppress_banner=True)
>>> from cltk.dependency.processes import TreeBuilderProcess
>>> nlp.pipeline.add_process(TreeBuilderProcess)  
>>> from cltk.languages.example_texts import get_example_text  
>>> doc = nlp.analyze(text=get_example_text("got"))  
>>> len(doc.trees)  
4
algorithm(doc)[source]
class cltk.dependency.processes.SpacyProcess(language=None)[source]

Bases: Process

A Process type to capture everything, that the spaCy project can do for a given language.

Note

spacy has only partial functionality available for some languages.

>>> from cltk.languages.example_texts import get_example_text
>>> process_spacy = SpacyProcess(language="lat")
>>> isinstance(process_spacy, SpacyProcess)
True

# >>> from spacy.models.common.doc import Document # >>> output_doc = process_spacy.run(Doc(raw=get_example_text(“lat”))) # >>> isinstance(output_doc.spacy_doc, Document) True

algorithm
run(input_doc)[source]
Return type:

Doc

static spacy_to_cltk_word_type(spacy_doc)[source]

Take an entire spacy document, extract each word, and encode it in the way expected by the CLTK’s Word type.

It works only if there is some sentence boundaries has been set by the loaded model.

See note in code about starting word token index at 1

>>> from cltk.dependency.processes import SpacyProcess
>>> from cltk.languages.example_texts import get_example_text
>>> process_spacy = SpacyProcess(language="lat")
>>> cltk_words = process_spacy.run(Doc(raw=get_example_text("lat"))).words
>>> isinstance(cltk_words, list)
True
>>> isinstance(cltk_words[0], Word)
True
>>> cltk_words[0]
Word(index_char_start=0, index_char_stop=6, index_token=0, index_sentence=0, string='Gallia', pos=None, lemma='Gallia', stem=None, scansion=None, xpos='proper_noun', upos='PROPN', dependency_relation='nsubj', governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)
class cltk.dependency.processes.LatinSpacyProcess(language='lat', description="Process for Spacy for Patrick Burn's Latin model.", authorship_info='``LatinSpacyProcess`` using LatinCy model by Patrick Burns from https://arxiv.org/abs/2305.04365 . Please cite: https://arxiv.org/abs/2305.04365')[source]

Bases: SpacyProcess

Run a Spacy model.

<https://huggingface.co/latincy>_

language: Literal['lat'] = 'lat'
description: str = "Process for Spacy for Patrick Burn's Latin model."
authorship_info: str = '``LatinSpacyProcess`` using LatinCy model by Patrick Burns from https://arxiv.org/abs/2305.04365 . Please cite: https://arxiv.org/abs/2305.04365'

8.1.5.3. cltk.dependency.spacy_wrapper module

Wrapper for spaCy NLP software and models.

class cltk.dependency.spacy_wrapper.SpacyWrapper(language, nlp=None, interactive=True, silent=False)[source]

Bases: object

SpacyWrapper has been made to be an interface between spaCy and CLTK.

nlps: dict[str, cltk.dependency.spacy_wrapper.SpacyWrapper] = {}
parse(text)[source]
>>> from cltk.languages.example_texts import get_example_text
>>> spacy_wrapper: SpacyWrapper = SpacyWrapper(language="lat")
>>> latin_spacy_doc: SpacyDoc = spacy_wrapper.parse(get_example_text("lat"))
Parameters:

text (str) – Text to analyze.

Return type:

Doc

Returns:

classmethod get_nlp(language)[source]
Parameters:

language (str) – Language parameter to retrieve an already-loaded model or the default model.

Return type:

SpacyWrapper

Returns:

A saved instance of SpacyWrapper.

is_wrapper_available()[source]

Maps an ISO 639-3 language id (e.g., lat for Latin) to that used by spacy (la); confirms that this is a language the CLTK supports (i.e., is it pre-modern or not).

>>> spacy_wrapper: SpacyWrapper = SpacyWrapper(language='lat', interactive=False, silent=True)
>>> spacy_wrapper.is_wrapper_available()
True
Return type:

bool

_get_spacy_code()[source]

Get Spacy abbreviation from the ISO standard name.

Return type:

str

_load_model()[source]

Load model into memory.

Return type:

Language

8.1.5.4. cltk.dependency.stanza_wrapper module

Wrapper for the Python Stanza package. About: https://github.com/stanfordnlp/stanza.

class cltk.dependency.stanza_wrapper.StanzaWrapper(language, treebank=None, stanza_debug_level='ERROR', interactive=True, silent=False)[source]

Bases: object

CLTK’s wrapper for the Stanza project.

nlps: dict[str, cltk.dependency.stanza_wrapper.StanzaWrapper] = {}
parse(text)[source]

Run all available stanza parsing on input text.

>>> from cltk.languages.example_texts import get_example_text
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True)
>>> greek_nlp = stanza_wrapper.parse(get_example_text("grc"))
>>> from stanza.models.common.doc import Document, Token
>>> isinstance(greek_nlp, Document)
True
>>> nlp_greek_first_sent = greek_nlp.sentences[0]
>>> isinstance(nlp_greek_first_sent.tokens[0], Token)
True
>>> nlp_greek_first_sent.tokens[0].text
'ὅτι'
>>> nlp_greek_first_sent.tokens[0].words
[{
  "id": 1,
  "text": "ὅτι",
  "lemma": "ὅτι",
  "upos": "ADV",
  "xpos": "Df",
  "head": 13,
  "deprel": "advmod",
  "start_char": 0,
  "end_char": 3
}]
>>> nlp_greek_first_sent.tokens[0].start_char
0
>>> nlp_greek_first_sent.tokens[0].end_char
3
>>> nlp_greek_first_sent.tokens[0].misc
>>> nlp_greek_first_sent.tokens[0].pretty_print()
'<Token id=1;words=[<Word id=1;text=ὅτι;lemma=ὅτι;upos=ADV;xpos=Df;head=13;deprel=advmod>]>'
>>> nlp_greek_first_sent.tokens[0].to_dict()
[{'id': 1, 'text': 'ὅτι', 'lemma': 'ὅτι', 'upos': 'ADV', 'xpos': 'Df', 'head': 13, 'deprel': 'advmod', 'start_char': 0, 'end_char': 3}]
>>> first_word = nlp_greek_first_sent.tokens[0].words[0]
>>> first_word.id
1
>>> first_word.text
'ὅτι'
>>> first_word.lemma
'ὅτι'
>>> first_word.upos
'ADV'
>>> first_word.xpos
'Df'
>>> first_word.feats
>>> first_word.head
13
>>> first_word.parent
[
  {
    "id": 1,
    "text": "ὅτι",
    "lemma": "ὅτι",
    "upos": "ADV",
    "xpos": "Df",
    "head": 13,
    "deprel": "advmod",
    "start_char": 0,
    "end_char": 3
  }
]
>>> first_word.misc
>>> first_word.deprel
'advmod'
>>> first_word.pos
'ADV'
Return type:

Document

_load_pipeline()[source]

Instantiate stanza.Pipeline().

TODO: Make sure that logging captures what it should from the default stanza printout. TODO: Make note that full lemmatization is not possible for Old French

>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True)
>>> with suppress_stdout():    nlp_obj = stanza_wrapper._load_pipeline()
>>> isinstance(nlp_obj, stanza.pipeline.core.Pipeline)
True
>>> stanza_wrapper = StanzaWrapper(language='fro', stanza_debug_level="INFO", interactive=False, silent=True)
>>> with suppress_stdout():    nlp_obj = stanza_wrapper._load_pipeline()
>>> isinstance(nlp_obj, stanza.pipeline.core.Pipeline)
True
Return type:

Pipeline

_is_model_present()[source]

Checks if the model is already downloaded.

>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True)
>>> stanza_wrapper._is_model_present()
True
Return type:

bool

_download_model()[source]

Interface with the stanza model downloader.

Return type:

None

_get_default_treebank()[source]

Return description of a language’s default treebank if none supplied.

>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True)
>>> stanza_wrapper._get_default_treebank()
'proiel'
Return type:

str

_is_valid_treebank()[source]

Check whether for chosen language, optional treebank value is valid.

>>> stanza_wrapper = StanzaWrapper(language='grc', treebank='proiel', stanza_debug_level="INFO", interactive=False, silent=True)
>>> stanza_wrapper._is_valid_treebank()
True
Return type:

bool

is_wrapper_available()[source]

Maps an ISO 639-3 language id (e.g., lat for Latin) to that used by stanza (la); confirms that this is a language the CLTK supports (i.e., is it pre-modern or not).

>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True)
>>> stanza_wrapper.is_wrapper_available()
True
Return type:

bool

_get_stanza_code()[source]

Using known-supported language, use the CLTK’s internal code to look up the code used by Stanza.

>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True)
>>> stanza_wrapper._get_stanza_code()
'grc'
>>> stanza_wrapper.language = "xxx"
>>> stanza_wrapper._get_stanza_code()
Traceback (most recent call last):
  ...
KeyError: 'Somehow ``StanzaWrapper.language`` got renamed to something invalid. This should never happen.'
Return type:

str

classmethod get_nlp(language, treebank=None)[source]
Return type:

StanzaWrapper

8.1.5.5. cltk.dependency.tree module

A data structure for representing dependency tree graphs.

class cltk.dependency.tree.Form(form, form_id=0)[source]

Bases: Element

For the word (ie, node) of a dependency tree and its attributes. Inherits from the Element class of Python’s xml.etree library.

>>> desc_form = Form('described')
>>> desc_form
described_0
>>> desc_form.set('Tense', 'Past')
>>> desc_form
described_0
>>> desc_form / 'VBN'
described_0/VBN
>>> desc_form.full_str()
'described_0 [Tense=Past,pos=VBN]'
get_dependencies(relation)[source]

Extract dependents of this form for the specified dependency relation.

>>> john = Form('John', 1) / 'NNP'
>>> loves = Form('loves', 2) / 'VRB'
>>> mary = Form('Mary', 3) / 'NNP'
>>> loves >> john | 'subj'
subj(loves_2/VRB, John_1/NNP)
>>> loves >> mary | 'obj'
obj(loves_2/VRB, Mary_3/NNP)
>>> loves.get_dependencies('subj')
[subj(loves_2/VRB, John_1/NNP)]
>>> loves.get_dependencies('obj')
[obj(loves_2/VRB, Mary_3/NNP)]
Return type:

list[Dependency]

full_str(include_relation=True)[source]

Returns a string containing all features of the Form. The ID is attached to the text, and the relation is optionally suppressed.

>>> loves = Form('loves', 2) / 'VRB'
>>> loves.full_str()
'loves_2 [pos=VRB]'
>>> john = Form('John', 1) / 'NNP'
>>> loves >> john | 'subj'
subj(loves_2/VRB, John_1/NNP)
>>> john.full_str(True)
'John_1 [pos=NNP,relation=subj]'
Return type:

str

static to_form(word)[source]

Converts a CLTK Word object to a Form.

TODO: The Form info that prints is incomplete/ugly; correct str repr of Form TODO: Fix these doctests; it’s ugly to import so many Forms, but is this required?

>>> from cltk.morphology.universal_dependencies_features import Case, Gender, Number, POS
>>> noun = POS.noun
>>> nominative = Case.nominative
>>> feminine = Gender.feminine
>>> singular = Number.singular
>>> cltk_word = Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='Gallia', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=1, features={Case: [nominative], Gender: [feminine], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=False, named_entity='LOCATION', syllables=None, phonetic_transcription=None, definition='')  
>>> cltk_word.features[Case] = Case.nominative  
>>> cltk_word.features[Gender] = Gender.feminine  
>>> cltk_word.features[Number] = Number.singular  
>>> f = Form.to_form(cltk_word)  
>>> f.full_str()  
'Gallia_0 [lemma=mallis,pos=NOUN,upos=NOUN,xpos=A1|grn1|casA|gen2,Case=nominative,Gender=feminine,Number=singular]'
Return type:

Form

class cltk.dependency.tree.Dependency(head, dep, relation=None)[source]

Bases: object

The asymmetric binary relationship (or edge) between a governing Form (the “head”) and a subordinate Form (the “dependent”).

In principle the relationship could capture any form-to-form relation that the systems deems of interest, be it syntactic, semantic, or discursive.

If the relation attribute is not speficied, then the dependency simply states that there’s some asymmetric relationship between the head and the dependenent. This is an untyped dependency.

For a typed dependency, a string value is supplied for the relation attribute.

class cltk.dependency.tree.DependencyTree(root)[source]

Bases: ElementTree

The hierarchical tree representing the entirety of a parse.

get_dependencies()[source]

Returns a list of all the dependency relations in the tree, generated by depth-first search.

>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.dependency.processes import StanzaProcess
>>> process_stanza = StanzaProcess(language="lat")
>>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat")))
>>> a_sentence = output_doc.sentences[0]
>>> t = DependencyTree.to_tree(a_sentence)
>>> len(t.get_dependencies())
34
Return type:

list[Dependency]

print_tree(all_features=False)[source]

Prints a pretty-printed (indented) representation of the dependency tree. If all_features is True, then each node is printed with its complete feature bundles.

static to_tree(sentence)[source]

Factory method to create trees from sentences parses, i.e. lists of words.

>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.dependency.processes import StanzaProcess
>>> process_stanza = StanzaProcess(language="lat")
>>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat")))
>>> a_sentence = output_doc.sentences[0]
>>> t = DependencyTree.to_tree(a_sentence)
>>> t.findall(".")
[divisa_3/adjective]
Return type:

DependencyTree

8.1.5.6. cltk.dependency.utils module

Misc helper functions for extracting dependency info from CLTK data structures.

cltk.dependency.utils.get_governor_word(word, sentence)[source]

Submit a Word and a sentence (being a list of Word) and then return the governing word.

Return type:

Optional[Word]

cltk.dependency.utils.get_governor_word2(word, sentence_words)[source]

Submit a Word and a sentence (being a list of Word) and then return the governing word.

Return type:

Optional[Word]

cltk.dependency.utils.get_governor_relationship(word, sentence)[source]

Get the dependency relationship of a dependent to its governor.

Return type:

Optional[Any]