Latin

For most of the following operations, you must first import the CLTK Latin linguistic data (named latin_models_cltk).

Note that for most of the following operations, the j/i and v/u replacer JVReplacer() and .lower() should be used on the input string first, if necessary.

Clausulae Analysis

Clausulae analysis is an integral part of Latin prosimetrics. The clausulae analysis module analyzes prose rhythm data generated by the prosody module to produce a dictionary of common rhythm types and their frequencies.

The list of rhythms which the module tallies is drawn from John Ramsey's list of common Ciceronian clausulae. See Ramsey, John. Cicero: The Philippics I-II. Cambridge: Cambridge University, 2003: 22 for more.

In [1]: from cltk.prosody.latin.scanner import Scansion

In [2]: from cltk.prosody.latin.clausulae_analysis import Clausulae

In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'

In [4]: s = Scansion()

In [5]: c = Clausulae()

In [6]: prosody = s.scan_text(text)
Out[6]: ['¯˘¯˘¯¯˘˘˘¯˘˘˘¯˘¯¯x', '¯˘¯˘¯˘˘¯˘˘¯¯¯¯x']

In [7]: c.clausulae_analysis(prosody)
Out[7]: {'1st paeon + trochee': 0, 'molossus + iamb': 0, '1st paeon + anapest': 0, '4th paeon + trochee': 0, 'choriamb + double trochee': 0, 'molossus + cretic': 0, 'double spondee': 1, 'molossus + double trochee': 0, 'substituted cretic + trochee': 0, 'cretic + iamb': 0, 'cretic + trochee': 1, 'double trochee': 0, 'heroic': 0, 'cretic + double trochee': 0, 'cretic + double spondee': 0, '4th paeon + cretic': 0, 'double cretic': 0, 'dactyl + double trochee': 0}

Converting J to I, V to U

In [1]: from cltk.stem.latin.j_v import JVReplacer

In [2]: j = JVReplacer()

In [3]: j.replace('vem jam')
Out[3]: 'uem iam'

Converting PHI texts with TLGU

Note

  1. Update this section with new post-TLGU processors in formatter.py

The TLGU is C-language software which does an excellent job at converting the TLG and PHI corpora into various forms of human-readable Unicode plaintext. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. Download and installation is handled in the background. When TLGU() is instantiated, it checks the local OS for a functioning version of the software. If not found it is installed.

Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers).

In [1]: from cltk.corpus.greek.tlgu import TLGU

In [2]: t = TLGU()

In [3]: t.convert_corpus(corpus='phi5')  # ~/cltk_data/latin/text/tlg/plaintext/ #! This isn't working!

You can also divide the texts into a file for each individual work.

In [4]: t.divide_works('phi5')  # ~/cltk_data/latin/text/phi5/individual_works/

Information Retrieval

See Multilingual Information Retrieval for Latin–specific search options.

Declining

The CollatinusDecliner() attempts to retrieve all possible form of a lemma. This may be useful if you want to search for all forms of a word across a repository of non-lemmatized texts. This class is based on lexical and linguistic data built by the Collatinus Team. Data corrections and additions can be contributed back to the Collatinus project (in particular, into bin/data).

Example use, assuming you have already imported the latin_models_cltk:

In [1]: from cltk.stem.latin.declension import CollatinusDecliner

In [2]: decliner = CollatinusDecliner()

In [3]: print(decliner.decline("via"))
Out[3]: [
     ('via', '--s----n-'), ('via', '--s----v-'), ('viam', '--s----a-'), ('viae', '--s----g-'),
     ('viae', '--s----d-'), ('via', '--s----b-'), ('viae', '--p----n-'), ('viae', '--p----v-'),
     ('vias', '--p----a-'), ('viarum', '--p----g-'), ('viis', '--p----d-'), ('viis', '--p----b-')
 ]

 In [4]: print(decliner.decline("via", flatten=True))
 Out[4]: ['via', 'via', 'viam', 'viae', 'viae', 'via', 'viae', 'viae', 'vias', 'viarum', 'viis', 'viis']

Lemmatization

Tip

For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.

The CLTK's lemmatizer is based on a key-value store, whose code is available at the CLTK's Latin lemma/POS repository.

The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens (which, by the way, need ``j``s and ``v``s replaced first). Here is an example of the lemmatizer taking a string:

In [1]: from cltk.stem.lemma import LemmaReplacer

In [2]: from cltk.stem.latin.j_v import JVReplacer

In [3]: sentence = 'Aeneadum genetrix, hominum divomque voluptas, alma Venus, caeli subter labentia signa quae mare navigerum, quae terras frugiferentis concelebras, per te quoniam genus omne animantum concipitur visitque exortum lumina solis.'

In [6]: sentence = sentence.lower()

In [7]: lemmatizer = LemmaReplacer('latin')

In [8]: lemmatizer.lemmatize(sentence)
Out[8]:
['aeneadum',
 'genetrix',
 ',',
 'homo',
 'divus',
 'voluptas',
 ',',
 'almus',
 ...]

And here taking a list:

In [9]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'])
Out[9]: ['qui1', 'terra', 'frugiferens', 'concelebro']

The lemmatizer takes several optional arguments for controlling output: return_raw=True and return_string=True. return_raw returns the original inflection along with its headword:

In [10]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_raw=True)
Out[10]:
['quae/qui1',
 'terras/terra',
 'frugiferentis/frugiferens',
 'concelebras/concelebro']

And return string wraps the list in ' '.join():

In [11]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_string=True)
Out[11]: 'qui1 terra frugiferens concelebro'

These two arguments can be combined, as well.

Lemmatization, backoff method

The CLTK offers a series of lemmatizers that can be combined in a backoff sequence, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.

There is a generic version of the backoff latin lemmatizer which requires data from the CLTK latin models data found here: <https://github.com/cltk/latin_models_cltk/tree/master/lemmata/backoff>. The lemmatizer expects this model to be stored in a folder called cltk_data in the user's home directory.

The backoff module offers DefaultLemmatizer which returns the same "lemma" for all tokens:

In [1]: from cltk.lemmatize.latin.backoff import DefaultLemmatizer

In [2]: lemmatizer = DefaultLemmatizer()

In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]

DefaultLemmatizer can take as a parameter what "lemma" should be returned:

In [5]: lemmatizer = DefaultLemmatizer('UNK')

In [6]: lemmatizer.lemmatize(tokens)
Out[6]: [('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]

The backoff module also offers IdentityLemmatizer which returns the given token as the lemma:

In [7]: from cltk.lemmatize.latin.backoff import IdentityLemmatizer

In [8]: lemmatizer = IdentityLemmatizer()

In [9]: lemmatizer.lemmatize(tokens)

Out[9]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]

NB: Documentation is still be written for the remaining backoff lemmatizers, i.e. TrainLemmatizer, ContextLemmatizer, RegexpLemmatizer, and ContextPOSLemmatizer.

Macronizer

Automatically mark long Latin vowels with a macron. The algorithm used in this module is largely based on Johan Winge's, which is detailed in his thesis found.

Note that the macronizer's accuracy varies depending on which tagger is used. Currently, the macronizer supports the following taggers: tag_ngram_123_backoff, tag_tnt, and tag_crf. The tagger is selected when calling the class, as seen on line 2. Be sure to first import the data models from latin_models_cltk, via the corpus importer, since both the taggers and macronizer rely on them.

The macronizer can either macronize text, as seen at line 4 below, or return a list of tagged tokens containing the macronized form like on line 5.

In [1]: from cltk.prosody.latin.macronizer import Macronizer

In [2]: macronizer = Macronizer('tag_ngram_123_backoff')

In [3]: text = 'Quo usque tandem, O Catilina, abutere nostra patientia?'

In [4]: macronizer.macronize_text(text)
Out[4]: 'quō usque tandem , ō catilīnā , abūtēre nostrā patientia ?

In [5]: macronizer.macronize_tags(text)
Out[5]: [('quo', 'd--------', 'quō'), ('usque', 'd--------', 'usque'), ('tandem', 'd--------', 'tandem'), (',', 'u--------', ','), ('o', 'e--------', 'ō'), ('catilina', 'n-s---mb-', 'catilīnā'), (',', 'u--------', ','), ('abutere', 'v2sfip---', 'abūtēre'), ('nostra', 'a-s---fb-', 'nostrā'), ('patientia', 'n-s---fn-', 'patientia'), ('?', None, '?')]

Making POS training sets

Warning

POS tagging is a work in progress. A new tagging dictionary has been created, though a tagger has not yet been written.

First, obtain the Latin POS tagging files. The important file here is cltk_latin_pos_dict.txt, which is saved at ~/cltk_data/compiled/pos_latin. This file is a Python dict type which aims to give all possible parts-of-speech for any given form, though this is based off the incomplete Perseus latin-analyses.txt. Thus, there may be gaps in (i) the inflected forms defined and (ii) the comprehensiveness of the analyses of any given form. cltk_latin_pos_dict.txt looks like:

{'-nam': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                    'gloss': '',
                                    'type': 'conj'}}]},
 '-namque': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                       'gloss': '',
                                       'type': 'conj'}}]},
 '-sed': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                    'gloss': '',
                                    'type': 'conj'}}]},
 'Aaron': {'perseus_pos': [{'pos0': {'case': 'nom',
                                     'gender': 'masc',
                                     'gloss': 'Aaron',
                                     'number': 'sg',
                                     'type': 'substantive'}}]},
}

If you wish to edit the POS dictionary creator, see cltk_latin_pos_dict.txt.For more, see the [pos_latin](https://github.com/cltk/latin_pos_lemmata_cltk) repository.

Named Entity Recognition

Tip

NER is new functionality. Please report any errors you observe.

There is available a simple interface to a list of Latin proper nouns. By default tag_ner() takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.

In [1]: from cltk.tag import ner

In [2]: from cltk.stem.latin.j_v import JVReplacer

In [3]: text_str = """ut Venus, ut Sirius, ut Spica, ut aliae quae primae dicuntur esse mangitudinis."""

In [4]: jv_replacer = JVReplacer()

In [5]: text_str_iu = jv_replacer.replace(text_str)

In [7]: ner.tag_ner('latin', input_text=text_str_iu, output_type=list)
Out[7]:
[('ut',),
 ('Uenus', 'Entity'),
 (',',),
 ('ut',),
 ('Sirius', 'Entity'),
 (',',),
 ('ut',),
 ('Spica', 'Entity'),
 (',',),
 ('ut',),
 ('aliae',),
 ('quae',),
 ('primae',),
 ('dicuntur',),
 ('esse',),
 ('mangitudinis',),
 ('.',)]

PHI Indices

Located at cltk/corpus/latin/phi5_index.py of the source are indices for the PHI5, one of just id and name (PHI5_INDEX) and another also containing information on the authors' works (PHI5_WORKS_INDEX).

In [1]: from cltk.corpus.latin.phi5_index import PHI5_INDEX

In [2]: PHI5_INDEX
Out[2]:
{'LAT1050': 'Lucius Verginius Rufus',
 'LAT2335': 'Anonymi de Differentiis [Fronto]',
 'LAT1345': 'Silius Italicus',
 ... }

In [3]: from cltk.corpus.latin.phi5_index import PHI5_WORKS_INDEX

In [4]: PHI5_WORKS_INDEX
Out [4]:
{'LAT2335': {'works': ['001'], 'name': 'Anonymi de Differentiis [Fronto]'},
 'LAT1345': {'works': ['001'], 'name': 'Silius Italicus'},
 'LAT1351': {'works': ['001', '002', '003', '004', '005'],
  'name': 'Cornelius Tacitus'},
 'LAT2349': {'works': ['001', '002', '003', '004', '005', '006', '007'],
  'name': 'Maurus Servius Honoratus, Servius'},
  ...}

In addition to these indices there are several helper functions which will build filepaths for your particular computer. Not that you will need to have run convert_corpus(corpus='phi5') and divide_works('phi5') from the TLGU() class, respectively, for the following two functions.

In [1]: from cltk.corpus.utils.formatter import assemble_phi5_author_filepaths

In [2]: assemble_phi5_author_filepaths()
Out[2]:
['/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0636.TXT',
 '/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0658.TXT',
 '/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0827.TXT',
 ...]

In [3]: from cltk.corpus.utils.formatter import assemble_phi5_works_filepaths

In [4]: assemble_phi5_works_filepaths()
Out[4]:
['/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0636.TXT-001.txt',
 '/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0902.TXT-001.txt',
 '/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-001.txt',
 '/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-002.txt',
 ...]

These two functions are useful when, for example, needing to process all authors of the PHI5 corpus, all works of the corpus, or all works of one particular author.

POS tagging

These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the latin_models_cltk corpus.

1–2–3–gram backoff tagger

In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('latin')

In [3]: tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')
Out[3]:
[('Gallia', None),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---MN-'),
 ('divisa', 'T-PRPPNN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

TnT tagger

In [4]: tagger.tag_tnt('Gallia est omnis divisa in partes tres')
Out[4]:
[('Gallia', 'Unk'),
 ('est', 'V3SPIA---'),
 ('omnis', 'N-S---MN-'),
 ('divisa', 'T-SRPPFN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

CRF tagger

Warning

This tagger's accuracy has not yet been evaluated.

We use the NLTK's CRF tagger. For information on it, see the NLTK docs.

In [5]: tagger.tag_crf('Gallia est omnis divisa in partes tres')
Out[5]:
[('Gallia', 'A-P---NA-'),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---FN-'),
 ('divisa', 'N-S---FN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

Lapos tagger

Note

The Lapos tagger is available in its own repo, with with the master branch for Linux and apple branch for Mac. See directions there on how to use it.

Prosody Scanning

A prosody scanner is available for text which already has had its natural lengths marked with macrons. It returns a list of strings of long and short marks for each sentence, with an anceps marking the last syllable of each sentence.

In [1]: from cltk.prosody.latin.scanner import Scansion

In [2]: scanner = Scansion()

In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'

In [4]: scanner.scan_text(text)
Out[4]: ['¯˘¯˘¯¯˘˘˘¯˘˘˘¯˘¯¯x', '¯˘¯˘¯˘˘¯˘˘¯¯¯¯x']

Sentence Tokenization

The sentence tokenizer takes a string input into tokenize_sentences() and returns a list of strings. For more on the tokenizer, or to make your own, see the CLTK's Latin sentence tokenizer training set repository.

In [1]: from cltk.tokenize.sentence import TokenizeSentence

In [2]: tokenizer = TokenizeSentence('latin')

In [3]: untokenized_text = 'Itaque cum M. Aurelio et P. Minidio et Cn. Cornelio ad apparationem balistarum et scorpionem reliquorumque tormentorum refectionem fui praesto et cum eis commoda accepi, quae cum primo mihi tribuisiti recognitionem, per sorosis commendationem servasti. Cum ergo eo beneficio essem obligatus, ut ad exitum vitae non haberem inopiae timorem, haec tibi scribere coepi, quod animadverti multa te aedificavisse et nunc aedificare, reliquo quoque tempore et publicorum et privatorum aedificiorum, pro amplitudine rerum gestarum ut posteris memoriae traderentur curam habiturum.'

In [4]: tokenizer.tokenize_sentences(untokenized_text)
Out[4]:
['Itaque cum M. Aurelio et P. Minidio et Cn. Cornelio ad apparationem balistarum et scorpionem reliquorumque tormentorum refectionem fui praesto et cum eis commoda accepi, quae cum primo mihi tribuisiti recognitionem, per sorosis commendationem servasti.',
 'Cum ergo eo beneficio essem obligatus, ut ad exitum vitae non haberem inopiae timorem, haec tibi scribere coepi, quod animadverti multa te aedificavisse et nunc aedificare, reliquo quoque tempore et publicorum et privatorum aedificiorum, pro amplitudine rerum gestarum ut posteris memoriae traderentur curam habiturum.']

Stemming

The stemmer strips suffixes via an algorithm. It is much faster than the lemmatizer, which uses a replacement list.

In [1]: from cltk.stem.latin.stem import Stemmer

In [2]: sentence = 'Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiuerunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem ciuem existimarint foeneratorem quam furem, hinc licet existimare. Et uirum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, uerum, ut supra dixi, periculosum et calamitosum. At ex agricolis et uiri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque inuidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.'

In [3]: stemmer = Stemmer()

In [4]: stemmer.stem(sentence.lower())
Out[4]: 'est interd praestar mercatur r quaerere, nisi tam periculos sit, et it foenerari, si tam honestum. maior nostr sic habueru et ita in leg posiuerunt: fur dupl condemnari, foenerator quadrupli. quant peior ciu existimari foenerator quam furem, hinc lice existimare. et uir bon quo laudabant, ita laudabant: bon agricol bon colonum; amplissim laudar existimaba qui ita laudabatur. mercator autem strenu studios re quaerend existimo, uerum, ut supr dixi, periculos et calamitosum. at ex agricol et uir fortissim et milit strenuissim gignuntur, maxim p quaest stabilissim consequi minim inuidiosus, minim mal cogitant su qui in e studi occupat sunt. nunc, ut ad r redeam, quod promis institut principi hoc erit. '

Stopword Filtering

To use the CLTK's built-in stopwords list:

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.latin.stops import STOPS_LIST

In [3]: sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['usque',
 'tandem',
 'abutere',
 ',',
 'catilina',
 ',',
 'patientia',
 'nostra',
 '?']

Syllabifier

The syllabifier splits a given input Latin word into a list of syllables based on an algorithm and set of syllable specifications for Latin.

In [1]: from cltk.stem.latin.syllabifier import Syllabifier

In [2]: word = 'sidere'

In [3]: syllabifier = Syllabifier()

In [4]: syllabifier.syllabify(word)
Out[4]: ['si', 'de', 're']

Text Cleanup

Intended for use on the TLG after processing by TLGU().

In [1]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup

In [2]: import os

In [3]: file = os.path.expanduser('~/cltk_data/latin/text/phi5/individual_works/LAT0031.TXT-001.txt')

In [4]: with open(file) as f:
...:     r = f.read()
...:

In [5]: r[:500]
Out[5]: '\nDices pulchrum esse inimicos \nulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uide-\ntur, sed si liceat re publica salua ea persequi. sed quatenus id fieri non  \npotest, multo tempore multisque partibus inimici nostri non peribunt \natque, uti nunc sunt, erunt potius quam res publica profligetur atque \npereat. \n    Verbis conceptis deierare ausim, praeterquam qui \nTiberium Gracchum necarunt, neminem inimicum tantum molestiae \ntantumque laboris, quantum te ob has res, mihi tradidis'

In [6]: phi5_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Dices pulchrum esse inimicos ulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uidetur sed si liceat re publica salua ea persequi. sed quatenus id fieri non potest multo tempore multisque partibus inimici nostri non peribunt atque uti nunc sunt erunt potius quam res publica profligetur atque pereat. Verbis conceptis deierare ausim praeterquam qui Tiberium Gracchum necarunt neminem inimicum tantum molestiae tantumque laboris quantum te ob has res mihi tradidisse quem oportebat omni'

If you have a text of a language in Latin characters which contain a lot of junk, remove_non_ascii() might be of use.

In [1]: from cltk.corpus.utils.formatter import remove_non_ascii

In [2]: text =  'Dices ἐστιν ἐμός pulchrum esse inimicos ulcisci.'

In [3]: remove_non_ascii(text)
Out[3]: 'Dices   pulchrum esse inimicos ulcisci.

Transliteration

The CLTK provides IPA phonetic transliteration for the Latin language. Currently, the only available dialect is Classical as reconstructed by W. Sidney Allen (taken from Vox Latina, 85-103). Example:

In [1]: from cltk.phonology.latin.transcription import Transcriber

In [2]: transcriber = Transcriber(dialect="Classical", reconstruction="Allen")

In [3]: transcriber.transcribe("Quo usque tandem, O Catilina, abutere nostra patientia?")
Out[3]: "['kʷoː 'ʊs.kʷɛ 't̪an̪.d̪ẽː 'oː ka.t̪ɪ.'liː.n̪aː a.buː.'t̪eː.rɛ 'n̪ɔs.t̪raː pa.t̪ɪ̣.'jɛn̪.t̪ɪ̣.ja]"

Word Tokenization

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('latin')

In [3]: text = 'atque haec abuterque puerve paterne nihil'

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['atque', 'haec', 'abuter', 'que', 'puer', 've', 'pater', 'ne', 'nihil']

Word2Vec

Note

The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK's API for it will be revised.

Note

You will need to install Gensim to use these features.

Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).

The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.

One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here's an example of its use:

In [1]: from cltk.ir.query import search_corpus

In [2]: for x in search_corpus('amicitia', 'phi5', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.25):
    print(x)
   ...:
The following similar terms will be added to the 'amicitia' query: '['societate', 'praesentia', 'uita', 'sententia', 'promptu', 'beneuolentia', 'dignitate', 'monumentis', 'somnis', 'philosophia']'.
('L. Iunius Moderatus Columella', 'hospitem, nisi ex *amicitia* domini, quam raris-\nsime recipiat.')
('L. Iunius Moderatus Columella', ' \n    Xenophon Atheniensis eo libro, Publi Siluine, qui Oeconomicus \ninscribitur, prodidit maritale coniugium sic comparatum esse \nnatura, ut non solum iucundissima, uerum etiam utilissima uitae \nsocietas iniretur: nam primum, quod etiam Cicero ait, ne genus \nhumanum temporis longinquitate occideret, propter \nhoc marem cum femina esse coniunctum, deinde, ut ex \nhac eadem *societate* mortalibus adiutoria senectutis nec \nminus propugnacula praeparentur.')
('L. Iunius Moderatus Columella', 'ac ne ista quidem \npraesidia, ut diximus, non adsiduus labor et experientia \nuilici, non facultates ac uoluntas inpendendi tantum pollent \nquantum uel una *praesentia* domini, quae nisi frequens \noperibus interuenerit, ut in exercitu, cum abest imperator, \ncuncta cessant officia.')
…

threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.

The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:

In [3]: from cltk.vector.word2vec import get_sims

In [4]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.7)
Matches found, but below the threshold of 'threshold=0.7'. Lower it to see these results.
Out[4]: []

In [5]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.2)
Out[5]:
['lictor',
 'extemplo',
 'cena',
 'nuntio',
 'aduenio',
 'iniussus2',
 'forum',
 'dictator',
 'fabium',
'caesarem']

In [6]: get_sims('iube', 'latin', lemmatized=True, threshold=0.7)
"word 'iube' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['iubet”', 'iubet', 'iubilo', 'iubĕ', 'iubar', 'iubes', 'iubatus', 'iuba1', 'iubeo']'.

In [7]: get_sims('dictator', 'latin', lemmatized=False, threshold=0.7)
Out[7]:
['consul',
 'caesar',
 'seruilius',
 'praefectus',
 'flaccus',
 'manlius',
 'sp',
 'fuluius',
 'fabio',
 'ualerius']

To add and subtract vectors, you need to load the models yourself with Gensim.