Latin is a classical language belonging to the Italic branch of the Indo-European languages. The Latin alphabet is derived from the Etruscan and Greek alphabets, and ultimately from the Phoenician alphabet. Latin was originally spoken in Latium, in the Italian Peninsula. Through the power of the Roman Republic, it became the dominant language, initially in Italy and subsequently throughout the Roman Empire. Vulgar Latin developed into the Romance languages, such as Italian, Portuguese, Spanish, French, and Romanian. (Source: Wikipedia)


For most of the following operations, you must first import the CLTK Latin linguistic data (named latin_models_cltk).


Note that for most of the following operations, the j/i and v/u replacer JVReplacer() and .lower() should be used on the input string first, if necessary.

Clausulae Analysis

Clausulae analysis is an integral part of Latin prosimetrics. The clausulae analysis module analyzes prose rhythm data generated by the prosody module to produce a dictionary of common rhythm types and their frequencies.

The list of rhythms which the module tallies is drawn from John Ramsey's list of common Ciceronian clausulae. See Ramsey, John. Cicero: The Philippics I-II. Cambridge: Cambridge University, 2003: 22 for more.

In [1]: from cltk.prosody.latin.scanner import Scansion

In [2]: from cltk.prosody.latin.clausulae_analysis import Clausulae

In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'

In [4]: s = Scansion()

In [5]: c = Clausulae()

In [6]: prosody = s.scan_text(text)
Out[6]: ['¯˘¯˘¯¯˘˘˘¯˘˘˘¯˘¯¯x', '¯˘¯˘¯˘˘¯˘˘¯¯¯¯x']

In [7]: c.clausulae_analysis(prosody)
Out[7]: {'1st paeon + trochee': 0, 'molossus + iamb': 0, '1st paeon + anapest': 0, '4th paeon + trochee': 0, 'choriamb + double trochee': 0, 'molossus + cretic': 0, 'double spondee': 1, 'molossus + double trochee': 0, 'substituted cretic + trochee': 0, 'cretic + iamb': 0, 'cretic + trochee': 1, 'double trochee': 0, 'heroic': 0, 'cretic + double trochee': 0, 'cretic + double spondee': 0, '4th paeon + cretic': 0, 'double cretic': 0, 'dactyl + double trochee': 0}

Converting J to I, V to U

In [1]: from cltk.stem.latin.j_v import JVReplacer

In [2]: j = JVReplacer()

In [3]: j.replace('vem jam')
Out[3]: 'uem iam'

Converting PHI texts with TLGU


  1. Update this section with new post-TLGU processors in

The TLGU is C-language software which does an excellent job at converting the TLG and PHI corpora into various forms of human-readable Unicode plaintext. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. Download and installation is handled in the background. When TLGU() is instantiated, it checks the local OS for a functioning version of the software. If not found it is installed.

Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers).

In [1]: from cltk.corpus.greek.tlgu import TLGU

In [2]: t = TLGU()

In [3]: t.convert_corpus(corpus='phi5')  # ~/cltk_data/latin/text/phi5/plaintext/

You can also divide the texts into a file for each individual work.

In [4]: t.divide_works('phi5')  # ~/cltk_data/latin/text/phi5/individual_works/

Once these files are created, see PHI Indices below for accessing these newly created files.

See also Text Cleanup for removing extraneous non-textual characters from these files.

Information Retrieval

See Multilingual Information Retrieval for Latin–specific search options.


The CollatinusDecliner() attempts to retrieve all possible form of a lemma. This may be useful if you want to search for all forms of a word across a repository of non-lemmatized texts. This class is based on lexical and linguistic data built by the Collatinus Team. Data corrections and additions can be contributed back to the Collatinus project (in particular, into bin/data).

Example use, assuming you have already imported the latin_models_cltk:

In [1]: from cltk.stem.latin.declension import CollatinusDecliner

In [2]: decliner = CollatinusDecliner()

In [3]: print(decliner.decline("via"))
Out[3]: [
     ('via', '--s----n-'), ('via', '--s----v-'), ('viam', '--s----a-'), ('viae', '--s----g-'),
     ('viae', '--s----d-'), ('via', '--s----b-'), ('viae', '--p----n-'), ('viae', '--p----v-'),
     ('vias', '--p----a-'), ('viarum', '--p----g-'), ('viis', '--p----d-'), ('viis', '--p----b-')

 In [4]: print(decliner.decline("via", flatten=True))
 Out[4]: ['via', 'via', 'viam', 'viae', 'viae', 'via', 'viae', 'viae', 'vias', 'viarum', 'viis', 'viis']



For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.

The CLTK's lemmatizer is based on a key-value store, whose code is available at the CLTK's Latin lemma/POS repository.

The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens (which, by the way, need j and v replaced first). Here is an example of the lemmatizer taking a string:

In [1]: from cltk.stem.lemma import LemmaReplacer

In [2]: from cltk.stem.latin.j_v import JVReplacer

In [3]: sentence = 'Aeneadum genetrix, hominum divomque voluptas, alma Venus, caeli subter labentia signa quae mare navigerum, quae terras frugiferentis concelebras, per te quoniam genus omne animantum concipitur visitque exortum lumina solis.'

In [6]: sentence = sentence.lower()

In [7]: lemmatizer = LemmaReplacer('latin')

In [8]: lemmatizer.lemmatize(sentence)

And here taking a list:

In [9]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'])
Out[9]: ['qui1', 'terra', 'frugiferens', 'concelebro']

The lemmatizer takes several optional arguments for controlling output: return_raw=True and return_string=True. return_raw returns the original inflection along with its headword:

In [10]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_raw=True)

And return string wraps the list in ' '.join():

In [11]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_string=True)
Out[11]: 'qui1 terra frugiferens concelebro'

These two arguments can be combined, as well.

Lemmatization, backoff method

The CLTK offers a series of lemmatizers that can be combined in a backoff sequence, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.

There is a generic version of the backoff latin lemmatizer which requires data from the CLTK latin models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user's home directory.

The backoff module offers DefaultLemmatizer which returns the same "lemma" for all tokens:

In [1]: from cltk.lemmatize.latin.backoff import DefaultLemmatizer

In [2]: lemmatizer = DefaultLemmatizer()

In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]

DefaultLemmatizer can take as a parameter what "lemma" should be returned:

In [5]: lemmatizer = DefaultLemmatizer('UNK')

In [6]: lemmatizer.lemmatize(tokens)
Out[6]: [('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]

The backoff module also offers IdentityLemmatizer which returns the given token as the lemma:

In [7]: from cltk.lemmatize.latin.backoff import IdentityLemmatizer

In [8]: lemmatizer = IdentityLemmatizer()

In [9]: lemmatizer.lemmatize(tokens)

Out[9]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]

With the TrainLemmatizer, the backoff module allows you to provide a dictionary of the form {'TOKEN1': 'LEMMA1', 'TOKEN2': 'LEMMA2'} for lemmatization.

In [10]: tokens = ['arma', 'uirum', '-que', 'cano', ',', 'troiae', 'qui', 'primus', 'ab', 'oris']

In [11]: dict = {'arma': 'arma', 'uirum': 'uir', 'troiae': 'troia', 'oris': 'ora'}

In [12]: from cltk.lemmatize.latin.backoff import TrainLemmatizer

In [13]: lemmatizer = TrainLemmatizer(dict)

In [14]: lemmatizer.lemmatize(tokens)
Out[14]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', None), ('cano', None), (',', None), ('troiae', 'troia'), ('qui', None), ('primus', None), ('ab', None), ('oris', 'ora')]

The TrainLemmatizer—like all of the lemmatizers in this module—can take a second lemmatizer (or backoff lemmatizer) for any of the tokens that return 'None'. This is done with a 'backoff' parameter:

In [15]: default = DefaultLemmatizer('UNK')

In [16]: lemmatizer = TrainLemmatizer(dict, backoff=default)

In [17]: lemmatizer.lemmatize(tokens)
Out[17]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', 'UNK'), ('cano', 'UNK'), (',', 'UNK'), ('troiae', 'troia'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'ora')]

With the ContextLemmatizer, the backoff module allows you to provide a list of lists of sentences of the form [[('TOKEN1', 'LEMMA1'), ('TOKEN2', 'LEMMA2')], [('TOKEN3', 'LEMMA3'), ('TOKEN4', 'LEMMA4')], ... ] for lemmatization. The lemmatizer returns the the lemma that has the highest frequency based on the provided context (i.e. unigram, bigram, etc.). So, for example, with unigram context and the token 'est', if the tuple ('est', 'sum') appears in the training sents 99 times and ('est', 'comedo') appears 1 time, the lemmatizer would return the lemma 'sum'. The ContextLemmatizer and its subclasses can take a 'backoff' parameter. (There is a model available in CLTK Data that can be used for this purpose with slight modification: ~/cltk_data/latin/model/latin_models_cltk/lemmata/backoff/latin_pos_lemmatized_sents.pickle. This model has the form [[('TOKEN1', 'LEMMA1', 'POS1'), ('TOKEN2', 'LEMMA2', 'POS2')], ... ]. A list comprehension can get you the model you need for the ContextLemmatizer, e.g. [[(item[0], item[1]) for item in sent] for sent in train_data])

There are subclasses included in the backoff lemmatizer for unigram and bigram context. Here is an example of the UnigramLemmatizer():

In [18]: train_data = [[('cum', 'cum2'), ('esset', 'sum'), ('caesar', 'caesar'), ('in', 'in'), ('citeriore', 'citer'), ('gallia', 'gallia'), ('in', 'in'), ('hibernis', 'hibernus'), (',', 'punc'), ('ita', 'ita'), ('uti', 'ut'), ('supra', 'supra'), ('demonstrauimus', 'demonstro'), (',', 'punc'), ('crebri', 'creber'), ('ad', 'ad'), ('eum', 'is'), ('rumores', 'rumor'), ('adferebantur', 'affero'), ('litteris', 'littera'), ('-que', '-que'), ('item', 'item'), ('labieni', 'labienus'), ('certior', 'certus'), ('fiebat', 'fio'), ('omnes', 'omnis'), ('belgas', 'belgae'), (',', 'punc'), ('quam', 'qui'), ('tertiam', 'tertius'), ('esse', 'sum'), ('galliae', 'gallia'), ('partem', 'pars'), ('dixeramus', 'dico'), (',', 'punc'), ('contra', 'contra'), ('populum', 'populus'), ('romanum', 'romanus'), ('coniurare', 'coniuro'), ('obsides', 'obses'), ('-que', '-que'), ('inter', 'inter'), ('se', 'sui'), ('dare', 'do'), ('.', 'punc')], [('coniurandi', 'coniuro'), ('has', 'hic'), ('esse', 'sum'), ('causas', 'causa'), ('primum', 'primus'), ('quod', 'quod'), ('uererentur', 'uereor'), ('ne', 'ne'), (',', 'punc'), ('omni', 'omnis'), ('pacata', 'paco'), ('gallia', 'gallia'), (',', 'punc'), ('ad', 'ad'), ('eos', 'is'), ('exercitus', 'exercitus'), ('noster', 'noster'), ('adduceretur', 'adduco'), (';', 'punc')]]

In [19]: default = DefaultLemmatizer('UNK')

In [20]: lemmatizer = UnigramLemmatizer(train_sents, backoff=default)
In [21]: lemmatizer.lemmatize(tokens)

Out[21]: [('arma', 'UNK'), ('uirum', 'UNK'), ('-que', '-que'), ('cano', 'UNK'), (',', 'punc'), ('troiae', 'UNK'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'UNK')]

NB: Documentation is still be written for the remaining backoff lemmatizers, i.e. RegexpLemmatizer(), and ContextPOSLemmatizer().

Line Tokenization

The line tokenizer takes a string input into tokenize() and returns a list of strings.

In [1]: from cltk.tokenize.line import LineTokenizer

In [2]: tokenizer = LineTokenizer('latin')

In [3]: untokenized_text = """49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""

In [4]: tokenizer.tokenize(untokenized_text)

Out[4]: ['49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']

The line tokenizer by default removes multiple line breaks. If you wish to retain blank lines in the returned list, set the include_blanks to True.

In [5]: untokenized_text = """48. Cum tibi contigerit studio cognoscere multa,\nFac discas multa, vita nil discere velle.\n\n49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""

In [6]: tokenizer.tokenize(untokenized_text, include_blanks=True)

Out[6]: ['48. Cum tibi contigerit studio cognoscere multa,','Fac discas multa, vita nil discere velle.','','49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']


Automatically mark long Latin vowels with a macron. The algorithm used in this module is largely based on Johan Winge's, which is detailed in his thesis found.

Note that the macronizer's accuracy varies depending on which tagger is used. Currently, the macronizer supports the following taggers: tag_ngram_123_backoff, tag_tnt, and tag_crf. The tagger is selected when calling the class, as seen on line 2. Be sure to first import the data models from latin_models_cltk, via the corpus importer, since both the taggers and macronizer rely on them.

The macronizer can either macronize text, as seen at line 4 below, or return a list of tagged tokens containing the macronized form like on line 5.

In [1]: from cltk.prosody.latin.macronizer import Macronizer

In [2]: macronizer = Macronizer('tag_ngram_123_backoff')

In [3]: text = 'Quo usque tandem, O Catilina, abutere nostra patientia?'

In [4]: macronizer.macronize_text(text)
Out[4]: 'quō usque tandem , ō catilīnā , abūtēre nostrā patientia ?

In [5]: macronizer.macronize_tags(text)
Out[5]: [('quo', 'd--------', 'quō'), ('usque', 'd--------', 'usque'), ('tandem', 'd--------', 'tandem'), (',', 'u--------', ','), ('o', 'e--------', 'ō'), ('catilina', 'n-s---mb-', 'catilīnā'), (',', 'u--------', ','), ('abutere', 'v2sfip---', 'abūtēre'), ('nostra', 'a-s---fb-', 'nostrā'), ('patientia', 'n-s---fn-', 'patientia'), ('?', None, '?')]

Making POS training sets


POS tagging is a work in progress. A new tagging dictionary has been created, though a tagger has not yet been written.

First, obtain the Latin POS tagging files. The important file here is cltk_latin_pos_dict.txt, which is saved at ~/cltk_data/compiled/pos_latin. This file is a Python dict type which aims to give all possible parts-of-speech for any given form, though this is based off the incomplete Perseus latin-analyses.txt. Thus, there may be gaps in (i) the inflected forms defined and (ii) the comprehensiveness of the analyses of any given form. cltk_latin_pos_dict.txt looks like:

{'-nam': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                    'gloss': '',
                                    'type': 'conj'}}]},
 '-namque': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                       'gloss': '',
                                       'type': 'conj'}}]},
 '-sed': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                    'gloss': '',
                                    'type': 'conj'}}]},
 'Aaron': {'perseus_pos': [{'pos0': {'case': 'nom',
                                     'gender': 'masc',
                                     'gloss': 'Aaron',
                                     'number': 'sg',
                                     'type': 'substantive'}}]},

If you wish to edit the POS dictionary creator, see cltk_latin_pos_dict.txt.For more, see the [pos_latin]( repository.

Named Entity Recognition


NER is new functionality. Please report any errors you observe.

There is available a simple interface to a list of Latin proper nouns. By default tag_ner() takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.

In [1]: from cltk.tag import ner

In [2]: from cltk.stem.latin.j_v import JVReplacer

In [3]: text_str = """ut Venus, ut Sirius, ut Spica, ut aliae quae primae dicuntur esse mangitudinis."""

In [4]: jv_replacer = JVReplacer()

In [5]: text_str_iu = jv_replacer.replace(text_str)

In [7]: ner.tag_ner('latin', input_text=text_str_iu, output_type=list)
 ('Uenus', 'Entity'),
 ('Sirius', 'Entity'),
 ('Spica', 'Entity'),

PHI Indices

Located at cltk/corpus/latin/ of the source are indices for the PHI5, one of just id and name (PHI5_INDEX) and another also containing information on the authors' works (PHI5_WORKS_INDEX).

In [1]: from cltk.corpus.latin.phi5_index import PHI5_INDEX

In [2]: PHI5_INDEX
{'LAT1050': 'Lucius Verginius Rufus',
 'LAT2335': 'Anonymi de Differentiis [Fronto]',
 'LAT1345': 'Silius Italicus',
 ... }

In [3]: from cltk.corpus.latin.phi5_index import PHI5_WORKS_INDEX

Out [4]:
{'LAT2335': {'works': ['001'], 'name': 'Anonymi de Differentiis [Fronto]'},
 'LAT1345': {'works': ['001'], 'name': 'Silius Italicus'},
 'LAT1351': {'works': ['001', '002', '003', '004', '005'],
  'name': 'Cornelius Tacitus'},
 'LAT2349': {'works': ['001', '002', '003', '004', '005', '006', '007'],
  'name': 'Maurus Servius Honoratus, Servius'},

In addition to these indices there are several helper functions which will build filepaths for your particular computer. Not that you will need to have run convert_corpus(corpus='phi5') and divide_works('phi5') from the TLGU() class, respectively, for the following two functions.

In [1]: from cltk.corpus.utils.formatter import assemble_phi5_author_filepaths

In [2]: assemble_phi5_author_filepaths()

In [3]: from cltk.corpus.utils.formatter import assemble_phi5_works_filepaths

In [4]: assemble_phi5_works_filepaths()

These two functions are useful when, for example, needing to process all authors of the PHI5 corpus, all works of the corpus, or all works of one particular author.

POS tagging

These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the latin_models_cltk corpus.

1–2–3–gram backoff tagger

In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('latin')

In [3]: tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')
[('Gallia', None),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---MN-'),
 ('divisa', 'T-PRPPNN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

TnT tagger

In [4]: tagger.tag_tnt('Gallia est omnis divisa in partes tres')
[('Gallia', 'Unk'),
 ('est', 'V3SPIA---'),
 ('omnis', 'N-S---MN-'),
 ('divisa', 'T-SRPPFN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

CRF tagger


This tagger's accuracy has not yet been evaluated.

We use the NLTK's CRF tagger. For information on it, see the NLTK docs.

In [5]: tagger.tag_crf('Gallia est omnis divisa in partes tres')
[('Gallia', 'A-P---NA-'),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---FN-'),
 ('divisa', 'N-S---FN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

Lapos tagger


The Lapos tagger is available in its own repo, with with the master branch for Linux and apple branch for Mac. See directions there on how to use it.

Prosody Scanning

A prosody scanner is available for text which already has had its natural lengths marked with macrons. It returns a list of strings of long and short marks for each sentence, with an anceps marking the last syllable of each sentence.

In [1]: from cltk.prosody.latin.scanner import Scansion

In [2]: scanner = Scansion()

In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'

In [4]: scanner.scan_text(text)
Out[4]: ['¯˘¯˘¯¯˘˘˘¯˘˘˘¯˘¯¯x', '¯˘¯˘¯˘˘¯˘˘¯¯¯¯x']

Scansion of Poetry

About the use of macrons in poetry

Most Latin poetry arrives to us without macrons. Some lines of Latin poetry can be scanned and fit a poetic meter without any macrons at all, due to the rules of meter and positional accentuation.

Automatically macronizing every word in a line of Latin poetry does not mean that it will automatically scan correctly. Poets often diverge from standard usage: regularly long vowels can appear short; the verb nesciō in poetry scans the final personal ending as a short o; and regularly short vowels can appear as long; e.g. Lucretius regularly writes rēligiō which scans, instead of the usual religiō; and there is a prosody device: diastole - the short final vowel of a word is lengthened to fit the meter; e.g. tibī in Lucretius I.104 and III.899, etc.

However, some macrons are necessary for scansion: Lucretius I.12 begins with "aeriae" which will not scan in hexameter unless one substitutes its macronized form "āeriae".


The HexameterScanner class scans lines of Latin hexameter (with or without macrons) and determines if the line is a valid hexameter and what its scansion pattern is.

If the line is not properly macronized to scan, the scanner tries to determine whether the line:

  1. Scans merely by position.
  2. Syllabifies according to the common rules.
  3. Is complete (e.g. some hexameter lines are partial).

The scanner also determines which syllables would have to be made long to make the line scan as a valid hexameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HexameterScanner's scan method returns a Hexameter class object.

In [1]: from cltk.prosody.latin import HexameterScanner

In [2]: scanner = HexameterScanner()

In [3]: print(scanner.scan("impulerit. Tantaene animis caelestibus irae?"))
Out[3]: [Hexameter( original='impulerit. Tantaene animis caelestibus irae?', scansion='-  U U -    -   -   U U -    - -  U U  -  - ', valid=True, syllable_count=15, accented='īmpulerīt. Tāntaene animīs caelēstibus īrae?', scansion_notes=['Valid by positional stresses.'], syllables = ['īm, pu, le, rīt, Tān, taen, a, ni, mīs, cae, lēs, ti, bus, i, rae'])]


The Hexameter class object returned by the HexameterScanner provides slots for:

  1. original - original line of verse
  2. scansion - the scansion pattern
  3. valid - whether or not the hexameter is valid
  4. syllable_count - number of syllables according to common syllabification rules
  5. accented - if the hexameter is valid, a version of the line with accented vowels
  6. scansion_notes - a list recording characteristics and transformations made to the original line
  7. syllables - a list of syllables of which the line is divided into

The Scansion notes are defined in a NOTE_MAP dictionary object contained in the ScansionConstants class.


The ScansionConstants class is a configuration class for specifying scansion constants. This class also allows users to customizing scansion constants and scanner behavior, for example, a user may alter the symbols used for stressed and unstressed syllables:

In [1]: from cltk.prosody.latin import ScansionConstants

In [2]: constants = ScansionConstants(unstressed="U",stressed= "-", optional_terminal_ending="X")

In [3]: print(constants.DACTYL)
Out[3]: ['-UU']

In [4]: smaller_constants = ScansionConstants(unstressed="˘",stressed= "¯", optional_terminal_ending="x")

In [5]: print(smaller_constants.DACTYL)
Out[5]: ['¯˘˘']

Constants containing strings have characters in upper and lower case since they will often be used in regular expressions, and used to preserve/a verse's original case.


The Syllabifier class is a Latin language syllabifier. It parses a Latin word or a space separated list of words into a list of syllables. Consonantal I is transformed into a J at the start of a word as necessary. Tuned for poetry and verse, this class is tolerant of isolated single character consonants that may appear due to elision.

In [1]: from cltk.prosody.latin import Syllabifier

In [1]: syllabifier = Syllabifier()

In [2]: print(syllabifier.syllabify("libri"))
Out[2]: ['li', 'bri']

In [3]: print(syllabifier.syllabify("contra"))
Out[3]: ['con', 'tra']

Metrical Validator

The MetricalValidator class is a utility class for validating scansion patterns. Users may configure the scansion symbols internally via passing a customized ScansionConstants via a constructor argument:

In [1]: from cltk.prosody.latin import MetricalValidator

In [2]: print(MetricalValidator().is_valid_hexameter("-UU---UU---UU-U"))
Out[2]: ['True']


The ScansionFormatter class is a utility class for formatting scansion patterns.

In [1]: from cltk.prosody.latin import ScansionFormatter

In [2]: print(ScansionFormatter().hexameter("-UU-UU-UU---UU--"))
Out[2]: ['-UU|-UU|-UU|--|-UU|--']

In [3]: constants = ScansionConstants(unstressed="˘", stressed= "¯", optional_terminal_ending="x")

In [4]: formatter = ScansionFormatter(constants)

In [5]: print(formatter.hexameter( "¯˘˘¯˘˘¯˘˘¯¯¯˘˘¯¯"))
Out[5]: ['¯˘˘|¯˘˘|¯˘˘|¯¯|¯˘˘|¯¯']

StringUtils module

The StringUtils module contains utility methods for processing scansion and text. Such as punctuation_for_spaces_dict() returns a dictionary object that maps unicode punctuation to blanks spaces, which are essential for scansion to keep stress patterns in alignment with original vowel positions in the verse.

In [1]: from cltk.prosody.latin import StringUtils

In [2]: print("I'm ok! Oh #%&*()[]{}!? Fine!".translate(punctuation_for_spaces_dict()).strip())
Out[2]: ['I m ok  Oh              Fine']

Sentence Tokenization

The sentence tokenizer takes a string input into tokenize_sentences() and returns a list of strings. For more on the tokenizer, or to make your own, see the CLTK's Latin sentence tokenizer training set repository.

In [1]: from cltk.tokenize.sentence import TokenizeSentence

In [2]: tokenizer = TokenizeSentence('latin')

In [3]: untokenized_text = 'Itaque cum M. Aurelio et P. Minidio et Cn. Cornelio ad apparationem balistarum et scorpionem reliquorumque tormentorum refectionem fui praesto et cum eis commoda accepi, quae cum primo mihi tribuisiti recognitionem, per sorosis commendationem servasti. Cum ergo eo beneficio essem obligatus, ut ad exitum vitae non haberem inopiae timorem, haec tibi scribere coepi, quod animadverti multa te aedificavisse et nunc aedificare, reliquo quoque tempore et publicorum et privatorum aedificiorum, pro amplitudine rerum gestarum ut posteris memoriae traderentur curam habiturum.'

In [4]: tokenizer.tokenize_sentences(untokenized_text)
['Itaque cum M. Aurelio et P. Minidio et Cn. Cornelio ad apparationem balistarum et scorpionem reliquorumque tormentorum refectionem fui praesto et cum eis commoda accepi, quae cum primo mihi tribuisiti recognitionem, per sorosis commendationem servasti.',
 'Cum ergo eo beneficio essem obligatus, ut ad exitum vitae non haberem inopiae timorem, haec tibi scribere coepi, quod animadverti multa te aedificavisse et nunc aedificare, reliquo quoque tempore et publicorum et privatorum aedificiorum, pro amplitudine rerum gestarum ut posteris memoriae traderentur curam habiturum.']


The stemmer strips suffixes via an algorithm. It is much faster than the lemmatizer, which uses a replacement list.

In [1]: from cltk.stem.latin.stem import Stemmer

In [2]: sentence = 'Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiuerunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem ciuem existimarint foeneratorem quam furem, hinc licet existimare. Et uirum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, uerum, ut supra dixi, periculosum et calamitosum. At ex agricolis et uiri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque inuidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.'

In [3]: stemmer = Stemmer()

In [4]: stemmer.stem(sentence.lower())
Out[4]: 'est interd praestar mercatur r quaerere, nisi tam periculos sit, et it foenerari, si tam honestum. maior nostr sic habueru et ita in leg posiuerunt: fur dupl condemnari, foenerator quadrupli. quant peior ciu existimari foenerator quam furem, hinc lice existimare. et uir bon quo laudabant, ita laudabant: bon agricol bon colonum; amplissim laudar existimaba qui ita laudabatur. mercator autem strenu studios re quaerend existimo, uerum, ut supr dixi, periculos et calamitosum. at ex agricol et uir fortissim et milit strenuissim gignuntur, maxim p quaest stabilissim consequi minim inuidiosus, minim mal cogitant su qui in e studi occupat sunt. nunc, ut ad r redeam, quod promis institut principi hoc erit. '

Stopword Filtering

To use the CLTK's built-in stopwords list:

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.latin.stops import STOPS_LIST

In [3]: sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]


The syllabifier splits a given input Latin word into a list of syllables based on an algorithm and set of syllable specifications for Latin.

In [1]: from cltk.stem.latin.syllabifier import Syllabifier

In [2]: word = 'sidere'

In [3]: syllabifier = Syllabifier()

In [4]: syllabifier.syllabify(word)
Out[4]: ['si', 'de', 're']

Text Cleanup

Intended for use on the TLG after processing by TLGU().

In [1]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup

In [2]: import os

In [3]: file = os.path.expanduser('~/cltk_data/latin/text/phi5/individual_works/LAT0031.TXT-001.txt')

In [4]: with open(file) as f:
...:     r =

In [5]: r[:500]
Out[5]: '\nDices pulchrum esse inimicos \nulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uide-\ntur, sed si liceat re publica salua ea persequi. sed quatenus id fieri non  \npotest, multo tempore multisque partibus inimici nostri non peribunt \natque, uti nunc sunt, erunt potius quam res publica profligetur atque \npereat. \n    Verbis conceptis deierare ausim, praeterquam qui \nTiberium Gracchum necarunt, neminem inimicum tantum molestiae \ntantumque laboris, quantum te ob has res, mihi tradidis'

In [6]: phi5_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Dices pulchrum esse inimicos ulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uidetur sed si liceat re publica salua ea persequi. sed quatenus id fieri non potest multo tempore multisque partibus inimici nostri non peribunt atque uti nunc sunt erunt potius quam res publica profligetur atque pereat. Verbis conceptis deierare ausim praeterquam qui Tiberium Gracchum necarunt neminem inimicum tantum molestiae tantumque laboris quantum te ob has res mihi tradidisse quem oportebat omni'

If you have a text of a language in Latin characters which contain a lot of junk, remove_non_ascii() and remove_non_latin() might be of use.

In [1]: from cltk.corpus.utils.formatter import remove_non_ascii

In [2]: text =  'Dices ἐστιν ἐμός pulchrum esse inimicos ulcisci.'

In [3]: remove_non_ascii(text)
Out[3]: 'Dices   pulchrum esse inimicos ulcisci.

In [4]: from cltk.corpus.utils.formatter import remove_non_latin

In [5]: remove_non_latin(text)
Out[5]: ' Dices   pulchrum esse inimicos ulcisci'

In [6]: remove_non_latin(text, also_keep=['.', ','])
Out[6]: ' Dices   pulchrum esse inimicos ulcisci.'


The CLTK provides IPA phonetic transliteration for the Latin language. Currently, the only available dialect is Classical as reconstructed by W. Sidney Allen (taken from Vox Latina, 85-103). Example:

In [1]: from cltk.phonology.latin.transcription import Transcriber

In [2]: transcriber = Transcriber(dialect="Classical", reconstruction="Allen")

In [3]: transcriber.transcribe("Quo usque tandem, O Catilina, abutere nostra patientia?")
Out[3]: "['kʷoː 'ʊs.kʷɛ 't̪an̪.d̪ẽː 'oː ka.t̪ɪ.'liː.n̪aː a.buː.'t̪eː.rɛ 'n̪ɔs.t̪raː pa.t̪ɪ̣.'jɛn̪.t̪ɪ̣.ja]"

Word Tokenization

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('latin')

In [3]: text = 'atque haec abuterque puerve paterne nihil'

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['atque', 'haec', 'abuter', 'que', 'puer', 've', 'pater', 'ne', 'nihil']



The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK's API for it will be revised.


You will need to install Gensim to use these features.

Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).

The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.

One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here's an example of its use:

In [1]: from import search_corpus

In [2]: for x in search_corpus('amicitia', 'phi5', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.25):
Out[2]: The following similar terms will be added to the 'amicitia' query: '['societate', 'praesentia', 'uita', 'sententia', 'promptu', 'beneuolentia', 'dignitate', 'monumentis', 'somnis', 'philosophia']'.
('L. Iunius Moderatus Columella', 'hospitem, nisi ex *amicitia* domini, quam raris-\nsime recipiat.')
('L. Iunius Moderatus Columella', ' \n    Xenophon Atheniensis eo libro, Publi Siluine, qui Oeconomicus \ninscribitur, prodidit maritale coniugium sic comparatum esse \nnatura, ut non solum iucundissima, uerum etiam utilissima uitae \nsocietas iniretur: nam primum, quod etiam Cicero ait, ne genus \nhumanum temporis longinquitate occideret, propter \nhoc marem cum femina esse coniunctum, deinde, ut ex \nhac eadem *societate* mortalibus adiutoria senectutis nec \nminus propugnacula praeparentur.')
('L. Iunius Moderatus Columella', 'ac ne ista quidem \npraesidia, ut diximus, non adsiduus labor et experientia \nuilici, non facultates ac uoluntas inpendendi tantum pollent \nquantum uel una *praesentia* domini, quae nisi frequens \noperibus interuenerit, ut in exercitu, cum abest imperator, \ncuncta cessant officia.')

threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.

The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:

In [3]: from cltk.vector.word2vec import get_sims

In [4]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.7)
Matches found, but below the threshold of 'threshold=0.7'. Lower it to see these results.
Out[4]: []

In [5]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.2)

In [6]: get_sims('iube', 'latin', lemmatized=True, threshold=0.7)
Out[6]: "word 'iube' not in vocabulary"
['The following terms in the Word2Vec model you may be looking for: '['iubet”', 'iubet', 'iubilo', 'iubĕ', 'iubar', 'iubes', 'iubatus', 'iuba1', 'iubeo']'.]'

In [7]: get_sims('dictator', 'latin', lemmatized=False, threshold=0.7)

To add and subtract vectors, you need to load the models yourself with Gensim.