8.1.1. cltk.alphabet package

Modules for accessing the alphabets and character sets of in-scope CLTK languages.

8.1.1.1. Subpackages

8.1.1.2. Submodules

8.1.1.3. cltk.alphabet.ang module

The Old English alphabet.

>>> from cltk.alphabet import ang
>>> ang.DIGITS[:5]
['ān', 'tƿeġen', 'þrēo', 'fēoƿer', 'fīf']
>>> ang.DIPHTHONGS[:5]
['ea', 'eo', 'ie']

8.1.1.4. cltk.alphabet.arb module

The Arabic alphabet. Sources:

>>> from cltk.alphabet import arb
>>> arb.LETTERS[:5]
('ا', 'ب', 'ت', 'ة', 'ث')
>>> arb.PUNCTUATION_MARKS
['،', '؛', '؟']
>>> arb.ALEF
'ا'
>>> arb.WEAK
('ا', 'و', 'ي', 'ى')

8.1.1.5. cltk.alphabet.arc module

The Imperial Aramaic alphabet, plus simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own Unicode block.

TODO: Add Hebrew-to-Aramaic converter

cltk.alphabet.arc.square_to_imperial(square_script)[source]

simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own unicode block

Return type:

str

8.1.1.6. cltk.alphabet.ave module

The Avestan alphabet. Sources:

8.1.1.7. cltk.alphabet.ben module

The Bengali alphabet.

>>> from cltk.alphabet import ben
>>> ben.VOWELS[:5]
['অ', 'আ', 'ই', 'ঈ', 'উ']
>>> ben.DEPENDENT_VOWELS[:5]
['◌া', 'ি', '◌ী', '◌ু', '◌ূ']
>>> ben.CONSONANTS[:5]
['ক', 'খ', 'গ', 'ঘ ', 'ঙ']

8.1.1.8. cltk.alphabet.egy module

Convert MdC transliterated text to Unicode.

cltk.alphabet.egy.mdc_unicode(string, q_kopf=True)[source]

parameters: string: str q_kopf: boolean return: unicode_text: str The translitterated text passes to the function under the variable ‘string’. The search and replace operation is done for the related caracters. If the q_kopf parameter is False, we replace ‘q’ with ‘ḳ’

8.1.1.9. cltk.alphabet.enm module

The Middle English alphabet. Sources:

The produced consonant sound in Middle English are categorized as following:

  • Stops: ⟨/b/, /p/, /d/, /t/, /g/, /k/⟩

  • Affricatives: ⟨/ǰ/, /č/, /v/, /f/, /ð/, /θ/, /z/, /s/, /ž/, /š/, /c̹/, /x/, /h/⟩

  • Nasals: ⟨/m/, /n/, /ɳ/⟩

  • Later Resonants: ⟨/l/⟩

  • Medial Resonants: ⟨/r/, /y/, /w/⟩

Thorn (þ) was gradually replaced by the diphthong “th”, while Eth (ð), which had already fallen out of use by the 14th century, was later replaced by “d”

Wynn (ƿ) is the predecessor of “w”. Modern transliteration scripts, usually replace it with “w” as to avoid confusion with the strikingly similar p

The vowel sounds in Middle English are divided into:

  • Long Vowels: ⟨/a:/, /e/, /e̜/, /i/ , /ɔ:/, /o/ , /u/⟩

  • Short Vowels: ⟨/a/, /ɛ/, /I/, /ɔ/, /U/, /ə/⟩

As established rules for ME orthography were effectively nonexistent, compiling a definite list of diphthongs is non-trivial. The following aims to compile a list of the most commonly-used diphthongs.

>>> from cltk.alphabet import enm
>>> enm.ALPHABET[:5]
['a', 'b', 'c', 'd', 'e']
>>> enm.CONSONANTS[:5]
['b', 'c', 'd', 'f', 'g']
cltk.alphabet.enm.normalize_middle_english(text, to_lower=True, alpha_conv=True, punct=True)[source]

Normalizes Middle English text string and returns normalized string.

Parameters:
  • text (str) – str text to be normalized

  • to_lower (bool) – bool convert text to lower text

  • alpha_conv (bool) – bool convert text to canonical form æ -> ae, þ -> th, ð -> th, ȝ -> y if at beginning, gh otherwise

  • punct (bool) – remove punctuation

>>> normalize_middle_english('Whan Phebus in the CraBbe had neRe hys cours ronne', to_lower = True)
'whan phebus in the crabbe had nere hys cours ronne'
>>> normalize_middle_english('I pray ȝow þat ȝe woll', alpha_conv = True)
'i pray yow that ye woll'
>>> normalize_middle_english("furst, to begynne:...", punct = True)
'furst to begynne'
Return type:

str

8.1.1.10. cltk.alphabet.fro module

The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer in its current format should therefore be used as a last resort. The normalizer, word tokenizer, stemmer, lemmatizer, and list of stopwords for OF/MF were developed as part of Google Summer of Code 2017. A full write-up of this work can be found at : https://gist.github.com/nat1881/6f134617805e2efbe5d275770e26d350 References : Pope, M.K. 1956. From Latin to Modern French with Especial Consideration of Anglo-Norman. Manchester: MUP. Anglo-French spelling variants normalized to “orthographe commune”, from M. K. Pope (1956)

  • word-final d - e.g. vertud vs vertu

  • use of <u> over <ou>

  • <eaus> for <eus>, <ceaus> for <ceus>

  • triphtongs:
    • <iu> for <ieu>

    • <u> for <eu>

    • <ie> for <iee>

    • <ue> for <uee>

    • <ure> for <eure>

  • “epenthetic vowels” - e.g. averai for avrai

  • <eo> for <o>

  • <iw>, <ew> for <ieux>

  • final <a> for <e>

cltk.alphabet.fro.build_match_and_apply_functions(pattern, replace)[source]

Assemble regex patterns.

cltk.alphabet.fro.normalize_fr(tokens)[source]

Normalize Old and Middle French tokens.

TODO: Make work work again with a tokenizer.

Return type:

list[str]

8.1.1.11. cltk.alphabet.gmh module

The alphabet for Middle High German. Source:

The consonants of Middle High German are categorized as:

  • Stops: ⟨p t k/c/q b d g⟩

  • Affricates: ⟨pf/ph tz/z⟩

  • Fricatives: ⟨v f s ȥ sch ch h⟩

  • Nasals: ⟨m n⟩

  • Liquids: ⟨l r⟩

  • Semivowels: ⟨w j⟩

Misc. notes:

  • c is used at the beginning of only loanwords and is pronounced the same as k (e.g. calant, cappitain)

  • Double consonants are pronounced the same way as their corresponding letters in Modern Standard German (e.g. pp/p)

  • schl, schm, schn, schw are written in MHG as sw, sl, sm, sn

  • æ (also seen as ae), œ (also seen as oe) and iu denote the use of Umlaut over â, ô and û respectively

  • ȥ or ʒ is used in modern handbooks and grammars to indicate the s or s-like sound which arose from Germanic t in the High German consonant shift.

>>> from cltk.alphabet import gmh
>>> gmh.CONSONANTS[:5]
['b', 'd', 'g', 'h', 'f']
>>> gmh.VOWELS[:5]
['a', 'ë', 'e', 'i', 'o']
cltk.alphabet.gmh.normalize_middle_high_german(text, to_lower_all=True, to_lower_beginning=False, alpha_conv=True, punct=True, ascii=False)[source]

Normalize input string.

>>> from cltk.alphabet import gmh
>>> from cltk.languages.example_texts import get_example_text
>>> gmh.normalize_middle_high_german(get_example_text("gmh"))[:50]
'uns ist in alten\nmæren wunders vil geseit\nvon hele'
Parameters:
  • text (str) –

  • to_lower_beginning (bool) –

  • to_lower_all (bool) – convert whole text to lowercase

  • alpha_conv (bool) – convert alphabet to canonical form

  • punct (bool) – remove punctuation

  • ascii (bool) – returns ascii form

Return type:

str

Returns:

normalized text

8.1.1.12. cltk.alphabet.guj module

The Gujarati alphabet.

>>> from cltk.alphabet import guj
>>> guj.VOWELS[:5]
['અ', 'આ', 'ઇ', 'ઈ', 'ઉ']
>>> guj.CONSONANTS[:5]
['ક', 'ખ', 'ગ', 'ઘ', 'ચ']

8.1.1.13. cltk.alphabet.hin module

The Hindi alphabet.

>>> from cltk.alphabet import hin
>>> hin.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> hin.CONSONANTS[:5]
['क', 'ख', 'ग', 'घ', 'ङ']
>>> hin.SONORANT_CONSONANTS
['य', 'र', 'ल', 'व']

8.1.1.14. cltk.alphabet.kan module

The Kannada alphabet. The characters can be divided into 3 categories:

  1. Swaras (Vowels) : 13 in modern Kannada and 14 in Classical

  2. Vynjanas (Consonants) : They are further divided into 2 categories:

    1. Structured Consonants : 25

    1. Unstructured Consonants : 9 in modern Kannada and 11 in Classical

  1. Yogavaahakas (part vowel, part consonant) : 2

Corresponding to each Swaras and Yogavaahakas there is a symbol. Thus Consonant + Vowel Symbol = Kagunita.

>>> from cltk.alphabet import kan
>>> kan.VOWELS[:5]
['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ']
>>> kan.STRUCTURED_CONSONANTS[:5]
['ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙಚ']

8.1.1.15. cltk.alphabet.lat module

Alphabet and text normalization for Latin.

Guidelines: - […] Square brackets, or in recent editions wavy brackets ʺ{…}ʺ, enclose words etc. that an editor thinks should be deleted (see ʺdel.ʺ) or marked as out of place (see ʺsecl.ʺ). - […] Square brackets in a papyrus text, or in an inscription, enclose places where words have been lost through physical damage. - If this happens in mid-line, editors use ʺ[…]ʺ. - If only the end of the line is missing, they use a single bracket ʺ[…ʺ - If the lineʹs beginning is missing, they use ʺ…]ʺ - Within the brackets, often each dot represents one missing letter. - [[…]] Double brackets enclose letters or words deleted by the medieval copyist himself. - (…) Round brackets are used to supplement words abbreviated by the original copyist; e.g. in an inscription: ʺtrib(unus) mil(itum) leg(ionis) IIIʺ - <…> diamond ( = elbow = angular) brackets enclose words etc. that an editor has added (see ʺsuppl.ʺ) - † An obelus (pl. obeli) means that the word(s etc.) is very plainly corrrupt, but the editor cannot see how to emend. - If only one word is corrupt, there is only one obelus, which precedes the word; if two or more words are corrupt, two obeli enclose them. (Such at least is the rule–but that rule is often broken, especially in older editions, which sometimes dagger several words using only one obelus.) To dagger words in this way is to ʺobelizeʺ them.

class cltk.alphabet.lat.JVReplacer[source]

Bases: object

Replace J/V with I/U. Latin alphabet does not distinguish between J/j and I/i and V/v and U/u; Yet, many texts bear the influence of later editors and the predilections of other languages.

In practical terms, the JV substitution is recommended on all Latin text preprocessing; it helps to collapse the search space.

>>> replacer = JVReplacer()
>>> replacer.replace("Julius Caesar")
'Iulius Caesar'
>>> replacer.replace("In vino veritas.")
'In uino ueritas.'
replace(text)[source]

Do j/v replacement

class cltk.alphabet.lat.LigatureReplacer[source]

Bases: object

Replace ‘œæ’ with AE, ‘Œ Æ’ with OE. Classical Latin wrote the o and e separately (as has today again become the general practice), but the ligature was used by medieval and early modern writings, in part because the diphthongal sound had, by Late Latin, merged into the sound [e]. See: https://en.wikipedia.org/wiki/%C5%92 Æ (minuscule: æ) is a grapheme named æsc or ash, formed from the letters a and e, originally a ligature representing the Latin diphthong ae. It has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Norwegian, Icelandic, and Faroese. See: https://en.wikipedia.org/wiki/%C3%86

>>> replacer = LigatureReplacer()
>>> replacer.replace("mæd")
'maed'
>>> replacer.replace("prœil")
'proeil'
replace(text)[source]

Do character replacement.

cltk.alphabet.lat.dehyphenate(text)[source]

Remove hyphens from text; used on texts that have an line breaks with hyphens that may creep into the text. Caution using this elsewhere. :type text: str :param text: :rtype: str :return:

>>> dehyphenate('quid re-tundo hier')
'quid retundo hier'
cltk.alphabet.lat.swallow(text, pattern_matcher)[source]

Utility function internal to this module

Parameters:
  • text (str) – text to clean

  • pattern_matcher (Pattern) – pattern to match

Return type:

str

Returns:

the text without the matched pattern; spaces are not substituted

cltk.alphabet.lat.swallow_braces(text)[source]

Remove Text within braces, and drop the braces.

Parameters:

text (str) – Text with braces

Return type:

str

Returns:

Text with the braces and any text inside removed

>>> swallow_braces("{PRO P. QVINCTIO ORATIO} Quae res in civitate {etc}... ")
'Quae res in civitate ...'
cltk.alphabet.lat.drop_latin_punctuation(text)[source]

Drop all Latin punctuation except the hyphen and obelization markers, replacing the punctuation with a space. Please collapsing hyphenated words and removing obelization marks separately beforehand.

The hyphen is important in Latin tokenization as the enclitic particle -ne is different than the interjection ne.

Parameters:

text (str) – Text to clean

Return type:

str

Returns:

cleaned text

>>> drop_latin_punctuation('quid est ueritas?')
'quid est ueritas '
>>> drop_latin_punctuation("vides -ne , quod , planus est ")
'vides -ne   quod   planus est '
>>> drop_latin_punctuation("here is some trash, punct \/':;,!\?\._『@#\$%^&\*okay").replace("  ", " ")
'here is some trash punct okay'
cltk.alphabet.lat.remove_accents(text)[source]

Remove accents; note: AE replacement and macron replacement should happen elsewhere, if desired. :type text: str :param text: text with undesired accents :rtype: str :return: clean text

>>> remove_accents('suspensám')
'suspensam'
>>> remove_accents('quăm')
'quam'
>>> remove_accents('aegérrume')
'aegerrume'
>>> remove_accents('ĭndignu')
'indignu'
>>> remove_accents('îs')
'is'
>>> remove_accents('óccidentem')
'occidentem'
>>> remove_accents('frúges')
'fruges'
cltk.alphabet.lat.remove_macrons(text)[source]

Remove macrons above vowels :type text: str :param text: text with macronized vowels :rtype: str :return: clean text

>>> remove_macrons("canō")
'cano'
>>> remove_macrons("Īuliī")
'Iulii'
cltk.alphabet.lat.swallow_angle_brackets(text)[source]

Disappear text in and surrounding an angle bracket >>> text = “ <O> mea dext<e>ra illa CICERO RUFO Quo<quo>. modo proficiscendum <in> tuis. deesse HS <c> quae metu <exagitatus>, furore <es>set consilium ” >>> swallow_angle_brackets(text) ‘mea illa CICERO RUFO modo proficiscendum tuis. deesse HS quae metu furore consilium’

Return type:

str

cltk.alphabet.lat.disappear_angle_brackets(text)[source]

Remove all angle brackets, keeping the surrounding text; no spaces are inserted :type text: str :param text: text with angle bracket :rtype: str :return: text without angle brackets

cltk.alphabet.lat.swallow_square_brackets(text)[source]

Swallow text inside angle brackets, without substituting a space. :type text: str :param text: text to clean :rtype: str :return: text without square brackets and text inside removed

>>> swallow_square_brackets("qui aliquod institui[t] exemplum")
'qui aliquod institui exemplum'
>>> swallow_square_brackets("posthac tamen cum haec [tamen] quaeremus,")
'posthac tamen cum haec  quaeremus,'
cltk.alphabet.lat.swallow_obelized_words(text)[source]

Swallow obelized words; handles enclosed and words flagged on the left. Considers plus signs and daggers as obelization markers :type text: str :param text: Text with obelized words :rtype: str :return: clean text

>>> swallow_obelized_words("tu Fauonium †asinium† dicas")
'tu Fauonium  dicas'
>>> swallow_obelized_words("tu Fauonium †asinium dicas")
'tu Fauonium dicas'
>>> swallow_obelized_words("meam +similitudinem+")
'meam'
>>> swallow_obelized_words("mea +ratio non habet" )
'mea non habet'
cltk.alphabet.lat.disappear_round_brackets(text)[source]

Remove round brackets and keep the text intact :type text: str :param text: Text with round brackets. :rtype: str :return: Clean text.

>>> disappear_round_brackets("trib(unus) mil(itum) leg(ionis) III")
'tribunus militum legionis III'
cltk.alphabet.lat.swallow_editorial(text)[source]

Swallow common editorial morks :type text: str :param text: Text with editorial marks :rtype: str :return: Clean text.

>>> swallow_editorial("{PRO P. QVINCTIO ORATIO} Quae res in civitate trib(unus) mil(itum) leg(ionis) III tu Fauonium †asinium† dicas meam +similitudinem+  mea +ratio non habet ...     ")
'{PRO P. QVINCTIO ORATIO} Quae res in civitate tribunus militum legionis III tu Fauonium  dicas meam   mea non habet ...'
cltk.alphabet.lat.accept_editorial(text)[source]

Accept common editorial suggestions :type text: str :param text: Text with editorial suggestions :rtype: str :return: clean text

>>> accept_editorial("{PRO P. QVINCTIO ORATIO} Quae res in civitate trib(unus) mil(itum) leg(ionis) III tu Fauonium †asinium† dicas meam +similitudinem+  mea +ratio non habet ...     ")
'Quae res in civitate tribunus militum legionis III tu Fauonium  dicas meam   mea non habet  '
cltk.alphabet.lat.truecase(word, case_counter)[source]

Truecase a word using a Truecase dictionary

Parameters:
  • word (str) – a word

  • case_counter (dict[str, int]) – A counter; a dictionary of words/tokens and their relative frequency counts

Returns:

the truecased word

>>> case_counts ={"caesar": 1, "Caesar": 99}
>>> truecase('CAESAR', case_counts)
'Caesar'
cltk.alphabet.lat.normalize_lat(text, drop_accents=False, drop_macrons=False, jv_replacement=False, ligature_replacement=False)[source]

The function for all default Latin normalization.

>>> text = "canō Īuliī suspensám quăm aegérrume ĭndignu îs óccidentem frúges Julius Caesar. In vino veritas. mæd prœil"
>>> normalize_lat(text)
'canō Īuliī suspensám quăm aegérrume ĭndignu îs óccidentem frúges Julius Caesar. In vino veritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True)
'canō Īuliī suspensam quăm aegerrume ĭndignu is óccidentem frúges Julius Caesar. In vino veritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True, drop_macrons=True)
'cano Iulii suspensam quăm aegerrume ĭndignu is óccidentem frúges Julius Caesar. In vino veritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True, drop_macrons=True, jv_replacement=True)
'cano Iulii suspensam quăm aegerrume ĭndignu is óccidentem frúges Iulius Caesar. In uino ueritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True, drop_macrons=True, jv_replacement=True, ligature_replacement=True)
'cano Iulii suspensam quăm aegerrume ĭndignu is óccidentem frúges Iulius Caesar. In uino ueritas. maed proeil'
Return type:

str

8.1.1.16. cltk.alphabet.non module

Old Norse runes, Unicode block: 16A0–16FF. Source: Viking Language 1, Jessie L. Byock

TODO: Document and test better.

class cltk.alphabet.non.AutoName(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

class cltk.alphabet.non.RunicAlphabetName(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: AutoName

elder_futhark = 'elder_futhark'
younger_futhark = 'younger_futhark'
short_twig_younger_futhark = 'short_twig_younger_futhark'
class cltk.alphabet.non.Rune(runic_alphabet, form, sound, transcription, name)[source]

Bases: object

>>> Rune(RunicAlphabetName.elder_futhark, "ᚺ", "h", "h", "haglaz")

>>> Rune.display_runes(ELDER_FUTHARK)
['ᚠ', 'ᚢ', 'ᚦ', 'ᚨ', 'ᚱ', 'ᚲ', 'ᚷ', 'ᚹ', 'ᚺ', 'ᚾ', 'ᛁ', 'ᛃ', 'ᛇ', 'ᛈ', 'ᛉ', 'ᛊ', 'ᛏ', 'ᛒ', 'ᛖ', 'ᛗ', 'ᛚ', 'ᛜ', 'ᛟ', 'ᛞ']
static display_runes(runic_alphabet)[source]

Displays the given runic alphabet. :type runic_alphabet: list :param runic_alphabet: list :return: list

static from_form_to_transcription(form, runic_alphabet)[source]
Parameters:
  • form (str) –

  • runic_alphabet (list) –

Returns:

conventional transcription of the rune

class cltk.alphabet.non.Transcriber[source]

Bases: object

>>> little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬"
>>> Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK)
'᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫'
static from_form_to_transcription(runic_alphabet)[source]

Make a dictionary whose keys are forms of runes and values their transcriptions. Used by transcribe method. :type runic_alphabet: list :param runic_alphabet: :return: dict

static transcribe(rune_sentence, runic_alphabet)[source]

From a runic inscription, the transcribe method gives a conventional transcription. :type rune_sentence: str :param rune_sentence: str, elements of this are from runic_alphabet or are punctuations :type runic_alphabet: list :param runic_alphabet: list :return:

8.1.1.17. cltk.alphabet.omr module

The alphabet for Marathi.

# Using the International Alphabet of Sanskrit Transliteration (IAST), these vowels are represented thus

>>> from cltk.alphabet import omr
>>> omr.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> omr.IAST_VOWELS[:5]
['a', 'ā', 'i', 'ī', 'u']
>>> list(zip(omr.SEMI_VOWELS, omr.IAST_SEMI_VOWELS))
[('य', 'y'), ('र', 'r'), ('ल', 'l'), ('व', 'w')]

8.1.1.18. cltk.alphabet.ory module

The Odia alphabet.

>>> from cltk.alphabet import ory
>>> ory.VOWELS["0B05"]
'ଅ'
>>> ory.STRUCTURED_CONSONANTS["0B15"]
'କ'

8.1.1.19. cltk.alphabet.osc module

The Oscan alphabet. Sources:

  • <https://www.unicode.org/charts/PDF/U10300.pdf>

  • Buck, C. A Grammar of Oscan and Umbrian.

8.1.1.20. cltk.alphabet.ota module

Ottoman alphabet

Misc. notes:

  • Based off Persian Alphabet Transliteration in CLTK by Iman Nazar

  • Uses UTF-8 Encoding for Ottoman/Persian Letters

  • When printing Arabic letters, they appear in the console from left to right and inconsistently linked, but correctly link and flow right to left when inputted into a word processor. The problems only exist in the terminal.

TODO: Add tests

8.1.1.21. cltk.alphabet.oty module

Alphabet for Old Tamil. GRANTHA_CONSONANTS are from the Grantha script which was used between 6th and 20th century to write Sanskrit and the classical language Manipravalam.

TODO: Add tests

8.1.1.22. cltk.alphabet.peo module

The Old Persian Cuneiform. Sources:

8.1.1.23. cltk.alphabet.pes module

The Persian alphabet.

TODO: Write tests.

cltk.alphabet.pes.mk_replacement_regex()[source]
cltk.alphabet.pes.normalize_text(text)[source]

8.1.1.24. cltk.alphabet.pli module

The Pali alphabet.

TODO: Add tests.

8.1.1.25. cltk.alphabet.processes module

This module holds the Process for normalizing text strings, usually before the text is sent to other processes.

class cltk.alphabet.processes.NormalizeProcess(language=None)[source]

Bases: Process

Generic process for text normalization.

language: str = None
algorithm
run(input_doc)[source]

This ideally returns an algorithm that takes and returns a string.

Return type:

Doc

class cltk.alphabet.processes.GreekNormalizeProcess(language=None)[source]

Bases: NormalizeProcess

Text normalization for Ancient Greek.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> lang = "grc"
>>> orig_text = get_example_text(lang)
>>> non_normed_doc = Doc(raw=orig_text)
>>> normalize_proc = GreekNormalizeProcess(language=lang)
>>> normalized_text = normalize_proc.run(input_doc=non_normed_doc)
>>> normalized_text == orig_text
False
language: str = 'grc'
class cltk.alphabet.processes.LatinNormalizeProcess(language=None)[source]

Bases: NormalizeProcess

Text normalization for Latin.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> lang = "lat"
>>> orig_text = get_example_text(lang)
>>> non_normed_doc = Doc(raw=orig_text)
>>> normalize_proc = LatinNormalizeProcess(language=lang)
>>> normalized_text = normalize_proc.run(input_doc=non_normed_doc)
>>> normalized_text == orig_text
False
language: str = 'lat'

8.1.1.26. cltk.alphabet.san module

Data module for the Sanskrit languages alphabet and related characters.

8.1.1.27. cltk.alphabet.tel module

Telugu alphabet

TODO: Add tests.

8.1.1.28. cltk.alphabet.text_normalization module

Functions for preprocessing texts. Not language-specific.

cltk.alphabet.text_normalization.cltk_normalize(text, compatibility=True)[source]
cltk.alphabet.text_normalization.remove_non_ascii(input_string)[source]

Remove non-ascii characters Source: http://stackoverflow.com/a/1342373

cltk.alphabet.text_normalization.remove_non_latin(input_string, also_keep=None)[source]

Remove non-Latin characters. also_keep should be a list which will add chars (e.g. punctuation) that will not be filtered.

cltk.alphabet.text_normalization.split_trailing_punct(text, punctuation=None)[source]

Some tokenizers, including that in Stanza, do not always handle punctuation properly. For example, a trailing colon ("οἶδα:") is not split into an extra punctuation token. This function does such splitting on raw text before being sent to such a tokenizer.

Parameters:
  • text (str) – Input text string.

  • punctuation (Optional[list[str]]) – List of punctuation that should be split when trailing a word.

Return type:

str

Returns:

Text string with trailing punctuation separated by a whitespace character.

>>> raw_text = "κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν"
>>> split_trailing_punct(text=raw_text)
'κατηγόρων ’, οὐκ οἶδα : ἐγὼ δ᾽ οὖν'
cltk.alphabet.text_normalization.split_leading_punct(text, punctuation=None)[source]

Some tokenizers, including that in Stanza, do not always handle punctuation properly. For example, an open curly quote ("‘κατηγόρων’") is not split into an extra punctuation token. This function does such splitting on raw text before being sent to such a tokenizer.

Parameters:
  • text (str) – Input text string.

  • punctuation (Optional[list[str]]) – List of punctuation that should be split when before a word.

Return type:

str

Returns:

Text string with leading punctuation separated by a whitespace character.

>>> raw_text = "‘κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν"
>>> split_leading_punct(text=raw_text)
'‘ κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν'
cltk.alphabet.text_normalization.remove_odd_punct(text, punctuation=None)[source]

Remove certain characters that downstream processes do not handle well. It would be better to use split_leading_punct() and split_trailing_punct(), however the default models out of Stanza make very strange mistakes when, e.g., "‘" is made its own token.

What to do about the apostrophe following an elision (e.g., "δ᾽"")?

>>> raw_text = "‘κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν"
>>> remove_odd_punct(raw_text)
'κατηγόρων, οὐκ οἶδα ἐγὼ δ᾽ οὖν'
Return type:

str

8.1.1.29. cltk.alphabet.urd module

Urdu alphabet

TODO: Add tests.

8.1.1.30. cltk.alphabet.xlc module

The Lycian alphabet. Sources:

  • <https://www.unicode.org/charts/PDF/U10280.pdf>

8.1.1.31. cltk.alphabet.xld module

The Lydian alphabet. Sources: