8.1.12. cltk.phonology package

The phonology module aims to provide tools that:

  • phonetically/phonologically transcribe words of a given language,

  • syllabify words.

For some specific languages, there exist, for example, a word stresser (i.e. a function that gives which syllable is stressed).

These tasks are interesting in themselves for historical linguists or teachers. They are also essential for more high-level tasks such as prosody analyzers.

Like for all CLTK modules, the phonology module may be extended and improved if a set of features does not suit your needs because they are insufficient or they do not follow rules you want to test (agreement on phonology of extinct languages is often weak).

8.1.12.1. Subpackages

8.1.12.2. Submodules

8.1.12.3. cltk.phonology.akk module

Functions and classes for Akkadian phonology.

cltk.phonology.akk.get_cv_pattern(word, pprint=False)[source]

Return a patterned string representing the consonants and vowels of the input word.

>>> word = 'iparras'
>>> get_cv_pattern(word)
[('V', 1, 'i'), ('C', 1, 'p'), ('V', 2, 'a'), ('C', 2, 'r'), ('C', 2, 'r'), ('V', 2, 'a'), ('C', 3, 's')]
>>> get_cv_pattern(word, True)
'V₁C₁V₂C₂C₂V₂C₃'
Return type

Union[List[Tuple[str, int, str]], str]

cltk.phonology.akk.syllabify(word)[source]

Split Akkadian words into list of syllables >>> syllabify(“napištašunu”) [‘na’, ‘piš’, ‘ta’, ‘šu’, ‘nu’]

>>> syllabify("epištašu")
['e', 'piš', 'ta', 'šu']
Return type

List[str]

cltk.phonology.akk.find_stress(word)[source]

Find the stressed syllable in a word. The general logic follows Huehnergard 3rd edition (pgs. 3-4): (a) Light: ending in a short vowel: e.g., -a, -ba (b) Heavy: ending in a long vowel marked with a macron, or in a short vowel plus a consonant: e.g., -ā, -bā, -ak, -bak (c) Ultraheavy: ending in a long vowel marked with a circumflex, in any long vowel plus a consonant: e.g., -â, -bâ, -āk, -bāk, -âk, -bâk. (a) If the last syllable is ultraheavy, it bears the stress. (b) Otherwise, stress falls on the last non-final heavy or ultraheavy syllable. (c) Words that contain no non-final heavy or ultraheavy syllables have the stress fall on the first syllable.

>>> find_stress("napištašunu")
['na', '[piš]', 'ta', 'šu', 'nu']
Return type

List[str]

class cltk.phonology.akk.AkkadianSyllabifier[source]

Bases: object

syllabify(word)[source]

8.1.12.4. cltk.phonology.orthophonology module

A module for representing the orthophonology of a language: the mapping from orthographic representations to IPA symbols.

Pre-modern languages are characterized by their non-standardized writing rules. Writers attempt to follow rules that fit morphology (words of same family tend to have close spelling) and phonology (words of similar pronunciations are written the same way). As languages evolve, their phonology changes faster than their writing rules. This module aims to unify writing rules with phonological rules by borrowing the representation of sound changes used by historical linguistics.

Based on many ideas in cltk.phonology.non.utils by Clément Besnier <clem@clementbesnier.fr>.

class cltk.phonology.orthophonology.PhonologicalFeature(value)[source]

Bases: cltk.utils.utils.CLTKEnum

An enumeration.

matches(other)[source]
class cltk.phonology.orthophonology.Consonantal(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

neg = 1
pos = 2
class cltk.phonology.orthophonology.Voiced(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

neg = 1
pos = 2
class cltk.phonology.orthophonology.Aspirated(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

neg = 1
pos = 2
class cltk.phonology.orthophonology.Geminate(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

neg = 1
pos = 2
class cltk.phonology.orthophonology.Roundedness(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

neg = 1
pos = 2
class cltk.phonology.orthophonology.Length(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

short = 1
long = 2
overlong = 3
class cltk.phonology.orthophonology.Height(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

close = 1
near_close = 2
close_mid = 3
mid = 4
open_mid = 5
near_open = 6
open = 7
class cltk.phonology.orthophonology.Backness(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

front = 1
central = 2
back = 3
class cltk.phonology.orthophonology.Manner(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

stop = 1
fricative = 2
affricate = 3
nasal = 4
lateral = 5
trill = 6
spirant = 7
approximant = 8
class cltk.phonology.orthophonology.Place(value)[source]

Bases: cltk.phonology.orthophonology.PhonologicalFeature

An enumeration.

bilabial = 1
labio_dental = 2
dental = 3
alveolar = 4
post_alveolar = 5
retroflex = 6
palatal = 7
velar = 8
uvular = 9
glottal = 10
class cltk.phonology.orthophonology.AbstractPhoneme(features=None, ipa=None)[source]

Bases: object

An abstract phoneme is just a bundle of phonological features.

is_vowel()[source]
merge(other)[source]

Returns a copy of this phoneme, with the features of other merged into this feature bundle. Other can be a list of phonemes, in which case the list is returned (for technical reasons). Other may also be a single feature value or a list of feature values.

is_equal(other)[source]

Phonemes are equal if they share the same features. Note that the IPA symbol is not taken into account.

matches(other)[source]

This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.

cltk.phonology.orthophonology.make_phoneme(*feature_values)[source]

Creates an abstract phoneme made of the feature specifications given in the vararg.

Return type

AbstractPhoneme

cltk.phonology.orthophonology.PositionedPhoneme(phoneme, word_initial=False, word_final=False, syllable_initial=False, syllable_final=False, env_start=False, env_end=False)[source]

A decorator for phonemes, used in applying rules over words. Returns a copy of the input phoneme, with additional attributes, specifying whether the phoneme occurs at a word or syllable boundary, or its position in an environment.

class cltk.phonology.orthophonology.AlwaysMatchingPseudoPhoneme[source]

Bases: cltk.phonology.orthophonology.AbstractPhoneme

A pseudo-phoneme that matches all other phonemes.

matches(other)[source]

This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.

Return type

bool

class cltk.phonology.orthophonology.WordBoundaryPseudoPhoneme[source]

Bases: cltk.phonology.orthophonology.AbstractPhoneme

A pseudo-phoneme that only matches at the start or end of a word.

matches(other)[source]

This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.

Return type

bool

is_equal(other)[source]

Phonemes are equal if they share the same features. Note that the IPA symbol is not taken into account.

Return type

bool

class cltk.phonology.orthophonology.SyllableBoundaryPseudoPhoneme[source]

Bases: cltk.phonology.orthophonology.AbstractPhoneme

A pseudo-phoneme that matches at word boundaries and matches positioned phonemes that are at syllable boundaries.

matches(other)[source]

This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.

Return type

bool

class cltk.phonology.orthophonology.PhonemeDisjunction(*phonemes)[source]

Bases: list

A list of phonemes, with special properties for disjunctive (“or”) matching.

matches(other)[source]

A disjunctive list matches a phoneme if any of its members matches the phoneme. If other is also a disjunctive list, any match between this list and the other returns true.

Return type

bool

class cltk.phonology.orthophonology.Consonant(place, manner, voiced, ipa, geminate=neg, aspirated=neg)[source]

Bases: cltk.phonology.orthophonology.AbstractPhoneme

Based on cltk.phonology.utils by @clemsciences. A consonant is a phoneme that is specified for the features listed in the IPA chart for consonants: Place, Manner, Voicing. These may be read directly off the IPA chart, which also gives the IPA symbol. The Consonantal feature is set to positive, and the aspirated is defaulted to negative. See http://www.ipachart.com/

is_more_sonorous(other)[source]

compare this phoneme to another for sonority. Used for SSP considerations.

Return type

bool

merge(other)[source]

Returns a copy of this phoneme, with the features of other merged into this feature bundle. Other can be a list of phonemes, in which case the list is returned (for technical reasons). Other may also be a single feature value or a list of feature values.

geminate()[source]

Returns a new Consonant with its Geminate pos, and “ː” appended to its IPA symbol.

class cltk.phonology.orthophonology.Vowel(height, backness, rounded, length, ipa)[source]

Bases: cltk.phonology.orthophonology.AbstractPhoneme

The representation of a vowel by its features, as given in the IPA chart for vowels. See http://www.ipachart.com/

lengthen()[source]

Returns a new Vowel with its Length lengthened, and “ː” appended to its IPA symbol.

is_more_sonorous(other)[source]

compare this phoneme to another for sonority. Used for SSP considerations.

Return type

bool

merge(other)[source]

Returns a copy of this phoneme, with the features of other merged into this feature bundle. Other can be a list of phonemes, in which case the list is returned (for technical reasons). Other may also be a single feature value or a list of feature values.

class cltk.phonology.orthophonology.BasePhonologicalRule(condition, action)[source]

Bases: object

Base class for conditional phonological rules. A phonological rule relates an item (a phoneme) to its environment to define a transformation. Specifically, a rule specifies a condition and an action.

  • The condition characterizes the phonological environment of a phoneme in terms of the characteristics of the phomeme before it (if any), and after it (if any). In general it is a function taking three arguments: before, target, after, the phonemes in the environment, an returning a boolean for whether the rule should fire.

  • The action defines a transformation of the target phoneme, e.g. its vocalization. It is a function taking only the action, which returns the replacement phoneme OR a list of phonemes.

perform_action(phonemes, pos)[source]
class cltk.phonology.orthophonology.PhonologicalRule(condition, action)[source]

Bases: cltk.phonology.orthophonology.BasePhonologicalRule

The most general phonological rule can apply anywhere in the word. before and after phonemes may therefore be null when calling the condition.

check_environment(phonemes, pos)[source]
exception cltk.phonology.orthophonology.PhonemeNotFound(phoneme)[source]

Bases: Exception

Exception raised when a search for a phoneme in the investory fails.

exception cltk.phonology.orthophonology.LetterNotFound(letter)[source]

Bases: Exception

Exception raised when a search for a letter in the alphabet fails.

class cltk.phonology.orthophonology.Orthophonology(sound_inventory, alphabet, diphthongs, digraphs, to_modern={'m': 'm', 'n': 'n', 'n̥': 'ng', 'ŋ': 'ng', 'p': 'p', 'b': 'b', 't': 't', 'd': 'd', 'k': 'k', 'g': 'g', 't͡ʃ': 'ch', 'd͡ʒ': 'ge', 'f': 'f', 'v': 'v', 'θ': 'th', 'ð': 'th', 's': 's', 'z': 'z', 'ʃ': 'sh', 'ç': 'ch', 'x': 'ch', 'y': 'y', 'h': 'h', 'l': 'l', 'l̥': 'l', 'j': 'y', 'w': 'w', 'r': 'r', 'r̥': 'r', 'i': 'i', 'i:': 'ee', 'y:': 'y', 'u': 'u', 'u:': 'oo', 'e': 'e', 'e:': 'ee', 'ø': 'e', 'ø:': 'ee', 'o': 'o', 'o:': 'oo', 'æ': 'a', 'æ:': 'aa', 'ɑ': 'o', 'ɑ:': 'oo', 'æɑ': 'ao', 'æ:ɑ': 'ao', 'eo': 'eo', 'e:o': 'eeo', 'iu': 'iu', 'i:u': 'iiu'}, ['(^|(?<= ))hw', 'wh', 'oo(.)(^|(?= ))', 'o\\1e'])[source]

Bases: object

The ortho-phonology of a language is described by:

  • The inventory of all the phonemes of the language.

  • A mapping of orthographic symbols to phonemes.

  • mappings of orthographic symbols pairs to:

    • diphthongs

    • phonemes (i.e. digraphs)

  • phonological rules for the contextual transformation of phonological representations.

The class is very clearly aimed at alphabetic orthographies. Its usefulness for e.g. pictographic orthographies is questionable.

add_rule(rule)[source]

Adds a rule to the orthophonology. The order in which rules are added is critcial, since the first rule that matches fires.

is_syllable_initial(phonemes, pos)[source]
Return type

bool

is_syllable_final(phonemes, pos)[source]
Return type

bool

_position_phonemes(phonemes)[source]

Mark syllable boundaries, and, in future, other positional/suprasegmental features?

transcribe_word(word)[source]

The heart of the transcription process. Similar to the system in in cltk.phonology.utils, the algorithm: 1) Applies digraphs and diphthongs to the text of the word. 2) Carries out a naive (“greedy”, per @clemsciences) substitution of letters to phonemes, according to the alphabet. 3) Applies the conditions of the rules to the environment of each phoneme in turn. The first rule matched fires. There is no restart and later rules are not tested. Also, if a rule returns multiple phonemes, these are never re-tested by the rule set.

transcribe(text, as_phonemes=False)[source]

Transcribes a text, which is first tokenized for words, then each word is transcribed. If as_phonemes is true, returns a list of list of phoneme objects, else returns a string concatenation of the IPA symbols of the phonemes.

Return type

Union[str, list]

transcribe_to_modern(text)[source]

A very first attempt at transcribing from IPA to some modern orthography. The method is intended to provide the student with clues to the pronunciation of old orthographies.

Return type

str

voice(consonant)[source]

Voices a consonant, by searching the sound inventory for a consonant having the same features as the argument, but +voice.

Return type

Consonant

aspirate(consonant)[source]

Aspirates a consonant, by searching the sound inventory for a consonant having the same features as the argument, but +aspirated.

Return type

Consonant

geminate(consonant)[source]
Parameters

consonant (Consonant) –

Return type

Consonant

Returns

static lengthen(vowel)[source]

Returns a lengthened copy of the vowel argument.

Return type

Vowel

8.1.12.5. cltk.phonology.processes module

Processes for phonology.

8.1.12.6. cltk.phonology.syllabifier_processes module

This module implements syllabification processes for several languages. You may extend SyllabificationProcess and see pre-defined examples.

class cltk.phonology.syllabifier_processes.SyllabificationProcess(language: str = None)[source]

Bases: cltk.core.data_types.Process

This is the class to extend if you want to code your own syllabification process in the CLTK-style.

run(input_doc)[source]
Return type

Doc

class cltk.phonology.syllabifier_processes.GreekSyllabificationProcess(language: str = None)[source]

Bases: cltk.phonology.syllabifier_processes.SyllabificationProcess

Syllabification Process for Ancient Greek.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import GreekTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk import NLP
>>> a_pipeline = Pipeline(description="A custom Greek pipeline", processes=[GreekTokenizationProcess, DefaultPunctuationRemovalProcess, GreekSyllabificationProcess], language=get_lang("grc"))
>>> nlp = NLP(language='grc', custom_pipeline=a_pipeline, suppress_banner=True)
>>> text = get_example_text("grc")
>>> cltk_doc = nlp(text)
>>> [word.syllables for word in cltk_doc.words[:5]]
[['ὅτι'], ['μὲν'], ['ὑμ', 'εῖς'], ['ὦ'], ['ἄν', 'δρ', 'ες']]
description = 'The default Latin Syllabification process'
algorithm
class cltk.phonology.syllabifier_processes.LatinSyllabificationProcess(language: str = None)[source]

Bases: cltk.phonology.syllabifier_processes.SyllabificationProcess

Syllabification Process for Latin.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import LatinTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk import NLP
>>> a_pipeline = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, DefaultPunctuationRemovalProcess, LatinSyllabificationProcess], language=get_lang("lat"))
>>> nlp = NLP(language='lat', custom_pipeline=a_pipeline, suppress_banner=True)
>>> text = get_example_text("lat")
>>> cltk_doc = nlp(text)
>>> [word.syllables for word in cltk_doc.words[:5]]
[['gal', 'li', 'a'], ['est'], ['om', 'nis'], ['di', 'vi', 'sa'], ['in']]
description = 'The default Latin Syllabification process'
algorithm
class cltk.phonology.syllabifier_processes.MiddleEnglishSyllabificationProcess(language: str = None)[source]

Bases: cltk.phonology.syllabifier_processes.SyllabificationProcess

Syllabification Process for Middle English.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import MiddleEnglishTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Middle English pipeline",     processes=[MiddleEnglishTokenizationProcess, DefaultPunctuationRemovalProcess, MiddleEnglishSyllabificationProcess],     language=get_lang("enm"))
>>> nlp = NLP(language='enm', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("enm").replace('\n', ' ')
>>> cltk_doc = nlp(text)
>>> [word.syllables for word in cltk_doc.words[:5]]
[['whi', 'lom'], ['as'], ['ol', 'de'], ['sto', 'ries'], ['tellen']]
description = 'The default Middle English Syllabification process'
algorithm
class cltk.phonology.syllabifier_processes.MiddleHighGermanSyllabificationProcess(language: str = None)[source]

Bases: cltk.phonology.syllabifier_processes.SyllabificationProcess

Syllabification Process for Middle High German.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import MiddleHighGermanTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Middle High German pipeline",     processes=[MiddleHighGermanTokenizationProcess, DefaultPunctuationRemovalProcess,     MiddleHighGermanSyllabificationProcess], language=get_lang("gmh"))
>>> nlp = NLP(language='gmh', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("gmh")
>>> cltk_doc = nlp(text)
>>> [word.syllables for word in cltk_doc.words[:5]]
[['uns'], ['ist'], ['in'], ['al', 'ten'], ['mæ', 'ren']]
description = 'The default Middle High German syllabification process'
algorithm
class cltk.phonology.syllabifier_processes.OldEnglishSyllabificationProcess(language: str = None)[source]

Bases: cltk.phonology.syllabifier_processes.SyllabificationProcess

Syllabification Process for Old English.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import MiddleEnglishTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old English pipeline",     processes=[MiddleEnglishTokenizationProcess, DefaultPunctuationRemovalProcess, OldEnglishSyllabificationProcess],     language=get_lang("ang"))
>>> nlp = NLP(language='ang', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("ang")
>>> cltk_doc = nlp(text)
>>> [word.syllables for word in cltk_doc.words[:5]]
[['hwæt'], ['we'], ['gar', 'den', 'a'], ['in'], ['gear', 'da', 'gum']]
description = 'The default Old English syllabification process'
algorithm
class cltk.phonology.syllabifier_processes.OldNorseSyllabificationProcess(language: str = None)[source]

Bases: cltk.phonology.syllabifier_processes.SyllabificationProcess

Syllabification Process for Old Norse.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import OldNorseTokenizationProcess
>>> from cltk.text.processes import OldNorsePunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old Norse pipeline",     processes=[OldNorseTokenizationProcess, OldNorsePunctuationRemovalProcess, OldNorseSyllabificationProcess],     language=get_lang("non"))
>>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("non")
>>> cltk_doc = nlp(text)
>>> [word.syllables for word in cltk_doc.words[:5]]
[['gyl', 'fi'], ['ko', 'nungr'], ['réð'], ['þar'], ['lön', 'dum']]
description = 'The default Old Norse syllabification process'
algorithm

8.1.12.7. cltk.phonology.syllabify module

The syllabify module implements two main classes:

  • Syllabifier

  • Syllable

Syllabifier implements two general syllabification algorithms:

  • the Maximum Onset Principle,

  • the Sonority Sequence Principle.

They are both based on phonetic principles.

The Syllable class provides a way to linguistically represent a syllable.

cltk.phonology.syllabify.get_onsets(text, vowels='aeiou', threshold=0.0002)[source]

Source: Resonances in Middle High German: New Methodologies in Prosody, 2017, C. L. Hench

Parameters
  • text – str list: text to be analysed

  • vowels – str: valid vowels constituting the syllable

  • threshold – minimum frequency count for valid onset, C. Hench noted that the algorithm produces the best result for an untagged wordset of MHG, when retaining onsets which appear in at least 0.02% of the words

Let’s test it on the opening lines of Nibelungenlied

>>> text = ['uns', 'ist', 'in', 'alten', 'mæren', 'wunders', 'vil', 'geseit', 'von', 'helden', 'lobebæren', 'von', 'grôzer', 'arebeit', 'von', 'fröuden', 'hôchgezîten', 'von', 'weinen', 'und', 'von', 'klagen', 'von', 'küener', 'recken', 'strîten', 'muget', 'ir', 'nu', 'wunder', 'hœren', 'sagen']
>>> vowels = "aeiouæœôîöü"
>>> get_onsets(text, vowels=vowels)
['lt', 'm', 'r', 'w', 'nd', 'v', 'g', 's', 'h', 'ld', 'l', 'b', 'gr', 'z', 'fr', 'd', 'chg', 't', 'n', 'kl', 'k', 'ck', 'str']

Of course, this is an insignificant sample, but we could try and see how modifying the threshold affects the returned onset:

>>> get_onsets(text, threshold = 0.05, vowels=vowels)
['m', 'r', 'w', 'nd', 'v', 'g', 's', 'h', 'b', 'z', 't', 'n']
class cltk.phonology.syllabify.Syllabifier(low_vowels=None, mid_vowels=None, high_vowels=None, flaps=None, laterals=None, nasals=None, fricatives=None, plosives=None, language=None, break_geminants=False, variant=None, sep=None)[source]

Bases: object

Provides 2 main methods that syllabify words given phonology of its language.

set_invalid_onsets(invalid_onsets)[source]
set_invalid_ultima(invalid_ultima)[source]
set_hierarchy(hierarchy)[source]

Sets an alternative sonority hierarchy, note that you will also need to specify the vowelset with the set_vowels, in order for the module to correctly identify each nucleus.

The order of the phonemes defined is by decreased consonantality

>>> s = Syllabifier()
>>> s.set_hierarchy([['i', 'u'], ['e'], ['a'], ['r'], ['m', 'n'], ['f']])
>>> s.set_vowels(['i', 'u', 'e', 'a'])
>>> s.syllabify('feminarum')
['fe', 'mi', 'na', 'rum']
set_vowels(vowels)[source]

Define the vowel set of the syllabifier module

>>> s = Syllabifier()
>>> s.set_vowels(['i', 'u', 'e', 'a'])
>>> s.vowels
['i', 'u', 'e', 'a']
syllabify(word, mode='SSP')[source]
Parameters
  • word (str) – word to syllabify

  • mode – syllabification algorithm SSP (Sonority Sequence Principle) or MOP (Maximum Onset Principle)

Return type

Union[List[str], str]

Returns

syllabifier word

syllabify_ssp(word)[source]

Syllabifies a word according to the Sonority Sequencing Principle

Parameters

word (str) – Word to be syllabified

Return type

List[str]

Returns

List consisting of syllables

First you need to define the matters of articulation >>> high_vowels = [‘a’] >>> mid_vowels = [‘e’] >>> low_vowels = [‘i’, ‘u’] >>> flaps = [‘r’] >>> nasals = [‘m’, ‘n’] >>> fricatives = [‘f’] >>> s = Syllabifier(high_vowels=high_vowels, mid_vowels=mid_vowels, low_vowels=low_vowels, flaps=flaps, nasals=nasals, fricatives=fricatives) >>> s.syllabify(“feminarum”) [‘fe’, ‘mi’, ‘na’, ‘rum’]

Not specifying your alphabet results in an error: >>> s.syllabify(“foemina”) Traceback (most recent call last): … cltk.core.exceptions.CLTKException

Additionally, you can utilize the language parameter: >>> s = Syllabifier(language=’gmh’) >>> s.syllabify(‘lobebæren’) [‘lo’, ‘be’, ‘bæ’, ‘ren’] >>> s = Syllabifier(language=’enm’) >>> s.syllabify(“huntyng”) [‘hun’, ‘tyng’] >>> s = Syllabifier(language=’ang’) >>> s.syllabify(“arcebiscop”) [‘ar’, ‘ce’, ‘bis’, ‘cop’]

The break_geminants parameter ensures a breakpoint is placed between geminants: >>> geminant_s = Syllabifier(break_geminants=True) >>> hierarchy = [[“a”, “á”, “æ”, “e”, “é”, “i”, “í”, “o”, “ǫ”, “ø”, “ö”, “œ”, “ó”, “u”, “ú”, “y”, “ý”], [“j”], [“m”], [“n”], [“p”, “b”, “d”, “g”, “t”, “k”], [“c”, “f”, “s”, “h”, “v”, “x”, “þ”, “ð”], [“r”], [“l”]] >>> geminant_s.set_hierarchy(hierarchy) >>> geminant_s.set_vowels(hierarchy[0]) >>> geminant_s.syllabify(“ennitungl”) [‘en’, ‘ni’, ‘tungl’]

onset_maximization(syllables)[source]

Applies onset maximisation principle to syllables :type syllables: List[str] :param syllables: list of syllables :rtype: List[str] :return:

legal_onsets(syllables)[source]

Filters syllable respecting the legality principle

Parameters

syllables (List[str]) – list of syllables

The method scans for invalid syllable onsets:

>>> s = Syllabifier(["i", "u", "y"], ["o", "ø", "e"], ["a"], ["r"], ["l"], ["m", "n"], ["f", "v", "s", "h"], ["k", "g", "b", "p", "t", "d"])
>>> s.set_invalid_onsets(['lm'])
>>> s.legal_onsets(['a', 'lma', 'tigr'])
['al', 'ma', 'tigr']

You can also define invalid syllable ultima:

>>> s.set_invalid_ultima(['gr'])
>>> s.legal_onsets(['al', 'ma', 'ti', 'gr'])
['al', 'ma', 'tigr']
Return type

List[str]

syllabify_mop(word)[source]
>>> from cltk.phonology.gmh.syllabifier import DIPHTHONGS, TRIPHTHONGS, SHORT_VOWELS, LONG_VOWELS, CONSONANTS
>>> gmh_syllabifier = Syllabifier()
>>> gmh_syllabifier.set_short_vowels(SHORT_VOWELS)
>>> gmh_syllabifier.set_vowels(SHORT_VOWELS+LONG_VOWELS)
>>> gmh_syllabifier.set_diphthongs(DIPHTHONGS)
>>> gmh_syllabifier.set_triphthongs(TRIPHTHONGS)
>>> gmh_syllabifier.set_consonants(CONSONANTS)
>>> gmh_syllabifier.syllabify_mop('entslâfen')
['ent', 'slâ', 'fen']
>>> gmh_syllabifier.syllabify_mop('fröude')
['fröu', 'de']
>>> gmh_syllabifier.syllabify_mop('füerest')
['füe', 'rest']
>>> from cltk.phonology.enm.syllabifier import DIPHTHONGS, TRIPHTHONGS, SHORT_VOWELS, LONG_VOWELS
>>> enm_syllabifier = Syllabifier()
>>> enm_syllabifier.set_short_vowels(SHORT_VOWELS)
>>> enm_syllabifier.set_vowels(SHORT_VOWELS+LONG_VOWELS)
>>> enm_syllabifier.set_diphthongs(DIPHTHONGS)
>>> enm_syllabifier.set_triphthongs(TRIPHTHONGS)
>>> enm_syllabifier.syllabify_mop('heldis')
['hel', 'dis']
>>> enm_syllabifier.syllabify_mop('greef')
['greef']

Once you syllabify the word, the result will be saved as a class variable

>>> enm_syllabifier.syllabify_mop('commaundyd')
['com', 'mau', 'ndyd']
Parameters

word (str) – word to syllabify

Return type

List[str]

Returns

syllabified word

set_short_vowels(short_vowels)[source]
set_diphthongs(diphthongs)[source]
set_triphthongs(triphthongs)[source]
set_consonants(consonants)[source]
syllabify_ipa(word)[source]

Parses IPA string

Parameters

word (str) – word to be syllabified

Return type

List[str]

syllabify_phonemes(phonological_word)[source]

Syllabifies :type phonological_word: List[Union[Vowel, Consonant]] :param phonological_word: result of Transcriber().text_to_phonemes in cltk.phonology.non.utils :rtype: List[List[Union[Vowel, Consonant]]] :return:

class cltk.phonology.syllabify.Syllable(text, vowels, consonants)[source]

Bases: object

A syllable has three main constituents:

  • onset

  • nucleus

  • coda

Source: https://en.wikipedia.org/wiki/Syllable

_compute_syllable(text)[source]
>>> sylla1 = Syllable("armr", ["a"], ["r", "m"])
>>> sylla1.onset
[]
>>> sylla1.nucleus
['a']
>>> sylla1.coda
['r', 'm', 'r']
>>> sylla2 = Syllable("gangr", ["a"], ["g", "n", "r"])
>>> sylla2.onset
['g']
>>> sylla2.nucleus
['a']
>>> sylla2.coda
['n', 'g', 'r']
>>> sylla3 = Syllable("aurr", ["a", "u"], ["r"])
>>> sylla3.nucleus
['a', 'u']
>>> sylla3.coda
['r', 'r']
Parameters

text – a syllable

8.1.12.8. cltk.phonology.transcription_processes module

This module provides phonological/phonetic transcribers for several languages. PhonologicalTranscriptionProcess is the parent-class for all other custom transcription processes.

class cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess(language: str = None)[source]

Bases: cltk.core.data_types.Process

General phonological transcription Process.

run(input_doc)[source]
Return type

Doc

class cltk.phonology.transcription_processes.GothicPhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Gothic.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import OldNorseTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Gothic pipeline",     processes=[OldNorseTokenizationProcess, DefaultPunctuationRemovalProcess,     GothicPhonologicalTranscriberProcess], language=get_lang("got"))
>>> nlp = NLP(language='got', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("got")
>>> cltk_doc = nlp(text)
>>> [word.phonetic_transcription for word in cltk_doc.words[:5]]
['swa', 'liuhtjɛ', 'liuhaθ', 'jzwar', 'jn']
description = 'The default Gothic transcription process'
algorithm
class cltk.phonology.transcription_processes.GreekPhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Ancient Greek.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import GreekTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Greek pipeline",     processes=[GreekTokenizationProcess, DefaultPunctuationRemovalProcess,    GreekPhonologicalTranscriberProcess], language=get_lang("grc"))
>>> nlp = NLP(language='grc', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("grc")
>>> cltk_doc = nlp(text)
>>> [word.phonetic_transcription for word in cltk_doc.words[:5]]
['hó.ti', 'men', 'hy.mệːs', 'ɔ̂ː', 'ɑ́n.dres']
description = 'The default Greek transcription process'
algorithm
class cltk.phonology.transcription_processes.LatinPhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Latin.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import LatinTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk import NLP
>>> a_pipeline = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, DefaultPunctuationRemovalProcess, LatinPhonologicalTranscriberProcess], language=get_lang("lat"))
>>> nlp = NLP(language="lat", custom_pipeline=a_pipeline, suppress_banner=True)
>>> text = get_example_text("lat")
>>> cltk_doc = nlp.analyze(text)
>>> [word.phonetic_transcription for word in cltk_doc.words][:5]
['[gaɫlɪ̣ja]', '[ɛst̪]', '[ɔmn̪ɪs]', '[d̪ɪwɪsa]', '[ɪn̪]']
description = 'The default Latin transcription process'
algorithm
class cltk.phonology.transcription_processes.MiddleHighGermanPhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Middle High German. >>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleHighGermanTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description=”A custom Middle High German pipeline”, processes=[MiddleHighGermanTokenizationProcess, DefaultPunctuationRemovalProcess, MiddleHighGermanPhonologicalTranscriberProcess], language=get_lang(“gmh”)) >>> nlp = NLP(language=’gmh’, custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text(“gmh”) >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] [‘ʊns’, ‘ɪst’, ‘ɪn’, ‘alten’, ‘mɛren’]

description = 'The default Middle High German transcription process'
algorithm
class cltk.phonology.transcription_processes.OldEnglishPhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Old English. >>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleEnglishTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description=”A custom Old English pipeline”, processes=[MiddleEnglishTokenizationProcess, DefaultPunctuationRemovalProcess, OldEnglishPhonologicalTranscriberProcess], language=get_lang(“ang”)) >>> nlp = NLP(language=’ang’, custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text(“ang”) >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] [‘ʍæt’, ‘we’, ‘gɑrˠdenɑ’, ‘in’, ‘gæːɑrˠdɑgum’]

description = 'The default Old English transcription process'
algorithm
class cltk.phonology.transcription_processes.OldNorsePhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Old Norse.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import OldNorseTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old Norse pipeline",     processes=[OldNorseTokenizationProcess, DefaultPunctuationRemovalProcess,     OldNorsePhonologicalTranscriberProcess], language=get_lang("non"))
>>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True)
>>> text = get_example_text("non")
>>> cltk_doc = nlp(text)
>>> [word.phonetic_transcription for word in cltk_doc.words[:5]]
['gylvi', 'kɔnunɣr', 'reːð', 'θar', 'lœndum']
description = 'The default Old Norse poetry process'
algorithm
class cltk.phonology.transcription_processes.OldSwedishPhonologicalTranscriberProcess(language: str = None)[source]

Bases: cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess

Phonological transcription Process for Old Swedish.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers.processes import OldNorseTokenizationProcess
>>> from cltk.text.processes import DefaultPunctuationRemovalProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old Swedish pipeline",     processes=[OldNorseTokenizationProcess, DefaultPunctuationRemovalProcess,     OldSwedishPhonologicalTranscriberProcess], language=get_lang("non"))
>>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True)
>>> text = "Far man kunu oc dör han för en hun far barn. oc sigher hun oc hænnæ frændær."
>>> cltk_doc = nlp(text)
>>> [word.phonetic_transcription for word in cltk_doc.words[:5]]
['far', 'man', 'kunu', 'ok', 'dør']
description = 'The default Old Swedish transcription process'
algorithm