8.1.7. cltk.languages package

Init for cltk.languages.

8.1.7.1. Submodules

8.1.7.2. cltk.languages.example_texts module

Example paragraphs of text to be reused within the codebase for testing or demonstrating code.

TODO: Get longer Akkadian text

>>> from cltk.languages.example_texts import get_example_text
>>> get_example_text("grc")[:66]
'ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων'
>>> get_example_text("lat")[:67]
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae'
>>> get_example_text("non")[:50]
'Gylfi konungr réð þar löndum er nú heitir Svíþjóð.'
cltk.languages.example_texts.get_example_text(iso_code)[source]

Take in search term of usual language name and find ISO code.

>>> from cltk.languages.example_texts import get_example_text
>>> get_example_text("got")[:25]
'swa liuhtjai liuhaþ izwar'
>>> get_example_text("zkz")
Traceback (most recent call last):
  ...
cltk.core.exceptions.UnimplementedAlgorithmError: Example text unavailable for ISO 639-3 code 'zkz'.
>>> get_example_text("xxx")
Traceback (most recent call last):
  ...
cltk.core.exceptions.UnknownLanguageError: Unknown ISO language code 'xxx'.
Return type

str

8.1.7.3. cltk.languages.glottolog module

Module for mapping ISO 639-3 to Glottolog languages and language names. The key is the ISO code and the value, being a Language object, contains information from both the Glottolog and ISO data sets. The contents of this module were generated by scripts/make_glottolog_languages.py.

ISO 639-3 is an international standard for language languages with an aim to cover all known natural languages. The extended language coverage was based primarily on the language languages published by SIL International, which is now the registration authority for ISO 639-3. About: https://iso639-3.sil.org/.

Glottolog is a project run by the Max Planck Institute for the Science of Human History. The website contains languages for languages as well as reconstructions of language families. About: http://glottolog.org/. Data of Glottolog 4.0 is published under the following license: https://creativecommons.org/licenses/by/4.0/.

Haspelmath, Martin & Forkel, Robert & Hammarström, Harald. 2019. Glottolog 4.0. Jena: Max Planck Institute for the Science of Human History. (Available online at http://glottolog.org, Accessed on 2019-10-02.)

>>> from cltk.languages.utils import get_lang
>>> akkadian = get_lang("akk")
>>> akkadian
Language(name='Akkadian', glottolog_id='akka1240', latitude=33.1, longitude=44.1, dates=[], family_id='afro1255', parent_id='east2678', level='language', iso_639_3_code='akk', type='a')
>>> akkadian.name
'Akkadian'
>>> akkadian.glottolog_id
'akka1240'
>>> akkadian.latitude
33.1
>>> akkadian.longitude
44.1
>>> akkadian.family_id
'afro1255'
>>> akkadian.parent_id
'east2678'
>>> len(LANGUAGES)
219
cltk.languages.glottolog._resort_languages_list(languages_list)[source]

Pick up the LANGUAGES global and return alphabetized according to a language’s common name.

>>> iso_dict_keys = _resort_languages_list(LANGUAGES)
>>> list(iso_dict_keys)[:10]
['xae', 'xag', 'akk', 'xln', 'grc', 'hbo', 'xlg', 'xmk', 'xna', 'xzp']
Return type

OrderedDict[str, Language]

8.1.7.4. cltk.languages.pipelines module

Default processing pipelines for languages. The purpose of these dataclasses is to represent:

  1. the types of NLP processes that the CLTK can do

  2. the order in which processes are to be executed

  3. specifying what downstream features a particular implemented process requires

class cltk.languages.pipelines.AkkadianPipeline(description: str = 'Pipeline for the Akkadian language.', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Akkadian', glottolog_id='akka1240', latitude=33.1, longitude=44.1, dates=[], family_id='afro1255', parent_id='east2678', level='language', iso_639_3_code='akk', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Akkadian.

>>> from cltk.languages.pipelines import AkkadianPipeline
>>> a_pipeline = AkkadianPipeline()
>>> a_pipeline.description
'Pipeline for the Akkadian language.'
>>> a_pipeline.language
Language(name='Akkadian', glottolog_id='akka1240', latitude=33.1, longitude=44.1, dates=[], family_id='afro1255', parent_id='east2678', level='language', iso_639_3_code='akk', type='a')
>>> a_pipeline.language.name
'Akkadian'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.AkkadianTokenizationProcess'>
description: str = 'Pipeline for the Akkadian language.'
language: cltk.core.data_types.Language = Language(name='Akkadian', glottolog_id='akka1240', latitude=33.1, longitude=44.1, dates=[], family_id='afro1255', parent_id='east2678', level='language', iso_639_3_code='akk', type='a')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.ArabicPipeline(description: str = 'Pipeline for the Arabic language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Standard Arabic', glottolog_id='stan1318', latitude=27.9625, longitude=43.8525, dates=[], family_id='afro1255', parent_id='arab1395', level='language', iso_639_3_code='arb', type=''))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Arabic.

>>> from cltk.languages.pipelines import ArabicPipeline
>>> a_pipeline = ArabicPipeline()
>>> a_pipeline.description
'Pipeline for the Arabic language'
>>> a_pipeline.language
Language(name='Standard Arabic', glottolog_id='stan1318', latitude=27.9625, longitude=43.8525, dates=[], family_id='afro1255', parent_id='arab1395', level='language', iso_639_3_code='arb', type='')
>>> a_pipeline.language.name
'Standard Arabic'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.ArabicTokenizationProcess'>
description: str = 'Pipeline for the Arabic language'
language: cltk.core.data_types.Language = Language(name='Standard Arabic', glottolog_id='stan1318', latitude=27.9625, longitude=43.8525, dates=[], family_id='afro1255', parent_id='arab1395', level='language', iso_639_3_code='arb', type='')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.AramaicPipeline(description: str = 'Pipeline for the Aramaic language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Official Aramaic (700-300 BCE)', glottolog_id='', latitude=0.0, longitude=0.0, dates=[], family_id='', parent_id='', level='', iso_639_3_code='arc', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Aramaic.

TODO: Confirm with specialist what encodings should be expected. TODO: Replace ArabicTokenizationProcess with a multilingual one or a specific Aramaic.

>>> from cltk.languages.pipelines import AramaicPipeline
>>> a_pipeline = AramaicPipeline()
>>> a_pipeline.description
'Pipeline for the Aramaic language'
>>> a_pipeline.language
Language(name='Official Aramaic (700-300 BCE)', glottolog_id='', latitude=0.0, longitude=0.0, dates=[], family_id='', parent_id='', level='', iso_639_3_code='arc', type='a')
>>> a_pipeline.language.name
'Official Aramaic (700-300 BCE)'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.ArabicTokenizationProcess'>
description: str = 'Pipeline for the Aramaic language'
language: cltk.core.data_types.Language = Language(name='Official Aramaic (700-300 BCE)', glottolog_id='', latitude=0.0, longitude=0.0, dates=[], family_id='', parent_id='', level='', iso_639_3_code='arc', type='a')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.ChinesePipeline(description: str = 'Pipeline for the Classical Chinese language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Literary Chinese', glottolog_id='lite1248', latitude=0.0, longitude=0.0, dates=[], family_id='sino1245', parent_id='clas1255', level='language', iso_639_3_code='lzh', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Classical Chinese.

>>> from cltk.languages.pipelines import ChinesePipeline
>>> a_pipeline = ChinesePipeline()
>>> a_pipeline.description
'Pipeline for the Classical Chinese language'
>>> a_pipeline.language
Language(name='Literary Chinese', glottolog_id='lite1248', latitude=0.0, longitude=0.0, dates=[], family_id='sino1245', parent_id='clas1255', level='language', iso_639_3_code='lzh', type='h')
>>> a_pipeline.language.name
'Literary Chinese'
>>> a_pipeline.processes[0]
<class 'cltk.dependency.processes.ChineseStanzaProcess'>
description: str = 'Pipeline for the Classical Chinese language'
language: cltk.core.data_types.Language = Language(name='Literary Chinese', glottolog_id='lite1248', latitude=0.0, longitude=0.0, dates=[], family_id='sino1245', parent_id='clas1255', level='language', iso_639_3_code='lzh', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.CopticPipeline(description: str = 'Pipeline for the Coptic language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Coptic', glottolog_id='copt1239', latitude=29.472, longitude=31.2053, dates=[], family_id='afro1255', parent_id='egyp1245', level='language', iso_639_3_code='cop', type=''))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Coptic.

>>> from cltk.languages.pipelines import CopticPipeline
>>> a_pipeline = CopticPipeline()
>>> a_pipeline.description
'Pipeline for the Coptic language'
>>> a_pipeline.language
Language(name='Coptic', glottolog_id='copt1239', latitude=29.472, longitude=31.2053, dates=[], family_id='afro1255', parent_id='egyp1245', level='language', iso_639_3_code='cop', type='')
>>> a_pipeline.language.name
'Coptic'
>>> a_pipeline.processes[0]
<class 'cltk.dependency.processes.CopticStanzaProcess'>
description: str = 'Pipeline for the Coptic language'
language: cltk.core.data_types.Language = Language(name='Coptic', glottolog_id='copt1239', latitude=29.472, longitude=31.2053, dates=[], family_id='afro1255', parent_id='egyp1245', level='language', iso_639_3_code='cop', type='')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.GothicPipeline(description: str = 'Pipeline for the Gothic language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Gothic', glottolog_id='goth1244', latitude=46.9304, longitude=29.9786, dates=[], family_id='indo1319', parent_id='east2805', level='language', iso_639_3_code='got', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Gothic.

>>> from cltk.languages.pipelines import GothicPipeline
>>> a_pipeline = GothicPipeline()
>>> a_pipeline.description
'Pipeline for the Gothic language'
>>> a_pipeline.language
Language(name='Gothic', glottolog_id='goth1244', latitude=46.9304, longitude=29.9786, dates=[], family_id='indo1319', parent_id='east2805', level='language', iso_639_3_code='got', type='a')
>>> a_pipeline.language.name
'Gothic'
>>> a_pipeline.processes[0]
<class 'cltk.dependency.processes.GothicStanzaProcess'>
>>> a_pipeline.processes[1]
<class 'cltk.embeddings.processes.GothicEmbeddingsProcess'>
description: str = 'Pipeline for the Gothic language'
language: cltk.core.data_types.Language = Language(name='Gothic', glottolog_id='goth1244', latitude=46.9304, longitude=29.9786, dates=[], family_id='indo1319', parent_id='east2805', level='language', iso_639_3_code='got', type='a')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.GreekPipeline(description: str = 'Pipeline for the Greek language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Ancient Greek', glottolog_id='anci1242', latitude=39.8155, longitude=21.9129, dates=[], family_id='indo1319', parent_id='east2798', level='language', iso_639_3_code='grc', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Ancient Greek.

>>> from cltk.languages.pipelines import GreekPipeline
>>> a_pipeline = GreekPipeline()
>>> a_pipeline.description
'Pipeline for the Greek language'
>>> a_pipeline.language
Language(name='Ancient Greek', glottolog_id='anci1242', latitude=39.8155, longitude=21.9129, dates=[], family_id='indo1319', parent_id='east2798', level='language', iso_639_3_code='grc', type='h')
>>> a_pipeline.language.name
'Ancient Greek'
>>> a_pipeline.processes[0]
<class 'cltk.alphabet.processes.GreekNormalizeProcess'>
description: str = 'Pipeline for the Greek language'
language: cltk.core.data_types.Language = Language(name='Ancient Greek', glottolog_id='anci1242', latitude=39.8155, longitude=21.9129, dates=[], family_id='indo1319', parent_id='east2798', level='language', iso_639_3_code='grc', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.HindiPipeline(description: str = 'Pipeline for the Hindi language.', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Hindi', glottolog_id='hind1269', latitude=25.0, longitude=77.0, dates=[], family_id='indo1319', parent_id='hind1270', level='language', iso_639_3_code='hin', type=''))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Hindi.

>>> from cltk.languages.pipelines import HindiPipeline
>>> a_pipeline = HindiPipeline()
>>> a_pipeline.description
'Pipeline for the Hindi language.'
>>> a_pipeline.language
Language(name='Hindi', glottolog_id='hind1269', latitude=25.0, longitude=77.0, dates=[], family_id='indo1319', parent_id='hind1270', level='language', iso_639_3_code='hin', type='')
>>> a_pipeline.language.name
'Hindi'
>>> a_pipeline.processes[1]
<class 'cltk.stops.processes.StopsProcess'>
description: str = 'Pipeline for the Hindi language.'
language: cltk.core.data_types.Language = Language(name='Hindi', glottolog_id='hind1269', latitude=25.0, longitude=77.0, dates=[], family_id='indo1319', parent_id='hind1270', level='language', iso_639_3_code='hin', type='')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.LatinPipeline(description: str = 'Pipeline for the Latin language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Latin', glottolog_id='lati1261', latitude=41.9026, longitude=12.4502, dates=[], family_id='indo1319', parent_id='impe1234', level='language', iso_639_3_code='lat', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Latin.

TODO: Add stopword annotation for all relevant pipelines.

>>> from cltk.languages.pipelines import LatinPipeline
>>> a_pipeline = LatinPipeline()
>>> a_pipeline.description
'Pipeline for the Latin language'
>>> a_pipeline.language
Language(name='Latin', glottolog_id='lati1261', latitude=41.9026, longitude=12.4502, dates=[], family_id='indo1319', parent_id='impe1234', level='language', iso_639_3_code='lat', type='a')
>>> a_pipeline.language.name
'Latin'
>>> a_pipeline.processes[0]
<class 'cltk.alphabet.processes.LatinNormalizeProcess'>
description: str = 'Pipeline for the Latin language'
language: cltk.core.data_types.Language = Language(name='Latin', glottolog_id='lati1261', latitude=41.9026, longitude=12.4502, dates=[], family_id='indo1319', parent_id='impe1234', level='language', iso_639_3_code='lat', type='a')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.MiddleHighGermanPipeline(description: str = 'Pipeline for the Middle High German language.', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Middle High German', glottolog_id='midd1343', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='midd1349', level='language', iso_639_3_code='gmh', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Middle High German.

>>> a_pipeline = MiddleHighGermanPipeline()
>>> a_pipeline.description
'Pipeline for the Middle High German language.'
>>> a_pipeline.language
Language(name='Middle High German', glottolog_id='midd1343', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='midd1349', level='language', iso_639_3_code='gmh', type='h')
>>> a_pipeline.language.name
'Middle High German'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.MiddleHighGermanTokenizationProcess'>
description: str = 'Pipeline for the Middle High German language.'
language: cltk.core.data_types.Language = Language(name='Middle High German', glottolog_id='midd1343', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='midd1349', level='language', iso_639_3_code='gmh', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.MiddleEnglishPipeline(description: str = 'Pipeline for the Middle English language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Middle English', glottolog_id='midd1317', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='merc1242', level='language', iso_639_3_code='enm', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Middle English.

TODO: Figure out whether this the dedicated tokenizer is good enough or necessary; we have stanza for Old English, which might be able to tokenizer fine.

>>> from cltk.languages.pipelines import MiddleEnglishPipeline
>>> a_pipeline = MiddleEnglishPipeline()
>>> a_pipeline.description
'Pipeline for the Middle English language'
>>> a_pipeline.language
Language(name='Middle English', glottolog_id='midd1317', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='merc1242', level='language', iso_639_3_code='enm', type='h')
>>> a_pipeline.language.name
'Middle English'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.MiddleEnglishTokenizationProcess'>
description: str = 'Pipeline for the Middle English language'
language: cltk.core.data_types.Language = Language(name='Middle English', glottolog_id='midd1317', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='merc1242', level='language', iso_639_3_code='enm', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.MiddleFrenchPipeline(description: str = 'Pipeline for the Middle French language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Middle French', glottolog_id='midd1316', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='stan1290', level='dialect', iso_639_3_code='frm', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Middle French.

TODO: Figure out whether this the dedicated tokenizer is good enough or necessary; we have stanza for Old French, which might be able to tokenizer fine.

>>> from cltk.languages.pipelines import MiddleFrenchPipeline
>>> a_pipeline = MiddleFrenchPipeline()
>>> a_pipeline.description
'Pipeline for the Middle French language'
>>> a_pipeline.language
Language(name='Middle French', glottolog_id='midd1316', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='stan1290', level='dialect', iso_639_3_code='frm', type='h')
>>> a_pipeline.language.name
'Middle French'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.MiddleFrenchTokenizationProcess'>
description: str = 'Pipeline for the Middle French language'
language: cltk.core.data_types.Language = Language(name='Middle French', glottolog_id='midd1316', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='stan1290', level='dialect', iso_639_3_code='frm', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.OCSPipeline(description: str = 'Pipeline for the Old Church Slavonic language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Church Slavic', glottolog_id='chur1257', latitude=43.7171, longitude=22.8442, dates=[], family_id='indo1319', parent_id='east2269', level='language', iso_639_3_code='chu', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Old Church Slavonic.

>>> from cltk.languages.pipelines import OCSPipeline
>>> a_pipeline = OCSPipeline()
>>> a_pipeline.description
'Pipeline for the Old Church Slavonic language'
>>> a_pipeline.language
Language(name='Church Slavic', glottolog_id='chur1257', latitude=43.7171, longitude=22.8442, dates=[], family_id='indo1319', parent_id='east2269', level='language', iso_639_3_code='chu', type='a')
>>> a_pipeline.language.name
'Church Slavic'
>>> a_pipeline.processes[0]
<class 'cltk.dependency.processes.OCSStanzaProcess'>
description: str = 'Pipeline for the Old Church Slavonic language'
language: cltk.core.data_types.Language = Language(name='Church Slavic', glottolog_id='chur1257', latitude=43.7171, longitude=22.8442, dates=[], family_id='indo1319', parent_id='east2269', level='language', iso_639_3_code='chu', type='a')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.OldEnglishPipeline(description: str = 'Pipeline for the Old English language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Old English (ca. 450-1100)', glottolog_id='olde1238', latitude=51.06, longitude=-1.31, dates=[], family_id='indo1319', parent_id='angl1265', level='language', iso_639_3_code='ang', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Old English.

>>> from cltk.languages.pipelines import OldEnglishPipeline
>>> a_pipeline = OldEnglishPipeline()
>>> a_pipeline.description
'Pipeline for the Old English language'
>>> a_pipeline.language
Language(name='Old English (ca. 450-1100)', glottolog_id='olde1238', latitude=51.06, longitude=-1.31, dates=[], family_id='indo1319', parent_id='angl1265', level='language', iso_639_3_code='ang', type='h')
>>> a_pipeline.language.name
'Old English (ca. 450-1100)'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.MultilingualTokenizationProcess'>
description: str = 'Pipeline for the Old English language'
language: cltk.core.data_types.Language = Language(name='Old English (ca. 450-1100)', glottolog_id='olde1238', latitude=51.06, longitude=-1.31, dates=[], family_id='indo1319', parent_id='angl1265', level='language', iso_639_3_code='ang', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.OldFrenchPipeline(description: str = 'Pipeline for the Old French language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Old French (842-ca. 1400)', glottolog_id='oldf1239', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='oila1234', level='language', iso_639_3_code='fro', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Old French.

>>> from cltk.languages.pipelines import OldFrenchPipeline
>>> a_pipeline = OldFrenchPipeline()
>>> a_pipeline.description
'Pipeline for the Old French language'
>>> a_pipeline.language
Language(name='Old French (842-ca. 1400)', glottolog_id='oldf1239', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='oila1234', level='language', iso_639_3_code='fro', type='h')
>>> a_pipeline.language.name
'Old French (842-ca. 1400)'
>>> a_pipeline.processes[0]
<class 'cltk.dependency.processes.OldFrenchStanzaProcess'>
description: str = 'Pipeline for the Old French language'
language: cltk.core.data_types.Language = Language(name='Old French (842-ca. 1400)', glottolog_id='oldf1239', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='oila1234', level='language', iso_639_3_code='fro', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.OldNorsePipeline(description: str = 'Pipeline for the Old Norse language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Old Norse', glottolog_id='oldn1244', latitude=63.42, longitude=10.38, dates=[], family_id='indo1319', parent_id='west2805', level='language', iso_639_3_code='non', type='h'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Old Norse.

>>> from cltk.languages.pipelines import OldNorsePipeline
>>> a_pipeline = OldNorsePipeline()
>>> a_pipeline.description
'Pipeline for the Old Norse language'
>>> a_pipeline.language
Language(name='Old Norse', glottolog_id='oldn1244', latitude=63.42, longitude=10.38, dates=[], family_id='indo1319', parent_id='west2805', level='language', iso_639_3_code='non', type='h')
>>> a_pipeline.language.name
'Old Norse'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.OldNorseTokenizationProcess'>
description: str = 'Pipeline for the Old Norse language'
language: cltk.core.data_types.Language = Language(name='Old Norse', glottolog_id='oldn1244', latitude=63.42, longitude=10.38, dates=[], family_id='indo1319', parent_id='west2805', level='language', iso_639_3_code='non', type='h')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.PaliPipeline(description: str = 'Pipeline for the Pali language', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Pali', glottolog_id='pali1273', latitude=24.5271, longitude=82.251, dates=[], family_id='indo1319', parent_id='biha1245', level='language', iso_639_3_code='pli', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Pali.

TODO: Make better tokenizer for Pali.

>>> from cltk.languages.pipelines import PaliPipeline
>>> a_pipeline = PaliPipeline()
>>> a_pipeline.description
'Pipeline for the Pali language'
>>> a_pipeline.language
Language(name='Pali', glottolog_id='pali1273', latitude=24.5271, longitude=82.251, dates=[], family_id='indo1319', parent_id='biha1245', level='language', iso_639_3_code='pli', type='a')
>>> a_pipeline.language.name
'Pali'
>>> a_pipeline.processes[0]
<class 'cltk.tokenizers.processes.MultilingualTokenizationProcess'>
description: str = 'Pipeline for the Pali language'
language: cltk.core.data_types.Language = Language(name='Pali', glottolog_id='pali1273', latitude=24.5271, longitude=82.251, dates=[], family_id='indo1319', parent_id='biha1245', level='language', iso_639_3_code='pli', type='a')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.PanjabiPipeline(description: str = 'Pipeline for the Panjabi language.', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Eastern Panjabi', glottolog_id='panj125', latitude=30.0368, longitude=75.6702, dates=[], family_id='indo1319', parent_id='east2727', level='language', iso_639_3_code='pan', type=''))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Panjabi.

>>> from cltk.languages.pipelines import SanskritPipeline
>>> a_pipeline = PanjabiPipeline()
>>> a_pipeline.description
'Pipeline for the Panjabi language.'
>>> a_pipeline.language
Language(name='Eastern Panjabi', glottolog_id='panj125', latitude=30.0368, longitude=75.6702, dates=[], family_id='indo1319', parent_id='east2727', level='language', iso_639_3_code='pan', type='')
>>> a_pipeline.language.name
'Eastern Panjabi'
>>> a_pipeline.processes[1]
<class 'cltk.stops.processes.StopsProcess'>
description: str = 'Pipeline for the Panjabi language.'
language: cltk.core.data_types.Language = Language(name='Eastern Panjabi', glottolog_id='panj125', latitude=30.0368, longitude=75.6702, dates=[], family_id='indo1319', parent_id='east2727', level='language', iso_639_3_code='pan', type='')
processes: List[Type[cltk.core.data_types.Process]]
class cltk.languages.pipelines.SanskritPipeline(description: str = 'Pipeline for the Sanskrit language.', processes: List[Type[cltk.core.data_types.Process]] = <factory>, language: cltk.core.data_types.Language = Language(name='Sanskrit', glottolog_id='sans1269', latitude=20.0, longitude=77.0, dates=[], family_id='indo1319', parent_id='indo1321', level='language', iso_639_3_code='san', type='a'))[source]

Bases: cltk.core.data_types.Pipeline

Default Pipeline for Sanskrit.

TODO: Make better tokenizer for Sanskrit.

>>> from cltk.languages.pipelines import SanskritPipeline
>>> a_pipeline = SanskritPipeline()
>>> a_pipeline.description
'Pipeline for the Sanskrit language.'
>>> a_pipeline.language
Language(name='Sanskrit', glottolog_id='sans1269', latitude=20.0, longitude=77.0, dates=[], family_id='indo1319', parent_id='indo1321', level='language', iso_639_3_code='san', type='a')
>>> a_pipeline.language.name
'Sanskrit'
>>> a_pipeline.processes[1]
<class 'cltk.embeddings.processes.SanskritEmbeddingsProcess'>
description: str = 'Pipeline for the Sanskrit language.'
language: cltk.core.data_types.Language = Language(name='Sanskrit', glottolog_id='sans1269', latitude=20.0, longitude=77.0, dates=[], family_id='indo1319', parent_id='indo1321', level='language', iso_639_3_code='san', type='a')
processes: List[Type[cltk.core.data_types.Process]]

8.1.7.5. cltk.languages.utils module

cltk.languages.utils.get_lang(iso_code)[source]

Take ISO 639-3 code and return Language object for language.

TODO: Split this into another fn, check_language(), which is how is usually used now.

>>> from cltk.languages.utils import get_lang
>>> get_lang("akk")
Language(name='Akkadian', glottolog_id='akka1240', latitude=33.1, longitude=44.1, dates=[], family_id='afro1255', parent_id='east2678', level='language', iso_639_3_code='akk', type='a')
>>> from cltk.core.exceptions import UnknownLanguageError
>>> get_lang("xxx")
Traceback (most recent call last):
  ...
cltk.core.exceptions.UnknownLanguageError: Unknown ISO language code 'xxx'.
Return type

Language

cltk.languages.utils.find_iso_name(common_name)[source]

Find the ISO 639-3 language code (e.g., lat) by inputting the common name (Latin). This function just does simple substring matching, with some normalization of case, on the name field of the Language object.

>>> find_iso_name(common_name="Latin")
['lat']
>>> find_iso_name(common_name="lat")
['xga', 'lat']
>>> find_iso_name(common_name="slav")
['chu']
>>> find_iso_name(common_name="xxx")
[]
Return type

List[str]