8.1.14. cltk.sentence package

8.1.14.1. Submodules

8.1.14.2. cltk.sentence.grc module

Code for sentences tokenization: Greek.

Sentence tokenization for Ancient Greek is available using a regular-expression based tokenizer.

>>> from cltk.sentence.grc import GreekRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = GreekRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("grc"))
>>> sentences[:2]
['ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων, οὐκ οἶδα: ἐγὼ δ᾽ οὖν καὶ αὐτὸς ὑπ᾽ αὐτῶν ὀλίγου ἐμαυτοῦ ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.', 'καίτοι ἀληθές γε ὡς ἔπος εἰπεῖν οὐδὲν εἰρήκασιν.']
>>> len(sentences)
9
class cltk.sentence.grc.GreekRegexSentenceTokenizer[source]

Bases: cltk.sentence.sentence.RegexSentenceTokenizer

RegexSentenceTokenizer for Ancient Greek.

8.1.14.3. cltk.sentence.lat module

Code for sentences tokenization: Latin

>>> from cltk.sentence.lat import LatinPunktSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = LatinPunktSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("lat"))
>>> sentences[2]
'Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.'
>>> len(sentences)
8
class cltk.sentence.lat.LatinLanguageVars[source]

Bases: nltk.tokenize.punkt.PunktLanguageVars

class cltk.sentence.lat.LatinPunktSentenceTokenizer(strict=False)[source]

Bases: cltk.sentence.sentence.PunktSentenceTokenizer

Sentence tokenizer for Latin. Inherits from NLTK’s PunktSentenceTokenizer.

8.1.14.4. cltk.sentence.san module

Sentence tokenization for Sanskrit.

>>> from cltk.sentence.san import SanskritRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = SanskritRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("san"))
>>> sentences[1]
'तेन त्यक्तेन भुञ्जीथा मा गृधः कस्य स्विद्धनम् ॥'
>>> len(sentences)
12
class cltk.sentence.san.SanskritLanguageVars[source]

Bases: nltk.tokenize.punkt.PunktLanguageVars

sent_end_chars = ['।', '॥', '\\|', '\\|\\|']
class cltk.sentence.san.SanskritRegexSentenceTokenizer[source]

Bases: cltk.sentence.sentence.RegexSentenceTokenizer

RegexSentenceTokenizer for Sanskrit.

8.1.14.5. cltk.sentence.sentence module

Tokenize sentences.

class cltk.sentence.sentence.SentenceTokenizer(language=None)[source]

Bases: abc.ABC

Base class for sentences tokenization

tokenize(text, model=None)[source]

Method for tokenizing sentences with pretrained punkt models; can be overridden by language-specific tokenizers.

Return type

list

Parameters
  • text (str) – text to be tokenized into sentences

  • model (object) – tokenizer object to used # Should be in init?

class cltk.sentence.sentence.PunktSentenceTokenizer(language=None, lang_vars=None)[source]

Bases: cltk.sentence.sentence.SentenceTokenizer

Base class for punkt sentences tokenization

missing_models_message = 'PunktSentenceTokenizer requires a language model.'
class cltk.sentence.sentence.RegexSentenceTokenizer(language=None, sent_end_chars=None)[source]

Bases: cltk.sentence.sentence.SentenceTokenizer

Base class for regex sentences tokenization

tokenize(text, model=None)[source]

Method for tokenizing sentences with regular expressions.

Return type

list

Parameters

text (str) – text to be tokenized into sentences