8.1.3.1.1.1.1. cltk.corpora.grc.tlg package

8.1.3.1.1.1.1.1. Submodules

8.1.3.1.1.1.1.2. cltk.corpora.grc.tlg.author_date module

8.1.3.1.1.1.1.3. cltk.corpora.grc.tlg.author_epithet module

8.1.3.1.1.1.1.4. cltk.corpora.grc.tlg.author_female module

8.1.3.1.1.1.1.5. cltk.corpora.grc.tlg.author_geo module

8.1.3.1.1.1.1.6. cltk.corpora.grc.tlg.file_utils module

Higher-level (i.e., user-friendly) functions for quickly reading TLG data after it has been processed by TLGU().

cltk.corpora.grc.tlg.file_utils.tlg_plaintext_cleanup(text, rm_punctuation=False, rm_periods=False)[source]

Remove and substitute post-processing for Greek TLG text. TODO: Surely more junk to pull out. Please submit bugs!

Return type:

str

cltk.corpora.grc.tlg.file_utils.assemble_tlg_author_filepaths()[source]

Reads TLG index and builds a list of absolute filepaths.

Return type:

list[str]

cltk.corpora.grc.tlg.file_utils.assemble_tlg_works_filepaths()[source]

Reads TLG index and builds a list of absolute filepaths.

Return type:

list[str]

8.1.3.1.1.1.1.7. cltk.corpora.grc.tlg.id_author module

8.1.3.1.1.1.1.8. cltk.corpora.grc.tlg.index_lists module

8.1.3.1.1.1.1.9. cltk.corpora.grc.tlg.parse_tlg_indices module

For loading TLG .json files and searching, then pulling author ids.

cltk.corpora.grc.tlg.parse_tlg_indices.get_female_authors()[source]

Open female authors index and return ordered set of author ids.

Return type:

set[str]

cltk.corpora.grc.tlg.parse_tlg_indices.get_epithet_index()[source]

Return dict of epithets (key) to a set of all author ids of that epithet (value).

Return type:

dict[str, set[str]]

cltk.corpora.grc.tlg.parse_tlg_indices.get_epithets()[source]

Return a list of all the epithet labels.

Return type:

list[str]

cltk.corpora.grc.tlg.parse_tlg_indices.select_authors_by_epithet(query)[source]

Pass exact name (case-insensitive) of epithet name, return ordered set of author ids.

Return type:

set[str]

cltk.corpora.grc.tlg.parse_tlg_indices.get_epithet_of_author(_id)[source]

Pass author id and return the name of its associated epithet.

Return type:

str

cltk.corpora.grc.tlg.parse_tlg_indices.get_geo_index()[source]

Get entire index of geographic name (key) and set of associated authors (value).

Return type:

dict[str, set[str]]

cltk.corpora.grc.tlg.parse_tlg_indices.get_geographies()[source]

Return a list of all the epithet labels.

Return type:

list[str]

cltk.corpora.grc.tlg.parse_tlg_indices.select_authors_by_geo(query)[source]

Pass exact name (case-insensitive) of geography name, return ordered set of author ids.

Return type:

set[str]

cltk.corpora.grc.tlg.parse_tlg_indices.get_geo_of_author(_id)[source]

Pass author id and return the name of its associated epithet.

Return type:

str

cltk.corpora.grc.tlg.parse_tlg_indices.get_lists()[source]

Return all of the TLG’s indices.

Return type:

dict[str, dict[str, str]]

cltk.corpora.grc.tlg.parse_tlg_indices.get_id_author()[source]

Returns entirety of id-author TLG index.

Return type:

dict[str, str]

cltk.corpora.grc.tlg.parse_tlg_indices.select_id_by_name(query)[source]

Do a case-insensitive regex match on author name, returns TLG id.

Return type:

list[tuple[str, str]]

cltk.corpora.grc.tlg.parse_tlg_indices.open_json(_file)[source]

Loads the json file as a dictionary and returns it.

cltk.corpora.grc.tlg.parse_tlg_indices.get_works_by_id(_id)[source]

Pass author id and return a dictionary of its works.

cltk.corpora.grc.tlg.parse_tlg_indices.check_id(_id)[source]

Pass author id and return a string with the author label

cltk.corpora.grc.tlg.parse_tlg_indices.get_date_author()[source]

Returns entirety of date-author index.

Return type:

dict[str, list[str]]

cltk.corpora.grc.tlg.parse_tlg_indices.get_dates()[source]

Return a list of all the date epithet labels.

cltk.corpora.grc.tlg.parse_tlg_indices.get_date_of_author(_id)[source]

Pass author id and return the name of its associated date.

cltk.corpora.grc.tlg.parse_tlg_indices._get_epoch(_str)[source]

Take incoming string, return its epoch.

Return type:

Optional[str]

cltk.corpora.grc.tlg.parse_tlg_indices._check_number(_str)[source]

check if the string contains only a number followed by ?

Return type:

bool

cltk.corpora.grc.tlg.parse_tlg_indices._handle_splits(_str)[source]

Check if incoming date has a ‘-’ or ‘/’, if so do stuff.

Return type:

dict[str, Optional[str]]

cltk.corpora.grc.tlg.parse_tlg_indices.normalize_dates()[source]

Experiment to make sense of TLG dates. TODO: start here, parse everything with pass

8.1.3.1.1.1.1.10. cltk.corpora.grc.tlg.tlg_index module

Indices for the TLG.

Note: # TLG_MASTER_INDEX is the result of failed IDT parsing.

TODO: Add work names to TLG_WORKS_INDEX TODO: Add all TLG index data.

8.1.3.1.1.1.1.11. cltk.corpora.grc.tlg.tlgu module

Wrapper for tlgu command line utility.

Original software at: http://tlgu.carmen.gr/.

TLGU software written by Dimitri Marinakis and available at http://tlgu.carmen.gr/ under GPLv2 license.

TODO: the arguments to convert_corpus() need some rationalization, and divide_works() should be incorporated into it.

class cltk.corpora.grc.tlg.tlgu.TLGU(interactive=True)[source]

Bases: object

Check, install, and call TLGU.

_check_and_download_tlgu_source()[source]

Check if tlgu downloaded, if not download it.

Return type:

None

_check_install()[source]

Check if tlgu installed, if not install it.

Return type:

None

static convert(input_path=None, output_path=None, markup=None, rm_newlines=False, divide_works=False, lat=False, extra_args=None)[source]

Do conversion.

Parameters:
  • input_path (Optional[str]) – TLG filepath to convert.

  • output_path (Optional[str]) – filepath of new converted text.

  • markup (Optional[str]) – Specificity of inline markup. Default None removes all numerical markup; ‘full’ gives most detailed, with reference numbers included before each text line.

  • rm_newlines (bool) – No spaces; removes line ends and hyphens before an ID code; hyphens and spaces before page and column ends are retained.

  • divide_works (bool) – Each work (book) is output as a separate file in the form output_file-xxx.txt; if an output file is not specified, this option has no effect.

  • lat (bool) – Primarily Latin text (PHI). Some TLG texts, notably doccan1.txt and doccan2.txt are mostly roman texts lacking explicit language change codes. Setting this option will force a change to Latin text after each citation block is encountered.

  • extra_args (Optional[list[str]]) – Any other tlgu args to be passed, in list form and without dashes, e.g.: [‘p’, ‘b’, ‘B’].

Return type:

None

convert_corpus(corpus, markup=None, lat=None)[source]

Look for imported TLG or PHI files and convert them all to ~/cltk_data/grc/text/tlg/<plaintext>. TODO: Add markup options to input. TODO: Add rm_newlines, divide_works, and extra_args

Return type:

None

divide_works(corpus)[source]

Use the work-breaking option. TODO: Maybe incorporate this into convert_corpus() TODO: Write test for this

Return type:

None

8.1.3.1.1.1.1.12. cltk.corpora.grc.tlg.work_numbers module