8.1.4. cltk.data package

Init for data modules. Submodules cltk.data.fetch module

Import CLTK corpora. TODO: ? Fix so import_corpora() can take relative path. TODO: ? Add https://github.com/cltk/pos_latin

TODO: Consider renaming all “import” to “clone”

class cltk.data.fetch.ProgressPrinter[source]

Bases: git.util.RemoteProgress

Class that implements progress reporting.

update(op_code, cur_count, max_count=None, message='')[source]

Called whenever the progress changes

  • op_code

    Integer allowing to be compared against Operation IDs and stage IDs.

    Stage IDs are BEGIN and END. BEGIN will only be set once for each Operation ID as well as END. It may be that BEGIN and END are set at once in case only one progress message was emitted due to the speed of the operation. Between BEGIN and END, none of these flags will be set

    Operation IDs are all held within the OP_MASK. Only one Operation ID will be active per call.

  • cur_count – Current absolute count of items

  • max_count – The maximum count of items we expect. It may be None in case there is no maximum number of items or if it is (yet) unknown.

  • message – In case of the ‘WRITING’ operation, it contains the amount of bytes transferred. It may possibly be used for other purposes as well.

You may read the contents of the current line in self._cur_line

class cltk.data.fetch.FetchCorpus(language, testing=False)[source]

Bases: object

Import CLTK corpora.


Check CLTK_DATA_DIR + ‘/distributed_corpora.yaml’ for any custom, distributed corpora that the user wants to load locally.


Pull from LANGUAGE_CORPORA and return corpora for given language.

property list_corpora

Show corpora available for the CLTK to download.

static _copy_dir_recursive(src_rel, dst_rel)[source]

Copy contents of one directory to another. dst_rel dir cannot exist. Source: http://stackoverflow.com/a/1994840 TODO: Move this to file_operations.py module. :type src_rel: str :param src_rel: Directory to be copied. :type dst_rel: str :param dst_rel: Directory to be created with contents of src_rel.


Check whether a corpus is available for import. :type corpus_name: str :type corpus_name: str :param corpus_name: Name of available corpus. :rtype : str

_git_user_defined_corpus(corpus_name, corpus_type, uri, branch='master')[source]

Clone or update a git repo defined by user. TODO: This code is very redundant with what’s in import_corpus(), could be refactored.

import_corpus(corpus_name, local_path=None, branch='master')[source]

Download a remote or load local corpus into dir ~/cltk_data.

TODO: maybe add from git import RemoteProgress TODO: refactor this, it’s getting kinda long

  • corpus_name (str) – The name of an available corpus.

  • local_path (Optional[str]) – A filepath, required when importing local corpora.

  • branch (str) – What Git branch to clone.