8.1.4. cltk.data package¶
Init for data modules.
8.1.4.1. Submodules¶
8.1.4.2. cltk.data.fetch module¶
Import CLTK corpora.
TODO: ? Fix so import_corpora()
can take relative path.
TODO: ? Add https://github.com/cltk/pos_latin
TODO: Consider renaming all “import” to “clone”
- class cltk.data.fetch.ProgressPrinter[source]¶
Bases:
RemoteProgress
Class that implements progress reporting.
- update(op_code, cur_count, max_count=None, message='')[source]¶
Called whenever the progress changes.
- Parameters:
op_code –
Integer allowing to be compared against Operation IDs and stage IDs.
Stage IDs are
BEGIN
andEND
.BEGIN
will only be set once for each Operation ID as well asEND
. It may be thatBEGIN
andEND
are set at once in case only one progress message was emitted due to the speed of the operation. BetweenBEGIN
andEND
, none of these flags will be set.Operation IDs are all held within the
OP_MASK
. Only one Operation ID will be active per call.cur_count – Current absolute count of items.
max_count – The maximum count of items we expect. It may be
None
in case there is no maximum number of items or if it is (yet) unknown.message – In case of the
WRITING
operation, it contains the amount of bytes transferred. It may possibly be used for other purposes as well.
- Note:
You may read the contents of the current line in
self._cur_line
.
- error_lines: List[str]¶
- other_lines: List[str]¶
- class cltk.data.fetch.FetchCorpus(language, testing=False)[source]¶
Bases:
object
Import CLTK corpora.
- _get_user_defined_corpora()[source]¶
Check CLTK_DATA_DIR + ‘/distributed_corpora.yaml’ for any custom, distributed corpora that the user wants to load locally.
- _get_library_defined_corpora()[source]¶
Pull from
LANGUAGE_CORPORA
and return corpora for given language.
- property list_corpora¶
Show corpora available for the CLTK to download.
- static _copy_dir_recursive(src_rel, dst_rel)[source]¶
Copy contents of one directory to another. dst_rel dir cannot exist. Source: http://stackoverflow.com/a/1994840 TODO: Move this to file_operations.py module. :type src_rel: str :param src_rel: Directory to be copied. :type dst_rel: str :param dst_rel: Directory to be created with contents of
src_rel
.
- _get_corpus_properties(corpus_name)[source]¶
Check whether a corpus is available for import. :type corpus_name: str :type corpus_name:
str
:param corpus_name: Name of available corpus. :rtype : str
- _git_user_defined_corpus(corpus_name, corpus_type, uri, branch='master')[source]¶
Clone or update a git repo defined by user. TODO: This code is very redundant with what’s in import_corpus(), could be refactored.
- import_corpus(corpus_name, local_path=None, branch='master')[source]¶
Download a remote or load local corpus into dir
~/cltk_data
.TODO: maybe add
from git import RemoteProgress
TODO: refactor this, it’s getting kinda long- Parameters:
corpus_name (
str
) – The name of an available corpus.local_path (
str
) – A filepath, required when importing local corpora.branch (
str
) – What Git branch to clone.
- Return type:
None