6. Pipelines, Processes, Docs, and Words¶
See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20data%20types.ipynb for a detailed walkthrough of CLTK data types.
The CLTK contains four important, native data types:
cltk.core.data_types.Word: Contains all processed information for each word token. Has attributes including
Processadds data to each
Word. See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored in
sentence_embeddingsa weighted average of the word embeddings of the sentence.
Doc.raw, which is the original input string to
Doc.words, which is a list of
Wordobjects. It is the input and output of each
Processand final output of
NLP(). See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored in
cltk.core.data_types.Process: Takes and returns a
Doc. Each process does some processing of information within the
Doc, then annotates each
cltk.core.data_types.Pipeline: Has a list of
Pipeline.processes. Predefined pipelines have been made for some languages (Languages), while custom pipelines may be created for these languages or other, different languages. See notebook https://github.com/cltk/cltk/blob/master/notebooks/Make%20custom%20Process%20and%20add%20to%20Pipeline.ipynb for an example creating a new
Processand adding it to a custom
Pipeline. For an illustration of how
Processobjects inherit from one another, see figure Inheritance of Pipeline class.