CLTK data types
This page summarizes the core data structures you will work with when using CLTK. The first section covers objects returned by NLP().analyze(). The second section covers internal types that can be used to override defaults.
Objects returned by NLP().analyze()
Doc
The Doc object is the top-level container returned by NLP().analyze().
Key attributes:
language: ResolvedLanguagemetadata.words: List ofWordobjects (token-level annotations).raw: Original input text.normalized_text: Normalized version ofrawwhen a normalizer runs.sentence_boundaries: List of(start, stop)character offsets.sentence_embeddings: Optional embeddings keyed by sentence index.sentence_translations:dict[int, Translation]keyed by sentence index.translation: Optional aggregated translation string.translations: List ofTranslationobjects (usually per sentence).pipeline: ThePipelineinstance that produced the doc.backendandmodel: Backend/model identifiers used for processing.metadata: Free-form metadata populated by processes.
Helper properties:
sentence_strings: Returns sentence substrings computed fromnormalized_textandsentence_boundaries.sentences: ReturnsSentenceobjects grouped byWord.index_sentenceand ordered byWord.index_token. EachSentenceincludes the per-sentence translation and embedding when present.
Notes:
- The
translationfield is an optional string. Structured translations live insentence_translationsandtranslations.
Word
The Word object contains per-token annotations.
Key attributes:
index_char_start,index_char_stop: Character offsets.index_token: Token index within its sentence.index_sentence: Sentence index within the doc.string: Token surface form.lemma,stem: Lemma/stem if produced.upos: HoldsUDPartOfSpeechTagfor a POS tag (e.g., noun, verb) conformant to the Universal Dependencies standard.features: HoldsUDFeatureTagSetFull morphosynatic tags conformant to the Universal Dependencies standard.dependency_relation: ForUDDeprelTag, a dependency relation label conformant to the Universal Dependencies standard.governor: Anintpointing to the dependency head of the parse tree.embedding: Optional vector embedding.stop: Stopword flag.named_entity: Named entity tag if available.enrichment:WordEnrichmentbundle (glosses, translations, IPA, etc.).annotation_sources: Provenance per annotation type.confidence: Confidence scores per annotation type.
Notes:
- Internal
_docreference may be attached to allow back-references to the parentDoc.
UDPartOfSpeechTag
Universal Dependencies part-of-speech tag enum used for Word.upos.
Notes:
- Values follow the UD POS tagset; see
src/cltk/morphosyntax/ud_pos.pyfor the full list. - See the UD website for full definitions: https://universaldependencies.org/u/pos/index.html
UDFeatureTagSet
Bundle of Universal Dependencies morphosyntactic features used for Word.features. Consists of one or more UDDeprelTag. Example:
>>> nlp = NLP("lat", suppress_banner=True)
>>> doc = nlp.analyze("accipe")
>>> doc.words[0].features
UDFeatureTagSet([UDFeatureTag(Aspect=Imperfective), UDFeatureTag(Mood=Imperative), UDFeatureTag(Number=Singular), UDFeatureTag(Person=Second person), UDFeatureTag(Tense=Present), UDFeatureTag(VerbForm=Finite), UDFeatureTag(Voice=Active)])
Notes:
- Exposes UD feature keys and values as a structured tag set; see
src/cltk/morphosyntax/ud_features.py.
UDDeprelTag
Universal Dependencies dependency relation tag used for Word.dependency_relation.
Notes:
- Values follow the UD dependency relation set; see
src/cltk/morphosyntax/ud_deprels.py. - See the UD website for full definitions: https://universaldependencies.org/u/feat/all.html
Sentence
The Sentence object groups words and optional metadata for a sentence.
Key attributes:
words: List ofWordobjects for the sentence.index: Sentence index.embedding: Optional sentence embedding.translation: OptionalTranslationfor the sentence.annotation_sources: Provenance for sentence-level annotations.
Notes:
Sentenceobjects are produced byDoc.sentencesand are derived fromWord.index_sentenceandWord.index_token.
Translation
Structured translation metadata.
Key attributes:
source_lang_id: Optional source language ID.target_lang_id: Optional target language ID.text: Translation text.notes: Optional notes from the translation process.confidence: Optional confidence value in[0, 1].
Gloss
Contextual and dictionary gloss information.
Key attributes:
dictionary: Dictionary gloss (if provided).context: Contextual gloss (if provided).alternatives: List ofScoredTextalternatives with optional probabilities.
Internal types
These may be used to override the CLTK's defaults.
Language
Language metadata used for resolution and display.
Key attributes:
name: Human-readable language name.glottolog_id: Glottolog ID (used for pipeline selection).identifiers,iso,iso_set: Identifier metadata.classification,family_id,parent_id: Family and lineage info.scripts,orthographies,alt_names: Orthography and naming data.dialects: List ofDialectrecords.
Notes:
- For user-defined languages, supply a
glottolog_idandnameat minimum.
Dialect
Dialect metadata associated with a parent Language.
Key attributes:
glottolog_id: Dialect Glottolog ID.language_code: Optional dialect code alias.name: Dialect name.status,alt_names,scripts,orthographies: Descriptive metadata.
Pipeline
A pipeline is an ordered set of processes that transforms a Doc.
Key attributes:
processes: List ofProcessclasses or instances (order matters).description: Human-readable description.language,dialect,glottolog_id: Optional language metadata.spec: Optional declarative pipeline spec.
Helper methods:
add_process(process): Append a process to the pipeline.describe(): Return a human-friendly process list (uses registry forspec).
Notes:
- If
glottolog_idis set, the pipeline can auto-resolvelanguage/dialect.
Process
Base class for a pipeline step. Each process accepts, transforms, and finally returns a Doc.
Key attributes:
process_id: Stable identifier used by registries/specs.glottolog_id: Optional language code for language-specific logic.
Required method:
run(input_doc) -> Doc: Apply the process and return the modified doc.
CLTKConfig
Bundled configuration for initializing NLP().
Key attributes:
language_code: Language identifier string (Glottolog, ISO, or name).language: OptionalLanguageobject (used when defining custom languages).backend: Backend selection (stanza,openai,ollama, etc.).model: Optional model name.custom_pipeline: OptionalPipelineinstance to override defaults.suppress_banner: Toggle console banner output.
Backend-specific config blocks:
stanza,openai,mistral,ollama: Optional backend config blocks.active_backend_config: Returns the config block for the selected backend.
Notes:
- Provide either
language_codeorlanguage. If onlylanguageis supplied, itsglottolog_idis used as the language code. - Only one backend config block can be provided at a time.