Output Documents
This page documents the helper functions that turn a processed Doc into human‑readable or machine‑readable output. Each function is pure (no file I/O) and returns a string; the caller decides how to persist it.
format_readers_guide(doc: Doc, *, include_provenance: bool = False, include_confidence: bool = False) -> str
Render a Markdown “reader’s guide” aimed at students and scholars. The output is UTF‑8 Markdown with headings, blockquotes, and details blocks.
Structure
- H1 title from
doc.metadata["title"]ordoc.metadata["reference"], else “Reader’s Guide”. - Optional pronunciation line when all enriched words share the same IPA mode.
- Per sentence:
## Sentence N- Blockquote of the sentence surface text (tokens joined with spaces).
### Word-by-word- For each word:
### <surface>- Italic POS name and bold gloss when available.
- Bullets: Lemma, Gloss, Dictionary Gloss (if distinct), Dependency Role (name + code), Governor (1-based index), IPA (with mode), Syllables.
- Optional
<details>blocks for phonology trace and pedagogical notes.
Example (truncated)
# Reader's Guide
**Pronunciation mode:** attic_5c_bce
## Sentence 1
> ὅτι δὲ τὸν τρόπον τοῦτον
### Word-by-word
### ὅτι
*subordinating conjunction* · **that / because (subordinating conjunction)**
- **Lemma:** ὅτι
- **Gloss:** that / because (subordinating conjunction)
- **Dependency Role:** marker (`mark`)
- **Governor:** token 8
- **IPA (attic_5c_bce):** `/ˈho.ti/`
- **Syllables:** ὅ-τι
<details>
<summary>Phonology</summary>
- initial rough breathing realised as /h-/ in Attic
- vowel qualities preserved; no contraction
- stress on first syllable (acute)
</details>
Use this when you need a study‑friendly breakdown with enrichment fields (gloss, IPA, orthography, pedagogy).
Options
include_provenance=Trueadds a short provenance section near the top.include_confidence=Trueadds per-token confidence summaries when present.
doc_to_feature_table(doc: Doc, *, include_provenance: bool = False, include_confidence: bool = False) -> pa.Table
Convert a Doc into a tidy pyarrow.Table with one row per token, combining morphosyntax, dependencies, UD features, and selected metadata.
Columns
- Sentence index, global token index, token index in sentence
FORM,LEMMA,UPOS,HEAD,DEPREL- Metadata columns (if present on words): e.g., translation, definitions
- UD feature columns: each UD feature key becomes its own column
- Dependency extras (if present): governor sentence index, etc.
Example
from cltk.utils.file_outputs import doc_to_feature_table
table = doc_to_feature_table(doc)
print(table.schema)
print(table.to_pandas().head())
Useful for analytics, export to Parquet/CSV, or downstream ML pipelines.
When include_provenance=True or include_confidence=True, the table gains
prov_* and conf_* columns for key fields (lemma, upos, feats, head, deprel).
doc_to_conllu(doc: Doc, *, include_provenance: bool = False, include_confidence: bool = False) -> str
Render a Doc as CoNLL‑U v2 text. One sentence per block, 10 tab‑separated fields per token.
Behavior
- Uses
doc.sentencesordering; falls back todoc.wordsif sentences are absent. - Writes
ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC. - Preserves existing lemmas/UPOS/FEATS/DEPREL/governor; leaves blanks (
_) when data is missing. HEADis 1‑based; 0 for roots.
Example
# sent_id = 1
# text = ὅτι δὲ τὸν τρόπον τοῦτον
1 ὅτι ὅτι SCONJ _ _ 8 mark _ _
2 δὲ δέ PART _ _ 8 discourse _ _
...
Use this to round‑trip with UD tools, validators, or treebanks. The function is deterministic and does no I/O; write the returned string to disk if needed.
When include_provenance=True, the output adds comment lines that embed
provenance records (# cltk_provenance_default=..., # cltk_prov.<id>=...).
When include_confidence=True, token confidences appear in the MISC column.
doc_to_igt_latex(...) -> str and doc_to_igt_html(...) -> str
Render token-level interlinear glossed text (IGT) for each sentence. Both helpers are pure (string return) and keep deterministic ordering and IDs.
Behavior
- One table per sentence, with a token row and a gloss/lemma/morph row.
- Gloss selection priority:
word.enrichment.gloss→ firstlemma_translations→ lemma → token string. - Optional translation row per sentence if
sentence_translationsare present.
Example
from cltk.exports import doc_to_igt_latex, doc_to_igt_html
latex = doc_to_igt_latex(doc, include_gloss=True)
html = doc_to_igt_html(doc, include_morph=True)
doc_to_tei_xml(...) -> str
Produce a TEI‑ish XML document with token annotations in <w> elements and dependency arcs in a <standOff> section.
Behavior
- Tokens are annotated with
lemma,pos, andmsd(UD features) when available. - Dependencies are encoded as
<relation>entries with stable token IDs. - Root dependencies point to a sentence-level root anchor.
Example
from cltk.exports import doc_to_tei_xml
xml = doc_to_tei_xml(doc, include_morph=True, include_translation=True)
doc_to_readers_guide_html(...) -> str
Render a self‑contained HTML reader’s guide with collapsible token cards, tooltips, and inline styling.
Behavior
- Each sentence has a surface line, optional translation, and a token strip with hover tooltips.
- Token cards expose lemma, POS, morphology, gloss, IPA, and dependencies.
- Output is a single HTML document (no external assets).
Example