User-Defined Pipelines
Warning: This is an alpha feature. APIs and behaviors may change; use in production with care.
This page describes the declarative pipeline system for CLTK.
Process IDs
Each process has a stable process_id used in configs and presets. Examples:
normalizesentence_splitmorphosyntax.genaidependency.genaitranslation.genaienrichment.genai(legacy)enrichment.lexicon,enrichment.phonology,enrichment.idioms,enrichment.pedagogy
You can discover registered processes in code via ProcessRegistry.list_processes().
Example
Below is a minimal, working example that registers a Greek-only process to scan iambic trimeter and wires it into a TOML pipeline. The scanner is intentionally naive (syllable counts + basic long-vowel/diphthong heuristics), but it shows how to add a custom process without changing CLTK code.
scripts/pipeline_greek_iambic.toml:
language = "anci1242"
steps = ["prosody.iambic_trimeter"]
[step.prosody.iambic_trimeter]
enabled = true
output_key = "scansion"
scripts/scan_iambic_trimeter.py:
import re
import unicodedata
from typing import ClassVar
from cltk import NLP
from cltk.core.data_types import Doc, Pipeline, Process
from cltk.core.process_registry import register_process
_IAMBIC_TEMPLATE = ["x", "-", "u", "-", "x", "-", "u", "-", "x", "-", "u", "-"]
_VOWEL_RE = re.compile(r"[αεηιουω]+")
_DIPHTHONGS = ("αι", "ει", "οι", "υι", "αυ", "ευ", "ου", "ηυ", "ωυ")
_GREEK_GLOTTOLOG_IDS = {"anci1242"}
def _strip_diacritics(text: str) -> str:
decomposed = unicodedata.normalize("NFD", text)
return "".join(ch for ch in decomposed if unicodedata.category(ch) != "Mn")
def _normalize_greek(text: str) -> str:
text = _strip_diacritics(text).lower()
text = re.sub(r"[0-9]", "", text)
return re.sub(r"[^\w\s]", "", text)
def _scan_line(line: str) -> str:
cleaned = _normalize_greek(line)
syllables = _VOWEL_RE.findall(cleaned)
quantities: list[str] = []
for syl in syllables:
long_by_vowel = "η" in syl or "ω" in syl
long_by_diphthong = any(diph in syl for diph in _DIPHTHONGS)
quantities.append("-" if long_by_vowel or long_by_diphthong else "u")
quantities = quantities[: len(_IAMBIC_TEMPLATE)]
pattern = [
quantities[i] if i < len(quantities) else _IAMBIC_TEMPLATE[i]
for i in range(len(_IAMBIC_TEMPLATE))
]
return " ".join(pattern)
@register_process
class GreekIambicTrimeterProcess(Process):
process_id: ClassVar[str] = "prosody.iambic_trimeter"
glottolog_id: str | None = None
output_key: str = "scansion"
def run(self, input_doc: Doc) -> Doc:
lang = input_doc.language
iso = getattr(lang, "iso", None) or getattr(lang, "iso_set", {}).get("639-3")
if lang.glottolog_id not in _GREEK_GLOTTOLOG_IDS and iso != "grc":
raise ValueError("Iambic trimeter scanner is Ancient Greek only.")
lines = [line.strip() for line in (input_doc.raw or "").splitlines() if line]
scans = [_scan_line(line) for line in lines]
input_doc.metadata.setdefault(self.output_key, scans)
return input_doc
TEXT = """Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος
Κόλχων ἐς αἶαν κυανέας Συμπληγάδας,
μηδ᾽ ἐν νάπαισι Πηλίου πεσεῖν ποτε
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας
5ἀνδρῶν ἀριστέων οἳ τὸ πάγχρυσον δέρος
Πελίᾳ μετῆλθον."""
pipeline = Pipeline.from_toml("scripts/pipeline_greek_iambic.toml")
nlp = NLP(language_code="anci1242", custom_pipeline=pipeline)
doc = nlp.analyze(TEXT)
for line, scan in zip(TEXT.splitlines(), doc.metadata["scansion"]):
print(line)
print(scan)
Calling python scripts/scan_iambic_trimeter.py returns:
% python scripts/scan_iambic_trimeter.py
𐤀 CLTK version '2.3.10'. When using the CLTK in research, please cite: https://aclanthology.org/2021.acl-demo.3/
Selected language "Ionic-Attic Ancient Greek" ("anci1242") without dialect.
Pipeline for `NLP("anci1242", backend="stanza")`: ['GreekIambicTrimeterProcess']
⸖ Process name: GreekIambicTrimeterProcess
Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος
- - u u - - u u - u u -
Κόλχων ἐς αἶαν κυανέας Συμπληγάδας,
u - u - u u u - u u u -
μηδ᾽ ἐν νάπαισι Πηλίου πεσεῖν ποτε
- u u - u - - u - u u -
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας
- - u - - - u u - - u u
5ἀνδρῶν ἀριστέων οἳ τὸ πάγχρυσον δέρος
u - u u - - u u u u u u
Πελίᾳ μετῆλθον.
u u u - u - u - x - u -
See these examples in https://github.com/cltk/cltk/tree/master/scripts of the CLTK's repo.
Process Registry
ProcessRegistry is importable from cltk.core.process_registry and lets you
list, fetch, and register pipeline processes at runtime.
from cltk.core.data_types import Classification, Doc, Language, Process
from cltk.core.process_registry import ProcessRegistry, register_process
language = Language(
name="Latin",
glottolog_id="lati1261",
level="language",
classification=Classification(level="language"),
)
doc = Doc(language=language, raw="Gallia est omnis divisa in partes tres.")
# Discover and run a built-in process.
processes = ProcessRegistry.list_processes()
normalize_cls = ProcessRegistry.get_process("normalize")
normalized_doc = normalize_cls().run(doc)
# Register a custom process and fetch it by ID.
@register_process
class CustomEchoProcess(Process):
process_id = "custom.echo"
def run(self, input_doc: Doc) -> Doc:
input_doc.raw = input_doc.raw or ""
return input_doc
assert ProcessRegistry.get_process("custom.echo") is CustomEchoProcess
Pipeline TOML
A pipeline spec is a TOML file with optional presets and per-step config. Steps can be
listed explicitly or inherited from a preset. Step configuration lives under the
[step.*] tables.
Key points:
- If
stepsis present, its order is used. - If
stepsis omitted,presetdefines the order. - You can enable or disable steps with
enabled = true/false. - Both
[step.dependency.genai](nested) and[step."dependency.genai"](quoted) are supported.
Example (scripts/pipeline_latin_genai.toml):
language = "lati1261"
preset = "latin.genai.default"
# If steps is omitted, preset provides ordering.
# If steps is provided, it overrides the preset ordering.
# steps = ["normalize", "sentence_split", "morphosyntax.genai", "dependency.genai", "enrichment.lexicon", "enrichment.phonology", "translation.genai"]
[step.normalize]
enabled = true
[step.sentence_split]
enabled = true
[step.morphosyntax.genai]
enabled = true
prompt_profile = "latin_ud_strict"
[step.dependency.genai]
enabled = true
prompt_profile = "latin_ud_strict"
# Phase 4 will replace enrichment.genai with composable enrichers; show future shape now:
[step.enrichment.lexicon]
enabled = true
source = "perseus"
[step.enrichment.phonology]
enabled = false
mode = "rule_trace"
[step.translation.genai]
enabled = true
prompt_profile = "student_friendly"
# Also support quoted keys:
[step."dependency.genai"]
# (same content as above if desired)
Presets
Presets are named pipeline specs that provide an ordered set of steps and default configuration. Current presets include:
latin.genai.defaultlatin.genai.student_friendlylatin.genai.epigraphy_conservative
Use them with preset = "..." in your TOML.
Prompt Profiles
Prompt profiles let you select prompt bundles without subclassing processes. A profile can define templates per process and version.
Example usage in TOML:
[step.morphosyntax.genai]
prompt_profile = "latin_ud_strict"
[step.translation.genai]
prompt_profile = "student_friendly"
Use A TOML Pipeline With NLP
Load a TOML spec and pass it as the custom_pipeline for NLP().analyze():
from cltk import NLP
from cltk.core.data_types import Pipeline
pipeline = Pipeline.from_toml("scripts/pipeline_latin_genai.toml")
nlp = NLP(language_code="lati1261", backend="openai", custom_pipeline=pipeline)
doc = nlp.analyze("Gallia est omnis divisa in partes tres.")
This keeps the normal NLP flow while letting you control step order, enable/disable steps, and prompt profiles through the TOML.
CLI
Use the CLI to inspect or validate specs: