Skip to content

processes

Sentence splitting processes.

This module exposes a lightweight, language‑aware sentence splitter built around regular expressions per language (identified by Glottolog codes). It defines a generic SentenceSplittingProcess and many concrete subclasses, one per language or stage.

SentenceSplittingProcess

Bases: Process

Base class for sentence splitting processes.

Subclasses set glottolog_id and inherit the default algorithm, which delegates to a multi‑language regex splitter.

Attributes:

  • glottolog_id (Optional[str]) –

    Target language Glottolog code used to choose punctuation rules for sentence boundaries.

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = None

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LycianASentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Lycian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'lyci1241'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LydianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Lydian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'lydi1241'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

PalaicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Palaic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'pala1331'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

CarianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Carian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'cari1274'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

CuneiformLuwianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Cuneiform Luwian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'cune1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

HieroglyphicLuwianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Hieroglyphic Luwian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'hier1240'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalArmenianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Armenian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1256'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleArmenianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Armenian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1364'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AkkadianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Akkadian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'akka1240'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AncientGreekSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Ancient Greek.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'anci1242'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AncientHebrewSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Ancient Hebrew.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'anci1244'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalSyriacSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Syriac.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1252'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalTibetanSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Tibetan.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1254'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

CopticSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Coptic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'copt1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LatinSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Latin.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'lati1261'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OfficialAramaicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Official Aramaic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'impe1235'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldEnglishSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old English.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'olde1238'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldNorseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Norse.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldn1244'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

PaliSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Pali.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'pali1273'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalSanskritSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Sanskrit.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1258'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

VedicSanskritSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Vedic Sanskrit.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'vedi1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalArabicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Arabic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1259'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ChurchSlavonicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Church Slavonic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'chur1257'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleEnglishSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle English.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1317'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleFrenchSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle French.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1316'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddlePersianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Persian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'pahl1241'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldFrenchSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old French.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldf1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleHighGermanSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle High German.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1343'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldHighGermanSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old High German.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldh1241'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

GothicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Gothic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'goth1244'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

HindiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Hindi.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'hind1269'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

KhariBoliSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Khari Boli (Hindi dialect).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'khad1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

BrajSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Braj Bhasha.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'braj1242'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AwadhiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Awadhi.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'awad1243'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

UrduSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Urdu.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'urdu1245'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LiteraryChineseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Chinese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'lite1248'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldChineseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Chinese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldc1244'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleChineseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Chinese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1344'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

BaihuaChineseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Early Vernacular Chinese (Baihua).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1255'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

PanjabiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Panjabi.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'panj1256'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ParthianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Parthian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'part1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

DemoticSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Egyptian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'demo1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

BengaliSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Bengali.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'beng1280'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OdiaSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Odia (Oriya).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oriy1255'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AssameseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Assamese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'assa1263'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

GujaratiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Gujarati.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'guja1252'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MarathiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Marathi.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'mara1378'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

BagriSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Bagri (Rajasthani).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'bagr1243'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

SinhalaSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Sinhala.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'sinh1246'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

SindhiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Sindhi.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'sind1272'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

KashmiriSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Kashmiri.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'kash1277'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldBurmeseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Burmese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldb1235'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalBurmeseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Burmese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'nucl1310'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

TangutSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Tangut.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'tang1334'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

NewarSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Newar (Classical Nepal Bhasa).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'newa1246'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MeiteiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Meitei (Classical Manipuri).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'mani1292'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

SgawKarenSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Sgaw Karen.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'sgaw1245'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleMongolSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Mongol.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'mong1329'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalMongolianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Mongolian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'mong1331'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MogholiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Mogholi (Moghol).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'mogh1245'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

NumidianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Numidian (Ancient Berber).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'numi1241'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

TaitaSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Cushitic Taita.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'tait1247'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

HausaSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Hausa.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'haus1257'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldJurchenSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Jurchen.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'jurc1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldJapaneseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Japanese.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'japo1237'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldHungarianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Hungarian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldh1242'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ChagataiSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Chagatai.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'chag1247'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldTurkicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Turkic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldu1238'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldTamilSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Tamil.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldt1248'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

HittiteSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Hittite (hit1242).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'hitt1242'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

TocharianASentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Tocharian A (toch1238).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'toch1238'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

TocharianBSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Tocharian B (toch1237).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'toch1237'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AvestanSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Avestan (aves1237).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'aves1237'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

BactrianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Bactrian (bact1239).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'bact1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

SogdianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Sogdian (sogd1245).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'sogd1245'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

KhotaneseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Khotanese (khot1251).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'khot1251'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

TumshuqeseSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Tumshuqese (tums1237).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'tums1237'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldPersianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Old Persian (oldp1254).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldp1254'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

EarlyIrishSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Old Irish (oldi1245).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'oldi1245'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

UgariticSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Ugaritic (ugar1238).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'ugar1238'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

PhoenicianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Phoenician (phoe1239).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'phoe1239'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

GeezSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Geez (geez1241).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'geez1241'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleEgyptianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Middle Egyptian (midd1369).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1369'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldEgyptianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Old Egyptian (olde1242).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'olde1242'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LateEgyptianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitter for Late Egyptian (late1256).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'late1256'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldMiddleWelshSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Welsh.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1254'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleBretonSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Breton.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midb1244'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleCornishSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Cornish.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'corn1251'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldPrussianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Prussian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'prus1238'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LithuanianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Lithuanian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'lith1251'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

LatvianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Latvian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'latv1249'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AlbanianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Albanian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'gheg1238'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

SauraseniPrakritSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Sauraseni Prakrit.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'saur1252'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MaharastriPrakritSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Maharastri Prakrit.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'maha1305'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MagadhiPrakritSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Magadhi Prakrit.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'maga1260'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

GandhariSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Gandhari.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'gand1259'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MoabiteSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Moabite.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'moab1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

AmmoniteSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Ammonite.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'ammo1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

EdomiteSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Edomite.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'edom1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldAramaicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Aramaic (up to 700 BCE).

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'olda1246'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

OldAramaicSamalianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Old Aramaic–Samʾalian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'olda1245'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

MiddleAramaicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Middle Aramaic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'midd1366'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

ClassicalMandaicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Classical Mandaic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'clas1253'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

HatranSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Hatran.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'hatr1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

JewishBabylonianAramaicSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Jewish Babylonian Aramaic.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'jewi1240'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc

SamalianSentenceSplittingProcess

Bases: SentenceSplittingProcess

Sentence splitting process for Samʾalian.

glottolog_id class-attribute instance-attribute

glottolog_id: Optional[str] = 'sama1234'

process_id class-attribute

process_id: str = 'sentence_split'

algorithm cached property

algorithm: Callable[[str, str], list[tuple[int, int]]]

Return the language‑appropriate sentence boundary function.

The returned callable takes (text, glottolog_id) and returns a list of (start, stop) character offsets for each sentence.

Returns:

  • Callable[[str, str], list[tuple[int, int]]]

    A callable implementing sentence boundary detection.

Raises:

  • ValueError

    If the glottolog_id is not supported.

run

run(input_doc: Doc) -> Doc

Compute sentence boundaries and return an updated document.

Parameters:

  • input_doc (Doc) –

    Document whose normalized_text will be segmented.

Returns:

  • Doc

    A shallow copy of input_doc with sentence_boundaries set to

  • Doc

    a list of (start, stop) character indices.

Raises:

  • ValueError

    If normalized_text is missing or if glottolog_id is not set on the process.

Source code in cltk/sentence/processes.py
def run(self, input_doc: Doc) -> Doc:
    """Compute sentence boundaries and return an updated document.

    Args:
      input_doc: Document whose ``normalized_text`` will be segmented.

    Returns:
      A shallow copy of ``input_doc`` with ``sentence_boundaries`` set to
      a list of ``(start, stop)`` character indices.

    Raises:
      ValueError: If ``normalized_text`` is missing or if ``glottolog_id``
        is not set on the process.

    """
    output_doc = copy(input_doc)
    log = bind_from_doc(output_doc)
    if not output_doc.normalized_text:
        msg: str = "Doc must have `normalized_text`."
        log.error(msg)
        raise ValueError(msg)
    log.debug(
        f"Sentence splitter passed to split_sentences_multilang: {self.glottolog_id}"
    )
    # Ensure required attributes are present
    if self.glottolog_id is None:
        raise ValueError("glottolog_id must be set for sentence splitting")
    # Callable typing does not retain keyword names; pass positionally
    output_doc.sentence_boundaries = self.algorithm(
        output_doc.normalized_text,
        self.glottolog_id,
    )
    lang_id = None
    try:
        if output_doc.dialect and output_doc.dialect.glottolog_id:
            lang_id = output_doc.dialect.glottolog_id
        else:
            lang_id = output_doc.language.glottolog_id
    except Exception:
        lang_id = None
    prov_record = build_provenance_record(
        language=lang_id,
        backend=output_doc.backend,
        process=self.__class__.__name__,
        model=str(output_doc.model) if output_doc.model else None,
        provider=str(output_doc.backend) if output_doc.backend else None,
        notes={"sentence_count": len(output_doc.sentence_boundaries)},
    )
    prov_id = add_provenance_record(
        output_doc,
        prov_record,
        set_default=output_doc.default_provenance_id is None,
    )
    if prov_id:
        if not output_doc.sentence_annotation_sources:
            output_doc.sentence_annotation_sources = {}
        for idx in range(len(output_doc.sentence_boundaries)):
            entry = output_doc.sentence_annotation_sources.get(idx, {})
            entry["span"] = prov_id
            output_doc.sentence_annotation_sources[idx] = entry
    return output_doc