8.1.16. cltk.stops package Submodules cltk.stops.akk module

This stopword list was compiled by M. Willis Monroe cltk.stops.ang module

Old English: ‘Sourav Singh <ssouravsingh12@gmail.com>’. adapted Introduction to Old English website at https://lrc.la.utexas.edu/eieol/engol. cltk.stops.arb module

this list inspired from Arabic Stop Words Project https://github.com/linuxscout/arabicstopwords cltk.stops.cop module

This list is adapted from https://github.com/computationalstylistics/tidystopwords which in turn is based on UniversalDependencies treebanks. cltk.stops.enm module

Middle English: - people.stanford.edu/widner/content/text-mining-middle-ages (slide 13), textifier.com/resources/common-english-words.txt, en.wikipedia.org/wiki/Middle_English, en.wiktionary.org/wiki/Category:Middle_English_prepositions, en.wiktionary.org/wiki/Category:MIddle_English_determiners, en.wiktionary.org/wiki/Category:MIddle_English_conjunctions cltk.stops.fro module

This list was compiled from the 100 most frequently occurring words in the french_text corpus, with content words removed. It also includes forms of auxiliary verbs taken from Anglade (1931) and retrieved from https://fr.wikisource.org/wiki/Grammaire_élémentaire_de_l’ancien_français (available under a Attribution-ShareAlike 3.0 Creative Commons license.

Code used to determine most frequent words in the corpus:

import nltk
import re
from nltk.probability import FreqDist
from cltk.tokenize.word import WordTokenizer
determines 100 most common words and number of occurrences in the French corpus
ignores punctuation and upper-case
file_content = open("~/cltk/cltk/stop/french/frenchtexts.txt").read()
(n.b.: this file has been moved to fro_models_cltk)

word_tokenizer = WordTokenizer('french')
words = word_tokenizer.tokenize(file_content)
fdist = FreqDist(words)
prints 100 most common words
cw_list = [x[0] for x in common_words]
outputs 100 most common words to .txt file
with open('french_prov_stops.txt', 'a') as f:
    for item in cw_list:
        print(item, file=f) cltk.stops.gmh module

Middle High German: “Eleftheria Chatziargyriou <ele.hatzy@gmail.com>” using TFIDF method. Source of texts: http://www.gutenberg.org/files/22636/22636-h/22636-h.htm , http://texte.mediaevum.de/12mhd.htm cltk.stops.grc module

Greek: ‘Kyle P. Johnson <kyle@kyle-p-johnson.com>’, from the Perseus Hopper source [http://sourceforge.net/projects/perseus-hopper], found at “/sgml/reading/build/stoplists”, though this only contained acute accents on the ultima. There has been added to this grave accents to the ultima of each. Perseus source is made available under the Mozilla Public License 1.1 (MPL 1.1) [http://www.mozilla.org/MPL/1.1/]. cltk.stops.hin module

Classical Hindi Stopwords This list is composed from 100 most frequently occuring words in classical_hindi corpus <https://github.com/cltk/hindi_text_ltrc> in CLTK. source code : <https://gist.github.com/inishchith/ad4bc0da200110de638f5408c64bb14c> cltk.stops.lat module

Latin: from the Perseus Hopper source at /sgml/reading/build/stoplists. Source at http://sourceforge.net/projects/perseus-hopper/. Perseus data licensed under the Mozilla Public License 1.1 (MPL 1.1, http://www.mozilla.org/MPL/1.1/). cltk.stops.non module

Old Norse: “Clément Besnier <clem@clementbesnier.fr>”. Stopwords were defined by picking up in Altnordisches Elementarbuch by Ranke and Hofmann A new introduction to Old Norse by Barnes Viking Language 1 by Byock (this book provides a list of most frequent words in the sagas sorted by part of speech) cltk.stops.omr module

Marathi: from 100 most frequently occuring words in Marathi corpus in CLTK. cltk.stops.pan module

Panjabi: ‘Nimit Bhardwaj <nimitbhardwaj@gmail.com>’. Sahib Singh, from the site http://gurbanifiles.org/gurmukhi/index.htm, these words are the most frequent words in the Guru Granth Sahib

Note: This is in the Gurmukhi alphabet. cltk.stops.processes module

class cltk.stops.processes.StopsProcess(language: str = None)[source]

Bases: cltk.core.data_types.Process

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.stops.processes import StopsProcess
>>> from cltk.languages.example_texts import get_example_text
>>> lang = "lat"
>>> words = [Word(string=token) for token in split_punct_ws(get_example_text(lang))]
>>> stops_process = StopsProcess(language=lang)
>>> output_doc = stops_process.run(Doc(raw=get_example_text(lang), words=words))
>>> output_doc.words[1].string
>>> output_doc.words[1].stop

Note this marks a word a stop if there is a match on either the inflected form (Word.string) or the lemma (Word.lemma).

Return type

Doc cltk.stops.san module

Sanskrit: ‘Akhilesh S. Chobey <akhileshchobey03@gmail.com>’. Further explanations at: https://gist.github.com/Akhilesh28/b012159a10a642ed5c34e551db76f236 cltk.stops.words module

Stopwords for languages.

TODO: Give definition here of stopwords.

class cltk.stops.words.Stops(iso_code)[source]

Bases: object

Class for filtering stopwords.

>>> from cltk.stops.words import Stops
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> stops_obj = Stops(iso_code="lat")
>>> tokens = split_punct_ws(get_example_text("lat"))
>>> len(tokens)
>>> tokens[25:30]
['legibus', 'inter', 'se', 'differunt', 'Gallos']
>>> tokens_filtered = stops_obj.remove_stopwords(tokens=tokens)
>>> len(tokens_filtered)
>>> tokens_filtered[22:26]
['legibus', 'se', 'differunt', 'Gallos']

Take language code, return list of stopwords.

Return type


remove_stopwords(tokens, extra_stops=None)[source]

Take list of strings and remove stopwords.

Return type