8.1.12.1.2.1.1.1.1. cltk.phonology.arb.utils.pyarabic package

8.1.12.1.2.1.1.1.1.1. Submodules

8.1.12.1.2.1.1.1.1.2. cltk.phonology.arb.utils.pyarabic.araby module

Arabic module

8.1.12.1.2.1.1.1.1.2.1. Features:

  • Arabic letters classification

  • Text tokenization

  • Strip Harakat ( all, except Shadda, tatweel, last_haraka)

  • Sperate and join Letters and Harakat

  • Reduce tashkeel

  • Mesure tashkeel similarity ( Harakats, fully or partially vocalized, similarity with a template)

  • Letters normalization ( Ligatures and Hamza)

Includes code written by ‘Arabtechies’, ‘Arabeyes’, ‘Taha Zerrouki’.

Todo

Remove, rewrite, and/or refactor this due to GPL.

cltk.phonology.arb.utils.pyarabic.araby.is_sukun(archar)[source]

Checks for Arabic Sukun Mark. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_shadda(archar)[source]

Checks for Arabic Shadda Mark. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_tatweel(archar)[source]

Checks for Arabic Tatweel letter modifier. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_tanwin(archar)[source]

Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_tashkeel(archar)[source]

Checks for Arabic Tashkeel Marks:

  • FATHA, DAMMA, KASRA, SUKUN,

  • SHADDA,

  • FATHATAN, DAMMATAN, KASRATAN.

@param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_haraka(archar)[source]

Checks for Arabic Harakat Marks (FATHA, DAMMA, KASRA, SUKUN, TANWIN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_shortharaka(archar)[source]

Checks for Arabic short Harakat Marks (FATHA, DAMMA, KASRA, SUKUN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_ligature(archar)[source]

Checks for Arabic Ligatures like LamAlef. (LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_hamza(archar)[source]

Checks for Arabic Hamza forms. HAMZAT are (HAMZA, WAW_HAMZA, YEH_HAMZA, HAMZA_ABOVE, HAMZA_BELOW, ALEF_HAMZA_BELOW, ALEF_HAMZA_ABOVE ) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_alef(archar)[source]

Checks for Arabic Alef forms. ALEFAT = (ALEF, ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW, ALEF_WASLA, ALEF_MAKSURA ) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_yehlike(archar)[source]

Checks for Arabic Yeh forms. Yeh forms : YEH, YEH_HAMZA, SMALL_YEH, ALEF_MAKSURA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_wawlike(archar)[source]

Checks for Arabic Waw like forms. Waw forms : WAW, WAW_HAMZA, SMALL_WAW @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_teh(archar)[source]

Checks for Arabic Teh forms. Teh forms : TEH, TEH_MARBUTA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_small(archar)[source]

Checks for Arabic Small letters. SMALL Letters : SMALL ALEF, SMALL WAW, SMALL YEH @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_weak(archar)[source]

Checks for Arabic Weak letters. Weak Letters : ALEF, WAW, YEH, ALEF_MAKSURA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_moon(archar)[source]

Checks for Arabic Moon letters. Moon Letters : @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_sun(archar)[source]

Checks for Arabic Sun letters. Moon Letters : @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.order(archar)[source]

return Arabic letter order between 1 and 29. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3. @param archar: arabic unicode char @type archar: unicode @return: arabic order. @rtype: integer

cltk.phonology.arb.utils.pyarabic.araby.name(archar)[source]

return Arabic letter name in arabic. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3. @param archar: arabic unicode char @type archar: unicode @return: arabic name. @rtype: unicode

cltk.phonology.arb.utils.pyarabic.araby.arabicrange()[source]

return a list of arabic characteres . Return a list of characteres between ، to ْ @return: list of arabic characteres. @rtype: unicode

cltk.phonology.arb.utils.pyarabic.araby.has_shadda(word)[source]

Checks if the arabic word contains shadda. @param word: arabic unicode char @type word: unicode @return: if shadda exists @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_vocalized(word)[source]

Checks if the arabic word is vocalized. the word musn’t have any spaces and pounctuations. @param word: arabic unicode char @type word: unicode @return: if the word is vocalized @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_vocalizedtext(text)[source]

Checks if the arabic text is vocalized. The text can contain many words and spaces @param text: arabic unicode char @type text: unicode @return: if the word is vocalized @rtype:Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_arabicstring(text)[source]

Checks for an Arabic standard Unicode block characters An arabic string can contain spaces, digits and pounctuation. but only arabic standard characters, not extended arabic @param text: input text @type text: unicode @return: True if all charaters are in Arabic block @rtype: Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_arabicrange(text)[source]

Checks for an Arabic Unicode block characters @param text: input text @type text: unicode @return: True if all charaters are in Arabic block @rtype: Boolean

cltk.phonology.arb.utils.pyarabic.araby.is_arabicword(word)[source]

Checks for an valid Arabic word. An Arabic word not contains spaces, digits and pounctuation avoid some spelling error, TEH_MARBUTA must be at the end. @param word: input word @type word: unicode @return: True if all charaters are in Arabic block @rtype: Boolean

cltk.phonology.arb.utils.pyarabic.araby.first_char(word)[source]

Return the first char @param word: given word @type word: unicode @return: the first char @rtype: unicode char

cltk.phonology.arb.utils.pyarabic.araby.second_char(word)[source]

Return the second char @param word: given word @type word: unicode @return: the first char @rtype: unicode char

cltk.phonology.arb.utils.pyarabic.araby.last_char(word)[source]

Return the last letter example: zerrouki; ‘i’ is the last. @param word: given word @type word: unicode @return: the last letter @rtype: unicode char

cltk.phonology.arb.utils.pyarabic.araby.secondlast_char(word)[source]

Return the second last letter example: zerrouki; ‘k’ is the second last. @param word: given word @type word: unicode @return: the second last letter @rtype: unicode char

cltk.phonology.arb.utils.pyarabic.araby.strip_harakat(text)[source]

Strip Harakat from arabic word except Shadda. The striped marks are :

  • FATHA, DAMMA, KASRA

  • SUKUN

  • FATHATAN, DAMMATAN, KASRATAN,

@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.strip_lastharaka(text)[source]

Strip the last Haraka from arabic word except Shadda. The striped marks are:

  • FATHA, DAMMA, KASRA

  • SUKUN

  • FATHATAN, DAMMATAN, KASRATAN

@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.strip_tashkeel(text)[source]

Strip vowels from a text, include Shadda. The striped marks are:

  • FATHA, DAMMA, KASRA

  • SUKUN

  • SHADDA

  • FATHATAN, DAMMATAN, KASRATAN

@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.strip_tatweel(text)[source]

Strip tatweel from a text and return a result text.

@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.strip_shadda(text)[source]

Strip Shadda from a text and return a result text.

@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.normalize_ligature(text)[source]

Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are: LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE

@param text: arabic text. @type text: unicode. @return: return a converted text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.normalize_hamza(word)[source]

Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef. Replace the LamAlefs by simplified letters.

@param word: arabic text. @type word: unicode. @return: return a converted text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.separate(word, extract_shadda=False)[source]

separate the letters from the vowels, in arabic word, if a letter hasn’t a haraka, the not definited haraka is attributed. return ( letters, vowels) @param word: the input word @type word: unicode @param extract_shadda: extract shadda as seperate text @type extract_shadda: Boolean @return: ( letters, vowels) @rtype:couple of unicode

cltk.phonology.arb.utils.pyarabic.araby.joint(letters, marks)[source]

joint the letters with the marks the length ot letters and marks must be equal return word @param letters: the word letters @type letters: unicode @param marks: the word marks @type marks: unicode @return: word @rtype: unicode

cltk.phonology.arb.utils.pyarabic.araby.vocalizedlike(word1, word2)[source]

if the two words has the same letters and the same harakats, this fuction return True. The two words can be full vocalized, or partial vocalized

@param word1: first word @type word1: unicode @param word2: second word @type word2: unicode @return: if two words have similar vocalization @rtype: Boolean

cltk.phonology.arb.utils.pyarabic.araby.waznlike(word1, wazn)[source]

If the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters. this are as generic letters. The two words can be full vocalized, or partial vocalized

@param word1: input word @type word1: unicode @param wazn: given word template وزن @type wazn: unicode @return: if two words have similar vocalization @rtype: Boolean

cltk.phonology.arb.utils.pyarabic.araby.shaddalike(partial, fully)[source]

If the two words has the same letters and the same harakats, this fuction return True. The first word is partially vocalized, the second is fully if the partially contians a shadda, it must be at the same place in the fully

@param partial: the partially vocalized word @type partial: unicode @param fully: the fully vocalized word @type fully: unicode @return: if contains shadda @rtype: Boolean

cltk.phonology.arb.utils.pyarabic.araby.reduce_tashkeel(text)[source]

Reduce the Tashkeel, by deleting evident cases.

@param text: the input text fully vocalized. @type text: unicode. @return : partially vocalized text. @rtype: unicode.

cltk.phonology.arb.utils.pyarabic.araby.vocalized_similarity(word1, word2)[source]

if the two words has the same letters and the same harakats, this function return True. The two words can be full vocalized, or partial vocalized

@param word1: first word @type word1: unicode @param word2: second word @type word2: unicode @return: return if words are similar, else return negative number of errors @rtype: Boolean / int

cltk.phonology.arb.utils.pyarabic.araby.tokenize(text='')[source]

Tokenize text into words.

@param text: the input text. @type text: unicode. @return: list of words. @rtype: list.

8.1.12.1.2.1.1.1.1.3. cltk.phonology.arb.utils.pyarabic.stack module

Stack module

Includes code written by ‘Arabtechies’, ‘Arabeyes’, ‘Taha Zerrouki’.

class cltk.phonology.arb.utils.pyarabic.stack.Stack(text='')[source]

Bases: object

Stack class

push(item)[source]

puch an item into the stack @param item: pushed item @type item : mixed @return : None @rtype: None

pop()[source]

pop an item from the stack @return : poped item @rtype: mixed

is_empty()[source]

test if the stack is empty @return : True or False @rtype: boolean