Punjabi is an Indo-Aryan language native language of the Punjabi people who inhabit the historical Punjab region of Pakistan and India. Punjabi developed from Sanskrit through Prakrit language and later Apabhraṃśa. Punjabi emerged as an Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. and became stable by the 10th century. By the 10th century, many Nath poets were associated with earlier Punjabi works. Arabic and Persian influence in the historical Punjab region began with the late first millennium Muslim conquests on the Indian subcontinent. (Source: Wikipedia)
CorpusImporter or browse the CLTK Github repository (anything beginning with
punjabi_) to discover available Punjabi corpora.
In : from cltk.corpus.utils.importer import CorpusImporter In : c = CorpusImporter('punjabi') In : c.list_corpora Out: ['punjabi_text_gurban']
Now from the list of available corpora import any one you like.
The Punjabi digits, vowels, consonants, and symbols are placed in cltk/corpus/punjabi/alphabet.py. It is fully commented, so look there for more information about the language's phonology.
To use Punjabi's independent vowels, for example:
In : from cltk.corpus.punjabi.alphabet import INDEPENDENT_VOWELS In : print(INDEPENDENT_VOWELS) Out: ['ਆ', 'ਇ', 'ਈ', 'ਉ', 'ਊ', 'ਏ', 'ਐ', 'ਓ', 'ਔ']
These are the INDEPENDENT_VOWELS, they don't need any other consonant to be printed, they are printed as just they are, they represent the sounds "aa", "i", "iii", "u", "uuu", "a", "oo", "o" and "ou", respectively.
Similarly there are lists for
BINDI_CONSONANTS (nasal pronunciation) and some
OTHER_SYMBOLS (mostly for pronunciation).
These convert English numbers into Punjabi and vice-verse.
In: from cltk.corpus.punjabi.numerifier import punToEnglish_number In: from cltk.corpus.punjabi.numerifier import englishToPun_number In: c = punToEnglish_number('੧੨੩੪੫੬੭੮੯੦') In: print(c) Out: 1234567890 In: c = englishToPun_number(1234567890) In: print(c) Out: ੧੨੩੪੫੬੭੮੯੦
To use the CLTK's built-in stopwords list:
In: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex In: from cltk.stop.punjabi.stops import STOPS_LIST In: sample = "ਪੰਜਾਬੀ ਪੰਜਾਬ ਦੀ ਮੁਖੱ ਬੋੋਲਣ ਜਾਣ ਵਾਲੀ ਭਾਸ਼ਾ ਹੈ।" In: x = indian_punctuation_tokenize_regex(sample) In: print(x) Out: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਦੀ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਵਾਲੀ', 'ਭਾਸ਼ਾ', 'ਹੈ', '।'] In: lis = [w for w in x if not w in STOPS_LIST] In: print (lis) Out: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਭਾਸ਼ਾ', '।']