Lab 7 Description - MadhuriGumma/Python-Programming GitHub Wiki
Lab 7 is similar to that of ICE7 where I ahve read a file that contains some sentences and I have used the "NLTK" packages to get tokens, Pos-tagging and, Named entities and other operations and developed a summary by removing the repeated words and useless articles like "a,an and the". The screenshot below will give more insight into it. I have placed the text in a file called "my_data.txt"
The function that I have performed on this are:
- Removing the unnecessary articles and tokenizing them:
removableWords = set(stopwords.words('english')) words_split = word_tokenize(s) usefulWords = [w for w in words_split if not w in removableWords] for w in words_split: if w not in removableWords: usefulWords.append(w)
2. Lemmetizing the words to get the verb form for it
list_values1=[] for i in usefulWords: lemmatizer = WordNetLemmatizer() l=lemmatizer.lemmatize(i,pos='v') list_values1.append(l) print("lemmatized words:") print ("****************************") print(list_values1)
3. Pos-Tagging them to give which word belongs to what category of parts of speech:
o=pos_tag(usefulWords) print("POS tagging:") print ("*******************") print(o) list_values2=[] list_values3=[] for (m,n) in o: if (n != 'VB') and (n != 'VBD') and (n != 'VBG') and (n != 'VBN') and (n != 'VBP') and (n != 'VBZ') and (n != ',') and (n != '.'): list_values2.append(m) print("removing verbs:") print ("***********************") print(list_values2)
4. Calculating the word frequency of usefulwords:
import collections counter=collections.Counter(list_values2) print("count of words:") print ("***********************") print(counter)
5. Summarizing the useful words:
print("Summary:") print ("***********************") for l in open("my_data"): if (("clumsy" in l) or ("deduction" in l) or ("process" in l) or ("careless" in l) or ("lately" in l)): print(l)
Output screenshot: