ICP_5 - Girees737/KDM_Projects GitHub Wiki
Summary: In this lesson, I have learn how the feature extraction techniques like CountVectorizer, TF vectorizer, IDF Vectorizer, TFIDF vectorizer, N-Grams can be implemented in PySpark. Understood and compared the similar functionalities of sklearn of python library.
Tools:
Colab and Jupyter Notebook
Libraries:
Python, PySpark, Sklearn, nltk, pandas
Implementation:
- To read different domain text files and perform the following tasks.
a. Find out the top10 TF-IDF words for the above input. b. Find out the top10 TF-IDF words for the lemmatized input c. Find out the top10TF-IDF words for the n-gram based input.
- To write a simple spark program to read a dataset and find the W2V similar words (words with higher cosine similarity) for the Top10 TF-IDF Words.
a. To Try without NLP b. To Try with Lemmatization c. To Try with NGrams
Task-1:
- Created the spark session as below.
- Created the text file of news, sports, shopping, entertainment, science_tech.
-
Loaded the above text files data and created the the dataframe for the same.
-
Extracted the features from the above dataframe using TFIDF vectorizer without preprocessing and extracted top 10 TFIDF vectors based on their score.
-
Applied lemmatization on the raw text and fitted to TFIDF vectorizer after preprocessing and extracted top 10 TFIDF vectors based on their score.
-
Given the ngram range as a parameter in TFIDF vectorizer and extracted vectors for ngrams and got top 10 vectors based on their score.
Top 10 TFIDF Vectors on lemmatized text
Top 10 TFIDF vectors on N-grams
Taks-2:
- Created the data and made spark dataframe and visualized as below.