ICP_5 - Girees737/KDM_Projects GitHub Wiki

Summary: In this lesson, I have learn how the feature extraction techniques like CountVectorizer, TF vectorizer, IDF Vectorizer, TFIDF vectorizer, N-Grams can be implemented in PySpark. Understood and compared the similar functionalities of sklearn of python library.

Tools:

Colab and Jupyter Notebook

Libraries:

Python, PySpark, Sklearn, nltk, pandas

Implementation:

  1. To read different domain text files and perform the following tasks.

a. Find out the top10 TF-IDF words for the above input. b. Find out the top10 TF-IDF words for the lemmatized input c. Find out the top10TF-IDF words for the n-gram based input.

  1. To write a simple spark program to read a dataset and find the W2V similar words (words with higher cosine similarity) for the Top10 TF-IDF Words.

a. To Try without NLP b. To Try with Lemmatization c. To Try with NGrams

Task-1:

  1. Created the spark session as below.

  1. Created the text file of news, sports, shopping, entertainment, science_tech.

  1. Loaded the above text files data and created the the dataframe for the same.

  2. Extracted the features from the above dataframe using TFIDF vectorizer without preprocessing and extracted top 10 TFIDF vectors based on their score.

  3. Applied lemmatization on the raw text and fitted to TFIDF vectorizer after preprocessing and extracted top 10 TFIDF vectors based on their score.

  4. Given the ngram range as a parameter in TFIDF vectorizer and extracted vectors for ngrams and got top 10 vectors based on their score.

Top 10 TFIDF Vectors on lemmatized text

Top 10 TFIDF vectors on N-grams

Taks-2:

  1. Created the data and made spark dataframe and visualized as below.