Dataflair project - vidyasekaran/bigdata_frameworks_components GitHub Wiki

Got AFINN dictionary from below url http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

All we need to do is to combine https://acadgild.com/blog/mapreduce-use-case-sentiment-analysis-twitter-data/

and sentiment analysis using hive.. https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

How to load data from a text file to Hive table http://www.learn4master.com/learn-how-to/how-to-load-data-from-a-text-file-to-hive-table

--Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale http://www.informit.com/articles/article.aspx?p=2756471&seqNum=4

sentiment an commands used in hive side

create external table load_tweets(id BIGINT,text STRING) ROW FORMAT delimited fields terminated by '\t' stored as textfile;

hive> describe load_tweets;
OK id bigint
text string
Time taken: 0.069 seconds, Fetched: 2 row(s)

load data inpath '/user/guru/tmp5/part-r-00000' into table load_tweets;

create table split_words as select id as id,split(text,' ') as words from load_tweets;

hive> describe split_words; OK id bigint
words array

create table split_words as select id as id,split(text,' ') as words from load_tweets;

Next, let’s split each word inside the array as a new row. For this we need to use a UDTF(User Defined Table Generating Function). We have built-in UDTF called explode which will extract each element from an array and create a new row for each element.

create table tweet_word as select id as id,word from split_words LATERAL VIEW explode(words) w as word;

hive> describe tweet_word; OK id bigint
word string
Time taken: 0.068 seconds, Fetched: 2 row(s)

select * from tweet_word

933155602364354561 workers 933155602364354561 - 933155602364354561 https://t.co/wV8xYd7ZtP 933155602364354561 https://t.co/rcz5QXcKkR 933155602502529024 Throw 933155602502529024 Knife 933155602502529024 Games 933155602502529024 Sports 933155602502529024 | 933155602502529024 Mac 933155602502529024 App 933155602502529024 |1288174905| 933155602502529024 ****

Let’s use a dictionary called AFINN to calculate the sentiments. AFINN is a dictionary which consists of 2500 words rated from +5 to -5 depending on their meaning.

We will create a table to load the contents of AFINN dictionary. You can download the dictionary from the below link:

create table dictionary(word string,rating int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Now, let’s load the AFINN dictionary into the table by using the following command:

hive> LOAD DATA INPATH '/user/guru/AFINN-111.txt' into TABLE dictionary; Loading data to table default.dictionary Table default.dictionary stats: [numFiles=1, numRows=0, totalSize=28093, rawDataSize=0] OK Time taken: 0.334 seconds

Now, we will join the tweet_word table and dictionary table so that the rating of the word will be joined with the word.

create table word_join as select tweet_word.id,tweet_word.word,dictionary.rating from tweet_word LEFT OUTER JOIN dictionary ON(tweet_word.word =dictionary.word);

hive> describe word_join; OK id bigint
word string
rating int
Time taken: 0.06 seconds, Fetched: 3 row(s)

hive> select * from word_join;

933155602502529024 App NULL 933155602502529024 |1288174905| NULL 933155602502529024 **** NULL 933155602502529024 $2.29 NULL 933155602502529024 -> NULL 933155602502529024 FREE NULL 933155602502529024 #Sports NULL 933155602502529024 4+ NULL 933155602502529024 #Mac NULL 933155602502529024 #App NULL 933155602502529024 #iOS… NULL 933155602502529024 https://t.co/83SIxwvo1i NULL 933155602649567232 RT NULL 933155602649567232 @CricketAus: NULL 933155602649567232 Maxwell NULL 933155602649567232 added NULL 933155590326628355 clearly 1 933155589898821632 fire -2 933155587226877952 thankful 2 933155587226877952 awarded 3

Time taken: 0.112 seconds, Fetched: 3368 row(s)

Now we will perform the ‘groupby’ operation on the tweet_id so that all the words of one tweet will come to a single place. And then, we will be performing an Average operation on the rating of the words of each tweet so that the average rating of each tweet can be found.

select id,AVG(rating) as rating from word_join GROUP BY word_join.id order by rating DESC;

In the above command, we have calculated the average rating of each tweet by using each word of the tweet and arranging the tweets in the descending order as per their rating.

Time taken: 0.181 seconds, Fetched: 3368 row(s) Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1511921498436_0020, Tracking URL = http://localhost:8088/proxy/application_1511921498436_0020/ Kill Command = /home/guru/hadoop_training/hadoop-2.8.2/bin/hadoop job -kill job_1511921498436_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2017-11-29 10:41:50,418 Stage-1 map = 0%, reduce = 0% 2017-11-29 10:42:04,933 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.17 sec 2017-11-29 10:42:18,621 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.39 sec MapReduce Total cumulative CPU time: 6 seconds 390 msec Ended Job = job_1511921498436_0020 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1511921498436_0021, Tracking URL = http://localhost:8088/proxy/application_1511921498436_0021/ Kill Command = /home/guru/hadoop_training/hadoop-2.8.2/bin/hadoop job -kill job_1511921498436_0021 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2017-11-29 10:42:41,068 Stage-2 map = 0%, reduce = 0% 2017-11-29 10:42:51,667 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.83 sec 2017-11-29 10:43:01,050 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 5.94 sec MapReduce Total cumulative CPU time: 5 seconds 940 msec Ended Job = job_1511921498436_0021 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.39 sec HDFS Read: 97940 HDFS Write: 5883 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.94 sec HDFS Read: 6246 HDFS Write: 4529 SUCCESS Total MapReduce CPU Time Spent: 12 seconds 330 msec OK 933155483736854528 4.0 933155571808833536 4.0 933155564020002817 4.0 933155574627184640 4.0 933155523620306945 3.5 933155447804219392 3.0 933155579056545792 3.0 933155597557657602 3.0 933155519728058374 3.0 933155536916201474 2.5 933155463172116480 2.5 933155587226877952 2.5 933155583116525568 2.0 933155580163821568 2.0 933155538321408000 2.0 933155504741928960 2.0 933155493316411393 2.0 933155461200732160 2.0 933155506671235073 2.0 933155533535793152 2.0 933155572412829696 2.0 933155481438138368 2.0 933155599445045248 2.0 933155573926731776 2.0 933155557837627392 2.0 933155600531308545 2.0 933155578397958144 1.0 933155590326628355 1.0 933155477965496325 1.0 933155513445027840 1.0 933155594537598976 1.0 933155450840866816 1.0 933155586891431937 1.0 933155442699550720 1.0 933155568474136578 1.0 933155553387393025 1.0 933155554482147328 1.0 933155541303681025 0.6666666666666666 933155500559966208 0.5 933155490586009600 0.25 933155556914749440 0.0 933155570693021697 -0.5 933155569044676608 -1.0 933155456066785280 -1.0 933155574291816448 -1.0 933155458872946688 -1.0 933155488824512513 -1.0 933155450882809857 -1.0 933155494994247686 -1.0 933155565022441472 -2.0 933155581543796738 -2.0 933155589898821632 -2.0 933155554217877504 -2.0 933155583770836992 -2.0 933155448433213440 -2.0 933155511372955648 -3.0 933155579043942400 -3.0 933155529400242176 -3.0 933155462048026626 -3.0 933155498777563136 -3.0 933155453646782464 -3.5 933155492133617664 -4.0 933155551441293312 -4.0 933155602649567232 NULL 933155602502529024 NULL 933155602364354561 NULL 933155602234085376 NULL 933155601991065600 NULL 933155601709981697 NULL 933155600703221760 NULL 933155599939964928 NULL 933155598753034240 NULL 933155594680328192 NULL 933155594122301440 NULL 933155593807806474 NULL 933155592864022528 NULL 933155589290676225 NULL 933155588778979328 NULL 933155587700948992 NULL 933155586170130432 NULL 933155585784086528 NULL 933155584555278336 NULL 933155583116550144 NULL 933155583108186113 NULL 933155582500048896 NULL 933155582109876225 NULL 933155579354218496 NULL 933155578909732864 NULL 933155577965842432 NULL 933155577466884096 NULL 933155577177432064 NULL 933155577059880961 NULL 933155575671570432 NULL 933155575017467905 NULL 933155573503287296 NULL 933155573226463232 NULL 933155571775111168 NULL 933155569174753281 NULL 933155566486036480 NULL 933155564439375872 NULL 933155563466260481 NULL 933155563042680835 NULL 933155560651919360 NULL 933155560119140352 NULL 933155557871104000 NULL 933155557552394241 NULL 933155554289008640 NULL 933155553399808000 NULL 933155552116457472 NULL 933155551873249280 NULL 933155550220685312 NULL 933155550040395776 NULL 933155549444694016 NULL 933155547771260928 NULL 933155546357628928 NULL 933155546223599616 NULL 933155545325932544 NULL 933155543098703872 NULL 933155542440095744 NULL 933155537600040961 NULL 933155537533001728 NULL 933155535377092608 NULL 933155527990829056 NULL 933155527655174144 NULL 933155526032191488 NULL 933155525709119488 NULL 933155523397959680 NULL 933155521363955712 NULL 933155518314663936 NULL 933155517949759488 NULL 933155517631008768 NULL 933155516477349888 NULL 933155515592527872 NULL 933155513549967360 NULL 933155513302265857 NULL 933155513105227777 NULL 933155510592856064 NULL 933155510307536898 NULL 933155509707857920 NULL 933155509452115968 NULL 933155509078765578 NULL 933155508776615936 NULL 933155508508295169 NULL 933155507841454080 NULL 933155507434590208 NULL 933155507031953410 NULL 933155503227789312 NULL 933155501575196672 NULL 933155501256388608 NULL 933155497858940929 NULL 933155497187991552 NULL 933155497129271296 NULL 933155495787008000 NULL 933155494600105985 NULL 933155491638804481 NULL 933155490644738048 NULL 933155490552516610 NULL 933155488023408640 NULL 933155487255793667 NULL 933155487205306369 NULL 933155486697795584 NULL 933155482763780096 NULL 933155482717396993 NULL 933155481140584448 NULL 933155480075145216 NULL 933155479282479104 NULL 933155478552567809 NULL 933155477114023937 NULL 933155476975452161 NULL 933155475968937984 NULL 933155474907598848 NULL 933155474899210243 NULL 933155474207252480 NULL 933155473515102210 NULL 933155471631859713 NULL 933155471489486848 NULL 933155471120306176 NULL 933155470990118912 NULL 933155470050705408 NULL 933155469857828864 NULL 933155469669158913 NULL 933155468654120965 NULL 933155468473757702 NULL 933155465550118913 NULL 933155464757628928 NULL 933155462782095360 NULL 933155462312116230 NULL 933155460374519808 NULL 933155460164841472 NULL 933155456121483264 NULL 933155454607228928 NULL 933155450740248577 NULL 933155450631147521 NULL 933155448496168960 NULL 933155445748809728 NULL 933155443756666881 NULL 933155443462914048 NULL 933155443245043713 NULL 933155442880012289 NULL 933155441923821569 NULL 933155440979951617 NULL NULL NULL Time taken: 97.825 seconds, Fetched: 202 row(s)

In the above screen shot, you can see the tweet_id and its rating.

MapReduce on Avro Data Files https://dzone.com/articles/mapreduce-avro-data-files https://github.com/miguno/avro-cli-examples#avro-to-json http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/ https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_avro_usage.html

Refer Hadoop Real world solutions cookbook - second edition. a.Writing the Map Reduce program in Java to analyze web log data b.Performing Reduce side Joins using Map Reduce

has some examples http://hadooptutorial.info/twitter-data-analysis-using-hadoop-flume/

How-to: Analyze Twitter Data with Apache Hadoop http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

sentiment analysis using mapreduce.. https://acadgild.com/blog/mapreduce-use-case-sentiment-analysis-twitter-data/

sentiment analysis using hive.. https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

Hadoop Real-World Solutions Cookbook - Second Edition in Safari. (Good Book). Twitter sentiment analysis using Hive

Refer Hive udf samples here https://dzone.com/articles/writing-custom-hive-udf-andudaf

Mutiple Input Files In MapReduce: The Easy Way http://dailyhadoopsoup.blogspot.in/2014/01/mutiple-input-files-in-mapreduce-easy.html

use map reduce for Movie_Lens_Project.url
https://www.youtube.com/watch?v=DjdkYKNWDxI (frank Kane) - conceptual

https://www.youtube.com/watch?v=JG4PvCNmDyc - map reduce code for movie lens.

Movie lens project analysis

Data present in each file

(movies) : MovieID::Title::Genres ==> 1::Toy Story (1995)::Animation|Children's|Comedy (ratings): UserID::MovieID::Rating::Timestamp ==> 1::1193::5::978300760 (users) : UserID::Gender::Age::Occupation::Zip-code ==> 1::F::1::10::48067

troubleshoot https://stackoverflow.com/questions/30198717/how-do-i-output-custom-classes-having-lists-etc-data-structure-from-a-map-progra https://stackoverflow.com/questions/30228764/how-can-we-pass-listtext-as-mapper-output

Develop MapReduce programs to solve following KPIs:


  1. Top ten most viewed movies with their movies Name (Ascending or Descending order)
  2. Top twenty rated movies (Condition: The movie should be rated/viewed by at least 40 users)
  3. We wish to know how have the genres ranked by Average Rating, for each profession and age group. The age groups to be considered are: 18-35, 36-50 and 50+.

dataset joins http://codingjunkie.net/mapside-joins/

Hadoop Map Reduce – Joining Data sets – JOINS https://teqreference.wordpress.com/2015/02/26/hadoop-map-reduce-joining-data-sets-joins/

This url has the solution to join multiple files using map side join http://www.javamakeuse.com/2016/03/mapreduce-map-side-join-example-hadoop.html (currently using this) and combine this for top 10 most viewed : http://timepasstechies.com/mapreduce-topn/

Hadoop multiple inputs https://stackoverflow.com/questions/27349743/hadoop-multiple-inputs

Mapreduce reduce side join and top n records pattern with real world example http://timepasstechies.com/mapreduce-reduce-side-join-top-n-records-pattern-real-world-example/ and combine this for top 10 most viewed http://timepasstechies.com/mapreduce-topn/

Problem to solve : We wish to know how have the genres ranked by Average Rating, for each profession and age group. The age groups to be considered are: 18-35, 36-50 and 50+.

http://timepasstechies.com/mapreduce-replicatereduce-side-joinaverage-pattern-real-world-example/

Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (text), made available through DistributedCache https://gist.github.com/airawat/6587341 http://rajkrrsingh.blogspot.in/2013/10/hadoop-joining-two-datasets-using-map.html

Hadoop Project on NCDC ( National Climate Data Center – NOAA ) Dataset

https://www.eduonix.com/blog/bigdata-and-hadoop/hadoop-project-on-ncdc-national-climate-data-center-noaa-dataset/

Useful MapReduce Tutorial https://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Hadoop Project on NCDC ( National Climate Data Center – NOAA ) Dataset https://wikis.nyu.edu/display/NYUHPC/Big+Data+Tutorial+1%3A+MapReduce

Project tips

Joining Two Files Using MultipleInput In Hadoop MapReduce - MapSide Join http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html

https://stackoverflow.com/questions/46313945/joining-of-multiple-files-using-map-reduce

============

Sentiment analysis is the analysis of people’s opinions, sentiments, evaluations, appraisals, attitudes and emotions in relation to entities like individuals, products, events, services, organizations and topics by classifying the expressions as negative / positive opinions.

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

collect twitter data using twitter4j https://acadgild.com/blog/streaming-twitter-data-using-flume/ (Continue from Step 9) use mapreduce for sentiment analysis https://acadgild.com/blog/mapreduce-use-case-sentiment-analysis-twitter-data/

Use below link to install flume https://data-flair.training/blogs/apache-flume-installation-tutorial/

to install hive but hive installed from --You can install all softwares from here.. http://www.bogotobogo.com/Hadoop/BigData_hadoop_CDH5_Hive_Upgrade_2.php

download http://archive.cloudera.com/tarballs/

Twitter

vidyan_sentiment

Access Token	3038785524-tmffW4bba5cznrwm4p7DrOpT5n12oB2PTfFwvlQ

Access Token Secret cIqT7Dqzav6VywaPUxAuQDYJ7uYvA4v36vXmRJ9Q0ldGU

Consumer Key (API Key)	m3q61xnjSRa65eRhe5TzeB33N

Consumer Secret (API Secret) xZIj4hmMHpfiQ8CmNiesZa3VJLgJ9mjfF9qqYwJqflGqJdD7l9

Good Hadoop book

chapter 5 - Twitter sentiment analysis using Hive 1.Hadoop Real-World Solutions Cookbook - setup hadoop in aws,setup balancer etc..with lots of real world example 2.Hadoop operations - convers troubleshooting,security 3. Hadoop MapReduce v2 Cookbook - Second Edition - Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets - Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment -join 2 dataset in mapreduce - benchmark hdfs, mapreduce etc.. 4.Deep Learning with Hadoop - this book is for software engineers who already have a knowledge of big data, deep learning, and statistical modeling, but want to rapidly gain knowledge of how deep learning can be used for big data and vice versa. 5. Hadoop blueprint - realworld It has very complete and useful real world use cases of using Hadoop and its ecosystem. Chapters explain Big Data technology trends like IOT , Data Lakes and how Hadoop fits with is ecosystem to solve those problems. Analyze Sensor Data Using Hadoop, Building a data lake, Building a Fraud Detection System, Churn Detection are my favorites chapters where the authors bring with examples the steps of using Hadoop ecosystem. In Summary, if you want to learn Hadoop with examples, this is the right book for you.

  1. Learn benchmark hdfs, mapreduce etc,apache whirr to deploy mp to cloud.. from Hadoop MapReduce v2 Cookbook - Second Edition
  2. setup aws hadoop cluster from Hadoop Real-World Solutions Cookbook
⚠️ **GitHub.com Fallback** ⚠️