lda results (mllib) - derlin/bda-lsa-project GitHub Wiki
Table of Contents
Creating the model
We used the whole wikipedia dataset (57.5 GB, see datasets).
Parameters:
- number of documents:
4'703'192
- vocabulary size:
20'000
- number of topics:
k = 1'000
For the maxIterations
parameter, we first tried 1'000
and killed the process after 48 hours ! We then tried with both 20
and 50
.
Execution, creating the model:
We ran the class bda.lsa.lda.mllib.RunLDA
on the daplab, using spark 2.0.0. Here is the complete command:
spark-submit \
--class bda.lsa.lda.mllib.RunLDA \
--master yarn --deploy-mode client
--num-executors 10 \
--executor-memory 15g \
--driver-memory 20G
bda-lsa-project.jar 1000 20 -1 -1
With maxIter = 20
, it took 2h14 to execute. With maxIter = 50
, it took 5h. The results were saved in HDFS (/shared/wikipedia/docIds/lda-model-XX
).
Interactive queries:
The model was then loaded on a spark-shell for interactive queries. Once again, we took advantage of the daplab and used the same parameters as above (number of executors, etc.).
Spark-shell Setup
As a reminder, here is how to load an svd model in the spark-shell, assuming the jar is available and the config.properties
is correct:
val data = bda.lsa.getData(spark)
val model = bda.lsa.lda.mllib.RunLDA.loadModel(spark)
val q = new bda.lsa.lda.mllib.LDAQueryEngine(model, data)
Results with maxIter = 20
The topics
First, we had a look at the topics using q.describeTopicsWithWords(5)
. The result is available in others/topics-1000-lda-maxIter20.csv
.
Many topics make sense:
2, dam, reservoir, currency, canyon, water
6, release, album, ucl, song, band
8, route, expressway, bridge, highway, parkway
16, nba, basketball, season, team, coach
20, drug, heroin, addiction, housemate, opioid
while other do not:
0, pune, spence, grafton, bale, pollard
5, stan, capitol, kenton, melrose, shearer
864, jung, korean, kim, korea, meta
865, turtle, denton, hartley, butte, window
Random queries
Top documents for topic 119 password, security, guess, attacker, authentication
:
4279490 - Enpass,0.9416792976995423
3102960 - Intuitive Password,0.9316564369480546
916615 - Munged password,0.9273983641659067
927308 - Password policy,0.9238324774965654
2083888 - KYPS,0.9213816659468554
2131919 - Password management,0.9150316995419442
1694732 - Password psychology,0.9105991312670398
1234181 - Password fatigue,0.8949505437194555
927710 - RainbowCrack,0.8811882695987457
1317043 - Crack (password software),0.874134809930893
top topics for term "mickey":
949- disney, walt, edgar, allan, jungle, 52166.45979201152
290- mickey, kaye, episode, hollis, series, 42522.10445176097
782- norris, ness, douglass, cantonment, tenement, 6735.903447694659
890- dolly, crouch, parton, mays, horizons, 6692.934610955127
317- lowell, jackal, sprout, neville, bevan, 4774.749101984056
374- hardy, bunny, looney, minnie, kabir, 2630.832433260876
269- zulu, sargent, firefly, goliath, colchester, 1543.0724606917881
358- baseball, league, game, yes, pitch, 1366.6868230685839
396- episode, rover, mars, finn, series, 1097.7335044525305
219- weaver, pony, weave, calvin, badger, 843.6324184337855
504- championship, tournament, tour, win, open, 450.9374796177704
957- plato, pluto, curtiss, raman, whitaker, 422.1355742816467
Top topics for term "Batman":
460 - batman, woodward, général, comics, joker,155246.80856591873
79 - suit, batman, anime, gundam, cyclone,123800.0027488167
308 - superman, chinese, qing, lois, ethnic,4644.167085636017
161 - rowing, lego, julien, cox, judy,4210.495372757982
247 - cassie, walsall, joker, episode, gay,1140.2446390329778
436 - württemberg, castle, lex, luthor, century,470.9644377379244
Results with maxIter = 50
The topics
Again, we looked at the topics using q.describeTopicsWithWords(5)
. The result is available in others/topics-1000-lda-maxIter50.csv
.
The topics are different, but the conclusions are the same. Some of them make sense :
6, coast, coastal, continental, designate, inland
9, hip, hop, rap, rapper, mixtape
10, tank, german, nazi, hitler, resistance
But some others don't even seem to share a common theme :
3, phase, expect, gap, completion, pratt
8, thousand, friendship, olive, nationalism, pony
15, annual, nichols, domino, year, clover
Here are some other stats :
scala> model.logLikelihood
res6: Double = -1.8797753255155785E10
scala> model.topicConcentration
res8: Double = 1.1
scala> model.logPrior
res10: Double = -129954.18344277807
scala> model.docConcentration
res7: org.apache.spark.mllib.linalg.Vector = [1.05,1.05,1.05,1.05,...
Random queries
Some other queries :
Words related to topic 93 market, stock, insurance, cash, investor
6244216 - Bridle Insurance, 0.9999804797933601
4935496 - MS&AD Insurance Group, 0.9999804797933601
405715 - Etana Insurance, 0.9999739914188229
3433514 - Pseudoniphargus, 0.9999731303990821
2097985 - Jæger Dokk, 0.9999610420959162
3471437 - Asperton, 0.9999610420959162
3778496 - Dinghao Market, 0.9997881027115952
3773835 - Hailong Market, 0.9997881027115952
3534602 - Gudrun Stock, 0.999745845527948
3325840 - Beenham Stocks, 0.999745845527948
Top topics for the term "mustard" (q.topTopicsForTerm(data.termIds.indexOf("mustard")).map(e => e._2 + " : (" + topics(e._2) + ")" + " - " + e._1).mkString("\n")
) :
982 : (asia, asian, southeast, laos, lao) - 32504.288213833122
923 : (cook, eat, dish, sweet, cuisine) - 6478.6351545295765
299 : (animal, nature, wild, wildlife, sanctuary) - 0.6102807234657309
194 : (nuclear, accident, missile, weapon, viaduct) - 0.14984823165033154
552 : (strike, iraq, syria, syrian, rebel) - 0.10190013066249212
9 : (hip, hop, rap, rapper, mixtape) - 0.07677137684555978
351 : (location, restaurant, chain, fast, chef) - 0.05801084437788332
809 : (mars, crater, impact, volcano, dome) - 0.05350715625815306
216 : (chemical, reaction, compound, organic, oxygen) - 0.05313917415990251
22 : (islamic, palestine, destruction, iraq, baghdad) - 0.05250130133972568
701 : (food, crop, farming, sole, nicholson) - 0.03820582595433827
Top topics for the term "window" (q.topTopicsForTerm(data.termIds.indexOf("window")).map(e => e._2 + " : (" + topics(e._2) + ")" + " - " + e._1).mkString("\n")
) :
88 : (house, window, arch, gable, storey) - 409327.6430767544
45 : (cemetery, church, burial, nave, chapel) - 82249.37952221054
896 : (front, generation, available, rear, manual) - 15570.393982462794
528 : (software, user, web, windows, microsoft) - 2007.5440479475542
725 : (side, wall, square, circle, interior) - 133.15149226414317
466 : (design, door, remove, timber, slide) - 23.41339646786743
937 : (glass, candy, nut, butter, bead) - 4.26710385452073
799 : (building, room, floor, historic, contribute) - 0.3192841216368202
29 : (dedicate, altar, baroque, feast, pilgrim) - 0.14530693638599756
651 : (architecture, heritage, architect, architectural, building) - 0.10753047449716888
837 : (light, camera, lens, laser, optical) - 0.10566305553882346