svd results - derlin/bda-lsa-project GitHub Wiki
Table of Contents
Creating the model
We used the whole wikipedia dataset (57.5 GB, see datasets).
Parameters:
- number of documents:
4'703'192
- vocabulary size:
20'000
- number of topics:
k = 1'000
Execution, creating the model:
We ran the class bda.lsa.svd.RunSVD
on the daplab, using spark 2.0.0. Here is the complete command:
spark-submit \
--class bda.lsa.svd.RunSVD \
--master yarn --deploy-mode client
--num-executors 10 \
--executor-memory 15g \
--driver-memory 20G \
bda-lsa-project.jar 1000
It took 48 minutes to execute. The results were saved in HDFS (/shared/wikipedia/docIds/svd-model
).
Interactive queries:
The model was then loaded on a spark-shell for interactive queries. Once again, we took advantage of the daplab and used the same parameters as above (number of executors, etc.).
Spark-shell Setup
As a reminder, here is how to load an svd model in the spark-shell, assuming the jar is available and the config.properties
is correct:
val data = bda.lsa.getData(spark)
val model = bda.lsa.svd.RunSVD.loadModel(spark)
val q = new bda.lsa.svd.SVDQueryEngine(model, data)
Some results
The topics
First, we had a look at the topics using q.describeTopicsWithWords(5)
. The result is available in others/topics-1000-svd.csv
.
Some topics make absolute sense:
2, game, season, album, team, win
3, war, force, use, army, government
8, city, station, street, park, church
10, album, band, bar, song, party
15, gsm, operational, lte, unknown, telecom
19, regiment, court, battalion, brigade, division
20, bgcolor, style, rowspan, right, center
While others do not at first sight:
11, mathbf, film, station, party, frac
13, race, car, clt, bri, rch
22, usa, ale, brewing, left, commend
50, airport, party, island, jewish, king
997, kelly, snake, ira, hat, teen
998, shelley, bosnia, risk, kannada, cycle
999, khmer, association, academy, kashmir, section
Some random queries
Getting the top documents for topic 2:
2: game, season, album, team, win
yields:
Shimmer Volumes, 0.016377116561342497
Big Four (tennis), 0.016082646702247655
National Hockey League lore, 0.014167659295226441
Texas Longhorns men's basketball, 0.013612633137050686
History of Tipperary GAA, 0.013553995723004796
National Academy of Video Game Trade Reviewers, 0.013452756781359636
Characters of Supernatural, 0.012822426717493209
National Football League on television, 0.012805309925083705
Last-minute goal, 0.012713888605821618
Andy Murray, 0.012705334539146292
The number after the documents is the Left Singular Value or eigenvalue for this document/topic couple. It is hard to interpret as-is, but the highest the better.
Those results are not the most interesting, but they show one thing: SVD does not understand polysemy. Indeed, there is a confusion about "season" in films, sports and such.
Top terms for term "church" (id 95):
Note: topTermsForTerm query is very quick, since svd.U
is a local Matrix.
95 - church, 0.999999999999998
7159 - churches, 0.8084311707102206
9911 - congregational, 0.758998302713722
4148 - pastor, 0.7580429799405338
15933 - parishioner, 0.7273759700991641
11873 - pulpit, 0.7184415222951229
4377 - presbyterian, 0.7073768179622689
5592 - ecclesiastical, 0.7024518028916056
6733 - reformed, 0.7020500962856182
7896 - communion, 0.6880205095535681
Not so bad ! Here, SVD clearly found terms within the same theme !
Top Documents for term "Mickey" (id 5372):
397872 - Mickey Mouse (film series),1090.0034456722924
227210 - Mickey Mouse,895.9204323021379
261863 - Murinae,793.2678233955537
538938 - Mickey Mouse universe,627.886966462873
593550 - Donald Duck,596.4097853089412
1883329 - Sigmodontinae,521.0913514588701
230048 - Computer mouse,437.748848018837
421477 - Clarence Nash,419.8787555549701
456620 - Mickey Mouse Works,419.17828260483174
426126 - Pete (Disney),383.44043405965283
What about Batman ?
Top docs for document 148046, "Batman Forever":
148046 - Batman Forever, 0.9999999999999993
147967 - Batman & Robin (film), 0.9585472055819237
475230 - Batman Begins, 0.9520343364283416
1344054 - Batman: Mask of the Phantasm, 0.9377236191322167
3269692 - Batman in film, 0.9359881344577601
1590886 - Son of Batman, 0.9353289346105655
147888 - Batman Returns, 0.933892566651004
523195 - Vicki Vale, 0.932607097736682
876544 - Alfred Pennyworth, 0.9256316075604525
2189050 - Hush (comics), 0.9251785516882967
We have first all the Batman articles, then some Batman characters, but also other comics of the same type.
Top terms for term "batman" (id 3759):
3759 - batman, 0.9999999999999977
10954 - gotham, 0.9876397967261396
10250 - joker, 0.9760900376355605
16923 - arkham, 0.965984320884384
19152 - bane, 0.9598151994534613
12918 - grayson, 0.9415267047078937
18321 - hush, 0.9301561009586975
11807 - damian, 0.9244694559640239
11633 - dent, 0.9039045900765331
10075 - harley, 0.8234428214565112
Here, SVD successfully found most of the top terms/concepts of the Batman Universe.
Top docs for term "dent" (id 11633):
1869738 - James Gordon (comics), 104.82266471097371
131219 - Batman, 99.87136889524454
40086 - Batman: The Animated Series, 72.72400851127851
499138 - Batsuit, 63.9792368213549
397336 - Two-Face, 60.490591264337176
709998 - Batman franchise media, 59.01292256836966
1361987 - Scarecrow (DC Comics), 58.732113480754386
2303929 - Alternative versions of Batman, 56.64582249665118
1393518 - Batman (comic book), 56.56590152526399
3269692 - Batman in film, 54.92013757689848
Harvey has been properly detected !
And what about computers ?
Top docs for term "computer" (id 797):
2007500 - History of IBM, 972.2639960959705
2720411 - Computer, 688.9248390345646
19914 - IBM Personal Computer, 653.2726454515649
381262 - Personal computer, 630.6109897223239
1555241 - Timeline of DOS operating systems, 558.3993685383201
867136 - History of personal computers, 543.0675624558836
3399183 - Home computer, 504.48153430270634
216226 - History of computing hardware, 440.53158420756466
166137 - Computer science, 425.29516903358757
1276174 - Computer graphics, 423.80976595270744
Top docs for doc "Computer" (id 2720411)
2720411 - Computer,1.0000000000000007
216226 - History of computing hardware,0.8861129808073265
1949533 - Computer hardware,0.8627631271909316
78465 - Microcomputer,0.8514913078098942
2589439 - Computer Pioneer Award,0.847858380883771
4548006 - Mechanical computer,0.8428850752395042
2762341 - History of computer science,0.841274835480451
185650 - Computer program,0.8397384215617181
185522 - Stored-program computer,0.8393918187228524
381262 - Personal computer,0.8337466280456745
Top terms for term "computer" (id 797):
797 - computer,0.9999999999999996
19993 - computers,0.8701206407641349
4585 - computing,0.8557692068290813
3796 - hardware,0.6643669759275386
13168 - calculator,0.5944571833279778
12045 - amiga,0.589699146996869
10604 - hacker,0.5634863002398985
17683 - workstation,0.5578494254388993
11801 - rom,0.555955118016264
11055 - emulate,0.5555349933563621
Top Documents for term query ("Computer", "Church"):
Prussian Union of churches,8800.577637270315
St. Mary's Church,7319.0827506476335
Eastern Orthodox Church,5713.369027323474
Landeskirche,4971.575577130236
Churches in Norway,4544.355292719535
History of IBM,4519.547194291468
Methodism,4208.247542733009
United Methodist Church,3741.087182759663
Homosexuality and The Church of Jesus Christ of Latter-day Saints,3653.5284742695644
Presbyterian Church in America,3558.1198708541424
About the execution times
When we only work with terms, the results are almost instantaneous. This is because svd.V
is a local matrix. When we need to query documents as well, it takes some time. Indeed, svd.U
is a distributed row matrix, so spark needs to launch a job on the cluster and collect the results. Anyway, no query took more than 3 minutes, which is very good.
Another thing: in the current implementation, data.docIds
is an RDD with a lookup
method to get the document title from the ID. A single lookup takes about 5 seconds. To speed up things a bit, it is better to collect the RDD into a local map. A map of about 4 mio entries is not a problem on the Daplab.
Notes and conclusion
The results are pretty amazing. We could build a pretty good recommandation system on top of this model.
The best results are the ones linking terms and documents: using the topics directly is not really interesting.
Also, it is sometimes difficult to understand what is exactly a topic based only on the most relevant terms. Most of the times, it is clearer when we look at the top documents for the topic.