mllda results with k=200 (ml) - derlin/bda-lsa-project GitHub Wiki

Creating the model

We used the whole wikipedia dataset (57.5 GB, see datasets). The model creation and parameters are the same as described here, except that this time, we used ml instead of mllib. In summary:

Parameters:

  • number of documents: 4'703'192
  • vocabulary size: 20'000
  • number of topics: k = 200
  • maxIter = 10, 50, 80 !

With 50 iterations, it took 1h27 with ml and more than 5h with mllib.

==> ml is much faster than its old counterpart !!

MaxIters

In a first run, we made a mistake and ran it with maximum 10 iterations. It was very quick (about 1h), but the results where really bad :(. Here is an extract of the topics found:

0, new, also, school, use, first
1, new, school, also, use, first
2, new, also, use, school, first
3, new, also, school, use, first
4, new, also, use, school, first
5, new, school, also, first, use
6, new, also, use, school, first
7, align, new, also, use, school
8, new, also, use, school, first
9, new, school, also, use, first
10, new, use, also, school, first
11, new, also, use, school, first
12, new, also, use, school, first
13, new, use, also, school, first
14, new, also, school, use, first
15, new, school, also, use, first
16, new, school, also, use, first
17, new, school, also, use, first
18, new, school, also, use, first
19, new, also, school, use, first
20, new, also, school, use, first
21, align, new, also, school, use
22, new, also, use, school, first
23, new, also...

We then rerun it, this time with maxIter=80 (it took 2h13 to finish) and with maxIter=50 (it took 1h27).

Results

The topics with maxIters 80

With 80 iterations, the results seem interesting. Indeed, most of them make sense:

0, animal, bird, wild, male, egg
1, dance, earth, universe, ballet, robot
2, street, store, avenue, shop, location
3, smith, martin, page, wilson, allen
4, theatre, festival, opera, perform, performance
5, theory, idea, concept, philosophy, argue
6, manchester, birmingham, cap, yorkshire, leeds
7, god, spirit, religion, soul, buddhist
8, london, england, royal, memorial, cemetery
9, book, isbn, novel, write, story
10, brazil, brazilian, metropolitan, designer, barbara
11, engine, arm, use, design, type
12, space, sun, edge, star, graph
13, experience, suggest, human, finnish, relationship
14, edition, volume, print, copy, manuscript
15, york, new, columbia, frank, miller

The whole list is available in other/topics-200-mllda-maxIter80.csv.

The topics with maxIters 50

At first sight, the topics don't seem to match the ones found with mllib and the same parameters. Giving them a closer look, some topics might seem really close:

"software":

mllib: 91, datum, software, user, code, use
ml:    36, user, online, web, software, internet

"music":

mllib: 94,  music, producer, top, song, hot
mllib: 157, music, guitar, vocal, bass, piano
ml:    76,  music, orchestra, opera, piano, jazz
ml:    124, album, song, release, chart, music

In all those examples, the ml topics seem better at describing the underlying topic with words.

The whole list is available in other/topics-200-mllda-maxIter50.csv.

Some queries

As with mllib, let's get the top topics for term "batman":

159 - network, launch, service, mobile, limited, 286735.949804223
115 - vol, row, comics, marvel, comic, 4410.9926879093655
73 - episode, series, character, voice, show, 0.17015225638865533
8 - kill, death, find, escape, dead, 0.11629134029379477

Here, results are strange. The most pertinent topic is related to computer and technologies, while the "superhero" topic comes second... This is worse than the mllib result !

top topics for term "computer":

151, card, scout, memory, computer, device, 302303.8753921035
72, system, information, datum, access, tool, 244062.2388044791
47, college, university, science, engineering, student, 165123.57174622754
36, user, online, web, software, internet, 158034.63035305514
166, space, set, element, ring, function, 13466.06624337648
189, game, player, release, character, version, 11495.376558072285
46, school, student, education, grade, teacher, 8989.334098034138
13, signal, edge, frequency, use, graph, 8467.577648244318
74, reference, source, object, google, traditional, 1041.674809011646
58, earth, moon, mass, space, planet, 141.7113848250631
93, claim, action, case, robot, copyright, 52.013918326916766
96, program, child, education, basic, skill, 3.1691860441246806

Here, results are more coherent. We also see that many topics concern computer science and technologies.

top documents for topic 189, game, player, release, character, version:

q.topDocumentsForTopic(189).map(t => s"${t._1}, ${t._2}, ${t._3}").mkString("\n")

3451369, Compilations in the Sonic series, 0.7610340780211949
186130, Mario Kart, 0.7587034745204893
375083, Super Mario All-Stars, 0.7218754819962181
6214148, Rayman Legends, 0.7157825809789193
249728, Super Mario Kart, 0.6978293334818717
22065, Mario Bros., 0.6960735103435863
1720079, Mario Kart DS, 0.6951581306576075
836057, Sonic Jam, 0.6947413966595805
3700711, Super Smash Bros. (video game), 0.6933008644887617
1020154, Game Critics Awards, 0.6929297296283337
70802, Dr. Mario, 0.6798284852826327
5344938, New Super Mario Bros. U, 0.6786167433608842
1187574, Bionicle Heroes, 0.6785534780085812
1186672, Sonic the Hedgehog, 0.6776289337360315
375596, Super Mario Land, 0.6751746448768772
143031, Super Smash Bros. Melee, 0.6749174713458214

We can see that Mario and Sonic are very stereotypical of the video games topic.

top topics for document 189, "Name Your Adventure" (it took 57 minutes !!!):

q.topTopicsForDocument(189).map(t => s"${t._1}, ${topics(t._1)}, ${t._2}").mkString("\n")

73, episode, series, character, voice, show, 0.11639690256946846
189, game, player, release, character, version, 0.0797946331436864
16, harris, cooper, ted, chuck, ruth, 0.065051452469731
167, channel, station, radio, broadcast, show, 0.05492641631362996
61, role, star, dance, actor, play, 0.04012115725881391
162, say, get, tell, want, would, 0.03227785442793559

It is correctly categorised as a tv show.

One more for the road, top topics for document 30, "braindead (film)":

8, kill, death, find, escape, dead, 0.17217683835149714
190, film, award, best, festival, director, 0.1349148434032038
189, game, player, release, character, version, 0.07281994326787491
123, real, argentina, peru, argentine, copa, 0.055911406036352654
183, russian, soviet, russia, moscow, ukraine, 0.0369994461127318
93, claim, action, case, robot, copyright, 0.0293998890768176
30, bill, frank, scott, jackson, wilson, 0.026259990700183054
107, new, zealand, pool, auckland, wellington, 0.02085614757465879
139, york, new, prison, gang, kennedy, 0.01960439098538654
130, magic, sky, fantasy, fiction, monster, 0.018513839520272735

Conclusion

The ml package is pretty amazing as well. Apart from the first use of transformed (which is cached afterwards), the queries are very quick and the results with a k=200 are good.

We don't get exactly the same results as with mllib, but it is difficult to decide which model is the best. Both yield interesting and usable results.