lda results with k=200 (mllib) - derlin/bda-lsa-project GitHub Wiki

Creating the model

We used the whole wikipedia dataset (57.5 GB, see datasets). The model creation and parameters are the same as described here with maxIter=50, except for the parameter k. The number of topics is here 200. In summary:

Parameters:

  • number of documents: 4'703'192
  • vocabulary size: 20'000
  • number of topics: k = 200
  • maxIter = 50

It took 2h12 to execute (against more than 5h with k=1'000).

Results

The topics

All the topics, described with 5 terms, is available in csv format in other/topics-200-lda-maxIter50.csv.

At first sight, most of them make perfect sense.

Some queries

top topics for term "Batman":

q.topTopicsForTerm(data.termIds.indexOf("batman")).
   map(t => s"${t._2} - ${topics(t._2)}, ${t._1}").mkString("\n")

This gives:

140 - green, big, clark, baker, wayne, 291147.93959903804
130 - episode, series, voice, television, show, 0.04391746100853449
176 - dublin, irish, marvel, hero, cork, 0.03481843129685808
136 - comic, dead, moon, hope, adventure, 0.02365134447826069
73 - dark, magic, eye, dragon, ghost, 0.02253632359989645
29 - game, character, player, release, version, 0.022398469953310376

Here, we see that only the first topic makes sense for LDA, while other have a very small score. The term has been attributed to one topic only. Also, we don't see exactly what green has to do with batman, but remember that we have only 200 topics. So it is possible that the cluster 140 is all about super heroes, with green being for Hulk and green lantern.

To validate this assumption, let's see the 100 top terms for topic 140:

green, big, clark, baker, wayne, batman, kent, lock, guy, superman, bennett, flash, giant, wonder, blake, martha, quinn, lantern, slalom, hal, lois, pipe, comics, brave, justice, joker, atkinson, save, gotham, titans, wally, metropolis, lex, alec, watt, liz, lindsey, mitch, robin, shelton, bruce, unlimited, freeze, infinite, appear, granville, luthor, arkham, red, housemate, jeremiah, maguire, lana, cyborg, villain, guardians, pryor, dyson, crossover, heard, new, chill, evict, ucl, rouse, lobo, vol, toro, crisis, doomsday, finest, teen, barefoot, city, issue, adventures, eviction, comic, files, dent, feature, power, super, combo, shrink, returns, superhero, arrow, evanston, phantom, world, conduit, league, version, bane, scarecrow, later, amp, dick, ring

Here, many words are linked to the super heroes universe: comics, superman, flash, gotham, titans, luthor, superhero, etc. ==> topic 140 is better described as the superhero universe !

Let's try with the term "computer":

q.topTopicsForTerm(data.termIds.indexOf("computer")).map(t => s"${t._2} - ${topics(t._2)}, ${t._1}").mkString("\n")

This gives:

160 - technology, engineering, system, computer, machine, 771403.7501327193
91 - datum, software, user, code, use, 117399.18665560146
29 - game, character, player, release, version, 17830.645242408933
81 - level, basic, state, funding, grade, 6354.31041145056
162 - frac, function, mathbf, partial, equation, 202.39702419920675
12 - space, sun, edge, star, graph, 0.24748523969173558
37 - power, device, current, electric, signal, 0.14261456826008373
172 - school, student, high, education, teacher, 0.1327119993701473
68 - university, college, professor, study, degree, 0.09796352163515479
146 - company, network, channel, radio, product, 0.09449358564089928
149 - measure, wave, mass, scale, particle, 0.08866394325713459

Here, more than one topic has a high score. The first topic, 160, seems to be about the general idea of computers and the hardware. The second one 91 is more about code, software and users. Computers are also a huge part of gaming, hence the topic 29. Less pregnant but also important is the place of computers in mathematic (topic 162). The other topics have a very small score.

Once again, by using more than 5 words to describe the topics, our assumptions seem correct.

100 terms for topic 160:

technology, engineering, system, computer, machine, memory, technical, systems, core, lab, design, information, crater, instruction, electronic, laboratory, advanced, processing, electronics, chip, develop, processor, computing, ibm, simulation, clock, intel, artificial, program, controller, visual, monitor, ieee, graphic, mit, model, mechanical, research, electrical, control, integrate, communication, cache, development, use, computational, technological, ghz, hardware, owens, cpu, process, atlantis, rim, application, technologies, technician, integrated, learning, modeling, pages, digital, micro, card, science, simulator, rom, laptop, industrial, automation, display, architecture, mhz, performance, base, labs, amd, recognition, doi, ram, graphics, turing, acm, slot, sp...

100 terms for topic 91:

datum, software, user, code, use, web, application, system, windows, information, server, support, database, tool, google, version, search, microsoft, access, interface, file, allow, data, device, available, standard, update, bit, provide, virtual, client, key, linux, content, address, format, source, protocol, apple, disk, internet, program, implementation, message, type, process, document, operating, include, feature, programming, computer, java, package, object, hardware, implement, specification, developer, create, directory, storage, link, release, browser, multiple, base, network, email, add, display, require, api, management, store, platform, method, example, enable, run, also, framework, function, security, desktop, page, model, remote, android, pdf, stack, numbe...

100 terms for topic 29:

game, character, player, release, version, mode, video, playstation, nintendo, enemy, bulgarian, feature, xbox, arcade, gameplay, quest, console, puzzle, games, sofia, level, super, gaming, use, boss, wii, interactive, series, sega, original, mega, title, button, pokémon, develop, sequel, item, weapon, graphic, praise, atari, developer, also, battle, review, online, entertainment, scenario, base, control, include, combat, allow, different, create, action, ability, screen, multiplayer, development, play, attack, ign, amiga, one, new, must, world, give, rpg, available, bulgaria, reviewer, playable, adventure, unlock, boy, add, vita, capcom, system, nes, design, pack, expansion, final, call, spy, well, fight, set, receive, zelda, time, map, windows, psp, make, story, team

getting top documents for topic 29, (game, character, player, release, version):

q.topDocumentsForTopic(29, numDocs=30).map(t => s"${t._1} - ${t._2._2}, ${t._2._1}").mkString("\n")

Gives:

1534804 - GlowTag, 0.9999983177598708
870371 - Shigureden, 0.9999981556747017
395612 - Line Attack Heroes, 0.999997249705353
1281822 - Coffeetime Crosswords, 0.9999961601422818
5837014 - Xebian, 0.9999961601422818
1370577 - Galactic Protector, 0.9999946665534605
...
1925873 - Sofia Papadopoulou, 0.9982668754573092
2673735 - Sofia, Hîncești, 0.9982668754573092
2569750 - Pegasus (nightclub), 0.9978843943721741
1167525 - Pi Pegasi, 0.9978843943721741
5549700 - Pegasus (Pilz), 0.9978843943721741
2820001 - Pheidole pegasus, 0.9978843943721741

Here, most results are games. More interesting, they all have a score of more than 0.99 ! This means that the topic is really about games and LDA correctly classified articles about games. Just to test, we queried the top 300 documents for the same topic. The 300th entry is:

6036100 - Gravonaut, 0.9121662723668448

So more than 300 entries are pertinent to this topic, which contrasts with the results we got for k = 100.