Keyword analysis - robhogg/twive GitHub Wiki
I'm planning to add some intelligence here (e.g. stemming and experimentation with Pearson distance to group tweets and tweeters). At the moment, though, it's just a simple keyword cloud, 100 most frequent, after excluding list of stop-words.
The stop-word list at the moment is rather ad-hoc, and could do with some refinement. Might be worth trying a pre-prepared list, such as one of the ones here, though they looked a little broad.
List as-of 9 Feb 2013:
about | from | much | today |
all | get | my | too |
also | going | no | very |
amp | got | not | via |
and | had | only | was |
any | has | our | what |
anyone | have | out | when |
are | here | over | where |
but | how | que | who |
can | I'm | should | why |
could | I've | sobre | will |
day | into | some | with |
does | it's | than | would |
doing | its | that | yes |
don't | just | the | you |
even | last | there | your |
everyone | many | they | |
for | more | this |
amp is a slightly odd keyword, I know (the regex that extracts them needs a bit of attention, as it identifies the aphabetic characters in HTML entities as words).