Crawler poster - mogelbrod/TweetRank GitHub Wiki

Uses the Twitter REST API
Twitter limit: 150 queries / hour, Crawler designed to minimize queries
We also use proxies to skip the limit (but make the crawling slower).
Data fetched as XML and processed using SAX (fast and memory efficient).
Designed to run in multiple threads or multiple machines.
How does it work?
1. Start with some users, put them in a queue.
2. Get the first user in the queue, and crawl its last tweets and its friends.
3. Parse tweets looking for mentions, hashtags.
4. Add friends, mentions and the user again to the queue.
5. Send data to Solr and Ranker modules.
6. GOTO 2