2015 Current Tasks and Notes - UTMediaCAT/projectdocs GitHub Wiki
Open Issues in Github
https://github.com/UTMediaCAT/Voyage/issues
#Move server
Dec 3
- Turns out that WARC'ing is going faster than we thought: approximately 800 URLs in 5 hours; 4 simultaneous processes take 1.7 GB Ram, with 2 processors.
Nov 26
- After 1pm meeting to go through ideal backlink export format for priority queuing
- Priority queuing using the Google, in the form of "keyword site:"
- Team to Meet on tuesday for Diagrams and testing
- testing over summer - 1 instance to 1.5 instance per cpu core
- also in future parallel per site couple of nytimes still 1-1.5 per core
- Jai got his placement because of the work he did for UTMEDIACAT!
Nov 19
- have documentation, diagrams ready for all processes
- test new database and see how much RAM/cpu space is required
- get email wording for permission to crawl sites
- cost few hundred
- set up meeting in mid December may need more than 60gb RAM depends on how many sites in parallel swap space is slow if accessing all sites and accessing all memory an instance for 3 sites is 4gb memory being used will cost few hundred dollars with IITS
#Database:
Dec 11:
3 site paused when out of space but resumed afterwards. 60gb on digital ocean most taken out by logs, text files to see error codes. logs not rotated. cron job to delete log searches, using 14gb for 3 sites for 2 weeks incl ram and swap space. move to db would be the same or more in terms of site timing if a db, marginally more optimized but move to db may be a little bigger but not much more
William working on this, status is not ready to use, needs debug, check performance, check alternate ways to fix, 14gb timeline on that is that it will happen after the break
Dec 3:
- Met to try do the MySQL: couldn't easily transfer the SQLite database to MySQL, so in the end, it might be easiest to have the Bot Discovery (Plan B) use MySQL for its RAM database, and keep the result database in SQLite. This will help with testing to ensure stability. Try to do for next week.
- Still need to incorporate the saving of source URLs, and implement the changes to UI.
Nov 26:
MySQL update: Not yet work on implementation
Entering them through web interface Testing them solution: get all selectors from db from a given site, evaluate a given page for every selector in place. vs. paste in every time. Create account for vinnie, login and test ssh connect to db.
Alternate option: Django page instead to run and test selector
Source articles - Jai - in progress now working for new results. but now have to go and do it retroactively - make a script to go and get the source sites now
Nov 19:
- meeting yesterday for the database selection
- thought of how to optimize going with MySQL with Django implement
- Django can have multiple have dbs, keep existing SQL lite and implement new mysql database
- add the archiving of source articles that are found, and have this reflected in the interface - deferred
Nov 5:
- add the archiving of source articles that are found, and have this reflected in the interface
Weekly sweeps
Nov 26
- newspaper sweeps is fine - Alejandro to add still
Nov 19
daily weekly sweeps is fine for newspaper
- need to add all the necessary referring sites - deferred for Alejandro to add
Nov 5:
- started without a hitch
- need to add all the necessary referring sites
WARC
Dec 3:
- For easy viewing, we thought we would put a PDF version of webpage as a link. However the Phantom JS process for the PDF is taking a long time. We will wait to see if Roger has a solution, and if not, perhaps we will look into adding a WARC viewer.
Nov 12
- WARC'ing is stable, can do a maximum at 2 WARCs at a time, but this can be increased
- PDF-viewer: this is nearly ready, just need to have code merged
- there was only one bad WARC
Nov 5
- WARC'ing is not done: there was a memory crash
- WARC: dynamic number of processes based on how much memory is available
- WARCs have been checked
Crawler (Plan B) Test on NYT, CNN, BBC
Dec 11:
nytimes, bbc, 2 weeks only being restarted finding unique ones still nytimes 1 million bbc 900 000 cnn 470 000 CNN the most hits so far likely because of transcripts
google for mentions keywords not for backlinks ahrefs better you can filter it better, able to filter, they had some sort of ability to do so and then pay for those links if you're using that need a program, using API charges more. API command not helpful
Dec 3:
- New Terminology: Bot Discovery for the Plan B, and Newspaper Crawler is RSS Discovery
- test Bot Discovery with WARC'ing processes.
Nov 26:
CSS selector update: Made tool for testing CSS selectors. Current solution is doing it in the browser. Browser CSS selectors might not be the same implementation as the XML one - depends on which browser you use. Python script to use the xml stuff, same stuff that the crawler uses to test accurately. Needs SSH access to server needs access to get database to site.
- display and keyword matched keyword - highlight keyword
- costly to do for every article taking too much time
- shows all text and highlights all matching keywords
- crash in 2 weeks but ran again and since then hasn't stopped
- anchor text is halfway there - the text of links
Nov 19:
- crawler not run this week, will run next week
- implement to store the text as well for each article, naive implementation to show the text around the found keyword based on text and matched keyword. recreate in articles. not currently stored in database. full text stored so see surrounding text. takes 5-10 seconds to load text in "Matched Keywords" - this will be used for the comparison of text when it's separated out
- anchor text is still required - william and yuya getting the link
lots of cn.nytimes.com
Nov 12:
- no crash, but 2 processes are stuck, might be external
- Next Tuesday: Yuya, Jai, and William will meet to talk about database
- no other problems as of right now.
- we'll attempt to separate out the text of the article in order to make it available for comparison on re-crawls
- finding about 20 hits a day, mostly on NYT
- CSS selectors:
Nov 5:
- database/save state: not yet working; don't go with sql-lite; use MySQL or other
- we will increase server capacity to 40 gb & 4gb RAM,
- We are targetting December to finish implementing the database, and getting ready to do a baseline crawl.
Optimizing plan b crawler:
Dec 11:
priority queuing: few hundreds can be found, doing the whole thing would be difficulty limited results but many duplicates
Priority Queuing (Oct 15):
- search google for all aliases and then queue those hits.
- problem: google doesn't search embedded links, would need a source code search for that, so Alejandro is going to write to google.
- looking at getting backlinks from a company.
multi-threading on single site:
- where one process is downloading/queuing and one process is analyzing metadata
- leave until later, need to iron out bugs with crawler first.
multi-threading for multiple sites:
Oct 15
- Implemented multi-threading
- Implemented log per site
Oct 8
- William will implement newspaper code for analysis, and Yuya will test it;
- William/Yuya will also implement a log per web domain per crawl (instead of all together)
- fixes that were implemented to crawler need to be implemented in the multi-thread crawler, needs to be done manually - William
- keep on eye on this: readability issue was implemented over the summer, and this makes the crawl faster, but that also leads to greater instability, the source of the errors we saw this week.
CSS Selectors
Dec 3
- Code for testing CSS selectors, but it needs to be incorporated into the UI.
Nov 19
UI for CSS selectors on the site - for adding selectors. go to a site in referring sites and click on the site, can add css selectors at the bottom of the page needs to be hooked up to the crawler right now only hooked up to date modified and author
- will Alejandro want to add CSS selectors to the old database sites as well? will he use the old database?
Nov 12
- code is implemented to find CSS selectors as part of Newspaper, but we need UI on the site, so that a user can go in and plug in CSS selectors based on domain
Nov 5
- need some examples from Vinnie of CSS Selectors that need regular expression trouble shooting