Old Tasks and Notes - UTMediaCAT/projectdocs GitHub Wiki
Move server
Dec 3
- Turns out that WARC'ing is going faster than we thought: approximately 800 URLs in 5 hours; 4 simultaneous processes take 1.7 GB Ram, with 2 processors.
Nov 26
- After 1pm meeting to go through ideal backlink export format for priority queuing
- Priority queuing using the Google, in the form of "keyword site:"
- Team to Meet on tuesday for Diagrams and testing
- testing over summer - 1 instance to 1.5 instance per cpu core
- also in future parallel per site couple of nytimes still 1-1.5 per core
- Jai got his placement because of the work he did for UTMEDIACAT!
Nov 19
- have documentation, diagrams ready for all processes
- test new database and see how much RAM/cpu space is required
- get email wording for permission to crawl sites
- cost few hundred
- set up meeting in mid December may need more than 60gb RAM depends on how many sites in parallel swap space is slow if accessing all sites and accessing all memory an instance for 3 sites is 4gb memory being used will cost few hundred dollars with IITS
#Database:
Dec 3:
- Met to try do the MySQL: couldn't easily transfer the SQLite database to MySQL, so in the end, it might be easiest to have the Bot Discovery (Plan B) use MySQL for its RAM database, and keep the result database in SQLite. This will help with testing to ensure stability. Try to do for next week.
Nov 26:
MySQL update: Not yet work on implementation
Entering them through web interface Testing them solution: get all selectors from db from a given site, evaluate a given page for every selector in place. vs. paste in every time. Create account for vinnie, login and test ssh connect to db.
Alternate option: Django page instead to run and test selector
Source articles - Jai - in progress now working for new results. but now have to go and do it retroactively - make a script to go and get the source sites now
Nov 19:
- meeting yesterday for the database selection
- thought of how to optimize going with MySQL with Django implement
- Django can have multiple have dbs, keep existing SQL lite and implement new mysql database
- add the archiving of source articles that are found, and have this reflected in the interface - deferred
Nov 5:
- add the archiving of source articles that are found, and have this reflected in the interface
Weekly sweeps
Nov 26
- newspaper sweeps is fine - Alejandro to add still
Nov 19
daily weekly sweeps is fine for newspaper
- need to add all the necessary referring sites - deferred for Alejandro to add
Nov 5:
- started without a hitch
- need to add all the necessary referring sites
Oct 29:
- in a day, found 161, not linear, first hour was a lot, then found less in the next 23
Oct 22:
- Jai/Roger will look into setting up an instance of UTMediaCAT to scan twitter and domains that use Newspaper/RSS
Image preservation and style sheets preservation
Oct 15
- generally, the WPull is working really well, takes about 100 seconds to generate the WARC
- occasionally, some strange issues, like black background, but doesn't affect the text and links
WARC
Dec 3:
- For easy viewing, we thought we would put a PDF version of webpage as a link. However the Phantom JS process for the PDF is taking a long time. We will wait to see if Roger has a solution, and if not, perhaps we will look into adding a WARC viewer.
Nov 12
- WARC'ing is stable, can do a maximum at 2 WARCs at a time, but this can be increased
- PDF-viewer: this is nearly ready, just need to have code merged
- there was only one bad WARC
Nov 5
- WARC'ing is not done: there was a memory crash
- WARC: dynamic number of processes based on how much memory is available
- WARCs have been checked
Oct 29
- WARC'ing also can take up a lot of memory, approx 2 minutes for each hit, need to implement a queue with a maximum number of simultaneous processes.
Oct 8
- Roger and Jai managed to get the PhantomJS/WPull to get the WARC, but there's a problem with the speed
- Roger/Jai believe that they can make it faster by figuring out what elements to ignore --- here's what Roger writes: "we can manually force it to generate files in 2 mins for each url, and the results can be sill good. However, as a result, it is likely that a few images will be missing.( you can check the attachments to see sample file generated by this strategy)"
Oct 1:
- Roger figure out how to use PhantomJS & WPull to get an html output, which runs javascript. This week Jai/Roger will look to see if it's possible to do direct to WARC (preferable to having an export to WARC or conversion from HTML). If so, that's what we'll use.
OLD:
- WarC vs Zotero-like ability to snapshot? Jai: snapshot is possible. Kim writes why WARC is far preferable. Problem: not easy to WARC a javascript
- PhantomJS: does a PDF of the webpage, but the "print-out" version, doesn't have selectable text (not good)
- WKHTMLtoPDF: renders most of the javascript, linux-version
- Chrome extension works but it is very dependent, one possibility is to have an auto-clicker that runs through Chrome--seems to work best, but would need to have an instance of chromium to handle it
- Selenium: another project -- checking into it.
- Chrome javascript is available, but it isn't trivial to automate. Question: does the DSU have WARC'ing functionality they can give us? Maybe talk to Anya before going to all the trouble.
- Yuya/Roger also looking at Warc issue
Crawler (Plan B) Test on NYT, CNN, BBC
Dec 3:
- New Terminology: Bot Discovery for the Plan B, and Newspaper Crawler is RSS Discovery
- test Bot Discovery with WARC'ing processes.
Nov 26:
CSS selector update: Made tool for testing CSS selectors. Current solution is doing it in the browser. Browser CSS selectors might not be the same implementation as the XML one - depends on which browser you use. Python script to use the xml stuff, same stuff that the crawler uses to test accurately. Needs SSH access to server needs access to get database to site.
- display and keyword matched keyword - highlight keyword
- costly to do for every article taking too much time
- shows all text and highlights all matching keywords
- crash in 2 weeks but ran again and since then hasn't stopped
- anchor text is halfway there - the text of links
Nov 19:
- crawler not run this week, will run next week
- implement to store the text as well for each article, naive implementation to show the text around the found keyword based on text and matched keyword. recreate in articles. not currently stored in database. full text stored so see surrounding text. takes 5-10 seconds to load text in "Matched Keywords" - this will be used for the comparison of text when it's separated out
- anchor text is still required - william and yuya getting the link
lots of cn.nytimes.com
Nov 12:
- no crash, but 2 processes are stuck, might be external
- Next Tuesday: Yuya, Jai, and William will meet to talk about database
- no other problems as of right now.
- we'll attempt to separate out the text of the article in order to make it available for comparison on re-crawls
- finding about 20 hits a day, mostly on NYT
- CSS selectors:
Nov 5:
- database/save state: not yet working; don't go with sql-lite; use MySQL or other
- we will increase server capacity to 40 gb & 4gb RAM,
- We are targetting December to finish implementing the database, and getting ready to do a baseline crawl.
Oct 29:
- debugged problem of not picking anything up
- memory issue: even with six minute interval of memory logging, we're not able to see why memory usage goes up suddenly
- one possible solution is to do save state/database: probably best to do this for a variety of reasons
- another solution is hashing and perhaps 64 bit hashing.
- another possible solution: paging: using hard drive for RAM
- whether we need more server capacity
Oct 22:
- nothing picked up yet
- freezes, and memory issues: debugging (William), need to catch just before it crashes
Oct 15:
- some problems with the NYT/BBC searches, probably not a unicode problem, and try to see if this is solved by next week.
Oct 8
- Seems like the unicode problem has been fixed, and the 3 instances have now been running for 12 hours without crashing
- rates for speed: Cnn is 5000 pages/hour; NYT: 3700 pg/hr; BBC: 1700 pg/hr
- keep doing the test, to get a sense; gone through 150,000 pages in last 12 hours with no hits
Oct 1
- Oct 1st: NYT crashed, cnn.com stuck on transcript.cnn.com, BBC going about half the speed of cnn.com. Next week: figure out an average rate per hour or day.
- Oct 1st: three instances running. We'll add exception handler to deal with normal breakdowns so that the crawl continues.
Optimizing plan b crawler:
Priority Queuing (Oct 15):
- search google for all aliases and then queue those hits.
- problem: google doesn't search embedded links, would need a source code search for that, so Alejandro is going to write to google.
- looking at getting backlinks from a company.
multi-threading on single site:
- where one process is downloading/queuing and one process is analyzing metadata
- leave until later, need to iron out bugs with crawler first.
multi-threading for multiple sites:
Oct 15
- Implemented multi-threading
- Implemented log per site
Oct 8
- William will implement newspaper code for analysis, and Yuya will test it;
- William/Yuya will also implement a log per web domain per crawl (instead of all together)
- fixes that were implemented to crawler need to be implemented in the multi-thread crawler, needs to be done manually - William
- keep on eye on this: readability issue was implemented over the summer, and this makes the crawl faster, but that also leads to greater instability, the source of the errors we saw this week.
notes
-
Oct 1:
-
question regarding regular expressions, and whether there should be some kind of automated interface to alert user of url patterns or subdomains that are not bringing up hits - we'll think about this.
-
remember what's not an article - Not yet assigned
-
a way to discover new articles, eg "this week's stories" - Not yet assigned
-
3rd party search - Not yet assigned
-
dependent on upgrading python, perhaps by Sep 17th, if not Sep 24th
- unicode issue that needs to be tested, but it has been incorporated.
- upgrade is done, "crawler straying off" issue - crawler seems to get stuck in video or other path: Yuya will look for ways to produce a "regular expression" of URL (white/black list) -- Yuya has implemented a filter, so that user can manually insert black list: either normal string or reg ex.
- problem that the crawler is treating the same url as distinct urls - William has a solution (Sep 24th) - fixed
- code needs to be modified so it doesn't skip something - fixed
- question: does anyone get mad at us, kick out our user agent?
CSS Selectors
Nov 19
UI for CSS selectors on the site - for adding selectors. go to a site in referring sites and click on the site, can add css selectors at the bottom of the page needs to be hooked up to the crawler right now only hooked up to date modified and author
- will Alejandro want to add CSS selectors to the old database sites as well? will he use the old database?
Nov 12
- code is implemented to find CSS selectors as part of Newspaper, but we need UI on the site, so that a user can go in and plug in CSS selectors based on domain
Nov 5
- need some examples from Vinnie of CSS Selectors that need regular expression trouble shooting
Oct 29
- CSS selector interface needs to be implemented
- generally going well, and will go over with William next week
- some problems from last week persist.
Oct 22
- mondoweiss: need regular expressions to prevent duplicates b/c the comments are distinct
- time zone stuff: apparently not difficult to maintain the TZ, but different websites handle this differently
- also issue with the Washington Times, show it to us next week.
Oct 15
- Vinnie is getting through this, and next week she'll bring a spreadsheet
- Vinnie will also see if it's possible to keep both date posted and date modified,
- Yuya will add a column to the database for "date posted" in cases where this is available
Oct 8
- Vinnie will start looking at the old article database, when available (Yuya setting up), in order to start comparing column entries with timestamp/date and Author and (1) generate a list of problems (like Eldiflor's) and (2) then start to find CSS Selectors.
Old
- Vinnie & Alejandro learned how to find css selectors, Vinnie will try to do a few for Sep 24th.
- Vinnie has done most of the sites, and had trouble with two sites
- in addition to CSS selector: need reg ex to identify the metadata, eg. if author says "by Name", need reg ex that can strip out "by". So Vinnie will eventually need to write the regex for each site where there's an issue. -- not for now.
- Alejandro will send a list of sites to add to Vinnie, and Yuya will run an instance that won't be infinite to find which sites are not working for author/date/timestamp pick up - Sep 24
Alias/Tags
Nov 26
- keyword aliases
- aliases - do longer strings and stop if there's a match
- Source cites having aliases - not through the keywords - aliases can be done
- how to see when an alias/keyword is used without a link
- editorial decisions according to Haaretz but not providing a link
- right now it's totally separate. can be done through a manual query
Oct 15
- old code is somewhat in conflict, and needs to be pushed with updates - for Oct 22.
- Jai has written code for scope alias and tags and cache, needs to be tested - OCt 1
Interface
- Roger is suggesting to look at the following to update the stats: http://ironsummitmedia.github.io/startbootstrap-sb-admin-2/pages/index.html
Upgrade Newspaper/Python
- newspaper: upgrading python and then this will run, and moved to digital ocean -- done
- hope to have the IP address for the MedCAT -- Yuya emailed.
Team
- Alejandro will email Paul for the possible other developer. -- leave it for now
Location
- when DSU takes over MediaCAT web app, then we will look into having a UTSC subdomain.