Old Tasks and Notes - UTMediaCAT/projectdocs GitHub Wiki

Move server

Dec 3

Turns out that WARC'ing is going faster than we thought: approximately 800 URLs in 5 hours; 4 simultaneous processes take 1.7 GB Ram, with 2 processors.

Nov 26

After 1pm meeting to go through ideal backlink export format for priority queuing
Priority queuing using the Google, in the form of "keyword site:"
Team to Meet on tuesday for Diagrams and testing
testing over summer - 1 instance to 1.5 instance per cpu core
also in future parallel per site couple of nytimes still 1-1.5 per core
Jai got his placement because of the work he did for UTMEDIACAT!

Nov 19

have documentation, diagrams ready for all processes
test new database and see how much RAM/cpu space is required
get email wording for permission to crawl sites
cost few hundred
set up meeting in mid December may need more than 60gb RAM depends on how many sites in parallel swap space is slow if accessing all sites and accessing all memory an instance for 3 sites is 4gb memory being used will cost few hundred dollars with IITS

#Database:

Dec 3:

Met to try do the MySQL: couldn't easily transfer the SQLite database to MySQL, so in the end, it might be easiest to have the Bot Discovery (Plan B) use MySQL for its RAM database, and keep the result database in SQLite. This will help with testing to ensure stability. Try to do for next week.

Nov 26:

MySQL update: Not yet work on implementation

Entering them through web interface Testing them solution: get all selectors from db from a given site, evaluate a given page for every selector in place. vs. paste in every time. Create account for vinnie, login and test ssh connect to db.

Alternate option: Django page instead to run and test selector

Source articles - Jai - in progress now working for new results. but now have to go and do it retroactively - make a script to go and get the source sites now

Nov 19:

meeting yesterday for the database selection
thought of how to optimize going with MySQL with Django implement
Django can have multiple have dbs, keep existing SQL lite and implement new mysql database
add the archiving of source articles that are found, and have this reflected in the interface - deferred

Nov 5:

add the archiving of source articles that are found, and have this reflected in the interface

Weekly sweeps

Nov 26

newspaper sweeps is fine - Alejandro to add still

Nov 19

daily weekly sweeps is fine for newspaper

need to add all the necessary referring sites - deferred for Alejandro to add

Nov 5:

started without a hitch
need to add all the necessary referring sites

Oct 29:

in a day, found 161, not linear, first hour was a lot, then found less in the next 23

Oct 22:

Jai/Roger will look into setting up an instance of UTMediaCAT to scan twitter and domains that use Newspaper/RSS

Image preservation and style sheets preservation

Oct 15

generally, the WPull is working really well, takes about 100 seconds to generate the WARC
occasionally, some strange issues, like black background, but doesn't affect the text and links

WARC

Dec 3:

For easy viewing, we thought we would put a PDF version of webpage as a link. However the Phantom JS process for the PDF is taking a long time. We will wait to see if Roger has a solution, and if not, perhaps we will look into adding a WARC viewer.

Nov 12

WARC'ing is stable, can do a maximum at 2 WARCs at a time, but this can be increased
PDF-viewer: this is nearly ready, just need to have code merged
there was only one bad WARC

Nov 5

WARC'ing is not done: there was a memory crash
WARC: dynamic number of processes based on how much memory is available
WARCs have been checked

Oct 29

WARC'ing also can take up a lot of memory, approx 2 minutes for each hit, need to implement a queue with a maximum number of simultaneous processes.

Oct 8

Roger and Jai managed to get the PhantomJS/WPull to get the WARC, but there's a problem with the speed
Roger/Jai believe that they can make it faster by figuring out what elements to ignore --- here's what Roger writes: "we can manually force it to generate files in 2 mins for each url, and the results can be sill good. However, as a result, it is likely that a few images will be missing.( you can check the attachments to see sample file generated by this strategy)"

Oct 1:

Roger figure out how to use PhantomJS & WPull to get an html output, which runs javascript. This week Jai/Roger will look to see if it's possible to do direct to WARC (preferable to having an export to WARC or conversion from HTML). If so, that's what we'll use.

OLD:

WarC vs Zotero-like ability to snapshot? Jai: snapshot is possible. Kim writes why WARC is far preferable. Problem: not easy to WARC a javascript
PhantomJS: does a PDF of the webpage, but the "print-out" version, doesn't have selectable text (not good)
WKHTMLtoPDF: renders most of the javascript, linux-version
Chrome extension works but it is very dependent, one possibility is to have an auto-clicker that runs through Chrome--seems to work best, but would need to have an instance of chromium to handle it
Selenium: another project -- checking into it.
Chrome javascript is available, but it isn't trivial to automate. Question: does the DSU have WARC'ing functionality they can give us? Maybe talk to Anya before going to all the trouble.
Yuya/Roger also looking at Warc issue

Crawler (Plan B) Test on NYT, CNN, BBC

Dec 3:

New Terminology: Bot Discovery for the Plan B, and Newspaper Crawler is RSS Discovery
test Bot Discovery with WARC'ing processes.

Nov 26:

CSS selector update: Made tool for testing CSS selectors. Current solution is doing it in the browser. Browser CSS selectors might not be the same implementation as the XML one - depends on which browser you use. Python script to use the xml stuff, same stuff that the crawler uses to test accurately. Needs SSH access to server needs access to get database to site.

display and keyword matched keyword - highlight keyword
costly to do for every article taking too much time
shows all text and highlights all matching keywords
crash in 2 weeks but ran again and since then hasn't stopped
anchor text is halfway there - the text of links

Nov 19:

crawler not run this week, will run next week
implement to store the text as well for each article, naive implementation to show the text around the found keyword based on text and matched keyword. recreate in articles. not currently stored in database. full text stored so see surrounding text. takes 5-10 seconds to load text in "Matched Keywords" - this will be used for the comparison of text when it's separated out
anchor text is still required - william and yuya getting the link

lots of cn.nytimes.com

Nov 12:

no crash, but 2 processes are stuck, might be external
Next Tuesday: Yuya, Jai, and William will meet to talk about database
no other problems as of right now.
we'll attempt to separate out the text of the article in order to make it available for comparison on re-crawls
finding about 20 hits a day, mostly on NYT
CSS selectors:

Nov 5:

database/save state: not yet working; don't go with sql-lite; use MySQL or other
we will increase server capacity to 40 gb & 4gb RAM,
We are targetting December to finish implementing the database, and getting ready to do a baseline crawl.

Oct 29:

debugged problem of not picking anything up
memory issue: even with six minute interval of memory logging, we're not able to see why memory usage goes up suddenly
one possible solution is to do save state/database: probably best to do this for a variety of reasons
another solution is hashing and perhaps 64 bit hashing.
another possible solution: paging: using hard drive for RAM
whether we need more server capacity

Oct 22:

nothing picked up yet
freezes, and memory issues: debugging (William), need to catch just before it crashes

Oct 15:

some problems with the NYT/BBC searches, probably not a unicode problem, and try to see if this is solved by next week.

Oct 8

Seems like the unicode problem has been fixed, and the 3 instances have now been running for 12 hours without crashing
rates for speed: Cnn is 5000 pages/hour; NYT: 3700 pg/hr; BBC: 1700 pg/hr
keep doing the test, to get a sense; gone through 150,000 pages in last 12 hours with no hits

Oct 1

Oct 1st: NYT crashed, cnn.com stuck on transcript.cnn.com, BBC going about half the speed of cnn.com. Next week: figure out an average rate per hour or day.
Oct 1st: three instances running. We'll add exception handler to deal with normal breakdowns so that the crawl continues.

Optimizing plan b crawler:

Priority Queuing (Oct 15):

search google for all aliases and then queue those hits.
problem: google doesn't search embedded links, would need a source code search for that, so Alejandro is going to write to google.
looking at getting backlinks from a company.

multi-threading on single site:

where one process is downloading/queuing and one process is analyzing metadata
leave until later, need to iron out bugs with crawler first.

multi-threading for multiple sites:

Oct 15

Implemented multi-threading
Implemented log per site

Oct 8

William will implement newspaper code for analysis, and Yuya will test it;
William/Yuya will also implement a log per web domain per crawl (instead of all together)
fixes that were implemented to crawler need to be implemented in the multi-thread crawler, needs to be done manually - William
keep on eye on this: readability issue was implemented over the summer, and this makes the crawl faster, but that also leads to greater instability, the source of the errors we saw this week.

notes

Oct 1:
question regarding regular expressions, and whether there should be some kind of automated interface to alert user of url patterns or subdomains that are not bringing up hits - we'll think about this.
remember what's not an article - Not yet assigned
a way to discover new articles, eg "this week's stories" - Not yet assigned
3rd party search - Not yet assigned
dependent on upgrading python, perhaps by Sep 17th, if not Sep 24th

unicode issue that needs to be tested, but it has been incorporated.
upgrade is done, "crawler straying off" issue - crawler seems to get stuck in video or other path: Yuya will look for ways to produce a "regular expression" of URL (white/black list) -- Yuya has implemented a filter, so that user can manually insert black list: either normal string or reg ex.
problem that the crawler is treating the same url as distinct urls - William has a solution (Sep 24th) - fixed
code needs to be modified so it doesn't skip something - fixed
question: does anyone get mad at us, kick out our user agent?

CSS Selectors

Nov 19

UI for CSS selectors on the site - for adding selectors. go to a site in referring sites and click on the site, can add css selectors at the bottom of the page needs to be hooked up to the crawler right now only hooked up to date modified and author

will Alejandro want to add CSS selectors to the old database sites as well? will he use the old database?

Nov 12

code is implemented to find CSS selectors as part of Newspaper, but we need UI on the site, so that a user can go in and plug in CSS selectors based on domain

Nov 5

need some examples from Vinnie of CSS Selectors that need regular expression trouble shooting

Oct 29

CSS selector interface needs to be implemented
generally going well, and will go over with William next week
some problems from last week persist.

Oct 22

mondoweiss: need regular expressions to prevent duplicates b/c the comments are distinct
time zone stuff: apparently not difficult to maintain the TZ, but different websites handle this differently
also issue with the Washington Times, show it to us next week.

Oct 15

Vinnie is getting through this, and next week she'll bring a spreadsheet
Vinnie will also see if it's possible to keep both date posted and date modified,
Yuya will add a column to the database for "date posted" in cases where this is available

Oct 8

Vinnie will start looking at the old article database, when available (Yuya setting up), in order to start comparing column entries with timestamp/date and Author and (1) generate a list of problems (like Eldiflor's) and (2) then start to find CSS Selectors.

Old

Vinnie & Alejandro learned how to find css selectors, Vinnie will try to do a few for Sep 24th.
Vinnie has done most of the sites, and had trouble with two sites
in addition to CSS selector: need reg ex to identify the metadata, eg. if author says "by Name", need reg ex that can strip out "by". So Vinnie will eventually need to write the regex for each site where there's an issue. -- not for now.
Alejandro will send a list of sites to add to Vinnie, and Yuya will run an instance that won't be infinite to find which sites are not working for author/date/timestamp pick up - Sep 24

Alias/Tags

Nov 26

keyword aliases
aliases - do longer strings and stop if there's a match
Source cites having aliases - not through the keywords - aliases can be done
how to see when an alias/keyword is used without a link
editorial decisions according to Haaretz but not providing a link
right now it's totally separate. can be done through a manual query

Oct 15

old code is somewhat in conflict, and needs to be pushed with updates - for Oct 22.
Jai has written code for scope alias and tags and cache, needs to be tested - OCt 1

Interface

Roger is suggesting to look at the following to update the stats: http://ironsummitmedia.github.io/startbootstrap-sb-admin-2/pages/index.html

Upgrade Newspaper/Python

newspaper: upgrading python and then this will run, and moved to digital ocean -- done
hope to have the IP address for the MedCAT -- Yuya emailed.

Team

Alejandro will email Paul for the possible other developer. -- leave it for now

Location

when DSU takes over MediaCAT web app, then we will look into having a UTSC subdomain.