March 30, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- pay top up
- Israeli news sites: finish setting up directory and start crawl
- keep working on postprocessor
Crawler Issues
- still playing with the numbers
- jewishjournal giving errors -- not 503s, but 502: general error invalid response
- change user agent:
- part of the header: where call is being made
- can also do curl request or standalone puppeteer request to isolate the problem
- https://pptr.dev/api/puppeteer.browser.useragent
Twitter postprocessing
- managed slightly larger csv to process
- able to process but no output (could be not enough tweets)
- earlier error: unclear if size a problem or combining function
- new error with URL expander, but never finishes expanding
- get error of malformed node or string
- add function into a try-catch exception and then log with ID and create separate set of things
Israeli news sites crawl
- currently running
- initially going to just copy the code from github and start crawling
- got errors
- copying an existing crawler already on the server and using the existing crawler
Action Items
- finish trouble-shooting url expander: add function into a try-catch exception and then log with ID and create separate set of things
- crawler: can also do curl request or standalone puppeteer request to isolate the problem
- keep both crawls going
For new work studies:
- figure out why last line output file in Twitter crawl is being cut out.
- issue of email when crawler breaks