Dependencies

MongoDB

Scripts are configured to interface with a MongoDB database. As such a local or hosted MongoDB instance is required. Script assumes localhost (default settings) as default. For custom configurations or remote server, set environmental variable MONGOURL to the URI of the MongoDB database.

export MONGOURL='[url..]

URI Format is: mongodb:// + username : password @ host : port / database

For example

export MONGOURL='mongodb://user123:[email protected]:27017/scholarly

A "source" collection is required, scholars. Optionally, create several scholars collections (to keep different groups separate, for example), following the pattern: scholars_[suffix] Then, point the script to the collection by setting the environmental variable collection_suffix Note: output collections are consolidated into one by default, even if you have multiple source collections. This prevents any double-scraping if content of collections might overlap. If desired to have separate output collections, add the collection_suffix argument to the relevant connect_mongo() functions in scrape_control.R

R Packages

renv is used for package management. On first run, renv will bootstrap and prompt installation of required packages: renv::restore() NOTE: if using conda, install packages from conda instead. Some packages will not install properly from within R.

Operation

Basic/Suggested usage:

Complete one-time configuration (link needed) to define an aliased or macro command such as scrape

To run scraper, open a terminal (Mac/Linux) or command prompt (Windows) and simply enter scrape to begin.

Advanced/Configurations:

Scraping of the Google Scholar profiles is handled by these scripts:

scrape_control.R : The higher-level control script which would be activated by the user. Defines MongoDB connectivity and manages the scrape job
scrape_requirements.R : Load packages and custom functions
/R/ : Collection of functions to facilitate various aspects of scraping. Refer to individual script files for explanations

The scraper first collects the google scholar profile information which includes basic information such as name, affiliation, citations, and fields of study. This is inserted to the google_profiles collection. This webpage also exposes a list of publications. This list seeds the get_publication_details() function, which collects publication-level information from individual publication pages that is not exposed on the scholar profile page. This information is inserted to the publications collection. The process then repeats for the next scholar.

Use scrape_control.R to define the parameters of the job:

(optional) Specify custom collection suffix to scrape into
- Default: no suffix
(optional) Specify inclusion/exclusion criteria to build scrape job
- Default: Scrape all scholars who have a GSID but do not have a document in google_profiles collection
(optional) Specify log / temp file directory as argument to scrape_looper(X, log_tmp_dir = [path])
- Default: [project]/mongo_and_scrape/log

Running Scraper on Linux Server

This command starts scraping and ensures the process will not terminate with your session.

nohup Rscript Scholarly/scrape_control.R \

</dev/null>Scholarly/log/`date '+%m-%d-%y'`_consoleScrape.out \

 2>Scholarly/log/`date '+%m-%d-%y'`_ScrapeErrorMessages.err & disown

Status of script, or termination

Assumes proper configuration of .bash_profile. See:

Check for active scrape jobs with:

status

To terminate scrape job, note the PID from the above ps command, then:

kill [PID#]

To peek into the tail of the message log and error logs respectively, use the following commands: checklog checkerror

Logging

Check the current progress of an ongoing or completed scrape by referring to the log file.

Scholarly/log/scraper_log_[date].log is the main log of the file which records the timestamped status updates of the script, it can be opened mid-scrape to check status
nohup logs are a dump of the console and error/warning respectively, and are recommended for debugging as necessary.

Interrupts and Recovery

An error, loss of power, kicked out by Google, etc, may cause a scrape job to not finish completely—for example, only 50 of a scholar's 85 publications are scraped.

On starting the script again, the script will continue where it was interrupted. It should not require any intervention.

Adding more scholars to the main collection

If more scholars are added, or scholar documents that lacked GSID are updated with a GSID, just run the script to scrape them. By default, scholars who have been previously scraped in the given year collection will not be re-scraped.

Manually scraping a scholar profile or their publications

You can scrape from an XLSX file containing Google Profile URLs using Google Scholar Profile Collector app. From R script: To scrape only one scholar, scholar_scrape() can be called directly.

The scrape can omit either the profile or the publications by modifying the corresponding argument from the default: scholar_scrape(scholar_query, fetch_profile=TRUE, fetch_publications=TRUE)

The scholar_query argument expects a data frame or named list that contains:

scholar_query$_id
scholar_query$gsid

googleProfile_ Data Dictionary

This collection is scraped from the Google Scholar "Profile Page." Example profile page

Field name	Type	Prevalence	Source	Description
_id	string	100%	File	Internal identifier. Default is use gsid
gsid	string	100%	File*	12 character identifier generated by Google. Typically sourced from file by coders. This field is updated programmatically in the case of redirection (assignment of a new GSID by Google)
gsid_updated	boolean	0-1%	Scrape	Indicates a redirection event and new gsid
queried_gsid	string	0-1%	Scrape	In the case of a redirection event, the gsid listed in the import file is listed here.
affiliation	string	100%	Scrape	University name / "Unknown"
coauthors	array	50-60%	Scrape	List of coauthors associated. Scraped from the "coauthors" panel on the side of the profile. Only lists linked coauthors with google profiles (not associated with publication author bylines)
coauthors.coauth_gsid	string	┣100%	Scrape	Coauthor GSID
coauthors.coauth_name	string	┣100%	Scrape	Coauthor Full Name
coauthors.coauth_institution	string	┗100%	Scrape	Coauthor University name / "Unknown"
fields	array	90-100%	Scrape	Plain text descriptor of the scholar's field(s) of study
h_index	integer	100%	Scrape	As reported by Google
homepage	url	50-60%	Scrape	Author profile/faculty page/website provided by scholar
i10_index	integer	100%	Scrape	As reported by Google
n_publications	integer	100%	Calculated	On requesting profile, the number of publication links on the page is tallied up and recorded here.
name	string	100%	Scrape	Scholar full name
specs	string	100%	Scrape	Verification badge
total_cites	integer	100%	Scrape	As reported by Google
last_updated	date	100%	Timestamp	Timestamp of initial profile scrape event.

publications_ Data Dictionary

This collection is scraped from publication pages. Example publication page.

Field name	Type	Prevalence	Source	Description
_id	ObjectId	100%	Generated	Internal arbitrary BSON identifier (random)
gsid	string	100%	googleProfiles_	🔑KEY variable
application_number	string	1-2%	Scrape	(PATENT) Indicates a patent application number
authors	string	99%	Scrape	Plain text comma delimited list of publication authors.
blank	boolean	0-0.1%	Scrape	Indicates a truncated publication with no additional information beyond the profile page stem. (Link goes to a "blank" page, but is not 404)
book	string	4-6%	Scrape	Contains the name of a book in which the article/essay/etc appeared
cid	string	99-100%	Scrape	Unique identifier, key used by Google to link a publication to "citing publications" (can be used to find publications that cited this publication)
cites	integer	100%	Scrape	Total number of citations (from PROFILE page "Cited by" column)
conference	string	~10%	Scrape	Name of the conference, for conference papers
description	string	85-90%	Scrape	Paragraph, may be an abstract, description, synonpsis, etc depending on publication type
institution	string	0-1%	Scrape	Name a department/University/Institution
inventors	string	1-2%	Scrape	(PATENT) Equivalent to Authors
Issue	string	50%	Scrape	Issue number of a journal
journal	string	60-70%	Scrape	Name of a journal
last_updated	date	100%	Timestamp	Timestamp of scrape event.
number	string	0-0.5%	Scrape	Various combinations of edition, volume, page ranges
pages	string	70-75%	Scrape	Page range of journal/book
patent number	string	1-2%	Scrape	(PATENT) Patent Number
patent office	string	1-2%	Scrape	(PATENT) Issuing country code (99% 'US')
pub_cites	array	50-60%	Scrape	Transcription of the citation graph at the bottom of publication page
pub_cites.year	integer	┗100%	Scrape	4 Digit Year from X axis of graph
pub_cites.cites	integer	┗100%	Scrape	Integer number of citations for that year
pubid	string	100%	Scrape	Google generated - unique only when combined with gsid
publication date	string	90-95%	Scrape	May be identical to "year" field or may additionally contain a month and/or day in various formats.
publishedln	string	0-1%	Scrape	Alternative field for name of journal/etc
publisher	string	60-75%	Scrape	Name of a publishing company that publishes the journal/book
report number	string	0-0.1%	Scrape	Random/Junk field?
source	string	5%	Scrape	Alternative field for name of journal/etc
title	date	100%	Scrape	Title of publication (Hyperlinked text at top of google profile page)
volume	string	60-65%	Scrape	Volume number of journal/book
year	integer	100%	Scrape	Publication year (from PROFILE page "Year" column)

Google Scholar Data Collection - DataAnalyticsinStudentHands/Scholarly GitHub Wiki

Dependencies

MongoDB

R Packages

Operation

Basic/Suggested usage:

Advanced/Configurations:

Running Scraper on Linux Server

Status of script, or termination

Logging

Interrupts and Recovery

Adding more scholars to the main collection

Manually scraping a scholar profile or their publications

googleProfile_ Data Dictionary

publications_ Data Dictionary

⚠️ GitHub.com Fallback ⚠️

Google Scholar Data Collection - DataAnalyticsinStudentHands/Scholarly GitHub Wiki

Dependencies

MongoDB

R Packages

Operation

Basic/Suggested usage:

Advanced/Configurations:

Running Scraper on Linux Server

Status of script, or termination

Logging

Interrupts and Recovery

Adding more scholars to the main collection

Manually scraping a scholar profile or their publications

googleProfile_ Data Dictionary

publications_ Data Dictionary

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️