Google Scholar Data Collection - DataAnalyticsinStudentHands/Scholarly GitHub Wiki

Dependencies

MongoDB

Scripts are configured to interface with a MongoDB database. As such a local or hosted MongoDB instance is required. Script assumes localhost (default settings) as default. For custom configurations or remote server, set environmental variable MONGOURL to the URI of the MongoDB database.

export MONGOURL='[url..]

URI Format is: mongodb:// + username : password @ host : port / database

For example

export MONGOURL='mongodb://user123:[email protected]:27017/scholarly

A "source" collection is required, scholars. Optionally, create several scholars collections (to keep different groups separate, for example), following the pattern: scholars_[suffix] Then, point the script to the collection by setting the environmental variable collection_suffix Note: output collections are consolidated into one by default, even if you have multiple source collections. This prevents any double-scraping if content of collections might overlap. If desired to have separate output collections, add the collection_suffix argument to the relevant connect_mongo() functions in scrape_control.R

R Packages

renv is used for package management. On first run, renv will bootstrap and prompt installation of required packages: renv::restore() NOTE: if using conda, install packages from conda instead. Some packages will not install properly from within R.

Operation

Basic/Suggested usage:

Complete one-time configuration (link needed) to define an aliased or macro command such as scrape

To run scraper, open a terminal (Mac/Linux) or command prompt (Windows) and simply enter scrape to begin.

Advanced/Configurations:

Scraping of the Google Scholar profiles is handled by these scripts:

  • scrape_control.R : The higher-level control script which would be activated by the user. Defines MongoDB connectivity and manages the scrape job
  • scrape_requirements.R : Load packages and custom functions
  • /R/ : Collection of functions to facilitate various aspects of scraping. Refer to individual script files for explanations

The scraper first collects the google scholar profile information which includes basic information such as name, affiliation, citations, and fields of study. This is inserted to the google_profiles collection. This webpage also exposes a list of publications. This list seeds the get_publication_details() function, which collects publication-level information from individual publication pages that is not exposed on the scholar profile page. This information is inserted to the publications collection. The process then repeats for the next scholar.

Use scrape_control.R to define the parameters of the job:

  • (optional) Specify custom collection suffix to scrape into
    • Default: no suffix
  • (optional) Specify inclusion/exclusion criteria to build scrape job
    • Default: Scrape all scholars who have a GSID but do not have a document in google_profiles collection
  • (optional) Specify log / temp file directory as argument to scrape_looper(X, log_tmp_dir = [path])
    • Default: [project]/mongo_and_scrape/log

Running Scraper on Linux Server

This command starts scraping and ensures the process will not terminate with your session.

nohup Rscript Scholarly/scrape_control.R \

</dev/null>Scholarly/log/`date '+%m-%d-%y'`_consoleScrape.out \

 2>Scholarly/log/`date '+%m-%d-%y'`_ScrapeErrorMessages.err & disown

Status of script, or termination

Assumes proper configuration of .bash_profile. See:

Check for active scrape jobs with:

status

To terminate scrape job, note the PID from the above ps command, then:

kill [PID#]

To peek into the tail of the message log and error logs respectively, use the following commands: checklog checkerror

Logging

Check the current progress of an ongoing or completed scrape by referring to the log file.

  • Scholarly/log/scraper_log_[date].log is the main log of the file which records the timestamped status updates of the script, it can be opened mid-scrape to check status
  • nohup logs are a dump of the console and error/warning respectively, and are recommended for debugging as necessary.

Interrupts and Recovery

An error, loss of power, kicked out by Google, etc, may cause a scrape job to not finish completely—for example, only 50 of a scholar's 85 publications are scraped.

On starting the script again, the script will continue where it was interrupted. It should not require any intervention.

Adding more scholars to the main collection

If more scholars are added, or scholar documents that lacked GSID are updated with a GSID, just run the script to scrape them. By default, scholars who have been previously scraped in the given year collection will not be re-scraped.

Manually scraping a scholar profile or their publications

You can scrape from an XLSX file containing Google Profile URLs using Google Scholar Profile Collector app. From R script: To scrape only one scholar, scholar_scrape() can be called directly.

The scrape can omit either the profile or the publications by modifying the corresponding argument from the default: scholar_scrape(scholar_query, fetch_profile=TRUE, fetch_publications=TRUE)

The scholar_query argument expects a data frame or named list that contains:

  • scholar_query$_id
  • scholar_query$gsid

googleProfile_ Data Dictionary

This collection is scraped from the Google Scholar "Profile Page." Example profile page

Field name Type Prevalence Source Description
_id string 100% File Internal identifier. Default is use gsid
gsid string 100% File* 12 character identifier generated by Google. Typically sourced from file by coders. This field is updated programmatically in the case of redirection (assignment of a new GSID by Google)
gsid_updated boolean 0-1% Scrape Indicates a redirection event and new gsid
queried_gsid string 0-1% Scrape In the case of a redirection event, the gsid listed in the import file is listed here.
affiliation string 100% Scrape University name / "Unknown"
coauthors array 50-60% Scrape List of coauthors associated. Scraped from the "coauthors" panel on the side of the profile. Only lists linked coauthors with google profiles (not associated with publication author bylines)
coauthors.coauth_gsid string ┣100% Scrape Coauthor GSID
coauthors.coauth_name string ┣100% Scrape Coauthor Full Name
coauthors.coauth_institution string ┗100% Scrape Coauthor University name / "Unknown"
fields array 90-100% Scrape Plain text descriptor of the scholar's field(s) of study
h_index integer 100% Scrape As reported by Google
homepage url 50-60% Scrape Author profile/faculty page/website provided by scholar
i10_index integer 100% Scrape As reported by Google
n_publications integer 100% Calculated On requesting profile, the number of publication links on the page is tallied up and recorded here.
name string 100% Scrape Scholar full name
specs string 100% Scrape Verification badge
total_cites integer 100% Scrape As reported by Google
last_updated date 100% Timestamp Timestamp of initial profile scrape event.

publications_ Data Dictionary

This collection is scraped from publication pages. Example publication page.

Field name Type Prevalence Source Description
_id ObjectId 100% Generated Internal arbitrary BSON identifier (random)
gsid string 100% googleProfiles_ 🔑KEY variable
application_number string 1-2% Scrape (PATENT) Indicates a patent application number
authors string 99% Scrape Plain text comma delimited list of publication authors.
blank boolean 0-0.1% Scrape Indicates a truncated publication with no additional information beyond the profile page stem. (Link goes to a "blank" page, but is not 404)
book string 4-6% Scrape Contains the name of a book in which the article/essay/etc appeared
cid string 99-100% Scrape Unique identifier, key used by Google to link a publication to "citing publications" (can be used to find publications that cited this publication)
cites integer 100% Scrape Total number of citations (from PROFILE page "Cited by" column)
conference string ~10% Scrape Name of the conference, for conference papers
description string 85-90% Scrape Paragraph, may be an abstract, description, synonpsis, etc depending on publication type
institution string 0-1% Scrape Name a department/University/Institution
inventors string 1-2% Scrape (PATENT) Equivalent to Authors
Issue string 50% Scrape Issue number of a journal
journal string 60-70% Scrape Name of a journal
last_updated date 100% Timestamp Timestamp of scrape event.
number string 0-0.5% Scrape Various combinations of edition, volume, page ranges
pages string 70-75% Scrape Page range of journal/book
patent number string 1-2% Scrape (PATENT) Patent Number
patent office string 1-2% Scrape (PATENT) Issuing country code (99% 'US')
pub_cites array 50-60% Scrape Transcription of the citation graph at the bottom of publication page
pub_cites.year integer ┗100% Scrape 4 Digit Year from X axis of graph
pub_cites.cites integer ┗100% Scrape Integer number of citations for that year
pubid string 100% Scrape Google generated - unique only when combined with gsid
publication date string 90-95% Scrape May be identical to "year" field or may additionally contain a month and/or day in various formats.
publishedln string 0-1% Scrape Alternative field for name of journal/etc
publisher string 60-75% Scrape Name of a publishing company that publishes the journal/book
report number string 0-0.1% Scrape Random/Junk field?
source string 5% Scrape Alternative field for name of journal/etc
title date 100% Scrape Title of publication (Hyperlinked text at top of google profile page)
volume string 60-65% Scrape Volume number of journal/book
year integer 100% Scrape Publication year (from PROFILE page "Year" column)
⚠️ **GitHub.com Fallback** ⚠️