Google Scholar Data Collection - DataAnalyticsinStudentHands/Scholarly GitHub Wiki
Scripts are configured to interface with a MongoDB database. As such a local or hosted MongoDB instance is required.
Script assumes localhost (default settings) as default. For custom configurations or remote server, set environmental variable MONGOURL
to the URI of the MongoDB database.
export MONGOURL='[url..]
URI Format is: mongodb://
+ username :
password @
host :
port /
database
For example
export MONGOURL='mongodb://user123:[email protected]:27017/scholarly
A "source" collection is required, scholars
. Optionally, create several scholars collections (to keep different groups separate, for example), following the pattern: scholars_[suffix]
Then, point the script to the collection by setting the environmental variable collection_suffix
Note: output collections are consolidated into one by default, even if you have multiple source collections. This prevents any double-scraping if content of collections might overlap. If desired to have separate output collections, add the collection_suffix
argument to the relevant connect_mongo()
functions in scrape_control.R
renv is used for package management. On first run, renv will bootstrap and prompt installation of required packages: renv::restore()
NOTE: if using conda, install packages from conda instead. Some packages will not install properly from within R.
Complete one-time configuration (link needed) to define an aliased or macro command such as scrape
To run scraper, open a terminal (Mac/Linux) or command prompt (Windows) and simply enter scrape
to begin.
Scraping of the Google Scholar profiles is handled by these scripts:
-
scrape_control.R
: The higher-level control script which would be activated by the user. Defines MongoDB connectivity and manages the scrape job -
scrape_requirements.R
: Load packages and custom functions -
/R/
: Collection of functions to facilitate various aspects of scraping. Refer to individual script files for explanations
The scraper first collects the google scholar profile information which includes basic information such as name, affiliation, citations, and fields of study. This is inserted to the google_profiles
collection. This webpage also exposes a list of publications. This list seeds the get_publication_details()
function, which collects publication-level information from individual publication pages that is not exposed on the scholar profile page. This information is inserted to the publications
collection. The process then repeats for the next scholar.
Use scrape_control.R
to define the parameters of the job:
- (optional) Specify custom collection suffix to scrape into
- Default: no suffix
- (optional) Specify inclusion/exclusion criteria to build scrape job
- Default: Scrape all scholars who have a GSID but do not have a document in google_profiles collection
- (optional) Specify log / temp file directory as argument to
scrape_looper(X, log_tmp_dir = [path])
- Default:
[project]/mongo_and_scrape/log
- Default:
This command starts scraping and ensures the process will not terminate with your session.
nohup Rscript Scholarly/scrape_control.R \
</dev/null>Scholarly/log/`date '+%m-%d-%y'`_consoleScrape.out \
2>Scholarly/log/`date '+%m-%d-%y'`_ScrapeErrorMessages.err & disown
Assumes proper configuration of .bash_profile. See:
Check for active scrape jobs with:
status
To terminate scrape job, note the PID from the above ps
command, then:
kill [PID#]
To peek into the tail of the message log and error logs respectively, use the following commands:
checklog
checkerror
Check the current progress of an ongoing or completed scrape by referring to the log file.
-
Scholarly/log/scraper_log_[date].log
is the main log of the file which records the timestamped status updates of the script, it can be opened mid-scrape to check status - nohup logs are a dump of the console and error/warning respectively, and are recommended for debugging as necessary.
An error, loss of power, kicked out by Google, etc, may cause a scrape job to not finish completely—for example, only 50 of a scholar's 85 publications are scraped.
On starting the script again, the script will continue where it was interrupted. It should not require any intervention.
If more scholars are added, or scholar documents that lacked GSID are updated with a GSID, just run the script to scrape them. By default, scholars who have been previously scraped in the given year collection will not be re-scraped.
You can scrape from an XLSX file containing Google Profile URLs using Google Scholar Profile Collector app.
From R script: To scrape only one scholar, scholar_scrape()
can be called directly.
The scrape can omit either the profile or the publications by modifying the corresponding argument from the default:
scholar_scrape(scholar_query, fetch_profile=TRUE, fetch_publications=TRUE)
The scholar_query argument expects a data frame or named list that contains:
- scholar_query$_id
- scholar_query$gsid
This collection is scraped from the Google Scholar "Profile Page." Example profile page
Field name | Type | Prevalence | Source | Description |
---|---|---|---|---|
_id | string | 100% | File | Internal identifier. Default is use gsid |
gsid | string | 100% | File* | 12 character identifier generated by Google. Typically sourced from file by coders. This field is updated programmatically in the case of redirection (assignment of a new GSID by Google) |
gsid_updated | boolean | 0-1% | Scrape | Indicates a redirection event and new gsid |
queried_gsid | string | 0-1% | Scrape | In the case of a redirection event, the gsid listed in the import file is listed here. |
affiliation | string | 100% | Scrape | University name / "Unknown" |
coauthors | array | 50-60% | Scrape | List of coauthors associated. Scraped from the "coauthors" panel on the side of the profile. Only lists linked coauthors with google profiles (not associated with publication author bylines) |
coauthors.coauth_gsid | string | ┣100% | Scrape | Coauthor GSID |
coauthors.coauth_name | string | ┣100% | Scrape | Coauthor Full Name |
coauthors.coauth_institution | string | ┗100% | Scrape | Coauthor University name / "Unknown" |
fields | array | 90-100% | Scrape | Plain text descriptor of the scholar's field(s) of study |
h_index | integer | 100% | Scrape | As reported by Google |
homepage | url | 50-60% | Scrape | Author profile/faculty page/website provided by scholar |
i10_index | integer | 100% | Scrape | As reported by Google |
n_publications | integer | 100% | Calculated | On requesting profile, the number of publication links on the page is tallied up and recorded here. |
name | string | 100% | Scrape | Scholar full name |
specs | string | 100% | Scrape | Verification badge |
total_cites | integer | 100% | Scrape | As reported by Google |
last_updated | date | 100% | Timestamp | Timestamp of initial profile scrape event. |
This collection is scraped from publication pages. Example publication page.
Field name | Type | Prevalence | Source | Description |
---|---|---|---|---|
_id | ObjectId | 100% | Generated | Internal arbitrary BSON identifier (random) |
gsid | string | 100% | googleProfiles_ | 🔑KEY variable |
application_number | string | 1-2% | Scrape | (PATENT) Indicates a patent application number |
authors | string | 99% | Scrape | Plain text comma delimited list of publication authors. |
blank | boolean | 0-0.1% | Scrape | Indicates a truncated publication with no additional information beyond the profile page stem. (Link goes to a "blank" page, but is not 404) |
book | string | 4-6% | Scrape | Contains the name of a book in which the article/essay/etc appeared |
cid | string | 99-100% | Scrape | Unique identifier, key used by Google to link a publication to "citing publications" (can be used to find publications that cited this publication) |
cites | integer | 100% | Scrape | Total number of citations (from PROFILE page "Cited by" column) |
conference | string | ~10% | Scrape | Name of the conference, for conference papers |
description | string | 85-90% | Scrape | Paragraph, may be an abstract, description, synonpsis, etc depending on publication type |
institution | string | 0-1% | Scrape | Name a department/University/Institution |
inventors | string | 1-2% | Scrape | (PATENT) Equivalent to Authors |
Issue | string | 50% | Scrape | Issue number of a journal |
journal | string | 60-70% | Scrape | Name of a journal |
last_updated | date | 100% | Timestamp | Timestamp of scrape event. |
number | string | 0-0.5% | Scrape | Various combinations of edition, volume, page ranges |
pages | string | 70-75% | Scrape | Page range of journal/book |
patent number | string | 1-2% | Scrape | (PATENT) Patent Number |
patent office | string | 1-2% | Scrape | (PATENT) Issuing country code (99% 'US') |
pub_cites | array | 50-60% | Scrape | Transcription of the citation graph at the bottom of publication page |
pub_cites.year | integer | ┗100% | Scrape | 4 Digit Year from X axis of graph |
pub_cites.cites | integer | ┗100% | Scrape | Integer number of citations for that year |
pubid | string | 100% | Scrape | Google generated - unique only when combined with gsid |
publication date | string | 90-95% | Scrape | May be identical to "year" field or may additionally contain a month and/or day in various formats. |
publishedln | string | 0-1% | Scrape | Alternative field for name of journal/etc |
publisher | string | 60-75% | Scrape | Name of a publishing company that publishes the journal/book |
report number | string | 0-0.1% | Scrape | Random/Junk field? |
source | string | 5% | Scrape | Alternative field for name of journal/etc |
title | date | 100% | Scrape | Title of publication (Hyperlinked text at top of google profile page) |
volume | string | 60-65% | Scrape | Volume number of journal/book |
year | integer | 100% | Scrape | Publication year (from PROFILE page "Year" column) |