EntrezDirect scripting - lmmx/devnotes GitHub Wiki

The "Digital Object Identifier” uniquely identifies a research paper (and recently it's being co-opted to reference associated datasets). There're interesting and troublesome exceptions, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.

Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to prepend the DOI to "dx.doi.org/" to generate a working redirection link.

Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle 'switches' to tailor the output just as you would from the web service (albeit with a bit more of the inner workings on show).

I've pieced together a bit of a home pipeline, which has a function to generate citations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically.

Entrez Direct has concise documentation, and this setup can also be used to access genetic, oncology (OMIM), protein, and other types of data.

When installing the setup script added a "source .bashrc" command to my .bashrc, 'sourcing' my .bash_profile, which was already in turn 'sourcing' my .bashrc, effectively putting every new terminal command prompt in an infinite loop - watch out for this if your terminals freeze then quit after installation!

The scripts below are available here, I'll update them on the GitHub Gist if I make amendments:

function cutf		(){ cut -d $'\t' -f "$@"; }

function striptoalpha	(){ for thisword in $(echo "$@" | tr -dc "[A-Z][a-z]\n" | tr [A-Z] [a-z]); do echo $thisword; done; }

function pubmed		(){ esearch -db pubmed -query "$@" | efetch -format docsum | xtract -pattern DocumentSummary -present Author -and Title -element Id -first "Author/Name" -element Title; }

function pubmeddocsum	(){ esearch -db pubmed -query "$@" | efetch -format docsum; } 

function pubmedextractdoi (){ pubmeddocsum "$@" | xtract -pattern DocumentSummary -element Id -first "Author/Name" -element Title SortPubDate -block ArticleId -match "IdType:doi" -element Value | awk '{split($0,a,"\t"); split(a[4],b,"/"); print a[1]"\t"a[2]"\t"a[3]"\t"a[5]"\t"b[1]}'; }

function pubmeddoi	(){ pubmedextractdoi "$@" | cutf 4; }

function pubmeddoimulti (){ 
	xtracted=$(pubmedextractdoi "$@")
	if [[ $(echo "$xtracted" | cutf 4) == '' ]]
	then
		xtractedpmid=$(echo "$xtracted" | cutf 1)
		pmid2doirestful "$xtractedpmid"
	else
		echo "$xtracted" | cutf 4
	fi
}

function pmid2doi	(){ curl -s www.pmid2doi.org/rest/json/doi/"$@" | awk '{split($0,a,",\"doi\":\"|\"}"); print a[2]}'; }
function pmid2doimulti 	(){
	curleddoi=$(pmid2doi "$@")
	if [[ $curleddoi == '' ]]
	then
		pmid2doincbi "$@"
	else
		echo "$curleddoi"
	fi
}
function pmid2doincbi 	(){
	xtracteddoi=$(pubmedextractdoi "$@")
	if [[ $xtracteddoi == '' ]]
	then
		echo "DOI NA"
	else
		echo "$xtracteddoi"
	fi
}

function AddPubTableDOIsSimple   () {
        old_IFS=$IFS
        IFS=$'\n'
        for line in $(cat "$@"); do
                AddPubDOI "$line"
        done
        IFS=$old_IFS
}

# Came across NCBI rate throttling while trying to call AddPubDOI in parallel, so added a second attempt for "DOI NA"
# and also writing STDOUT output to STDERR as this function will be used on a file (meaning STDOUT will get silenced)
# so you can see progress through the lines, as in:
#     AddPubTableDOIs table.tsv > outputfile.tsv
# I'd recommend it's not wise to overwrite unless you're using version control.

function AddPubTableDOIs   () {
	old_IFS=$IFS
	IFS=$'\n'
	for line in $(cat "$@"); do
		DOIresp=$(AddPubDOI "$line" 2>/dev/null)
		if [[ $DOIresp =~ 'DOI NA' ]]; then
			# try again in case it's just NCBI rate throttling, but just the once
			DOIresp2=$(AddPubDOI "$line" 2>/dev/null)
			if [[ $(echo "$DOIresp2" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then
				echo "$DOIresp2"
				>&2 echo "$DOIresp"
			else
				DOIinput=$(echo "$line" | cutf 1-3)
				echo -e "$DOIinput\tDOI NA: Parse error"
				>&2 echo "$DOIinput\tDOI NA: Parse error"
			fi
		else
			if [[ $(echo "$DOIresp" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then
				echo "$DOIresp"
				>&2 echo "$DOIresp"
			else
				DOIinput=$(echo "$line" | cutf 1-3)
				echo -e "$DOIinput\tDOI NA: Parse error"
				>&2 echo "$DOIinput\tDOI NA: Parse error"
			fi
		fi
	done
	IFS=$old_IFS
}

function AddPubDOI	(){
	if [[ $(echo "$@" | cutf 4) != '' ]]; then
		echo "$@"
		continue
	fi
	printf "$(echo "$@" | cutf 1-3)\t"
	thistitle=$(echo "$@" | cutf 3)
	if [[ $thistitle != 'Title' ]]; then
		thisauthor=$(echo "$@" | cutf 1)
		thisyear=$(echo "$@" | cutf 2)
		round1=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR]")
		round1hits=$(echo "$round1" | wc -l)
		if [[ "$round1hits" -gt '1' ]]; then
			round2=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR] AND ("$thisyear"[Date - Publication] : "$thisyear"[Date - Publication])")
			round2hits=$(echo "$round2" | wc -l)
			if [[ "$round2hits" -gt '1' ]]; then
				round3=$(
					xtracted=$(pubmedextractdoi "$@")
					xtractedtitles=$(echo "$xtracted" | cutf 3 | tr -dc "[A-Z][a-z]\n")
					alphatitles=$(striptoalpha "$xtractedtitles")
					thistitlealpha=$(striptoalpha "$thistitle")
					presearchIFS=$IFS
					IFS=$'\n'
					titlecounter="1"
					for searchtitle in $(echo "$alphatitles"); do
						(( titlecounter++ ))
						if [[ "$searchtitle" == *"$thistitlealpha"* ]]; then
							echo "$xtracted" | sed $titlecounter'q;d' | cutf 4
						fi
					done
					IFS=$presearchIFS
				)
				round3hits=$(echo "$round3" | wc -l)
				if [[ "$round3hits" -gt '1' ]]; then
					echo "ERROR multiple DOIs after 3 attempts to reduce - "$round3
				else
					echo $round3
				fi
			else
				echo $round2
			fi
		else
			echo $round1
		fi
	fi
}

function pmid2doirestful (){
	curleddoi=$(pmid2doi "$@")
	if [[ $curleddoi == '' ]]
	then
		echo "DOI NA"
	else
		echo "$curleddoi"
	fi
}

function mmrlit { cat ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; }
function mmrlitedit { vim ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; }
function mmrlitgrep     (){ grep -i "$@" ~/Dropbox/Y3/MMR/Essay/literature_table_with_DOIs.tsv; }
function mmrlitdoi      (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | tr -d '\n' | xclip -sel p; clipconfirm;  }
function mmrlitdoicite  (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | awk '{print "`r citet(\""$0"\")`"}' | tr -d '\n' | xclip -sel p; clipconfirm; }

The main functions in the script are AddPubDOI and AddPubTableDOIs, the former being executed for every line in the input (reading from a table). Weird bug/programming language feature who knows where - you can't use the traditional while read variable; do function(variable); done < inputfile construction to handle a file line by line, so I resorted to cat trickery. I blame Perl.

cutf is my shorthand to tell the cut command I want a specific column in a tab-separated file or variable.
striptoalpha is a function I made here to turn paper titles into all-lowercase squished together strings of letters (no dashes, commas etc that might get in the way of text comparison) in a really crude way of checking one name against another. This part of the script could easily be improved, but I was just sorting out one funny case - usually matching author and year and using a loose title match will be sufficient to find the matching Pubmed entry, for which a DOI can be found.
pubmed chains together: esearch to search pubmed for the query; efetch to get the document (i.e. article) summaries as XML; and xtract to get the basic info. I don't use this in my little pipeline setup, rather I kept my options open and chose to get more information, and match within blocks of the XML for the DOI. It's not so complicated to follow, as well as my code there's this example on Biostars.
pubmeddocsum just does the first 2 of the steps above: providing full unparsed XML 'docsums'
pubmedextractdoi gets date and DOI information as columns, then uses GNU awk to rearrange the columns in the output
pubmeddoi gives just the DOI column from said rearranged output
pubmeddoimulti has 'multiple' ways to try and get the DOI for an article matched from searching Pubmed: firstly from the DOI output, then attempting to use the pmid2doi service output.
pmid2doimulti does as for pubmeddoimulti but from a provided PMID
pmid2doi handles the pmid2doi.org response, pmid2doincbi the Entrez Direct side, both feed into pmid2doimulti.

Rookie's disclaimer: I'm aware pipelines are suposed to contain more um, pipes, but I can't quite figure out an easy way to make these functions 'pipe' to one another, so I'm sticking with passing the output to the next as input ("$@" in bash script).

Further notes in this blog post

EntrezDirect scripting - lmmx/devnotes GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️