Home - Mcirino/pdd GitHub Wiki

Welcome to the PaleoDeepDive wiki!

DeepDive documentation is available here

What is PaleoDeepDive?

The PaleoDeepDive (PDD) statistical machine reading and learning system was developed to extract paleontological data from scientific literature, ie. dark data. Dark data includes journals, articles, scanned PDFs, and published compendia. It is documents containing information that the user would like to extract, surrounded by information that is not needed.

DeepDive is the base of PDD. DeepDive is a trained system, which means it learns and gets better with repeated iterations and more data. PDD is a variant of DeepDive specializing in extracting data from paleontological literature, and outputting useful fossil data, specifically fossil size data.

PDD makes use of custom paleoscience-focused inference rules to inform the DeepDive learning process. These rules include recognition of taxonomic hierarchy, species and genus names, and geological formations.

DeepDive takes into account natural language ambiguities, and computes calibrated probabilities for every assertion it makes.

By using image processing techniques, such as thresholding, PDD can measure fossils illustrated in the literature. Using natural language processing (NLP), PDD can output the actual size of fossils by extracting magnification information from figure captions. NLP is also used to identify other useful information for extraction, such as taxonomy.

Once data has been extracted, it is output to a relational database. PDD is a much quicker and more efficient way of compiling paleontological datasets than doing it manually. Additionally, it is important to extract data from sources that are not already in a database, such as the Paleobiology Database (PaleoDB), whose size is limited by human efforts to extract and manually enter data.

We are using PDD to extract a body size data set for marine fossil animals. The goal is to eventually build a species-level body size dataset for the past 550 million years of animal evolution. Moreover, once PDD reaches a state where it is easily customizable and accurate, it can be adjusted to gather data not just on specimen size, but possibly on other traits as well such as location. In turn, these datasets could be used to identify other macroevolutionary trends over geologically significant periods of time.