Final Presentation Script - matt-bernhardt/datalore GitHub Wiki
MIT's cache of scholarship, dirty metadata, and the hidden ecosystem of a modern university
In 2009, the Massachusetts Institute of Technology’s faculty unanimously voted to collect all work published at the institute in an open access repository.
It was the first time in the 900-year history of university libraries that an entire university had agreed to share all of its scholarship with all comers(word choice - openly with anyone interested?).
[IMAGES: Data points -- should we group these, or scatter them between paragraphs, Effy?] 14,981 items (Jan. 17, 2015) 11,101 MIT authors 179 MIT departments, labs and research centers 21 years of research
MIT's open access policy eliminated barriers between the public and the university's creative work, but it had another, unintended effect: For the first time, MIT had data to track scholarly activity on campus.
The metadata – information about who publishes, with whom, how often, where, and with which rights – gives us a new perspective on the ecosystem of a modern university. We’ll show you:
[IMAGE? Like the data points above:) • Three stories from the data: • How attitudes toward sharing research have changed over time • Which MIT departments are the best collaborators • Whether a record of collaboration correllates to receiving NIH research grants [XXXXXX NEEDS WORK XXXX} • An argument that Open Access is important, not just to share research, but to understand the ecosystem of scholarship; • A list of stories the data could tell in the future; and • Ways to improve MIT's OA data collection practices.
Question: How are attitudes toward open access changing over time?
MIT's open access database includes a range of rights and permissions for users. Some scholars share their work with no strings or requirements attached. Others add their work to the OA database but limit how others can use or remix their scholarship.
We wanted to see whether different departments had different sharing cultures, and we were also interested in whether attitudes toward openness and sharing changed over time. Ben Swanson and Effy Zhang built an interactive hive visualization with three axes: rights granted, departments, and time. Mousing over different axes lets users visualize which departments are the most open, and how rights in the entire OA database have changed over time.
[IMAGE: Hive visualization. Credit: Ben Swanson & Effy Zhang]
Question: Which departments are the best collaborators? The OA metadata includes information about which departments, labs and research centers at MIT contributed to a paper.
Matt Bernhardt's chord diagram illustrates which of MIT's departments have the most collaborations, and also reveals whether their interdisciplinary research is across a wide range of departments, or is focused on narrow but deep working relationships. (It's sorted based on the date of first entry in the OA database;
[IMAGE: Chord diagram of collaborations. Credit: Matt Bernhardt]
Question: Does collaboration correlate to federal funding?
Once we were able to visualize interdisciplinary collaborations, we wondered whether it had an benefits.
While causation was beyond the scope of a three-day hackathon, we pulled five years of grant funding data from the National Institutes of Health to see whether there were any correllations between collaboration and that set of funding.
[IMAGE: Data viz correlating authorial collaboration to NIH grant fund recipients]
Other questions worth exploring: [COULD WE MAKE THIS GRAPHIC]?
• Does the gender ratio of a department's faculty correllate with interdisciplinary collaboration? • What's the culture of co-authorship at MIT? What are least, most and median numbers for co-authorship? • Does the size of a research center's graduate student pool correllate to faculty publishing?
Here's the thing about MIT's Open Access database and the way it's collected: the metadata is messy.
When a scholar, librarian, or publisher adds a paper to the repository, she is asked to fill out some basic information -- which varies depending on how she approaches the database. Author. Title. Publisher. Publication date. A description of the file. And more.
[IMAGE: screenshot of the data entry page]
A cataloguer could fill out 86 pieces of metadata for each paper that's added to the repository, with the option to add more fields and information -- but none of the fields are required.
When we began work with the data, the Open Access repository included 14,981 papers. We found that only seven pieces of metadata existed for all 14,981 entries.
[IMAGE: .jpg of data on white board]
Incomplete data makes it difficult, sometimes impossible, to have confidence when using a database. Matt Bernhardt, the web developer who scraped the OA data for our team, scrubbed the data so that we could share the sets that were complete -- so we could share them. But we would prefer to have had more.
[IMAGE: Thumbnail of Excel matrix with link to spreadsheet]
In addition to incomplete data, there are inconsistent data. An author could be identified as "Smith, Jane B." in one document, and "Smith, Jane Baker" in another -- making one author appear to be two, and making two Jane Smiths appear to have a single credit each, rather than reflecting Jane Smith's real productivity as a scholar.
And let's not forget typos. Jane Smith or Jan Smith? Yikes.
We propose a few ways to improve the Open Access database's metadata: For future additions, employ: • Mandatory fields: A few basic pieces of metadata should be required for each item entered into the OA database: Author(s), Title, Abstract, Rights and permissions • Controlled vocabulary for fields like MIT authors, MIT departments, rights • Consistent format for dates
Open Access Open Access repositories provide free, online access to digital content, and it promotes access to scholarship by eliminating the fees that publishers charge libraries and other institutions for the privilege of looking at a journal’s contents. These fees -- which are negotiated independently between the publishers and institutions, and are treated as industry secrets -- create barriers to research. Therefore, many institutions are unable to afford subscriptions to a collection of journals comprehensive enough to support the work of a large number of scholars working in myriad research fields. Open Access breaks this barrier; promoting the use of research in courses, text mining, accessibility applications, and other projects.
DSpace@MIT DSpace@MIT is home to green OA -- a repository of open access files.[1] DSpace was developed collaboratively in 2002 by people at MIT and Hewlett Packard, and this open source product is now supported by DuraSpace, a not-for-profit organization dedicated open technologies that house and provide access to digital content.[2] On campus, Curation and Preservation Services at the MIT Libraries manage DSpace@MIT.[3] While DSpace@MIT provides a home for content deposited as result of the 2009 Faculty Open Access Policy, this does not represent all of the content in DSpace@MIT, which holds the Open Access collection of individuals and organizations across MIT. Additionally, DSpace@MIT serves as a repository for Institution wide digital content that is not Open Access, such as out of print MIT Press books, and other research content.
[1] http://legacy.earlham.edu/~peters/fos/overview.htm [2] http://duraspace.org/history [3] http://libguides.mit.edu/c.php?g=176372&p=1158910
Metadata and DSpace@MIT If you’re interested in more of the details about the metadata documenting the content of DSpace@MIT, some information about this metadata is available on the web, and therefor accessible to us in this two-day hack. (Some of it is not.) The metadata schema in use throughout DSpace@MIT is Dublin Core, a schema known for its flexibility.
This flexibility allows the database to represent all of the different types of content that could be submitted to DSpace@MIT (there are 86 fields available to those editing the metadata on the back end).
This also means that there are a lot of inconsistencies. For example, when authors submit articles directly into the DSpace@MIT web form, there is only one required element: Authorizing MIT Author (not to be confused with creator).
And that’s just one way to add things to the repository.
Communities (departments, labs, and research centers) at MIT can each set up their own workflow for this process. By working with library staff, each community can establish and train its own content reviewer, metadata editor, and coordinator. Like the 86 possible elements, this is helpful but idealistic: creating many options for how content can be deposited, but at the cost of consistent metadata.
The problems become apparent when you dive into the data set. Of the 14,981 items in our data set, two are missing titles. Rather than use the creator element, creators are listed under <dc.contributor.author> and <dc.contributor.mitauthor>. Are there some items that don’t have creators at all; some have authors, but none associated with MIT – or was this metadata not entered correctly?
The lack of keywords is another (although less fundamental) problem. The full metadata includes a field called <dc.subject> which maps to “Keywords” in the simple item display. That element is only populated 371 times, yet it is one of the five ways that people can browse DSpace@MIT. (Remember, the full scope of the content of DSpace@MIT is not limited to Open Access content: browsing by all authors lists 95,726 items out of which only 14,981 are part of our Open Access data set.)
Why is this significant for Open Access content? Well, if users are discovering content through subject browses, Open Access content is severely underrepresented: of the 37,340 items in “Browsing by Subject,” only 371 are Open Access. If subject browses are front and center on the DSpace@MIT homepage, making the <dc.subject> field mandatory would greatly increase exposure to Open Access content.
With nearly 15,000 item records, it appears that there are only seven mandatory elements: date accessioned, date available, eprint version (describing which version of the manuscript is included in the repository), MIT license, URI, dc.type, and dc.type.URI. As stated elsewhere on our site, it would be helpful if more of the fields were mandatory: contributor, title, and rights. Making publisher and date issued mandatory but a possible “null” value would ensure that metadata about publication was included if relevant. Overall, the metadata schema is effective in that it is enabling access to a large quantity of Open Access material, but it could be more effective with additional tweaks.
Link to full data set Link to relational data set Link to Excel file List/links to tools used in creating the project Bibliography (see Bibliography)
Copy by: Betsy O'Donovan and Emily Weirich Data supplied by: Matt Berhnardt and MIT Libraries Data management by: Matt Bernhardt, Ben Swanson, Effy Zhang Data visualizations by: Matt Bernhardt, Ben Swanson, Effy Zhang Research by: Emily Weirich Ux and site design by: Effy Zhang Story design by: Betsy O'Donovan
Matt Bernhardt is a web developer at the MIT Libraries, with interests in discovery systems and the movement toward openness -- in education, access, and software, and many other places. Betsy O’Donovan, the editor and digital strategist for AIR and a 2013 Nieman fellow at Harvard, spends her time thinking about how tech and data can influence and improve journalism. Ben Swanson plays with NLP and machine learning, working with geographical city data intersecting Twitter and the US Census. Emily Weirich recently earned her MLIS from Simmons College and has studied art history, photography, visual communication, and metadata; currently working in Access Services at the Harvard Fine Arts Library. Effy Zhang is an interaction designer, an MFA candidate in interactive design at the School of Visual Arts in New York, and has worked as a design intern at Samsung.