Nicole Vasilevsky is a great speaker (Oregon Health & Science University). By the third day of a conference, where you’ve gone madly from session to session, her presentation was a beautiful moment of intellectual repose because you never had to struggle to understand her point. Ahhhh.
My notes will not do her justice, but plenty of what she said resonated with things I learned about at the 2013 Research Data Alliance gathering.
She and her group had some research questions:
- How to make science more reproducible?
- How can we educate researchers so that their data will be more reusable and reproducible?
- How can we use data to generate new hypotheses and make new connections?
Nicole explained that reproducible science involves providing good metadata about resources used in your lab experiments. Her analogy was cooking – you might copy the recipe of a famous chef, but if your ingredients weren’t of the same quality as the chef’s, your results may vary… Verifying results in science means using the same exact resources (antibodies, model organisms, etc.)
Another issue is the methodology for an experiment. In many scientific journals there are length restrictions on this part of an article, so even if a researcher intends to fully describe the methodology, they may be prevented by publishing practices.
She suggested we take a look at the comments at this twitter hashtag:
In their study, they took journal articles from biomed literature, across several domains, and looked at 200 papers from journals with various impact factors. Across all those articles, only about 50% of resources were identifiable. (antibodies, cell lines, organisms, knockdown reagents, etc) even when there were stringent requirements for including this information according to journal guidelines, which pointed out that the guidelines weren’t being enforced.
Evidently they looked at lab notebooks too, which are often meticulous. Where labs are doing a good job tracking the info, they aren’t getting that info into the publications (vendors, catalog numbers, stable unique identifiers, etc).
Tools to help researchers are emerging. Unique identifiers for resources are available in some places – e.g., biosharing.org. But in experiments, resources could also be software and tools. There needs to be more registry-like oversight of the identifiers and controled vocabularies that are needed.
So, one of the projects her group is now working on is the Resource Identification Initiative, promoting unique RRIDs (Research Resource IDs). Along the lines of Force 11, RRIDs should be: Machine readable, free to generate and access, used consistently across publishers and journals. to aid in discovery, RRIDs should be used in methods sections and as keywords in published articles. Even though this is a very recent project, RRIDs are getting used. Where they are being used, they are correctly used about 90% of the time.
In their effort to help educate researchers, her boss entered a contest called the one-page challenge: What would you to with $1000 in order to…..
They won, and used the money to fund a Data Management Happy Hour to advertise their workshops and consultations and other services, and to talk with researchers about their data. Seems like part of the reason it was a success (besides the wine) had to do with being very open about how everyone is learning how to do this better. They had some giveaway where people shared some badly managed data sets or visualizations, got people laughing, used the mistakes to make points about better practices and establish themselves as useful consultants with relevant library services.
They also had a data wrangling open house for grad students, who are less immediately concerned with the use and re-use of data or reproducible science — they are really focused on getting thru school, graduating. In order to do that, they need to be efficient and avoid mistakes in their data management practices, so Nicole and her colleagues involved grad students in organizing and promoting a data wrangling workshop.
Making new connections via data was the third part of the presentation. I learned, finally, the difference between an ontology and a controlled vocabulary–it’s not complicated, it just requires a clear explainer. CTSA Connect is the project Nicole reviewed as making the connections alluded to in her third question. CTSA is explained on their own website, and it sounds like VIVO:
CTSAconnect aims to integrate information about research activities, clinical activities, and scientific resources by creating a semantic framework that will facilitate the production and consumption of Linked Open Data about investigators, physicians, biomedical research resources, services, and clinical activities. The goal is to enable software to consume data from multiple sources and allow the broadest possible representation of researchers’ and clinicians’ activities and research products. Current research tracking and networking systems rely largely on publications, but clinical encounters, reagents, techniques, specimens, model organisms, etc., are equally valuable for representing expertise. http://www.ctsaconnect.org/
Nicole and others have been working on the VIVO Integrated Semantic Framework (VIVI-ISF) ontology suite. The general idea as I understand it is to have a semantic framework for describing relationships among all the entities that are interesting to researchers trying to stay up-to-date in their fields. So there needs to be an ontology for resources as well as an ontology for people – a framework for revealing the relationships that are important about these kinds of entities.
The website for the ontology group at OHSU is here: