Tags: #csg15, integration
Tags: #csg15, devops
Closing Plenary – Cliff Lynch
A record setting meeting – not just attendees, but number of session proposals was far in excess of what’s been seen before.
Security and privacy have become pervasive issues. Concerns about security interplay with notions of moving services onto the net and depending on remote organizations and facilities. Security and privacy are separate things, though interrelated. Privacy itself has become a multi-headed thing. Was traditionally privacy from the state or privacy from your neighbors, now there is a vast enteprise interested in data about you. There are people who think we have approached this wrong, that we should punish people from making nasty use of information rather than failure to keep it secret. Snowden revelations represent enormous breach of security in organizations that are supposed to be the best and are well funded. Some of the revelations are about efforts to undermine security in the national and international networking infrastructure. That suggests that we have a lot to do in improving security – hard to believe in selective compromises that can only be open to the good guys.
The Snowden breach also is a highwater mark of a trend that raises issues for archives and libraries – an example of a large and untidy database of material. This is not a leaked memo, or even the Pentagon papers. Here we have this big dataset cached in various places. The government is still not comfortable with this to the point that it cannot be used as reference material in classes taken by government employees as they would be mishandling classified documents. How are we going to manage these important caches of source documents, and who is going to do it?
There are any number of other security and privacy problems you can read about in the press. The term “data breach” suggests a singular event. There is evidence that many systems are compromised for a long period of time – that’s an important distinction. Seeing a spectacular example at present with Sony where it appears they may have lost control of their corporate IT infrastructure to a point where they may have to tear it down and build it again. It IETF is looking at design factors and algorithms – we need to do this in our community more systematically. There’s been good material coming out of Internet2 and Educause joint security task force. Some of this is easy – why are we sending things in the clear when we don’t need to? Underlying assumption that it’s a benign world out there. The whole infrastructure around the metadata harvesting protocol is open – who would want to inject bad data about repositories? We’re using these as major sources of inventories of research data – would be good if they’re reasonably accurate. CNI will convene, probably in February, to start building a shopping list of things relevant to our community.
Two things that are harder and more painful to deal with: One is the sacrifices we need to make to get licenses for certain kinds of material, especially consumer market materials. Look at the compromises public libraries have made to license materials for their patrons – pretty uncomfortable with privacy choices. Need to reflect long and hard across the entire cultural memory sector. The other thing is levels of assurance – how rigorous do you want to be with evidence in trusting someone. Some identities are issued with an email address, others want to see your passport and your mother. It’s easier to do it right the first time, but sometimes if you do it right it can take forever. We’re building a whole new apparatus about factual biography and author identity – no agreement or even discussion about what our expectations are around this. Do we want to trust people’s assertions, or do we want it verifiable? Part of the problem is it’s hard to understand how big the problem is.
In the commercial sphere it is stunning how much we don’t know about how much personal information is passed around and reused.
Another clear trend: Research data management. We are still waiting eagerly (and with growing impatience) for policies and ground rules from the funding agencies about implementing OSTP directives. In broader context seeing focus on data, data sharing, big data. Phil Bourne was appointed as the first assistant director of data science at the NIH – creation of that role underscores how important they see research data and data management. Seeing this in other agencies and in business. City governments are getting involved in big data and the emergence of centers in urban informatics. SHARE will be a backbone inventory and analytic tool for understanding research data responsibilities – has a clear idea where it’s heading and is starting go move along.
Still many things we’re not coping with well in this area – data around human subjects. Not sure we have a good conversation between those who are concerned about privacy and those who see what can be accomplished with information. A story that illustrates developments and fault lines. THere’s a whole alternative in social science emerging out there with studies that could never have been done within univerisites but fairly well respectful of privacy. Sometime earlier this year the proceedings of the national academies of science publish a paper jointly authored by researchers at Cornell and Facebook. Emotional contagion – if your circle of friends share depressing information then you will reflect depressing information back. Came up with idea to test this on Facebook. Need a lot of people in the experiment. They twiddled Facebook feed algorithm to bias towards depressing items and then did sentiment analysis. Then looked at people on sending end of those items. Around 60k people. They found there was a little truth in the theory. People started freaking out in various directions: academics (what IRB allowed this? where were the informed consent forms?); another group that said this seemed fairly harmless and can’t really be done with informed consent and should be viewed as a clever experiment. This hasn’t been resolved. Some people are worried that things that are product optimizations normally could be reframed as human subject experiments. Some people are wondering whether we don’t need a little regulation in this area. There was a conference at MIT on digital experiments. Large enterprises are doing thousands of tests a year, with sophisticated statistics, to tweak their optimizations. The part of the Facebook thing that was surprising is there were a lot of unhappy Facebook users offended about the news feed algorithm being messed with – without understanding that there are hundreds of engineers at Facebook messing with it all the time. Put a spotlight on how litlle people understand how much their interactions are shaped by algorithms in unpredictable ways.
People dont realize how personalized news has become – we don’t all see the same NY Times pages. What does it mean to try and preserve things in this environment? Intellectually challenging problem that deserves attention as we think about what are the important points to stress in information literacy. What’s appropriate ethically? What about research reproducibility? What evidence can we be collecting to support future research?
There was a CLER workshop on Sunday about things we have in archives and special collections that need to be restricted in some way or are in ambiguous status. Eg. things collected before 1900 when collecting and research practices were different. That is some of the only research we will ever have on some things and places, and we have to talk about them no matter how awkward.
Software – we often make casual statements about software preservation and sustainability. Time to take a closer look. There is massive confusion about what sustainability means, the difference between sustainability and preservation, and what those terms mean. Time for more nuance around this. Rates of obsolescence and change – in some sense desirable to keep everybody on current version, but the flip side is that vendors have enormous motivations to put people through painful frequent cycles of planned obsolescence. There is some evidence that there are better outcomes of backward compatibility with open source software. We need to understand forces obsolescence cycles and what that implies in areas like digital humanities where there’s not a lot of money to rewrite things every year. We’re seeing new tools on virtualization technologies for software preservation.
Did an executive roundtable on supporting digital humanities at scale. Linked closely to digital scholarship centers – these are important mechanisms for diffusion of information on technology and methodology. One of the striking things is that got a lot of people who are looking at or the issue of or planning for scholarship centers. There’s interest in looking at a workshop for people planning such a center. There are lots of things that have the word “center” in them, with widely varying meanings. It might be a real help to summarize the points of disagreement and the different kinds of things parked under these headings.
If you ask the qeustion how are we doing in terms of preserving and providing stewardship of cultural memory in our society (including but not limited to scholarly activity), nobody can answer. If you ask are we doing better this year than last we have no idea how to answer. How much would it cost to do 50% better than now? Can’t answer that either. There have been some point investigations – like the Keepers activity. Studies from Columbia and Cornell on what proportion of periodicals are archived. Other than copyright deposit at LC we have no mechanism to get recordings into institutions that care about the cultural record. We’re in a slow motion train wreck with the video and visual memory of the 20th century – it’s a big problem that, until recently, we couldn’t get a handle on. This will require a big infusion of funds in the interest of cultural memory. Indiana University took a systematic inventory of their problem and then was able to win a sizeable down payment from leadership to deal with it. NY Public have done a study and are just sharing results – their numbers are bigger and scarier than Indiana’s. Getting surveys done is getting a bit easier. There are probably horrible things waiting to be discovered in terms of preserving video games. Preserving the news is another area. Part of the difficulty is the massive restructuring in some of these industries. Helpful to think about this systematically in order to prioritize and measure our collective work.
Tags: #cni14f, biomedicine, NIH, research
The NIH Contribution to the Commons
Philip E Bourne, Associate Director for Data Science, NIH
There’s a realization that the future of biomedical research will be very different – the whole enterprise is becoming more analytical. How can we maximize rate of discovery in this new environment?
We have come a long way in just one researcher’s career. But there is much to do: too few drugs, not personalized, too long to get to market; rare diseases are ignored; clinical trials are too limited in patients, too expensive, and not retroactive; education and training does not match current market needs; research is not cost effective – not easily replicated, too slow to disseminate. How do we do better?
There is much promise: 100,000 genomes project – goal is to sequence 100k people and use it as a diagnostic tool. Comorbidity network for 6.2 million Danes over 14.9 years – the likelihood if you have one disease you’ll get another. Incredibly powerful tool based on an entire population. We don’t have the facilities in this country to do that yet – need access and homogonization of data.
What is NIH doing?
Trying to create an ecosystem to support biomedical research and health care. Too much is lost in the system – grants are made, publications are issued, but much of what went into the publication is lost.
Elements of ecosystem: Community, Policy, Infrastructure. On top of that lay a virtuous research cycle – that’s the driver.
Policies – now & forthcoming
Data sharing: NIH takes seriously, and now have mandates from government on how to move forward with sharing. Genomic data sharing is announced; Data sharing plans on all research awards; Data sharing plan enforcement: machine readable plan, repository requirements to include grant numbers. If you say you’re going to put data in repository x on date y, it should be easy to check that that has happened and then release the next funding without human intevention. Actually looking at that.
Data Citation – elevate to be considered by NIH as a legitimate form of scholarship. Process: machine readable standard for data citation (done – JATS (xml ingested by PubMed) extension); endorsement of data citation in NIH bib sketch, grants, reports, etc.
Data to Knowledge initiative (BD2K). Funded 12 centers of data excellence – each associated with different types of data. Also funded data discovery index consortium, building means of indexing and finding that data. It’s very difficult to find datasets now, which slows process down. Same can be said of software and standards.
The Commons – A conceptual framework for sharing and being FAIR: Finding, Accessing, Integrating, Reusing
Digital research objects with attribution – can be data, software, narrative, etc. The Commons is agnostic of computing platform.
Digital Objects (with UIDs); Search (indexed metadata); Search
Public cloud platforms, super computing platforms, other platforms
Research Object IDs under discussion by the community – BD2K centers, NCI cloud pilots (Google and AWS supported), large public data sets, MODs. Meeting in January in UK – could DOIs or some other form.
Search – BD2K data and software discovery indices; Google search functions
Appropriate APIs being developed by the community, eg Global Alliance for Genomic Health. I want to know what variation there is in chromosome 7, position x, across the human population. With the Commons more of those kinds of questions can be answered. Beacon is an app being tested – testing people’s willingness to share and people’s willingness to build tools.
The Commons business model: What happens right now is people write grant to NIH, with line items to manage data resources. If successful they get money – then what happens? Maybe some of it gets siphoned off to do somethign else. Or equipment gets bought where it’s not heavily utilized. As we move more and more to cloud resources, it’s easier to think on a business model based on credit. Instead of getting hard cash you’re given an amount of credit that you can spend in any Commons compliant service, where compliance means they’ve agreed to share. Could be that institution is part of Commons or it could be public cloud or some other kind of resource. Creates more of a supply and demand environment. Enables a public/private partnership. Want to test idea that more can be done with computing dollars. NIH doesn’t actually know how much they spend on computation and data activities – but undoubtedly over a billion dollars per year.
Community: Training initiatives. Build an OPEN digital framework for data science training: NIH data science workforce development center (call will go out soon). How do you crate metadata around physical and virtual courses? Develop short-term training opportunities – e.g. supported workshop with gaming community. Develop the discipline of biomedical data science and support cross-training – OPEN courseware.
What is needed? Some examples from across the ICs:
Homgenization of disparate large unstructured datasets; deriving structure from unstructured data; feature mapping and comparison from image data; visualization and analysis of multi-dimensional phenotypic datasets; causal modeling of large scale dynamic networks and subsequent discovery.
In process of standing up Commons with two or three projects – centers being funded from BD2K who are interested, working with one or two public cloud providers. Looking to pilot specific reference datasets – how to stimulate accessibility and quality. Having discussions with other federal agencies who are also playing around with these kinds of ideas. FDA looking to burst content out into cloud. In Europe ELIXIR is a set of countries standing up nodes to support biomedical research. Having discussions to see if that can work with Commons.
There’s a role for librarians, but it requires significant retraining in order to curate data. You have to understand the data in order to curate it. Being part of a collective that is curating data and working with faculty and students that are using that data is useful, but a cultural shift. The real question is what’s the business model? Where’s the gain for the institution?
We may have problems with the data we already have, but that’s only a tiny fraction of the data we will have. The emphasis on precision medicine will increase dramatically in the next while.
Our model right now is that all data is created equal, which clearly is not the case. But we don’t know which data is created more equal. If grant runs out and there is clear indication that data is still in use, perhaps there should be funding to continue maintenance.
Tags: #cni14f, RDF, VIVO
Evolution of VIVO Software
Layne Johnson, VIVO Project Director, DuraSpace
VIVO History – started at Cornell in 2003. 2009-12 NIH funded VIVO ($12 million) to evolve.
Problems – Researchers struggle to identify collaborators, most information and data are highly distributed, difficult to access, reuse, & share and is not standardized for interop.
VIVO can facilitate collaborations and store disparate information stored inteh VIVO-ISF ontology.
What is VIVO? Open source, semantic web application enables management and discovery of research and scholarship across disciplines and institutions.
VIVO harvests data from authoritative sources thus reducing manual input and providing integrated data sources. Internal data from ERPs, external data from bibliographic sources, ejournals, patents, etc.
VIVO data stored as RDF.
Triple stores and linked open data: provide abiity to inference and reason; can be machine readable; links into the open data cloue; provide links into a wide variety of information sources from different interoperable ontologies; allow knowledge about research and researchers to be discovered.
VIVO supports search & exploration – by individual, type, relationship, combinations and facets.
One of the larger implementations is USDA VIVO. Another interesting one is Find an Expert at the University of Melbourne. Scholars@Duke. Mountain West Research Consortium has a cross consortium search. The Deep Carbon Observatory data portal uses VIVO.
Installed base of VIVO implementations has remained somewhat level.
VIVO Evolution: from grant-funding to open source. In 2012-13 VIVO partnered with DuraSpace, who provide infrastructure and leadership – legal, tax, marketing communication, leadership. Sustained through a community membership model. VIVO project director hired May 1.
Charter process – Jonathn Markow & Steering group. Based on DuraSpace model for consistency across products. Charter finalized in late July, 2014.
VIVO Governance: Leaderhip group, steering group, management team. Four working groups: Development & Implementation; Applications & Tools; VIVO-ISF Ontology; Community Engagement & Outreach (undergoing reconstitution).
Four levels of membership – $2.5k, $5k, $10k, $20k.
VIVO strategic planning: 14 member strategy group created from leadership, steering, management teams and external members. Met December 1 & 2. Did a survey to determine current state of 41 VIVO leaders. Got 20 respondents. VIVO’s 3 strategic themes: community, sustainability, technology. 5 top goals for each theme selected, each strategy group member got tovote for 3 goals per theme.
Community: increase productivity; develop more transparent governance; increase engaged contributors; maintain a current and dynamic web presence; develop goals for partnerships (ORCID, CRIS, CASRAI, W3C, SciEnCV, CRediT, etc.)
Sustainability: create welcoming community; develop clear value proposition; increase adoption; promote the value of membership.
Technology: Develop democratic code processes; clarify core architecture and processes; develop VIVO search; improve/increase core modularity; team-based development processes.