CNI Fall 2014 Meeting – Closing Plenary

Closing Plenary – Cliff Lynch

A record setting meeting – not just attendees, but number of session proposals was far in excess of what’s been seen before.

Security and privacy have become pervasive issues. Concerns about security interplay with notions of moving services onto the net and depending on remote organizations and facilities. Security and privacy are separate things, though interrelated. Privacy itself has become a multi-headed thing. Was traditionally privacy from the state or privacy from your neighbors, now there is a vast enteprise interested in data about you. There are people who think we have approached this wrong, that we should punish people from making nasty use of information rather than failure to keep it secret. Snowden revelations represent enormous breach of security in organizations that are supposed to be the best and are well funded. Some of the revelations are about efforts to undermine security in the national and international networking infrastructure. That suggests that we have a lot to do in improving security – hard to believe in selective compromises that can only be open to the good guys.

The Snowden breach also is a highwater mark of a trend that raises issues for archives and libraries – an example of a large and untidy database of material. This is not a leaked memo, or even the Pentagon papers. Here we have this big dataset cached in various places. The government is still not comfortable with this to the point that it cannot be used as reference material in classes taken by government employees as they would be mishandling classified documents. How are we going to manage these important caches of source documents, and who is going to do it?

There are any number of other security and privacy problems you can read about in the press. The term “data breach” suggests a singular event. There is evidence that many systems are compromised for a long period of time – that’s an important distinction. Seeing a spectacular example at present with Sony where it appears they may have lost control of their corporate IT infrastructure to a point where they may have to tear it down and build it again. It IETF is looking at design factors and algorithms – we need to do this in our community more systematically. There’s been good material coming out of Internet2 and Educause joint security task force. Some of this is easy – why are we sending things in the clear when we don’t need to? Underlying assumption that it’s a benign world out there. The whole infrastructure around the metadata harvesting protocol is open – who would want to inject bad data about repositories? We’re using these as major sources of inventories of research data – would be good if they’re reasonably accurate. CNI will convene, probably in February, to start building a shopping list of things relevant to our community.

Two things that are harder and more painful to deal with: One is the sacrifices we need to make to get licenses for certain kinds of material, especially consumer market materials. Look at the compromises public libraries have made to license materials for their patrons – pretty uncomfortable with privacy choices. Need to reflect long and hard across the entire cultural memory sector. The other thing is levels of assurance – how rigorous do you want to be with evidence in trusting someone. Some identities are issued with an email address, others want to see your passport and your mother. It’s easier to do it right the first time, but sometimes if you do it right it can take forever. We’re building a whole new apparatus about factual biography and author identity – no agreement or even discussion about what our expectations are around this. Do we want to trust people’s assertions, or do we want it verifiable? Part of the problem is it’s hard to understand how big the problem is.

In the commercial sphere it is stunning how much we don’t know about how much personal information is passed around and reused.

Another clear trend: Research data management. We are still waiting eagerly (and with growing impatience) for policies and ground rules from the funding agencies about implementing OSTP directives. In broader context seeing focus on data, data sharing, big data. Phil Bourne was appointed as the first assistant director of data science at the NIH – creation of that role underscores how important they see research data and data management. Seeing this in other agencies and in business. City governments are getting involved in big data and the emergence of centers in urban informatics. SHARE will be a backbone inventory and analytic tool for understanding research data responsibilities – has a clear idea where it’s heading and is starting go move along.

Still many things we’re not coping with well in this area – data around human subjects. Not sure we have a good conversation between those who are concerned about privacy and those who see what can be accomplished with information. A story that illustrates developments and fault lines. THere’s a whole alternative in social science emerging out there with studies that could never have been done within univerisites but fairly well respectful of privacy. Sometime earlier this year the proceedings of the national academies of science publish a paper jointly authored by researchers at Cornell and Facebook. Emotional contagion – if your circle of friends share depressing information then you will reflect depressing information back. Came up with idea to test this on Facebook. Need a lot of people in the experiment. They twiddled Facebook feed algorithm to bias towards depressing items and then did sentiment analysis. Then looked at people on sending end of those items. Around 60k people. They found there was a little truth in the theory. People started freaking out in various directions: academics (what IRB allowed this? where were the informed consent forms?); another group that said this seemed fairly harmless and can’t really be done with informed consent and should be viewed as a clever experiment. This hasn’t been resolved. Some people are worried that things that are product optimizations normally could be reframed as human subject experiments. Some people are wondering whether we don’t need a little regulation in this area. There was a conference at MIT on digital experiments. Large enterprises are doing thousands of tests a year, with sophisticated statistics, to tweak their optimizations. The part of the Facebook thing that was surprising is there were a lot of unhappy Facebook users offended about the news feed algorithm being messed with – without understanding that there are hundreds of engineers at Facebook messing with it all the time. Put a spotlight on how litlle people understand how much their interactions are shaped by algorithms in unpredictable ways.

People dont realize how personalized news has become – we don’t all see the same NY Times pages. What does it mean to try and preserve things in this environment? Intellectually challenging problem that deserves attention as we think about what are the important points to stress in information literacy. What’s appropriate ethically? What about research reproducibility? What evidence can we be collecting to support future research?

There was a CLER workshop on Sunday about things we have in archives and special collections that need to be restricted in some way or are in ambiguous status. Eg. things collected before 1900 when collecting and research practices were different. That is some of the only research we will ever have on some things and places, and we have to talk about them no matter how awkward.

Software – we often make casual statements about software preservation and sustainability. Time to take a closer look. There is massive confusion about what sustainability means, the difference between sustainability and preservation, and what those terms mean. Time for more nuance around this. Rates of obsolescence and change – in some sense desirable to keep everybody on current version, but the flip side is that vendors have enormous motivations to put people through painful frequent cycles of planned obsolescence. There is some evidence that there are better outcomes of backward compatibility with open source software. We need to understand forces obsolescence cycles and what that implies in areas like digital humanities where there’s not a lot of money to rewrite things every year. We’re seeing new tools on virtualization technologies for software preservation.

Did an executive roundtable on supporting digital humanities at scale. Linked closely to digital scholarship centers – these are important mechanisms for diffusion of information on technology and methodology. One of the striking things is that got a lot of people who are looking at or the issue of or planning for scholarship centers. There’s interest in looking at a workshop for people planning such a center. There are lots of things that have the word “center” in them, with widely varying meanings. It might be a real help to summarize the points of disagreement and the different kinds of things parked under these headings.

If you ask the qeustion how are we doing in terms of preserving and providing stewardship of cultural memory in our society (including but not limited to scholarly activity), nobody can answer. If you ask are we doing better this year than last we have no idea how to answer. How much would it cost to do 50% better than now? Can’t answer that either. There have been some point investigations – like the Keepers activity. Studies from Columbia and Cornell on what proportion of periodicals are archived. Other than copyright deposit at LC we have no mechanism to get recordings into institutions that care about the cultural record. We’re in a slow motion train wreck with the video and visual memory of the 20th century – it’s a big problem that, until recently, we couldn’t get a handle on. This will require a big infusion of funds in the interest of cultural memory. Indiana University took a systematic inventory of their problem and then was able to win a sizeable down payment from leadership to deal with it. NY Public have done a study and are just sharing results – their numbers are bigger and scarier than Indiana’s. Getting surveys done is getting a bit easier. There are probably horrible things waiting to be discovered in terms of preserving video games. Preserving the news is another area. Part of the difficulty is the massive restructuring in some of these industries. Helpful to think about this systematically in order to prioritize and measure our collective work.

CNI Fall 2014 Meeting: The NIH Contribution to the Commons

The NIH Contribution to the Commons
Philip E Bourne, Associate Director for Data Science, NIH

There’s a realization that the future of biomedical research will be very different – the whole enterprise is becoming more analytical. How can we maximize rate of discovery in this new environment?

We have come a long way in just one researcher’s career. But there is much to do: too few drugs, not personalized, too long to get to market; rare diseases are ignored; clinical trials are too limited in patients, too expensive, and not retroactive; education and training does not match current market needs; research is not cost effective – not easily replicated, too slow to disseminate. How do we do better?

There is much promise: 100,000 genomes project – goal is to sequence 100k people and use it as a diagnostic tool. Comorbidity network for 6.2 million Danes over 14.9 years – the likelihood if you have one disease you’ll get another. Incredibly powerful tool based on an entire population. We don’t have the facilities in this country to do that yet – need access and homogonization of data.

What is NIH doing?

Trying to create an ecosystem to support biomedical research and health care. Too much is lost in the system – grants are made, publications are issued, but much of what went into the publication is lost.

Elements of ecosystem: Community, Policy, Infrastructure. On top of that lay a virtuous research cycle – that’s the driver.

Policies – now & forthcoming
Data sharing: NIH takes seriously, and now have mandates from government on how to move forward with sharing. Genomic data sharing is announced; Data sharing plans on all research awards; Data sharing plan enforcement: machine readable plan, repository requirements to include grant numbers. If you say you’re going to put data in repository x on date y, it should be easy to check that that has happened and then release the next funding without human intevention. Actually looking at that.

Data Citation – elevate to be considered by NIH as a legitimate form of scholarship. Process: machine readable standard for data citation (done – JATS (xml ingested by PubMed) extension); endorsement of data citation in NIH bib sketch, grants, reports, etc.

Infrastructure -
Data to Knowledge initiative (BD2K). Funded 12 centers of data excellence – each associated with different types of data. Also funded data discovery index consortium, building means of indexing and finding that data. It’s very difficult to find datasets now, which slows process down. Same can be said of software and standards.

The Commons – A conceptual framework for sharing and being FAIR: Finding, Accessing, Integrating, Reusing
Digital research objects with attribution – can be data, software, narrative, etc. The Commons is agnostic of computing platform.

Digital Objects (with UIDs); Search (indexed metadata); Search

Public cloud platforms, super computing platforms, other platforms

Research Object IDs under discussion by the community – BD2K centers, NCI cloud pilots (Google and AWS supported), large public data sets, MODs. Meeting in January in UK – could DOIs or some other form.

Search – BD2K data and software discovery indices; Google search functions

Appropriate APIs being developed by the community, eg Global Alliance for Genomic Health. I want to know what variation there is in chromosome 7, position x, across the human population. With the Commons more of those kinds of questions can be answered. Beacon is an app being tested – testing people’s willingness to share and people’s willingness to build tools.

The Commons business model: What happens right now is people write grant to NIH, with line items to manage data resources. If successful they get money – then what happens? Maybe some of it gets siphoned off to do somethign else. Or equipment gets bought where it’s not heavily utilized. As we move more and more to cloud resources, it’s easier to think on a business model based on credit. Instead of getting hard cash you’re given an amount of credit that you can spend in any Commons compliant service, where compliance means they’ve agreed to share. Could be that institution is part of Commons or it could be public cloud or some other kind of resource. Creates more of a supply and demand environment. Enables a public/private partnership. Want to test idea that more can be done with computing dollars. NIH doesn’t actually know how much they spend on computation and data activities – but undoubtedly over a billion dollars per year.

Community: Training initiatives. Build an OPEN digital framework for data science training: NIH data science workforce development center (call will go out soon). How do you crate metadata around physical and virtual courses? Develop short-term training opportunities – e.g. supported workshop with gaming community. Develop the discipline of biomedical data science and support cross-training – OPEN courseware.

What is needed? Some examples from across the ICs:
Homgenization of disparate large unstructured datasets; deriving structure from unstructured data; feature mapping and comparison from image data; visualization and analysis of multi-dimensional phenotypic datasets; causal modeling of large scale dynamic networks and subsequent discovery.

In process of standing up Commons with two or three projects – centers being funded from BD2K who are interested, working with one or two public cloud providers. Looking to pilot specific reference datasets – how to stimulate accessibility and quality. Having discussions with other federal agencies who are also playing around with these kinds of ideas. FDA looking to burst content out into cloud. In Europe ELIXIR is a set of countries standing up nodes to support biomedical research. Having discussions to see if that can work with Commons.

There’s a role for librarians, but it requires significant retraining in order to curate data. You have to understand the data in order to curate it. Being part of a collective that is curating data and working with faculty and students that are using that data is useful, but a cultural shift. The real question is what’s the business model? Where’s the gain for the institution?

We may have problems with the data we already have, but that’s only a tiny fraction of the data we will have. The emphasis on precision medicine will increase dramatically in the next while.

Our model right now is that all data is created equal, which clearly is not the case. But we don’t know which data is created more equal. If grant runs out and there is clear indication that data is still in use, perhaps there should be funding to continue maintenance.

CNI Fall 2014 Meeting – VIVO Evolution

Evolution of VIVO Software

Layne Johnson, VIVO Project Director, DuraSpace

VIVO History – started at Cornell in 2003. 2009-12 NIH funded VIVO ($12 million) to evolve.

Problems – Researchers struggle to identify collaborators, most information and data are highly distributed, difficult to access, reuse, & share and is not standardized for interop.

VIVO can facilitate collaborations and store disparate information stored inteh VIVO-ISF ontology.

What is VIVO? Open source, semantic web application enables management and discovery of research and scholarship across disciplines and institutions.

VIVO harvests data from authoritative sources thus reducing manual input and providing integrated data sources. Internal data from ERPs, external data from bibliographic sources, ejournals, patents, etc.

VIVO data stored as RDF.

Triple stores and linked open data: provide abiity to inference and reason; can be machine readable; links into the open data cloue; provide links into a wide variety of information sources from different interoperable ontologies; allow knowledge about research and researchers to be discovered.

VIVO supports search & exploration – by individual, type, relationship, combinations and facets.

One of the larger implementations is USDA VIVO. Another interesting one is Find an Expert at the University of Melbourne. Scholars@Duke. Mountain West Research Consortium has a cross consortium search. The Deep Carbon Observatory data portal uses VIVO.

Installed base of VIVO implementations has remained somewhat level.

VIVO Evolution: from grant-funding to open source. In 2012-13 VIVO partnered with DuraSpace, who provide infrastructure and leadership – legal, tax, marketing communication, leadership. Sustained through a community membership model. VIVO project director hired May 1.

Charter process – Jonathn Markow & Steering group. Based on DuraSpace model for consistency across products. Charter finalized in late July, 2014.

VIVO Governance: Leaderhip group, steering group, management team. Four working groups: Development & Implementation; Applications & Tools; VIVO-ISF Ontology; Community Engagement & Outreach (undergoing reconstitution).

Four levels of membership – $2.5k, $5k, $10k, $20k.

VIVO strategic planning: 14 member strategy group created from leadership, steering, management teams and external members. Met December 1 & 2. Did a survey to determine current state of 41 VIVO leaders. Got 20 respondents. VIVO’s 3 strategic themes: community, sustainability, technology. 5 top goals for each theme selected, each strategy group member got tovote for 3 goals per theme.

Community: increase productivity; develop more transparent governance; increase engaged contributors; maintain a current and dynamic web presence; develop goals for partnerships (ORCID, CRIS, CASRAI, W3C, SciEnCV, CRediT, etc.)

Sustainability: create welcoming community; develop clear value proposition; increase adoption; promote the value of membership.

Technology: Develop democratic code processes; clarify core architecture and processes; develop VIVO search; improve/increase core modularity; team-based development processes.

CNI Fall 2014 Meeting: Fedora 4 early adopters

Fedora 4 Early Adopters

David Wilcox, Defora Product Manager, DuraSpace

Fedora 4.0 released November 27. Built by 35 Fedora community developers. Native citizen of the semantic web – linked data platform service. Hydra and Islandora integration.

Beta pilots – Art Institute of Chicago, Penn State, Stanford, UCSD.

62 members in support of Fedoray, funding increased dramatically (over $500k). Effort around building sustainability – more members at lower funding amounts. Governance model – Leadership and steering groups.

Fedora 4 roadmap – short term (6 months) – 4.1 will support migrations from Fedora 3. Want to establish migration pilots, and prioritize 4.1 features.

Fedora 4.1 features – focus on migrations, but some new features – API partitioning, Web Access Control, Audit service, remote/asynch storage are candidates.

Fedora 4 training- 3 workshops held in October (DC, Australia, Colorado), more planned for 2015.

It is possible for Fedora 4 could be a back-end for VIVO.

If you want to go with Hyrda at this point you should go to Fedora 4, not 3.

Declan Fleming, UCSD

Oroginal goals – map UCSD’s deeply-nested metadata to simpler RDF vocabularies, taking advantage of Dedora 4’s RDF functionality. Ingest UCSD DAMS4 71k objects using different storage options to compare ingest performance, functionality, and repository performance. Synchronize content to disk and/or an external triple store.

Current status – Initial mapping of metadata completed for pilot work. Ingested sample dataset using mulitple storage options: Modeshape, federated filesystem, and hybrid (modeshape objects linked to federated fulesystem files). Ingested full UCSD DAMS4 dataset into Fedora4 using Modeshape.

Ongoing work – continuing to refine metadata mapping, as part of the broader Hudra community push toward interoperability and pluggable data models. Full-scale ingest with simultaneious indexing, full-scale ingest with hybrid storage (about ready to give up on that and embrace modeshape), performance testing.

Over time ingesting of metadata slowed down – they use a lot of blank nodes which adds to complexity of structure – might be the reason.

File operations were very reliable. Didn’t test huge files rigorously.

Stefano Cossu – Art Institute

DAMS project goals – will take over part of current Collection Management System duties – 270k objects, 2/3 of which are digitized. Strong integration with existing systems adopt standards, single source for institution-wide shared data. Meant to become a central hub of knowledge.

LAKE – Linked Asset and Knowledge Ecosystem. Integrates with CITI (collection management system) which is the front-end to Fedora (LAKE) which acts as the asset store.

Why Fedora? Great integration capabilities, very adaptable, built on modern standards, focus on data preservation. Makes no assumptions about front-end interface. REST APIs. Speaks RDF natively.

Key features for the AIC – Content modeling, federation, asynchronous automation, external indexing, flexible storage.

Content modeling: adding/removing functionality via mix-ins. Can define type and sub-types. Spending lots of time building a content model. Serves as a foundation for ontology. Still debating whether JCR is best model for building content model. Additional content control is in their wish list.

Asynchronous Automation: Used modeshape sequencers so far. Camel framework offers more generic functionality and flxibility. Uses: extract metadata on ingestion, create/destroy derivatives based on node events, index content.

Filesystem federation to access external sources, custom database connector.

Indexing: multiple indexing engines – powerful search/query tools: triplesetore, solr, etc.

Tom Cramer – Stanford

Exercising Fedora as a linked data repository – introducing triannon and Stanfords Fedora 4 beta pilot

Use case 1: digital manuscript annotations. Used open annotation W3C working group approach to map annotation into RDF. Tens of thousands of annotations – where to store, manage, and retrieve?

use Case 2:Linked data for libraries. Bibliographic data, person data, circulation and curation data. Build a virtual collection without enriching the core record using linked data to index and visualize.

Need a RDF store, need to persist, manage, and index. Not the ILS nor core repository – this is a fluid space while the repository is stable and reliable. All RDF / linked data.

Fedora was a good fit: Native RDF store, manage assets (bitstreams), built in service framework (versioning, indexing, APIs), easy to deploy.

Linked Data Platform (LDP): W3C draft spec, enables read-write operations of linked data via HTTP, Developed at at same time as Fedora 4, Fedora 4 one of a handful of current LDP implementations.

Stanford pilot: install, configure & deploy Fedora 4; exercise LDP API for storing annotations and associated text/binary objects; develop support for RDF references to external objects; test scale with millions of small objects; integrate with read/write apps and operations – annotation tools (e.g. Annotator), indexing and visualization (Solr and Blacklight)

Current: Annotator (Mirador) <- json-ld -> Trianon (Rails engine for open annotations stored in Fedora 4) <-> LDP – Fedora 4.

Future: Blacklight and Solr.

Learned to date: Fedora 4 approaching 100% LDP 1.0 compliant, Trannon at alpha stage (can write, read & delete open annotations to/from Fedora 4); Still to come: updates to annotations, storage of binary blobs in Fedora, implement authn/z, deploy against real annotation clients, populate with data at scale.

Looking at Fedora 4 as a general store for enriching digital objects and records through annotating, curating, tagging.

CNI Meeting Fall 2014: Managing Research Data

Managing Research Data: Some Ins and Outs
Joyce Ray, Johns Hopkins University
Geneva Henry, George Washington University
Michele Kimpton, DuraSpace
Melissa Levine, University of Michigan

Based on a book: Research Data Management – Practical Strategies for Information Professionals.

Overview of the volume:
– Policy context
– Planning
– Managing active data
– archiving and managing data long-term
– measuring success
– case studies
– what’s next

Common themes:
– planning is essential and ongoing
– essential infrastructure goes beyond software and your own institution – it includes tools, services, policies, and communities of practices
– collaboration internally and externally helps to maximize insittutional investment
– value of managing research data is not yet proven

Geneva Henry

Data Curation for the Humanities (based on work done at Rice)

- Where’s the data in humanities research?

Digital content enables structure to be added to otherwise unstructured resources (metadata, OCR, text markup, georeferencing, 3d recreations of space)

Data enables new research never before possible (social network analysis to discover historic relationships, time and space anlysis, closer inspection of historic sites, content analysis for frequency of terms across multiple works)

- Creating sustainabile digital content and what it takes to curate it

Digital content is powerful but enhancements must be maintained and reusable. Balance in extreme markup/annotation vs. minimal additions of interpretations; managing elements associated with a work as separate objects. Example of migrating content across newer versions of TEI.

- teams and infrastructure

Domain expertise and technical expertise needed for success – partnering between academic faculty and librarians and technologists is powerful. Don’t go in with the attitude that this is just a service provided – scholars should be involved in considering markup. Opportunities for learning new skills. Platforms that can handle varied content.

- project case studies – Our Americas Archive partnership; travelers in the middle east archive; shepherd school of music collection; Rice ephemera archive; Houston Aiian American Archives Oral Histories collection.

Michele Kimpton – Archiving research data in the cloud or in a local repository

Did a survey of people in the Duraspace community around practices and use cases.

Common issues: Where can I put my data for long term access; How do I make it discoverable, reuseable, reproducible?; What metadata, provenance, and identifiers should I use? (very much an emerging set of practices); What policies should be in place for archiving and preserving data? (multiple locations? cost associated); How do I fund this?

Data management in DSpace – new features in DSpace 5.0 related to data management and archiving (coming out end of this month). DOI support – EZID, ORCID integration, linked open data support, integrated with DuraCloud.

Data management in Fedora – last week Fedora 4 became available. Supports linked open data; content modeling; versioning; large files; fixity checking; external, asynchronous storage.

DataOne project -humanities, social sciences, earth science. 80% of files are excel or comma-delimited – the long tail of data.

Commercial based cloud solutions – Attract end users because solves immediate need without adding a ton of work to end user. Share, collaborate,or meet mandate by publisher or funding agency; Little to no preservation practices in place; No stated or unstated longerm data management practices; long term at risk reliant on investors interest and success in the market; lack of trust and control in academic community.

Publishers are paying for storage of data in Figshare.

Community based cloud solutions: duracloud (in partnership with DPN and chronopolis); center for open science; dryad (UNC, based on DSpace); zenodo.

Questions: Is it open source? Are the policies transparent? What is the governance? Are there policies to preserve the data?

POWRR study from IMLS – comprehensive study of archiving tools.

Melissa Levine

Availability-Usability Gap – Copyright, open data and the availability-usability gap: challenges opportunities, and approaches for libraries.

The big ideals: “data as the new gold” – we are building the mines, can we make them safe and stable?

Principles: Denton Declaration 2012. Research data when repurposed has accretive value; publicly funded research should be publicly available for public good (issues with commercial providers); transparency in research is essential to sustain the public trust; validation of research data by peer community essential to the function of responsible research; managing research data is the responsibility of a broad community of stakeholders including researchers, funders, institutions, libraries, archivists, and the public.

Still early times:open access/data: Open access funding mandates (NIH 2008, NSF 2010, OSTP 2013); Simple things complex: making sure authors have the rights they need to deposit, can only pass on what they have to give; trending toward proactive planning in rquired data plans (data considered at the front end, well before publication).

Issues: technology outpaces law; law is harder than technology because it’s about people; different countries differ; different disciplines differ; hiding data (squeezing the last bit of publication before sharing; disincentives to share); data citation – chain of title; public versus private interest; cost – good metadata is expensive.

Progress: Research Data Aliiance- CODATA; DataCite; DataONE – best practicies primer; databib; ORCID; new roles for libraries as hubs of expertise – even if it’s to other parts of the enterprise.

CNI meeting Fall 2014: SHARE update

SHARE update
Tyler Walters, SHARE director and dean of libraries at Virginia Tech
Erice Celeste, SHARE technical director
Jeff Spies, Co-founder/CTO at Center for Open Science

Share is a higher education initiative to maximize research impact. (huh?)

Sponsored by ARL, AAU, APRU.

Knowing what’s going on and keeping informed of what’s going on.

Four working groups addressing key tasks: repository, workflow, technical, communication

Received $1 million from IMLS and Sloan to generate a notification service.

SHARE is a response to the OSTP memo, but roots before that.

Infrastructure: Repositories, research network platforms, CRIS systems, standards and protocols, identifiers

Workflow – multiple silos = administrative burden

Policy – public access, open access, copyright, data management and sharing, internal policies.

Insittutional context: US federal agencies join growing trend to require public access to funded research; measureable proliferation of institutional and disciplinary repositories; premium on impact and visibility in higher ed.

Research Context – Scolarly outcomes are contextualized by materials generated in the process and aftermath of scholarly inquiry. Research process gendrates materials covering methods employed, evidence used, and formative discussion.

Ressearch libraries: collaboration among institutions going up; shift from collections as products to collections as components of the academy’s knowledge resources; library is supporting and embedded within the process of scholarship.

Notification Service: Knowing who is producing what, and under whose auspices, is critical to a wide range of stakeholders – funders, sponsored research offices, etc.

Researchers produce articles, preprints, presentations, datasets, and also administrative output like grant reports and data management plans. Research release events. Meant to be public.

Consumers of research release events: repositories, sponsored research offices, funders, public. Interest in process as well as product. Today each entity must relate arrange with one another to learn what’s going on. Notification service shares metadata about research release events.

Center for open science has partnered with SHARE to implement notification service. http://bit.ly/sharegithub/

Looking for feedback on proposed metadata schema, though the system is schema agnostic.

API – push API and content harvesters (pulling data in from various sources). Now have 24 providers and adding more. 16 use OAIPMH while 8 use non-standard metadata formats.

Harvested data gets put into open science framework – pushes out RSS/Atom, PubSubHubbub, etc. Sit on top of elastic search. You can add a lucene format full-text search to a data request.

250k research release events so far.arxiv and crossref are largest providers. Averaging about 900 events per day. Now averaging 2-3k per day in last few days as new providers are added.

Developed push protocol for providers to push data rather than waiting for pull.

Public release: Early 2015 beta release, fall 2015 first full release.

Some early lessons: Metadata rights issues – some sites not sure about thier right to, for example, share abstracts; Is there an explicit license for metadata (e.g. CC Zero)?;

Inclusion of identifiers – need some key identifiers to be available in order to create effective notifications. Most sources to not even collect email addresses of authors, much less ORCID or ISNI. Most sources make no effort to collect funding information or grant award numbers. Guidelines? See https://www.coar-repositories.org

Consistency across providers – reduce errors, simplify preparing for new providers. Required for push reporting.

Next layer: Reconciliation service – takes output of notification service to create enhanced and interrelated data set.

Share Discovery – searchable and friendly.

Phase 2 benefits – Researchers can keep everyone informed by keeping anyone informed, institutions can assemble more comprehensive record of impact,; open access advocates can hold publishers accountable for promises; other systems can count on consistency of metadata from SHARE.

Relation to Chorus – when items get into Chorus it is a research release event, hopfully will get into notification service.

CNI 2014 Fall meeting – Opening Plenary

I’m in DC for the annual fall meeting of the Coalition for Networked Information. This time the opening plenary is a discussion moderated by Cliff Lynch, CNI’s Executive Director, and including Tom Cramer (Chief Technology Strategist, Stanford University Libraries), Michelle Kimpton (Chief Executive Officer, DuraSpace), and James Hilton (Dean of Libraries & Vice Provost for Digital Educational Initiatives, University of Michigan).

Cliff – Notable successes in people launching community source projects over the last 10-15 years. But the landscape is changing: economic model, speed of development are looking shakey, accellerated move to single or multi-tennant arrangements run elsewhere. Where does this leave you when you come to the point in the lifecycle when you need to think about new systems? How do we engage with new opportunities in the MOOC and Unizen space?

Community source – what is it and does it really have a future?

James Hilton – Community source not going away. Is community source the same as open source? Open source often used synonymous with developer-autonomous centric. Kuali as compared with Sakai. How do you organize the labor that produces an outcome? We have many more tools to tune development – different organizational models can work.

Michele Kimpton (Durapsace) – Tuning of community development model. If you want to collaborate and develop code together, that’s a community model. Code doesn’t advance if people are doing customization at their institutions themselves. Need to invest and be transparent to advance the code base.

Tom Cramer – Many forms of community – one form is to have a centralized organization, but just as many examples where the community is grass-roots driven from the edges, like Fedora.

Is there a trend taking us towards or away from the grass roots model to funding a central model?

Tom Cramer – Examples on both sides. Central organization can bring focus, but so can grassroots – e.g. BlackLight faceted browser for SOLR.

James H – as scale of investment goes up the pressure to organize and centralize goes up.

Is the presence of serious commercial players a factor in central vs. grass-roots?

Tom Cramer – if there’s an absence of commercial players that can buy space and time for grass-roots organizing. Central authority can make missteps, whether community or commercial.

James – Unizen is trying to organize community effort around content and analytics standards. Made a decision to adopt commercial software – in part because they wanted the speed that came with that. Contingent on contracts giving the control needed.

Michele Kimpton – Two models – when commercial entity makes product open-source that gives an exit strategy, but it’s not community controlled. Really serving core paying customers.

Tom Cramer – Community source projects have failed where they’ve been gated communities – fail to channel the interests outside the gates. Also true of vendor solutions – unless you can tap the bigger market it will be a problem.

James – Unizen focus is on creating relays that will be as agnostic as possible. Community development is in building workflows using repositories, not in refining the LMS. It’s not about software, it’s about business and economic models.

Cliff – move to talk about the move of software from local to redundant network hosting. There seems to be a big move in that direction. Seeing what would have been community source before now taking on character as community service – like DPN, APT, etc. How does that change the landscape?

James – makes you ask the question – what parts do I need to control, what do I not need? In Unizen trying to figure out what parts need control. The LMS is core infrastructure – go for economy of scale. Focus control on building digital workflows, helping humanists and research scientists know where stuff goes.

Michele Kimpton – 1700 institutions running DSpace – difficult to upgrade to new releases. Duraspace wanted to provide pathway for smaller institutions to run latest code. Cloud infrastructure will flip IT in academic environments on its head. Will be hard to justify building data centers when they can buy IT as a service and buy only what they need. Can keep the same governing process and openness.

Tom Cramer – Running data center and installing and maintaining software is not the core competency. It’s higher up in the stack providing value to the community. Where do you want to maintain control? Curation, discovery, preservation.

Cliff – A lot of this software is getting big, volatile, and complex enough (especially in the security environment) doing maintenance and configuration management is getting to be troublesome. But if you’re out in the cloud you put need to do version control and validation – is that a worthwhile tradeoff?

James – if you’re committed to running everything in this compliance environment, that is all you will do. What to we value as academic institutions? What do we bring that’s unique?

CLiff – Barrier to innovation is everybody forking off code and doing local adaptations. Sense is in the future with networked software as service that area of variation really goes down. Can diffuse innovations faster.

Tom Cramer – Perhaps getting better at managing diversity. Seen lots of good examples that different communities are good at putting enhancements back into the code base. Separate question than running software as a service. In commercial world looking at securing different layers and diffusing innovation at those layers.

Michele Kimpton – There has been a lot of customization of both DSpace and Fedora, and that leads to frustration in upgrading. But customizations are needed. Part of the beauty of more innovation is you can look at aggregations across instances in the cloud – e.g. how do we aggregate pushing content into DPLA or DPN? Easier to do from cloud to cloud.

Tom Cramer – It’s standardization that enables that, not just cloud. e.g. standardized APIs.

Cliff – standards – are the places where standards are most applicable changing? Used to be notion of standards that allowed replacement of building blocks within a system. Now that you move into a world of aggregated things standards don’t mean as much – may work just fine to be expedient.

James – challenge is how do you move standards at the pace of technology?

Tom Cramer – role for standards based on size of the pool you want to swim in. There are important communities of practice around loose coupling whether informal or formal. Look at the numbers of people using SOLR for searching.

Cliff – Puzzle about how patterns of innovation change. Community source projects from grassroots where there is considerable technical expertise at the participating instituitons. If we think about collective service-based aggregations do local technical experts become scarcer and does that imply less diversity of innovation?

James – if we can move innovation up the stack life gets better.

Tom Cramer – you don’t need to know how to run a server to have technical expertise. Successful solutions will figure out way to tap innovation coming from the edges. Be the community you want to be.

Cliff – you can look back and see the evolution – used to be many organizations that had huge knowledge of global networking, but now it’s held in fewer insittutions.

Michele Kimpton – if the developer can focus on developing and not setting up server and talking to IT, it increases innovation. Can throw things up and see if they work? Capital costs to innovation are so much lower. That’s why in the commercial space you see cloud-based services spawning all over the place.

Discussion of contracting and procurement – the legal folks have the same challenge we do in figuring where we really need to be unique flowers. We all have indemnification and state rules. We don’t need 50 different ways to say it.


Pages

Latest tweets

What I’m listening to


Follow

Get every new post delivered to your Inbox.