Archive for July 19th, 2011

Breakout session reports [ #rdlmw ]

    Secure research data

– Wanted to focus narrowly on where access to restricted datasets are important in research computing. In social sciences, sometimes researchers have to apply to analyze data from government that is not public. Medical data is protected by regulation. Geospatial data research can use sensitive data on individuals. People working with industry sometimes have restrictions on data. Intellectual property has to be respected. Recommendations:

1. People who manage research computing environments want to know what federal standards need to be complied with – come up with a national working group on how to comply. There is a federal interagency working group on data which might be a good venue to communicate with.
2. A simple catalog of solutions from institutions on how to enable remote access to secure data. Use the Educause Cyberinfrastructure working group.
3. Catalog items for clinical translational study.

    Policy

Recommendations:
1. Develop a set of documentation (elevator speech, exec summary, and extensive report) to describe the need for policies and standards across disciplines as much as possible.
2. Develop workshop for university officers (VP of Research, Provost) to include them in discussions on how institutions can be involved.
3. Catalog of issues on data ownership and responsibility. Reduce mean time to discovery for researcher in how they should deal with their data.
4. Develop workshop for leaders of disciplinary communities.
5. Develop discipline-blind framework – what are the kinds of things a discipline needs to do to develop policies and standards?
6. University librarian is key in this role.
7. It’s time for the researchers to walk into the room with the librarians and say “we’re here”. – Brian Athey.

    Assessment and selection of research data

Is it really a goal to keep all data if possible? Good question.
Good practices with physical materials should be studied for guidance.
Expense of what it takes to manage data shouldn’t be primary consideration for what we keep.
Selection process has to be discipline specific.
What’s the cost of getting rid of something? Is reproduction of the data possible, and if so, what does that cost?
It’s easier to throw things away than to try to collect them after the fact. So collect and manage data before deciding to throw it away.
Researchers will have to provide at least core metadata.
Selection process is not yes/no but a continuum from minimal to full.

1. To make decision easier, develop a framework for making decisions. The researcher is a full partner in this.
2. Educate key audiences on importance of curatorial concepts. – researchers in all disciplines, and catch grad students now.
3. Encourage policy makers to rethink roles across the institution.

    Funding and operation

Recommendations for action:
1. Repository builders should collaborate – build with knowledge and forethought of others. Too many isolated repositories. Think federation.
2. Make data movable. Funding models will change over time. Should be movable from one caretaker to another.
3. Prepare for the hand-off. Anybody organizing a repository must put enough details in plan and budget to enable hand-off at the end of business cycle.
4. It would be useful to have a study of existing repository models.

Partnering researchers, IT staff, librarians and archivists
30 people in this breakout!
1. Communication of what’s out there – what models exist? Portal that identifies workable solutions. What practices work for training – resources for cross-training?
2. Institute more training for grad students.
3. Substantial workshop report from here – task NSF for developing a generic framework that allows institutions to implement policies and appropriate procedures.
4. Hold a workshop to define best institutional practices in communicating between researchers and librarians.
5. Survey our campuses on data management practices.

    Standards for provenance, metadata, discoverability

Got into a discussion on “what is metadata” – anything that supports the core user needs for information. IFLA def – can you find it, can you identify it, can you select among resources, can you retain or reuse it? We want our metadata to be interoperable – move across repositories, workspaces, etc. We also want trustworthy and reliable data.
Core needs:
1. Common framework for data. some emerging, like METS.
2. Role of ontologies – domains recognizing standardized terminologies. Linked Data (semantic web) might be worth exploring for this.
3. Instrumented data – if numeric data is off, then data is useless. How do we know if the data is good? Huge gap in current data – need to work with instrument manufacturers. What captured this data? Usually entered manually.
4. Metadata needs to be captured at point of data creation.
5. Need standards of provenance – what’s the purpose of creating this data? Relationships between datasets are critical. Most scientists spend a long time exploring dimensions of the same set of problems.
Researchers want to develop their own metadata – treat it like any other data stream. Don’t worry about having to bring it into a structure.

    Partnering funding agencies, research institutions, communities, and industrial and corporate partnerships

Recommendations:
1. Joint study of the feasibility of the “digital sheepskin”. Is there a model for a digital container that can be sustained through the ages, including metadata? We’ll probably have to invent some of the social context for this.
2. Conduct an aggregated study of TCO models using trusted party (academia) for storage for perpetuity or for ten years.
3. Identify the missing pieces of the research data software stack, and encourage collaborations between academia and industry.
4. A study on criteria for throwing data away, by discipline.
5. Continue to emphasize that data volume is growing much faster than our ability to move data around. Think about where we need to site data.
6. What are the possible models for joint activity with industrial partners?

Lightning Round! [ #rdlmw ]

John Fundine – USGS – the case for not keeping everything. They deal with observational records, now in the 5 petabyte range. They end up with orphans as programmatic sponsors change. Budgets are going down in government agencies – now looking at best case next year being the same as 2007-2008. What not to keep and who decides? Would advocate a formal appraisal process – as an archivist he owns the process but not the outcomes. They have a 30% disposal rate. Disposal is not the same as destruction – owners can find new homes.

Jim Myers – W3C Provenance Introduction. Group is in progress. Started from open source provenance group – data history tracking. Scope – want to talk about input objects being used to produce output objects. Also trying to add the distinction between document and file. Will be able to talk about how documents were funded, reports were derived from it, etc. Also talking about physical objects – the web of things.

Herbert J. Bernstein – We’re looking at too easy of a problem. Should we be thinking about communicating with the future. Get out of the mode of thinking what we do works – our processes produce errors. The data should be able to stand on its own legs without people. In a format that will be readable, unlike what we do know. Need to be much more conservative about our frameworks, and stop changing them.

Grace Agnew, Brian Womack – Spent about three years working with scientists on organizing data. Faculty didn’t remember context of data just months later – makes it hard to reuse data. Need to know what trial, who conducted it, etc. – entire provenance. Developed events-based data model that fits in a METS-based data model. Training a team internally of librarians – will support research data efforts going forward.

Jen Scheop – Woods Hole – Points from this morning – Not true that we don’t have to throw data out. Data are not like books – not vetted or standardized. Not true that large projects can find money for curation.

Scott Brant – Purdue – Data curation profiles (http://datacurationprofiles.org ) Developed profile template. Got a grant to teach librarians how to interact with researchers. There is a toolkit available on the web site.

“DataONE (Observation Network for Earth): Enabling New Science by Supporting the Management of Data Throughout its Life Cycle” – Bill Michener, University Libraries, University of New Mexico [ #rdlmw ]

Defining the problem space –

Grand challenges are difficult. We’re using different languages and standards for how we deal with our domain data. Most scientists complain that they’re spending most of their time doing mundane data management and integration. Only a small part of their time is spent on the analysis.

The data deluge – lots of sensors.
Proliferation of citizen science programs. A whole new way of doing science.
Data silos. Lots of big repositories, tons of small ones, each using their own, non-interoperable data standards. Creates the long tail of orphan data – scattered worldwide.

Data entropy – most scientists are really familiar with their data just prior to publication. They may or may not document the intricacies of the data, and we lose the ability to use the data over time.

DataONE approaches
Community Engagement – been funded by 1.5 years by NSF. Will be releasing infrastructure in December 2011. Started interacting with scientific community two years ago via interviews, surveys, etc. What are the challenges scientists are facing?

Recent study found (in earth sciences) >80% would be willing to share data, across a broad group of researchers.

Stakeholder needs – what are data management plans? How do I describe and preserve my data?

Brought an array of people into the room to look at continental bird migration. What do we need to answer this? 31 different data layers, including a single researcher in Utah with data in his desk. Data discovery is an issue. Needed lots of compute cycles, which was a shock. Took an initial .5 million hours on TeraGrid, and more later. Also needed visualization tools. One of the datasets used, ebird, is a citizen science data source. Produced State of the Birds report 2011.

Cyberinfrastructure support – goal is to enable new sciences through universal access to data about life on earth, the environment, plus access to key tools. Three precepts: 1 Build on existing cyberinfrastructure; 2. Create new cyberinfrastructure; 3. support communities of practice – we’ve ignored this over time.

Member nodes – data repositories that already exist. Coordinating nodes – retain complete metadata catalog, indexing for search, networ-wide services, ensure content availability (preservation), replication services (would like to see data in 3 or more repositories. Investigator toolkit – familiar to scientists, integrated with data resources.

First three member node prototypes – ORNL-DAAC, Dryad, KNB.

There’s beginning to be some evidence that when you share your data, citation rates to your publications go up.

Working with Microsoft Research to make Excel a more powerful tool.

Added (impending release) a Data Management Planning Tool (DMPTool) for building a data management plan – wizard driven.

DataONE includes a powerful data discovery tool.

Education and Training – there’s a lot to do! In DataONE, created DataONEpedia – best practices. Scientists want one pagers, not detailed manuals.

“Taking AIM at Data Lifecycle Management” – Jose-Marie Griffiths, Bryant University [ #rdlmw ]

Representing the point of view of chief research officers for this talk.

Most of concerns relate to current economic conditions and uncertainties, particularly concerned about overhead costs. Also concerned about policies that turn into unfunded mandates. Concerned about roles and liabilities.

Size and scale issues – big universities can do things smaller ones can’t – need to find ways of federating so the smaller institutions can participate.

AIM – Access, Integrity, Mediation

Access – what goes in must be able to come out. Need to focus on users, defining “users” as widely as possible. Need metadata, which requires people. Also need to understand the costs of migrating data as technologies become obsolete.

ICPSR is a good model of an inter-institutional data consortium.

Interoperability – never easy. Referential integrity degrades over time. Decisions tend to get made on the fly.

Increased public access is a trend, supported by the government funding agencies. NSTC report expected next spring.

Integrity – We need to plan for preservation across the entire lifecycle. What are going to share? raw data? processed, analyzed datasets? instruments? calibration? analytical tools?

Mediation – needed at all stages of lifecycle. Where there is high intensity of interaction, it may make sense to have lots of replication and different mediation. Mediation may not always need to be formal, but for repositories and analysis it does need to be more formal. But must make sure that creating new repositories is not a solution in search of a problem.

Players and relationships among them are constantly shifting, vying for funding and attention. Issues about research directories. For data to be discovered, must have a shared overlay of connections. An ecosystem of multiple stakeholders.

Serge points out that large swaths of disciplines don’t have disciplinary repositories. Jose replies that there is a role for institutional repositories, but there are challenges – we don’t know enough about building a sustainable economic model. We don’t have good metrics about progress in cyberinfrastructure. All we have is number of high speed connections to institutions.

Serge Goldstein – DataSpace [ #rdlmw ]

I’ve reported on Serge’s experimental model at Princeton before, at http://blog.orenblog.org/2010/05/12/csg-spring-2010-storing-data-forever/

Funding and operational model for long-term preservation of research data. Piloting at Princeton.

Storing data forever.

What’s “forever”? We don’t usually tell people how long we keep stuff – like in libraries. We can treat data the same way as books – “indefinitely” – best effort to keep data around for a long time, which doesn’t have to be precisely defined.

Quotes Cliff Lynch – funding agencies don’t expect data to be kept forever. But Serge is uncomfortable with that.

The reality today is that we’re talking about an indefinite period of a “few years”.

Where do we store data? Your local web site; A disciplinary repository; At another university; in the cloud (Amazon, Google, Duracloud)

How to pay for storing data? Institution pays; grants pay – but they don’t go on forever; or – we don’t know (the most popular model). Most mechanisms require ongoing payment. That answers the “what should we store” question – by being willing to store whatever someone’s willing to pay for. Duracloud is charging $1800/year/Tb. Not a reasonable charge for long-term preservation.

At Princeton they’re trying a Pay Once Store Endlessly approach. Based on a steadily declining cost of storage (as computed on a per-unit-of-storage basis). Turns out you can store the research data forever for about twice the original storage cost. At Princeton that turns out to be about $5 per gigabyte (including tape drives) to store forever.

Not including added services like curation or translation – just a bit storage.

Serge looked at the data management plans for all grants submitted at Princeton since the mandate for a data management plan. 93 grants total. 27 (30%) have no data management plan. Most popular is on a web site or local disk (20%). Then DataSpace.

Brian Athey – Big Data 2011 [ #rdlmw ]

Brian Athey is a professor in the Medical School at the University of Michigan.

It’s difficult to incentivize researchers to share data.

Agile data integration is an engine that drives discovery.

Developing personal health system requires combining data extracted from genomics with data extracted from a clinical record of the individual.

There’s a disconnect between classic IT’s “command and control” approach and what actually happens in research labs. We want to achieve a focused collaboration balancing high levels of focus and participation.

Next gen sequencing – turning out around 10 terabytes per day at Michigan, from 1500 users.

In 2006 there was a knee in the curve where it became more economical to generate the genomic data than to store it. We have to make decisions about what we store – we can’t save everything.

Brian is working on a Federated Enterprise Data Warehouse, that stores both clinical and research data. There’s an “honest broker” that mediates the data accessible to the research side.

PCAST NITRD “Big Data” report from November. Has a list of recommendations.

We are all challenged by having to bring heterogeneous data together. Working with Johnson and Johnson on something called tranSMART – J&J have over 400 pharma research databases.

Clinicians have worfklow – researchers don’t.

Discussion items:
IT doesn’t own the problem.
The rise of “architecture”
Data governance
Data governance – who owns the data? bring them into the room. But there also has to be top down convenors.
Privacy, security, confidentiality – the idea of the “honest broker” could be a model.
Cost and value-centered models – if we remain just a cost center we’re cooked.

Question – why can’t we keep all the data? The “Best Buy conundrum” – why do you charge me so much for storage when I can get it elsewhere cheap. Takes money to curate and level out the chaos. Maybe we should let the researchers decide what stays and what goes. The questioner, dealing with crystallography data and working with people dealing with NASA data, says that they’ve learned that getting rid of raw data is a huge mistake. Vijay notes that now the cost of hardware is only 5% of the cost of storage – it’s people and facilities that cost.

Research Data Lifecycle Management Workshop – Princeton NJ [ #rdlmw ]

I’m in Princeton for the NSF-sponsored workshop on Research Data Lifecycle Management. I’m on the organizing committee, and it’s gratifying to see the room full of interesting people ready to spend the next day and a half discussing this timely topic. The participants are a really interesting mix of technologists, faculty members, and librarians (and of course those categories are not mutually exclusive).

The idea of the workshop is to try to come up with a set of actionable best practice recommendations that can be used to move the state of the art forward. I’ll try to keep up with activities here on the blog, and you can also follow along by watching the #rdlmw tag on Twitter, or by watching the live video stream at The web page for the event is at:


subscribe

Pages

Latest tweets

interesting links

What I’m listening to

July 2011
M T W T F S S
« Jan   Oct »
 123
45678910
11121314151617
18192021222324
25262728293031

Follow

Get every new post delivered to your Inbox.