Archive Page 2

Breakout session reports [ #rdlmw ]

    Secure research data

– Wanted to focus narrowly on where access to restricted datasets are important in research computing. In social sciences, sometimes researchers have to apply to analyze data from government that is not public. Medical data is protected by regulation. Geospatial data research can use sensitive data on individuals. People working with industry sometimes have restrictions on data. Intellectual property has to be respected. Recommendations:

1. People who manage research computing environments want to know what federal standards need to be complied with – come up with a national working group on how to comply. There is a federal interagency working group on data which might be a good venue to communicate with.
2. A simple catalog of solutions from institutions on how to enable remote access to secure data. Use the Educause Cyberinfrastructure working group.
3. Catalog items for clinical translational study.

    Policy

Recommendations:
1. Develop a set of documentation (elevator speech, exec summary, and extensive report) to describe the need for policies and standards across disciplines as much as possible.
2. Develop workshop for university officers (VP of Research, Provost) to include them in discussions on how institutions can be involved.
3. Catalog of issues on data ownership and responsibility. Reduce mean time to discovery for researcher in how they should deal with their data.
4. Develop workshop for leaders of disciplinary communities.
5. Develop discipline-blind framework – what are the kinds of things a discipline needs to do to develop policies and standards?
6. University librarian is key in this role.
7. It’s time for the researchers to walk into the room with the librarians and say “we’re here”. – Brian Athey.

    Assessment and selection of research data

Is it really a goal to keep all data if possible? Good question.
Good practices with physical materials should be studied for guidance.
Expense of what it takes to manage data shouldn’t be primary consideration for what we keep.
Selection process has to be discipline specific.
What’s the cost of getting rid of something? Is reproduction of the data possible, and if so, what does that cost?
It’s easier to throw things away than to try to collect them after the fact. So collect and manage data before deciding to throw it away.
Researchers will have to provide at least core metadata.
Selection process is not yes/no but a continuum from minimal to full.

1. To make decision easier, develop a framework for making decisions. The researcher is a full partner in this.
2. Educate key audiences on importance of curatorial concepts. – researchers in all disciplines, and catch grad students now.
3. Encourage policy makers to rethink roles across the institution.

    Funding and operation

Recommendations for action:
1. Repository builders should collaborate – build with knowledge and forethought of others. Too many isolated repositories. Think federation.
2. Make data movable. Funding models will change over time. Should be movable from one caretaker to another.
3. Prepare for the hand-off. Anybody organizing a repository must put enough details in plan and budget to enable hand-off at the end of business cycle.
4. It would be useful to have a study of existing repository models.

Partnering researchers, IT staff, librarians and archivists
30 people in this breakout!
1. Communication of what’s out there – what models exist? Portal that identifies workable solutions. What practices work for training – resources for cross-training?
2. Institute more training for grad students.
3. Substantial workshop report from here – task NSF for developing a generic framework that allows institutions to implement policies and appropriate procedures.
4. Hold a workshop to define best institutional practices in communicating between researchers and librarians.
5. Survey our campuses on data management practices.

    Standards for provenance, metadata, discoverability

Got into a discussion on “what is metadata” – anything that supports the core user needs for information. IFLA def – can you find it, can you identify it, can you select among resources, can you retain or reuse it? We want our metadata to be interoperable – move across repositories, workspaces, etc. We also want trustworthy and reliable data.
Core needs:
1. Common framework for data. some emerging, like METS.
2. Role of ontologies – domains recognizing standardized terminologies. Linked Data (semantic web) might be worth exploring for this.
3. Instrumented data – if numeric data is off, then data is useless. How do we know if the data is good? Huge gap in current data – need to work with instrument manufacturers. What captured this data? Usually entered manually.
4. Metadata needs to be captured at point of data creation.
5. Need standards of provenance – what’s the purpose of creating this data? Relationships between datasets are critical. Most scientists spend a long time exploring dimensions of the same set of problems.
Researchers want to develop their own metadata – treat it like any other data stream. Don’t worry about having to bring it into a structure.

    Partnering funding agencies, research institutions, communities, and industrial and corporate partnerships

Recommendations:
1. Joint study of the feasibility of the “digital sheepskin”. Is there a model for a digital container that can be sustained through the ages, including metadata? We’ll probably have to invent some of the social context for this.
2. Conduct an aggregated study of TCO models using trusted party (academia) for storage for perpetuity or for ten years.
3. Identify the missing pieces of the research data software stack, and encourage collaborations between academia and industry.
4. A study on criteria for throwing data away, by discipline.
5. Continue to emphasize that data volume is growing much faster than our ability to move data around. Think about where we need to site data.
6. What are the possible models for joint activity with industrial partners?

Lightning Round! [ #rdlmw ]

John Fundine – USGS – the case for not keeping everything. They deal with observational records, now in the 5 petabyte range. They end up with orphans as programmatic sponsors change. Budgets are going down in government agencies – now looking at best case next year being the same as 2007-2008. What not to keep and who decides? Would advocate a formal appraisal process – as an archivist he owns the process but not the outcomes. They have a 30% disposal rate. Disposal is not the same as destruction – owners can find new homes.

Jim Myers – W3C Provenance Introduction. Group is in progress. Started from open source provenance group – data history tracking. Scope – want to talk about input objects being used to produce output objects. Also trying to add the distinction between document and file. Will be able to talk about how documents were funded, reports were derived from it, etc. Also talking about physical objects – the web of things.

Herbert J. Bernstein – We’re looking at too easy of a problem. Should we be thinking about communicating with the future. Get out of the mode of thinking what we do works – our processes produce errors. The data should be able to stand on its own legs without people. In a format that will be readable, unlike what we do know. Need to be much more conservative about our frameworks, and stop changing them.

Grace Agnew, Brian Womack – Spent about three years working with scientists on organizing data. Faculty didn’t remember context of data just months later – makes it hard to reuse data. Need to know what trial, who conducted it, etc. – entire provenance. Developed events-based data model that fits in a METS-based data model. Training a team internally of librarians – will support research data efforts going forward.

Jen Scheop – Woods Hole – Points from this morning – Not true that we don’t have to throw data out. Data are not like books – not vetted or standardized. Not true that large projects can find money for curation.

Scott Brant – Purdue – Data curation profiles (http://datacurationprofiles.org ) Developed profile template. Got a grant to teach librarians how to interact with researchers. There is a toolkit available on the web site.

“DataONE (Observation Network for Earth): Enabling New Science by Supporting the Management of Data Throughout its Life Cycle” – Bill Michener, University Libraries, University of New Mexico [ #rdlmw ]

Defining the problem space –

Grand challenges are difficult. We’re using different languages and standards for how we deal with our domain data. Most scientists complain that they’re spending most of their time doing mundane data management and integration. Only a small part of their time is spent on the analysis.

The data deluge – lots of sensors.
Proliferation of citizen science programs. A whole new way of doing science.
Data silos. Lots of big repositories, tons of small ones, each using their own, non-interoperable data standards. Creates the long tail of orphan data – scattered worldwide.

Data entropy – most scientists are really familiar with their data just prior to publication. They may or may not document the intricacies of the data, and we lose the ability to use the data over time.

DataONE approaches
Community Engagement – been funded by 1.5 years by NSF. Will be releasing infrastructure in December 2011. Started interacting with scientific community two years ago via interviews, surveys, etc. What are the challenges scientists are facing?

Recent study found (in earth sciences) >80% would be willing to share data, across a broad group of researchers.

Stakeholder needs – what are data management plans? How do I describe and preserve my data?

Brought an array of people into the room to look at continental bird migration. What do we need to answer this? 31 different data layers, including a single researcher in Utah with data in his desk. Data discovery is an issue. Needed lots of compute cycles, which was a shock. Took an initial .5 million hours on TeraGrid, and more later. Also needed visualization tools. One of the datasets used, ebird, is a citizen science data source. Produced State of the Birds report 2011.

Cyberinfrastructure support – goal is to enable new sciences through universal access to data about life on earth, the environment, plus access to key tools. Three precepts: 1 Build on existing cyberinfrastructure; 2. Create new cyberinfrastructure; 3. support communities of practice – we’ve ignored this over time.

Member nodes – data repositories that already exist. Coordinating nodes – retain complete metadata catalog, indexing for search, networ-wide services, ensure content availability (preservation), replication services (would like to see data in 3 or more repositories. Investigator toolkit – familiar to scientists, integrated with data resources.

First three member node prototypes – ORNL-DAAC, Dryad, KNB.

There’s beginning to be some evidence that when you share your data, citation rates to your publications go up.

Working with Microsoft Research to make Excel a more powerful tool.

Added (impending release) a Data Management Planning Tool (DMPTool) for building a data management plan – wizard driven.

DataONE includes a powerful data discovery tool.

Education and Training – there’s a lot to do! In DataONE, created DataONEpedia – best practices. Scientists want one pagers, not detailed manuals.

“Taking AIM at Data Lifecycle Management” – Jose-Marie Griffiths, Bryant University [ #rdlmw ]

Representing the point of view of chief research officers for this talk.

Most of concerns relate to current economic conditions and uncertainties, particularly concerned about overhead costs. Also concerned about policies that turn into unfunded mandates. Concerned about roles and liabilities.

Size and scale issues – big universities can do things smaller ones can’t – need to find ways of federating so the smaller institutions can participate.

AIM – Access, Integrity, Mediation

Access – what goes in must be able to come out. Need to focus on users, defining “users” as widely as possible. Need metadata, which requires people. Also need to understand the costs of migrating data as technologies become obsolete.

ICPSR is a good model of an inter-institutional data consortium.

Interoperability – never easy. Referential integrity degrades over time. Decisions tend to get made on the fly.

Increased public access is a trend, supported by the government funding agencies. NSTC report expected next spring.

Integrity – We need to plan for preservation across the entire lifecycle. What are going to share? raw data? processed, analyzed datasets? instruments? calibration? analytical tools?

Mediation – needed at all stages of lifecycle. Where there is high intensity of interaction, it may make sense to have lots of replication and different mediation. Mediation may not always need to be formal, but for repositories and analysis it does need to be more formal. But must make sure that creating new repositories is not a solution in search of a problem.

Players and relationships among them are constantly shifting, vying for funding and attention. Issues about research directories. For data to be discovered, must have a shared overlay of connections. An ecosystem of multiple stakeholders.

Serge points out that large swaths of disciplines don’t have disciplinary repositories. Jose replies that there is a role for institutional repositories, but there are challenges – we don’t know enough about building a sustainable economic model. We don’t have good metrics about progress in cyberinfrastructure. All we have is number of high speed connections to institutions.

Serge Goldstein – DataSpace [ #rdlmw ]

I’ve reported on Serge’s experimental model at Princeton before, at http://blog.orenblog.org/2010/05/12/csg-spring-2010-storing-data-forever/

Funding and operational model for long-term preservation of research data. Piloting at Princeton.

Storing data forever.

What’s “forever”? We don’t usually tell people how long we keep stuff – like in libraries. We can treat data the same way as books – “indefinitely” – best effort to keep data around for a long time, which doesn’t have to be precisely defined.

Quotes Cliff Lynch – funding agencies don’t expect data to be kept forever. But Serge is uncomfortable with that.

The reality today is that we’re talking about an indefinite period of a “few years”.

Where do we store data? Your local web site; A disciplinary repository; At another university; in the cloud (Amazon, Google, Duracloud)

How to pay for storing data? Institution pays; grants pay – but they don’t go on forever; or – we don’t know (the most popular model). Most mechanisms require ongoing payment. That answers the “what should we store” question – by being willing to store whatever someone’s willing to pay for. Duracloud is charging $1800/year/Tb. Not a reasonable charge for long-term preservation.

At Princeton they’re trying a Pay Once Store Endlessly approach. Based on a steadily declining cost of storage (as computed on a per-unit-of-storage basis). Turns out you can store the research data forever for about twice the original storage cost. At Princeton that turns out to be about $5 per gigabyte (including tape drives) to store forever.

Not including added services like curation or translation – just a bit storage.

Serge looked at the data management plans for all grants submitted at Princeton since the mandate for a data management plan. 93 grants total. 27 (30%) have no data management plan. Most popular is on a web site or local disk (20%). Then DataSpace.

Brian Athey – Big Data 2011 [ #rdlmw ]

Brian Athey is a professor in the Medical School at the University of Michigan.

It’s difficult to incentivize researchers to share data.

Agile data integration is an engine that drives discovery.

Developing personal health system requires combining data extracted from genomics with data extracted from a clinical record of the individual.

There’s a disconnect between classic IT’s “command and control” approach and what actually happens in research labs. We want to achieve a focused collaboration balancing high levels of focus and participation.

Next gen sequencing – turning out around 10 terabytes per day at Michigan, from 1500 users.

In 2006 there was a knee in the curve where it became more economical to generate the genomic data than to store it. We have to make decisions about what we store – we can’t save everything.

Brian is working on a Federated Enterprise Data Warehouse, that stores both clinical and research data. There’s an “honest broker” that mediates the data accessible to the research side.

PCAST NITRD “Big Data” report from November. Has a list of recommendations.

We are all challenged by having to bring heterogeneous data together. Working with Johnson and Johnson on something called tranSMART – J&J have over 400 pharma research databases.

Clinicians have worfklow – researchers don’t.

Discussion items:
IT doesn’t own the problem.
The rise of “architecture”
Data governance
Data governance – who owns the data? bring them into the room. But there also has to be top down convenors.
Privacy, security, confidentiality – the idea of the “honest broker” could be a model.
Cost and value-centered models – if we remain just a cost center we’re cooked.

Question – why can’t we keep all the data? The “Best Buy conundrum” – why do you charge me so much for storage when I can get it elsewhere cheap. Takes money to curate and level out the chaos. Maybe we should let the researchers decide what stays and what goes. The questioner, dealing with crystallography data and working with people dealing with NASA data, says that they’ve learned that getting rid of raw data is a huge mistake. Vijay notes that now the cost of hardware is only 5% of the cost of storage – it’s people and facilities that cost.

Research Data Lifecycle Management Workshop – Princeton NJ [ #rdlmw ]

I’m in Princeton for the NSF-sponsored workshop on Research Data Lifecycle Management. I’m on the organizing committee, and it’s gratifying to see the room full of interesting people ready to spend the next day and a half discussing this timely topic. The participants are a really interesting mix of technologists, faculty members, and librarians (and of course those categories are not mutually exclusive).

The idea of the workshop is to try to come up with a set of actionable best practice recommendations that can be used to move the state of the art forward. I’ll try to keep up with activities here on the blog, and you can also follow along by watching the #rdlmw tag on Twitter, or by watching the live video stream at The web page for the event is at:

[CSG Winter 2011] Higher ed from both sides now

Greg Jackson (Educause)

Collaboration – we don’t do it very well across our organization.
- We sign NDAs for No Benefit
- We let vendors pick us off
- We keep our cake (we hold on to resources we really should be sharing)

Battles – we fight those we can’t win. Prevalence will sometimes win out over quality.
- Google is going to win
- The CFO is going to win
- Verizon/AT&T/Sprint are going to win
- Oracle is going to win – not everything, but everything it cares about
We don’t engage very well if we characterize them as evil

Optimization
- Being different from peers isn’t the same as being ahead of peers. No competitive advantage to how we use IT at our institutions.
- Being ahead of peers isn’t the same as winning.
- Distinctiveness yields value, but it also consumes it
- It doesn’t matter what computer you use, because standardization has largely been achieved
- When standardization fails, idiosyncrasy accelerates

Tracy notes that we’re different because our environments demand us to be.
Greg – we don’t want to aspire to mediocrity. We shouldn’t innovate in different directions just for the sake of different directions.

Management
- We reject cost accounting
- We prefer tactics to strategies
- We send good money after bad
- We prefer right to timely
- We eat (or alienate) our seed corn
- We mistake users for customers

Association
- We squabble (especially in public)
- We waste too much time on governance
- We spread ourselves too thinly
- We obsess

[CSG Winter 2011] Time to de-localize?

This discussion, led by Sally Jackson (Illinois), was a lot more interesting than is captured here, but I’ll share what I’ve got:

Overall tendency of IT has been to amplify the ability of faculty and students to reach across great distances, socially, politically, and physically. Our support structures have not adjusted to this reality.

de-localize – invites an association with globalization, but that’s not entirely what she had in mind.

Services from different providers, virtual teams, support for people who rely on many people other than just us.

Shel – localization is no longer necessary for personalization – it’s easy to tailor environments that aren’t provided locally.

Most of us have divided support structures – large core at the center, surrounded by a community of IT professionals attached to labs, colleges, centers. At Illinois, about a third of support staff are in the center, two-thirds in the units. That’s true of all the CIC except Indiana.

All of our end users are now wandering horizontally. Every day is a sequence of small but irritating hurdles to jump. We’d like to be able to eliminate those little irritations. Extra credentials are a real problem – at Illinois they need all new credentials to report on conversations with vendors.

Kitty – individual units have sets of services that work great within their silos, but for people who want to engage outside that silo it gets confused.

Barbara – as a faculty member she has control of her desktop, but as a member of the provost’s office she has to use the locked down image.

Bill – there’s a lot of power in the local tribes across the institution. Greg – tribes are no longer geographically defined. Even within local physical communities, people interact with those they choose, not necessarily those that are in physical proximity.

Shel – any given solution will be an aggregation of pieces from multiple providers. “Central” doesn’t mean what it used to – it’s about being dynamic.

Good support gets attached as a node in a personal network. Great support helps to build this unbounded personal network.

Can we build a curriculum for training great support staff? Add a layer of socio-technical competence to the pure tech. competence.
- Treat people equally and involve them no matter what organization they’re part of.
- Collaborative problem based learning infused into all projects and studies.
- Problems requiring virtual teams.
- Network-building activities.
- New professional career tracks focused on connector skills.

Treat each faculty member as the center of an unbounded network of social and technical resources.

Shel – there’s also a product management role, which Sally characterizes as a level of context awareness.

[CSG] Unified Communications Workshop – part 3

Duke Telepresence -

View of the big screens at the front of the Duke Fuqua telepresence classroom

Duke Fuqua telepresence classroom

the view the presenter has in the Duke Fuqua telepresence room

The presenter's view

Fuqua’s (Duke business school) been doing telepresence for over a decade. Challenge was to find a room that would accommodate 90 people. Room opened in 2008, seats 140, and there was some thinking about telepresence as it was designed. About 1/3 of the schools in the room have Cisco telepresence.

They wanted a 3-screen system and everyone in the room to be able to see the remote presenters well. Wanted the local presenter to be able to see the remote participants without turning around.

Camera system in the room – standard is the 3 camera mount under the big screen array – shoots the room in 3rds. On either side are two pan-tilt-zoom cameras. There are 70-something microphones around the room – press and hold to talk. As you speak, camera on your half of the room pans and tilts to show you, and one of the screen shows you. There’s also a camera that shows the local presenter – follows the actions of the person at the front of the room, replacing the image on the center screen.

People are getting used to using the room. Haven’t had any regular classes using the resources, so continuously getting faculty up to speed. People are learning, and will be able to use it themselves.

Harvard has 12 CTS 1200 and 1300 units on campus. Installed a 60-seat classroom in a local high school that’s connected to Harvard. Averaging 25-40 hours a week on the units, with peaks up to 60 hrs. Majority of calls are interop with h.323 conferences. Done mobile interop with Mobi (Tandberg). Added interactive presentation capabilities. Working on FaceTime and Skype tie-ins. Looking at backside integration with WebEx and other collaboration tools.

Duke is creating a smaller version of the three screen room to talk to a remote site for the School of the Environment.

Lots of high schools are adopting this technology. Smithsonian has four units that are now online.

Haven’t done any QOS on network – works ok over the R&E networks across to China.

Harvard has integrated with Exchange. Users can use it with one button, unless they have to do h.323 interop, which requires some intervention.

Eventually you’ll see all the Tandberg gear integrated with Cisco’s call manager.

« Previous PageNext Page »


subscribe

Pages

Latest tweets

interesting links

What I’m listening to

 

February 2012
M T W T F S S
« Jan    
 12345
6789101112
13141516171819
20212223242526
272829  

Follow

Get every new post delivered to your Inbox.