Archive for May 12th, 2010

[CSG 2010] Curation, Preservation, & Information Lifecycle Management

Mairead from Penn State is talking about designing and implementing storage arhcitectures and systems to support data curation and preservation needs. Who’s thinking about this, and what are they doing?

Drivers & Incentives – eScience/eResearch. NSF requirement for data management plans. Compliance – e-discovery, FERPA, HIPAA, Sarbanes-Oxley. Institutional record retention regulations and policies. Storage services for libraries, archives, cultural heritage entities. Great efficiencies.

Expectations (not supported) – storage is cheap; storage is smart; stuff on the internet is persistent; digital safer than analog; storage provider – curators and preservation experts; repositories take care of preservation; metadata will take care of it; libraries will take care of it; the cloud will take care of it.

The reality – new roles, new responsibilities, new collaborations, practices, workflows; Intellectual capital requirements – digital preservation; clout antithetical to preservation?; increased management requirements; scaling issues with preservation requirements.

Standards/Technologies
iRODS – From SDSC, integrated rule-based data system. Second generation of SRB.
Content addressable storage – fixed content storage, retrieval based on content rather than location
eXtensible Access Method (XAM)

Initiatives -
NSF DataNet – Data Conservancy Project – JHU lead with 23 institutions.
Chronopolis – SDSC, UCSD, UMIACS, NCAR – federated data grid using SRB/IRODS
LOCKSS (Lots of Copies Keep Things Safe) – replication of licensed journals and other content
MetaArchive – a private LOCKSS archive
Internet Archive
National Digital Information Infrastructure & Preservation Program (NDIIP) – Library of Congress project.
California Digital Library
DuraSpace – DuraCloud project to implement a preservation-oriented cloud storage service
HaithiTrust – Repository and storage infrastructure initiated for CIC Google book project
Sun PReservation and Archiving SIG (PASIG)
Storage Networking Industry Association

Penn State activities – Content Stewardship PRogram – strategic collaboration between Libraries and ITS. Goal – a suite of services to support the lifecycle of the digital object – creation, discovery, access, storage, preservation, and archiving. Hired Digital Library Architect and Digital Collections Curator; worked on governance.

Sally Jackson says that the Library School at Illinois now has a program in digital curation.

Cliff – decisions on what to curate, and what to keep, are less binary in digital formats than in print. Eg, Portico for scholarly journals, vs. “digital archaeology” status. It’s about risk management and resource allocation. Some of what we’re trying to understand in bit-management is really about risk and cost. How many redundant copies do you need? Failure modes are not well understood. Very scary data from physics labs about undetected bit flip errors. What does that cost in a preserved object? If it’s encrypted in clever ways it can cost a lot!

[CSG Spring 2010] Storage Futures – Cloud Options discussion

Shel Waggener – Link campus into cloud providers?
- Duraspace integration?
- UC Systemwide storage solution
- Purchase mass storage from commercial provider e.g. Amazon
- Let everybody do their own.

File Sharing through cloud: Institutional sharing?
- Eliminated Xythos (done)
- Common contract with Dropbox?

Student and faculty portfolios?
- Alumni offerings

Bernard – in context of move to Google, thy’ve clarified policies around PHI, ITAR data, FERPA.

One institution reports that as far as their CISO is concerned, if it’s verifiably sufficiently encrypted, they’d regard it the same as shredded paper.

[CSG Spring 2010] Storage Strategies

Storage strategy survey results. Storage management is equally distributed between central IT, distributed, both, or not sure.

What’s provided centrally? All offer individual file space. Most offer backups for distributed servers and departmental file space. Half offer desktop backups.

Funding models – just about all have some variety of pay for what you use. Most have some common goods, and about half have base plus cost for extra.

About half do full cost recovery including staff time.

Challenges – data growth is top, tiered storage is next, along with centralizing and virtualization.

Biggest explicit challenges : Data growth, perception of cost, research storage.

Storage at Iowa
Central file storage: Base entitlement, individuals 1-5 GB, depts, 1 GB per FTE. 4 hour recovery objectives. 99.97% uptime. 89% participation. Enterprise level, high availability.

One price fits all network file storage, offered some lower-cost network storage, e.g. without replication or backup, now they’ve got lowest-cost bare server storage – lots of enthusiasm for that model.

http://its/uiowa.edu/spa/storage/

Low cost SAN for servers $0.36 – $1.68 per year, depending on service level. Recovery is hw and sw, no staff time or data center charges.

Storage Census 2010

51% of storage being used by research. 35% Admin and Overhead (including email), 11% Teaching, 3% Public Service.

72% of storage is backup vs. online.

Next steps: identify and promote research solutions; build central backup service; build, promote archival solutions.

Storage @ U VIrginia – Jim Jolkl

Hierarchical Storage Manager Services: Storage for long-term research data (centrally funded but not well marketed); Library materials (funding via Library contributions to infrastructure); RESSCU (off-campus service for departmental disaster recovery backups).

Enterprise Storage – Based on Netapp clusters. NFS, CIFS for users, ISCSI, SAN internally. Works really well, highly reliable, replicated. Mostly used for central services. For departments it’s $3.20/GB/yr to $3.50 without backups. Lots of incidental sales to people who want a gigabyte or so for additional email quota. Doesn’t work for people who want a lot of storage.

New mid-tier storage service – focus on a reasonable and affordable storage service for departments and researchers.
Requirements: reliable, low cost, low overhead, self service. Unbundled services – optional remote replication and backups. Access via NFS and CIFS. Snapshots – users deal with their own restores. Offering Linux and WIndows versions. Doing group files based on their groups infrastructure. Using RAIDKING disk arrays. Using BetterFS on Fedora, Windows server for the windows side.

Cost model – 1 hour plus $0.34/GB/yr (raid5, but not replicated). Next year expect to drop price by 50%. Currently about 22 TB leased on NFS and only marginal WIndows use to date. All of the complaints about the costs of central storage have gone away. Research groups interested in buying big chunks.

Shel Waggener – Berkeley Storage & Backup Strategy

Shel says scale matters and no matter who says they’re doing it better faster cheaper, without scale they’re not.

2003 – every department runs own storage – including seven within central IT.
2004 – data center moves creates opportunity for common architecture
2006 – dedicated storage group formed. No further central storage purchases supported except throuh storage team.
2007 – Hitachi wins bakeoff. 250 TB. Email team works with storage group to move from direct-attached to SAN
2010 – over 500 hosts using pool – 1.25 PB expanding to 3 PB this year.

SAN-based approach. Lots of serial attached SCSI disk – moving away from fiber-channel.

Cheapest storage is now 25 cents gigabyte per month. The most expensive tier (now $4.00/GB/Month) bears the cost of the expensive infrastructure that the other tiers leverage.

Failure rate on cheap disk is reliable, but recovery time is longer.

At the cost of storage, they don’t have quotas for email.

One advantage is paying for today’s storage today. Departments buy big arrays and use 5% in the first two years, which is much more expensive. But that’s what’s supported by NIH and NSF.

Backing up 338 users’ desktops (in IST) takes up 1.3 TB.

[CSG Spring 2010] Storing Data Forever

Serge from Princeton is talking about storing data. There’s a piece by MacKenzie Smith called Managing Research Data 101.

What do we mean by data? What about transcribing obsolete formats? Lot of metadata issues. Lots of issues.

What is “forever”? Serge thinks we’re talking about storing for as long as we possibly can, which can’t be precisely defined.

Why store data forever?
- because we have to – funding agencies want data “sharing” plans – e.g. NIH data sharing policy (2003). NIH says that applicants may request funds for data sharing and archiving.
Science Insider May 5 – Ed Seidel says NSF will require applicants to submit a data management plan. That could include saying “we will not retain data”.

- Because we need to encourage honesty – e.g. did Mendel cheat?
- Like open source help uncover mistakes or bugs.
- Open data and access movement – what about research data?

Michael Pickett asks who owns the data? At Brown, the institution claims to own the data.

Cliff Lynch notes that most of the time the data is not copryightable, so that “ownership” comes down to “possession”

There’s a great deal of variation by branch of science on what the release schedules look like – planetary research scientists get a couple of years to work their data before releasing to others, whereas in genomics the model is to pump out the data almost every night.

Current storage models
- Let someone else do it
– Government agency/lab/bureau e.g. NASA, NOAA
– Professional society

Dryad is an interesting model – if you publish in a given model you can deposit your data there. That’s like genbank.

Duraspace wants to promote a cloud storage model based on dspace and fedora.

There are a number of data repositories that are government sponsored that started in universities.

Shel says that researchers will be putting data in the cloud as part of the research process, but where does it migrate to?

Serge’s proposal – Pay once, store endlessly (Terry notes that it’s also called a ponzi scheme).

Total cost of storage =
I = initial cost
D = rate at which storage costs decrease yearl, expressed as a fraction
R = how often, in years, storage is replaced
T = cost to store data forever

T = I + (1-d) to the r *I + (1=d) to the 2r * I + ….

if d=20%, r = 4, T=I * 2

If you charge twice the cost of initial storage, you can store the data forever.

They’re trying to implement this model at Princeton, calling it DataSpace.

People costs (calculated per gigabyte managed) also go down over time.

Cliff – there was a task force funded by NSF, Mellon, and JISC on sustainable models for digital preservation – http://brtf.sdsc.edu

[CSG Spring 2010] Staffing for Research Computing

Greg Anderson from Chicago is talking about funding staff for research computing.

Most people in the room raise their hand when asked if they dedicate staff to research computing on campus.

At Illinois they have 175 people in NCSA, but it doesn’t report to CIO.

Shel notes that employees have gotten stretched into doing lots of other things besides just providing research support. They’re trying to rein that back in in their career classification structures by requiring people to classify themselves. Now there’s 300 generalists classified as such.

At Princeton they’ve started a group of scientific sysadmins. The central folks are starting to help with technical supervision, creating some coherence across units. At Berkeley the central organization buys some time from some of the technical groups to make sure that they’re available to work with the central organization. Groups don’t get any design or consultation help unless they agree to put their computers in the data center.

At Columbia they have a central IT employee who works in the new center for (social sciences?) research computing – it’s a new model.

Greg asks how people know what the ratio of staff to research computing support should be and how do they make the case?

Shel asks whether anybody has surveyed grad students and postdocs about the sysadmin work they’re pressed into doing. He thinks that they’re seeing that work as more tangential to their research than they did a few years back.

Dave Lambert is talking about how the skill set for sysadmin has gotten sufficiently complex that the grad student or postdoc can’t hope to be successful at it. He cites the example of finding lots of insecure Oracle databases in research groups.

Klara asks why we always put funding at the start of the discussion of research support? Dave says it’s because of the funding model for research at our institutions. The domain scientists see any investment in this space by NSF as competing directly with the research funding. We need to think about how we build the political process to help lead on these issues.

[CSG Spring 2010] Research Computing Funding Issues

Alan Crosswell from Columbia kicks off the workshop on Research Computing Funding Issues. The goals of the session are: what works, what are best practices, what are barriers or enablers for best practices?

Agenda:
- Grants Jargon 101 – Alan
- Funding Infrastructure, primarily data centers- Alan
- Funding servers and storeage – Curt
- Funding staff – Greg
- Funding storace and archival life cycle – Serge and Raj
- Summary and reports from related initiatives – Raj

Grants Jargon
- A21: Principles for determining costs applicable to grants, contracts and other agreements with educational institutions. What are allowed and unallowed costs.
- you can’t charge people different rates for the same service.
- direct costs – personnel, equipment, supplies, travel consultants, tuition, central computer charges, core facility charges
- indirect costs a/k/a Facilities adn Admin (F&A) – overhead costs such as heat, administrative salaries, etc.
- negotiated with federal government. Columbia’s rate is 61%. PIs see this as wasted money.
- modified direct costs – substractions include equipment, participant support, GRA tuition, alteration or renovation, subcontracts > $25k.
Faculty want to know why everything they need isn’t included in the indirect cost. Faculty want to know why they can buy servers without paying overhead, but if they buy services from central IT they pay the overhead. Shel notes that CPU or storage as a service is the only logical direction, but how do we do that cost effectively under A21? Dave Lambert says that they negotiated a new agreement with HHS for their new data center. Dave Gift says that at Michigan State they let researchers buy nodes in a condo model, but some think that’s inefficient and not a good model for the future.
Alan asks whether other core shared facilities like gene sequencers are subject to indirect costs.

Campus Data Center Models
- Institutional core research facility – a number that grew out of former NSF supercomputer centers.
- Departmental closet clusters – sucking up lots of electricity that gets tossed back into the overhead.
- Shared data centers between administration and research – Columbia got some stimulus funding for some renovation around NIH research facilities.
- Multi-institution facilities (e.g. RENCI in North Carolina, recent announcement in Massachusets)
- Cloud – faculty go out with credit card and buy cycles on Amazon
- Funding spans the gamut from fully institutionally funded to fully grant funded.

Funding pre-workshop survey results
- 19 of 22 have centrally run research data centers, mostly (15) centrally funded. 9 counts of charge-back, 3 counts of grant funding)
- 18 of 22 respondents have departmentally run research data centers, mostly (14 counts) departmentally funded (3 counts of using charge back, 4 counts of grant funding)
- 14 have inventoried their research data centers
- 10 have gathered systematic data on research computing needs

Dave Lambert – had to create a cost allocation structure for the data centers for the rest of the institution to match what they charge grants for research use, in order to satisfy A21′s requirement to not charge different rates.

Kitty – as universities start revealing the costs of electricity to faculty, people will be encouraged to join the central facility. Dave notes that security often provides another incentive for people because of the visibility of incidents. At Georgetown they now have security office (in IT) review of research grants.

Curt Hillegas from Princeton is talking about Server and Short to Mid-Term Storage Funding
talking about working storage, not long-term archival storage
-some funding has to kick-start the process – either an individual faculty member or central funding. Gary Chapman notes that there’s an argument to be made for central funding of interim funding to keep the resources going between grant cycles.

Bernard says that at Minnesota they’ve done a server inventory and found that servers are located in 225 rooms in 150 different buildings, but only 15% of those are devoted to research. Sally Jackson thinks the same is approximately true at Illinois. At Princeton about 50% of computing is research, and that’s expected to grow.

Stanford is looking at providing their core image as an Amazon Machine Image.

At UC Berkeley they have three supported computational models available and they fund design consulting with PIs before the grant.

Cornell has a fee-for-service model that is starting to work well. At Princeton that has never worked.

Life Cycle management – you gotta kill the thing, to make room for the new. Terry says we need a “cash for computer clunkers” program. You need to offer transition help for researchers.


subscribe

Pages

Latest tweets

interesting links

What I’m listening to

 

May 2010
M T W T F S S
« Mar   Jul »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Follow

Get every new post delivered to your Inbox.