While leading a policy discussion on the mis-named Digital Rights Management technologies (aka copy protection).
“…these technologies have the shelf life of sushi.”
Technorati Tags: CSG-Winter-2007, DRM
This is where you say something clever
While leading a policy discussion on the mis-named Digital Rights Management technologies (aka copy protection).
“…these technologies have the shelf life of sushi.”
Technorati Tags: CSG-Winter-2007, DRM
While leading a policy discussion on the mis-named Digital Rights Management technologies (aka copy protection).
“…these technologies have the shelf life of sushi.”
Technorati Tags: CSG-Winter-2007, DRM
We had a great workshop on Thursday on collaboration tools and how to approach them in higher education. I was part of the panel that led the presentation, so I wasn’t taking notes, but I’m sure the notes will be posted to the CSG web site after the meeting.
For my part in the presentation, I reiterated some of the points I made at last spring’s discussion of this topic, and went on to comment that what we’re now experiencing in the collaborative tools space is somewhat analogous to the Cambrian explosion, where we have a tremendous proliferation of new species of software appearing almost on a daily basis and combining and evolving at a very rapid rate, making it very difficult to figure out which ones we should engage with at an enterprise level, or even how to construct a meaningful taxonomy of these applications.
Technorati Tags: collaboration, CSG-Winter-2007, social-software
Managing very large files in research computing at IU.
Task force two years ago on research cyberinfrastructure had recommendations concerning storage – Continuing to deliver centralized facilities to support research computing as well as dependable archival storage were identified as important. Large file storage is just a piece of the storage strategy for IU.
They have about a petabyte of spinning disk available for researchers, as well as 4 petabytes of archival storage (the Massive Data Storage System). The “Data Capacitor” captures data from instrumentation.
Data Capacitor uses Lustre OS.
MDSS designed to provide a deep store for large files. Runs HPSS. Interfaces include FTP, Samba, and tar. Radiology is one of the biggest users. Also working with digital library programming. They give the researchers 500 GB for free, and after that they want to discuss it.
Preservation, curation, and long term management of data is a big issue – need to link librarians, computer supporting, and IT professionals. Serge notes that finding ways of accomplishing persistent URIs for data is important.
Backup with mirroring is if you accidentally delete something or introduce bad data in big data sets is a serious problem.
Technorati Tags: CSG-Winter-2007, cyber-infrastructure, storage
- Project partnership with Google publicly announced in 2004 December – scanning 7 million print volumes over 4-6 years. Direct scanning costs are borne by Google.
UM receives a copyof all digital files, including OCSR and metadata which can be used to build services. UM can share, with some restrictions. Each volume page produces 2.01 files on average – will be about 2.2 billion files, 380 TB of data. Sustained rate of 3.16 MB per second for four years.
Data characteristics – well defined file formats – image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Indefinite retention. Files are largely static. Much material is in copyright, so requires security practices.
Mbooks service – can search and look at books online.
There’s interest in using the OCR data for textual analysis research.
Technorati Tags: CSG-Winter-2007, google, higher-ed, digital-libraries, storage
Small files are normal fo rlots of people – people write apps using files as a database substitute – this comes from the desktop computing world. This problem has existed for years – but now people have discovered HPC, but they don’t want to rewrite their programs. Small files are deadly to most file systems – some more than others. Creates even more problems with clusters.
People are expecting cheap disk at commodity prices, but that’s not fast disk. Virtualization can be deadly as it adds overhead due to the levels of abstraction.
An example – an 1800 compute node cluster at USC. If they’re accessing small files, you have to have ways to coordinate file locking and synchronization across the nodes. 3-4 terabits of bandwidth capacity get slowed to nothing if there’s lots of small file access going on.
Right now base file system is QFS (Sun). The directory metadata is on separate disks from the data itself, which is great on big files, but hard with small files because of single metadata catalog. There are local parallel file systems on the nodes, which work better for small files. NFS has its own issues with small file access because of the overhead. They’ve set up “condo disk” as well as condo nodes, so they can have their own file space instead of a virtualized environment.
Some example of small file file systems -
Genomics Group – 10ks files in a single directory.
Natural Language Group – 50-250k files in directory. Many nodes accessing the same dictionaries.
Backups are slower and harder – can’t keep the tape spinning if you’re doing lots of directory accesses – takes hours instead of minutes.
Ways to help -
- faster disk (helps metadata/directory space)
- distributed file access (qfs)
- no free lunch
Next generation -
- nfs 4 doesn’t cut it
- gpfs helps some
- 10 gbps hosts on data plane – nothing but jumbo frames, which might make it worse.
- ram disk for metadata? san diego does it – might help.
- storage management solutions – performance for small files is in question.
-
Technorati Tags: CSG-Winter-2007, higher-ed, research-computing, storage
Small files are normal fo rlots of people – people write apps using files as a database substitute – this comes from the desktop computing world. This problem has existed for years – but now people have discovered HPC, but they don’t want to rewrite their programs. Small files are deadly to most file systems – some more than others. Creates even more problems with clusters.
People are expecting cheap disk at commodity prices, but that’s not fast disk. Virtualization can be deadly as it adds overhead due to the levels of abstraction.
An example – an 1800 compute node cluster at USC. If they’re accessing small files, you have to have ways to coordinate file locking and synchronization across the nodes. 3-4 terabits of bandwidth capacity get slowed to nothing if there’s lots of small file access going on.
Right now base file system is QFS (Sun). The directory metadata is on separate disks from the data itself, which is great on big files, but hard with small files because of single metadata catalog. There are local parallel file systems on the nodes, which work better for small files. NFS has its own issues with small file access because of the overhead. They’ve set up “condo disk” as well as condo nodes, so they can have their own file space instead of a virtualized environment.
Some example of small file file systems -
Genomics Group – 10ks files in a single directory.
Natural Language Group – 50-250k files in directory. Many nodes accessing the same dictionaries.
Backups are slower and harder – can’t keep the tape spinning if you’re doing lots of directory accesses – takes hours instead of minutes.
Ways to help -
- faster disk (helps metadata/directory space)
- distributed file access (qfs)
- no free lunch
Next generation -
- nfs 4 doesn’t cut it
- gpfs helps some
- 10 gbps hosts on data plane – nothing but jumbo frames, which might make it worse.
- ram disk for metadata? san diego does it – might help.
- storage management solutions – performance for small files is in question.
-
Technorati Tags: CSG-Winter-2007, higher-ed, research-computing, storage
The afternoon workshop, coordinated by Kitty, is on data storage.
There is, of course, a survey to present. Most of the schools are offering multiple kinds of file services with ever-increasing quotas. Only two schools are offering replication technologies (like Apple or Microsoft’s). The predominant technology is direct attached storage, but there is use of Fiber Channel SAN, and some use of iSCSI SAN. Most folks are using TSM for backup.
Most people said that the Library does not provide any data archiving services.
Unsolved problems include (of course) funding, smart data storage, multi-platform access, replacing current distributed file systems – what’s next?, virtualization and tiering, more-more-more – keeping up with demand.
Summary – growth in data is a huge problem and an unfunded mandate. Federal requirements for keeping and protecting data for longer periods and unmanaged data are huge issues. Inefficiency is a problem – we’re not aligning data with the right solutions. The technologies for storage don’t knit together well – there’s a duct-tape feeling to the solutions.
Ron Thielen from the University of Chicago is talking about storage.
SAN vs NAS is the wrong question – they’re converging anyway. The real question is what APIs do you want to use to provide access to data – files, blocks, objects.
A File System is really a metadata repository and related APIs. Once a vendor understands that it enables really interesting things to happen – Xythos is an example of someone who gets that. Typical storage growth figures are quoted as 39% annually – even more worrisome is the percentage of budget devoted to storage. At the U Chicago, in the last few years they’ve seen 96% compound annual growth rate.
Gartner predicts “By 2008, nearly 50% of data centers worldwide will lack the necessary power and cooling capacity to support high-density equipment.”
What’s the storage buzz?
- SMI-S 1.2 (an ANSII standard for storage management) & Aperi (an open source storage management project – part of the Eclipse project).
- Continuous data protection – backs up files as they change.
- Virtualization – heterogeneous (the holy grail), switch-based (Cisco and Brocade – moving virtualization into the SAN itself), HBA (for VMWare or blade centers).
- Global Nape Spaces (File Virtualization) – put something in front of a bunch of NAS devices that looks like a single name. EMC and Brocade have purchased technologies in this area.
- Clustered File Systems and Storage (like Isilon)
- Archival file systems (Archivas, Permabit) – a specialized example of clustered file system.
- Database archiving
- Wide Area File Systems
- Object Based Storage Devices – when you’re storing data on storage devices, some metadata can be managed by the device not the storage system. (why would you want to do this?)
- TPM (Trusted Platform Module) in storage devices – TPM in devices and servers exchange certificates – storage devices can be made to not give up access if they’re not matched with the appropriate servers.
- Solid State & Hybrid
- Intelligent Storage Grids & Storiage Autonomics – do self-provisioning based on access to policy rules.
Regulatory Effects on Storage -
New Federeal Rules for Civil Procedures causing much FUD.
- “rules also mean that colleges that are in litigation or that suspect they may soon be in litigation cannot destroy electronic evidence they know would be relevant to a lawsuit.” (Chronicle of Higher Education)
- means universities will have to keep much better track of data.
Greg Jackson notes that this is a risk management issue where we need to be careful about going to great lengths to solve problems technologically instead of planning on some basic procedures that we might take when or if we have to perform under this law.
Use case – VBoIP and Unified Messaging – talk about unstructured data!
Technorati Tags: CSG-Winter-2007, cyber-infrastructure, storage, research-computing
I’m in Los Angeles for the Winter meeting of the Common Solutions Group, at USC.
The first workshop is on building cyber-infrastructure for research. Bill Clebsch from Stanford frames the discussion by noting that this effort will make ERP implementations see cheap and easy, and he’s told his provost that.
There was a survey of the CSG membership on context for research computing.
In the survey 78% of the membership see value in having governance/oversight for research computing, though only 26% have such a body.
The top issue is data center facilities, more than networking. Storage is a major short-term concern.
The predominant support model is raw hosting, with the data center only providing floor space, cooling, and power.
About half of the membership do some central staffing for research computing, but most of that is monitoring facilities and power. 45% of the respondents are doing some support for the technology portion of grant development. 68% offer options for system administration support.
Cost is the major factor influencing central data center use, especially when it’s not part of the indirect costs.
In a panel on key drivers and changes, Tim Gleason from Harvard notes that the data center they built two years ago is now full, and they’re about to start building another 10,000 sq. ft. data center right behind it.
Jim Pepin from USC is talking about hosting, co-location, and condo-ing – in condo-ing they put together the machine into the cluster, but the researcher has the exclusive use of it. They’re seeing lots more use of that (as opposed to traditional hosting or colo), because of the complexity involved in building the machines. About 60% of the machines in the cluster are centrally owned, 40% owned by the researchers. Researchers can also trade cycle futures with each other. There’s a faculty committee of senior faculty that allocates the central resources annually. They’ve never had to say no to any request in the six years they’ve been doing this kind of allocation.
Jim notes that the design of networks for high-end research is very important, and that there is some tension between that and the desires of campus security to build barriers into the network.
Pat Dreher talks about a physics project that will be amassing an exabyte (1,000 petabytes) of data over the next ten years.
Pat quotes Larry Smarr as saying that networking is becoming cheaper than storage, and storage is becoming cheaper than compute power – this is the first such major shift in a generation.
The folks from Penn State note that for planning purposes a kilowatt of power per square foot is a good number for the next few years.
Bill notes that at Stanford they believe that in fifteen years they won’t be hosting anything (they’ll be buying the services) so that the data center investment should be thought of in that time frame.
There’s a bunch of discussion about whether every institution needs to build a lot of data center capacity, or whether there are ways to collaborate across organizations. Kitty Bridges from UMich points out that we need to learn how to be nimble on our feet and agile in our own collaborations. Jim Phelps from Wisconsin proposes that if we can offer ways to support virtual organizations for cross-institutional research that might be a place to start.
It’s pointed out that despite the talk of virtual organizations, most research today is performed by single PIs working alone with a bunch of grad students within an institution.
After the break the discussion moves on to talking about sustainable funding models for research computing.
Kevin Morooney is talking about the history of research computing at Penn State – until 1988 research computing was in the Research organization, but in 1988 it moved into the Center for Academic Computing. They maintained three FTE for research support. In 1997 they created a new shop, which now has 15 FTE and a director for doing high performance computing and visualization. Kevin points out that cyber-infrastructure is not only happening in the central organization, but all over the campus. Looking ahead he sees another round of central IT investment coming, with campus coordination that goes beyond what happens in central IT, but that it’s important that the central IT work at understanding and providing for the needs of faculty researchers while coordinating all these other conversations.
Bill Clebsch is talking about how at Stanford the institution is charging schools and other units for power, which has changed the paradigm for research computing – schools have had to pay for power for research computing. This year, for the first time, schools have to pay for space. In the last six months these factors, plus faculty realizing that they could spend more time on research than on the “plumbing”. They’re now looking at building a new data center.
One of the unexpected side effects of coordinating this activity is groups wanting to co-locate research assistants, which they hope will build a community around computational research.
Jim Jolkl from UVa is talking about their Linux clusters model – they contribute 20% of the cost. They charge $13.75 /GB/yr for storage, but they provide a Hierarchcical Storage Manager for archiving at no charge. Like everybody else, data center facility space is a large issue.
They hear a lot about getting people to support researchers. They’ve had a task force on computational science that’s recommended senior-level leadership, the need for grant development support, seed funding for promising programs, expert support for computational science (algorithms, data and security, visualization, etc).
Technorati Tags: cyber-infrastructure, higher-ed, CSG-Winter-2007, research-computing