CSG Winter 2015 – Integration Workshop

Beth Ann –
Integration getting more complex and it’s an area where central IT has always struggled. Demand is rising.
Survey results: 
Do you have an integration strategy? 11 do 2 don’t
What’s driving it – Innovation, security, new ERPs, BI/CRM, Other: diverse, distributed cloud systems
Where does integration responsibility reside? –  Central IT / Middleware – 6, distributed – 5, virtual group or hybrid – 2, shotgun – 1
Do you have an esb? WSO2 – 4, Mulesoft – 1, Oracle – 2, None yet – 2.
Biggest app challenges – Workday, mainframe, ERPs, departmental apps, apps that require data transformation
Manage integration performance? Most don’t or have limited perf management
Security? certs, “service” accounts, custom tools, policy, platform (Boomi, MuleSoft), Nginx, IDS, Kibana, encryption firewalls.
Critical staff skill sets: Data architecture, advanced programming, basic programming, business analyst, config management
Jim Choate – UPenn – Riding the Enterprise Service Bus
Starting point: Myriad of point-to-point integrations, not well documented, fragile, difficult to be agile,
Drivers for change: New student system, SAAS solutions gaining traction: Concur, Canvas, Success Factors (used in health system and HR training), MIR3
New Approach – SOA, ESB
Benefits – Flexibility, Scalability, reasonable cost, availability, centralized management and monitoring
ESB selection criteria – Services: pub-sub, synchronous messaging, synch messaging, transaction support, transformations, web service generation, guaranteed delivery
Deployment Environment: Scalability, avaialbility, load balancing, clustering
Governance and Deployment: versioning, deployment, upgrades, support roles
Three finalists: WSO2, Mulesoft, Oracle
30 applications in production
Open Data Initiative – sparked by undergraduate assembly resolution to open up access to non-confidential dat sets, implemented as restful APIs. Deployed APIs: map item filter service, map item filter parameters service, courses catalog search service, course section search service, course section search parameters service, dining service, directory search service, directory person details, news/events/map search service, transit data service.
Used by departmental web apps, student developed web apps, and (soon) student developed mobile app
Development and Slumni Relations – Key system is Peoplesoft app, real time sync of key biographic and contact data with IModules online community; high volume; fire and forget
Canvas – Real time enrollments; very well received by students, faculty, and staff.
Lessons learned: Don’t get too far in front of vendors (every release of one product breaks APIs), Mulesoft – support generally good, but long wait times for some fixes; Mulesoft – upgrades more difficult than expected; Mulesfot – development environment wasn’t set up for how they liked to work, had to tune; Instructor – API issues… questionable design choices, bugs, throughput problems, documentation; Instructure – overburdened test environment – ended up paying for a dedicated test system
API Infrastructure – Scotty Logan – Stanford
In the beginning – standalone APIs, buildings/ Campus maps, Project Task Awards check – hard to find, moved around; Other random APIs – SOAP, RPC< rematch, RMI, RESTish
“SOA is Dead; Long Live Services” – Anne Thoimas Mane, Burton Group, January 2009
CAP – Community Academic Profiles – Med School Faculty Profiles. Was Java + Oracle; Dept Drupal Sites that wanted data – not java, not oracle, very common, need profile data. Common answer: RESTful API, JSON, OAuth 2.0.
API “Gang of Four” forms. On campus Apigee workshop validate approach. CAP API using Spring and MongoDB – generate JSON versions on profile changes; OAuth AS using Spring. API gateway using node.js. Minimal feature set
Early 2013 – CAP API design & implementation, client/server only, API gateway implementation.
Early 2014 Planning – API gateway enhancements, OAuth 2.0 AS enhancements (user (3 legged) tokens, SAML integration), Thin client + API for accounts system; developer portal
Mid 2014 – Reorg interrupted things but still moving forward
API Discussions: Real-time bus location; device compliance system; AED locations; environmental health & safety; staff & faculty training; directory data.
Common Concerns: opening up data, lack of desks, APIs and OAuth are new and scary; system load and support; infrastructure development
Still to do: Developer portal; “3 legged” Oauth; New gateway features: caching, token verification, rate limiting; UI for managing tokens
Susan Kelley – Yale
Lot of interest in  SOA and ESB from a lot of people for a very long time.
Software solutions in varying degrees of use.
2013 IT Strategic Plan included a section on Integration – still needed to figure out implementation.
Hired a Director of Systems and Data Integration. Workday, opportunity to redo many integrations. CIO very interested in exposing APIs. Infrastructure becomes “exostructure”.
Systems and data integration – created a rationale, scope, and vision document. Central integration competency center – 2 people located with the DBAs, integration community of practice, training for all app teams on new tools.
Defined a three year time horizon in constraining discussion about tools. Had advocates for Fuse/Service Mix, Dell Boomi, IBM Cast Iron, WebMethods. Determined that WebMethods would work well (they already had it) for short term.  Focused on gaps and ended up selecting Layer7 for API management. Uses Telend for easy ETL work.
Merged IDM with systems and data development. Developers, platform admins.
Yale Data API service – being socialized by web team for reuse of data on sites. Data governance – quietly beginning to get some traction; Workday integrations.
Value, Business Drivers, and Challenges – Beth Ann Bergsmark – Georgetown
Current business and use cases for better integration – New ERPs, mobile experience, achieving the “connected campus” means breaking the ERP silos, engagement layer will create many more integrations; seamless intuitive user experiences – drivers for better application integration of extension tools.
Mobile experience – students expect realtime rich data. Use cases: registration, mobile access (ID cards) and identity, innovation.
Emerging and future use cases: increased importance of BI and CRM – engagement layer for experience, aggregation of data across multiple points, actionable data and outcomes; Internet of Things; Agility in service adoptions (cloud) means data integration must be agile too.
Challenges: Technology – systems that were never designed to share data; vendor lok in issues, lack of APIs; fragile integrations that break; performance and monitoring. Business process: data elements defined in isolation. Greater complexity: no longer just point to point; new skills for staff; new situational awareness.
Path forward: Gain situational awareness by mapping and inventory of integrations across stacks, understand the gap between use cases and capabilities. Track or adopt iPaaS solution with standardized connectors; Build integration center of excellence, shift away from artisanal handcrafted integrations, unified voice with vendors to push for modern integration capabilities, integration capability needs to be a heavier incluence on priorities for service/system modernization and product service selection; align with data governance.

CSG Winter 2015 – DevOps workshop

Why you should care – Bruce Vincent and Scotty Logan – Stanford
How do we reconcile: desire for continuous functional improvement; need for efficient deployment workflow; platform variations desire for portability; expectation of zero service disruption? Can’t disrupt ongoing practices. Outage windows – “is never good for you?”
The problem – how to manage change. Streamlining deployment. Going right to live – scary thought?
State of the art 2011 – (cloud implementation)
Containers are a game changer: Application consistency; portability; rapid prototyping, testing, deployment; disposable servers. There were always problems in making sure that the environment is the same in dev and prod. Developers can’t deal with the complexity.
Version upgrades can be done discretely, tested, and staged. Orchestration builds entire environment automatically. Container OS is tiny and disposable, so almost no sysadmin or patching required. Very cost effective and no hypervisor overhead. Docker supported on AWS, Google Compute, OpenStack, and soon Azure.
Your whole stack as code: Programming professionals are driving DevOps as a new standard in software engineering practice. Continuous Integration; Blue-Green deployment; You get more productivity from your developers with DevOps; As a nice additional benefit, good developers want to work in your shop.
Using Terraform to script virtual data center at AWS.
Organizational Skills and Issues – Charlie Kneifel
Innovation doesn’t move fast enough – balance between the right amount of process and allow the innovation. At Duke have a group that meet on a weekly basis. At Duke made progress in automation and reaped some paybacks.
DevOps maturity model: Duke case study/demo – Mark McCahill
Devops won’t happen overnight.
The basics – have had in place, virtualized compute, virtualized storage, puppet configuration management, SVN/Git repository, ticketing system
Standardization – lovingly hand-crafted systems created by artisan sysadmins fail. CVL and CM-manage illustrated that standard build processes work. Clockworks team: 2 devs + 3 team leads (Linux, Windows, Monitoring) + architect.
Clockworks – configure & provision custom VMs. ServiceNow ticket process to handle whatever we haven’t yet automated. Chaos control opportunities: TSM backup configuration; self-service Shib SP registration; self service Commode site cert signing (Locksmith).
Next Steps: Stevedore: Automate drupal and wordpress via Docker container orchestration. Containers: data, mySQL, php, apache; site cert creation and installation; shib Sp registration.
IDM in containers: Kerberos KDS container now in test. Continuous builds via Jenkins; automate testing; retain old container – we can rollback.
Antikythera – DevOps automation isn’t just for admin & web sites. Research computing provisioning proof of concept – compute, storage, apps/containers… and SDN as it is more widely deployed. Lets you have clear provenance of code and datasets for any specific job.
Summary – migrate ticket-driven artisan-crafted work and processes to self-service apps. Orchestrate automation via self-services app APIs; Automation dashboards for both research and admin computing.
Bill Allison – Berkeley: Moving to Continuous
Case study: The Berkeley Desktop
Fall 2011 – OMG v0.1: Everything is broken or breaking all the time. No time for staff to work on solutions. Compromised machines. Standard image doesn’t work on laptops. standard image too hard to change.
Imaging a machine took 4 hours of senior tech time. Varying hardware standards, no significant automation, manual work, no checklists.
Too busy to improve.
Split Desktop Design and Engineering from Ops and support.
Tackle the things that increase costs: Labor, productivity loss; change; variance
Now have 11-12k computers under management, around 5k with the full Berkeley desktop.
Artifacts are public in GitHub so others can use them.

CSG Winter 2015 – Software Defined Networking

We’re in Berkeley for the Winter 2015 CSG meeting.
The first workshop is on Software Defined Networking
Applications call SDN controller through RESTful API. Controller talks to network switches to configure private virtual network on the fly. Takes control of the network out of the hands of network admins and puts it into apps and services. Have to plan how to open up the network. Have to figure out how to roll out within the old network environment.
Survey results – Jim Jolkl
24 responses.
Do you currently have a separate dedicated network for research? Almost half do.
Do you plan to make significant investment in your core network over the next 12-18 months? 2/3 do.
Are you aware of anyone on your campus holding a NSF award that involves connecting to AL2S? Almost half do.
Your State/Regional network’s support for AL2S – 2/3 do.
Expected level of integration between state network and AL2S – regionals ready to tunnel connection back to campus.
Do you have a SDN strategy? Most say in process.
Most campuses in early stages of figuring out what to do.
Barriers to adoption? Lack of staff time and expertise.
Summarize SD goals – bypassing friction devices, connecting researchers to AL2S
Where do you see SDN used on campus over 3-5 years? Data center, campus bypass network.
Industry survey – 87% will have data center SDN in production in 2016.
Charlie Kneiffel – Duke’s SDN journey
Part 1 – planning
Definitions – implementation of OpenFlow software controller that manages network traffic flow on a set of network devices. Focused on edge more than core. Primary goal is to improve speed, reliability and performance of the network used by researchers.
Current state: SDN switches deployed in production – hub and spoke model. Production controller – Ryu based. Production rule manager – SwitchBoard (Mark developed). perfSONAR nodes deployed across campus. In middle of upgrading to new version (Puppet’izing). Efforts led to redesign of Duke core network. Duke uses an MPLS core and can switch to a VRF easily – so routing is everywhere.
Infrastructure considerations: – Dedicated science network? converged/unified network? fiber infrastructure? Needs at the core? needs at the edge?
Lessons learned –
Test, test, test in controlled fashion (perfSONAR is your friend)
Oversubscription – accidental or intentional. Span ports, layer 2/3 domains
10G Cannons aimed at your network – can melt things in way you never expected
Measurement of real bandwidth availability
Firewalls – what are the real limits – per stream/overall – how do they fall over / fail?
IPS – looking at traffic between VRFs – where is traffic inspected, white listing, how often?
IDS – Passive
It’s another network upgrade: But it’s not fully documented. Keep the core a significant multiplier of the edge
QOS is important if you have converged services.
General SDN model at Duke
Integrated – hosts connect to a network. Default path for traffic is to the production network. Controller allows a rule to bypass.
Fun with Ryu – Mark McCahill – Duke
Why SDN? campus network speed bumps (firewalls, IPS/IDS)
Have self-service bypass networks for researchers – should be simple web app
Also researchers want access to national nets.
Switchboard – Web app (Ruby on Rails)  – who is authorized to enable a bypass/link, status of requests, update SDN controller base on approved requests (created fear and loathing in web developer) RYU gives a REST interface to set up and tear down routes, want to be able to rollback/restore SDN controller state, auditibality of state of network configuration. What’s hard is having web developers talking to network engineers – they have totally different languages
Deployment strategy -start with intermittent bandwidth intensive tasks: backups, bulk data moves, protected network segment data ingestion.  Run building edge switches in hybrid mode – enable open flow ports where needed.
Next steps for Ryu – modifying quite a bit. Limit DHCP responses to authoritative DHCP servers. VLAN tag flipping to support linking VLANs. Detect and throttle ARP flooding (and other DOS attacks on the SDN controller).
Lessons: You can easily simulate an SDN network. Rya-Rest router + Mininet + Open vSwitch. http://sdnhub.org/releases/sdn-starter-kit-ryu/ Open vSwitch is very capable. A SDn simulation on my VM was #1 with a bullet on the top-talker charts on afternoon – network engineers should be careful about which addresses they assign.
Lessons; the open flow port you most care about is probably disabled. Bad connections; how polished is your glass? hybrid mode switches in wrong mode. Fiber terminations that are good enough for 1 Gig aren’t necessarily good enough for 10 gig. Hybrid mode switches  in the wrong mode.
Endgame: Campus SDX (Software Defined Exchange) – Campus core bypass links for science DMZ, interonnects layer 2 services (AL2S, BEN, etc).Start with a self-service app (Switchboard), then automate via the API.
Roy Hockett – University of Michigan SDN Research
Recent research area – Clean Slate project in 2007.
Research on networking and abstractions, research on SDN implementations, research that uses SDN as tool or component
Mobilab – golabal-scale live laboratory for supporting mobile computing science.
Atlas Great Lakes Tier 2 – LHC computing and muon collaboration. Leveraging SDN to build paths to transfer data.
SDN Vendorscape
Lots of activity in this space with many choices. Most vendors claim their hardware/software can be introduced gradually into the network without disruption. Most products or services are designed for data center optimization – with attention just now shifting to the WAN. ISPs are embracing network virtualization, optimization and automation features to vendors are delivering products and services that will do this. SDx Central is a good source of information.
Eric Boyd – Internet2 and AL2S
Innovation story – Abundant Bandwidth – 100G for now, Network programmability – SDN, network virtualization, Friction-Free Science – Science DMZ
AL2S today – a reliable VLAN service. Would like to be able to offer a range of services, each in their own slice. Run your own network on the underlying I2 network to allow rapid prototyping of advanced applications and new network services. Private network capabilities with shared networked costs.
Technology behind network virtualization – Built a hypervisor called FlowSpace Firewall
Example – Prototyoe Multi-Domain Layer 2 Services
Making OpenFlow Networks Manageable – Steve Walerbusser – Stanford
“Northbound” open flow architecture not yet defined (controller-controller, controller-app). Industry is based around SNMP but that doesn’t intersect yet. Most organizations don’t want to deploy technologies that aren’t based on SNMP. Need to fill gap.
OpenFlow shares a lot with previous architectures, but adds new concepts – new capabilities to manage, new gotchas to avoid. How do we manage these new capabilities? metrics, tools, processes.
OpenFlow can provide much faster provisioning – how fast? what are bottlenecks? How reliable? SLAs?
Engineers need tools and data to make decisions about defining policies into controllers.
OpenFlow Management Gateway – OMG – an OPenFlow controller, uses OPenFlow protocol to gather important metrics, translates info to SNMP (could do REST APIs too), works side-by-side with existing controller infrastructure.

CNI Fall 2014 Meeting – Closing Plenary

Closing Plenary – Cliff Lynch

A record setting meeting – not just attendees, but number of session proposals was far in excess of what’s been seen before.

Security and privacy have become pervasive issues. Concerns about security interplay with notions of moving services onto the net and depending on remote organizations and facilities. Security and privacy are separate things, though interrelated. Privacy itself has become a multi-headed thing. Was traditionally privacy from the state or privacy from your neighbors, now there is a vast enteprise interested in data about you. There are people who think we have approached this wrong, that we should punish people from making nasty use of information rather than failure to keep it secret. Snowden revelations represent enormous breach of security in organizations that are supposed to be the best and are well funded. Some of the revelations are about efforts to undermine security in the national and international networking infrastructure. That suggests that we have a lot to do in improving security – hard to believe in selective compromises that can only be open to the good guys.

The Snowden breach also is a highwater mark of a trend that raises issues for archives and libraries – an example of a large and untidy database of material. This is not a leaked memo, or even the Pentagon papers. Here we have this big dataset cached in various places. The government is still not comfortable with this to the point that it cannot be used as reference material in classes taken by government employees as they would be mishandling classified documents. How are we going to manage these important caches of source documents, and who is going to do it?

There are any number of other security and privacy problems you can read about in the press. The term “data breach” suggests a singular event. There is evidence that many systems are compromised for a long period of time – that’s an important distinction. Seeing a spectacular example at present with Sony where it appears they may have lost control of their corporate IT infrastructure to a point where they may have to tear it down and build it again. It IETF is looking at design factors and algorithms – we need to do this in our community more systematically. There’s been good material coming out of Internet2 and Educause joint security task force. Some of this is easy – why are we sending things in the clear when we don’t need to? Underlying assumption that it’s a benign world out there. The whole infrastructure around the metadata harvesting protocol is open – who would want to inject bad data about repositories? We’re using these as major sources of inventories of research data – would be good if they’re reasonably accurate. CNI will convene, probably in February, to start building a shopping list of things relevant to our community.

Two things that are harder and more painful to deal with: One is the sacrifices we need to make to get licenses for certain kinds of material, especially consumer market materials. Look at the compromises public libraries have made to license materials for their patrons – pretty uncomfortable with privacy choices. Need to reflect long and hard across the entire cultural memory sector. The other thing is levels of assurance – how rigorous do you want to be with evidence in trusting someone. Some identities are issued with an email address, others want to see your passport and your mother. It’s easier to do it right the first time, but sometimes if you do it right it can take forever. We’re building a whole new apparatus about factual biography and author identity – no agreement or even discussion about what our expectations are around this. Do we want to trust people’s assertions, or do we want it verifiable? Part of the problem is it’s hard to understand how big the problem is.

In the commercial sphere it is stunning how much we don’t know about how much personal information is passed around and reused.

Another clear trend: Research data management. We are still waiting eagerly (and with growing impatience) for policies and ground rules from the funding agencies about implementing OSTP directives. In broader context seeing focus on data, data sharing, big data. Phil Bourne was appointed as the first assistant director of data science at the NIH – creation of that role underscores how important they see research data and data management. Seeing this in other agencies and in business. City governments are getting involved in big data and the emergence of centers in urban informatics. SHARE will be a backbone inventory and analytic tool for understanding research data responsibilities – has a clear idea where it’s heading and is starting go move along.

Still many things we’re not coping with well in this area – data around human subjects. Not sure we have a good conversation between those who are concerned about privacy and those who see what can be accomplished with information. A story that illustrates developments and fault lines. THere’s a whole alternative in social science emerging out there with studies that could never have been done within univerisites but fairly well respectful of privacy. Sometime earlier this year the proceedings of the national academies of science publish a paper jointly authored by researchers at Cornell and Facebook. Emotional contagion – if your circle of friends share depressing information then you will reflect depressing information back. Came up with idea to test this on Facebook. Need a lot of people in the experiment. They twiddled Facebook feed algorithm to bias towards depressing items and then did sentiment analysis. Then looked at people on sending end of those items. Around 60k people. They found there was a little truth in the theory. People started freaking out in various directions: academics (what IRB allowed this? where were the informed consent forms?); another group that said this seemed fairly harmless and can’t really be done with informed consent and should be viewed as a clever experiment. This hasn’t been resolved. Some people are worried that things that are product optimizations normally could be reframed as human subject experiments. Some people are wondering whether we don’t need a little regulation in this area. There was a conference at MIT on digital experiments. Large enterprises are doing thousands of tests a year, with sophisticated statistics, to tweak their optimizations. The part of the Facebook thing that was surprising is there were a lot of unhappy Facebook users offended about the news feed algorithm being messed with – without understanding that there are hundreds of engineers at Facebook messing with it all the time. Put a spotlight on how litlle people understand how much their interactions are shaped by algorithms in unpredictable ways.

People dont realize how personalized news has become – we don’t all see the same NY Times pages. What does it mean to try and preserve things in this environment? Intellectually challenging problem that deserves attention as we think about what are the important points to stress in information literacy. What’s appropriate ethically? What about research reproducibility? What evidence can we be collecting to support future research?

There was a CLER workshop on Sunday about things we have in archives and special collections that need to be restricted in some way or are in ambiguous status. Eg. things collected before 1900 when collecting and research practices were different. That is some of the only research we will ever have on some things and places, and we have to talk about them no matter how awkward.

Software – we often make casual statements about software preservation and sustainability. Time to take a closer look. There is massive confusion about what sustainability means, the difference between sustainability and preservation, and what those terms mean. Time for more nuance around this. Rates of obsolescence and change – in some sense desirable to keep everybody on current version, but the flip side is that vendors have enormous motivations to put people through painful frequent cycles of planned obsolescence. There is some evidence that there are better outcomes of backward compatibility with open source software. We need to understand forces obsolescence cycles and what that implies in areas like digital humanities where there’s not a lot of money to rewrite things every year. We’re seeing new tools on virtualization technologies for software preservation.

Did an executive roundtable on supporting digital humanities at scale. Linked closely to digital scholarship centers – these are important mechanisms for diffusion of information on technology and methodology. One of the striking things is that got a lot of people who are looking at or the issue of or planning for scholarship centers. There’s interest in looking at a workshop for people planning such a center. There are lots of things that have the word “center” in them, with widely varying meanings. It might be a real help to summarize the points of disagreement and the different kinds of things parked under these headings.

If you ask the qeustion how are we doing in terms of preserving and providing stewardship of cultural memory in our society (including but not limited to scholarly activity), nobody can answer. If you ask are we doing better this year than last we have no idea how to answer. How much would it cost to do 50% better than now? Can’t answer that either. There have been some point investigations – like the Keepers activity. Studies from Columbia and Cornell on what proportion of periodicals are archived. Other than copyright deposit at LC we have no mechanism to get recordings into institutions that care about the cultural record. We’re in a slow motion train wreck with the video and visual memory of the 20th century – it’s a big problem that, until recently, we couldn’t get a handle on. This will require a big infusion of funds in the interest of cultural memory. Indiana University took a systematic inventory of their problem and then was able to win a sizeable down payment from leadership to deal with it. NY Public have done a study and are just sharing results – their numbers are bigger and scarier than Indiana’s. Getting surveys done is getting a bit easier. There are probably horrible things waiting to be discovered in terms of preserving video games. Preserving the news is another area. Part of the difficulty is the massive restructuring in some of these industries. Helpful to think about this systematically in order to prioritize and measure our collective work.

CNI Fall 2014 Meeting: The NIH Contribution to the Commons

The NIH Contribution to the Commons
Philip E Bourne, Associate Director for Data Science, NIH

There’s a realization that the future of biomedical research will be very different – the whole enterprise is becoming more analytical. How can we maximize rate of discovery in this new environment?

We have come a long way in just one researcher’s career. But there is much to do: too few drugs, not personalized, too long to get to market; rare diseases are ignored; clinical trials are too limited in patients, too expensive, and not retroactive; education and training does not match current market needs; research is not cost effective – not easily replicated, too slow to disseminate. How do we do better?

There is much promise: 100,000 genomes project – goal is to sequence 100k people and use it as a diagnostic tool. Comorbidity network for 6.2 million Danes over 14.9 years – the likelihood if you have one disease you’ll get another. Incredibly powerful tool based on an entire population. We don’t have the facilities in this country to do that yet – need access and homogonization of data.

What is NIH doing?

Trying to create an ecosystem to support biomedical research and health care. Too much is lost in the system – grants are made, publications are issued, but much of what went into the publication is lost.

Elements of ecosystem: Community, Policy, Infrastructure. On top of that lay a virtuous research cycle – that’s the driver.

Policies – now & forthcoming
Data sharing: NIH takes seriously, and now have mandates from government on how to move forward with sharing. Genomic data sharing is announced; Data sharing plans on all research awards; Data sharing plan enforcement: machine readable plan, repository requirements to include grant numbers. If you say you’re going to put data in repository x on date y, it should be easy to check that that has happened and then release the next funding without human intevention. Actually looking at that.

Data Citation – elevate to be considered by NIH as a legitimate form of scholarship. Process: machine readable standard for data citation (done – JATS (xml ingested by PubMed) extension); endorsement of data citation in NIH bib sketch, grants, reports, etc.

Infrastructure –
Data to Knowledge initiative (BD2K). Funded 12 centers of data excellence – each associated with different types of data. Also funded data discovery index consortium, building means of indexing and finding that data. It’s very difficult to find datasets now, which slows process down. Same can be said of software and standards.

The Commons – A conceptual framework for sharing and being FAIR: Finding, Accessing, Integrating, Reusing
Digital research objects with attribution – can be data, software, narrative, etc. The Commons is agnostic of computing platform.

Digital Objects (with UIDs); Search (indexed metadata); Search

Public cloud platforms, super computing platforms, other platforms

Research Object IDs under discussion by the community – BD2K centers, NCI cloud pilots (Google and AWS supported), large public data sets, MODs. Meeting in January in UK – could DOIs or some other form.

Search – BD2K data and software discovery indices; Google search functions

Appropriate APIs being developed by the community, eg Global Alliance for Genomic Health. I want to know what variation there is in chromosome 7, position x, across the human population. With the Commons more of those kinds of questions can be answered. Beacon is an app being tested – testing people’s willingness to share and people’s willingness to build tools.

The Commons business model: What happens right now is people write grant to NIH, with line items to manage data resources. If successful they get money – then what happens? Maybe some of it gets siphoned off to do somethign else. Or equipment gets bought where it’s not heavily utilized. As we move more and more to cloud resources, it’s easier to think on a business model based on credit. Instead of getting hard cash you’re given an amount of credit that you can spend in any Commons compliant service, where compliance means they’ve agreed to share. Could be that institution is part of Commons or it could be public cloud or some other kind of resource. Creates more of a supply and demand environment. Enables a public/private partnership. Want to test idea that more can be done with computing dollars. NIH doesn’t actually know how much they spend on computation and data activities – but undoubtedly over a billion dollars per year.

Community: Training initiatives. Build an OPEN digital framework for data science training: NIH data science workforce development center (call will go out soon). How do you crate metadata around physical and virtual courses? Develop short-term training opportunities – e.g. supported workshop with gaming community. Develop the discipline of biomedical data science and support cross-training – OPEN courseware.

What is needed? Some examples from across the ICs:
Homgenization of disparate large unstructured datasets; deriving structure from unstructured data; feature mapping and comparison from image data; visualization and analysis of multi-dimensional phenotypic datasets; causal modeling of large scale dynamic networks and subsequent discovery.

In process of standing up Commons with two or three projects – centers being funded from BD2K who are interested, working with one or two public cloud providers. Looking to pilot specific reference datasets – how to stimulate accessibility and quality. Having discussions with other federal agencies who are also playing around with these kinds of ideas. FDA looking to burst content out into cloud. In Europe ELIXIR is a set of countries standing up nodes to support biomedical research. Having discussions to see if that can work with Commons.

There’s a role for librarians, but it requires significant retraining in order to curate data. You have to understand the data in order to curate it. Being part of a collective that is curating data and working with faculty and students that are using that data is useful, but a cultural shift. The real question is what’s the business model? Where’s the gain for the institution?

We may have problems with the data we already have, but that’s only a tiny fraction of the data we will have. The emphasis on precision medicine will increase dramatically in the next while.

Our model right now is that all data is created equal, which clearly is not the case. But we don’t know which data is created more equal. If grant runs out and there is clear indication that data is still in use, perhaps there should be funding to continue maintenance.

CNI Fall 2014 Meeting – VIVO Evolution

Evolution of VIVO Software

Layne Johnson, VIVO Project Director, DuraSpace

VIVO History – started at Cornell in 2003. 2009-12 NIH funded VIVO ($12 million) to evolve.

Problems – Researchers struggle to identify collaborators, most information and data are highly distributed, difficult to access, reuse, & share and is not standardized for interop.

VIVO can facilitate collaborations and store disparate information stored inteh VIVO-ISF ontology.

What is VIVO? Open source, semantic web application enables management and discovery of research and scholarship across disciplines and institutions.

VIVO harvests data from authoritative sources thus reducing manual input and providing integrated data sources. Internal data from ERPs, external data from bibliographic sources, ejournals, patents, etc.

VIVO data stored as RDF.

Triple stores and linked open data: provide abiity to inference and reason; can be machine readable; links into the open data cloue; provide links into a wide variety of information sources from different interoperable ontologies; allow knowledge about research and researchers to be discovered.

VIVO supports search & exploration – by individual, type, relationship, combinations and facets.

One of the larger implementations is USDA VIVO. Another interesting one is Find an Expert at the University of Melbourne. Scholars@Duke. Mountain West Research Consortium has a cross consortium search. The Deep Carbon Observatory data portal uses VIVO.

Installed base of VIVO implementations has remained somewhat level.

VIVO Evolution: from grant-funding to open source. In 2012-13 VIVO partnered with DuraSpace, who provide infrastructure and leadership – legal, tax, marketing communication, leadership. Sustained through a community membership model. VIVO project director hired May 1.

Charter process – Jonathn Markow & Steering group. Based on DuraSpace model for consistency across products. Charter finalized in late July, 2014.

VIVO Governance: Leaderhip group, steering group, management team. Four working groups: Development & Implementation; Applications & Tools; VIVO-ISF Ontology; Community Engagement & Outreach (undergoing reconstitution).

Four levels of membership – $2.5k, $5k, $10k, $20k.

VIVO strategic planning: 14 member strategy group created from leadership, steering, management teams and external members. Met December 1 & 2. Did a survey to determine current state of 41 VIVO leaders. Got 20 respondents. VIVO’s 3 strategic themes: community, sustainability, technology. 5 top goals for each theme selected, each strategy group member got tovote for 3 goals per theme.

Community: increase productivity; develop more transparent governance; increase engaged contributors; maintain a current and dynamic web presence; develop goals for partnerships (ORCID, CRIS, CASRAI, W3C, SciEnCV, CRediT, etc.)

Sustainability: create welcoming community; develop clear value proposition; increase adoption; promote the value of membership.

Technology: Develop democratic code processes; clarify core architecture and processes; develop VIVO search; improve/increase core modularity; team-based development processes.

CNI Fall 2014 Meeting: Fedora 4 early adopters

Fedora 4 Early Adopters

David Wilcox, Defora Product Manager, DuraSpace

Fedora 4.0 released November 27. Built by 35 Fedora community developers. Native citizen of the semantic web – linked data platform service. Hydra and Islandora integration.

Beta pilots – Art Institute of Chicago, Penn State, Stanford, UCSD.

62 members in support of Fedoray, funding increased dramatically (over $500k). Effort around building sustainability – more members at lower funding amounts. Governance model – Leadership and steering groups.

Fedora 4 roadmap – short term (6 months) – 4.1 will support migrations from Fedora 3. Want to establish migration pilots, and prioritize 4.1 features.

Fedora 4.1 features – focus on migrations, but some new features – API partitioning, Web Access Control, Audit service, remote/asynch storage are candidates.

Fedora 4 training- 3 workshops held in October (DC, Australia, Colorado), more planned for 2015.

It is possible for Fedora 4 could be a back-end for VIVO.

If you want to go with Hyrda at this point you should go to Fedora 4, not 3.

Declan Fleming, UCSD

Oroginal goals – map UCSD’s deeply-nested metadata to simpler RDF vocabularies, taking advantage of Dedora 4’s RDF functionality. Ingest UCSD DAMS4 71k objects using different storage options to compare ingest performance, functionality, and repository performance. Synchronize content to disk and/or an external triple store.

Current status – Initial mapping of metadata completed for pilot work. Ingested sample dataset using mulitple storage options: Modeshape, federated filesystem, and hybrid (modeshape objects linked to federated fulesystem files). Ingested full UCSD DAMS4 dataset into Fedora4 using Modeshape.

Ongoing work – continuing to refine metadata mapping, as part of the broader Hudra community push toward interoperability and pluggable data models. Full-scale ingest with simultaneious indexing, full-scale ingest with hybrid storage (about ready to give up on that and embrace modeshape), performance testing.

Over time ingesting of metadata slowed down – they use a lot of blank nodes which adds to complexity of structure – might be the reason.

File operations were very reliable. Didn’t test huge files rigorously.

Stefano Cossu – Art Institute

DAMS project goals – will take over part of current Collection Management System duties – 270k objects, 2/3 of which are digitized. Strong integration with existing systems adopt standards, single source for institution-wide shared data. Meant to become a central hub of knowledge.

LAKE – Linked Asset and Knowledge Ecosystem. Integrates with CITI (collection management system) which is the front-end to Fedora (LAKE) which acts as the asset store.

Why Fedora? Great integration capabilities, very adaptable, built on modern standards, focus on data preservation. Makes no assumptions about front-end interface. REST APIs. Speaks RDF natively.

Key features for the AIC – Content modeling, federation, asynchronous automation, external indexing, flexible storage.

Content modeling: adding/removing functionality via mix-ins. Can define type and sub-types. Spending lots of time building a content model. Serves as a foundation for ontology. Still debating whether JCR is best model for building content model. Additional content control is in their wish list.

Asynchronous Automation: Used modeshape sequencers so far. Camel framework offers more generic functionality and flxibility. Uses: extract metadata on ingestion, create/destroy derivatives based on node events, index content.

Filesystem federation to access external sources, custom database connector.

Indexing: multiple indexing engines – powerful search/query tools: triplesetore, solr, etc.

Tom Cramer – Stanford

Exercising Fedora as a linked data repository – introducing triannon and Stanfords Fedora 4 beta pilot

Use case 1: digital manuscript annotations. Used open annotation W3C working group approach to map annotation into RDF. Tens of thousands of annotations – where to store, manage, and retrieve?

use Case 2:Linked data for libraries. Bibliographic data, person data, circulation and curation data. Build a virtual collection without enriching the core record using linked data to index and visualize.

Need a RDF store, need to persist, manage, and index. Not the ILS nor core repository – this is a fluid space while the repository is stable and reliable. All RDF / linked data.

Fedora was a good fit: Native RDF store, manage assets (bitstreams), built in service framework (versioning, indexing, APIs), easy to deploy.

Linked Data Platform (LDP): W3C draft spec, enables read-write operations of linked data via HTTP, Developed at at same time as Fedora 4, Fedora 4 one of a handful of current LDP implementations.

Stanford pilot: install, configure & deploy Fedora 4; exercise LDP API for storing annotations and associated text/binary objects; develop support for RDF references to external objects; test scale with millions of small objects; integrate with read/write apps and operations – annotation tools (e.g. Annotator), indexing and visualization (Solr and Blacklight)

Current: Annotator (Mirador) <- json-ld -> Trianon (Rails engine for open annotations stored in Fedora 4) <-> LDP – Fedora 4.

Future: Blacklight and Solr.

Learned to date: Fedora 4 approaching 100% LDP 1.0 compliant, Trannon at alpha stage (can write, read & delete open annotations to/from Fedora 4); Still to come: updates to annotations, storage of binary blobs in Fedora, implement authn/z, deploy against real annotation clients, populate with data at scale.

Looking at Fedora 4 as a general store for enriching digital objects and records through annotating, curating, tagging.


Latest tweets

What I’m listening to


Get every new post delivered to your Inbox.