Personalized news 

At the recent CNI meeting Cliff Lynch reflected on the difficulty in determining what is the “official record” when the news is personalized to the viewer’s individual context.

I was thinking about that as I read this story in the NY Times reporting on a comprehensive study on the costs of health care procedures in different US markets.

Reading through the article as I’m sitting out on the deck in our rented vacation condo in Florida I was struck by this in the midst of the article:

Consider Fort Myers, Fla., our best guess for where you might be reading this article. Spending on Medicare patients is very high in this area. But, when it comes to private health insurance, spending is relatively low.

This was followed by some graphs comparing the health care costs in Fort Myers compared with other places in this US.

It’s interesting that this is in the midst of an article, not in some special “interactive” feature. Welcome to the new normal!

New professional presence site

I’ve been blogging for over a dozen years now, posting over 1400 posts here in that time. While I’ve always had a link to my resume on the blog, recently I decided it’s really time I had a better professional presence on the web.

So I’ve now put up – take a look and let me know what you think!

CNI Fall 2015 Day 1

I’m at the fall meeting for the Coalition for Networked Information. For those who don’t know, CNI is a joint initiative of Educause and the Association of Research Libraries and was founded in 1990 to promote the use of digital information technology to advance scholarship and education. I was involved in the early days of CNI and I’m happy to have recently been appointed as a representative of Educause on the CNI Steering Committee.

Cliff Lynch is CNI’s Executive Director, and one of the highlights of the member meetings is his plenary address, where he generally surveys the landscape of digital information and pulls together interesting, intriguing, and sometimes troubling themes that he thinks are worth watching and working on.

In today’s plenary Cliff talked about the evolving landscape of federal mandates for public access to federally funded research results. It is only in 2016 that we will see the actual implementation of the plans the various federal agencies put forward to implement the directive that the Office of Science and Technology Policy put out in 2013. Cliff noted that the implementations of the multiple federal funding agencies are not coordinated, and that some of them are not in sync with existing practices at institutions, and there will be a lot of confusion.

Cliff also had some very interesting observations on the current set of issues surrounding security and privacy. He cited the recent IETF work on pervasive surveillance threat models, noting that if you can watch enough aggregate traffic patterns going to and from network locations you can infer a lot, even if you can’t see into the contents of encrypted traffic.  And with the possible emergence of quantum computing that may be able to break current encryption technologies, security and privacy become much more difficult. Looking at the recent string of data breaches at Sony, the Office of Personnel Management, and several universities, you have to start asking whether we are capable of keeping things secure over time.

He then moved on to discussing privacy issues, noting that all sorts of data is being collected on people’s activities in ways that can be creepy – e-texts that tattle on you, e-companions for children or the elderly that broadcast information. CNI held a workshop in the spring on this topic, and the general consensus was that people should be able to have a reasonable expectation of privacy in their online activities, and they should be informed about use of their data. It’s generally clear that we’re doing a horrible job at this. NISO just issued work on distilling some principles. In our campuses people have different impressions of what’s happening in authorization handoffs between institutions and publishers – it’s confused enough that CNI will be fostering some work to gather some facts about this.

The greatest area of innovation right now that Cliff sees is where technology gets combined with other things (the internet of things) – like drones, autonomous vehicles, machine learning, robotics, etc.  But there isn’t a lot of direct technical IT innovation happening, and what we’re seeing is a degree of planned obsolescence where we’re forced to spend lots of time and effort to upgrade software or hardware in ways that don’t get us any increased functionality or productivity. If that continues to be the case we’ll need to figure out how to “slow down the hamster  wheel.”

Finally Cliff closed by talking about the complexity of preservation in a world where information is presented in ways increasingly tailored to the individual. How do we document the evolution of experiences that are mediated by changing algorithms? And this is not just a preservation problem but an accountability issue, given the pervasive use of personalized algorithms in important functions like credit ratings.




Internet2 Tech Exchange 2015 – High Volume Logging Using Open Source Software

James Harr, Univ. Nebraska

ELK stack – ElasticSearch, Logstash, Kibana, (+redis)

ElasticSearch indexes and analyzes JSON – no foreign keys or transactions, scalable, fast, I/O friendly. Needs lots of RAM

Cabana – WebUI to query ElasticSearch and visualize data.

Log stash – “a unix pipe on steroids” – start with input and output, but can add conditional filters (e.g. regex). Add-on tools like mutate and grok. Can have multiple inputs and outputs.

GROK – has a set of prebuilt regular expressions. Makes it easy to grab things and stuff them into fields. Have to do it on the way in not after the fact (it’s a pipe tool). 306 built-in patterns.

Grok GeoIP – includes built in database, breaks out geo data into fields.

LogStash – statsd – sums things up – give it key and values, adds values and once a minute sends to another tool.

Graphite – a graphing tool, easy to use. Three pieces of info per line: key you want logged to, time, value. Will create a new metric if it’s not in the database.

Can listen to twitter data with LogStash.

Redis – Message queue server

Queue – like a mailbox, can have multiple senders and receivers, but each message goes to one receiver. No receiver, messages pile up.

Channel (pub/sub) – like the radio, each message goes to all subscribers. No subscriber? message is lost, publisher is not held up. Useful for debugging.

Composing a log system: Logstash is not a single service: split up concerns, use queues to deal with bursts, errors. use channels to troubleshoot.

General architecture – start simple:

Collector -> queue -> analyzer -> ElasticSearch -> Kibana

Keep collectors simple – reliability and speed are the goal. A single collector can listen to multiple things.

Queue goes into Redis. Most work done in analyzer – groking, passing things to statsd, etc.Can run multiple instances.

Channels can be used to also send data to other receivers.

Composing a Log System – Archiving

collector -> queue -> analyzer -> archive -> archiver -> log file

JSON compresses very well. Do archiving after analyzer so all the fields are broken out.

Split out indices so you can have different retention policies, dashboards, etc. e.g. firewall data different than log stash.

Can use logstash to read syslog data from a device, filter out what you want to send to Splunk to get your data volume down.

Lessons (technical): clear query cache regularly (cron job every morning); more RAM is better, but the JVM doesn’t behave well after 32GB; Split unrelated data into indices (e.g. syslog messages vs. firewall logs); part simple; use channels to try new things.

Lessons: It’s not about full text search, though that’s nice. It’s about having data analytics. ElasticSearch, LogStash, and Kibana are just tools in your toolbox. If you don’t have enough resources to keep everything, prune what you don’t need.

Internet2 Tech Exchange 2015 – RESTful APIs and Resource Definitions for Higher Ed

Keith Hazelton – UWisc

TIER work growing out of CIFER – Not just RESTful APIs. The goal is to make identity infrastructure developer and integrator friendly.

Considering use of RAML API designer and tools for API design and documentation.

Data structures – the win is to get a canonical representation that can be shared across vertical silos. Looking at messaging approaches. Want to make sure that messaging and API approaches are using the same representations. Looking at JSON space.

DSAWG – the TiER Data Structures and APIs Working Group – just forming, not yet officially launched. Will be openly announced.

Ben Oshrin, Spherical Cow

CIFER APIs – Quite a few proposed, some more mature than others.

More Mature: (Core schema – attributes that show up across multiple APIs); ID Match (creates a representation for asking “do I know this person already, and do I have an identifier?”); SOR to Registry (create a new role for a person); Authorization (standard ways of representing authorization queries).

Less mature: Registry extraction (way to pull or push data from registry – overlap with provisioning); Credential management (do we really need to have multiple password reset apps?)’

Not even itemized: Management APIs; Monitoring APIs. Have come up in TIER discussions.

Non CIFER  APIs / Protocols of interest: CAS, LDAP, OAuth, OIDC, ORCID, SAML2, SCIM, VOOT2

Use cases:

  • Intra-component: e.g. person registry queries group registry for authorization; group registry receives person subject records from person registry.
  • Enterprise to component: System or Record provisions student or employee data in Person Registry
  • Enterprise APIs: Homw grown person registry exposes person data to campus apps.


API Docs; Implementations

Internet2 2015 Tech Ex – The Age of Consent

Ken Klingenstein is talking about the work going on to enable informed consent in releasing identity attributes to services. I walked in a little late, so I’m missing a bunch of the detail at the beginning, but he invokes Kim Cameron’s 7 Laws of Identity.

Consent done properly is something that users will only want to do once for a relying party, and be informed out of band when attributes are released. There is interesting research about whether users take on consent differs between offline and online.

Some federations already support consent.

Rob Carter – Duke

Why do we need an architecture for consent? Use cases at Duke:

  • Student data – would like to release attributes about students that may be FERPA protected. If a student can “consent” to an attribute release, FERPA may not even be involved (according to Duke’s previous registrar). Consent has to be informed (what’s sufficient “informedness”?); has  to be revokable (means you need to store it so the person can review and change); Non-repudiation and audibility (We know that the person gave the consent and when it was given).
  • Student devs – Trying to get students working on development. When a student wants to have another student share information with friends in an app, questions come up about release of information. Would like to have same kind of consent framework in OAuth as they have in other environments (e.g. Shibboleth).
  • Would be nice to have a single place for a user to manage their consents.

Nathan Dors – Washington

UW entering age of consent. Asks the audience where they are with consent frameworks – almost all are just entering the discussion. UW wants to go from uninformed consent (or not doing things because of the barrier of getting consent). Consent is highly ubiquitous already in the consumer world – Google, Facebook, etc. Help desk need to understand how to explain things to users. ID Management needs to be able to help developers understand what they need to do to get consents. Need to figure out how to layer in consent into a bunch of idPs including Azure AD which has its own consent framework. Need to apply existing privacy policies and data governance to consent context.

Larry Peterson (U of Arizona) – Give Your Data the Edge (Cloud Infrastructure for Big Data Research)

Data Management Challenge
     Distributed Set of Collaborators
     Existing Data Sets (sometimes curated, sometimes less)
     Taking advantage of commodity data storage
Researchers widely distributed, data widely distributed
Pre-Stage, then Write Back – Read/write workload, widely distributed. Tend to make assumption that researcher is data management expert.
Goal: Enable a scalable number of collaborators and apps to share access to data independent of where it’s stored: minimize operational burden on users; maximize use of commodity infrastructure; maximize aggregate I/O performance
Syndicate Solution – Add a CDN into the mix. Distribute big data in the same way Netflix distributes video, using caching.
Syndicate gateways sit between players and the caching transport (which uses HTTP instead of TCP).
Metadata service on the side which has to scale.
Result is all collaborators share a volume.
Gateways bridge application workflow and HTTP transport, e.g. IRODS, Hadoop drivers. Data gets acquired from existing repositories, or commodity storage (e.g. S3 or Box) which get treated as block storage. Gives ability to spread risk over multiple backing stores.
Metadata store manages data consistency and key distribution. Uses adaptive HTTP streaming. Plays a role of security distributing credentials which are delivered through the same CDN.
Shared volume is mounted just like Dropbox. Can auto mount volumes into multiple VMs like in EC2.
Service composition – Syndicate = CDN + Obect Store + NoSQL DB
Value Add
Storage Service
CDN gives scalable read bandwidth (Akamai Hyper Cache and RequestRouter). Built a CDN for R&E content on Internet2s footprint.
Object store – gives data durability. (S3, Glacier, DropBox, Box, Swift).
NoSQL DB (Google App Engine) for metadata service
Multi-tier cloud. Commodity cloud is one of the tiers. You could contribute private clouds into the project too. Internet2 Backbone. Regional & Campus (4 servers minimum). -> End Users
Caching hierarchy – some in the private cloud, less at the regional & campus side.
Request Router in the I2 backbone, which tells you which cache to get service from.
Value Proposition: Cloud Ready (Allows users to mount shared voluments into cloud-hosted VMs with minimal operational overhead; Adapt to existing workflows – makes it easy to integrate existing user workflows. There are ways to build modular plugins on user side or data side. Sustainable Design – I got a big data problem, I need to connect commodity disk to my workload.
Will first be used by IPlant community. More info at
Scientific Data with Syndicate – John Hartman, Univ. of Arizona Computer Science
Use Hadoop and Syndicate to support “big data” science – meta-genomics research w/ Bonnie Hurwitz, Agriculture and Biosystems.
Meta-genomics – statistical analysis of DNA rather than sequencing entire genomes. Sequencing produces snippets of DNA (called reads) – requires very pure samples of DNA. Instead, look at samples in the environment, e.g. compare population of reads in one sample with reads in another sample. Tara Oceans Expedition; Bacterial infections; Colon cancer.
Tara is a ship collecting information about the oceans, taking samples to enable comparative analysis. Currently about 9 TB of data from ship.
Looking at bacterial infections with Dr. George Watts. Treatment depends on identifying characteristics of the bacteria, so ideal to perform meta-genomic analysis on an infection to classify and determine treatment.
Analysis Techniques – originally custom HPC applications with manual data staging; now- Hadoop application with manual data staging; future – Hadoop application with data access via Syndicate.
Hadoop: open-source MapReduce; includes Hadoop Distributed File System, so storage nodes for the computation nodes. Tasks are run on local data when possible – hadoop task scheduler knows data location. Data must be manually staged into HDFS, and Hadoop does are managed by central controller.
Trying to allow remote Hadoop data access: Storage in iRODS and HDFS; Transport by Syndicate and HTTP; enable federation between Hadoop clusters.
Storage-side functionality: delivers data sets to Syndicate via SG; Publish/subscribe mechanism keeps datasets up to date via Rabbit MQ. Integration of Syndicate and iRODs authentication mechanisms.
Working on federating Haddop clusters via Syndicate, so clusters can pull data from each other.
Challenges – Identity Management; While-file write (need to write results back to storage; syndicate designed for file reads, block writes) have to provide consistency at dataset level; Performance.
Biologists are thrilled when this works at all.

Latest tweets

What I’m listening to


Get every new post delivered to your Inbox.