Internet2 Tech Exchange 2015 – High Volume Logging Using Open Source Software

James Harr, Univ. Nebraska

ELK stack – ElasticSearch, Logstash, Kibana, (+redis)

ElasticSearch indexes and analyzes JSON – no foreign keys or transactions, scalable, fast, I/O friendly. Needs lots of RAM

Cabana – WebUI to query ElasticSearch and visualize data.

Log stash – “a unix pipe on steroids” – start with input and output, but can add conditional filters (e.g. regex). Add-on tools like mutate and grok. Can have multiple inputs and outputs.

GROK – has a set of prebuilt regular expressions. Makes it easy to grab things and stuff them into fields. Have to do it on the way in not after the fact (it’s a pipe tool). 306 built-in patterns.

Grok GeoIP – includes built in database, breaks out geo data into fields.

LogStash – statsd – sums things up – give it key and values, adds values and once a minute sends to another tool.

Graphite – a graphing tool, easy to use. Three pieces of info per line: key you want logged to, time, value. Will create a new metric if it’s not in the database.

Can listen to twitter data with LogStash.

Redis – Message queue server

Queue – like a mailbox, can have multiple senders and receivers, but each message goes to one receiver. No receiver, messages pile up.

Channel (pub/sub) – like the radio, each message goes to all subscribers. No subscriber? message is lost, publisher is not held up. Useful for debugging.

Composing a log system: Logstash is not a single service: split up concerns, use queues to deal with bursts, errors. use channels to troubleshoot.

General architecture – start simple:

Collector -> queue -> analyzer -> ElasticSearch -> Kibana

Keep collectors simple – reliability and speed are the goal. A single collector can listen to multiple things.

Queue goes into Redis. Most work done in analyzer – groking, passing things to statsd, etc.Can run multiple instances.

Channels can be used to also send data to other receivers.

Composing a Log System – Archiving

collector -> queue -> analyzer -> archive -> archiver -> log file

JSON compresses very well. Do archiving after analyzer so all the fields are broken out.

Split out indices so you can have different retention policies, dashboards, etc. e.g. firewall data different than log stash.

Can use logstash to read syslog data from a device, filter out what you want to send to Splunk to get your data volume down.

Lessons (technical): clear query cache regularly (cron job every morning); more RAM is better, but the JVM doesn’t behave well after 32GB; Split unrelated data into indices (e.g. syslog messages vs. firewall logs); part simple; use channels to try new things.

Lessons: It’s not about full text search, though that’s nice. It’s about having data analytics. ElasticSearch, LogStash, and Kibana are just tools in your toolbox. If you don’t have enough resources to keep everything, prune what you don’t need.

Internet2 Tech Exchange 2015 – RESTful APIs and Resource Definitions for Higher Ed

Keith Hazelton – UWisc

TIER work growing out of CIFER – Not just RESTful APIs. The goal is to make identity infrastructure developer and integrator friendly.

Considering use of RAML API designer and tools for API design and documentation.

Data structures – the win is to get a canonical representation that can be shared across vertical silos. Looking at messaging approaches. Want to make sure that messaging and API approaches are using the same representations. Looking at JSON space.

DSAWG – the TiER Data Structures and APIs Working Group – just forming, not yet officially launched. Will be openly announced.

Ben Oshrin, Spherical Cow

CIFER APIs – Quite a few proposed, some more mature than others.

More Mature: (Core schema – attributes that show up across multiple APIs); ID Match (creates a representation for asking “do I know this person already, and do I have an identifier?”); SOR to Registry (create a new role for a person); Authorization (standard ways of representing authorization queries).

Less mature: Registry extraction (way to pull or push data from registry – overlap with provisioning); Credential management (do we really need to have multiple password reset apps?)’

Not even itemized: Management APIs; Monitoring APIs. Have come up in TIER discussions.

Non CIFER  APIs / Protocols of interest: CAS, LDAP, OAuth, OIDC, ORCID, SAML2, SCIM, VOOT2

Use cases:

  • Intra-component: e.g. person registry queries group registry for authorization; group registry receives person subject records from person registry.
  • Enterprise to component: System or Record provisions student or employee data in Person Registry
  • Enterprise APIs: Homw grown person registry exposes person data to campus apps.


API Docs; Implementations

Internet2 2015 Tech Ex – The Age of Consent

Ken Klingenstein is talking about the work going on to enable informed consent in releasing identity attributes to services. I walked in a little late, so I’m missing a bunch of the detail at the beginning, but he invokes Kim Cameron’s 7 Laws of Identity.

Consent done properly is something that users will only want to do once for a relying party, and be informed out of band when attributes are released. There is interesting research about whether users take on consent differs between offline and online.

Some federations already support consent.

Rob Carter – Duke

Why do we need an architecture for consent? Use cases at Duke:

  • Student data – would like to release attributes about students that may be FERPA protected. If a student can “consent” to an attribute release, FERPA may not even be involved (according to Duke’s previous registrar). Consent has to be informed (what’s sufficient “informedness”?); has  to be revokable (means you need to store it so the person can review and change); Non-repudiation and audibility (We know that the person gave the consent and when it was given).
  • Student devs – Trying to get students working on development. When a student wants to have another student share information with friends in an app, questions come up about release of information. Would like to have same kind of consent framework in OAuth as they have in other environments (e.g. Shibboleth).
  • Would be nice to have a single place for a user to manage their consents.

Nathan Dors – Washington

UW entering age of consent. Asks the audience where they are with consent frameworks – almost all are just entering the discussion. UW wants to go from uninformed consent (or not doing things because of the barrier of getting consent). Consent is highly ubiquitous already in the consumer world – Google, Facebook, etc. Help desk need to understand how to explain things to users. ID Management needs to be able to help developers understand what they need to do to get consents. Need to figure out how to layer in consent into a bunch of idPs including Azure AD which has its own consent framework. Need to apply existing privacy policies and data governance to consent context.

Larry Peterson (U of Arizona) – Give Your Data the Edge (Cloud Infrastructure for Big Data Research)

Data Management Challenge
     Distributed Set of Collaborators
     Existing Data Sets (sometimes curated, sometimes less)
     Taking advantage of commodity data storage
Researchers widely distributed, data widely distributed
Pre-Stage, then Write Back – Read/write workload, widely distributed. Tend to make assumption that researcher is data management expert.
Goal: Enable a scalable number of collaborators and apps to share access to data independent of where it’s stored: minimize operational burden on users; maximize use of commodity infrastructure; maximize aggregate I/O performance
Syndicate Solution – Add a CDN into the mix. Distribute big data in the same way Netflix distributes video, using caching.
Syndicate gateways sit between players and the caching transport (which uses HTTP instead of TCP).
Metadata service on the side which has to scale.
Result is all collaborators share a volume.
Gateways bridge application workflow and HTTP transport, e.g. IRODS, Hadoop drivers. Data gets acquired from existing repositories, or commodity storage (e.g. S3 or Box) which get treated as block storage. Gives ability to spread risk over multiple backing stores.
Metadata store manages data consistency and key distribution. Uses adaptive HTTP streaming. Plays a role of security distributing credentials which are delivered through the same CDN.
Shared volume is mounted just like Dropbox. Can auto mount volumes into multiple VMs like in EC2.
Service composition – Syndicate = CDN + Obect Store + NoSQL DB
Value Add
Storage Service
CDN gives scalable read bandwidth (Akamai Hyper Cache and RequestRouter). Built a CDN for R&E content on Internet2s footprint.
Object store – gives data durability. (S3, Glacier, DropBox, Box, Swift).
NoSQL DB (Google App Engine) for metadata service
Multi-tier cloud. Commodity cloud is one of the tiers. You could contribute private clouds into the project too. Internet2 Backbone. Regional & Campus (4 servers minimum). -> End Users
Caching hierarchy – some in the private cloud, less at the regional & campus side.
Request Router in the I2 backbone, which tells you which cache to get service from.
Value Proposition: Cloud Ready (Allows users to mount shared voluments into cloud-hosted VMs with minimal operational overhead; Adapt to existing workflows – makes it easy to integrate existing user workflows. There are ways to build modular plugins on user side or data side. Sustainable Design – I got a big data problem, I need to connect commodity disk to my workload.
Will first be used by IPlant community. More info at
Scientific Data with Syndicate – John Hartman, Univ. of Arizona Computer Science
Use Hadoop and Syndicate to support “big data” science – meta-genomics research w/ Bonnie Hurwitz, Agriculture and Biosystems.
Meta-genomics – statistical analysis of DNA rather than sequencing entire genomes. Sequencing produces snippets of DNA (called reads) – requires very pure samples of DNA. Instead, look at samples in the environment, e.g. compare population of reads in one sample with reads in another sample. Tara Oceans Expedition; Bacterial infections; Colon cancer.
Tara is a ship collecting information about the oceans, taking samples to enable comparative analysis. Currently about 9 TB of data from ship.
Looking at bacterial infections with Dr. George Watts. Treatment depends on identifying characteristics of the bacteria, so ideal to perform meta-genomic analysis on an infection to classify and determine treatment.
Analysis Techniques – originally custom HPC applications with manual data staging; now- Hadoop application with manual data staging; future – Hadoop application with data access via Syndicate.
Hadoop: open-source MapReduce; includes Hadoop Distributed File System, so storage nodes for the computation nodes. Tasks are run on local data when possible – hadoop task scheduler knows data location. Data must be manually staged into HDFS, and Hadoop does are managed by central controller.
Trying to allow remote Hadoop data access: Storage in iRODS and HDFS; Transport by Syndicate and HTTP; enable federation between Hadoop clusters.
Storage-side functionality: delivers data sets to Syndicate via SG; Publish/subscribe mechanism keeps datasets up to date via Rabbit MQ. Integration of Syndicate and iRODs authentication mechanisms.
Working on federating Haddop clusters via Syndicate, so clusters can pull data from each other.
Challenges – Identity Management; While-file write (need to write results back to storage; syndicate designed for file reads, block writes) have to provide consistency at dataset level; Performance.
Biologists are thrilled when this works at all.

2015 Internet2 Technology Exchange

I’m in Cleveland for the next few days for the Internet2 Tech Exchange. This afternoon kicks off with a discussion of the future of cloud infrastructure and platform services. As I suppose is to be expected, the room is full of network and identity engineering professionals. I’m feeling a bit like a fish out of water as an application level guy, but it’s interesting to get this perspective on things.

One comment is that for Internet2 schools, bandwidth to cloud providers is no longer as much an issue as redundancy and latency. Shel observes that for non-I2 schools bandwidth is still a huge issue. The level of networking these schools don’t have is frightening.

The cloud conversation was followed by a TIER (Trusted Identity in Education and Research) investors meeting. There are 50 schools who have invested in TIER now. This is a multi-year project, with a goal of producing a permanently sustainable structure for keeping the work going. A lot of the work so far has been to gather requirements and build use cases. Internet2 has committed to keeping the funding for Shibboleth and Grouper going as part of the TIER funding. The technical work will be done by a dedicated team at Internet2.

The initial TIER release (baseline) will be a package that includes Shibboleth, Grouper, and COManage, brought together as a unified set, as well as setting the stage for Scalable Consent support. That release will happen in 2016. It will be a foundation for future (incremental) updates and enhancements. The API will be built for forward compatibility. Releases will be instrumented for continual feedback and improvement of the product.

CSG Spring 2015: Internet of Things

Opportunities: Better services; find new cost efficiencies (e.g. trash cans that let people know when they need emptying); Improved sustainability; Safety

Challenges: Network; Security; Privacy; Support

Networks: Apple watch doesn’t support WPA2 and is not a good 802.1x supplicant. Inexpensive data acquisition devices – $5 wifi module. They’ll all connect to our network address space. This may be a driver to get serious about IPv6. BYOD is the stepping stone to the Internet of Things. Devices talking to each other – “Your refrigerator is talking to my car!”. Do you need directly addressable IP addresses, or can we keep NATing forever? UCSD starting to roll out carrier-grade NAT on their wireless with “sticky” addresses for a few hours. Will we need to have “eduthing” like we have eduperson? Many of the things won’t use WiFi because the power consumption is too large. There are other emerging (conflicting) protocols.

Students will be doing data acquisition and wanting to do data analysis – we should be providing tools for managing and analyzing data.

90% of the world’s data has been created in the past two years. The concept of digital exhaust – have to analyze data as it flows, looking for trends and patterns, not saving it.

What can we do with this data? Could see, for instance, when everyone is fleeing a building. If we’re collecting sensor data and correlating it to other data, do we need to involve the IRB?

Who is the data custodian of the trash can data? How do we think about data governance for this kind of data? It’s not about the source of the data, but the attributes. There are regulatory and compliance concerns. Merging of data changes the concerns.

CSG Spring 2015: The Future IT Organization / Talent Management

What kind of people do we need?

  • Hire for speed (the best people you can find, then figure out where to fit them in).
  • More business and entrepreneurial skills – services moving more towards products.
  • Technical curiosity.
  • More willing to engage with the business partners.
  • Wisconsin had a hard time finding Peoplesoft developers so they set up a Peoplesoft Academy and selected a dozen people to train and then hired six of them.
  • Google used to talk about people who are like “stem cells” – can adapt to different environments.
  • Look for resiliency.

The next generation: willing to work hard, but less patient for delayed gratification – want life balance from the start. Very social, not tied to their employer. Watched their parents go through the great recession, so don’t trust employers. Learn fast and think they should be highly empowered from the start. Fueling a crowd-sourcing model, with lots of shifting around.

What do we have to change in how we recruit and employ?

  • Want to work from home, later in the day.
  • Want opportunities to explore – spend some time that might not be relevant yet.
  • Create environment for those type of employees to be successful.
  • University HR paradigms might not work for new IT employees.
  • Gartner’s work on bimodal IT is worth following – we have business applications that have to be solid and reliable, and new activities that can be innovative and constantly changing. There are employees of all ages that prefer to work in both modes and we need to find ways to accommodate them.
  • Millennials like to get groups together and fix things – they have low tolerance for things that are broken and take a long time to fix.

How do we recruit the best IT staff for the future?

  • Compete with mission and brand.
  • Can build and tear down stuff in the cloud without all the brittle process around it.
  • Looking for systems thinkers.
  • Instituting a formal internship program for undergrads and grads – good pipeline. If you hire a student when they graduate, even if they stay for just a couple of years you get great work.
  • Higher Ed IT is failure-averse. Best practice now is to fail early and fail cheap – put together an internship program and if it doesn’t work, shut it down. Penn State has a course on intelligent fast failure.
  • Cultural fit is important.

Do we have a talent management process?

  • Small things like recognition from the CIO with small spot gift cards can help.
  • Acknowledgement of colleagues from peers.
  • Be visible in the technical communities showing the quality of work we do, making it attractive.
  • In some kinds of jobs people are frustrated by the lack of career advancement opportunities.

How do we transition current employees to newer modes of work?

  • Employees are looking for more agile leadership.
  • Look at identifying individuals to give temporary opportunities to move elsewhere on a temporary assignment to get a specific job done. Needs to be a project with urgency to make it concrete. Then they have a different perspective when they go back to their line organizations.
  • It’s like a college basketball team – the good ones come in and out quickly so we just need to keep the pipeline flowing.

How do we encourage diversity?

  • Georgetown has around 400 regular participants in a Women Who Code effort, mostly not from computer science.
  • Plug students in at more strategic levels – not just answering the help desk phone.
  • Campuses can connect with community coding groups.
  • Google got its idea for flex time and research from academia – we are the source of innovation and we need to reclaim that.
  • Very little of the new cool stuff from our research and education programs filter into our organizations. How do we short-circuit that?


Latest tweets

What I’m listening to


Get every new post delivered to your Inbox.