The SRE Low Frequency Reporting Platform

By

Grant Mitchell And Ashley Bullock

sre-logo

Hello, this is the first in a series of blog entries we intend to write about all things Site Reliability Engineering at Paddy Power Betfair. As this is our first ever blog post (so please be kind!), we thought we’d write about a straightforward project we ran recently, our low frequency reporting platform. We’ll discuss what we built, why we built it and some of the outcomes we saw. This project is quite simplistic, and likely not 100% transferable to your estate, however, hopefully our thoughts on the project may be useful, even if the detailed “how we did it” would not be :). The framework was delivered over a few sprints, the services running on it are continually being developed.

What is a Low Frequency Reporting Platform?

During our day to day work, we have discovered the need to collect and process lots of small datasets at a relatively low frequency such as daily or hourly collections. These include reports such as hardware health, app performance or information from ServiceNow. The typical way this is tackled in our environment would be to create a service or App – internally known as a TLA (we name them generally with a three letter acronym). All TLA’s are deployed with an A or B appended to their hostname. An update and re-deployment of a TLA using “A” will become “B” after re-deployment is complete. The next deployment will then become A, then B, etc… Rather than develop individual TLA’s for each little project, we decided to create a framework to host “small” services.

The framework allows a pipeline build of applications. These basic applications can currently do the following: “Write data into Splunk for reporting” and/or “create a simple web API to query data from”. The goal was to quickly (within a sprint usually) expose valuable information to the business – and so the TLA/App LRP was born.

high-level

Some Notable LRP services

Cinder Volume Mapping Service

When VM’s get re-deployed in our OpenStack environment the disk volumes used on the Pure storage changes. When querying the storage devices, you are presented with a list of Cinder Volume IDs. We needed to help people translate them into something meaningful – like a hostname and a mount location. This service translates what the storage device tells you “80f5b4ad-2997-4c4a-96be-763ea57365cc” to something a developer or support engineer can use: “iex-fokpp01.ppbetfair-fokpp_kafka_logs_04-old”. This data is accessible via an API call to one of our LRP servers: http://XXXXX.betfair/api/cvm/v1.0/volumes for other services to consume or via a Splunk dashboard we created for our users.

Hardware ILO Checks

Recently an issue was reported about a slow-down on some of our services. The root-cause was eventually identified as several hypervisors where the batteries in the disk cache controllers were flat. Including one server, which had a failed disk in addition to the battery issues – which really ground to a halt! By pulling data from the ILOs we developed a Splunk dashboard to show faults detected on hypervisors. As this data is in Splunk we also cross-reference it with other data to provide enhanced information on the affected hypervisors and potentially affected guests. Using this dashboard DC staff can get an overview of all issues and Service owners can zoom in on their area to see problems affecting their hypervisors.

fault-sankey

Service Now Unregistered Apps reporting

In line with an on-going business initiative to reduce MTTR (who doesn’t have one of those?) we were asked by the Service Management team to help them identify Apps which had VM’s deployed, but no records in Service Now (meaning they did not know whom to engage if issues occurred). Using LRP, we created a dashboard to report on offenders. This dashboard also combines information from other sources to help the Service Management team track down the owners – we provide the Team that own the hypervisor the unregistered App was deployed to and which users have run pipelines for the unregistered App from the ThoughtWorks Go logs. We also send them a daily slack report showing any new unregistered App’s – so hopefully, things will become cleaner from now on!

Our Answer

The low frequency reporting platform consists of Redhat Linux boxes deployed into our OpenStack environment. Everything is built by ansible playbooks that get run via a pipeline in Thoughtworks Go. The playbooks build the VM’s and deploy the required programs immutably. Some of the technologies we chose for LRP are described below.

Splunk?

We use Splunk Cloud for a lot of our monitoring and reporting. During our first demo to the wider team of LRP we were asked why we were using Splunk? The answer is simple, by putting information in Splunk we allow our users a simple reporting mechanism to self-service their queries. Indeed, shortly after we unveiled this service the first dashboards from our users using the data we were providing started to appear. While we are happy to create dashboards for users, we found it better to put the control into the hands of those that deeply understand the value of the data, how it connects to other data and who use it daily in their job. It was when we did this that we found the real magic happened 😉

Redis?

We use Redis as a cache for our API calls. We can insert “infrequent” data in JSON format from various sources (with post-processing and correlation where appropriate), allowing clients to pull the data in a “frequent” manner without impacting or exposing the data sources. In this way, it’s running essentially as a proxy for exposing processed data.

NGINX/uWSGI?

This is our API gateway, we decided on a lightweight server for our API’s. As of course, NGINX cannot server dynamic content directly we utilised uWSGI to link to our Flask backend.

Flask?

Our backend framework is in Flask. Using Jinja2 templates we automate the creation of the required python files dynamically on deployment. This allows us to give colleagues a constrained way to interact, create routes and utilise canned routines for manipulating their Redis Cache.

Why not containers?

About now, you might be thinking, this sounds like an ideal problem for containers… and you’d be right. Unfortunately, the kubernetes pipelines for our environments are still in “beta”, so we decided to press on and devise this solution, but we believe that switching to containers in future would be a simple process

What does an API service look like?

The services are defined in basic YAML files. A very simplistic endpoint might be configured like this:

api-yaml
This simple example config defines an endpoint /api/cvm/v1.0/volumes. From this we create a simple application. We create a python file with all the endpoints from this (and other applications defined in other yaml), additionally for this endpoint, we create getallvalues.py – containing the code specifically for this endpoint using the information given. As you can see, this service is extremely simple, initially only allowing retrieval from the cache (the cache was populated by a good old fashion CRON process 🙂 ). However, we can build other endpoints with put/post/delete methods allowing users to create a more complete API.

What does a Splunk feed look like?

Our inputs to splunk are also created via simple jinja2 template :

feed-j2

As you can see from the snippet above we execute a simple shell script periodically to ingest data into Splunk. These scripts can utilise our API Services on LRP, or make calls out directly to other services. Our latest input builds a JSON output from YAML configuration files polled from our git repository. Others call public endpoints like Service Now and Pager Duty. By using Splunk to do the heavy-lifting and processing, we allow the service to to be lightweight, deployed only on a system with two vCPUs.

So, what outcomes did we see?

Users really do want to play with the data! One of our aims was to put the data in the hands of users who can really leverage it. Developers liked the data exposed via the APIs – but we have also seen users whose day jobs are not centred around coding utilise the data in Splunk. They are creating searches and dashboards and sharing with colleagues. It’s an obvious truth, but our data is more powerful when it’s timely, relevant, relatable to other data and reportable. By offering API’s and data in Splunk we allowed a broad spectrum of users to interact with the data, and as our projects deliver more sources, we have new ways to link our data and give us better insights.

example-dash

We also found that in developing a framework we encouraged more opportunities to work for teams across locations. In Paddy Power Betfair we have offices in London, UK; Cluj, Romania; Dublin, Ireland and Porto, Portugal. When we started to Demo this to our colleagues in other offices ideas for more sources and uses came out – Dublin created some more insights into fragile assets where all of a Apps’ compute existed in one rack and Porto are looking to use data from the platform to tag disk related events in our TSDB with the information provided from the CVM API. We have a great team across our locations, so we all enjoy any excuse to get involved in working with them :).

This project will continue to grow, and we’re still looking at improvements around availability and scalability – so perhaps we will revisit this in future posts. In our next post, however, we’ll talk about a new project we’re starting soon – App Scorecards. We are afraid it’s report card time colleagues!

About the authors

Grant Mitchell is an SRE at Paddy Power Betfair. As well as a lifelong interest in computers and technology he has a passion for beer and at the time of writing this article has logged 3,587 unique beers on Untappd – https://untappd.com/user/anachronism

Ashley Bullock is an SRE at Paddy Power Betfair. Fresh from university with a keen interest in programming, Ash also has a passion for music and is seeking a drummer to join a band within the company.

We’re hiring, come and join us!

Shameless Plug:

Think you can drum with Ash, outdrink Grant, beat Vadim at ping-pong, make some plans for Nigel or just want to work in a team passionate about improving our technology and how we use it? Well, we’re hiring! Reading this months after this post, don’t worry – drop us a line anyway! While Ash may have a drummer by then, we are always on the lookout for good team-mates!

https://paddypowerbetfair.jobs/jobs/?search=sre

The SRE Low Frequency Reporting Platform

2 thoughts on “The SRE Low Frequency Reporting Platform

  1. Your Personal Elementary School Teacher says:

    Stop putting apostrophes in every single plural of an acronym, it makes reading this unbearable.

    Note: APIs, VMs, TLAs.
    Note2: “I’m going to VM’s for a pint” – as in, “VM” is the owner of a pub.

    Like

    1. We use apostrophes to clearly delineate the acronyms from the plural intent. As we frequently use mixed case acronyms in IT (search for PaaS and IaaS – platform/infrastructures as a service as common examples of this) the use of the apostrophe aids in disambiguating the term. We did consider using full stops to split the acronyms – style guides for various publications allow for apostrophe usage in such cases. However frequently in our industry, which is full of acronyms, the use of full stops is depreciated . Consider when you last saw common IT terms such as RAM, USB or WWW expressed with full stops.

      This usage allows you to simply run several “Plan Of Test” artefacts against our “Plain Old Telephony Service” infrastructure without confusion :). Apologies if you found this hard to read, my intent is to value communication without ambiguity (and I assume you understood the intent before pointing out it was wrong :). When I write code is where I am concerned about syntactical rules.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s