The Betfair Exchange is a truly 24/7 product as part of a global business, both in terms of the customers that bet on it and where the events they are betting on take place. Uptime and stability of the Exchange technology stack are of paramount importance to the business and its customers. Even a few minutes downtime could mean a huge loss in revenue and customer impact, as an important horse race runs without any in play bets being placed or a customer misses a crucial trade out opportunity as part of their carefully tuned betting strategy. Therefore, keeping the Exchange platform running and having an early indication of any problems are always at the forefront of our minds on the Exchange development team.
Focus on stability obviously starts at design and development time. All our components are built to withstand failure. Stateless microservices that serve data to our Web and Mobile channels and direct access API customers are deployed in clusters behind load balancers. This means that should any one node of the cluster become unhealthy, it can simply set its health check to fail so that the load balancer knows to stop sending it traffic.
We also have a number of stateful applications where one node (or a group of nodes if the data can be sharded) is doing all of the work, for example producers to our many Kafka streams. For these applications, we again deploy a cluster of nodes that can potentially assume responsibility for that task. Leadership elections take place, managed by Zookeeper, to choose which node is performing that task at any one time. Automated failure detection means the leader node can at any time mark itself as unhealthy, triggering a leadership election and a fail over routine that allows another node to take control with only a very brief pause in service.
Although we have designed a resilient architecture, we don’t want to have to make use of our failure handling very often! So automated regression test packs are built into every stage of our continuous delivery pipelines to ensure only high quality code is deployed into production. Staging and dark live environments give us further opportunities to spot potential problems before any new feature or change is put in front of actual customers.
But in an imperfect world we always have to recognise that failures in both software and hardware will happen and how we detect and respond to these failures is crucial in minimising impact to customers. To begin with, every node we deploy, either physical or virtual, is fitted with a suite of standard monitoring checks controlled by Sensu. These will continuously test things like service health checks, available disk space and CPU usage. These checks are set up to automatically page on-call support teams at any time to alert them to a failure using PagerDuty. The alerts will help tie any degradation down to specific services or nodes, ensuring the on-call engineer can quickly get to the root of the problem and follow steps to resolve it.
Beyond these basic checks we also want to record more granular, service level metrics to pick up errors and degradation not immediately obvious using system stats or web pings. Every application in the Exchange system is constantly recording data about everything it does. Trends in service request rates, error rates and response times are tracked and can be further broken down by operation or node to help isolate issues. Queue sizes and Kafka message rates can help indicate bottlenecks or traffic spikes and recording usage per customer can also help us track poor user behaviour that can be educated or blocked to ensure one badly behaving piece of customer software doesn’t impact the rest of the customer base. Even a small change in any of these metrics can give an early warning of a problem developing and allow us to take action before it spreads and causes customer impact.
All the metrics we record are stored into a time series database from where they can be queried, graphed and displayed to support staff using Grafana. This also ensures a detailed history is persisted that we can use to analyse how different areas of the system performed at peak times and help us plan for future big events. The constant year on year growth in demand that we continue to see for our platform means we are never able to stand still. So, the ability to see how our performance has trended over time is vital to how we plan for the future and schedule pieces of scalability and re-architecture work.
All developers working on the Exchange are regularly reminded to think of the customer and think like a customer and this applies to how we set up our monitoring systems too. We want to have a real-time idea of the kind of experience our customers are having at any time and understand which areas of a customer’s interaction with the Exchange might be affected by a certain production issue.
To gain visibility of this, we have built a suite of monitoring “probes” that interact directly with our API. These are Python scripts, designed to mimic real customer behaviour by submitting realistic request patterns into the Exchange and monitoring the response times experienced and any failures detected. We deploy these scripts into different AWS instances around the globe so that we can have an idea of the total latency experienced by customers entering our external network layer and how that latency might differ depending on where in the world they are betting from. This can help alert us to problems with specific network routes or geographical regions.
As with the other stats mentioned earlier, all the metrics recorded by these scripts are collected into our time series database where they can be graphed and used to automatically initiate on-call page outs. They are also displayed onto dashboards on monitors around the development team’s office space and on the big screens monitored by our first line support operators. They provide an extremely valuable reference point for anyone in the business looking to understand the current health of the Exchange as experienced by real customers and is often the first port of call whenever reports of production issues start to appear.
We realised that this data would also be of use to external customers and third parties so this monitoring system is also used to power our API status page, which is built on the Cachet framework. Here we display an aggregated view of what our monitoring is currently measuring, to show what we consider the health of different areas of the Exchange to be. It uses a traffic light system, with the values toggling automatically based on the performance we have recorded in the last minute.
This provides a valuable source of information for customers to know when the Exchange is suffering some degradation or a planned outage and has hugely reduced the load on our customer services operators during those times. With the introduction of the monitoring system above, we have seen occurrences where customer contact provides us the first indication of stability problems on the Exchange reduce to near zero and have all our uptime metrics trending in the right direction to provide a faster, more reliable customer experience than ever before.
As we move through a busy period of Football and the Six Nations, on into the big Spring Racing festivals at Cheltenham and Aintree for the Grand National, all these design decisions and alerting systems will be as important as ever in ensuring we are delivering a great experience to our customers.