I like analogies because it can help explain stuff to people. Also in this case, for your every day die hard DevOps team, developers and other seasoned IT staff a ‘bug’ is a well known term. Capture the BUG. How do you do that?
But what if you mention ‘we’ve got a bug’ to for example accountants, retail, HR or other staff? They would probably think that it is about an ‘issue’ in the platform or application you run your business on. If you would say ‘hey, there is a bug in your house’, they would probably think that some insect is crawling or flying somewhere. But you know that an insect is easily found and exterminated, right?
A bug in your business platform
But how do you do that on your business’s platform? The place where your organization is creating and delivering the services for your customers? Services evolve, platforms change and probably also the people who build the application and services are not around forever. The house stays the same (more or less) but IT platforms and services don’t. So far the analogy to the bug in the house.
Services, software and releases follow up pretty rapidly and capturing a bug can be really time consuming, not even mentioning revenue loss or customer experience issues. So the key to this problem is: observability. The better you are at capturing the bug, finding problems and knowing the performance of your services, the more it helps to define customer experiences and innovation.
Capture the bug with the observability stack in Kibana
During this 3 hours event we worked with the Elastic observability components such as overview, logs, metrics, APM, uptime and alerting and tried to capture 10 bugs. Before we could start with this event we needed to sign in on a custom Slack workspace, dedicated for this event. This Slack workspace was used to communicate your result back to the trainer but more importantly it was used to receive the Alerts from the system which were your starting point for each challenge.
Elastic set up a dummy website on which we could see the operations and performance of sales, services and infrastructure. This dummy website was monitored using Elastic APM and UpTime running in a dedicated Elastic Cloud environment.
The ‘bugs’ were introduced via scripted handles. Without knowing the ‘house’ we could still find the bug using the observability components. Of course knowing the ‘house’ helps, and when working together on the same set of (monitoring) information with the different teams in your organization, you can very quickly identify problems and solve them. Observability will not only help you to identify crisis and potential revenue and customer loss, but also to identify other smaller potentially hidden issues.
During those challenges we learned that setting up your monitoring is key in finding and resolving your RCA. It sounds obvious but the better the monitoring data, the better you can do observability and find your RCA. To bring it back to the analogy of ‘the house’, when your partner mentioned that there is a bug in the house you probably go and search each room to find the bug. If your partner mentions that the bug is in the living room, left from the couch you will find the bug much easier and faster. So in fact the level of data is key.
Challenge and awards
The game element in the virtual event was that you individually needed to find the RCA of an issue (triggered via an alert via Slack). We got 10 minutes per case to find the answer. When you got the correct answer you could earn 2 points, an extra bonus point could be won by giving the correct solution to fix the problem. Like an extra point for the fastest lap during a F1 grand prix, to keep using analogies.
For both of us, using these kinds of tools was a new experience. Nevertheless, we were able to find the RCA and even we could propose some solutions. After each task the trainer explained what the correct solution was and showed us how the solution should have been found. After that the system was brought back to stable so a new scenario could be introduced.
Around 24 people entered this capture the bug competition and Patrick ended up in 3rd place. Even though the capture the bug workshop took 3 hours, it never felt like it was taking too long. The setup of the webinar was good. We were not only looking at theoretical slides, but were also digging into the stuff and performing hands-on tasks to find a solution. The game element makes it really fun as of course you want to beat your colleague(s). Do you want to know more about observability in general or with use of Elastic, please contact us.