How we monitor the health of our applications and infrastructure by Nic Ngoo, VP of Technology, Kaodim Group

Written by Nic Ngoo

How do you know when there is something wrong with your website or mobile apps? Is it when a customer complains through customer support? Or is it when you wake up one morning and see a nasty review left on the App Store. Or worst still, no news so you just go days or weeks without finding out your application is not working as it should be and only find out when you look at the balance sheet at the end of the month and you’re bleeding money.

In this post, I will share what and how we monitor our applications at Kaodim as we uphold 2 of our Core Values: Obsess Over Every Detail Of The Customer Experience, and Hold Ourselves To The Highest Standards Of Quality & Performance.

Why do we need to invest in monitoring solutions?

At Kaodim, our business is not only serving the end customers that use our web and mobile applications to conveniently search and book a variety of services, but also to serve the service providers (or vendors) that will receive and complete those services offline. We have presence in Malaysia, Philippines, Singapore and Indonesia serving thousands of transactions per day.

The Kaodim platforms are available at kaodim.com, kaodim.sg, gawin.ph and Beres.id for web, Kaodim User and Kaodim Vendor apps for both Android and iOS in Malaysia and Singapore; Gawin User and Gawin Vendor apps in the Philippines; Beres User and Beres Vendor apps in Indonesia.

Our customers on the web are interacting with 2 applications – the landing page for searching and booking of services and the other one is the dashboard application to manage service requests and accessing help centre etc. Additionally, there are 4 backend services that have to be monitored and supported as well.

 

So in total, our Engineering team has to look after 4 backend services, 8 web applications and 12 mobile applications, although the code bases are the same for each application across all 4 countries except for localisation files. However we still monitor all 12 version of our mobile applications because we found crashes that occurred in certain country only.

Any downtime and deviation from expected business flows are extremely damaging to the trust that our customers and service providers place in Kaodim, not to mention our ability to keep growing.

 

Table summarizing all of the applications for Kaodim that need to be monitored

Web Android iOS
Malaysia kaodim.com

dashboard.kaodim.com

Kaodim UserKaodim Vendor Kaodim UserKaodim Vendor
Singapore kaodim.sg

dashboard.kaodim.sg

Philippines gawin.ph

dashboard.gawin.ph

Gawin User

Gawin Vendor

Gawin User

Gawin Vendor

Indonesia beres.id

dashboard.beres.id

Beres User

Beres Vendor

Beres User

Beres Vendor

Backend Services Kaodim Main Backend Service

Recommendation Service

URL Shortener Service

Customer Support Service

A table summarizes all of the applications for Kaodim that needs to be monitored 

 

How do we approach application monitoring and alerting?

Our engineers need to find out about any issues before they happen and address them, or if not possible, find out as early as possible and contain the issues, before widespread damage is done.

Here are some of the principles that we are introducing at Kaodim Engineering to ensure effective monitoring and actionable alerts, taking inspiration from 2 publications on this subject – Google Site Reliability Engineering Book and Best Practices for Setting SLOs and SLIs For Modern, Complex Systems.

  • We need to monitor for general availability (uptime) of all of our applications as well as issues that contribute to critical business functionality loss
  • Monitoring should allow us to see the trend as well as narrow in on specific issues for troubleshooting
  • Alerting should have as little noise as possible so we don’t fall into the trap of ignoring false-alarms
  • Alerting tools should be automated and the alert’s message are specific so that engineers are only alerted when something is really wrong they don’t need to take a long time to find out exactly what’s happening
  • We cannot possibly monitor every single event so it is important to truly understand what is critical for your business and only spend resources to monitor the critical events
  • We set warning thresholds and alert engineers so that they find out about impending issues before they actually cause business loss
  • We use dashboards and time series graphs to spot trends and also set outlier alerting for any spikes that are out of the ordinary
  • Every engineer has an incident response and resolution as part of their yearly KPIs. We evaluate their ownership in actively addressing production issues.

 

What are we monitoring and alerting on?

Prior to this, there were already error alerts and monitoring tools in place, but there are obvious gaps where some of the services are not being monitored and we did not have a common Service Level Indicators defined. So we took a step back and decided that we had to revisit our monitoring and alerting strategy again.

The first step for us was to determine all the metrics that are important for us to tell the health of our applications. We followed Google’s strategy of combining black-box and white-box monitoring: black-box monitoring are looking at the boundaries of our application as a whole, such as a system is down or not working correctly. White-box monitoring will allow us to look deeper into the applications for imminent problems such as slow running queries or logs showing repeated retries.

Google’s recommendation is that the Four Golden Signals are minimum of what needs to be monitored – Latency, Traffic, Errors and Saturation. But we wanted to have more metrics as a starting point and also specific ones to our web and mobile client side platforms. Below are the key black-box type metrics we looked at to give us a starting point.

Metric (unit) Definition
Availability (%) The fraction of the time that a service is usable. Can customers access?
Apdex (0 to 1) The normalized way of measuring performance and customer satisfaction within a sample period. We define a response time Threshold, T per application e.g. T=1s. Then

Where Satisfied is the number of response ≤ T

Tolerating is number of response > T and ≤ 4T

Latency (ms) The time it takes an API to service a request
Traffic / Throughput (rpm or request per minute) How much demand is placed on our applications. It depends on the application whether this is a web HTTP requests, or I/O connection for Database
Error Rate (%) The rate of requests that fail. Depending on which application, could be HTTP 5xx, JS errors, DB connection timeout. APM tools can report on application-level errors that we take into account.
Saturation (%) The measure of how ‘full’ our system is. Depending on which application, we look at CPU, Memory, I/O, disk usage etc.
Page view load (ms) Web only. Average page load time broken down into segments
First Contentful Paint (s) Web only. Measures the time from navigation to the time when the browser renders the first bit of content from the DOM
Speed Index (s) Web only. Page load performance metric that shows you how quickly the contents of a page are visibly populated. The lower the score, the better
Time to Interactive (s) Web only. Metric measures how long it takes a page to become interactive
Crash-free users Mobile only. Percentage of users that have not encountered errors
Network Success Rate Mobile Only. The percentage of HTTP/S requests made by the app that returns with a 2xx or 3xx response code.
Payload size Mobile only. Byte size of the network payload downloaded and uploaded by the app

 

Why we use the tools that we use?

We use a combination of 3rd party open-source, free and paid tools to accomplish our objective. One of the decisions we had to make as a lean startup, is build vs. buy decisions. We try to be up and running as quickly as possible without taking valuable engineers time to build custom monitoring from open-source tools. As a startup on hyper-growth stage, engineers time is best spent on building new features that benefit our ecosystem. But as a team, we also perform our due diligence to evaluate all of the tools available out there as there are so many options to choose from.

Some of the tools like Slack are being used by the rest of the company, while AWS Cloudwatch is the easiest way to monitor our AWS resources where the majority of our production workload are hosted. While Firebase is not only one of the best but also free and its SDK easily integrated into our mobile builds. However, we made the decision to invest considerably in New Relic, one of the industry leaders in application monitoring SaaS. It’s simple to set up, powerful and support for multiple platforms is one of the reasons for our decision. Having our backend, web and API monitoring under one tool keeps the management easier.

Alerting channels

  • Slack where all of our alerts go at this moment. All engineers have Slack on their mobile devices and required to enable notifications for all critical alerting channels. We divide our slack channels into critical and non-critical channels for historical auditing purpose
  • Email notification is considered a secondary alerting channel but still active
  • Others. There are future considerations for SMS alerts and paging tools like PagerDuty for Highest Severity events, but so far Slack works thanks to the discipline of our engineers responding.

Monitoring Tools

  • AWS Cloudwatch Metric, Alarm and Dashboard for all of our AWS-hosted services
  • New Relic APM for application performance and error monitoring. Provides transaction traces for white-box monitoring and alerting as well as pinpoint any slow running queries that we can continually improve on.
  • New Relic Synthetics for API performance monitoring and ping tests for uptime. Our API monitoring uses a test script to send a request and for the tests to pass, not only HTTPS 200 response is required but we have also configured New Relic to look for correct response content.
  • New Relic Browser for web client performance monitoring and errors
  • Google Page Insights to analyze our landing page and dashboard web pages on demand
  • Firebase Crashlytics to report on mobile app crashes and errors
  • Firebase Performance to get insight into our mobile app performance
  • Raygun is a tool we use considerably for error reporting of our Ruby on Rails backend. We find that it provides a better error alerting than New Relic APM so we’re keeping this.
  • Nagios is an open-sourced monitoring tool we put in place since the early days to monitor host-level processes and network connectivity
  • Monit for monitoring and automated keep-alive of specific services such as Sidekiq queue job processing and Phusion Passenger. Alerts will be sent if there are any errors

 

Below are some of the example screenshots of our monitoring dashboards. The dashboard and alerting notifications are probably the most time consuming to setup other than the first step of figuring out the important metrics to measure.

New Relic Synthetics monitor for API and Ping tests

 

New Relic APM monitoring of our main Backend service

 

New Relic Browser monitoring of our web pages

 

PostgreSQL RDS monitoring on AWS Cloudwatch Dashboard

 

Main Backend Service EC2 Instances monitoring on AWS Cloudwatch Dashboard

 

AWS Elasticache Redis monitoring on AWS Cloudwatch Dashboard

 

Firebase Crashlytics monitoring for mobile applications

These dashboards are displayed on a large TV in the Engineering standup area to ensure everyone has visibility and are aware of what’s happening with our applications. Every morning during standup, it is impossible not to miss the graphs if there is a spike, something is showing red or crash-free users has dipped below our threshold.

The alerts configured to send Slack messages in alerting only channels. We use a combination of custom webhooks and native integrations to achieve this. Tools like New Relic, Firebase and Raygun offer native integration, while you need to do a little bit of work for AWS and open-source solutions. How exactly we configure these tests and alerting policies is a subject for future posts but feel free to reach out to me if you’d like to learn more.

The example below shows our New Relic alerting Slack channel where GET 5xx errors above the critical threshold are posting messages.

You can create a custom Lambda function to send Cloudwatch alarms to Slack. This guide from Slack shows you how. The example below is our ‘warning’ alert when our Elasticsearch JVMMemoryPressure goes above 65%, and engineers are supposed to pay attention before functionality is impaired.

Final Words and Next Steps

As you can see, having the visibility into these metrics of our application provides us with a starting point and baseline to continually improve on. Quoting father of management consulting Peter Drucker, “If you can’t measure it, you can’t improve it”.

Hopefully, this post gives you some idea on how to use a combination of free and paid tools to monitor your applications and what to monitor. No matter what industry you’re in, the first step is always to understand what are the important things for your customers and list down the metrics to monitor. Then perform an in-depth analysis of the tools you’ve shortlisted. All of these tools will provide you with a free trial period so you can install their SDKs, JS snippets and setup alerting. This gives you a good sample period to play around and weigh the pros and cons.

Next step for us is to take the last 30-day data on each metric to establish our Service Level Objectives (SLOs) on what to expect out of our applications. Having SLOs will help us to keep raising the bar of our performance with our customers in mind. We are also looking to evaluate New Relic Infrastructure and Mobile to see if they are worth the ROI and consolidating all under 1 tool. The nice thing about having everything in New Relic is the power of Insights which allows you to build an all-in-one dashboard showing everything in a single pane of glass.

I am sure you have your own ways of monitoring and alerting and I’d love to learn how you’re doing it. If you have any suggestions or questions, I’d be happy to hear them so drop me a message at nic@kaodim.com.

Read the original article at https://bit.ly/2H1yikh

To contact us, email us at newsroom@kaodim.com