Observability, Monitoring, Alerting

Do you know what’s going on with your application?

Italo Santos
5 min readJan 25, 2023
Normal question to be asked when you're on-call

Over the past years, we’ve heard a lot about observability, after Twitter’s engineers wrote a blog post called Observability at Twitter, in September 2013. That was one of the first times the term observability was used in the context of software systems.

A few years later, Peter Bourgon attended the 2017 Distributed Tracing Summit which discussed more the definition and scope of how tracing helps provide observability, and in his blog post called Metrics, tracing, and logging, he describes the thoughts that could probably map out the domain of instrumentation.

Peter Bourgon — Metrics, tracing, and logging

Several discussions started about the theme and in 2020 the industry was aligned with the thinking that observability isn’t a feature that you can install or a service you can subscribe to, or simply the deployment and collection of three types of data.

But the observability concept isn't something new, Hungarian-American engineer Rudolf E. Kálmán introduced it in 1960.[1][2]

"Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."

Even though observability might mean different things to different people, however, the heartwood remains the same — bringing better visibility into systems.

Considering that observability can be obtained with the set of logs, metrics, and traces, based on Rudolf's theory, those are outputs of a system’s internal observation, which is defined as telemetry data that are used to have a better understanding of the systems and are considered the observability pillars for modern systems.

Observability is data, which is produced, collected, and stored to produce insights that can lead to answers, but the process of knowing what information to expose and how to examine the evidence (aka observations) still requires a good understanding of the system’s domain, as well as a good sense of intuition.

mon-i-tor

/ˈmänədər/
verb
gerund or present participle: monitoring

observe and check the progress or quality of (something) over a period of time; keep under systematic review.

Despite the observability that looks much similar, monitoring is the ability to collect, process, aggregate, and display real-time quantitative data about a system to mainly understand the system’s health. While monitoring looks at the systems from a more static perspective, observability can be more dynamic, providing much more information about unpredictable failure modes.

Observability from Stochastic Geometry

Observability Infrastructure

In the microservices world, a good monitoring platform is essential, but also a hard problem. Several tools can solve multiple problems using the same solution Newrelic, Datadog, Dynatrace, Lightstep, etc… but this can cost a lot depending on how big your application ecosystem is.

While we’re Spraying our bike shed with some new colors at Loggi, we’ve decided to adopt Istio’s service mesh as one of the core infrastructure layer components, which allows you to transparently add capabilities like observability, traffic management, and security, without adding them to your code.

Istio Architecture Overview

Istio is capable of generating detailed telemetry data for all service communications within a mesh, providing observability of the service’s behavior throughout the following types of telemetry:

  • Metrics → Istio generates a set of service metrics based on the four “golden signals” of monitoring (latency, traffic, errors, and saturation)
  • Distributed Traces → Istio generates distributed trace spans for each service
  • Logs → As traffic flows into service within a mesh, Istio can generate a full record of each request, including source and destination metadata

Although it’s not possible to get all the internal data from the applications using only Istio, we come up some libraries that are capable of doing this, like Micrometer for JVM apps and Django Prometheus. These libraries together with Opentelemetry expose internal framework data and are easily customizable to generate other metrics we deem necessary, building an ecosystem of tools to help you to observe your systems.

Observability Ecosystem Tools

Using the observability data

Observability can produce a huge amount of data and it could become a mess so it's important to understand how to use it.

Monitoring Hierarchy — Inspired by Alex King’s article

The pyramid above proposes how the collected data can be used, which I call the Monitoring Hierarchy.

The bottom of the pyramid represents the telemetry data from the several systems, whereas the top of the pyramid contains the notifications which a human being will read about a problem that requires immediate attention, while the middle layer allows insights into operating efficiency.

We can also divide the pyramid into two categories, metrics (data to be observed) and diagnostics (root cause analysis)

Metrics aims to show an overview of the health of a system and looks at the current moment, while diagnostic data analyzes past data, such as logs to identify the reason for the failure

For this reason, your monitoring system should address two major questions: “What’s broken, and Why?”

Again… There is no silver bullet!

But there are some methods to help you to address these questions.

  • Four Golden Signals → user-centric, monitoring system from the user perspective
  • RED → Focused on microservices, check the system from the request perspective, this will answer which service is broken
  • USE → More focus on infrastructure to identify performance issues and bottlenecks

Conclusion

“Observability Isn’t a Panacea”

In Greek mythology, Panacea was a goddess of healing and the term “panacea” is also widely used with the meaning of healing for all males.

We can’t look at observability as purely an operational concern. An observable system isn’t achieved by plainly having monitoring infrastructure in place or by having an observability team, observability is a feature that needs to be embedded into a system since the design.

References & Good articles

--

--