Cloud Native Observability

The term cloud native observability might sound like another buzzword that is used to sell new tools. And that might be true to the fact that we have a lot of new tools that emerged to solve the problems of monitoring container infrastructure.

Conventional monitoring for servers may include collecting basic metrics of the system like CPU and memory resource usage and logging of processes and the operating system.

Observability is often used synonymously with monitoring, but monitoring is only one of the subcategories of cloud native observability and does not do justice to its scope. The term observability is closely related to the control theory which deals with behavior of dynamic systems. In essence, the control theory describes how external outputs of systems can be measured to manipulate the behavior of the system. When we deal with container orchestration and microservices, the biggest challenge is keeping track of the systems, how they interact with each other and how they behave when under load or in an error state.

The higher goal of observability is to allow analysis of the collected data. This helps to get a better understanding of the system and react to error states. This more technological side of things is closely related to modern agile software development that also uses feedback loops in which you analyze the behavior of software and adapt it constantly based on the outcome.

The term telemetry has Greek roots and means remote or distance (tele) and measuring (metry). Measuring and collecting data points and then transferring it to another system is of course not exclusive to cloud native or even IT systems. A good example would be a weather station with a data-logger that measures the temperature, humidity, wind speed and more at a certain point and then transmits it to another system that can process and display the data.

In container systems, each and every application should have tools built in that generate information data, which is then collected and transferred in a centralized system. The data can be divided into three categories:

Logs
Metrics
Traces

To ship the logs, different methods can be used:

Node-level logging – The most efficient way to collect logs. An administrator configures a log shipping tool that collects logs and ships them to a central store.
Logging via sidecar container – The application has a sidecar container that collects the logs and ships them to a central store.
Application-level logging – The application pushes the logs directly to the central store. While this seems very convenient at first, it requires configuring the logging adapter in every application that runs in a cluster.

Prometheus is an open source monitoring system, originally developed at SoundCloud, which became the second CNCF hosted project in 2016. Over time, it became a very popular monitoring solution and is now a standard tool that integrates especially well in the Kubernetes and container ecosystem.

Prometheus can collect metrics that were emitted by applications and servers as time series data - these are very simple sets of data that include a timestamp, label and the measurement itself. The Prometheus data model provides four core metrics:

Counter: A value that increases, like a request or error count
Gauge: Values that increase or decrease, like memory size
Histogram: A sample of observations, like request duration or response size
Summary: Similar to a histogram, but also provides the total count of observations.

A trace describes the tracking of a request while it passes through the services. A trace consists of multiple units of work which represent the different events that occur while the request is passing the system. Each application can contribute a span to the trace, which can include information like start and finish time, name, tags or a log message.

These traces can be stored and analyzed in a tracing system like Jaeger.

While tracing was a new technology and method that was geared towards cloud native environments, there were again problems in the area of standardization. In 2019, the OpenTracing and OpenCensus projects merged to form the OpenTelemetry project, which is now also a CNCF project.

OpenTelemetry is a set of application programming interfaces (APIs), software development kits (SDKs) and tools that can be used to integrate telemetry such as metrics, protocols, but especially traces into applications and infrastructures.

Since cloud providers don’t offer their services "pro-bono", the key to cost optimization in the cloud is to analyze what is really needed and, if possible, automate the scheduling of the resources needed. There are several ways to do automatic and manual optimization:

Identify wasted and unused resources
Right-Sizing
Reserved Instances
Spot Instances

PreviousCloud Native Architecture NextCloud Native Application Delivery

Last updated 8 days ago