๐Ÿ˜Ž Dev-first observability use cases

Learn how you can troubleshoot issues in production faster and improve dev experience and productivity with Helios' dev-first observability platform.

Leveraging OpenTelemetry and context propagation, Helios turns different operations in a distributed application into traces and correlates them with logs and metrics, enabling end-to-end app visibility and faster troubleshooting of distributed applications. You can visualize your entire app flows, including all services, APIs, message brokers, data pipelines, and databases. You can also search for specific errors or events to find the root cause and get the traces and data you need to fix it in minutes.

Visualizing E2E flows to provide the full context of each API call and error in a distributed application

Visualizing E2E flows to provide the full context of each API call and error in a distributed application such as the Helios Sandbox

In this doc you can read about real-world use-cases of how using Helios and leveraging OpenTelemetry distributed tracing is helping dev and ops teams investigate and solve issues faster. Those use-cases are:

  1. From alert to applicative flow in 1-click
  2. Easy and quick reproduction of production issues, locally
  3. Applied API observability
  4. Bottleneck analysis leveraging distributed tracing
  5. 3rd-party app integrations
  6. Root cause of failed tests and visibility into CI environment

From alert to applicative flow in 1-click

There are many different channels to learn when things break in your app - error monitoring, logs, Slack, alerts from Helios, and even internal exceptions. The tricky part is to figure out why and where things didn't work as expected. The power of Helios is in getting you the right data with the right context at the right time - meaning, when an error pops up you are able to access with 1-click the full E2E visualization of the erroneous flow. For many issues, this alone will reduce MTTR to a few short minutes and save precious time and endless frustration.

Helios log instrumentation allows developers to access in one click the full E2E trace visualization in Helios for the full context of what happened

Helios log instrumentation allows developers to access in one click the full E2E trace visualization in Helios for the full context of what happened

Going the other way around, error logs are automatically collected by Helios and displayed within the context of the full E2E trace so that all the data needed for the root cause analysis is made available in a single location.

Error logs presented directly in Helios within the full context of the E2E trace visualization

Error logs presented directly in Helios within the full context of the E2E trace visualization in the Helios Sandbox

Easy and quick reproduction of production issues, locally

No code is immune to bugs, and the key is to be able to find out and troubleshoot issues - especially in production - quickly and confidently. The next step is to understand in retrospective what can be done to inspect this type of issues earlier in the development cycle in the future, by generating a test case or updating the local and pre-prod environments to resemble the production one a bit better.

The Helios OpenTelemetry SDK can collect all payloads (HTTP request and response bodies, message queues content, and DB queries and results) and using it, offers developers the ability to replay flows and reproduce calls in their distributed applications - in any environment.

Reproducing a flow using the code automatically generated by Helios based on instrumented data of an erroneous trace

Reproducing a flow using the code automatically generated by Helios based on instrumented data of an erroneous trace from the Helios Sandbox

Applied API observability

With the rapid rise of API use - both internally and as a product on their own - API observability is becoming increasingly important to understand how APIs are being used and how they are impacting application performance. Latency and error rate issues in APIs can also affect customer experience. With Helios, API discovery, specification, monitoring and troubleshooting are based on the actual instrumentation of the microservices, instead of the documentation of the APIs. This actual data can be applied to identify and troubleshoot issues quickly, optimize performance, improve customer satisfaction and also the overall developer experience.

The core pieces of dev-centric API observability include:

  1. Auto-generated API catalog
  2. API overview and (actual) OpenAPI specifications
  3. API troubleshooting
An example of the auto-generated OpenAPI spec of the /traces API in Helios

An example of the auto-generated OpenAPI spec of the /traces API in Helios

Bottleneck analysis leveraging distributed tracing

In distributed applications, bottlenecks happen in many different places at different times and due to many possible reasons. There are many I/O operations flying around, and different processes of the same application are often allocated resources differently, which then often leads to a backlog building up over time. Without the ability to go through the program execution step-by-step, it quickly becomes unmanageable and very hard to understand whatโ€™s going on.

By using distributed tracing solutions like OpenTelemetry and Helios in a developer's day-to-day work, it's possible to get visibility into bottlenecks in the application, solve them quickly, and ensure they do not occur again over time.

3rd-party app integrations

Almost any application requires some level of integration with a 3rd-party app; unfortunately, the process isnโ€™t always a smooth one, as often the first stab at making an integration includes inevitable errors. Unless developers catch and log the errors received, they have to debug our code, breakpoint on the interaction itself to really understand the root cause. This can be time consuming and tedious, depending on the complexity of the implemented flow โ€“ and frustrating, depending on the maturity and stability of the 3rd-party app youโ€™re integrating with.

Observability over 3rd-party app integrations helps to streamline this process, providing E2E visibility into applicative flows as early as in your local development environment, so you can easily pinpoint errors in the process, reproduce them, investigate - and hopefully complete the task at hand much quicker and with less guesswork.

The E2E trace from a local environment used for troubleshooting, depicting exactly what was the issue in the flow

The E2E trace from a local environment used for troubleshooting, depicting exactly what was the issue in the flow

Root cause of failed tests and visibility into CI environment

One of the main challenges when building a distributed application is testing it, and more specifically debugging the tests. One day a test passes, the next day it stops working. In distributed environments, testing frameworks provide limited transparency into what failed. Similar to how application flows in microservices architectures are handled by multiple services and cloud entities, so too are test flows. This makes it harder to understand what actually happened and when. Even if developers know where to look, the logs are not always accessible to them, and often, the only indication for what went wrong is the failed assertion which doesnโ€™t tell the whole story.

Taking the same approach as we do with applicative flows, instrumenting existing tests can also go a long way in providing visibility into E2E tests and helping get to the failure root cause quickly.

Leveraging OpenTelemetry to instrument existing tests and create an intuitive E2E flow to help troubleshooting and getting to the root cause quicker (example from [Helios Sandbox](https://sandbox.gethelios.dev/tests?relativeDateRange=now-10d&activeTest=d75441be-1374-4c61-b6d9-d022b1ab828f))

Leveraging OpenTelemetry to instrument existing tests and create an intuitive E2E flow to help troubleshooting and getting to the root cause quicker (example from Helios Sandbox)

In addition, observability can be leveraged in the CI environment to help teams ship new versions faster.