Leveraging OpenTelemetry and context propagation, Helios turns different operations in a distributed application into traces and correlates them with logs and metrics, enabling end-to-end app visibility and faster troubleshooting of distributed applications. You can visualize your entire app flows, including all services, APIs, message brokers, data pipelines, and databases. You can also search for specific errors or events to find the root cause and get the traces and data you need to fix it in minutes.
In this doc you can read about real-world use-cases of how using Helios and leveraging OpenTelemetry distributed tracing is helping dev and ops teams investigate and solve issues faster. Those use-cases are:
- From alert to applicative flow in 1-click
- Easy and quick reproduction of production issues, locally
- Applied API observability
- Bottleneck analysis leveraging distributed tracing
- 3rd-party app integrations
- Root cause of failed tests and visibility into CI environment
There are many different channels to learn when things break in your app - error monitoring, logs, Slack, alerts from Helios, and even internal exceptions. The tricky part is to figure out why and where things didn't work as expected. The power of Helios is in getting you the right data with the right context at the right time - meaning, when an error pops up you are able to access with 1-click the full E2E visualization of the erroneous flow. For many issues, this alone will reduce MTTR to a few short minutes and save precious time and endless frustration.
Going the other way around, error logs are automatically collected by Helios and displayed within the context of the full E2E trace so that all the data needed for the root cause analysis is made available in a single location.
No code is immune to bugs, and the key is to be able to find out and troubleshoot issues - especially in production - quickly and confidently. The next step is to understand in retrospective what can be done to inspect this type of issues earlier in the development cycle in the future, by generating a test case or updating the local and pre-prod environments to resemble the production one a bit better.
The Helios OpenTelemetry SDK can collect all payloads (HTTP request and response bodies, message queues content, and DB queries and results) and using it, offers developers the ability to replay flows and reproduce calls in their distributed applications - in any environment.
With the rapid rise of API use - both internally and as a product on their own - API observability is becoming increasingly important to understand how APIs are being used and how they are impacting application performance. Latency and error rate issues in APIs can also affect customer experience. With Helios, API discovery, specification, monitoring and troubleshooting are based on the actual instrumentation of the microservices, instead of the documentation of the APIs. This actual data can be applied to identify and troubleshoot issues quickly, optimize performance, improve customer satisfaction and also the overall developer experience.
The core pieces of dev-centric API observability include:
- Auto-generated API catalog
- API overview and (actual) OpenAPI specifications
- API troubleshooting
In distributed applications, bottlenecks happen in many different places at different times and due to many possible reasons. There are many I/O operations flying around, and different processes of the same application are often allocated resources differently, which then often leads to a backlog building up over time. Without the ability to go through the program execution step-by-step, it quickly becomes unmanageable and very hard to understand what’s going on.
By using distributed tracing solutions like OpenTelemetry and Helios in a developer's day-to-day work, it's possible to get visibility into bottlenecks in the application, solve them quickly, and ensure they do not occur again over time.
Almost any application requires some level of integration with a 3rd-party app; unfortunately, the process isn’t always a smooth one, as often the first stab at making an integration includes inevitable errors. Unless developers catch and log the errors received, they have to debug our code, breakpoint on the interaction itself to really understand the root cause. This can be time consuming and tedious, depending on the complexity of the implemented flow – and frustrating, depending on the maturity and stability of the 3rd-party app you’re integrating with.
Observability over 3rd-party app integrations helps to streamline this process, providing E2E visibility into applicative flows as early as in your local development environment, so you can easily pinpoint errors in the process, reproduce them, investigate - and hopefully complete the task at hand much quicker and with less guesswork.
One of the main challenges when building a distributed application is testing it, and more specifically debugging the tests. One day a test passes, the next day it stops working. In distributed environments, testing frameworks provide limited transparency into what failed. Similar to how application flows in microservices architectures are handled by multiple services and cloud entities, so too are test flows. This makes it harder to understand what actually happened and when. Even if developers know where to look, the logs are not always accessible to them, and often, the only indication for what went wrong is the failed assertion which doesn’t tell the whole story.
In addition, observability can be leveraged in the CI environment to help teams ship new versions faster.
Updated 7 months ago
- Read more about dev-first observability on the Helios website
- API monitoring vs. observability in microservices - Troubleshooting guide
- 🔮 API observability
- ❗️ Inspect logs with context
- ⚡️ Replaying flows
- ∞ Identifying test runs in CI environment
- API latency in microservices – trace based troubleshooting
- Debugging and troubleshooting microservices in production - all you need to know