Distributed tracing and Observability
TODO
How is distributed logging and tracing supposed to work for asynchronous systems? e.g. For applications that are using messaging or streaming products and using streaming strategies like pub-sub?
https://stackify.com/what-is-observability-everything-a-beginner-needs-to-know/
- How to set-up notifications, observability alerts?
What Is Distributed Tracing?
What is observability?
https://opentelemetry.io/docs/what-is-opentelemetry/#what-is-observability
https://opentelemetry.io/docs/concepts/observability-primer/#what-is-observability
Observability is the ability to understand the internal state of a system by examining its outputs. In the context of software, this means being able to understand the internal state of a system by examining its telemetry data, which includes traces, metrics, and logs.
To make a system observable, it must be instrumented. That is, the code must emit traces, metrics, or logs. The instrumented data must then be sent to an observability backend.
Opentelemetry
https://opentelemetry.io/docs/what-is-opentelemetry/
Traces: https://opentelemetry.io/docs/concepts/signals/traces/
As requests flow through distributed systems, it’s important to keep track of how it travels, as this can be useful for monitoring and troubleshooting.
Tracing allows you to track the journey of a request as it moves through different services in a distributed environment. It provides a way to understand the flow of operations across these services, making it easier to pinpoint performance issues or errors.
Using tracing, you can break down the operations into smaller parts or pieces by identifying what happened, where, when, and how it happened, along with every other relevant information. This structured approach significantly enhances the effectiveness and efficiency of the debugging process.
Tracing is a fundamental aspect of observability. A trace is a collection of spans, providing a high-level view of how a specific request or transaction moves through various services within a distributed environment. Imagine a trace as a comprehensive map that outlines the path a request takes through the system.
Spans: https://signoz.io/blog/opentelemetry-spans/
Useful for understanding performance issues in a single service. e.g. Which functions are taking too long to complete?
An OpenTelemetry span represents a single unit of work within a system. It encapsulates information about a specific operation, including its start time, duration, associated attributes, and any events or errors during its execution.
Instrumentation
https://opentelemetry.io/docs/concepts/instrumentation/
How to instrument tracing in an application using opentelemetry?
- In Java: https://opentelemetry.io/docs/languages/java/intro/
- In GoLang: https://opentelemetry.io/docs/languages/go/getting-started/
Are application profiling and distributed tracing related?
- How profiling and tracing work together: https://grafana.com/docs/grafana/latest/datasources/pyroscope/profiling-and-tracing/#how-profiling-and-tracing-work-together
- Profiling Vs Tracing in OpenTelemetry: What’s the Difference? https://www.apica.io/blog/profiling-vs-tracing-in-opentelemetry/
- Distributed tracing vs. APM: What’s the difference? https://chronosphere.io/learn/distributed-tracing-vs-apm-whats-the-difference/
- What is the difference between Logging, Tracing & Profiling? https://greeeg.com/en/issues/differences-between-logging-tracing-profiling
- Tracing and Profiling Techniques for Distributed Systems https://www.geeksforgeeks.org/system-design/tracing-and-profiling-techniques-for-distributed-systems/
Yes, application profiling and distributed tracing are closely related; distributed tracing identifies which services are slow and profiling then dives into the specific code within those services to pinpoint the exact performance bottleneck, making them complementary tools for comprehensive application performance monitoring (APM). Tracing provides a high-level, end-to-end view of a request’s journey across services, while profiling offers a detailed, intra-service view by measuring resource usage and function execution times.
How they work together:
- Distributed Tracing Identifies the Problem Area: A user request travels through multiple services in a distributed system. Distributed tracing captures the path of this request, showing how long each service takes and where delays occur.
- Profiling Pinpoints the Cause: When tracing reveals a service is slow, profiling is used to investigate that specific service. It analyzes the code within that service to find functions or operations that are consuming excessive CPU, blocking I/O, or otherwise causing performance issues.
- Integrated Tools: Modern APM tools often integrate tracing and profiling, allowing developers to move directly from slow trace spans (the path of the request) to the relevant code segments within the profiler for efficient debugging.
Key Differences:
- Distributed Tracing: Focuses on the inter-service journey of a request, answering questions like “Which service is slow?”.
- Application Profiling: Focuses on the intra-service workings, answering “What specific code is causing the problem in this service?”.
In summary, tracing maps the journey, and profiling illuminates the specific causes of slowness within that journey, creating a powerful synergy for diagnosing performance issues in complex, distributed applications.
Spring Cloud Sleuth
TODO
https://www.baeldung.com/spring-cloud-sleuth-single-application
Zipkin
TODO
https://www.baeldung.com/tracing-services-with-zipkin