profile picture Endowus Tech

Implementing Distributed Tracing in a Polyglot Microservices Environment

April 14, 2025  ·  By Jonathan Hosea  ·  Filed under devops, tracing, observability

Distributed Tracing is an extremely useful tool to improve observability of distributed systems. By visualizing the entire journey of a request—from the moment a customer clicks a button on a Next.js front end or Flutter mobile app, through the Nest.js BFF and Scala backend services using Akka and Kafka—distributed tracing empowers developers to quickly diagnose performance issues, identify bottlenecks, and reduce mean time to resolution (MTTR). This article details our technical journey, the challenges, explains the underlying concepts, and offers an endowus point of view of us implementing it.

Our Tech Stack Recap

We leverage Akka’s capabilities along with Kamon for metrics and observability. Kamon simplifies trace propagation between microservices. Yet, as you’ll see, some components require custom integration.

A Short TL;DR on Distributed Tracing

Taking a quote from the OpenTelemetry page regarding distributed tracing:

trace requests and operations across multiple services and components in your application. This gives you end-to-end visibility into the performance and behavior of your application, allowing you to quickly identify and resolve issues. — CNCF blog on OpenTelemetry

Essentially, distributed tracing helps us visualize how an error occurs and how it can be reproduced across systems.

This diagram shows multiple microservices from A to F, which the call propagates with the same trace but different spans.

How tracing propagates across services
Trace Propagation across services

The following diagrams show how trace contexts are propagated even within the same service with different layers of code and abstractions.

How tracing is propagated across multiple services
Tag and Span relationship with Trace across services

Defining What Counts as a ‘Trace’

Defining a trace from the front end can be a bit tricky. The simplest definition is that when a customer clicks a button and triggers an action in the backend, it should count as a trace. But what about loading a screen or tracking a customer’s journey through a multi-step wizard?

Traces give us the big picture of what happens when a request is made to an application. Whether your application is a monolith with a single database or a sophisticated mesh of services, traces are essential to understanding the full “path” a request takes in your application.”
OpenTelemetry Docs

In our implementation, we opted to begin a trace from the frontend (even though it arguably might not always be necessary). For example, when a customer creates a goal in our system, one trace will capture that goal creation request. Other system calls—such as those loading the historical performance of the portfolio—will be recorded under separate trace IDs.

Standards to Use

There are multiple (and sometimes competing) standards for achieving distributed tracing—from Google’s early observability concepts to Zipkin, Jaeger, OpenTracing, and W3C.

We opted for the W3C standard because it’s the newest and comes as a recommendation by the W3C. (We like shiny things!)

System Implementation

Frontend Implementation

In our mobile application, we initiate a trace for each UI flow and propagate the corresponding trace context to our backend-for-frontend layer and throughout the rest of the system.

Stream<ApiResponse> mapEventToState(HkDashboardEvent event) async* {


diInstance<DistributedTracingRepository>().initSpan();

await userPreferenceRepository.loadPreferences();

final displayCcy = userPreferenceRepository.getPreferences?.displayCcy ??  
userRepository.defaultCcy;

_startPartialLoading(  
    futures,  
    apiProvider.hkBffSvc.loadDashboardPerformance(displayCcy),  
    transaction,  
    PartialLoadingType.dashboardPerformance,  
);
  

We don’t use Dart’s package for OpenTelemetry because we’re prioritizing customer flow tracing over UI tracing.

Backend for Frontend Implementation

For our backend-for-frontend built in NestJS, we integrate the OpenTelemetry SDK using the W3CTraceContextPropagator and Node.js’s AsyncLocalStorageContextManager:

provider.register({  
    contextManager,  
    propagator: new W3CTraceContextPropagator(),  
});

Registering the instrumentations for HTTP, Express and NestJS in the open telemetry context

registerInstrumentations({  
    tracerProvider: provider,  
    instrumentations: [  
        new HttpInstrumentation(),  
        new ExpressInstrumentation({  
            ignoreLayersType: [ExpressLayerType.REQUEST_HANDLER, ExpressLayerType.MIDDLEWARE],  
        }),  
        new NestInstrumentation(),  
    ],  
});

Backend Systems Implementation

Our backend microservices are primarily coded in Scala using Akka, and we leverage Kamon—a third-party metrics and observability package. Kamon has done much of the heavy lifting for trace propagation between our microservices; we only needed to configure the trace context standard we want to use.

1. Akka’s Private Codes

Using an opinionated framework means working with the tools provided and building logic around them. For our purposes, it has meant building the tracing code around Akka’s private APIs and, at times, extending those APIs to meet our needs.

2. CQRS Pattern Systems

Our use of Akka implies a CQRS-based backend. The asynchronous nature of CQRS raises the question: how do we pass the trace context between a write process and subsequent asynchronous operations?

Thanks to Kamon’s Akka instrumentation, which handles actor-based communication (including projections used extensively for asynchronous processing and Kafka publishing), much of this complexity is abstracted away.

There are four key areas to pay attention to in Akka’s trace propagation. The following diagram illustrates these areas:

How tracing is propagated with CRQS and across services
How trace context are propagated within multiple services with CQRS and Projections
  1. Passing Trace Context Between Akka Actors
    This is handled by Kamon’s Akka instrumentation.
    More details: Kamon Akka Instrumentation

  2. Initiate Trace Context for Akka Projections

    Unfortunately, Kamon’s Akka integration doesn’t include built‑in support for Akka Projections. Because Projections in a CQRS architecture run independently of the event‑persisting write side, we chose to start each projection with its own root trace rather than embedding the original trace context in our event payloads to keep our read‑side processing fully decoupled from the write‑side persistence.

We start the parent trace within the projection handler with several tags as below :

...
try {
   // Kamon's `ContextStorage.runWithSpan` does not honour Future out of the box yet
   // See https://github.com/kamon-io/Kamon/issues/770
   //
   // Until that is supported, we need to manually finish the span when handler future completes
   // Using same approach as Kamon's `Tracing.span`
   val fut: Future[Done] = Kamon.runWithSpan(span, finishSpan = false) {
   handlerFn.andThen { f =>
     log.debug(s"[$projectionName] Processed event ${envelope.event.getClass.getSimpleName} for persistenceId ${envelope.persistenceId}")
     f
   }.applyOrElse[EventEnvelope[Event], Future[Done]](
     envelope,
     e => {
       log.debug(s"[$projectionName] Skipped event ${e.event.getClass.getSimpleName} for persistenceId ${e.persistenceId}")
       Future.successful(Done)
     }
   )
  }

   withMetrics(envelope, shardNumber) { errorCounterFunc =>
      fut.onComplete {
        case Failure(t) =>
          span.fail(t.getMessage, t).finish()
          errorCounterFunc()
        case Success(_) =>
          span.finish()
      }(CallingThreadExecutionContext)
      fut
  }
} catch {
 case NonFatal(t) =>
   // Kamon's `ContextStorage.runWithSpan` already marks span as failed
   span.finish()
   Future.failed(t)
}
...
  1. Passing Trace Context Between System Calls (HTTP)
    Kamon manages this as well; we simply enable the configuration using the W3C trace context standard.
    More details: Kamon HTTP Instrumentation

  2. Passing Trace Context in Kafka Messages Between Systems
    This requires extra work. Although Kamon provides Kafka producer instrumentation out-of-the-box, it does not provide subscriber instrumentation to capture context from Kafka headers and continue the span. We implemented our own message interceptor to extract the trace context from Kafka headers:

def runWithKamonContext[In, Out](f: In => Out): *Message*[In] => Out = message => {  
    val ctx = message  
        .get(KafkaMetadataKeys.*Headers*)  
        .flatMap(s => Try(KafkaInstrumentation.settings.propagator.read(s, Context.Empty)).toOption)  
        .getOrElse(Context.Empty)  
    Kamon.runWithContext(ctx) {  
        f(message.payload)  
    }  
}

def propagateMessage[In, Out](f: *Message*[In] => Out): *Message*[In] => *Message*[Out] = message => {  
    message.withPayload(f(message))  
}

How This Helps Us in Production Support

End-to-End Logs Correlation and Visibility

With distributed tracing, rather than relying on separate request IDs across different systems, we can quickly correlate logs end-to-end. This unified view helps us understand the complete journey of a call—from the frontend through all backend processes.

Improved Visibility on Performance

Having multiple system calls under the same trace allows us to deduce the time taken by each service call, identify unnecessary repetitions, and detect performance bottlenecks. Detailed timing information helps pinpoint latency hotspots and optimize performance.

Enhanced Issue Detection

Using Sentry for production alerts, we often face challenges in high-volume scenarios where related alerts from multiple systems appear as isolated incidents. With trace context visibility, we can distinguish between related alerts, quickly understand the root cause, and reduce debugging time.

Sentry example on how we use trace id
Sentry alert example with trace_id included

Our sentry includes traceId in the alerts making the alerts easier to find in kibana and pinpoint the root cause. It also helps with relevant multiple sentry alerts that’s connected with each other.

Cost Optimization

Distributed tracing can identify redundant or inefficient service calls. For example, a developer might inadvertently call the same endpoint from the same service in a different layer of code instead of reusing existing data. Tracing makes it easier to spot these inefficiencies, leading to more effective code refactoring and reduced resource usage.

Looking forward

Enabling distributed tracing has significantly improved our ability to handle production issues and analyze metrics efficiently. However, there’s still room to maximize its potential and refine our implementation.

Span Storage and Historical Data

Currently, we don’t require persistent span storage, but as our system evolves, retaining historical traces will become invaluable. It will provide a broader view of system behavior over time, aiding in issue detection, performance tuning, and long-term trend analysis.

AI-Enabled Metrics Enhancements

To further improve our monitoring and incident response, we can also leverage AI techniques to enhance observability and root cause analysis with the goal of keeping MTTR time lower and resolving issues faster.

Noise Reduction in Alerts

By leveraging clustering techniques (e.g., DBSCAN) or NLP-based log grouping, we can automatically group similar alerts and prioritize critical issues, reducing noise and improving response efficiency.

Root Cause Analysis

Graph algorithms can help identify cascading failures across multiple systems, enabling faster and more precise root cause analysis.