Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus, achieving significant performance gains and cost efficiency.

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus.

By: Eugene Ma, Natasha Aleksandrova

When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as dashboards and alerts. Collecting all your metrics first means you can focus on major technical challenges — scale, correctness and performance — without worrying about how users will adopt your new tools.

But for our project, this approach wasn’t straightforward: most of our metrics were instrumented with StatsD libraries, OpenTelemetry was gaining traction, and our new storage system was based on Prometheus. We were left with a lot of open questions. Where do we fork the metrics? Should we adopt OpenTelemetry? Do our metrics work well with Prometheus? The task of collecting metrics required us to answer these questions and rethink our metrics infrastructure.

Instrumentation and collection

Our services originally used the StatsD protocol, with a Veneur sidecar collecting and forwarding metrics to our vendor.

For the migration away from this architecture, we defined a clear set of supported and preferred instrumentation protocols. OpenTelemetry Protocol (OTLP) is now recommended for internal services, while Prometheus is preferred for OSS workloads. StatsD (DogStatsD format) remains as a fallback for legacy paths or applications that are difficult to update / instrument.

This led us to adopt the vendor-neutral OpenTelemetry Collector (OTel Collector).

Dual-write migration approach

Roughly 40% of our services use a shared platform-maintained metrics library, which was the logical starting point for OTLP adoption and enabled broad rollout of OTLP emission across those services. By enabling this library to dual-emit metrics, StatsD for the existing pipeline and OTLP for the new OTel Collector, we made broad migration progress with low friction.

Benefits realized by moving to OTLP

Moving to OTLP offered significant benefits over remaining on StatsD:

JVM profiling data shows that CPU time spent on metrics processing dropped from 10% to under 1% of total CPU samples in production after migrating from StatsD to OTLP.
OTLP provides greater reliability than StatsD, which relies on UDP and is more vulnerable to packet loss under high throughput.
Emitting OTLP natively removed the need for an intermediate StatsD-to-OTLP translation step inside the OTel Collector.
Because our backend is Prometheus-based, remaining on StatsD would have restricted users to classic histograms. OTLP enabled us to adopt Prometheus-native capabilities such as exponential histograms with full fidelity.
OTLP is a CNCF-sponsored, vendor-neutral protocol and the de facto industry standard for telemetry. Moving to OTLP positions us to support future observability use cases without protocol barriers.

A challenge encountered with OTLP

While most services handled the dual-write OTLP + StatsD mode without issues, our highest-volume metric emitters experienced significant regressions in performance after enabling OTLP. In particular, these services began to suffer from memory pressure, increased garbage collection activity, and heap growth.

To mitigate this, we updated our common metrics library configuration to use delta temporality (AggregationTemporalitySelector.deltaPreferred()) for a select set of services. That change reduces the in-process memory burden, because delta temporality avoids retaining full state for all metric-label combinations between exports.

Streaming aggregation with vmagent

To keep costs under control, we needed a way to aggregate away instance-level labels. We chose vmagent from VictoriaMetrics because it supports streaming aggregation, sharding, and has a very efficient codebase.

Our architecture uses two layers: Routers (stateless sharding) and Aggregators (maintaining in-memory state for totals). This setup allowed us to scale to hundreds of aggregators, ingesting over 100 million samples per second while significantly reducing costs.

Source: Hacker News