OpenTelemetry可观测性标准
约 1337 字大约 4 分钟
opentelemetryobservability
2025-06-22
OpenTelemetry(OTel)是 CNCF 的可观测性框架,统一了 Traces、Metrics 和 Logs 三大信号的采集标准。它由 OpenTracing 和 OpenCensus 合并而来,正在成为云原生可观测性的事实标准。
三大信号
SDK 架构
Traces
一个 Trace 由多个 Span 组成,代表一次请求的完整生命周期:
Go SDK 手动埋点示例:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("order-service")
func ProcessOrder(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "ProcessOrder",
trace.WithAttributes(
attribute.String("order.id", orderID),
attribute.String("order.type", "standard"),
),
)
defer span.End()
// 子 Span
ctx, dbSpan := tracer.Start(ctx, "QueryDatabase")
order, err := db.GetOrder(ctx, orderID)
if err != nil {
dbSpan.RecordError(err)
dbSpan.SetStatus(codes.Error, err.Error())
dbSpan.End()
return err
}
dbSpan.End()
// 添加事件
span.AddEvent("order.validated", trace.WithAttributes(
attribute.Float64("order.amount", order.Amount),
))
return nil
}Metrics
OTel 支持三种指标类型:
import (
"go.opentelemetry.io/otel/metric"
)
var meter = otel.Meter("order-service")
func initMetrics() {
// Counter - 只增不减的计数器
orderCounter, _ := meter.Int64Counter("orders.total",
metric.WithDescription("Total number of orders processed"),
metric.WithUnit("{order}"),
)
// Histogram - 值分布统计
latencyHistogram, _ := meter.Float64Histogram("orders.processing.duration",
metric.WithDescription("Order processing duration"),
metric.WithUnit("ms"),
metric.WithExplicitBucketBoundaries(5, 10, 25, 50, 100, 250, 500, 1000),
)
// UpDownCounter - 可增可减的计数器
activeOrders, _ := meter.Int64UpDownCounter("orders.active",
metric.WithDescription("Number of orders currently being processed"),
)
// Gauge - 瞬时值(通过 Observable 实现)
meter.Float64ObservableGauge("system.cpu.usage",
metric.WithDescription("CPU usage percentage"),
metric.WithFloat64Callback(func(ctx context.Context, o metric.Float64Observer) error {
o.Observe(getCPUUsage())
return nil
}),
)
}Logs
OTel Logs 与 Traces 关联,提供结构化日志:
import (
"go.opentelemetry.io/otel/log"
)
func handleRequest(ctx context.Context) {
logger := otel.Logger("order-service")
logger.Emit(ctx, log.Record{
Severity: log.SeverityInfo,
Body: log.StringValue("Order processed successfully"),
Attributes: []log.KeyValue{
log.String("order.id", "ORD-12345"),
log.Float64("processing.time.ms", 42.5),
},
// TraceID 和 SpanID 自动从 ctx 中提取
})
}自动埋点(Auto-Instrumentation)
自动埋点无需修改代码即可采集遥测数据。
Java Agent
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=order-service \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-Dotel.traces.sampler=parentbased_traceidratio \
-Dotel.traces.sampler.arg=0.1 \
-jar app.jarKubernetes 自动注入
# OpenTelemetry Operator 自动注入
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
spec:
exporter:
endpoint: http://otel-collector.monitoring:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.25"
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
---
# Pod 注解触发自动注入
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "auto-instrumentation"
spec:
containers:
- name: app
image: order-service:1.0OTel Collector
Collector 是 OTel 的核心数据管道,负责接收、处理和导出遥测数据:
Collector 配置
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'k8s-pods'
kubernetes_sd_configs:
- role: pod
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
processors:
batch:
send_batch_size: 1024
send_batch_max_size: 2048
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
attributes:
actions:
- key: environment
value: production
action: upsert
- key: db.password
action: delete
filter/traces:
traces:
span:
- 'attributes["http.target"] == "/healthz"'
- 'name == "readiness-check"'
tail_sampling:
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
metric_expiration: 5m
loki:
endpoint: http://loki.monitoring:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, batch, filter/traces, tail_sampling]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, attributes]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [loki]
extensions: [health_check, pprof, zpages]
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1888
zpages:
endpoint: 0.0.0.0:55679Context Propagation(上下文传播)
上下文传播是分布式追踪的核心,确保 Trace 信息在服务间传递:
W3C Trace Context 标准(默认传播格式):
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
版本-traceId(32hex)-spanId(16hex)-flags支持的传播器:
- W3C TraceContext(推荐默认)
- W3C Baggage
- B3(Zipkin 格式)
- Jaeger
OTLP 协议
OTLP(OpenTelemetry Protocol)是 OTel 的原生传输协议:
| 特性 | OTLP/gRPC | OTLP/HTTP |
|---|---|---|
| 端口 | 4317 | 4318 |
| 编码 | Protocol Buffers | JSON 或 Protobuf |
| 连接 | 长连接 | 短连接 |
| 适用 | 服务间高吞吐 | 浏览器/受限环境 |
Kubernetes 部署架构
# OTel Collector as DaemonSet (Agent 模式)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-agent
spec:
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch: {}
memory_limiter:
limit_mib: 512
exporters:
otlp:
endpoint: otel-gateway.monitoring:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
---
# OTel Collector as Deployment (Gateway 模式)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-gateway
spec:
mode: deployment
replicas: 3
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch: {}
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
exporters:
otlp/tempo:
endpoint: tempo:4317
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlp/tempo]总结
OpenTelemetry 正在统一可观测性标准:
- 优先使用自动埋点,降低接入成本
- Collector 的 Agent + Gateway 两级架构平衡性能和灵活性
- 尾部采样在 Gateway 层实施,确保完整 Trace 的采样决策一致
- 使用 W3C TraceContext 作为默认上下文传播格式
- 通过 Kubernetes Operator 简化部署和自动注入
- OTLP 是首选协议,避免厂商锁定
贡献者
更新日志
2026/3/14 13:09
查看所有更新日志
9f6c2-feat: organize wiki content and refresh site setup于