构建全栈可观测性体系
本文介绍如何使用现代开源工具构建完整的可观测性平台,实现指标、日志和链路追踪的统一管理。
🎯 可观测性三大支柱
- 指标:数值型数据,用于监控和告警
- 日志:离散事件记录,用于故障诊断
- 链路追踪:请求路径追踪,用于性能分析
🏗️ 架构设计
整体架构
应用 → Prometheus Exporter → Prometheus → Alertmanager → Grafana
应用 → Log Agent → Loki → Grafana
应用 → Tracing Agent → Tempo → Grafana
📊 指标监控
Prometheus 配置
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
告警规则
groups:
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "高错误率告警"
description: " 错误率超过 5%"
📝 日志管理
Loki 配置
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
store: consul
max_transfer_retries: 0
schema_config:
configs:
- from: 2020-10-24
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
日志采集
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
data:
promtail.yaml: |
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
🔗 链路追踪
Tempo 配置
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
http:
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
应用集成
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
log.Fatal(err)
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
}
📈 Grafana 仪表板
数据源配置
{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"isDefault": true
}
仪表板示例
{
"dashboard": {
"title": "应用监控",
"panels": [
{
"title": "QPS",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
},
{
"title": "错误率",
"targets": [
{
"expr": "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])"
}
]
}
]
}
}
🎯 SLO/SLI 指标体系
SLI 定义
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-config
data:
availability-sli.yaml: |
name: service_availability
description: 服务可用性
query: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
SLO 告警
- alert: SLOViolation
expr: service_availability < 0.99
for: 5m
labels:
severity: critical
annotations:
summary: "SLO 违规告警"
description: "服务可用性低于 99%"
🔍 故障排查
统一查询
# 查询指标
curl 'http://prometheus:9090/api/v1/query?query=up'
# 查询日志
curl 'http://loki:3100/loki/api/v1/query_range?query={app="myapp"}'
# 查询链路
curl 'http://tempo:3200/api/search?traceID=xxx'
关联分析
通过 Grafana 实现三大支柱的关联:
- 从告警跳转到相关日志
- 从日志跳转到相关链路
- 从链路跳转到相关指标
📚 最佳实践
- 统一标签:在所有系统中使用一致的标签体系
- 采样策略:对高流量链路进行采样
- 数据保留:根据业务需求设置合理的保留策略
- 性能优化:避免过度采集影响应用性能
- 告警降噪:设置合理的告警阈值和聚合规则
作者: Pizicaiman
发布时间: 2024-01-05
分类: 监控, 可观测性
标签: Prometheus, Grafana, Loki, Tempo, 监控, 可观测性