PromQL查询语言详解
约 1377 字大约 5 分钟
prometheuspromql
2025-06-23
PromQL(Prometheus Query Language)是 Prometheus 的查询语言,用于查询时间序列数据、创建告警规则和构建仪表盘。掌握 PromQL 是有效使用 Prometheus 监控系统的关键。
数据模型
Prometheus 的数据模型基于时间序列,每条时间序列由指标名称和标签集合唯一标识:
<metric_name>{<label_name>=<label_value>, ...}指标类型
| 类型 | 描述 | 示例 |
|---|---|---|
| Counter | 只增不减的累计值 | http_requests_total |
| Gauge | 可任意上下浮动的值 | node_memory_available_bytes |
| Histogram | 采样观测值的分布 | http_request_duration_seconds_bucket |
| Summary | 类似 Histogram,计算客户端分位数 | go_gc_duration_seconds |
选择器
即时向量选择器
返回每个时间序列的最新样本值:
# 精确匹配
http_requests_total{method="GET", status="200"}
# 正则匹配
http_requests_total{method=~"GET|POST"}
# 反向匹配
http_requests_total{status!="200"}
# 正则排除
http_requests_total{path!~"/health.*"}
# 组合条件
http_requests_total{method="GET", status=~"2..", instance=~"web-.*"}范围向量选择器
返回时间范围内的所有样本:
# 过去 5 分钟的所有样本
http_requests_total{method="GET"}[5m]
# 时间单位: s(秒), m(分), h(时), d(天), w(周), y(年)
node_cpu_seconds_total[1h]
# 偏移量(查看历史数据)
http_requests_total{method="GET"}[5m] offset 1h
# @ 修饰符(查看特定时间点)
http_requests_total @ 1609459200常用函数与操作符
rate 和 irate
rate() 计算 Counter 的每秒平均增长率(推荐用于告警和慢变化):
# 过去 5 分钟的平均 QPS
rate(http_requests_total[5m])
# 按 method 和 status 分组
rate(http_requests_total{job="api"}[5m])irate() 计算最后两个样本的瞬时增长率(适合快速变化的图表):
irate(http_requests_total[5m])increase
计算 Counter 在时间范围内的增量:
# 过去 1 小时的请求总数
increase(http_requests_total[1h])
# 等价于 rate() * 时间窗口秒数
rate(http_requests_total[1h]) * 3600聚合操作符
# sum - 求和
sum(rate(http_requests_total[5m])) by (method)
# avg - 平均值
avg(node_cpu_seconds_total{mode="idle"}) by (instance)
# count - 计数
count(up == 1) by (job)
# max / min
max(container_memory_usage_bytes) by (pod)
# topk / bottomk
topk(10, rate(http_requests_total[5m]))
# quantile - 分位数
quantile(0.95, rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
# stddev - 标准差
stddev(rate(http_requests_total[5m])) by (instance)
# count_values - 按值分组计数
count_values("version", build_info)histogram_quantile
从 Histogram 类型指标计算分位数:
# P99 延迟
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# 按 service 分组的 P95 延迟
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# P50 (中位数)
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)注意:histogram_quantile 要求按 le 标签保留分组,因为 le(less than or equal)定义了桶的边界。
predict_linear
基于线性回归预测未来值:
# 预测 4 小时后磁盘空间是否耗尽
predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
# 预测 24 小时后的值
predict_linear(node_memory_MemAvailable_bytes[12h], 24*3600)其他实用函数
# abs - 绝对值
abs(delta(temperature_celsius[1h]))
# ceil / floor / round
ceil(rate(http_requests_total[5m]))
# clamp - 限制范围
clamp(cpu_usage, 0, 100)
# changes - 值变化次数
changes(process_start_time_seconds[1h])
# resets - Counter 重置次数
resets(http_requests_total[1h])
# delta - Gauge 的差值
delta(temperature_celsius[1h])
# deriv - Gauge 的导数(每秒变化率)
deriv(process_resident_memory_bytes[1h])
# label_replace - 标签重写
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")
# label_join - 标签拼接
label_join(up, "full_name", "-", "job", "instance")
# absent - 指标不存在时返回 1
absent(up{job="api"})
# vector - 标量转向量
vector(1)
# time - 当前 Unix 时间戳
time() - process_start_time_seconds二元操作符
算术操作符
# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# CPU 使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100比较操作符
# 过滤:内存使用超过 80%
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.8
# bool 修饰符:返回 0 或 1
http_requests_total > bool 1000向量匹配
# 一对一匹配
method_code:http_errors:rate5m{method="get"} / ignoring(code) method:http_requests:rate5m{method="get"}
# 多对一匹配
method_code:http_errors:rate5m / on(method) group_left method:http_requests:rate5mRecording Rules
预计算复杂查询,提升查询性能:
# prometheus-rules.yaml
groups:
- name: http_rules
interval: 30s
rules:
# 命名规范: level:metric:operations
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: instance:node_cpu:ratio
expr: |
1 - avg by(instance) (
irate(node_cpu_seconds_total{mode="idle"}[5m])
)
- record: instance:node_memory:usage_ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes
)Alert Expressions
告警规则基于 PromQL 表达式:
groups:
- name: critical_alerts
rules:
# 高错误率
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ printf \"%.2f\" $value }}% (>5%) for job {{ $labels.job }}"
# 高延迟
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency > 1s on {{ $labels.job }}"
# 磁盘空间不足
- alert: DiskSpaceRunningLow
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk space on {{ $labels.instance }} predicted to run out in 24h"
# 实例宕机
- alert: InstanceDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
# Pod 频繁重启
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"查询优化技巧
- 使用 Recording Rules 预计算高开销查询
- 缩小标签选择器范围,避免全量扫描
- rate() 窗口至少为采集间隔的 4 倍(如 15s 间隔用 [1m])
- 避免高基数标签(如 user_id、request_id)
- 优先使用
rate而非irate用于告警(更稳定) - histogram_quantile 在聚合后计算,减少计算量
总结
PromQL 是 Prometheus 生态的核心能力。熟练掌握 rate()、histogram_quantile()、聚合操作符和向量匹配是高效使用 Prometheus 的关键。通过 Recording Rules 预计算和合理的告警规则设计,可以构建出高效可靠的监控系统。
贡献者
更新日志
2026/3/14 13:09
查看所有更新日志
9f6c2-feat: organize wiki content and refresh site setup于