写点什么

Prometheus 学习笔记之查询【基础篇】

用户头像
卓丁
关注
发布于: 2021 年 01 月 09 日
Prometheus学习笔记之查询【基础篇】

译者注:本文翻译自云原生监控组件Prometheus的官方文档-查询篇【基础】


原文参见:https://prometheus.io/docs/prometheus/latest/querying/basics/


Prometheus provides a functional query language called PromQL (Prometheus Query Language) that lets the user select and aggregate time series data in real time. The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus's expression browser, or consumed by external systems via the HTTP API.

Prometheus提供了一种称为PromQL(Prometheus查询语言)的函数查询语言,使得用户可以实时选择和聚合时间序列数据。表达式的结果既可以显示为图形,也可以在Prometheus的表达式浏览器中显示为表格数据,也可以通过HTTP API由外部系统使用。
复制代码

Examples

实例

This document is meant as a reference. For learning, it might be easier to start with a couple of examples.


本文档仅供参考。对于学习来说,从几个例子开始可能更容易。
复制代码

Expression language data types

表达式语言的数据类型

In Prometheus's expression language, an expression or sub-expression can evaluate to one of four types:


在Prometheus的表达语言中,一个表达式或子表达式可以计算为以下四种类型之一:
复制代码
  • Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp

  • 瞬时向量 一组时间序列,每个时间序列包含一个样本,所有样本共享相同的时间戳

  • Range vector - a set of time series containing a range of data points over time for each time series

  • 范围向量 一组时间序列,包含每个时间序列随时间变化的一系列数据点

  • Scalar - a simple numeric floating point value

  • 标量 一个简单的数字型浮点值

  • String - a simple string value; currently unused

  • 字符串 一个简单的字符串值;当前未使用

Depending on the use-case (e.g. when graphing vs. displaying the output of an expression), only some of these types are legal as the result from a user-specified expression. For example, an expression that returns an instant vector is the only type that can be directly graphed.


在不同的使用场景里(例如,当绘制和显示表达式的输出时),这些类型中只有部分是合法的,是用户指定表达式的结果。

例如,返回瞬时向量的表达式是唯一可以直接绘制图形的类型。


Literals

常量

String literals

字符串常量

Strings may be specified as literals in single quotes, double quotes or backticks.

PromQL follows the same escaping rules as Go. In single or double quotes a backslash begins an escape sequence, which may be followed by abfnrtv or \. Specific characters can be provided using octal (\nnn) or hexadecimal (\xnn\unnnn and \Unnnnnnnn).

No escaping is processed inside backticks. Unlike Go, Prometheus does not discard newlines inside backticks.

可以使用单引号、双引号或反引号来定义字符串常量。

PromQL 遵循与 Go 相同的转义规则。

在单引号或双引号中,用反斜杠来表示一个转义序列的开始,后面可以是 a、b、f、n、r、t、v 或\。可以使用八进制(\nnn)或十六进制(\xnn、\unnnn 和\unnnnnnn)来表示特定字符。

反引号内不会处理任何转义。 与 Go 不同的是,Prometheus 不会在反引号内丢弃换行符。

译者注:这里的意思是,Prometheus的反引号的字符串常量中,换行符仍然有效?


Example:

举例:

"this is a string"'these are unescaped: \n \\ \t'`these are not unescaped: \n ' " \t`
复制代码

Float literals

浮点数常量

Scalar float values can be written as literal integer or floating-point numbers in the format (whitespace only included for better readability):

标量浮点值可以按以下格式写为常量整数或浮点数(其中包含空格是只为提高可读性)

[-+]?(      [0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?    | 0[xX][0-9a-fA-F]+    | [nN][aA][nN]    | [iI][nN][fF])
复制代码

Examples:

举例:

23-2.433.4e-90x8f-InfNaN
复制代码

Time series Selectors

(时序数据)筛选/选择器

Instant vector selectors

瞬时类向量选择器

Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp (instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time series that have this metric name.

This example selects all time series that have the http_requests_total metric name:

瞬时向量选择器允许在给定的时间戳(即时)选择一组时间序列和每个时间序列的单个样本值:在最简单的形式中,只指定一个度量名称。这将生成一个包含所有具有此指标名称的时间序列的元素的即时向量。


以下示例选择所有名称为http_requests_total指标的时间序列:


http_requests_total
复制代码

It is possible to filter these time series further by appending a comma separated list of label matchers in curly braces ({}).

通过在花括号({})中添加逗号分隔的标签匹配器列表,可以进一步过滤这些时间序列。

This example selects only those time series with the http_requests_total metric name that also have the job label set to prometheus and their group label set to canary:

以下示例仅选择指标名称为http_requests_total ,并且job标签值为prometheusgroup标签值为canary的时间序列:

http_requests_total{job="prometheus",group="canary"}
复制代码

It is also possible to negatively match a label value, or to match label values against regular expressions. The following label matching operators exist:

也可以对标签值进行不等于的匹配,或者将标签值与正则表达式进行匹配。共有以下几种标签匹配运算符:


  • =: Select labels that are exactly equal to the provided string.

  • =: 选择与所给字符串完全相等的标签。

  • !=: Select labels that are not equal to the provided string.

  • !=: 选择与所给字符串不相等的标签。

  • =~: Select labels that regex-match the provided string.

  • =~: 选择与所给字符串正则匹配的标签。

  • !~: Select labels that do not regex-match the provided string.

  • !~:选择与所给字符串非正则匹配的标签。

For example, this selects all http_requests_total time series for stagingtesting, and development environments and HTTP methods other than GET.

举例,以下示例筛选了指标名称为http_requests_totalenvironmentstaging, testing,`development`几种之一并且methodGET的所有时间序列。

http_requests_total{environment=~"staging|testing|development",method!="GET"}
复制代码

Label matchers that match empty label values also select all time series that do not have the specific label set at all. Regex-matches are fully anchored. It is possible to have multiple matchers for the same label name.

匹配空标签值的标签匹配器也会选择根本没有特定标签集的所有时间序列。

正则表达式匹配完全固定。同一标签名可以有多个匹配器。


Vector selectors must either specify a name or at least one label matcher that does not match the empty string. The following expression is illegal:

向量选择器必须指定一个名称或至少一个与空字符串不匹配的标签匹配器。 以下表达式是非法的:

译者注:为什么非法?是因为正则表达式.* 表示0个或n个除换行符以外的所有字符,因为可以是0个,那么就包括空字符串,但这是选择器的语法所不允许的,所以说它是非法的;

{job=~".*"} # Bad!
复制代码

In contrast, these expressions are valid as they both have a selector that does not match empty label values.

相反地,以下表达式是有效的,因为它们都有一个与空标签值不匹配的选择器。

{job=~".+"}              # Good!{job=~".*",method="get"} # Good!
复制代码

Label matchers can also be applied to metric names by matching against the internal __name__ label. For example, the expression http_requests_total is equivalent to {__name__="http_requests_total"}. Matchers other than = (!==~!~) may also be used. The following expression selects all metrics that have a name starting with job::

标签匹配器也可以通过与内部的__name__标签进行匹配而用于表示与筛选指标名称。

例如,表达式 http_requests_total 等价于{ __name__ =“ http_requests_total”}。 还可以使用=!=, =~!~ )以外的匹配器。 以下表达式表示筛选名称中以job:开头的所有指标:


{__name__=~"job:.*"}
复制代码

The metric name must not be one of the keywords boolonignoringgroup_left and group_right. The following expression is illegal:

表达式的指标名称中一定不能包含诸如boolonignoringgroup_left 和group_right 等关键字。

以下的表达式是非法的:

on{} # Bad!
复制代码

A workaround for this restriction is to use the __name__ label:

解决这一限制的方法是,使用 __name__ 标签:

{__name__="on"} # Good!
复制代码

All regular expressions in Prometheus use RE2 syntax.

Prometheus 中所使用的正则表达式请参考 RE2 syntax.

Range Vector Selectors

区间向量选择器

Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. Syntactically, a time duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element.

区间向量与瞬时向量类似,不同之处在于,它们从当前瞬间选择了一定范围的样本。

从语法上讲,在向量选择器末尾的方括号([])中附加一个持续时间,以指定每个结果区间向量元素的时间值应回溯多久。

In this example, we select all the values we have recorded within the last 5 minutes for all time series that have the metric name http_requests_total and a job label set to prometheus:

如下示例,我们筛选了指标名称http_requests_total,并且标签job值为prometheus最近 5 分钟内记录的所有时序值。

http_requests_total{job="prometheus"}[5m]
复制代码

Time Durations

持续时间

Time durations are specified as a number, followed immediately by one of the following units:

持续时间指定为一个数字,随后跟下列单位之一:

  • ms - milliseconds

  • s - seconds

  • m - minutes

  • h - hours

  • d - days - assuming a day has always 24h //假设每天为 24 小时

  • w - weeks - assuming a week has always 7d //假设每周有 7 天

  • y - years - assuming a year has always 365d //假设每年有 365 天

Time durations can be combined, by concatenation. Units must be ordered from the longest to the shortest. A given unit must only appear once in a time duration.

Here are some examples of valid time durations:

可以通过串联来组合持续时间。单位必须从最长到最短排序。给定的单位在一个持续时间内只能出现一次。

以下是一些有效持续时间示例:

5h1h30m5m10s
复制代码

Offset modifier

偏移量修饰符

The offset modifier allows changing the time offset for individual instant and range vectors in a query.

offset修饰符允许更改查询中(单个瞬间和范围向量)的时间偏移量。

For example, the following expression returns the value of http_requests_total 5 minutes in the past relative to the current query evaluation time:

例如,以下表达式返回相对于当前查询评估时间过去 5 分钟的 http_requests_total 值:

译者注:比如当前查询时间是08:30:25,则该表达式就返回自08:25:25以来,5分钟内的所有值

http_requests_total offset 5m
复制代码

Note that the offset modifier always needs to follow the selector immediately, i.e. the following would be correct:

请注意,offset 修饰符始终需要紧随选择器后,如下举例是正确的:

sum(http_requests_total{method="GET"} offset 5m) // GOOD.
复制代码

While the following would be incorrect:

如下举例是不正确的。

译者注:offset 因放在指标的标签列表其后,即 } 后面,而非 ) 后面;

sum(http_requests_total{method="GET"}) offset 5m // INVALID.
复制代码

The same works for range vectors. This returns the 5-minute rate that http_requests_total had a week ago:

以上规则同样适用于区间向量。如下将返回一周前指标 http_requests_total 在 5 分钟内的平均增长率:

rate(http_requests_total[5m] offset 1w)
复制代码

Subquery

子查询

Subquery allows you to run an instant query for a given range and resolution. The result of a subquery is a range vector.

子查询可以根据给定的范围和分辨率运行瞬时查询。子查询的结果是一个范围向量。

Syntax: <instant_query> '[' <range> ':' [<resolution>] ']' [ offset <duration> ]

  • <resolution> is optional. Default is the global evaluation interval.

  • <resolution>是可选的。 默认值为global evaluation_interval

Operators

运算符

Prometheus supports many binary and aggregation operators. These are described in detail in the expression language operators page.

Prometheus 支持许多二进制和聚合运算符。 expression language operators 页面中详细描述了这些运算符。

Functions

函数

Prometheus supports several functions to operate on data. These are described in detail in the expression language functions page.

Prometheus 支持多种对数据进行操作的函数。  expression language functions 页面中详细描述了这些函数。

Comments

注释

PromQL supports line comments that start with #. Example:

PromQL 支持以#开头的行注释,例如:

    # This is a comment
复制代码

Gotchas

问题/陷阱

译者注:本节主要讲述Gotchas,即我们常说的"坑","踩坑"

Staleness

(数据)过时(问题)

When queries are run, timestamps at which to sample data are selected independently of the actual present time series data. This is mainly to support cases like aggregation (sumavg, and so on), where multiple aggregated time series do not exactly align in time. Because of their independence, Prometheus needs to assign a value at those timestamps for each relevant time series. It does so by simply taking the newest sample before this timestamp.

运行查询时,将选择采样数据时的时间戳,而不会依赖于真实的当前时间序列数据。

这主要是为了支持聚合(sum、avg 等)之类的情况,其中多个聚合类时间序列在时间上并不完全一致。

由于他们的独立性,Prometheus 需要在这些时间戳上为每个相关的时间序列分配一个值。

为此,只需在此时间戳直接获取最新样本即可。

译者注:即这就是为什么在“运行查询时,将选择采样数据时的时间戳”的原因。

通俗点讲,就是说,为了是的聚合类的时序在最新时间上能达到一致,干脆统一采取上次采集的时间作为最新的时间戳。

相当于每次查询时的数据,其实严格意义上并不是查询那一时刻的数据,我们是可以容忍这样微小的误差来规避更大的可能牺牲。


If a target scrape or rule evaluation no longer returns a sample for a time series that was previously present, that time series will be marked as stale. If a target is removed, its previously returned time series will be marked as stale soon afterwards.

如果目标拉取或规则评估不再返回先前存在的时间序列的样本,则该时间序列将会被标记为过时。

如果一个目标被删除,那么该目标之前返回的时间序列将会很快被标记为过时。


If a query is evaluated at a sampling timestamp after a time series is marked stale, then no value is returned for that time series. If new samples are subsequently ingested for that time series, they will be returned as normal.

在一个时间序列被标记为过时后,如果在采样时间戳处对查询进行求值,则不会返回该时间序列的任何值。

如果随后在该时间序列中摄入了新样本,则新样本将会被正常返回。


If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.

如果在采样时间戳 5 分钟前(默认情况下)未找到样本,则此时不会返回该时间序列的值。

这实际上意味着,在最新采集的样本早于 5 分钟或标记为陈旧之后,时间序列会从图表中“消失”。

Staleness will not be marked for time series that have timestamps included in their scrapes. Only the 5 minute threshold will be applied in that case.

拉取时包含时间戳的时间序列将不会标记为陈旧。在这种情况下,仅会应用 5 分钟的阈值。

Avoiding slow queries and overloads

(如何)避免慢查询和过载

If a query needs to operate on a very large amount of data, graphing it might time out or overload the server or browser. Thus, when constructing queries over unknown data, always start building the query in the tabular view of Prometheus's expression browser until the result set seems reasonable (hundreds, not thousands, of time series at most). Only when you have filtered or aggregated your data sufficiently, switch to graph mode. If the expression still takes too long to graph ad-hoc, pre-record it via a recording rule.

如果查询需要对大量数据进行操作,则对其进行绘图可能会超时或使服务器或浏览器过载。

因此,当在未知数据上构造查询时,请始终在 Prometheus 表达式浏览器的表格视图中开始构建查询,

直到结果集看起来合理为止(最多几百个,而不是几千个时间序列)。

只有在充分过滤或聚合数据后,才能切换到图形模式。

如果表达式仍然需要很长时间来绘制临时图表,可可以通过使用 recording rule 对其进行预记录。

译者注:这里需要一点背景知识,可先学习一下Configuration一节中的Recording rules Alerting rules


This is especially relevant for Prometheus's query language, where a bare metric name selector like api_http_requests_total could expand to thousands of time series with different labels. Also keep in mind that expressions which aggregate over many time series will generate load on the server even if the output is only a small number of time series. This is similar to how it would be slow to sum all values of a column in a relational database, even if the output value is only a single number.

上述问题与 Prometheus 查询语言有很大关系,在 Prometheus 的查询语言中,

一个简单的指标名称选择器(如 api_http_requests_total)可以扩展到数千个具有不同标签的时间序列。

还请记住,即使输出只是少量时间序列,在许多时间序列上聚合的表达式也会在服务器上产生负载。

这类似于将关系数据库中的列的所有值相加会很慢,即使输出值只是一个数字也是如此。

 This documentation is open-source. Please help improve it by filing issues or pull requests.

本文档是开源的。 请通过提出问题或(git) pull request 来帮助改进它。


发布于: 2021 年 01 月 09 日阅读数: 103
用户头像

卓丁

关注

鸟过无痕 2017.12.10 加入

泰戈尔:虽然天空没有留下我的痕迹,但我已飞过。

评论

发布
暂无评论
Prometheus学习笔记之查询【基础篇】