写点什么

Elasticsearch 日志监控方案

用户头像
Se7en
关注
发布于: 2 小时前
Elasticsearch 日志监控方案

现在大部分公司都会选择将应用、中间件、系统等日志存储在 Elasticsearch 中,如何发现日志中的异常数据并且及时告警通知就显得十分重要。本文将会介绍两种主流的日志监控方案,分别是 Yelp 公司开源的 ElastAlert 和 Elastic 官方的商业版功能 Watcher。


如下图所示,日志数据源是一台 Nginx 服务器,在该服务器上安装 Filebeat 收集 Nginx 日志并输出到 Elasticsearch,之后会分别演示用 ElastAlert 和 Watcher 两种方案监控日志并进行告警。


部署 Nginx

安装依赖

yum install -y gcc gcc-c++ autoconf pcre pcre-devel make automake wget httpd-tools vim tree zlib-devel
复制代码

下载安装包

wget http://nginx.org/download/nginx-1.14.0.tar.gztar -xzvf nginx-1.14.0.tar.gz
复制代码

编译安装

cd nginx-1.14.0./configure
复制代码

配置 Nginx

编辑配置文件 /usr/local/nginx/conf/nginx.conf,在 Nginx 上配置一个静态网页服务。


worker_processes  1;
events { worker_connections 1024;}

http { server { listen 80; location / { root html; } }}
复制代码


启动 Nginx:


sbin/nginx
复制代码


访问 Nginx:


部署 Filebeat

下载并安装 Filebeat。


curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.14.0-x86_64.rpmsudo rpm -vi filebeat-7.14.0-x86_64.rpm
复制代码


编辑 /etc/filebeat/filebeat.yml 配置文件,读取 Nginx 日志文件输出到 Elasticsearch 的 nginx 索引中,后缀是当前日期。


filebeat.inputs:- type: log  enabled: true  paths:    - /usr/local/nginx/logs/*.logoutput.elasticsearch:  hosts: ["192.168.1.8:9200"]  index: "nginx-%{+yyyy.MM.dd}"  #username: "elastic"  #password: "changeme"setup.ilm.enabled: falsesetup.template.name: "nginx"setup.template.pattern: "nginx-*"
复制代码


启动 Filebeat:


systemctl start filebeat
复制代码

ElastAlert

ElastAlert 是 Yelp 公司开源的一套用 Python 写的 Elasticsearch 告警框架,可以从 Elasticsearch 当中查询出匹配规则的数据进行告警。


ElastAlert 有以下特点:


  • 支持多种匹配规则(频率、阈值、数据变化、黑白名单、变化率等)。

  • 支持多种告警类型(邮件、HTTP POST、自定义脚本等)。

  • 支持用户自定义规则和告警类型。

  • 匹配项汇总报警,重复告警抑制,告警失败重试和过期。

  • 可用性强,状态信息保存到 Elasticsearch 的索引中。

  • 支持调试和审计。

部署 Elastalert

安装 Python

wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgztar -zxvf Python-3.6.9.tgzcd Python-3.6.9./configuremake && make install
复制代码


检查 Python 版本:


python3 -V
复制代码

安装依赖

yum install gcc libffi-devel python3-devel openssl-devel -ypip3 install -U pippip3 install "setuptools>=11.3"
复制代码

安装 Elastalert

python3 install elastalert
复制代码

配置 Elastalert

克隆代码到本地:


git clone https://github.com/Yelp/elastalert.gitcd elastalert
复制代码


我们可以在 ElastAlert 源码文件的根目录下找到一个叫做 config.yaml.example 的文件,修改文件名为 config.yaml:


mv config.yaml.example  config.yaml
复制代码


创建存放规则的目录。


mkdir rulescd rules
复制代码


编辑 config.yaml 文件,修改主配置:


#规则存放的目录rules_folder: rules
#运行的频率run_every: minutes: 1
#ElastAlert 将缓存最近一段时间的结果,以防某些日志源不是实时的buffer_time: minutes: 45
#Elasticsearch 地址es_host: 192.168.1.8
#Elasticsearch 端口es_port: 9200
#Elasticsearch 用户名密码(可选)#es_username: someusername#es_password: somepassword
#ElastAlert 元数据存储索引writeback_index: elastalert_status
#如果警报因某种原因失败,ElastAlert将重试发送警报,直到该时间段结束alert_time_limit: days: 2
复制代码


创建 rules/nginx.yaml 文件,编辑 rule:


规则内容为:在 1 分钟内如果查询 nginx-* 索引的 message 字段匹配 到 error 5 次就触发告警,往指定的 URL 发送一个 HTTP POST 请求。


# Alert when the rate of events exceeds a threshold
# (Required)# Elasticsearch hostes_host: 192.168.1.8
# (Required)# Elasticsearch portes_port: 9200
# (OptionaL) Connect with SSL to elasticsearch#use_ssl: True
# (Optional) basic-auth username and password for elasticsearch#es_username: someusername#es_password: somepassword
# (Required)# Rule name, must be uniquename: nginx rule
# (Required)# Type of alert.# the frequency rule type alerts when num_events events occur with timeframe timetype: frequency
# (Required)# Index to search, wildcard supportedindex: nginx-*
# (Required, frequency specific)# Alert when this many documents matching the query occur within a timeframenum_events: 5
# (Required, frequency specific)# num_events must occur within this amount of time to trigger an alerttimeframe: minutes: 1
# (Required)# A list of elasticsearch filters used for find events# These filters are joined with AND and nested in a filtered query# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.htmlfilter:- term: message: "error"
# (Required)# The alert is use when a match is foundalert:- "post"
http_post_url: "https://webhook.site/2f64f4b3-8b43-488c-b2df-695136079e36"
复制代码


https://webhook.site 网站提供了测试的 Webhook 接口,每个人的 URL 都是独立的,拷贝这个 URL 复制到 http_post_url 中。



ElastAlert 会把执行记录存放到一个索引中,可以方便我们审核和调试。使用以下命令创建这个索引的,默认情况下,索引名叫 elastalert_status。


root@ydt-net-es-node1:/software #elastalert-create-indexEnter Elasticsearch host: 192.168.1.8Enter Elasticsearch port: 9200Use SSL? t/f: f#如果有认证输入用户名密码Enter optional basic-auth username (or leave blank): Enter optional basic-auth password (or leave blank): Enter optional Elasticsearch URL prefix (prepends a string to the URL of every request): New index name? (Default elastalert_status) New alias name? (Default elastalert_alerts) Name of existing index to copy? (Default None) Elastic Version: 7.9.3Reading Elastic 6 index mappings:Reading index mapping 'es_mappings/6/silence.json'Reading index mapping 'es_mappings/6/elastalert_status.json'Reading index mapping 'es_mappings/6/elastalert.json'Reading index mapping 'es_mappings/6/past_elastalert.json'Reading index mapping 'es_mappings/6/elastalert_error.json'New index elastalert_status createdDone!
复制代码


发送 2 个请求,1 个是正确请求,1 个是错误请求。


> curl http://192.168.1.134 -IHTTP/1.1 200 OKServer: nginx/1.14.2Date: Mon, 16 Aug 2021 07:28:42 GMTContent-Type: text/htmlContent-Length: 612Last-Modified: Wed, 16 Jun 2021 02:46:13 GMTConnection: keep-aliveETag: "60c965f5-264"Accept-Ranges: bytes
> curl http://192.168.1.134/xxxxxx -IHTTP/1.1 404 Not FoundServer: nginx/1.14.2Date: Mon, 16 Aug 2021 07:28:43 GMTContent-Type: text/htmlContent-Length: 169Connection: keep-alive
复制代码


在 Kibana 上可以看到 Nginx 的日志,错误请求会在 access.log 和 error.log 各写一次,因此这里看到 3 条记录。



运行 elastalert-test-rule 命令检验配置文件是否正确并且可以看到规则匹配的次数,elastalert-test-rule 命令并不会真正触发告警。


> elastalert-test-rule rules/nginx.yamlINFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.            To send them but remain verbose, use --verbose instead.Didn't get any results.INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.                To send them but remain verbose, use --verbose instead.1 rules loadedINFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts#匹配一次INFO:elastalert:Queried rule nginx rule from 2021-08-16 15:28 CST to 2021-08-16 15:29 CST: 1 / 1 hits 
Would have written the following documents to writeback index (default is elastalert_status):
elastalert_status - {'rule_name': 'nginx rule', 'endtime': datetime.datetime(2021, 8, 16, 7, 29, 30, 422431, tzinfo=tzutc()), 'starttime': datetime.datetime(2021, 8, 16, 7, 28, 29, 822431, tzinfo=tzutc()), 'matches': 0, 'hits': 1, '@timestamp': datetime.datetime(2021, 8, 16, 7, 29, 30, 527080, tzinfo=tzutc()), 'time_taken': 0.02203655242919922}
复制代码


1 分钟内连续发送错误请求 5 次达到触发告警的阈值:


for i in {1..3};do curl http://192.168.1.134/xxxxxx -I;done
复制代码


此时可以看到发送的告警格式。


> elastalert-test-rule rules/nginx.yamlINFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.            To send them but remain verbose, use --verbose instead.Didn't get any results.INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.                To send them but remain verbose, use --verbose instead.1 rules loadedINFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler startsINFO:elastalert:Queried rule nginx rule from 2021-08-16 15:33 CST to 2021-08-16 15:34 CST: 5 / 5 hitsINFO:elastalert:Alert for nginx rule at 2021-08-16T07:34:26.230Z:INFO:elastalert:nginx rule
At least 5 events occurred between 2021-08-16 15:33 CST and 2021-08-16 15:34 CST
@timestamp: 2021-08-16T07:34:26.230Z_id: 0CDiTXsBCANUjLffFM2O_index: nginx-2021.08.16_type: _docagent: { "ephemeral_id": "4ee4bd89-cb8e-43fb-9331-476c229a5480", "hostname": "nginx-plus1", "id": "629442a8-34ab-40db-80a8-16e4fda8dec7", "name": "nginx-plus1", "type": "filebeat", "version": "7.14.0"}ecs: { "version": "1.10.0"}host: { "name": "nginx-plus1"}input: { "type": "log"}log: { "file": { "path": "/usr/local/nginx/logs/error.log" }, "offset": 16944}message: 2021/08/16 15:34:22 [error] 4022#0: *40 open() "/usr/local/nginx/html/xxxxxx" failed (2: No such file or directory), client: 192.168.1.35, server: , request: "GET /xxxxxx HTTP/1.1", host: "192.168.1.134"num_hits: 5num_matches: 1

Would have written the following documents to writeback index (default is elastalert_status):
silence - {'exponent': 0, 'rule_name': 'nginx rule', '@timestamp': datetime.datetime(2021, 8, 16, 7, 34, 42, 866184, tzinfo=tzutc()), 'until': datetime.datetime(2021, 8, 16, 7, 35, 42, 866174, tzinfo=tzutc())}
elastalert_status - {'rule_name': 'nginx rule', 'endtime': datetime.datetime(2021, 8, 16, 7, 34, 42, 810992, tzinfo=tzutc()), 'starttime': datetime.datetime(2021, 8, 16, 7, 33, 42, 210992, tzinfo=tzutc()), 'matches': 1, 'hits': 5, '@timestamp': datetime.datetime(2021, 8, 16, 7, 34, 42, 868045, tzinfo=tzutc()), 'time_taken': 0.015259981155395508}
复制代码


使用以下命令运行 elastalert,可以看到触发了告警:


> elastalert --verbose --rule rules/nginx.yaml1 rules loadedINFO:elastalert:Starting upINFO:elastalert:Disabled rules are: []INFO:elastalert:Sleeping for 59.999839 secondsINFO:elastalert:Queried rule nginx rule from 2021-08-16 14:54 CST to 2021-08-16 15:39 CST: 7 / 7 hitsINFO:elastalert:HTTP Post alert sent.INFO:elastalert:Ran nginx rule from 2021-08-16 14:54 CST to 2021-08-16 15:39 CST: 7 query hits (0 already seen), 1 matches, 1 alerts sent
复制代码


访问 https://webhook.site 网站可以看到 ElastAlert 发送的 HTTP POST 请求。



查询 elastalert_status 索引可以看到 ElastAlert 的执行记录。


GET elastalert_status/_search#返回结果{  "took" : 0,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 1,      "relation" : "eq"    },    "max_score" : 1.0,    "hits" : [      {        "_index" : "elastalert_status",        "_type" : "_doc",        "_id" : "1SDmTXsBCANUjLff0M1Q",        "_score" : 1.0,        "_source" : {          "match_body" : {            "input" : {              "type" : "log"            },            "agent" : {              "hostname" : "nginx-plus1",              "name" : "nginx-plus1",              "id" : "629442a8-34ab-40db-80a8-16e4fda8dec7",              "ephemeral_id" : "4ee4bd89-cb8e-43fb-9331-476c229a5480",              "type" : "filebeat",              "version" : "7.14.0"            },            "@timestamp" : "2021-08-16T07:34:26.230Z",            "ecs" : {              "version" : "1.10.0"            },            "log" : {              "file" : {                "path" : "/usr/local/nginx/logs/error.log"              },              "offset" : 16740            },            "host" : {              "name" : "nginx-plus1"            },            "message" : "2021/08/16 15:34:22 [error] 4022#0: *39 open() \"/usr/local/nginx/html/xxxxxx\" failed (2: No such file or directory), client: 192.168.1.35, server: , request: \"GET /xxxxxx HTTP/1.1\", host: \"192.168.1.134\"",            "_id" : "zyDiTXsBCANUjLffFM2O",            "_index" : "nginx-2021.08.16",            "_type" : "_doc",            "num_hits" : 7,            "num_matches" : 1          },          "rule_name" : "nginx rule",          "alert_info" : {            "type" : "http_post",            "http_post_webhook_url" : [              "https://webhook.site/2f64f4b3-8b43-488c-b2df-695136079e36"            ]          },          "alert_sent" : true,          "alert_time" : "2021-08-16T07:39:35.185929Z",          "match_time" : "2021-08-16T07:34:26.230Z",          "@timestamp" : "2021-08-16T07:39:37.418536Z"        }      }    ]  }}
复制代码

Watcher

Watcher 是 Elastic 官方提供的一个对日志数据监控和报警的功能,Watcher 属于收费功能,我们可以在 License Management 中开启 30 天的试用。



Watcher 由以下 5 个部分组成:


  • trigger:定义 watcher 触发的时间或者周期。

  • input:定义数据的来源,可以是一个索引或者 HTTP 请求的结果等等。如果没有设置输入将为空。

  • condition:定义执行 action 触发的条件。如果没有设置默认总是触发 action。

  • transform(可选):修改 watcher 的 payload。

  • actions:定义执行的动作,例如 email,webhook,index,logging,slack 等等。


创建 1 个 Watcher:


  • trigger:每分钟运行一次。

  • input:通配符匹配 nginx-* 的索引,查询 message 字段中的 error 关键字,每次针对在过去 5 分钟内发生的事件来进行查询。

  • condition:如果在查询结果中,匹配到 1 次,就触发 action。

  • action:向指定 URL 发送一个 HTTP POST 请求。


PUT _watcher/watch/nginx-watcher{  "trigger": {     "schedule" : {      "interval" : "1m"    }  },  "input": {    "search": {      "request": {        "indices": [          "nginx-*"        ],        "body": {          "query": {            "bool": {              "must": {                "match": {                  "message": "error"                }              },              "filter": {                "range": {                  "@timestamp": {                    "from": "{{ctx.trigger.scheduled_time}}||-5m",                    "to": "{{ctx.trigger.triggered_time}}"                  }                }              }            }          }        }      }    }  },  "condition": {    "compare": {      "ctx.payload.hits.total": {        "gt": 0      }    }  },  "actions": {    "my_webhook": {      "throttle_period": "2m",      "webhook": {        "method": "POST",        "url": "https://webhook.site/2f64f4b3-8b43-488c-b2df-695136079e36",        "body": "Number of Nginx Error: {{ctx.payload.hits.total}}"      }    }  }}
复制代码


查看刚刚创建的 watcher:



1 分钟内连续发送 5 次错误请求。


for i in {1..3};do curl http://192.168.1.134/xxxxxx -I;done
复制代码


查看 watcher 状态,可以看到触发了 action。



访问 https://webhook.site 可以看到最新的 Webhook 事件已经被触发了,而且它的 Raw Content 和我们之前定义的 body 格式是一致的。



如果我们设置的 watcher 间隔时间比较久,Elasticsearch 为了方便我们测试,提供了_execute 接口,通过执行下面命令可以立即运行一下我们的 watcher。


PUT _watcher/watch/nginx-watcher/_execute
复制代码

参考资料

  • https://zhuanlan.zhihu.com/p/386722918

  • https://elastalert.readthedocs.io/

  • https://www.elastic.co/guide/en/elasticsearch/reference/7.14/xpack-alerting.html

  • https://blog.csdn.net/UbuntuTouch/article/details/106298651

  • https://elasticstack.blog.csdn.net/article/details/105340379

  • https://elasticstack.blog.csdn.net/article/details/103820572

欢迎关注


发布于: 2 小时前阅读数: 2
用户头像

Se7en

关注

还未添加个人签名 2020.01.10 加入

还未添加个人简介

评论

发布
暂无评论
Elasticsearch 日志监控方案