• Spring Cloud 之 Prometheus集成Alertmanager实现微服务预警(二十三)


    上两篇我们讲了Prometheus+Grafana+Eureka实现微服务监控,通过Grafana可以查看监控大屏。但是还有一个问题,就是如果出现预警,不可能靠人一直盯着,一是可能漏掉重要的预警信息,二是人工预警不及时,更重的是上千台服务器得需要多少人。这时自动,准确的预警尤为重要。

    Spring Cloud 之 Prometheus+Grafana实现监控微服务(二十一) 

    Spring Cloud 之 Prometheus+Grafana+Eureka实现动态微服务监控(二十二)

     在官方给的架构图中,如下图右上角标记为蓝色框内,则为普罗米修斯的预警模块。本篇主要实现机器宕机后,Prometheus把宕机的预警信息发给Alertmanager,再通过AlertManager把预警信息转发给我们自己的预警应用。预警应用可以通过邮件,短信,企业微信预警,通知相关业务及开发人员。

    Prometheus集成Alertmanager预警架构图

    1、Prometheus集成alertmanager配置

    prometheus.yml文件中配置好alertmanager地址,9093是alertmanager默认启动端口。

    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]  

    2、Prometheus配置预警规则

    2.1 在prometheus.yml同级目录下新建一个规则配置文件,名称为first_rules.yml。

    first_rules.yml内容如下:

    groups:
    - name: example
      rules:
      - alert:  InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: Instance has been down for more than 5 minutes

    上面配置的意思是当有实例下线时发送告警信息。

    2.2 prometheus.yml配置中配置规则文件first_rules.yml,默认是注释掉的,打开即可。

    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "first_rules.yml"
      # - "second_rules.yml"

    3、alertmanager安装及配置

    3.1 alertmanager下载

    地址:https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.windows-amd64.tar.gz

    3.2 修改alertmanager.yml文件

    下载完成后解压,修改receivers.webhook_configs.url,指向我们自己的预警应用地址(spring-cloud-alertmanager地址)

    global:
      resolve_timeout: 5m
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://127.0.0.1:5001/alertMessage/receive'
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'dev', 'instance']

    4、预警模块spring-cloud-alertmanager开发

    4.1 创建接收预警消息的Controller,注意Post提交,参数为数据流。

    /**
     * @author Leo
     */
    @RestController
    @RequestMapping("alertMessage")
    @Slf4j
    public class ReceiveAlertMessageController {
    
        @PostMapping("receive")
        public String receiveMsg(@RequestBody byte[] data) {
            String msg = new String(data, 0, data.length, Charset.forName("UTF-8"));
            log.info("接收AlertManager预警消息:" + msg);
            return "success";
        }
    }

    4.2 创建启动类

    /**
     * @author Leo
     */
    @SpringBootApplication
    @EnableEurekaClient
    public class AlertManagerApplication {
    
        public static void main(String[] args) {
            SpringApplication.run(AlertManagerApplication.class, args);
        }
    
    }

    5、预警流程验证

    5.1 启动

    启动eureka

    启动prometheus:D:softspringcloudprometheus-2.25.1prometheus.exe

    启动alertmanager:D:softspringcloudalertmanager-0.21.0alertmanager.exe

    启动pring-cloud-alertmanager

    5.2 查看预警规则

    浏览器中输入:http://localhost:9090/classic/rules,可以查看到我们之前在first_rules.yml文件中配置的规则

    点击Alert菜单,可以看到现在有3个实例处于下线状态(其实这里不是真正的下线,只是我们没有在应用里配置Prometheus,而Prometheus又可以从eureka拉取应用列表,但是不能从应用侧拉取采集信息)

     5.3 查看Alertmanager管理平台

    浏览器输入:http://localhost:9093/,点击Alert菜单,可以看到现在有3条预警,证明Prometheus已经把告警信息推送到Alertmanager端了。

     5.4 查看spring-cloud-alertmanager后台日志

    2021-03-17 10:28:05.621  INFO 49924 --- [nio-5001-exec-5] c.x.a.c.ReceiveAlertMessageController    : 接收AlertManager预警消息:{"receiver":"web\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"InstanceDown","instance":"172.16.43.41:5001","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0u0026g0.tab=1","fingerprint":"69addef300b8a5b1"},{"status":"firing","labels":{"alertname":"InstanceDown","instance":"windows10.microdone.cn:apollo-adminservice:8090","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0u0026g0.tab=1","fingerprint":"bfde9dc4159405b2"},{"status":"firing","labels":{"alertname":"InstanceDown","instance":"windows10.microdone.cn:apollo-configservice:8080","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0u0026g0.tab=1","fingerprint":"80adda6540e0cfba"}],"groupLabels":{"alertname":"InstanceDown"},"commonLabels":{"alertname":"InstanceDown","job":"eureka","severity":"critical"},"commonAnnotations":{"summary":"Instance has been down for more than 5 minutes"},"externalURL":"http://DESKTOP-TK67BLR:9093","version":"4","groupKey":"{}:{alertname="InstanceDown"}","truncatedAlerts":0}

    可以看到我们通过http://127.0.0.1:5001/alertMessage/receive接口接收到了Alertmanager推送过来的消息,用JSON工具格式化接收到的报文:

    {
        "receiver":"web.hook",
        "status":"firing",
        "alerts":[
            {
                "status":"firing",
                "labels":{
                    "alertname":"InstanceDown",
                    "instance":"127.0.0.1:5001",
                    "job":"eureka",
                    "severity":"critical"
                },
                "annotations":{
                    "summary":"Instance has been down for more than 5 minutes"
                },
                "startsAt":"2021-03-17T00:27:55.050285364Z",
                "endsAt":"0001-01-01T00:00:00Z",
                "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
                "fingerprint":"69addef300b8a5b1"
            },
            {
                "status":"firing",
                "labels":{
                    "alertname":"InstanceDown",
                    "instance":"windows10.microdone.cn:apollo-adminservice:8090",
                    "job":"eureka",
                    "severity":"critical"
                },
                "annotations":{
                    "summary":"Instance has been down for more than 5 minutes"
                },
                "startsAt":"2021-03-17T00:27:55.050285364Z",
                "endsAt":"0001-01-01T00:00:00Z",
                "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
                "fingerprint":"bfde9dc4159405b2"
            },
            {
                "status":"firing",
                "labels":{
                    "alertname":"InstanceDown",
                    "instance":"windows10.microdone.cn:apollo-configservice:8080",
                    "job":"eureka",
                    "severity":"critical"
                },
                "annotations":{
                    "summary":"Instance has been down for more than 5 minutes"
                },
                "startsAt":"2021-03-17T00:27:55.050285364Z",
                "endsAt":"0001-01-01T00:00:00Z",
                "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
                "fingerprint":"80adda6540e0cfba"
            }
        ],
        "groupLabels":{
            "alertname":"InstanceDown"
        },
        "commonLabels":{
            "alertname":"InstanceDown",
            "job":"eureka",
            "severity":"critical"
        },
        "commonAnnotations":{
            "summary":"Instance has been down for more than 5 minutes"
        },
        "externalURL":"http://DESKTOP-TK67BLR:9093",
        "version":"4",
        "groupKey":"{}:{alertname="InstanceDown"}",
        "truncatedAlerts":0
    }

    到此Prometheus集成Alertmanager集成就完成了。

    补充:不通过Alertmanager直接调邮件预警是因为生产上预警信息量很大,我们可以通过在spring-cloud-alertmanager中将接收到的预警信息存入MQ或数据库,然后再调邮件,短信服务预警。而且预警的方式也更灵活。

  • 相关阅读:
    “上海名媛群”事件,我来说几句
    急于脱手商业地产的酒店式公寓,让我在无锡买了房
    40年产权的商业地产,个人投资者决不能碰
    产权分割商铺,太坑人!
    我的第二故乡 – 武汉
    我的第二故乡
    Consul踢除失效服务和移除Node节点
    合并2个数组为1个无重复元素的有序数组--Go对比Python
    当Prometheus遇到混沌工程
    测试流程规范--测试准入、准出、停止标准、bug优先级定义
  • 原文地址:https://www.cnblogs.com/shileibrave/p/14548069.html
Copyright © 2020-2023  润新知