Files
zsglpt/scripts/HEALTH_MONITOR_README.md

61 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 健康监控(邮件版)
本目录提供 `health_email_monitor.py`,通过调用 `/health` 接口并使用**容器内已有邮件配置**发告警邮件。
## 1) 快速试跑
```bash
cd /root/zsglpt
python3 scripts/health_email_monitor.py \
--to 你的告警邮箱@example.com \
--container knowledge-automation-multiuser \
--url http://127.0.0.1:51232/health \
--dry-run
```
去掉 `--dry-run` 即会实际发邮件。
## 2) 建议 cron每分钟
```bash
* * * * * cd /root/zsglpt && /usr/bin/python3 scripts/health_email_monitor.py \
--to 你的告警邮箱@example.com \
--container knowledge-automation-multiuser \
--url http://127.0.0.1:51232/health \
>> /root/zsglpt/logs/health_monitor.log 2>&1
```
## 3) 支持的规则
- `service_down`:健康接口请求失败(立即告警)
- `health_fail`:返回 `ok/db_ok` 异常或 HTTP 5xx立即告警
- `db_pool_exhausted`:连接池耗尽(默认连续 3 次才告警)
- `queue_backlog_high`:任务堆积过高(默认 `pending_total >= 50` 且连续 5 次)
脚本支持恢复通知(规则恢复正常会发“恢复”邮件)。
## 4) 常用参数
- `--to`:收件人(必填)
- `--container`Docker 容器名(默认 `knowledge-automation-multiuser`
- `--url`:健康地址(默认 `http://127.0.0.1:51232/health`
- `--state-file`:状态文件路径(默认 `/tmp/zsglpt_health_monitor_state.json`
- `--remind-seconds`:重复告警间隔(默认 3600 秒)
- `--queue-threshold`:队列告警阈值(默认 50
- `--queue-streak`:队列连续次数阈值(默认 5
- `--db-pool-streak`:连接池连续次数阈值(默认 3
## 5) 环境变量方式(可选)
也可不用命令行参数,改用环境变量:
- `MONITOR_EMAIL_TO`
- `MONITOR_DOCKER_CONTAINER`
- `HEALTH_URL`
- `MONITOR_STATE_FILE`
- `MONITOR_REMIND_SECONDS`
- `MONITOR_QUEUE_THRESHOLD`
- `MONITOR_QUEUE_STREAK`
- `MONITOR_DB_POOL_STREAK`