Implement compression quota refunds and admin manual subscription
This commit is contained in:
99
docs/observability.md
Normal file
99
docs/observability.md
Normal file
@@ -0,0 +1,99 @@
|
||||
# 可观测性设计(日志/指标/追踪)- ImageForge
|
||||
|
||||
目标:让“压缩效果、性能瓶颈、队列健康、计费正确性、滥用风险”都能被观测与告警,便于商用运营。
|
||||
|
||||
---
|
||||
|
||||
## 1. 统一规范
|
||||
|
||||
### 1.1 请求标识
|
||||
- 每个 HTTP 请求生成 `request_id`(或从网关透传),写入:
|
||||
- 响应头:`X-Request-Id`
|
||||
- 日志字段:`request_id`
|
||||
- Trace:`trace_id/span_id`(如启用 OpenTelemetry)
|
||||
|
||||
### 1.2 日志格式
|
||||
- 结构化日志(JSON)优先,便于 Loki/ELK 聚合。
|
||||
- 禁止记录:明文密码、JWT、API Key、Webhook secret。
|
||||
|
||||
建议最小字段:
|
||||
- `timestamp`、`level`、`service`(api/worker)、`request_id`
|
||||
- `user_id`(可空)、`api_key_id`(可空)、`ip`、`user_agent`
|
||||
- `route`、`method`、`status`、`latency_ms`
|
||||
- `task_id`、`task_file_id`(压缩链路)
|
||||
- `bytes_in`、`bytes_out`、`format_in/out`、`compression_level`
|
||||
|
||||
---
|
||||
|
||||
## 2. 指标(Prometheus)
|
||||
|
||||
### 2.1 API 服务指标
|
||||
请求类:
|
||||
- `http_requests_total{route,method,status}`
|
||||
- `http_request_duration_seconds_bucket{route,method}`
|
||||
|
||||
鉴权与风控:
|
||||
- `auth_fail_total{reason}`
|
||||
- `rate_limited_total{scope}`(anonymous/user/api_key)
|
||||
- `quota_exceeded_total{plan}`
|
||||
|
||||
计费链路:
|
||||
- `billing_webhook_total{provider,event_type,result}`
|
||||
- `subscription_state_total{state}`
|
||||
- `invoice_total{status}`
|
||||
|
||||
### 2.2 Worker 指标
|
||||
队列与吞吐:
|
||||
- `jobs_received_total`
|
||||
- `jobs_inflight`
|
||||
- `jobs_completed_total{result}`
|
||||
- `job_duration_seconds_bucket{format,level}`
|
||||
|
||||
压缩效果:
|
||||
- `bytes_in_total`、`bytes_out_total`、`bytes_saved_total`
|
||||
- `compression_ratio_bucket{format,level}`
|
||||
|
||||
资源与异常:
|
||||
- `decode_failed_total{reason}`
|
||||
- `pixel_limit_hit_total`
|
||||
|
||||
### 2.3 Redis/队列指标(可选)
|
||||
- Streams 消费延迟、pending 数量、dead-letter 数量(如实现)。
|
||||
|
||||
---
|
||||
|
||||
## 3. 追踪(Tracing)
|
||||
|
||||
建议:API 与 Worker 使用 OpenTelemetry,打通跨服务链路:
|
||||
- API:`create_task` span、`auth` span、`db` span、`redis` span
|
||||
- Worker:`fetch_job` span、`download_input` span、`compress` span、`upload_output` span、`metering` span
|
||||
|
||||
价值:
|
||||
- 发现耗时集中点(解码/编码/S3/DB)。
|
||||
- 对账问题定位(用量事件写入失败/重复)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 仪表板与告警(建议)
|
||||
|
||||
### 4.1 SLO(建议起点)
|
||||
- API:P95 < 300ms(不含压缩直返)、错误率 < 0.5%
|
||||
- Worker:队列积压 < N(按规模定义),失败率 < 1%
|
||||
|
||||
### 4.2 告警
|
||||
可用性:
|
||||
- `http 5xx` 激增
|
||||
- `/health` 探活失败
|
||||
|
||||
队列健康:
|
||||
- pending/inflight 持续上升
|
||||
- 单任务耗时异常增长
|
||||
|
||||
计费正确性:
|
||||
- webhook 处理失败
|
||||
- 订阅状态异常(active->incomplete 回退等)
|
||||
|
||||
滥用风险:
|
||||
- 单 key/单 IP 用量突增
|
||||
- 格式探测失败率异常
|
||||
|
||||
Reference in New Issue
Block a user