部署架构

本文档描述 B2B2B 酒店平台（Hotel Platform）的完整部署架构，涵盖从 MVP 到生产的演进路径、 Docker Compose 编排、CI/CD 流水线、配置管理、日志收集、扩容策略及数据库部署方案。

1. 部署架构总览

1.1 MVP 阶段（单节点）

在项目初期（MVP 阶段），采用单台云服务器 + Docker Compose 方案，以最低成本快速验证业务模型。

硬件要求

项目	最低配置	推荐配置	说明
CPU	4 核	8 核	查价引擎与 BI 模块对 CPU 需求较高
内存	8 GB	16 GB	ClickHouse 和 RabbitMQ 为内存消耗大户
系统盘	50 GB SSD	100 GB SSD	系统与 Docker 镜像层
数据盘	100 GB SSD	200 GB SSD	PostgreSQL + ClickHouse 数据持久化
带宽	5 Mbps	10 Mbps	供应商 API 对外通信

云厂商推荐

国内：阿里云 ECS / 腾讯云 CVM / 华为云 ECS
海外：AWS EC2 / GCP Compute Engine / DigitalOcean Droplets
推荐镜像：Ubuntu 22.04 LTS 或 Debian 12

单节点拓扑

┌─────────────────────────────────────────────────┐
│                 云服务器 (8C16G)                  │
│                                                   │
│  ┌─────────────┐  ┌─────────────┐                │
│  │  Nginx/Caddy │──│ API Gateway │                │
│  │  (反向代理)   │  │  :8080      │                │
│  └─────────────┘  └──────┬──────┘                │
│                          │                        │
│  ┌───────────┐  ┌────────┴───────┐  ┌──────────┐│
│  │ Admin Web │  │  业务服务层      │  │ Redis    ││
│  │ :3000     │  │               │  │ :6379    ││
│  └───────────┘  │ • Query Engine │  └──────────┘│
│                 │ • Order Svc    │               │
│  ┌───────────┐  │ • Supplier Adp │  ┌──────────┐│
│  │ Prometheus │  │ • Matching Svc │  │ RabbitMQ ││
│  │ :9090     │  │ • Pricing Svc  │  │ :5672    ││
│  └─────┬─────┘  │ • Settlement   │  └──────────┘│
│        │        │ • Inventory    │               │
│  ┌─────┴─────┐  │ • Risk Svc     │  ┌──────────┐│
│  │  Grafana  │  │ • BI Service   │  │ClickHouse││
│  │ :3001     │  │               │  │ :8123    ││
│  └───────────┘  └───────────────┘  └──────────┘│
│                                                   │
│  ┌─────────────────────────────────────────────┐ │
│  │         PostgreSQL 16 (Pigsty)               │ │
│  │         :5432                                │ │
│  └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘

单节点注意事项

数据备份：每日自动备份 PostgreSQL 与 ClickHouse，备份文件推送到对象存储（OSS/S3）
监控告警：Prometheus + Grafana 基础监控，配置磁盘/内存/CPU 告警阈值
日志清理：Docker 日志配置 max-size 与 max-file，避免磁盘打满
安全加固：开放端口仅 80/443/SSH，所有管理端口通过 SSH 隧道访问

1.2 生产阶段（多节点）

当业务量增长（日订单 > 1 万）后，进入生产阶段，核心思路是应用与数据分离、服务按需扩缩。

节点规划

节点类型	数量	配置	部署组件
应用服务器	2~4	8C16G	API Gateway + 全部业务微服务
数据库服务器	1 主 + 1 从	16C32G	PostgreSQL (主从复制)
缓存服务器	1~3	4C8G	Redis Sentinel / Cluster
消息队列服务器	1~3	4C8G	RabbitMQ 集群
分析服务器	1~2	8C16G	ClickHouse (分片 + 副本)
监控服务器	1	4C8G	Prometheus + Grafana + Loki
负载均衡	1~2	2C4G	Nginx / Caddy / 云 LB

生产阶段拓扑

                        ┌──────────────┐
                        │   DNS / CDN  │
                        └──────┬───────┘
                               │
                        ┌──────┴───────┐
                        │  负载均衡器   │
                        │  (Nginx/LB)  │
                        └──────┬───────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
       ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
       │  App Node 1 │ │  App Node 2 │ │  App Node N │
       │             │ │             │ │             │
       │ API Gateway │ │ API Gateway │ │ API Gateway │
       │ 业务微服务  │ │ 业务微服务  │ │ 业务微服务  │
       └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
              │                │                │
       ┌──────┴────────────────┴────────────────┴──────┐
       │                  基础设施层                     │
       │                                                │
       │  ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
       │  │ PG Primary│ │ PG Replica│ │ Redis Sentinel │ │
       │  └──────────┘ └──────────┘ └────────────────┘ │
       │  ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
       │  │ RabbitMQ │ │ClickHouse│ │  Loki + Prom   │ │
       │  └──────────┘ └──────────┘ └────────────────┘ │
       └────────────────────────────────────────────────┘

可选 K8s 编排

当日订单超过 10 万时，建议迁移至 Kubernetes：

使用 Helm Chart 统一管理部署模板
通过 HPA 实现业务服务自动扩缩容
命名空间隔离：hotel-platform (生产) / hotel-staging (预发布) / hotel-dev (开发)
Ingress Controller 替代外部 Nginx

2. Docker Compose 部署

2.1 服务清单

B2B2B 酒店平台包含以下核心服务：

services:
  api-gateway:         # API 网关 - 统一入口，路由转发，限流熔断
  query-engine:        # 查价引擎 - 多供应商并发查价，价格排序
  order-service:       # 订单服务 - 订单生命周期管理
  supplier-adapter:    # 供应商适配器服务 - 对接各供应商 API
  matching-service:    # 匹配服务 - 供应商-酒店智能匹配
  pricing-service:     # 价格服务 - 价格计算、加价策略
  settlement-service:  # 结算服务 - 对账、结算、发票管理
  inventory-service:   # 库存服务 - 房态同步、库存管理
  risk-service:        # 风控服务 - 交易风控、信用评估
  bi-service:          # 数据智能服务 - 数据分析、报表
  admin-web:           # 管理后台 - 运营管理界面
  redis:               # 缓存 - 热数据缓存、分布式锁
  rabbitmq:            # 消息队列 - 异步解耦、事件驱动
  clickhouse:          # 分析数据库 - BI 数据存储
  prometheus:          # 监控 - 指标采集
  grafana:             # 监控看板 - 可视化仪表盘

2.2 docker-compose.yml 核心配置

version: "3.9"

x-common-env: &common-env
  NODE_ENV: production
  LOG_LEVEL: info
  REDIS_URL: redis://redis:6379/0
  RABBITMQ_URL: amqp://guest:guest@rabbitmq:5672
  DATABASE_URL: postgresql://hotel:hotel_pass@postgres:5432/hotel_db
  CLICKHOUSE_URL: clickhouse://clickhouse:8123/hotel_analytics
  JAEGER_ENDPOINT: http://jaeger:14268/api/traces

x-app-defaults: &app-defaults
  restart: unless-stopped
  logging:
    driver: json-file
    options:
      max-size: "50m"
      max-file: "5"
      tag: "{{.Name}}"
  deploy:
    resources:
      limits:
        memory: 512M
      reservations:
        memory: 128M

services:
  # ─── API 网关 ─────────────────────────────────────────────
  api-gateway:
    <<: *app-defaults
    build:
      context: ./src/api-gateway
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      <<: *common-env
      PORT: 8080
      RATE_LIMIT_WINDOW_MS: 60000
      RATE_LIMIT_MAX: 100
    depends_on:
      redis:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s
    networks:
      - frontend
      - backend

  # ─── 查价引擎 ─────────────────────────────────────────────
  query-engine:
    <<: *app-defaults
    build:
      context: ./src/query-engine
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      CONCURRENCY_LIMIT: 20
      QUERY_TIMEOUT_MS: 10000
    depends_on:
      redis:
        condition: service_healthy
      supplier-adapter:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 20s
    deploy:
      resources:
        limits:
          memory: 1024M
        reservations:
          memory: 256M
    networks:
      - backend

  # ─── 订单服务 ─────────────────────────────────────────────
  order-service:
    <<: *app-defaults
    build:
      context: ./src/order-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      ORDER_EXPIRY_MINUTES: 30
    depends_on:
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  # ─── 供应商适配器服务 ─────────────────────────────────────
  supplier-adapter:
    <<: *app-defaults
    build:
      context: ./src/supplier-adapter
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      ADAPTER_POOL_SIZE: 10
      RETRY_MAX_ATTEMPTS: 3
      RETRY_BACKOFF_MS: 1000
    depends_on:
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend
      - external-api

  # ─── 匹配服务 ─────────────────────────────────────────────
  matching-service:
    <<: *app-defaults
    build:
      context: ./src/matching-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      MATCHING_ALGORITHM: weighted_score
    depends_on:
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  # ─── 价格服务 ─────────────────────────────────────────────
  pricing-service:
    <<: *app-defaults
    build:
      context: ./src/pricing-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      DEFAULT_CURRENCY: CNY
      EXCHANGE_RATE_CACHE_TTL: 3600
    depends_on:
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  # ─── 结算服务 ─────────────────────────────────────────────
  settlement-service:
    <<: *app-defaults
    build:
      context: ./src/settlement-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      SETTLE_CYCLE_DAYS: 7
    depends_on:
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  # ─── 库存服务 ─────────────────────────────────────────────
  inventory-service:
    <<: *app-defaults
    build:
      context: ./src/inventory-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      SYNC_INTERVAL_MS: 300000
    depends_on:
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  # ─── 风控服务 ─────────────────────────────────────────────
  risk-service:
    <<: *app-defaults
    build:
      context: ./src/risk-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      RISK_SCORE_THRESHOLD: 80
    depends_on:
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  # ─── 数据智能服务 (BI) ────────────────────────────────────
  bi-service:
    <<: *app-defaults
    build:
      context: ./src/bi-service
      dockerfile: Dockerfile
    environment:
      <<: *common-env
      CLICKHOUSE_URL: clickhouse://clickhouse:8123/hotel_analytics
    depends_on:
      clickhouse:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 1024M
        reservations:
          memory: 256M
    networks:
      - backend

  # ─── 管理后台 ─────────────────────────────────────────────
  admin-web:
    <<: *app-defaults
    build:
      context: ./src/admin-web
      dockerfile: Dockerfile
    ports:
      - "3000:3000"
    environment:
      <<: *common-env
      NEXT_PUBLIC_API_URL: /api
    depends_on:
      api-gateway:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000"]
      interval: 20s
      timeout: 5s
      retries: 3
    networks:
      - frontend

  # ─── 基础设施服务 ─────────────────────────────────────────

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    command: >
      redis-server
      --maxmemory 512mb
      --maxmemory-policy allkeys-lru
      --appendonly yes
    volumes:
      - redis-data:/data
    ports:
      - "127.0.0.1:6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3
    networks:
      - backend

  rabbitmq:
    image: rabbitmq:3-management-alpine
    restart: unless-stopped
    environment:
      RABBITMQ_DEFAULT_USER: guest
      RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD:-guest}
    volumes:
      - rabbitmq-data:/var/lib/rabbitmq
    ports:
      - "127.0.0.1:5672:5672"
      - "127.0.0.1:15672:15672"
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - backend

  postgres:
    image: ghcr.io/pgvector/pgvector:pg16
    restart: unless-stopped
    environment:
      POSTGRES_USER: hotel
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-hotel_pass}
      POSTGRES_DB: hotel_db
    volumes:
      - postgres-data:/var/lib/postgresql/data
    ports:
      - "127.0.0.1:5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U hotel -d hotel_db"]
      interval: 10s
      timeout: 3s
      retries: 5
    networks:
      - backend

  clickhouse:
    image: clickhouse/clickhouse-server:24-alpine
    restart: unless-stopped
    environment:
      CLICKHOUSE_USER: hotel
      CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-hotel_click}
      CLICKHOUSE_DB: hotel_analytics
    volumes:
      - clickhouse-data:/var/lib/clickhouse
    ports:
      - "127.0.0.1:8123:8123"
      - "127.0.0.1:9000:9000"
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8123/ping"]
      interval: 10s
      timeout: 3s
      retries: 3
    networks:
      - backend

  prometheus:
    image: prom/prometheus:v2.52.0
    restart: unless-stopped
    volumes:
      - ./deploy/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "127.0.0.1:9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.4.0
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./deploy/docker/grafana/provisioning:/etc/grafana/provisioning
    ports:
      - "127.0.0.1:3001:3000"
    depends_on:
      - prometheus
    networks:
      - frontend
      - monitoring

  loki:
    image: grafana/loki:2.9.0
    restart: unless-stopped
    volumes:
      - ./deploy/docker/loki/local-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki
    ports:
      - "127.0.0.1:3100:3100"
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:2.9.0
    restart: unless-stopped
    volumes:
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./deploy/docker/promtail/config.yml:/etc/promtail/config.yml
    depends_on:
      - loki
    networks:
      - monitoring

volumes:
  redis-data:
  rabbitmq-data:
  postgres-data:
  clickhouse-data:
  prometheus-data:
  grafana-data:
  loki-data:

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: false
  external-api:
    driver: bridge
    internal: false
  monitoring:
    driver: bridge
    internal: true

2.3 环境变量管理

.env 文件结构

# .env.example — 复制为 .env 并填入实际值

# ─── 应用配置 ────────────────────────────────
NODE_ENV=production
LOG_LEVEL=info
PORT=8080

# ─── 数据库 ──────────────────────────────────
POSTGRES_PASSWORD=your_secure_db_password
DATABASE_URL=postgresql://hotel:your_secure_db_password@postgres:5432/hotel_db
CLICKHOUSE_PASSWORD=your_secure_clickhouse_password

# ─── Redis ───────────────────────────────────
REDIS_PASSWORD=your_secure_redis_password
REDIS_URL=redis://:your_secure_redis_password@redis:6379/0

# ─── RabbitMQ ────────────────────────────────
RABBITMQ_PASSWORD=your_secure_rabbitmq_password
RABBITMQ_URL=amqp://guest:your_secure_rabbitmq_password@rabbitmq:5672

# ─── 监控 ────────────────────────────────────
GRAFANA_PASSWORD=your_secure_grafana_password

# ─── 供应商 API Keys（加密存储）───────────────
SUPPLIER_A_API_KEY=supplier_a_key
SUPPLIER_B_API_KEY=supplier_b_key
SUPPLIER_C_API_KEY=supplier_c_key

# ─── 加密 ────────────────────────────────────
ENCRYPTION_SECRET=your_32_char_encryption_secret_key

# ─── JWT ─────────────────────────────────────
JWT_SECRET=your_jwt_secret_key_min_32_chars
JWT_EXPIRY_HOURS=24

敏感信息管理策略

级别	方式	示例
开发环境	`.env` 文件（加入 .gitignore）	本地开发配置
预发布	GitHub Secrets + `.env.staging`	CI/CD 自动注入
生产环境	云密钥管理 + 手动 SSH 注入	阿里云 KMS / AWS Secrets Mgr
应急访问	运维通过 Vault 或 1Password 团队保险箱	数据库 root 密码

多环境配置

deploy/
├── .env.example          # 模板（提交到 Git）
├── .env.dev              # 开发环境（不提交）
├── .env.staging          # 预发布环境（不提交）
├── .env.production       # 生产环境（不提交）
└── docker/
    ├── compose.dev.yml       # docker compose -f compose.dev.yml up
    ├── compose.staging.yml   # docker compose -f compose.staging.yml up
    └── compose.production.yml

各环境 compose.*.yml 通过环境变量覆盖差异配置（如端口映射、副本数等）：

# compose.staging.yml 示例
services:
  api-gateway:
    environment:
      NODE_ENV: staging
      LOG_LEVEL: debug
    deploy:
      replicas: 1

3. CI/CD 流水线

3.1 Git 仓库结构

b2b2b-hotel-platform/
├── src/                          # 源代码
│   ├── api-gateway/
│   │   ├── src/
│   │   ├── tests/
│   │   ├── Dockerfile
│   │   └── package.json
│   ├── query-engine/
│   ├── order-service/
│   ├── supplier-adapter/
│   ├── matching-service/
│   ├── pricing-service/
│   ├── settlement-service/
│   ├── inventory-service/
│   ├── risk-service/
│   ├── bi-service/
│   └── admin-web/
├── deploy/                       # 部署配置
│   ├── docker/
│   │   ├── compose.base.yml
│   │   ├── compose.dev.yml
│   │   ├── compose.staging.yml
│   │   ├── compose.production.yml
│   │   ├── prometheus/
│   │   ├── grafana/
│   │   ├── loki/
│   │   └── promtail/
│   ├── k8s/                      # K8s manifests（未来）
│   │   ├── helm/
│   │   │   └── hotel-platform/
│   │   │       ├── Chart.yaml
│   │   │       ├── values.yaml
│   │   │       ├── values-staging.yaml
│   │   │       └── templates/
│   │   └── base/
│   │       ├── namespace.yaml
│   │       └── network-policy.yaml
│   └── scripts/
│       ├── deploy.sh
│       ├── rollback.sh
│       ├── backup-db.sh
│       └── health-check.sh
├── infra/                        # 基础设施
│   ├── pigsty/                   # PostgreSQL 部署
│   ├── monitoring/               # 监控配置
│   └── terraform/                # IaC（未来）
├── docs/                         # 文档
│   └── infra/
│       ├── deploy.md             # 本文档
│       ├── pigsty-setup.md
│       └── monitoring.md
├── .github/
│   └── workflows/
│       ├── ci.yml                # PR 流水线
│       ├── cd-staging.yml        # 预发布部署
│       └── cd-production.yml     # 生产部署
├── docker-compose.yml            # 顶层编排入口
├── Makefile
└── README.md

3.2 GitHub Actions 流水线

PR 阶段（ci.yml）— 代码质量门禁

name: CI Pipeline
on:
  pull_request:
    branches: [main, develop]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service:
          - api-gateway
          - query-engine
          - order-service
          - supplier-adapter
          - matching-service
          - pricing-service
          - settlement-service
          - inventory-service
          - risk-service
          - bi-service
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"
          cache-dependency-path: src/${{ matrix.service }}/package.json

      - name: Install dependencies
        working-directory: src/${{ matrix.service }}
        run: npm ci

      - name: Lint
        working-directory: src/${{ matrix.service }}
        run: npm run lint

      - name: Type check
        working-directory: src/${{ matrix.service }}
        run: npm run typecheck

      - name: Unit tests
        working-directory: src/${{ matrix.service }}
        run: npm run test:unit -- --coverage

      - name: Integration tests (mock)
        working-directory: src/${{ matrix.service }}
        run: npm run test:integration
        env:
          DATABASE_URL: "postgresql://test:test@localhost:5432/test_db"

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Dependency audit
        run: |
          npm audit --audit-level=high --omit=dev || true
      - name: Trivy filesystem scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: "fs"
          scan-ref: "."

  build-check:
    runs-on: ubuntu-latest
    needs: [lint-and-test]
    strategy:
      matrix:
        service:
          - api-gateway
          - query-engine
          - order-service
    steps:
      - uses: actions/checkout@v4
      - name: Build Docker image
        run: docker build -t ${{ matrix.service }}:test ./src/${{ matrix.service }}

Merge 阶段 — 自动构建与推送

name: Build & Push
on:
  push:
    branches: [main, develop]

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service: ${{ fromJson(needs.changed-services.outputs.services) }}
    steps:
      - uses: actions/checkout@v4

      - name: Login to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build & Push
        uses: docker/build-push-action@v5
        with:
          context: ./src/${{ matrix.service }}
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/${{ matrix.service }}:${{ github.sha }}
            ghcr.io/${{ github.repository }}/${{ matrix.service }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

Tag 阶段 — 发布与部署

name: Deploy Staging
on:
  push:
    tags: ["v*"]

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to Staging
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.STAGING_HOST }}
          username: deploy
          key: ${{ secrets.STAGING_SSH_KEY }}
          script: |
            cd /opt/hotel-platform
            git pull origin main
            export IMAGE_TAG=${{ github.ref_name }}
            docker compose -f compose.staging.yml pull
            docker compose -f compose.staging.yml up -d
            ./deploy/scripts/health-check.sh

3.3 发布流程

灰度发布（金丝雀部署）

                 ┌───────────────────────┐
                 │    开始发布 v1.2.0     │
                 └───────────┬───────────┘
                             │
                 ┌───────────▼───────────┐
                 │  部署 Canary (5%)      │
                 │  监控 5 分钟           │
                 └───────────┬───────────┘
                             │
                    ┌────────┴────────┐
                    │ 错误率 < 1% ?   │
                    │ 延迟 P99 < 500ms│
                    └───┬─────────┬───┘
                   Yes  │         │  No
                        │         │
              ┌─────────▼───┐  ┌──▼──────────┐
              │ 30% 流量    │  │ 自动回滚     │
              │ 监控 10 分钟 │  │ 通知运维     │
              └─────────┬───┘  └─────────────┘
                        │
                   ┌────┴────┐
                   │ 检查通过  │
                   └────┬────┘
                        │
              ┌─────────▼───┐
              │ 100% 全量    │
              │ 发布完成      │
              └─────────────┘

回滚策略

场景	触发条件	回滚方式	回滚时间
自动回滚	错误率 > 1% 持续 2 分钟	切换到上一版本容器	< 30s
手动回滚	运维判断异常	`docker compose rollback`	< 1min
数据库回滚	迁移脚本执行失败	执行对应 `down.sql` 脚本	< 2min

数据库迁移版本化管理

src/db-migrations/
├── 001_create_suppliers.up.sql
├── 001_create_suppliers.down.sql
├── 002_create_hotels.up.sql
├── 002_create_hotels.down.sql
├── 003_create_orders.up.sql
├── 003_create_orders.down.sql
├── 004_add_settlement_fields.up.sql
├── 004_add_settlement_fields.down.sql
└── schema_migrations.sql       # 版本追踪表

迁移执行工具推荐：node-pg-migrate 或 flyway

# Makefile 集成
db-migrate-up:
    npx node-pg-migrate up -m src/db-migrations -e production

db-migrate-down:
    npx node-pg-migrate down -m src/db-migrations -e production

db-migrate-create:
    npx node-pg-migrate create $(name) --sql-file -m src/db-migrations

4. 配置管理

4.1 配置中心设计

MVP 阶段采用数据库驱动的配置中心，无需引入额外中间件：

CREATE TABLE platform_config (
    id          SERIAL PRIMARY KEY,
    key         VARCHAR(255) NOT NULL UNIQUE,
    value       JSONB NOT NULL,
    env         VARCHAR(20) NOT NULL DEFAULT 'production',
    description TEXT,
    updated_at  TIMESTAMPTZ DEFAULT NOW(),
    updated_by  VARCHAR(100)
);

CREATE INDEX idx_config_env_key ON platform_config(env, key);

配置示例

{
  "key": "matching.algorithm.weights",
  "value": {
    "price_weight": 0.4,
    "rating_weight": 0.3,
    "availability_weight": 0.2,
    "supplier_reliability_weight": 0.1
  },
  "env": "production",
  "description": "酒店匹配算法权重配置"
}

未来 etcd 迁移路径

当服务数量增多（> 20 个微服务）或需要跨集群配置同步时，迁移至 etcd：

etcd 节点集群 (3 节点)
├── /hotel-platform/
│   ├── /config/
│   │   ├── /api-gateway/
│   │   ├── /query-engine/
│   │   └── /global/
│   ├── /features/          # Feature Flags
│   └── /secrets/           # 加密密钥引用

4.2 热更新机制

┌──────────────┐    Watch     ┌──────────────┐    Poll      ┌──────────┐
│   配置中心    │◄────────────│  Config Client│──────────────►│  DB/etcd │
│  (Source)    │   Change     │  (SDK)        │  every 30s   │          │
└──────────────┘              └──────┬───────┘              └──────────┘
                                     │
                              ┌──────▼───────┐
                              │ 内存缓存      │
                              │ (Local Cache)│
                              └──────┬───────┘
                                     │
                              ┌──────▼───────┐
                              │ 业务逻辑      │
                              │ (实时读取)    │
                              └──────────────┘

实现要点：

Config Client SDK：每个服务内嵌配置客户端，启动时从配置中心拉取全量配置
长轮询 + 本地缓存：30 秒一次轻量级轮询，配置变更时推送到本地缓存
Watch 通知：基于 PostgreSQL LISTEN/NOTIFY 或 etcd Watch 机制
无需重启：业务逻辑始终读取内存缓存中的最新值

// 配置客户端伪代码
class ConfigClient {
  private cache: Map<string, any> = new Map();

  async start() {
    await this.loadAll();          // 启动时全量加载
    this.startPolling(30_000);     // 30s 轮询
  }

  get<T>(key: string, defaultValue: T): T {
    return this.cache.get(key) ?? defaultValue;
  }
}

4.3 特性开关（Feature Flags）

CREATE TABLE feature_flags (
    id          SERIAL PRIMARY KEY,
    name        VARCHAR(100) NOT NULL UNIQUE,
    enabled     BOOLEAN NOT NULL DEFAULT false,
    rollout_pct INTEGER NOT NULL DEFAULT 100,  -- 0~100
    whitelist   TEXT[],                        -- 用户/租户白名单
    description TEXT,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

使用场景

Flag 名称	默认值	说明
`supplier.realtime.sync`	false	供应商实时同步（灰度开启）
`risk.auto.reject`	false	风控自动拒单
`matching.ai.boost`	false	AI 匹配推荐权重
`settlement.auto.reconcile`	false	自动对账
`admin.v2`	false	管理后台 V2 界面

在业务代码中的使用

async function checkFeatureFlag(name: string, context: { tenantId: string }): Promise<boolean> {
  const flag = await configClient.getFeatureFlag(name);
  if (!flag?.enabled) return false;
  if (flag.rollout_pct < 100) {
    const hash = murmurhash(context.tenantId + name) % 100;
    return hash < flag.rollout_pct;
  }
  if (flag.whitelist?.length > 0) {
    return flag.whitelist.includes(context.tenantId);
  }
  return true;
}

5. 日志收集

5.1 日志格式

所有服务统一输出结构化 JSON 日志：

{
  "timestamp": "2025-01-15T08:30:15.123Z",
  "level": "INFO",
  "service": "query-engine",
  "traceId": "abc123def456",
  "spanId": "789ghi",
  "message": "Query completed successfully",
  "context": {
    "supplierCount": 5,
    "resultCount": 12,
    "durationMs": 342
  }
}

日志字段规范：

字段	类型	必填	说明
`timestamp`	ISO 8601	✓	日志时间（UTC）
`level`	string	✓	日志级别
`service`	string	✓	服务名称
`traceId`	string	✓	链路追踪 ID
`spanId`	string	✗	Span ID（分布式追踪用）
`message`	string	✓	日志消息
`context`	object	✗	结构化附加信息

5.2 收集方案：Loki + Promtail

选择 Loki + Promtail 作为日志收集方案（相比 ELK 更轻量，资源消耗更低）。

架构

┌──────────┐ ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Service A│ │ Service B│ ... │ Service N│     │ Service N│
│ (stdout) │ │ (stdout) │     │ (stdout) │     │ (stdout) │
└────┬─────┘ └────┬─────┘     └────┬─────┘     └────┬─────┘
     │            │                │                │
     └────────────┴────────────────┴────────────────┘
                          │
                   Docker Logging Driver
                   (json-file)
                          │
                ┌─────────▼─────────┐
                │   Promtail Agent  │
                │   (每节点部署)     │
                │   - 读取容器日志   │
                │   - 添加 Label    │
                └─────────┬─────────┘
                          │
                ┌─────────▼─────────┐
                │       Loki        │
                │   - 日志存储       │
                │   - 标签索引       │
                │   - 查询引擎       │
                └─────────┬─────────┘
                          │
                ┌─────────▼─────────┐
                │      Grafana       │
                │   - LogQL 查询     │
                │   - 与指标关联      │
                └───────────────────┘

Promtail 配置

# deploy/docker/promtail/config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ["__meta_docker_container_name"]
        target_label: "container"
      - source_labels: ["__meta_docker_container_log_stream"]
        target_label: "stream"
    pipeline_stages:
      - json:
          expressions:
            level: level
            service: service
            traceId: traceId
            message: message
      - labels:
          level:
          service:
          traceId:

5.3 日志分级

级别	用途	生产是否输出	示例
ERROR	需要立即关注的错误，触发告警	✓	数据库连接失败、第三方 API 超时
WARN	潜在问题，可能影响业务	✓	重试第 2 次、缓存命中率低于 50%
INFO	关键业务操作记录	✓	订单创建、供应商查价完成、结算完成
DEBUG	详细调试信息，仅排查问题时开启	✗	SQL 语句、HTTP 请求详情、内部状态变更

生产环境通过 LOG_LEVEL 环境变量控制：

# 生产
LOG_LEVEL=info

# 排查问题时临时开启
docker compose exec query-engine sh -c 'export LOG_LEVEL=debug && node dist/main.js'

5.4 日志保留策略

阶段	保留时间	存储位置	成本优化措施
热数据	30 天	本地磁盘 / S3 Standard	完整索引，可全文检索
温数据	90 天	S3 Infrequent Access	压缩存储，仅保留元数据索引
冷数据	1 年	S3 Glacier / OSS 归档	大幅压缩，按需恢复
过期数据	自动删除	—	Loki retention 配置

Loki 保留配置：

# loki/local-config.yaml
schema_config:
  configs:
    - from: 2025-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 720h          # 30 天热数据
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

table_manager:
  retention_deletes_enabled: true
  retention_period: 720h

5.5 Trace ID 注入（全链路追踪）

OpenTelemetry 集成

// src/shared/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || "unknown",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://jaeger:4317",
  }),
});

sdk.start();

链路追踪流程

客户端请求
    │
    ├── [X-Request-ID: abc123] ─────────────────────────────
    │                                                         │
┌───▼──────────────────────────────────────────────────────────┐
│ API Gateway                                                  │
│   TraceID: abc123 | Span: gateway.process                   │
│   └──→ 调用 query-engine                                     │
│         传递 Header: traceparent: 00-abc123-...              │
└───┬──────────────────────────────────────────────────────────┘
    │
┌───▼──────────────────────────────────────────────────────────┐
│ Query Engine                                                 │
│   TraceID: abc123 | Span: query.search                      │
│   ├──→ 调用 supplier-adapter (supplier-A)                    │
│   │     Span: adapter.supplier-a.query                       │
│   ├──→ 调用 supplier-adapter (supplier-B)                    │
│   │     Span: adapter.supplier-b.query                       │
│   └──→ 写入 Redis 缓存                                       │
│         Span: cache.write                                    │
└──────────────────────────────────────────────────────────────┘

在 Grafana 中通过 TraceID: abc123 可查看完整调用链

6. 扩容路径

6.1 单节点 → 多节点

扩容时机与步骤

阶段 1: 单节点 (0~1万 日订单)
   ├── 所有服务 + 数据库在同一台服务器
   └── 预估成本: ¥300~500/月

阶段 2: 数据分离 (1万~5万 日订单)           ← 第一步扩容
   ├── 应用服务器 (8C16G) × 1
   ├── 数据库服务器 (16C32G) × 1
   └── 预估成本: ¥800~1200/月

阶段 3: 应用水平扩容 (5万~10万 日订单)      ← 第二步扩容
   ├── 应用服务器 (8C16G) × 2~3
   ├── 数据库服务器 (16C32G) × 1 主 + 1 从
   ├── 负载均衡器 × 1
   └── 预估成本: ¥2000~3500/月

数据库单独部署步骤

# 1. 在数据库服务器安装 PostgreSQL (Pigsty)
#    详见 docs/infra/pigsty-setup.md

# 2. 导出现有数据
docker exec postgres pg_dump -U hotel hotel_db > backup.sql

# 3. 导入到新服务器
psql -h db-server -U hotel -d hotel_db < backup.sql

# 4. 更新应用连接
# 修改 .env: DATABASE_URL=postgresql://hotel:xxx@db-server:5432/hotel_db

# 5. 移除 compose 中的 postgres 服务
docker compose rm -s postgres

应用服务水平扩容

# 使用 Docker Compose replicas 扩容
# 修改 compose.production.yml
services:
  query-engine:
    deploy:
      replicas: 3
  order-service:
    deploy:
      replicas: 2

# 前置 Nginx 负载均衡
upstream api_backend {
    least_conn;
    server app-node-1:8080;
    server app-node-2:8080;
    server app-node-3:8080;
}

server {
    listen 80;
    server_name api.hotel-platform.com;

    location / {
        proxy_pass http://api_backend;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Request-ID $request_id;
    }
}

6.2 Docker → K8s

迁移时机

指标	Docker Compose 阈值	K8s 迁移阈值
日订单量	< 10 万	> 10 万
业务服务实例数	< 30	> 30
服务更新频率	每日 < 3 次	每日 > 3 次
运维人力	1~2 人	> 3 人

Helm Chart 设计

deploy/k8s/helm/hotel-platform/
├── Chart.yaml
├── values.yaml                    # 默认值
├── values-staging.yaml            # 预发布覆盖
├── values-production.yaml         # 生产覆盖
└── templates/
    ├── _helpers.tpl
    ├── namespace.yaml
    ├── network-policy.yaml
    ├── api-gateway/
    │   ├── deployment.yaml
    │   ├── service.yaml
    │   ├── hpa.yaml
    │   └── podmonitor.yaml
    ├── query-engine/
    │   ├── deployment.yaml
    │   ├── service.yaml
    │   └── hpa.yaml
    ├── order-service/
    │   └── ...
    ├── redis/
    │   ├── statefulset.yaml
    │   └── service.yaml
    ├── postgres/
    │   └── statefulset.yaml
    ├── rabbitmq/
    │   └── statefulset.yaml
    ├── clickhouse/
    │   └── statefulset.yaml
    ├── prometheus/
    │   └── ...
    ├── grafana/
    │   └── ...
    └── ingress.yaml

HPA（水平自动扩缩容）配置

# templates/query-engine/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: query-engine
  namespace: hotel-platform
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: query-engine
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

7. 数据库部署方案

7.1 PostgreSQL（Pigsty）

详细部署步骤参见 pigsty-setup.md

阶段	方案	说明
MVP	单实例 + Docker	`pgvector/pgvector:pg16` 容器
早期生产	Pigsty 单节点	PG 16 + Pigsty 监控
中期生产	Pigsty 主从复制	1 主 + 1~2 从，流复制
大规模	Pigsty + Patroni	自动故障转移，3 节点以上

Pigsty 核心配置要点

# pg_conf 参数建议
shared_buffers: 4GB              # 物理 RAM 的 25%
effective_cache_size: 12GB       # 物理 RAM 的 75%
work_mem: 64MB                   # 并行查询排序内存
maintenance_work_mem: 512MB      # 索引构建、VACUUM 内存
max_connections: 200             # 连接池（推荐 PgBouncer）
random_page_cost: 1.1            # SSD 优化

7.2 Redis

阶段	方案	数据量上限	高可用
MVP	单节点 Docker	< 2 GB	无
早期生产	Redis Sentinel (3 节点)	< 16 GB	自动故障转移
大规模	Redis Cluster (6+ 节点)	> 16 GB	分片 + 自动故障转移

Sentinel 模式

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Redis    │  │ Redis    │  │ Redis    │
│ Master   │  │ Slave    │  │ Slave    │
│ :6379    │◄─┤ :6379    │  │ :6379    │
└────┬─────┘  └──────────┘  └──────────┘
     │
┌────▼─────┐  ┌──────────┐  ┌──────────┐
│Sentinel 1│  │Sentinel 2│  │Sentinel 3│
│ :26379   │  │ :26379   │  │ :26379   │
└──────────┘  └──────────┘  └──────────┘

# docker-compose 中 Sentinel 模式
services:
  redis-master:
    image: redis:7-alpine
    command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru --appendonly yes

  redis-slave-1:
    image: redis:7-alpine
    command: redis-server --replicaof redis-master 6379 --maxmemory 4gb
    depends_on: [redis-master]

  redis-sentinel-1:
    image: redis:7-alpine
    command: >
      redis-sentinel /etc/redis/sentinel.conf
    volumes:
      - ./redis/sentinel.conf:/etc/redis/sentinel.conf
    depends_on: [redis-master]

7.3 ClickHouse

阶段	方案	数据量	说明
MVP	单节点 Docker	< 50 GB	基础分析查询
早期生产	单节点 + 物理机	< 500 GB	增加内存和 SSD
大规模	分片集群 (2~4 Shard)	> 500 GB	水平分片 + 副本

分片集群架构

                  ┌──────────────┐
                  │   CH Proxy   │
                  │  (查询路由)   │
                  └──────┬───────┘
                         │
            ┌────────────┼────────────┐
            │            │            │
     ┌──────▼──────┐┌───▼────┐┌──────▼──────┐
     │  Shard 1    ││ Shard 2││  Shard 3    │
     │ ┌─┐  ┌─┐   ││ ┌─┐┌─┐││ ┌─┐  ┌─┐   │
     │ │R│  │R│   ││ │R││R│││ │R│  │R│   │
     │ └─┘  └─┘   ││ └─┘└─┘││ └─┘  └─┘   │
     └─────────────┘└───────┘└─────────────┘
      Replica1 Repl2  Repl1   Repl1 Repl2

R = Replica

ClickHouse 关键配置

<!-- config.xml -->
<clickhouse>
    <max_memory_usage>8000000000</max_memory_usage>          <!-- 8GB -->
    <max_threads>8</max_threads>
    <background_pool_size>8</background_pool_size>

    <query_log>
        <ttl>query_log</ttl>
        <move_to_ttl_interval>3600</move_to_ttl_interval>
    </query_log>

    <!-- MergeTree 引擎设置 -->
    <merge_tree>
        <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
    </merge_tree>
</clickhouse>

7.4 RabbitMQ

阶段	方案	消息吞吐量	说明
MVP	单节点 Docker	< 1K/s	管理插件 + 持久化队列
早期生产	镜像队列 (2 节点)	< 5K/s	队列镜像保证高可用
大规模	集群 (3+ 节点) + Quorum	> 5K/s	Quorum Queues（Raft 共识）

镜像队列配置

# 设置策略：所有队列镜像到全部节点
rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --apply-to queues

性能优化

# rabbitmq.conf
listeners.tcp.default = 5672
management.tcp.port = 15672

# 内存水位线（触发流控的阈值）
vm_memory_high_watermark.relative = 0.6

# 磁盘空间限制
disk_free_limit.absolute = 5GB

# 持久化优化
queue_master_locator = min-masters

附录

A. 常用运维命令

# 启动所有服务
docker compose up -d

# 查看服务状态
docker compose ps

# 查看日志
docker compose logs -f api-gateway
docker compose logs --since 1h query-engine

# 重启单个服务
docker compose restart order-service

# 强制重建（代码更新后）
docker compose up -d --build query-engine

# 查看资源使用
docker stats --no-stream

# 进入容器调试
docker compose exec postgres psql -U hotel -d hotel_db
docker compose exec redis redis-cli

# 备份数据库
docker compose exec postgres pg_dump -U hotel hotel_db | gzip > backup_$(date +%Y%m%d).sql.gz

# 健康检查
curl -s http://localhost:8080/health | jq .

B. 监控告警规则（Prometheus）

groups:
  - name: hotel-platform-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx 错误率超过 1%"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 延迟超过 2s: {{ $labels.service }}"

      - alert: DatabaseConnectionPoolExhausted
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "数据库连接池使用率超过 90%"

      - alert: RabbitMQQueueBacklog
        expr: rabbitmq_queue_messages > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "消息队列积压: {{ $labels.queue }}"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘剩余空间不足 20%"

C. 故障排查检查清单

问题类型	排查步骤
服务无法启动	`docker compose logs <service>` → 检查环境变量 → 检查端口占用
数据库连接失败	检查 PG 状态 → `pg_isready` → 检查连接数 → 检查 DNS/网络
查价超时	检查供应商 API → Redis 缓存 → 查看链路追踪 → 检查并发限制
消息堆积	检查消费端日志 → 检查死信队列 → 检查 RabbitMQ 管理界面
内存不足	`docker stats` → 检查 ClickHouse 查询 → 调整 `max_memory_usage`
磁盘写满	`du -sh /var/lib/docker/*` → 清理旧日志 → 扩容磁盘

文档维护：本文档随架构演进持续更新，每次重大变更需记录变更日期和负责人。

相关文档：pigsty-setup.md | monitoring.md