本文档描述 B2B2B 酒店平台(Hotel Platform)的完整部署架构,涵盖从 MVP 到生产的演进路径、 Docker Compose 编排、CI/CD 流水线、配置管理、日志收集、扩容策略及数据库部署方案。
在项目初期(MVP 阶段),采用单台云服务器 + Docker Compose 方案,以最低成本快速验证业务模型。
| 项目 | 最低配置 | 推荐配置 | 说明 |
|---|---|---|---|
| CPU | 4 核 | 8 核 | 查价引擎与 BI 模块对 CPU 需求较高 |
| 内存 | 8 GB | 16 GB | ClickHouse 和 RabbitMQ 为内存消耗大户 |
| 系统盘 | 50 GB SSD | 100 GB SSD | 系统与 Docker 镜像层 |
| 数据盘 | 100 GB SSD | 200 GB SSD | PostgreSQL + ClickHouse 数据持久化 |
| 带宽 | 5 Mbps | 10 Mbps | 供应商 API 对外通信 |
┌─────────────────────────────────────────────────┐
│ 云服务器 (8C16G) │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Nginx/Caddy │──│ API Gateway │ │
│ │ (反向代理) │ │ :8080 │ │
│ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ┌───────────┐ ┌────────┴───────┐ ┌──────────┐│
│ │ Admin Web │ │ 业务服务层 │ │ Redis ││
│ │ :3000 │ │ │ │ :6379 ││
│ └───────────┘ │ • Query Engine │ └──────────┘│
│ │ • Order Svc │ │
│ ┌───────────┐ │ • Supplier Adp │ ┌──────────┐│
│ │ Prometheus │ │ • Matching Svc │ │ RabbitMQ ││
│ │ :9090 │ │ • Pricing Svc │ │ :5672 ││
│ └─────┬─────┘ │ • Settlement │ └──────────┘│
│ │ │ • Inventory │ │
│ ┌─────┴─────┐ │ • Risk Svc │ ┌──────────┐│
│ │ Grafana │ │ • BI Service │ │ClickHouse││
│ │ :3001 │ │ │ │ :8123 ││
│ └───────────┘ └───────────────┘ └──────────┘│
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ PostgreSQL 16 (Pigsty) │ │
│ │ :5432 │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
max-size 与 max-file,避免磁盘打满当业务量增长(日订单 > 1 万)后,进入生产阶段,核心思路是应用与数据分离、服务按需扩缩。
| 节点类型 | 数量 | 配置 | 部署组件 |
|---|---|---|---|
| 应用服务器 | 2~4 | 8C16G | API Gateway + 全部业务微服务 |
| 数据库服务器 | 1 主 + 1 从 | 16C32G | PostgreSQL (主从复制) |
| 缓存服务器 | 1~3 | 4C8G | Redis Sentinel / Cluster |
| 消息队列服务器 | 1~3 | 4C8G | RabbitMQ 集群 |
| 分析服务器 | 1~2 | 8C16G | ClickHouse (分片 + 副本) |
| 监控服务器 | 1 | 4C8G | Prometheus + Grafana + Loki |
| 负载均衡 | 1~2 | 2C4G | Nginx / Caddy / 云 LB |
┌──────────────┐
│ DNS / CDN │
└──────┬───────┘
│
┌──────┴───────┐
│ 负载均衡器 │
│ (Nginx/LB) │
└──────┬───────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ App Node 1 │ │ App Node 2 │ │ App Node N │
│ │ │ │ │ │
│ API Gateway │ │ API Gateway │ │ API Gateway │
│ 业务微服务 │ │ 业务微服务 │ │ 业务微服务 │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌──────┴────────────────┴────────────────┴──────┐
│ 基础设施层 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ PG Primary│ │ PG Replica│ │ Redis Sentinel │ │
│ └──────────┘ └──────────┘ └────────────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ RabbitMQ │ │ClickHouse│ │ Loki + Prom │ │
│ └──────────┘ └──────────┘ └────────────────┘ │
└────────────────────────────────────────────────┘
当日订单超过 10 万时,建议迁移至 Kubernetes:
hotel-platform (生产) / hotel-staging (预发布) / hotel-dev (开发)B2B2B 酒店平台包含以下核心服务:
services:
api-gateway: # API 网关 - 统一入口,路由转发,限流熔断
query-engine: # 查价引擎 - 多供应商并发查价,价格排序
order-service: # 订单服务 - 订单生命周期管理
supplier-adapter: # 供应商适配器服务 - 对接各供应商 API
matching-service: # 匹配服务 - 供应商-酒店智能匹配
pricing-service: # 价格服务 - 价格计算、加价策略
settlement-service: # 结算服务 - 对账、结算、发票管理
inventory-service: # 库存服务 - 房态同步、库存管理
risk-service: # 风控服务 - 交易风控、信用评估
bi-service: # 数据智能服务 - 数据分析、报表
admin-web: # 管理后台 - 运营管理界面
redis: # 缓存 - 热数据缓存、分布式锁
rabbitmq: # 消息队列 - 异步解耦、事件驱动
clickhouse: # 分析数据库 - BI 数据存储
prometheus: # 监控 - 指标采集
grafana: # 监控看板 - 可视化仪表盘
version: "3.9"
x-common-env: &common-env
NODE_ENV: production
LOG_LEVEL: info
REDIS_URL: redis://redis:6379/0
RABBITMQ_URL: amqp://guest:guest@rabbitmq:5672
DATABASE_URL: postgresql://hotel:hotel_pass@postgres:5432/hotel_db
CLICKHOUSE_URL: clickhouse://clickhouse:8123/hotel_analytics
JAEGER_ENDPOINT: http://jaeger:14268/api/traces
x-app-defaults: &app-defaults
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "50m"
max-file: "5"
tag: "{{.Name}}"
deploy:
resources:
limits:
memory: 512M
reservations:
memory: 128M
services:
# ─── API 网关 ─────────────────────────────────────────────
api-gateway:
<<: *app-defaults
build:
context: ./src/api-gateway
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
<<: *common-env
PORT: 8080
RATE_LIMIT_WINDOW_MS: 60000
RATE_LIMIT_MAX: 100
depends_on:
redis:
condition: service_healthy
rabbitmq:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
networks:
- frontend
- backend
# ─── 查价引擎 ─────────────────────────────────────────────
query-engine:
<<: *app-defaults
build:
context: ./src/query-engine
dockerfile: Dockerfile
environment:
<<: *common-env
CONCURRENCY_LIMIT: 20
QUERY_TIMEOUT_MS: 10000
depends_on:
redis:
condition: service_healthy
supplier-adapter:
condition: service_started
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 20s
deploy:
resources:
limits:
memory: 1024M
reservations:
memory: 256M
networks:
- backend
# ─── 订单服务 ─────────────────────────────────────────────
order-service:
<<: *app-defaults
build:
context: ./src/order-service
dockerfile: Dockerfile
environment:
<<: *common-env
ORDER_EXPIRY_MINUTES: 30
depends_on:
postgres:
condition: service_healthy
rabbitmq:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
# ─── 供应商适配器服务 ─────────────────────────────────────
supplier-adapter:
<<: *app-defaults
build:
context: ./src/supplier-adapter
dockerfile: Dockerfile
environment:
<<: *common-env
ADAPTER_POOL_SIZE: 10
RETRY_MAX_ATTEMPTS: 3
RETRY_BACKOFF_MS: 1000
depends_on:
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
- external-api
# ─── 匹配服务 ─────────────────────────────────────────────
matching-service:
<<: *app-defaults
build:
context: ./src/matching-service
dockerfile: Dockerfile
environment:
<<: *common-env
MATCHING_ALGORITHM: weighted_score
depends_on:
postgres:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
# ─── 价格服务 ─────────────────────────────────────────────
pricing-service:
<<: *app-defaults
build:
context: ./src/pricing-service
dockerfile: Dockerfile
environment:
<<: *common-env
DEFAULT_CURRENCY: CNY
EXCHANGE_RATE_CACHE_TTL: 3600
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
# ─── 结算服务 ─────────────────────────────────────────────
settlement-service:
<<: *app-defaults
build:
context: ./src/settlement-service
dockerfile: Dockerfile
environment:
<<: *common-env
SETTLE_CYCLE_DAYS: 7
depends_on:
postgres:
condition: service_healthy
rabbitmq:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
# ─── 库存服务 ─────────────────────────────────────────────
inventory-service:
<<: *app-defaults
build:
context: ./src/inventory-service
dockerfile: Dockerfile
environment:
<<: *common-env
SYNC_INTERVAL_MS: 300000
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
# ─── 风控服务 ─────────────────────────────────────────────
risk-service:
<<: *app-defaults
build:
context: ./src/risk-service
dockerfile: Dockerfile
environment:
<<: *common-env
RISK_SCORE_THRESHOLD: 80
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
# ─── 数据智能服务 (BI) ────────────────────────────────────
bi-service:
<<: *app-defaults
build:
context: ./src/bi-service
dockerfile: Dockerfile
environment:
<<: *common-env
CLICKHOUSE_URL: clickhouse://clickhouse:8123/hotel_analytics
depends_on:
clickhouse:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
deploy:
resources:
limits:
memory: 1024M
reservations:
memory: 256M
networks:
- backend
# ─── 管理后台 ─────────────────────────────────────────────
admin-web:
<<: *app-defaults
build:
context: ./src/admin-web
dockerfile: Dockerfile
ports:
- "3000:3000"
environment:
<<: *common-env
NEXT_PUBLIC_API_URL: /api
depends_on:
api-gateway:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000"]
interval: 20s
timeout: 5s
retries: 3
networks:
- frontend
# ─── 基础设施服务 ─────────────────────────────────────────
redis:
image: redis:7-alpine
restart: unless-stopped
command: >
redis-server
--maxmemory 512mb
--maxmemory-policy allkeys-lru
--appendonly yes
volumes:
- redis-data:/data
ports:
- "127.0.0.1:6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
networks:
- backend
rabbitmq:
image: rabbitmq:3-management-alpine
restart: unless-stopped
environment:
RABBITMQ_DEFAULT_USER: guest
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD:-guest}
volumes:
- rabbitmq-data:/var/lib/rabbitmq
ports:
- "127.0.0.1:5672:5672"
- "127.0.0.1:15672:15672"
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
interval: 15s
timeout: 5s
retries: 3
networks:
- backend
postgres:
image: ghcr.io/pgvector/pgvector:pg16
restart: unless-stopped
environment:
POSTGRES_USER: hotel
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-hotel_pass}
POSTGRES_DB: hotel_db
volumes:
- postgres-data:/var/lib/postgresql/data
ports:
- "127.0.0.1:5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U hotel -d hotel_db"]
interval: 10s
timeout: 3s
retries: 5
networks:
- backend
clickhouse:
image: clickhouse/clickhouse-server:24-alpine
restart: unless-stopped
environment:
CLICKHOUSE_USER: hotel
CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-hotel_click}
CLICKHOUSE_DB: hotel_analytics
volumes:
- clickhouse-data:/var/lib/clickhouse
ports:
- "127.0.0.1:8123:8123"
- "127.0.0.1:9000:9000"
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8123/ping"]
interval: 10s
timeout: 3s
retries: 3
networks:
- backend
prometheus:
image: prom/prometheus:v2.52.0
restart: unless-stopped
volumes:
- ./deploy/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "127.0.0.1:9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:10.4.0
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
- ./deploy/docker/grafana/provisioning:/etc/grafana/provisioning
ports:
- "127.0.0.1:3001:3000"
depends_on:
- prometheus
networks:
- frontend
- monitoring
loki:
image: grafana/loki:2.9.0
restart: unless-stopped
volumes:
- ./deploy/docker/loki/local-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
ports:
- "127.0.0.1:3100:3100"
networks:
- monitoring
promtail:
image: grafana/promtail:2.9.0
restart: unless-stopped
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./deploy/docker/promtail/config.yml:/etc/promtail/config.yml
depends_on:
- loki
networks:
- monitoring
volumes:
redis-data:
rabbitmq-data:
postgres-data:
clickhouse-data:
prometheus-data:
grafana-data:
loki-data:
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: false
external-api:
driver: bridge
internal: false
monitoring:
driver: bridge
internal: true
# .env.example — 复制为 .env 并填入实际值
# ─── 应用配置 ────────────────────────────────
NODE_ENV=production
LOG_LEVEL=info
PORT=8080
# ─── 数据库 ──────────────────────────────────
POSTGRES_PASSWORD=your_secure_db_password
DATABASE_URL=postgresql://hotel:your_secure_db_password@postgres:5432/hotel_db
CLICKHOUSE_PASSWORD=your_secure_clickhouse_password
# ─── Redis ───────────────────────────────────
REDIS_PASSWORD=your_secure_redis_password
REDIS_URL=redis://:your_secure_redis_password@redis:6379/0
# ─── RabbitMQ ────────────────────────────────
RABBITMQ_PASSWORD=your_secure_rabbitmq_password
RABBITMQ_URL=amqp://guest:your_secure_rabbitmq_password@rabbitmq:5672
# ─── 监控 ────────────────────────────────────
GRAFANA_PASSWORD=your_secure_grafana_password
# ─── 供应商 API Keys(加密存储)───────────────
SUPPLIER_A_API_KEY=supplier_a_key
SUPPLIER_B_API_KEY=supplier_b_key
SUPPLIER_C_API_KEY=supplier_c_key
# ─── 加密 ────────────────────────────────────
ENCRYPTION_SECRET=your_32_char_encryption_secret_key
# ─── JWT ─────────────────────────────────────
JWT_SECRET=your_jwt_secret_key_min_32_chars
JWT_EXPIRY_HOURS=24
| 级别 | 方式 | 示例 |
|---|---|---|
| 开发环境 | .env 文件(加入 .gitignore) |
本地开发配置 |
| 预发布 | GitHub Secrets + .env.staging |
CI/CD 自动注入 |
| 生产环境 | 云密钥管理 + 手动 SSH 注入 | 阿里云 KMS / AWS Secrets Mgr |
| 应急访问 | 运维通过 Vault 或 1Password 团队保险箱 | 数据库 root 密码 |
deploy/
├── .env.example # 模板(提交到 Git)
├── .env.dev # 开发环境(不提交)
├── .env.staging # 预发布环境(不提交)
├── .env.production # 生产环境(不提交)
└── docker/
├── compose.dev.yml # docker compose -f compose.dev.yml up
├── compose.staging.yml # docker compose -f compose.staging.yml up
└── compose.production.yml
各环境 compose.*.yml 通过环境变量覆盖差异配置(如端口映射、副本数等):
# compose.staging.yml 示例
services:
api-gateway:
environment:
NODE_ENV: staging
LOG_LEVEL: debug
deploy:
replicas: 1
b2b2b-hotel-platform/
├── src/ # 源代码
│ ├── api-gateway/
│ │ ├── src/
│ │ ├── tests/
│ │ ├── Dockerfile
│ │ └── package.json
│ ├── query-engine/
│ ├── order-service/
│ ├── supplier-adapter/
│ ├── matching-service/
│ ├── pricing-service/
│ ├── settlement-service/
│ ├── inventory-service/
│ ├── risk-service/
│ ├── bi-service/
│ └── admin-web/
├── deploy/ # 部署配置
│ ├── docker/
│ │ ├── compose.base.yml
│ │ ├── compose.dev.yml
│ │ ├── compose.staging.yml
│ │ ├── compose.production.yml
│ │ ├── prometheus/
│ │ ├── grafana/
│ │ ├── loki/
│ │ └── promtail/
│ ├── k8s/ # K8s manifests(未来)
│ │ ├── helm/
│ │ │ └── hotel-platform/
│ │ │ ├── Chart.yaml
│ │ │ ├── values.yaml
│ │ │ ├── values-staging.yaml
│ │ │ └── templates/
│ │ └── base/
│ │ ├── namespace.yaml
│ │ └── network-policy.yaml
│ └── scripts/
│ ├── deploy.sh
│ ├── rollback.sh
│ ├── backup-db.sh
│ └── health-check.sh
├── infra/ # 基础设施
│ ├── pigsty/ # PostgreSQL 部署
│ ├── monitoring/ # 监控配置
│ └── terraform/ # IaC(未来)
├── docs/ # 文档
│ └── infra/
│ ├── deploy.md # 本文档
│ ├── pigsty-setup.md
│ └── monitoring.md
├── .github/
│ └── workflows/
│ ├── ci.yml # PR 流水线
│ ├── cd-staging.yml # 预发布部署
│ └── cd-production.yml # 生产部署
├── docker-compose.yml # 顶层编排入口
├── Makefile
└── README.md
name: CI Pipeline
on:
pull_request:
branches: [main, develop]
jobs:
lint-and-test:
runs-on: ubuntu-latest
strategy:
matrix:
service:
- api-gateway
- query-engine
- order-service
- supplier-adapter
- matching-service
- pricing-service
- settlement-service
- inventory-service
- risk-service
- bi-service
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
cache-dependency-path: src/${{ matrix.service }}/package.json
- name: Install dependencies
working-directory: src/${{ matrix.service }}
run: npm ci
- name: Lint
working-directory: src/${{ matrix.service }}
run: npm run lint
- name: Type check
working-directory: src/${{ matrix.service }}
run: npm run typecheck
- name: Unit tests
working-directory: src/${{ matrix.service }}
run: npm run test:unit -- --coverage
- name: Integration tests (mock)
working-directory: src/${{ matrix.service }}
run: npm run test:integration
env:
DATABASE_URL: "postgresql://test:test@localhost:5432/test_db"
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Dependency audit
run: |
npm audit --audit-level=high --omit=dev || true
- name: Trivy filesystem scan
uses: aquasecurity/trivy-action@master
with:
scan-type: "fs"
scan-ref: "."
build-check:
runs-on: ubuntu-latest
needs: [lint-and-test]
strategy:
matrix:
service:
- api-gateway
- query-engine
- order-service
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t ${{ matrix.service }}:test ./src/${{ matrix.service }}
name: Build & Push
on:
push:
branches: [main, develop]
jobs:
build-and-push:
runs-on: ubuntu-latest
strategy:
matrix:
service: ${{ fromJson(needs.changed-services.outputs.services) }}
steps:
- uses: actions/checkout@v4
- name: Login to Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build & Push
uses: docker/build-push-action@v5
with:
context: ./src/${{ matrix.service }}
push: true
tags: |
ghcr.io/${{ github.repository }}/${{ matrix.service }}:${{ github.sha }}
ghcr.io/${{ github.repository }}/${{ matrix.service }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
name: Deploy Staging
on:
push:
tags: ["v*"]
jobs:
deploy-staging:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to Staging
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.STAGING_HOST }}
username: deploy
key: ${{ secrets.STAGING_SSH_KEY }}
script: |
cd /opt/hotel-platform
git pull origin main
export IMAGE_TAG=${{ github.ref_name }}
docker compose -f compose.staging.yml pull
docker compose -f compose.staging.yml up -d
./deploy/scripts/health-check.sh
┌───────────────────────┐
│ 开始发布 v1.2.0 │
└───────────┬───────────┘
│
┌───────────▼───────────┐
│ 部署 Canary (5%) │
│ 监控 5 分钟 │
└───────────┬───────────┘
│
┌────────┴────────┐
│ 错误率 < 1% ? │
│ 延迟 P99 < 500ms│
└───┬─────────┬───┘
Yes │ │ No
│ │
┌─────────▼───┐ ┌──▼──────────┐
│ 30% 流量 │ │ 自动回滚 │
│ 监控 10 分钟 │ │ 通知运维 │
└─────────┬───┘ └─────────────┘
│
┌────┴────┐
│ 检查通过 │
└────┬────┘
│
┌─────────▼───┐
│ 100% 全量 │
│ 发布完成 │
└─────────────┘
| 场景 | 触发条件 | 回滚方式 | 回滚时间 |
|---|---|---|---|
| 自动回滚 | 错误率 > 1% 持续 2 分钟 | 切换到上一版本容器 | < 30s |
| 手动回滚 | 运维判断异常 | docker compose rollback |
< 1min |
| 数据库回滚 | 迁移脚本执行失败 | 执行对应 down.sql 脚本 |
< 2min |
src/db-migrations/
├── 001_create_suppliers.up.sql
├── 001_create_suppliers.down.sql
├── 002_create_hotels.up.sql
├── 002_create_hotels.down.sql
├── 003_create_orders.up.sql
├── 003_create_orders.down.sql
├── 004_add_settlement_fields.up.sql
├── 004_add_settlement_fields.down.sql
└── schema_migrations.sql # 版本追踪表
迁移执行工具推荐:node-pg-migrate 或 flyway
# Makefile 集成
db-migrate-up:
npx node-pg-migrate up -m src/db-migrations -e production
db-migrate-down:
npx node-pg-migrate down -m src/db-migrations -e production
db-migrate-create:
npx node-pg-migrate create $(name) --sql-file -m src/db-migrations
MVP 阶段采用数据库驱动的配置中心,无需引入额外中间件:
CREATE TABLE platform_config (
id SERIAL PRIMARY KEY,
key VARCHAR(255) NOT NULL UNIQUE,
value JSONB NOT NULL,
env VARCHAR(20) NOT NULL DEFAULT 'production',
description TEXT,
updated_at TIMESTAMPTZ DEFAULT NOW(),
updated_by VARCHAR(100)
);
CREATE INDEX idx_config_env_key ON platform_config(env, key);
{
"key": "matching.algorithm.weights",
"value": {
"price_weight": 0.4,
"rating_weight": 0.3,
"availability_weight": 0.2,
"supplier_reliability_weight": 0.1
},
"env": "production",
"description": "酒店匹配算法权重配置"
}
当服务数量增多(> 20 个微服务)或需要跨集群配置同步时,迁移至 etcd:
etcd 节点集群 (3 节点)
├── /hotel-platform/
│ ├── /config/
│ │ ├── /api-gateway/
│ │ ├── /query-engine/
│ │ └── /global/
│ ├── /features/ # Feature Flags
│ └── /secrets/ # 加密密钥引用
┌──────────────┐ Watch ┌──────────────┐ Poll ┌──────────┐
│ 配置中心 │◄────────────│ Config Client│──────────────►│ DB/etcd │
│ (Source) │ Change │ (SDK) │ every 30s │ │
└──────────────┘ └──────┬───────┘ └──────────┘
│
┌──────▼───────┐
│ 内存缓存 │
│ (Local Cache)│
└──────┬───────┘
│
┌──────▼───────┐
│ 业务逻辑 │
│ (实时读取) │
└──────────────┘
实现要点:
LISTEN/NOTIFY 或 etcd Watch 机制// 配置客户端伪代码
class ConfigClient {
private cache: Map<string, any> = new Map();
async start() {
await this.loadAll(); // 启动时全量加载
this.startPolling(30_000); // 30s 轮询
}
get<T>(key: string, defaultValue: T): T {
return this.cache.get(key) ?? defaultValue;
}
}
CREATE TABLE feature_flags (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
enabled BOOLEAN NOT NULL DEFAULT false,
rollout_pct INTEGER NOT NULL DEFAULT 100, -- 0~100
whitelist TEXT[], -- 用户/租户白名单
description TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
| Flag 名称 | 默认值 | 说明 |
|---|---|---|
supplier.realtime.sync |
false | 供应商实时同步(灰度开启) |
risk.auto.reject |
false | 风控自动拒单 |
matching.ai.boost |
false | AI 匹配推荐权重 |
settlement.auto.reconcile |
false | 自动对账 |
admin.v2 |
false | 管理后台 V2 界面 |
async function checkFeatureFlag(name: string, context: { tenantId: string }): Promise<boolean> {
const flag = await configClient.getFeatureFlag(name);
if (!flag?.enabled) return false;
if (flag.rollout_pct < 100) {
const hash = murmurhash(context.tenantId + name) % 100;
return hash < flag.rollout_pct;
}
if (flag.whitelist?.length > 0) {
return flag.whitelist.includes(context.tenantId);
}
return true;
}
所有服务统一输出结构化 JSON 日志:
{
"timestamp": "2025-01-15T08:30:15.123Z",
"level": "INFO",
"service": "query-engine",
"traceId": "abc123def456",
"spanId": "789ghi",
"message": "Query completed successfully",
"context": {
"supplierCount": 5,
"resultCount": 12,
"durationMs": 342
}
}
日志字段规范:
| 字段 | 类型 | 必填 | 说明 |
|---|---|---|---|
timestamp |
ISO 8601 | ✓ | 日志时间(UTC) |
level |
string | ✓ | 日志级别 |
service |
string | ✓ | 服务名称 |
traceId |
string | ✓ | 链路追踪 ID |
spanId |
string | ✗ | Span ID(分布式追踪用) |
message |
string | ✓ | 日志消息 |
context |
object | ✗ | 结构化附加信息 |
选择 Loki + Promtail 作为日志收集方案(相比 ELK 更轻量,资源消耗更低)。
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Service A│ │ Service B│ ... │ Service N│ │ Service N│
│ (stdout) │ │ (stdout) │ │ (stdout) │ │ (stdout) │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
└────────────┴────────────────┴────────────────┘
│
Docker Logging Driver
(json-file)
│
┌─────────▼─────────┐
│ Promtail Agent │
│ (每节点部署) │
│ - 读取容器日志 │
│ - 添加 Label │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ Loki │
│ - 日志存储 │
│ - 标签索引 │
│ - 查询引擎 │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ Grafana │
│ - LogQL 查询 │
│ - 与指标关联 │
└───────────────────┘
# deploy/docker/promtail/config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ["__meta_docker_container_name"]
target_label: "container"
- source_labels: ["__meta_docker_container_log_stream"]
target_label: "stream"
pipeline_stages:
- json:
expressions:
level: level
service: service
traceId: traceId
message: message
- labels:
level:
service:
traceId:
| 级别 | 用途 | 生产是否输出 | 示例 |
|---|---|---|---|
| ERROR | 需要立即关注的错误,触发告警 | ✓ | 数据库连接失败、第三方 API 超时 |
| WARN | 潜在问题,可能影响业务 | ✓ | 重试第 2 次、缓存命中率低于 50% |
| INFO | 关键业务操作记录 | ✓ | 订单创建、供应商查价完成、结算完成 |
| DEBUG | 详细调试信息,仅排查问题时开启 | ✗ | SQL 语句、HTTP 请求详情、内部状态变更 |
生产环境通过 LOG_LEVEL 环境变量控制:
# 生产
LOG_LEVEL=info
# 排查问题时临时开启
docker compose exec query-engine sh -c 'export LOG_LEVEL=debug && node dist/main.js'
| 阶段 | 保留时间 | 存储位置 | 成本优化措施 |
|---|---|---|---|
| 热数据 | 30 天 | 本地磁盘 / S3 Standard | 完整索引,可全文检索 |
| 温数据 | 90 天 | S3 Infrequent Access | 压缩存储,仅保留元数据索引 |
| 冷数据 | 1 年 | S3 Glacier / OSS 归档 | 大幅压缩,按需恢复 |
| 过期数据 | 自动删除 | — | Loki retention 配置 |
Loki 保留配置:
# loki/local-config.yaml
schema_config:
configs:
- from: 2025-01-01
store: boltdb-shipper
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 720h # 30 天热数据
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
table_manager:
retention_deletes_enabled: true
retention_period: 720h
// src/shared/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || "unknown",
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://jaeger:4317",
}),
});
sdk.start();
客户端请求
│
├── [X-Request-ID: abc123] ─────────────────────────────
│ │
┌───▼──────────────────────────────────────────────────────────┐
│ API Gateway │
│ TraceID: abc123 | Span: gateway.process │
│ └──→ 调用 query-engine │
│ 传递 Header: traceparent: 00-abc123-... │
└───┬──────────────────────────────────────────────────────────┘
│
┌───▼──────────────────────────────────────────────────────────┐
│ Query Engine │
│ TraceID: abc123 | Span: query.search │
│ ├──→ 调用 supplier-adapter (supplier-A) │
│ │ Span: adapter.supplier-a.query │
│ ├──→ 调用 supplier-adapter (supplier-B) │
│ │ Span: adapter.supplier-b.query │
│ └──→ 写入 Redis 缓存 │
│ Span: cache.write │
└──────────────────────────────────────────────────────────────┘
在 Grafana 中通过 TraceID: abc123 可查看完整调用链
阶段 1: 单节点 (0~1万 日订单)
├── 所有服务 + 数据库在同一台服务器
└── 预估成本: ¥300~500/月
阶段 2: 数据分离 (1万~5万 日订单) ← 第一步扩容
├── 应用服务器 (8C16G) × 1
├── 数据库服务器 (16C32G) × 1
└── 预估成本: ¥800~1200/月
阶段 3: 应用水平扩容 (5万~10万 日订单) ← 第二步扩容
├── 应用服务器 (8C16G) × 2~3
├── 数据库服务器 (16C32G) × 1 主 + 1 从
├── 负载均衡器 × 1
└── 预估成本: ¥2000~3500/月
# 1. 在数据库服务器安装 PostgreSQL (Pigsty)
# 详见 docs/infra/pigsty-setup.md
# 2. 导出现有数据
docker exec postgres pg_dump -U hotel hotel_db > backup.sql
# 3. 导入到新服务器
psql -h db-server -U hotel -d hotel_db < backup.sql
# 4. 更新应用连接
# 修改 .env: DATABASE_URL=postgresql://hotel:xxx@db-server:5432/hotel_db
# 5. 移除 compose 中的 postgres 服务
docker compose rm -s postgres
# 使用 Docker Compose replicas 扩容
# 修改 compose.production.yml
services:
query-engine:
deploy:
replicas: 3
order-service:
deploy:
replicas: 2
# 前置 Nginx 负载均衡
upstream api_backend {
least_conn;
server app-node-1:8080;
server app-node-2:8080;
server app-node-3:8080;
}
server {
listen 80;
server_name api.hotel-platform.com;
location / {
proxy_pass http://api_backend;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Request-ID $request_id;
}
}
| 指标 | Docker Compose 阈值 | K8s 迁移阈值 |
|---|---|---|
| 日订单量 | < 10 万 | > 10 万 |
| 业务服务实例数 | < 30 | > 30 |
| 服务更新频率 | 每日 < 3 次 | 每日 > 3 次 |
| 运维人力 | 1~2 人 | > 3 人 |
deploy/k8s/helm/hotel-platform/
├── Chart.yaml
├── values.yaml # 默认值
├── values-staging.yaml # 预发布覆盖
├── values-production.yaml # 生产覆盖
└── templates/
├── _helpers.tpl
├── namespace.yaml
├── network-policy.yaml
├── api-gateway/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── hpa.yaml
│ └── podmonitor.yaml
├── query-engine/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── hpa.yaml
├── order-service/
│ └── ...
├── redis/
│ ├── statefulset.yaml
│ └── service.yaml
├── postgres/
│ └── statefulset.yaml
├── rabbitmq/
│ └── statefulset.yaml
├── clickhouse/
│ └── statefulset.yaml
├── prometheus/
│ └── ...
├── grafana/
│ └── ...
└── ingress.yaml
# templates/query-engine/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: query-engine
namespace: hotel-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: query-engine
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
详细部署步骤参见 pigsty-setup.md
| 阶段 | 方案 | 说明 |
|---|---|---|
| MVP | 单实例 + Docker | pgvector/pgvector:pg16 容器 |
| 早期生产 | Pigsty 单节点 | PG 16 + Pigsty 监控 |
| 中期生产 | Pigsty 主从复制 | 1 主 + 1~2 从,流复制 |
| 大规模 | Pigsty + Patroni | 自动故障转移,3 节点以上 |
# pg_conf 参数建议
shared_buffers: 4GB # 物理 RAM 的 25%
effective_cache_size: 12GB # 物理 RAM 的 75%
work_mem: 64MB # 并行查询排序内存
maintenance_work_mem: 512MB # 索引构建、VACUUM 内存
max_connections: 200 # 连接池(推荐 PgBouncer)
random_page_cost: 1.1 # SSD 优化
| 阶段 | 方案 | 数据量上限 | 高可用 |
|---|---|---|---|
| MVP | 单节点 Docker | < 2 GB | 无 |
| 早期生产 | Redis Sentinel (3 节点) | < 16 GB | 自动故障转移 |
| 大规模 | Redis Cluster (6+ 节点) | > 16 GB | 分片 + 自动故障转移 |
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Redis │ │ Redis │ │ Redis │
│ Master │ │ Slave │ │ Slave │
│ :6379 │◄─┤ :6379 │ │ :6379 │
└────┬─────┘ └──────────┘ └──────────┘
│
┌────▼─────┐ ┌──────────┐ ┌──────────┐
│Sentinel 1│ │Sentinel 2│ │Sentinel 3│
│ :26379 │ │ :26379 │ │ :26379 │
└──────────┘ └──────────┘ └──────────┘
# docker-compose 中 Sentinel 模式
services:
redis-master:
image: redis:7-alpine
command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru --appendonly yes
redis-slave-1:
image: redis:7-alpine
command: redis-server --replicaof redis-master 6379 --maxmemory 4gb
depends_on: [redis-master]
redis-sentinel-1:
image: redis:7-alpine
command: >
redis-sentinel /etc/redis/sentinel.conf
volumes:
- ./redis/sentinel.conf:/etc/redis/sentinel.conf
depends_on: [redis-master]
| 阶段 | 方案 | 数据量 | 说明 |
|---|---|---|---|
| MVP | 单节点 Docker | < 50 GB | 基础分析查询 |
| 早期生产 | 单节点 + 物理机 | < 500 GB | 增加内存和 SSD |
| 大规模 | 分片集群 (2~4 Shard) | > 500 GB | 水平分片 + 副本 |
┌──────────────┐
│ CH Proxy │
│ (查询路由) │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌──────▼──────┐┌───▼────┐┌──────▼──────┐
│ Shard 1 ││ Shard 2││ Shard 3 │
│ ┌─┐ ┌─┐ ││ ┌─┐┌─┐││ ┌─┐ ┌─┐ │
│ │R│ │R│ ││ │R││R│││ │R│ │R│ │
│ └─┘ └─┘ ││ └─┘└─┘││ └─┘ └─┘ │
└─────────────┘└───────┘└─────────────┘
Replica1 Repl2 Repl1 Repl1 Repl2
R = Replica
<!-- config.xml -->
<clickhouse>
<max_memory_usage>8000000000</max_memory_usage> <!-- 8GB -->
<max_threads>8</max_threads>
<background_pool_size>8</background_pool_size>
<query_log>
<ttl>query_log</ttl>
<move_to_ttl_interval>3600</move_to_ttl_interval>
</query_log>
<!-- MergeTree 引擎设置 -->
<merge_tree>
<max_suspicious_broken_parts>5</max_suspicious_broken_parts>
</merge_tree>
</clickhouse>
| 阶段 | 方案 | 消息吞吐量 | 说明 |
|---|---|---|---|
| MVP | 单节点 Docker | < 1K/s | 管理插件 + 持久化队列 |
| 早期生产 | 镜像队列 (2 节点) | < 5K/s | 队列镜像保证高可用 |
| 大规模 | 集群 (3+ 节点) + Quorum | > 5K/s | Quorum Queues(Raft 共识) |
# 设置策略:所有队列镜像到全部节点
rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --apply-to queues
# rabbitmq.conf
listeners.tcp.default = 5672
management.tcp.port = 15672
# 内存水位线(触发流控的阈值)
vm_memory_high_watermark.relative = 0.6
# 磁盘空间限制
disk_free_limit.absolute = 5GB
# 持久化优化
queue_master_locator = min-masters
# 启动所有服务
docker compose up -d
# 查看服务状态
docker compose ps
# 查看日志
docker compose logs -f api-gateway
docker compose logs --since 1h query-engine
# 重启单个服务
docker compose restart order-service
# 强制重建(代码更新后)
docker compose up -d --build query-engine
# 查看资源使用
docker stats --no-stream
# 进入容器调试
docker compose exec postgres psql -U hotel -d hotel_db
docker compose exec redis redis-cli
# 备份数据库
docker compose exec postgres pg_dump -U hotel hotel_db | gzip > backup_$(date +%Y%m%d).sql.gz
# 健康检查
curl -s http://localhost:8080/health | jq .
groups:
- name: hotel-platform-alerts
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "HTTP 5xx 错误率超过 1%"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 延迟超过 2s: {{ $labels.service }}"
- alert: DatabaseConnectionPoolExhausted
expr: pg_stat_activity_count / pg_settings_max_connections > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "数据库连接池使用率超过 90%"
- alert: RabbitMQQueueBacklog
expr: rabbitmq_queue_messages > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "消息队列积压: {{ $labels.queue }}"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘剩余空间不足 20%"
| 问题类型 | 排查步骤 |
|---|---|
| 服务无法启动 | docker compose logs <service> → 检查环境变量 → 检查端口占用 |
| 数据库连接失败 | 检查 PG 状态 → pg_isready → 检查连接数 → 检查 DNS/网络 |
| 查价超时 | 检查供应商 API → Redis 缓存 → 查看链路追踪 → 检查并发限制 |
| 消息堆积 | 检查消费端日志 → 检查死信队列 → 检查 RabbitMQ 管理界面 |
| 内存不足 | docker stats → 检查 ClickHouse 查询 → 调整 max_memory_usage |
| 磁盘写满 | du -sh /var/lib/docker/* → 清理旧日志 → 扩容磁盘 |
文档维护:本文档随架构演进持续更新,每次重大变更需记录变更日期和负责人。
相关文档:pigsty-setup.md | monitoring.md