Skip to content

Kế hoạch: Batch & Async Processing Feature

1. Hiện trạng

  • 40+ scheduled jobs across payment, game, NFT, leaderboard, maintenance domains.
  • 25+ JMS listeners với dedicated queues.
  • ProgrammaticEndpointRegistration tự động register per-machine/setting queues.
  • JMSListenerRecoveryScheduler tự động khôi phục listener lỗi.
  • 4 Firebase listeners cho realtime events.
  • Gift aggregation pipeline với Redis buffering.

2. Vấn đề đã biết

# Vấn đề Mức độ
1 Queue lag không được monitor High
2 DLQ không có alerting High
3 Firebase listener health không đo được Medium
4 Job execution time không có metrics Medium
5 Dynamic queue không tự update khi thêm machine runtime Medium

3. Kế hoạch cải tiến

Phase 1: Observability (ưu tiên cao)

  • Task 1-1: Integrate ActiveMQ metrics với Spring Actuator + Prometheus
  • Task 1-2: Add DLQ alerting khi có messages trong DLQ
  • Task 1-3: Firebase listener health check scheduled

Phase 2: Stability

  • Task 2-1: Review và chuẩn hóa DLQ strategy cho tất cả queues
  • Task 2-2: Test JMSListenerRecoveryScheduler hoạt động đúng
  • Task 2-3: Dynamic queue update khi admin tạo machine/setting mới

Phase 3: Performance

  • Task 3-1: Profile job execution times (Spring Batch metrics)
  • Task 3-2: Tune concurrency cho từng queue type dựa theo production workload

Phase 4: Testing

  • Task 4-1: Test dynamic queue registration (startup)
  • Task 4-2: Test listener recovery (kill listener → check recovery)
  • Task 4-3: Load test gift aggregation pipeline

4. Timeline

gantt
    title Batch & Async - Improvement Plan
    dateFormat YYYY-MM-DD
    section Phase 1 Observability
    Task 1-1 ActiveMQ metrics  :crit, t1_1, 2025-01-01, 2d
    Task 1-2 DLQ alerting      :crit, t1_2, after t1_1, 1d
    Task 1-3 Firebase health   :t1_3, after t1_1, 2d
    section Phase 2 Stability
    Task 2-1 DLQ strategy      :t2_1, after t1_2, 2d
    Task 2-2 Recovery test     :t2_2, after t2_1, 1d
    Task 2-3 Dynamic update    :t2_3, after t2_2, 3d
    section Phase 3 Performance
    Task 3-1 Job metrics       :t3_1, after t1_3, 2d
    Task 3-2 Concurrency tune  :t3_2, after t3_1, 2d
    section Phase 4 Testing
    Task 4-1 Dynamic queue     :t4_1, after t2_3, 2d
    Task 4-2 Recovery test     :t4_2, after t4_1, 1d
    Task 4-3 Gift load test    :t4_3, after t4_2, 3d

5. Rủi ro

Rủi ro Ảnh hưởng Biện pháp
Queue backlog đột biến Processing delay Queue lag alert + scale consumers
Job overlap Double processing fixedDelay + distributed lock (ShedLock)
Firebase listener crash Events lost Health check + auto-reconnect
Dynamic queue không đủ Rooms stuck Monitor queue count vs machine count
DLQ accumulation Silent errors DLQ monitoring + alert