Monitoring Metrics
Overview
This document outlines the key metrics, their descriptions and the respective thresholds considered as stable (okay) for monitoring the system and ensuring high system availability and high system performance uptime.
Infrastructure Metrics
Metric | Description | Stable Threshold |
---|---|---|
CPU Utilization | Tracks server load to prevent overutilization | < 80% |
Memory Usage | Ensures sufficient memory availability | <80% |
Disk I/O & Usage | Detects disk bottlenecks affecting transactions | Disk Read/Write < 80% |
Network Latency & Throughput | Measures response times and bandwidth | < 100ms latency |
Load Balancer Health | Monitors traffic distribution and failover readiness | 99.99% Uptime |
Database Query Performance | Detects slow queries and optimizes database efficiency | Query Execution < 200ms |
Cloud Run/GKE Health | Ensures containerized services are running efficiently | < 2% error rate |
Application Metrics
Metric | Description | Stable Threshold |
---|---|---|
API Response Time | Measures latency (P50, P95, P99) of API calls | P95 < 300ms |
API Error Rate | Tracks 4xx and 5xx errors affecting user experience | < 1% |
Transaction Per Second (TPS) | Measures system capacity to handle payments | Based on System Capacity |
Payment Gateway Response Time | Detects slow third-party integrations affecting transactions. | < 500ms |
Service Availability | Tracks uptime % for SLA compliance | 99.99% uptime |
Service Latency | Measures the delay in internal service responses. | < 200ms |
Request Per Minute (RPM) | Tracks API request volume | Base on load |
Transaction Metrics
Metric | Description | Stable Threshold |
---|---|---|
Transaction Volume | Tracks total transactions over time |
|
Transaction Failure Rate | Monitors failed payments to detect issues | < 2% |
Transaction Latency | Measures the time taken to process a transaction | < 2s |
Refund & Chargeback Rate | Identifies customer disputes and fraud trends | < 1% |
Payment Decline Rate | Analyzes unsuccessful transactions due to bank rejections | < 5% |
Approval Rate | Percentage of transactions successfully authorized by banks | >95% |
Settlement Success Rate | Measures percentage of payments successfully settled | >98% |
Transaction Success Rate | Percentage of successful payments over total attempts | >98%
|
Service Health Metrics
Metric | Description | Stable Threshold |
---|---|---|
System Uptime | Tracks overall system availability | >99.99% |
Service Response Time | Measures end-to-end service latency | <500ms |
Dependency Health Checks | Monitors third-party service dependencies | 99.99% Uptime |
Service Degradation Alerts | Detects slowdowns before failure | < 2% error rate |
Incident Recovery Time | Measures MTTR (Mean Time To Repair) | < 30min |