/
Monitoring Metrics

Monitoring Metrics

Overview

This document outlines the key metrics, their descriptions and the respective thresholds considered as stable (okay) for monitoring the system and ensuring high system availability and high system performance uptime.

 

  1. Infrastructure Metrics

Metric

Description

Stable Threshold

Metric

Description

Stable Threshold

CPU Utilization

Tracks server load to prevent overutilization

< 80%

Memory Usage

Ensures sufficient memory availability

<80%

Disk I/O & Usage

Detects disk bottlenecks affecting transactions

Disk Read/Write < 80%

Network Latency & Throughput

Measures response times and bandwidth

< 100ms latency

Load Balancer Health

Monitors traffic distribution and failover readiness

99.99% Uptime

Database Query Performance

Detects slow queries and optimizes database efficiency

Query Execution < 200ms

Cloud Run/GKE Health

Ensures containerized services are running efficiently

< 2% error rate

  1. Application Metrics

Metric

Description

Stable Threshold

Metric

Description

Stable Threshold

API Response Time

Measures latency (P50, P95, P99) of API calls

P95 < 300ms

API Error Rate

Tracks 4xx and 5xx errors affecting user experience

< 1%

Transaction Per Second (TPS)

Measures system capacity to handle payments

Based on System Capacity

Payment Gateway Response Time

Detects slow third-party integrations affecting transactions.

< 500ms

Service Availability

Tracks uptime % for SLA compliance

99.99% uptime

Service Latency

Measures the delay in internal service responses.

< 200ms

Request Per Minute (RPM)

Tracks API request volume

Base on load

  1. Transaction Metrics

Metric

Description

Stable Threshold

Metric

Description

Stable Threshold

Transaction Volume

Tracks total transactions over time

 

Transaction Failure Rate

Monitors failed payments to detect issues

< 2%

Transaction Latency

Measures the time taken to process a transaction

< 2s

Refund & Chargeback Rate

Identifies customer disputes and fraud trends

< 1%

Payment Decline Rate

Analyzes unsuccessful transactions due to bank rejections

< 5%

Approval Rate

Percentage of transactions successfully authorized by banks

>95%

Settlement Success Rate

Measures percentage of payments successfully settled

>98%

Transaction Success Rate

Percentage of successful payments over total attempts

>98%

 

  1. Service Health Metrics

Metric

Description

Stable Threshold

Metric

Description

Stable Threshold

System Uptime

Tracks overall system availability

>99.99%

Service Response Time

Measures end-to-end service latency

<500ms

Dependency Health Checks

Monitors third-party service dependencies

99.99% Uptime

Service Degradation Alerts

Detects slowdowns before failure

< 2% error rate

Incident Recovery Time

Measures MTTR (Mean Time To Repair)

< 30min

Related content