MONITORING PLAYBOOK | ATS / SWITCH
TABLE OF CONTENT
Roles & Responsibilities while on duty
Escalation Rules
Mailing Etiquette
Must Watch:
KSS on DCIR Product
ROLES AND RESPONSIBILITIES WHILE ON DUTY
Hourly Report:
While on duty there will be an hourly report sent via mail(sent to aptpaymonitoring@teamapt.com copying success.imakuh@teamapt.com and chidera.molokwu@teamapt.com) commencing an hour from the start of shift. This report will detail current incidences and any portal issues in the below format:
Hourly Report Example
Incidence/ Issue | Details | Resolution Status |
---|---|---|
Wema Bank dcir failing | (Jira ticket id and background to cause of failure) | Details on resolution steps whether pending or already resolved |
Wema bank portal | Settlement pending | Details on resolution steps whether pending or already resolved |
N.B : Any issue not documented but seen to occur during a shift will count against the staff towards appraisal.
Incident Logging on Jira
All issues that occur during a shift must be logged at time issue is picked to allow for accurate measurement of timeline to resolution.
Steps to log issues on Jira
Staff should login to Jira via link : https://teamapt.atlassian.net/jira/servicedesk/projects/TDSD/queues/custom/533
Click on the create button
Select issue type as incident
Fill in the required details as requested and shown in the images below
Assign the issue to yourself or a technical support staff if one has been assigned to assist you resolve an issue
N.B SUMMARIZE THE PROBLEM SHOULD FOLLOW A SIMILAR FORMAT TO THE EMAIL FORMAT DESCRIBED LATER ON IN THIS DOCUMENT
Portal Checks
While on shift there should be periodic checks on all dcir portals carried out to ensure there is no issue affecting the portals. These check should be done 3 times in a shift; at the start, in the middle of the shift and at the end of shift and should be included in the handover report if any issue is noticed. Below details checks to be carried out on the portals:
LOGIN- Confirm login to portals is fine
BACKOFFICE DISPUTE-Confirm last time a back office dispute was logged
BULK SETTLEMENT- Confirm settlement has run for the hours in the day and the status is completed
PARTICIPANT REPORT- Confirm participant report for the previous day ran and is in completed state
TRANSACTION MIGRATION- Confirm that last transactions viewable on the portal are within 5 minutes from current time
SETTLEMENT REPORT - Confirm settlement reports are generating
Handover Report:
A handover report must be sent at end of shift documenting incidents logged while on duty and stating clearly all open issues so the next person on shift can continue from wherever the resolution stage is currently at.
Please include:
Status of ATS interchanges (On/off)
Number of incidents tickets raised / closed (Move all open tickets to the app monitoring engineer taking over)
Portal checks (settlement, dispute, authentication checks)
Validation report status; Date of Last validation report sent
Whatsapp DCIR config Updates:
Dcir config updates must be dropped on the dcir tech support group at every config change or every 2 hours if no changes have been made within those 2 hours ,
Format is below:
DCIR (TeamApt)
FCMB: 4k - 300k
Wema(No verve): 4k - 300k
Keystone: 4k- 300k
Union Verve 0 - 99999999999k
Union DCIR 4k- 99999999999k
Polaris: 0 - 99999999999k
UBA: 4k - 300k
FCMB Verve: 4k - 300k
Access DCIR: 8k - 300k
Turned Off
Fidelity: 4k - 300k
Please note that issues that occur during shift whether transactional or tied to the portal should be communicated on the Aptpay technical support whatsapp group when there is no resolution in site or confusion on what action to take. When in doubt always ask!!!
ESCALATION RULES
For transactional issues, monitoring is to be done on slack.
Below is an example of an alert on slack, once such alerts occur 3 times within a 30- 45 mins interval it must be assumed there is an issue and escalation protocols must be triggered.
There are two types of links on slack which can aid you in investigating when an issue occurs:
The first link allows you view all the errors that have occured on a particular dcir within a specified range. For example Access has been returning a low success rate alert a check to see the prevalent error on the bank can be done using the below link https://watchtower.teamapt.com/d/rQPtldJIk/spool-samples?viewPanel=3&orgId=1&from=now-6h&to=now&var-Interchange=24&var-Resp_Code=92&var-Time=5
Once the link is clicked the view below shows an example of the expected result to aid in further investigations:
Key components in screenshot above
Interchange: can be adjusted to any dcir bank and return prevalent errors for that bank
Response code: Should be ignored for this particular link
Time range(in mins): Can be adjusted to capture the time range of a particular failure
The second link is a follow up to the first which allows for spooling of transactions tied to a specific response code. The first link allows you flag what response code is causing issues and the second link allows you spool specific transaction details for whatever response code you have specified:
Click here to spool samples of the affected transactions --> https://watchtower.teamapt.com/d/rQPtldJIk/spool-samples?viewPanel=2&orgId=1&from=now-6h&to=now&var-Interchange=24&var-Resp_Code=92&var-Time=5
Once the link is clicked the view below shows an example of the expected result to aid in further investigations:
Key components in screenshot above
Interchange: can be adjusted to any dcir bank and return transactions for that bank
Response code: The response code seen to occur most often in the first link can be specified here to bring out specific transactions affected by that error.
Time range(in mins): Can be adjusted to capture the time range of a particular failure
Escalation protocols apply thus per bank:
Bank | ERROR(Response Code) | PROTOCOL |
---|---|---|
All banks | 69 | Escalate via mail to the channels team requesting for a check on the database as the error 69 speaks to slowness due to the db having issues |
All banks | 91( response time 55 seconds) | Check Grafana vpn tabs to confirm vpn tunnel is up for all banks except wema and access |
UBA, KEYSTONE | 91( response time between2- 21 seconds) | Escalate to the bank with samples as this is an issue from the FEP. Always login to confirm this from the logs before escalating |
All banks without VPN | 91( response time between2- 21 seconds) | Escalate to the bank with samples as this is an issue from the FEP |
All banks without VPN | 91( response time between 0-1) | Escalate to the bank for a restart of bank cashout service as there has been a disconnection |
UBA, KEYSTONE | 91( response time between 0-1) | Login via vpn channel to the realtime server and restart the bank cashout servce |
All banks | 96 | Confirm if transactions are getting to the bank’s FEP. For banks without vpn a mail can be sent to the bank to confirm transactions got to them. If confirmed transactions did not get to them suspect a db issue and escalate accordingly for a session |
Wema Bank | 96 | Confirm if the failures are unique to the Verve card bins. If so turn off the wema dcir verve only |
UBA | 21 | Login via vpn and confirm if there are issues experienced from the bank’s FEP. This issue usually occurs due to delays from the bank’s end. |
N.B: Always assume 51,61s,57s are customer induced and ignore
MAILING ETIQUETTE
When escalating an issue to a bank via mail the following subject format should be adopted-
Bank| DCIR | RESPONSE CODE (transactional failures only)| EFFECT OF FAILURE ON TRANSACTIONS/PORTAL | DATE (Use format YYYYMMDD)
See examples below to guide:
Example : FCMB | DCIR | RC 69 | High response time| 20240621
FIDELITY| DCIR |RC 91 | Intermittent time outs| 20240622
The expectation is for every escalation there be a 30 min follow up to the escalated bank until resolution.
If there is no response after an hour a call must be placed through to the Level 1 contact person via the escalation Matrix to ensure there is a response.