/
Teamapt IT-Service-Management

Teamapt IT-Service-Management

 

 

Definitions

Incidents: An unplanned interruption to a service, a reduction in the quality of a service, or an event that has not yet impacted the service but could do so if action is not taken.

  • Incidents are typically logged, managed, and resolved through the incident management process to restore normal service operation as quickly as possible and minimize the impact on business operations.

Service Request: A formal request from a user for something to be provided, such as information, advice, a standard change, or access to a service.

Tasks: A discrete unit of work or activity that is required to achieve a specific objective, such as resolving an incident, fulfilling a service request, or implementing a change.

Problem: A problem is the underlying cause of one or more incidents. The goal is to identify and eliminate the root cause to prevent recurrence.

Difference between Incidents & Service request & Task

Incidents

Service request

Task

Incidents

Service request

Task

 

Unplanned interruption or reduction in service quality.

Formal user request for something to be provided (e.g., information, access).

Discrete unit of work required to achieve a specific objective.

 

Restore normal service operation as quickly as possible.

Fulfill a user’s request for a service or information.

Complete a specific activity within a process or workflow.

 

Reactive (response to an issue or disruption).

Proactive (fulfillment of a user’s need or request).

Can be reactive or proactive, depending on the context.

 

  • Server downtime.

  • Application crash.

  • Network outage.

  • Password reset.

  • Request for software installation.

  • Investigate the root cause of an incident.

  • Implement a change.

 

Often high risk due to potential business impact.

Low risk, as requests are usually routine and pre-approved.

Risk level varies depending on the context (e.g., Change tasks may have higher risk).

 

Incident Management

Incident management is the process of restoring normal service operation as quickly as possible after an IT service disruption in order to minimize the impact on the organization. It involves identifying, logging, categorizing, prioritizing, investigating, diagnosing, and resolving incidents. (based on ITIL principles)

It encompasses several key steps:

  1. Incident Identification

  2. Incident Logging

  3. Incident categorization & prioritization

  4. Incident diagnosis & Investigation

  5. Incident escalation

  6. Incident resolution

  7. Incident closure & review

 

Incident_mgt2.png

  

  1. Incident identification

Incidents can be reported through various channels, such as Customer Support, Integration Support, Operations, or monitoring systems.

 

  1. Incident logging

Once an incident is identified and reported, it is logged in the incident management system on (JIRA). The incident record includes details such as the date and time of the incident, the affected service, and the priority level.

 

  1. Incident categorization & prioritization

The incident is categorized based on the type of incident, which may be the type of service affected (critical-P1, high-P2, medium-P3 & low-P4) and how it should be handled. It is tagged as either a major incident or a minor incident based on business impact and urgency; a priority can be assigned to the incident. This helps determine the appropriate response and escalation procedures.

Major incidents are critical and urgent severity incidents reported through internal monitoring/alerting tools.

  • Communication: Notify management and relevant stakeholders about the incident.

  • Investigation Call: Conduct a call to analyze the issue, identify the root cause, and develop a resolution plan.

  • Resolution: Execute the plan to resolve the incident.

  • Post-Incident Report: Document the incident, including its cause, resolution steps, and preventive measures to avoid recurrence.

 

  1. Incident diagnosis & Investigation

The incident is analyzed to determine the root cause and to identify any workarounds or temporary solutions that can restore the service quickly.

 

  1. Incident escalation

If the incident cannot be resolved within a predefined time frame or requires additional resources, it is escalated to higher-level support teams or management

 

  1. Incident resolution

The incident is resolved by implementing a permanent solution or a workaround that restores normal service operation.

 

  1. Incident closure & review

Once the incident is resolved, it is closed in the incident management system. For major incidents that have had a significant business impact, a post-incident review is conducted to identify how the cause of the incident was investigated and identified, the effectiveness of the resolution process, and opportunities for improvement.

·    Incident reporting

Incidents are reported to stakeholders, such as service owners, management, or customers, to keep them informed of service performance and incident trends.

 

 

Roles and Responsibilities

·       Incident Manager.

·       Incident Analyst (App Monitoring).

·       Technical Support Specialist.

·       Problem Manager.

·       Change Manager.

·       Stakeholders: Business Leads, CTO, Engineering Leads, Infrastructure Lead, Operations Lead, Customer Support Lead, etc.

 

 Escalation Matrix

Escalation_matrix.png

 

Problem Management

Problem management focuses on identifying, analyzing, and resolving the root causes of incidents to prevent their recurrence and minimize their impact on business operations.

Problem Identification

  • Problems are identified through various methods, such as recurring incidents, trend analysis, or proactive detection techniques.

  • Problems are categorized and prioritized based on their impact on services, business objectives, and alignment with ISO/IEC 20000 guidelines for effective service management.

Problem Logging and Documentation

  • Problems are logged on Jira servicedesk with comprehensive details, including:

    • Problem description

    • Related incidents

    • Initial impact assessment

  • Documentation is continuously updated to reflect the progress of the investigation, findings, and any changes.

Problem Investigation and Diagnosis

  • Assigned TSE/SRE conduct in-depth investigations to identify the root cause(s) of the problem and assess its potential impact on services.

  • Detailed documentation of the investigation process, including root cause analysis, is to be maintained.

Problem Resolution and Workaround Implementation

  • TSE/SRE collaborate with relevant stakeholders to develop a resolution plan that effectively addresses the root cause.

  • Temporary workarounds may be implemented to minimize the problem's impact on services until a permanent solution is established, minimizing service disruptions.

Problem Closure and Communication

  • Once the problem is resolved, the TSE/SRE updates the problem record, documents the resolution, and marks it as closed.

  • Users and stakeholders are informed of the resolution and any necessary actions they need to take. For critical incidents, an incident report is shared to ensure transparency and compliance with ISO/IEC 20000's communication requirements.

 

Change Management

Change management, also referred to as “change enablement” (in ITIL v4), ensures that changes to IT services & systems are controlled and implemented with minimal risk and disruptions.

Process

  • Request for Change (RFC): Raise a request for change on Jira.

  • Assessment: Evaluate the impact, risk, and benefits of change. Change must be categorized appropriately.

  • Approval: Based on the type of change, ensure all necessary approvals are gotten from relevant stakeholders before proceeding.

  • Implementation: Schedule and execute the change with proper planning and testing.

  • Review: Analyze the outcome over a period of time and document learnings.

 

Categories of Change

Changes are categorized based on their impact and urgency, with each requiring different levels of approval and oversight.

 

Types of Change

  1. Standard Change: A pre-approved, routine, low-risk change that is well understood and follows a defined procedure. (e.g., configuration updates)

  2. Normal Change: A change that is not standard or emergency and requires a formal process for assessment, approval, and implementation. New features or major updates requiring full testing and stakeholders involvement

  3. Emergency Change: A change that must be implemented as soon as possible to resolve an incident, prevent a major issue, or address a critical security vulnerability.

  4. Major Change: A high-impact change that typically involves significant risk, cost, or organizational impact (large-scale).

 

Key differences

Standard Change

Normal Change

Emergency Change

Major Change

Standard Change

Normal Change

Emergency Change

Major Change

Low risk

Medium-High risk

High risk

High risk

Low urgency

Normal urgency

High urgency

Normal urgency

Pre-approved

Formal CAB approval

Expedited (ECAB)

Senior management/CAB

Examples: user account management, software updates, network configurations, scheduled backups, scheduled system restarts, certificate management, firewall rules, cloud resource management (provisioning), documentation updates, etc.

Examples: software deployment, service introduction, cloud migration, network changes, business continuity planning, policy or compliance changes, end of support mitigation, etc.

Examples: application failures, system outages, network issues, security breaches or vulnerabilities, performance degradation, disaster recovery, configuration errors, etc.

Examples: cloud migration (large scale), network overhaul, organization-wide software rollout, etc.

 

 

Change_management_.png

 

Change Advisory Board (CAB)

All changes and product releases must get the green light through approval on Jira Change management workflow from the Change Advisory Board (CAB) based on the specific type of change. This approval will only be issued once the requirements outlined below have been submitted for CAB review and approval of the proposed changes.

 

Requirements for Change Management Approval

  • Description of the change.

  • Reason for the change.

  • Expected outcomes.

  • Deployment & Release note

  • Rollback plan (if applicable).

  • Testing and validation scripts (evidence).

  • Risk assessment.

  • Dependencies

 

Process Steps:

  1. The PM/EM initiates a Request for Change (RFC) through Jira.

  2. The PM/EM identifies and selects the appropriate category for the change.

  3. A detailed description along with the requirements for the change is provided by the PM/EM.

  4. Depending on the change's specifics, the PM/EM consults the referenced table to determine and obtain the necessary stakeholder approvals.

  5. Once all approvals are secured, a deployment notification must be sent to stakeholders (email: changeadvisoryboard@teamapt.com) at least 24 hours before the planned implementation date.

  6. Any change that may impact the customer experience should be communicated to all internal stakeholders. Additionally, any UI updates, new feature releases, or changes that could potentially cause service disruptions should be directly communicated to customers.

 

To effectively clarify responsibilities and ensure all CAB stakeholders fully understand their roles in the change management process, the RACI matrix can be utilized. Here's how it applies:

  • Responsible (R): The person or team performing the task.

  • Accountable (A): The person ultimately answerable for the task or decision.

  • Consulted (C): Those whose input is required(Approval).

  • Informed (I): Those who need to be kept informed of progress or decisions.


NB: CAB members with C or A must approve on JIRA.

Role

Name

RACI Matrix for Normal change

RACI Matrix for Standard change (Pre-Approved)

RACI Matrix for Emergency change

Role

Name

RACI Matrix for Normal change

RACI Matrix for Standard change (Pre-Approved)

RACI Matrix for Emergency change

Associated PM/EM

Associated PM/EM

R

R

R

VP, Engineering

Emeka Awagu (Acting)

A

A

A

Business Information Security Officer

Dozie Alisah

C

I

I

Head Tech Support

Oladapo Onayemi

C

I

I

Lead Project Coordination and Documentation

Idris Aliyu

I

I

I

Head Operations

Tolulope Obianwu

C

I

I

Chief Compliance Officer/Compliance Business Partner

Ugonna Akah/Adefemi Opeogun

C

I

I

Head Infrastructure

Tolu Aina

I

I

I

CTO

Emeka Awagu

I

C (Pre-approved)

C

CPO

Frank Atashili

I

I

C