Technical implementation
Phase 1: Technical Deep-dive
Inputs:
PRD document
High fidelity product designs with complete UX and UI
Objective: Get complete clarity on the implementation details for the proposed product, document the implementation steps and estimate the work to be done
Output:
Updated tickets to capture the implementation details and sub tasks
Detailed breakdown of engineering steps into subtasks in each ticket
Story points to be assigned to the ticket
Test cases
Architectural design
Responsibility: Engineering Manager
Participants: Product Manager, Engineers, QA, BL, VPE, Security Analyst
Expected time to complete: 2 days
Measurements:
Metric | Motivation | Source |
Story point coverage | To measure the adoption of work estimation in our product management process | Jira |
Acceptance criteria coverage | To encourage the addition of acceptance criteria by the PMs | Jira |
Functional Test Coverage | To encourage the addition of relevant test cases to Jira tickets | QMetry via Jira |
Phase 2: Sprint Planning
Input:
Backlog of products that have gone through deep dives and have been estimated
Backlog of bugs & technical debt that have been discussed and estimated
Objective: Determine the sprint goal, i.e. the work to be delivered at the end of the sprint and each product team member’s role
Output:
List of tickets to be worked on during the sprint
Assignment of tickets to engineers
Responsibility: Business Leader
Participants: Product Manager, EM, Engineers, QA, BL, VPE
Expected time to complete: 1 day
Measurements:
Metric | Motivation | Source |
Team and sprint health | To help measure the flow of work within the team and potentially predict the delivery speed of a team | Jira |
Notes
Team and sprint health is a combination of different metrics to help answer the questions: how much work are we doing? Where are we spending most of our time e.g fixing bugs, new initiatives etc? Are there bottlenecks or inefficiencies that could pose a risk to the delivery of the initiatives? Is the team overwhelmed? Are certain resources within the team under-utilised or over-utilised? How much are we investing to execute this initiative? etc. When we have clarity on the specific metrics this document will be updated accordingly.
Phase 3: Coding
Input: The products to be implemented and test cases to be automated
Objective: Implement the products according to the agreed design and automate the test cases using our automation test suite
Output:
A working deployable feature
Automated test cases
Responsibility: EM
Participants: Engineers
Expected time to complete: variable depending on size of project
Measurements Summary:
Metric | Motivation | Source |
Cycle time | To measure how efficient our pipeline is by measuring how long it takes to go from code commit to production | Gitlab, ArgoCD |
Code suggestions | To improve code quality and reduce cycle time | SonarLint |
Time to merge | To measure how long it takes codes to be reviewed and merged - long wait times requires investigation | Gitlab |
Coding time | To know how long an engineer spent implementing a given task | Gitlab |
PR count per contributor | To measure how any PRs is initiated by developer | Gitlab |
Review reaction time | How fast do reviewers respond the merge requests | Gitlab |
PR size | To know how if developers adhere to small PR size best practice | Gitlab |
PR comment count | To measure the toll required to review an engineers code | Gitlab |
Notes
The metrics above are by no means exhaustive, keep in mind that the dashboard tool might have more metrics then listed here and sometimes the metrics might be represented under a different name
Phase 4: Build Pipeline
This is a critical stage comprising a number of quality gates to help ensure the quality of the product feature(s) being shipped. The goal is to pass the modified code base through series of quality checks and ensure that the code base passes all of these checks before it can proceed to the next stage. On successful build, the feature is deployed to a dev environment for further checks if it’s a service, if it is a library it is pushed to the relevant package registry for distribution. On failure, the pipeline fails allowing the team or engineer responsible for the changes being verified to fix the failure.
Input
Committed or merged code
Objective: build the modified code base, pass it through series of quality checks and ensure that the code base passes all of these checks before it can proceed to the next stage
Output:
A working deployable package (docker image, packages etc)
Responsibility: EM
Participants: Engineers, QA, BL, VPE
Expected time to complete: 3 days
Pipeline Quality Gates:
Gate | Measurements | Source |
Unit Test | Unit test coverage | Cobertura, Jasmine |
Static code Analysis | Unit test coverage | SonarQube |
Container Analysis | Container vulnerabilities | Trivy |
Functional Test | Automated functional test coverage (API) | Cypress/Playwright |
Load Test | Latency score Test Result (successful if the agreed thresholds are maintained otherwise it is considered a failed test) | Jmeter |
Security Test | API Vulnerabilities | Astra(TBD) |
Notes
Executing some of the quality gates e.g functional test, load test and security test will require spinning up a fresh and clean environment, seeding that environment with necessary data before running the tests. We have agreed to adopt Garden as an orchestration tool to implement this process. You can find the documentation and template on how to include this in your projects here
Phase 5: Production Deployment
The target is to have a completely automated process where the outputs of the build stage is deployed to production environment without human intervention. However, I acknowledge that this will require work (for instance having feature flags) and time before we get there, so for now we will include a human element by retaining our current deployment flow, where EAs or anyone with the right authority can modify the Gitops file to add the new release version before they are picked by ArgoCD for deployment. From there, we can work towards the ideal scenario as we grow more confident in our pipelines. That said this ideal scenario will remain an actively tracked target for all teams.
Input
Release tag
Objective: Deploy the tested service by simply modifying Gitops file
Output:
A running service without defects
Responsibility: EM
Participants: Engineers, QA, BL, VPE
Expected time to complete: 1 day
Measurements:
Measurements | Motivation | Source |
---|---|---|
Deployment status | To help us evaluate the outcome of a deployment. This will feed into measuring the reliability of our pipelines | Argocd rollout |
Error rate | To measure the impact of of the deployment on error rate | Newrelic |
Latency | To measure the impact of of the deployment on latency | Newrelic |
throughput | To measure the impact of of the deployment on throughput | Newrelic |
Resource Usage | To measure the impact of of the deployment on resources | Newrelic |
Deployment Frequency | This is a derived metric that helps evaluate the speed and reliability of our pipeline - being able to deploy fast shows our pipeline is reliable and fast enough to push changes | argocd |
Change failure rate | This is also a derived metric which measures the reliability of our pipeline - a low change failure rate implies having a reliable pipeline that prevents defects from getting to production | newrelic, pagerduty, argocd |
Notes
This assumes that every team uses Gitops flow for deploying changes. Therefore it is now compulsory for every team to implement Gitops deployment strategy -including (Canary or blue/green deployment) for all your services. This is will be an actively tracked target for every team. Reach out to @David Irivbogbe for guidelines on how to add this to your project.
Phase 6: Production Monitoring
In this phase the services running on production are continuously monitored in order to proactively identify performance bottlenecks, defects, anomalous behaviours etc.
In addition, products in production should be monitored by Product Manager for customer satisfaction, and optimised based on customer needs on an ongoing basis.
Input: Running services
Objectives:
Monitor the performance of our products on production
Monitor and optimise for Customer Satisfaction and Product Market Fit (PMF)
Output: Product optimisations included in sprint queues.
Responsibility: Product Manager
Participants: EM
Expected time to complete: ongoing
Measurements:
Measurement | Motivation | Source |
---|---|---|
Error rate | To know the percentage of our requests end up in failure | Newrelic |
Latency | To know how fast we respond to user requests | Newrelic |
throughput | To know much load we are able to handle | Newrelic |
Resource Usage | To know the amount of resources spent on handling the current load | Newrelic |
Apdex | To measure how happy our customers are | Newrelic |
Uptime | To know how available are our services | Freshping |
Mean time to recover | To help us see how fast we recover from failures | Newrelic, pagerduty |
Incidents Count | To know how many incidents we emit; both customer reported and those automatically triggered by monitoring tools | Pagerduty |
Customer Satisfaction | To measure how satisfied the customer is with the product in production | CS data, surveys |
Product Market Fit | To measure whether the product in production suits market needs (has the market changed, and our product is left behind?) | Surveys |
Notes
Incident analysis is a separate topic detailing how we handle the different types of incidents and how they impact teams, for now we will stick with this and evolve accordingly.
Process Implementation
The implementation of the process above is divided into the stages
Visibility
Adoption
Re-evaluation and Continuous improvement
Visibility
We need to implement the framework that will allow us measure the metrics outlined above - with that it becomes easier to know our base state. This will help us achieve three important things
We will see what we need to work on.
We can set realistic targets and timelines
We can measure the impact of our efforts towards meeting those targets.
The framework for visibility is divided into the aspects below:
Aspect | Process Phase | Source | Dashboard Tool | Data Ingestion Mode |
---|---|---|---|---|
Product Research Visibility | Phases 1 to 2 | Confluence, Jira | - | Direct |
Product Management Visibility | Phases 3 to 5 | Jira, Gitlab | Selection in Progress. The options are Hatica, Jellyfish, LinearB. Here is the RFC for the tool selection | Direct integrations with the sources |
Build Pipeline Visibility | Phase 6 | SonarQube, Jenkins | Grafana and/or Jellyfish,LinearB, Hatica | Webhook integration and/or ETL |
Production Visibility | Phases 7 to 8 | ArgoCD, Newrelic, Pagerduty | Newrelic and/or Grafana | Webhook integration and/or ETL |
The outputs of this framework will be dashboards showing the team level performance across the different aspects.
Adoption
Once we have implemented the needed observability framework, we are then able to track these metrics on a team level, this will help the teams become self aware and self improving. In addition to that, we can set targets for every team and see their results in real-time. This will help every stakeholder to objectively see the level of adoption in each team and also see the impact of the effort being put into adopting the process by every team.
Based on these data, we can establish a team appraisal cadence through the leaders of each team.
Re-evaluation and Continuous improvement
The goal is to keep improving our processes and make it a well oiled machine where every component is healthy and contributing effectively to the success of the organisation. While at it we will keep evaluating the process using the metrics and making sure we keep evolving and improving the process based on business realities.