opinion
How to test a payments system in a multi resilient world
Digital payment systems are a key part of the UK's critical infrastructure, so ensuring their resilience is paramount.
Availability is a fundamental part of a payments environment. The ability of a system to support high volume throughput at low latency, with the highest levels of availability is not optional. Implicit within this availability is the principle of system resilience. Of late, via the DORA regulations in the EU and the FCA’s mandates in the UK, the subject of operational resilience has become an area that banks and other financial services organisations are now obliged to treat in a discrete, defined and evidencable manner.
It's easy to see why the area of payments and the assuredness of the payments network is receiving such treatment. The systems that deliver electronic payments form a key part of the country’s critical infrastructure. Some years ago, the tag “critical infrastructure” in the payments space would almost certainly have been applied solely to Vocalink, with other parts of the payments eco-system considered important, though not in themselves critical.
However, following the change in payment patterns post COVID, and the increase in the use of open banking, immediate payments, APMs and digital payments, it’s clear that critical infrastructure can be now applied wholesale. The tests set by the FCA for financial institutions with regard to ensuring payments availability (even in the situation where their core systems are out for extended periods of time) reflect this. They also reflect the possibility of losing a significant payment facility and its impact on the economy.
Financial institutions argue, and with some merit, that their systems are resilient. The use of active/active or hot standby systems ensure availability when systems fail, but they do not address the core issue of resilience in themselves. Availability and resilience aren’t interchangeable terms in this context, and real efforts need to be invested in supporting both elements.
This creates an inflection point for financial institutions of all types and sizes. In ensuring they are resilient within the operational demands of the legislation, institutions must also reflect on whether their current approach to Quality Assurance (QA) and Quality Engineering (QE) across their payments environments are sufficiently robust across the board.
The danger of assumption
The reality is that almost all failure in systems is as a result of assumptions. QE and development teams do not set out to not fully test something, but often received wisdom causes them to assume something is covered, or won’t be impacted. When set against timeframes that are often incredibly tight, with little contingency in either time or budget, the process welcomes the apparent certainty offered by these assumptions.
Sometimes the impact of these assumptions is limited. Occasionally, this is as straightforward as a part of a network being lost because it was assumed that the network controller could handle being reconfigured and that capacity wouldn’t be affected. However, sometimes they are significant, such as the issues faced by TSB some time ago – resulting in serious implications and fines for the senior management team as a result of the FCA’s SMR legislation.
What can we learn from major programmes, such as RTGS?
The RTGS replacement programme is one of the biggest change projects in UK banking today. Like all major schemes or central bank deployments it is mandated with published timelines. It will deliver a major change to the standards that the UK uses to drive RTGS and has the potential to deliver major benefits to banks and corporates in the UK over time. It also imposes considerable change on the institutions involved.
The reality for any scheme is that they must be assured of the interoperability of the financial institution with the scheme environment and be certain that issues are not generated by a member that will cause operational problems or impact the integrity of the scheme. This is generally safeguarded through the use of a certification process that ensures the interaction between scheme and member is consistent in behaviour with that laid down in the specifications. It does not prove the systems within the member institution that will interface into the scheme.
To ensure their own operations, banks deploy testing. Unfortunately, too often testing is seen as a check and balance exercise. Some view testing as more of an ‘afterthought’, but this shouldn’t be the case. Rather, Quality Engineering should be engineered into the fundamentals of an institution from the very beginning. It should be focussed on not only assuring elements such as message content, reply request interaction and presentation of the right data to the right part of the infrastructure, but also on the environment, its capacity and its overall resilience.
The Challenge:
With the focus being so heavily on functionality, pace of delivery and reduced costs, requirements related to backup/restore, failover, performance, soak test needs, maintainability and overall resilience are often forgotten or thought about far too late in the process. The nature of requirements within this space are that they fundamentally impact the solution through technology decisions and the overall design. Late consideration may present issues that impact the overall design and technology and prove to be hugely costly through re-design and rework. In short, the later in the software development lifecycle these issues are found, the more time consuming and costly they are to remedy.
Identify non-functional requirements early to impact decision making:
It is vitally important to ensure that resiliency through effective non-functional requirement identification is an aim for the business. This should be considered as early as possible and the results must be part of the initial decision process regarding overall design, environment needs and the testing approach going forwards.
The risk of not doing this early is that the project/programme will not be setup to deliver the right resilient system and testing will be unable to plan for and deliver effective non-functional testing on time and on budget.
Consider non-functional requirements holistically:
Identifying which non-functional requirements are a concern should be the starting point and may be considered as early as project initiation. Once this is understood, the detail within them can be brought forward and should be considered holistically as it is highly likely that one non-functional requirement will have a direct impact on other design decisions and requirements.
The risk of not viewing non-functional requirements holistically is increased costs as requirements that do not align will need to be re-thought. This may present delays, re-designs, re-testing and additional cost to the business.
We must view non-functional requirements as an opportunity:
Non-functional requirements deliver resiliency within the system that is being built. The identification of well written, joined up non-functional requirements should be seen as an opportunity to truly shape the right solution and ensure that there are no surprises like additional costs or a lack of resiliency within the solution itself.
Getting help to identify what is considered a well-written and measurable non-functional requirement should be non-negotiable. Thorough planning and investment early on is necessary and will pay dividends later in the software development life cycle.
Know your vendor constraints:
Proving non-functional requirements is a complex and time-consuming area of testing with certain nuances having a direct impact on what decisions can be made and when. A primary example is where solutions are hosted within the cloud. It is not always possible to conduct fail-over, back/up restore, maintenance, performance or soak test scenarios within cloud hosted solutions without engaging with the hosting vendor. Due to the shared nature of infrastructure; no one will thank a test team for injecting peak loads of data into a system which then slows the service down for their other clients.
Summary:
Roq’s experience in operating within the scheme space, including significant exposure to the RTGS programme, leads us to believe that change is both necessary and inevitable. Continuing to see Quality Assurance as a series of individual elements within myriad projects creates an approach to testing which risks gaps in the end to end system and assumes all the other individual elements have done their job effectively.
Roq takes a genuinely holistic look at how to support a fully rounded approach to Quality Assurance that addresses the fact that payments is a mission critical part of UK infrastructure. This won’t be easy for some institutions, and will be a bugbear to others. But in a world where the stakes are high, and there are personal implications for senior members of a bank’s team, it is the only way of mitigating risk and securing assuredness and resiliency.
If you are a financial institution currently dealing with the complexities of the RTGS transformation, we are here to help. Whether you require extra resource, or an expert second opinion, simply reach out to us and one of our specialists will be happy to discuss your organisation’s challenges and requirements.
/f/177999/1600x1000/f2daf6449e/steve-1600x1000.png)
/f/177999/1200x900/dadd17f64b/banking-online-1200x900.jpg)
/f/177999/1200x900/d1c075ba5c/insurance-1200x900.jpg)
/f/177999/1200x900/3d9bb9d746/angela-1200-x-900.png)