Automation speeds delivery, but blind trust adds risk. Banks are now testing the security signals and AI outputs that decide what code is safe to ship.

The UK's leading Credit and Financial Services network.
Shoppers of code and stewards of uptime are waking up to a simple truth: automated scanners and AI recommendations are invaluable, but they’re not gospel. Banking QA teams in the UK and beyond are shifting from just testing apps to testing the signals that declare those apps safe , and that change matters for resilience, compliance and customer trust.
AI hallucination risk: Generative models without live vulnerability feeds can suggest non‑existent or malicious component versions, creating dangerous false leads.
Signal noise is real: High volumes of false positives and false negatives mean scanners can distract teams from actual risk and complicate audits.
QA’s remit is expanding: Teams now need to validate vulnerability data curation, AI training inputs, and detection timelines , not only application behaviour.
Regulatory pressure rising: Rules like DORA push banks to show why software was approved and how risk decisions were made, not just that scans ran.
Practical step: Combine multiple scanning tools, verify AI outputs against live intelligence, and log decision trails to make automation explainable.
Banks have leaned on dependency scanners and automated pipelines to keep pace with rapid releases, and there’s a reassuring hum to tools that flag risky libraries or policy breaches. But when a scanner’s output becomes the de facto reason to ship or block code, QA can’t afford blind faith. Sonatype’s 2026 research shows AI can “hallucinate” component versions and even recommend malware if it lacks real‑time intelligence. That sensory jolt , the feeling that your safety net might be frayed , is pushing teams to treat scanner results as testable artefacts in their own right.
Historically, QA validated behaviour: does the feature do what it says? Now QA also asks: did the tool that recommended this dependency know what it was talking about? That shift is less glamorous, but more impactful; it’s the difference between a postmortem and a defensible audit trail.
The Sonatype report maps how open source usage exploded through 2025, with trillions of downloads and attack vectors that target developer environments rather than end users. When generative tools suggest upgrades or replacements without live vulnerability feeds, nearly a third of those suggestions can be wrong. That’s not just an irritating statistic , it’s a workflow hazard.
False positives consume analyst hours, while false negatives let exploitable code slip through. For regulated firms, that noise complicates reporting and weakens confidence in automated controls. The sensible takeaway is to treat AI recommendations like partner input: useful, but needing independent verification.
If you’re running scanners and AI agents in your CI pipeline, add pragmatic guardrails. First, normalise comparing outputs from multiple sources: cross‑check LLM suggestions against authoritative vulnerability feeds. Second, prioritise real‑time intelligence , feeds and databases that update faster reduce the chance of model hallucination. Third, maintain a provenance log: which tool recommended what, when, and who approved the change.
Operationally, that looks like small, repeatable practices , automation that flags anomalies for human review, policies that refuse auto‑merge of AI‑only suggestions, and periodic audits of tool performance. These steps don’t slow you down so much as make speed sustainable and defensible.
Regulators now expect more than proof a scan ran; they want to know why a decision was made. Under frameworks such as DORA, financial firms must demonstrate operational resilience and explainability. That moves QA from being a final gatekeeper to a translator: converting machine signals into regulatory narratives.
So QA needs to document not only findings but the context , which datasets informed the model, how vulnerability scores were weighted, and what compensating controls exist if a flagged component is used. Making those threads visible helps in audits and reassures stakeholders that automation isn’t a black box.
Diversity in tooling reduces correlated failure: combine static analysis, software composition analysis, dynamic testing and human code review. Add sanity checks for AI agents by forcing them to reference live vulnerability indices, and treat any unsupported recommendation as suspect. Vendors like Sonatype have published findings and guidance that help institutions choose feeds and orchestrate scans more safely.
Leaders should also invest in skills: QA engineers who understand threat modelling, supply‑chain mechanics and AI caveats will spot patterns a generic tester won’t. It’s the human touch , curiosity, scepticism, good documentation , that turns noisy outputs into actionable assurance.
It’s a small change that can make every scan and recommendation far more defensible.
Join us for Credit Week 2026!
Get the latest industry news