Senior SRE – Scale, Reliability & AI | Remote

Build the reliability foundation for a platform used by millions worldwide
Shape how AI and ML are applied to observability and incident response
Remote first, technical leadership track, genuine engineering culture

There is a difference between managing infrastructure and engineering reliability. This role is firmly in the second category.

A high-traffic global platform on Google Cloud needs an SRE who thinks in systems, who looks at a complex, distributed architecture and sees not just what is running, but how it will fail, how to prevent it, and how to recover faster when it does. Someone who measures their impact in error budgets and eliminates toil, not tickets closed.

The platform is operating at a genuine scale. The SRE function is not peripheral; it is central to how the engineering organisation grows without degrading. You will own the reliability standards that give product teams the confidence to ship fast, build the observability infrastructure that turns noise into signal, and lead the incident process that turns failure into institutional knowledge.

What makes this role distinct is the AI dimension. The team is actively building machine learning into how it detects anomalies, predicts degradation, and responds to incidents. This is not a future roadmap item; it is happening now, and the person coming into this role will have a real hand in shaping it.

If you are an SRE who wants to move beyond keeping things running and start defining how a scaled platform is built to last, this is worth a conversation.

What you will own:

SLI, SLO, and error budget frameworks across the platform's most critical services
Fault-tolerant, highly available infrastructure design and maintenance on GCP
Containerised workload operations at scale across Kubernetes and service mesh environments
Automation and tooling that systematically removes operational toil
ML and AI integration into observability pipelines, anomaly detection, and incident workflows
End-to-end incident response ownership, from detection through to blameless postmortem

What you will:

A strong background in senior SRE, Production Engineering, or a comparable discipline
Deep hands-on experience with GCP or AWS in demanding production environments
Proven ability to operate and scale Kubernetes clusters under real load
Fluency in Python or Node.js for automation, scripting, and tooling development
A track record of owning CI/CD pipelines and improving delivery confidence

What is on offer:

Competitive base salary with bonus, healthcare, and life assurance
Fully remote with optional access to a collaborative hub when you want it
A genuine technical leadership pathway for engineers who want to grow

For more information, contact Samer Jaffer in confidence on +353 1 649 8502 or sa[email protected]

Senior SRE – Scale, Reliability & AI | Remote

About this role

Similar jobs

Similar jobs

Similar jobs