Every job listed here is analyzed by our AI to identify worldwide hiring — not just “remote in the US.” Our classification is actively being improved, some results may be inaccurate.
Worldwide Remote
Jobs reviewed for worldwide hiring.
Real Hiring Data
Country flags show the countries where each company has team members
Updated Hourly
Fresh jobs synced from thousands of career pages
Senior SRE – Scale, Reliability & AI | Remote Build the reliability foundation for a platform used by millions worldwide Shape how AI and ML are applied to observability and incident response Remote first, technical leadership track, genuine engineering culture There is a difference between managing infrastructure and engineering reliability. This role is firmly in the second category. A high-traffic global platform on Google Cloud needs an SRE who thinks in systems, who looks at a complex, distributed architecture and sees not just what is running, but how it will fail, how to prevent it, and how to recover faster when it does. Someone who measures their impact in error budgets and eliminates toil, not tickets closed. The platform is operating at a genuine scale. The SRE function is not peripheral; it is central to how the engineering organisation grows without degrading. You will own the reliability standards that give product teams the confidence to ship fast, build the observability infrastructure that turns noise into signal, and lead the incident process that turns failure into institutional knowledge. What makes this role distinct is the AI dimension. The team is actively building machine learning into how it detects anomalies, predicts degradation, and responds to incidents. This is not a future roadmap item; it is happening now, and the person coming into this role will have a real hand in shaping it. If you are an SRE who wants to move beyond keeping things running and start defining how a scaled platform is built to last, this is worth a conversation. What you will own: SLI, SLO, and error budget frameworks across the platform's most critical services Fault-tolerant, highly available infrastructure design and maintenance on GCP Containerised workload operations at scale across Kubernetes and service mesh environments Automation and tooling that systematically removes operational toil ML and AI integration into observability pipelines, anomaly detection, and incident workflows End-to-end incident response ownership, from detection through to blameless postmortem What you will: A strong background in senior SRE, Production Engineering, or a comparable discipline Deep hands-on experience with GCP or AWS in demanding production environments Proven ability to operate and scale Kubernetes clusters under real load Fluency in Python or Node.js for automation, scripting, and tooling development A track record of owning CI/CD pipelines and improving delivery confidence What is on offer: Competitive base salary with bonus, healthcare, and life assurance Fully remote with optional access to a collaborative hub when you want it A genuine technical leadership pathway for engineers who want to grow For more information, contact Samer Jaffer in confidence on +353 1 64 9 8502 or sa [email protected]