Job Purpose To provide technical leadership for reliability across RWS’s enterprise systems and internal AI platforms, ensuring systems are resilient, observable, and scalable. The role supports setting the direction for Site Reliability Engineering practices, influences architecture across multiple domains, and enables the safe and reliable operation of business-critical services and AI-driven workflows. About Product & Technology Product & Technology plays a pivotal role in aligning the organization with its strategic objectives and enhancing shareholder value. Product & Technology is responsible for establishing unified standards and governance practices throughout the company. Additionally, we oversee the development and maintenance of core applications essential for the seamless operation of various functions across the organization. We are committed to driving and executing future roadmaps that are in line with the overall strategic direction of RWS. With a global reach, Product & Technology provides support services to over 7500 end users worldwide. We take pride in managing the information security operation and safeguarding all our assets. Our core functions encompass Enterprise & Technical Architecture, Network & Voice, Infrastructure, Service Delivery, Service Operations, Data & Analytics, Security & Quality Compliance, Transformation, Application Development, Enterprise Platforms, With a dedicated team of over 500 staff, Product & Technology ensures a strong presence across all regions, enabling efficient and effective support to our global operations. Job Overview Key Responsibilities Define and drive enterprise-wide reliability strategy across internal systems, enterprise applications, and AI platforms. Own reliability standards (SLIs, SLOs, error budgets) for business-critical internal and enterprise services. Shape the reliability architecture for the AI agentic platform, ensuring scalability, observability, and safe automation. Lead cross-domain design reviews to ensure resilience and performance of enterprise systems and internal tooling. Drive adoption of observability platforms and unified telemetry standards across legacy and modern systems. Partner with enterprise application, platform, and data teams to improve system reliability and operational maturity. Establish patterns for reliability in hybrid environments (cloud, on-prem, enterprise SaaS). Act as a technical authority and mentor to engineers across RWS. Skills & Experience Extensive experience in Site Reliability Engineering or platform engineering in complex enterprise environments. Deep knowledge of distributed systems, enterprise applications, and internal platforms. Strong experience with cloud infrastructure (AWS/GCP), Kubernetes, and hybrid environments. Expertise in observability tooling (e.g. Prometheus, Grafana, OpenTelemetry, etc.). Experience supporting or designing AI/ML or agent-based platforms at scale (or closely related systems). Proven ability to influence architectural direction across multiple teams and domains. Excellent communication and leadership skills with a focus on alignment and clarity.
Senior Site Reliability Engineer (SRE)
The Investigo Group
Senior Site Reliability Engineer (SRE)
The Investigo Group
Lead Engineer, CSRE
Livenation
Cloud Engineer (SRE)
Heroic Labs
Senior SRE/Platform Engineer (AI Platforms)
Amach
BI Engineer, SRE (Remote, International)
Pulsepoint