Company Profile: Our client is a U.S.-based group of affiliated companies operating at the intersection of legal technology and mass tort litigation. The organization includes a legal technology platform that automates medical record retrieval and case qualification for law firms, a Washington, D.C.–based mass tort litigation firm, and related holding entities. It is a lean, high-growth environment where each team member plays a significant and impactful role. Overall purpose and responsibilities of the role: As a Site Reliability Engineer , you will help build and support a technology platform while working closely with support staff and developers. You will be responsible for monitoring and troubleshooting the live platform to ensure optimal performance and stability. The role will also involve participating in new customer onboarding, provisioning customer environments, and resolving production issues to maintain system reliability and performance. Duties and Responsibilities: ● Monitor and troubleshoot the running platform across multiple services and components ● Analyze Cloud Run logs, Temporal workflow UI, GKE pod status, and Pub/Sub queues to identify and resolve issues ● Perform end-to-end triage to determine whether issues originate from the agent layer (Python), workflow layer (Temporal), API layer (Go), or frontend (Vue) ● Support resolution of paralegal-facing operational issues such as stuck cases, failed faxes, and pending qualifications ● Execute and write SQL queries against AlloyDB for investigation, validation, and troubleshooting ● Participate in platform development and improvement initiatives, including identifying recurring issues and contributing to fixes ● Support new customer onboarding, including provisioning and validating customer environments ● Contribute to the build and enhancement of internal tools, services, and platform components ● Act as a Level 2 support engineer, going beyond surface-level platform monitoring to identify and resolve deeper system and integration errors ● Develop and maintain runbooks, escalation procedures, and operational documentation to improve incident response and system reliability

Site Reliability Engineer (Remote) - #35039

About this role

Similar jobs

Similar jobs

Similar jobs