Role : Cloud Systems Architect (Remote) Location : Remote (Work from Anywhere) Payout : Competitive Role Overview: We are hiring for one of our clients, seeking a Site Reliability Engineer (LInE) to work on a contractor basis. As a Site Reliability Engineer, you will apply your expertise to help train next-generation AI systems, shaping how models learn, reason, and perform through high-quality, real-world input. This role offers a unique opportunity to contribute to the development of frontier AI models, leveraging your domain knowledge to drive innovation in the AI industry. Key Responsibilities: • Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus, ensuring seamless deployments and high system availability. • Monitor system health, analyze performance metrics, and proactively address bottlenecks or potential failures, minimizing manual intervention and increasing system reliability. • Automate operational processes to minimize manual intervention and increase system reliability, and respond swiftly to incidents, conduct root cause analysis, and drive continuous improvements in incident response procedures. • Collaborate closely with development and operations teams to deliver seamless deployments and high system availability, creating comprehensive documentation and clear runbooks for operational excellence. • Respond to incidents, conduct root cause analysis, and drive continuous improvements in incident response procedures, ensuring high system availability and minimizing downtime. Required Skills & Qualifications: • Proven experience designing, implementing, and maintaining scalable infrastructure using Linux, Kubernetes, and Prometheus, with a strong understanding of system health monitoring and performance metrics analysis. • Strong understanding of automation tools and technologies, with experience in automating operational processes to minimize manual intervention and increase system reliability. • Excellent problem-solving skills, with the ability to analyze complex system issues, identify root causes, and develop effective solutions. • Strong communication and collaboration skills, with the ability to work closely with development and operations teams to deliver seamless deployments and high system availability. • Experience with comprehensive documentation and clear runbooks for operational excellence, with a strong attention to detail and ability to create clear, concise documentation. More About the Opportunity: This role offers a unique opportunity to work with a global leader in the AI industry, leveraging your domain knowledge to drive innovation and shape the development of next-generation AI systems. You will have the opportunity to work on a global scale, collaborating with top experts and contributing to the creation of cutting-edge AI models. Equal Opportunity Employer: We hire based on skills and expertise. All qualified candidates are welcome regardless of background, experience, or prior employment history. Applications are reviewed solely on demonstrated technical ability and qualifications. Apply Now!

Find Remote Jobs That Hire Worldwide

Cloud Systems Architect (Remote)

About this role

Job Details