Senior Site Reliability Engineer (SRE) Job Brief: We are seeking an experienced Senior Site Reliability Engineer (SRE) to join our team and play a critical role in ensuring the reliability, scalability, and performance of our systems. The ideal candidate will have a strong background in infrastructure management, automation, and a passion for optimizing and improving the reliability of mission-critical systems. Responsibilities: Design, implement, and maintain highly available and scalable infrastructure solutions. Develop and maintain automated deployment, monitoring, and alerting systems to ensure system reliability and performance. Collaborate with development teams to design, implement, and maintain CI/CD pipelines for automated testing and deployment. Lead incident response and resolution efforts, including root cause analysis and post-incident reviews. Implement and enforce best practices for system security, including access controls, data encryption, and vulnerability management. Proactively identify performance bottlenecks and optimization opportunities in the infrastructure and application stack. Participate in capacity planning and resource allocation to ensure scalability and cost efficiency. Mentor junior engineers and provide technical guidance on best practices for reliability engineering. Requirements: Bachelor's degree in Computer Science, Information Technology, or related field. 5+ years of experience in site reliability engineering, systems administration, or related field. Strong expertise in cloud computing platforms such as AWS, Azure, or Google Cloud Platform. Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation, or Ansible. Experience with container orchestration systems such as Kubernetes or Docker Swarm. Deep understanding of Linux/Unix systems administration and networking concepts. Familiarity with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog. Knowledge of scripting languages such as Python, Bash, or PowerShell. Strong problem-solving skills and ability to troubleshoot complex issues in production environments. Excellent communication and collaboration skills, with the ability to work effectively in a team environment. Preferred Qualifications: Certification in cloud platforms such as AWS Certified Solutions Architect, Google Cloud Certified Professional Cloud Architect, or Azure Solutions Architect. Experience with service mesh technologies such as Istio or Linkerd. Knowledge of distributed systems design principles and microservices architecture. Familiarity with agile methodologies and DevOps practices. Contributions to open-source projects or participation in the SRE community.

Senior Site Reliability Engineer

About this role

Similar jobs

Similar jobs

Similar jobs