Job Description: Site Reliability Engineer (SRE) Contract: 1-Year Location: Montreal (Remote Role) Role Overview We are seeking a highly experienced Site Reliability Engineer (SRE) to support critical cybersecurity platforms by ensuring high availability, reliability, performance, and operational visibility . This role focuses on maintaining production stability, building robust observability practices, enhancing monitoring systems, and improving dashboards used across engineering, operations, risk, and executive leadership. The ideal candidate combines strong technical depth with an operational excellence mindset. Key Responsibilities Reliability & Operations Ensure reliability, availability, scalability, and performance of critical platforms and infrastructure Monitor system health, proactively identify risks, and resolve service-impacting issues Support incident response, troubleshooting, and service restoration Observability & Monitoring Instrument applications, infrastructure, APIs, and cloud components for full-stack visibility Design and enhance monitoring, logging, alerting, tracing, and observability solutions Build actionable alerts that reduce noise and improve response efficiency Metrics & Reporting Define and track key metrics (SLIs, SLOs, SLAs, latency, throughput, error rates) Develop and maintain dashboards for engineering, operations, and executive teams Continuously enhance dashboards for leadership visibility into system health and risks Incident Management & Improvement Participate in incident response, root-cause analysis, and post-incident reviews Drive continuous improvement in operational processes and system resilience Automation & DevOps Automate operational tasks, health checks, and recovery workflows Support CI/CD pipelines, release management, and production readiness Improve deployment validation and rollback strategies Collaboration & Resilience Partner with engineering, cloud, infrastructure, and security teams Contribute to resilience initiatives (capacity planning, failover testing, disaster recovery) Align systems with governance, compliance, and security standards Required Qualifications 10+ years of experience in SRE, DevOps, infrastructure, or production engineering Strong experience with distributed and cloud-based systems (AWS, Azure, or GCP) Hands-on expertise in: Monitoring, logging, metrics, tracing, and observability practices Building dashboards and operational reporting systems Deep understanding of: SLIs, SLOs, SLAs, error budgets, and incident management Strong scripting/programming skills (Python, Java, Bash, or PowerShell) Experience with Infrastructure as Code (e.g., Terraform) Knowledge of CI/CD pipelines and DevOps workflows Strong troubleshooting skills across distributed systems and APIs Experience with relational and NoSQL databases Excellent communication skills (technical and executive-level reporting) Preferred Skills Experience with cybersecurity or enterprise risk platforms Familiarity with observability tools such as: Splunk, Grafana, Prometheus, Datadog, Dynatrace, New Relic Experience creating executive dashboards and operational scorecards Exposure to: Kubernetes, Docker, serverless architectures Kafka or event-driven systems Knowledge of cloud security and governance tools Experience with AI/cloud monitoring services (Azure AI, AWS Bedrock, Vertex AI) Familiarity with Linux and Windows environments Experience with synthetic monitoring and automated health checks Key Competencies Strong ownership mindset and focus on reliability Ability to work in fast-paced, production-critical environments Effective prioritization based on business impact and operational risk Excellent cross-functional collaboration skills Clear communication of technical insights to non-technical stakeholders Continuous improvement mindset focused on automation and resilience High attention to detail in monitoring, alerting, and reporting

Find Remote Jobs That Hire Worldwide

Site Reliability Engineer

About this role

Job Details