Are you passionate about observability, telemetry, production reliability, and helping engineering teams identify performance bottlenecks before users report them?

Care1 is seeking a Senior Site Reliability Engineer (SRE) to own and evolve our production observability and reliability practices. This is a full-time remote position based in Pakistan. To ensure effective collaboration with our Canadian Engineering team, candidates must have daily working hours that overlap from 10:00 AM to 2:00 PM Pacific Time (PT), Monday to Friday.

This is not a DevOps infrastructure role! So you will not be doing infrastructure or platform engineering. We already have a DevOps team. This is a highly technical and hands-on role focused on the health of our application in terms of reliability, telemetry, latency analysis, monitoring and alerting. You will work closely with DevOps, Software Engineers, Technical Leads and QA to identify bottlenecks, improve system visibility, and help evolve the operational maturity of our platform.

Our goal is to move from reactive support, where users report issues first, to proactive observability, where the system alerts us before customers are impacted.

You will help establish observability as an engineering function within Care1. This role is ideal for someone who enjoys deep technical investigation, production debugging, distributed systems, telemetry pipelines, and driving operational excellence across engineering teams.

Responsibilities:

Implement and own the observability stack in our Python/Django application across logs, metrics, traces, queues, infrastructure, and application telemetry
Help evolve the platform from reactive support toward proactive reliability and operational awareness
Ensure production systems generate actionable telemetry that enables effective debugging, latency analysis, and incident investigation
Leverage AI-assisted engineering tools (we use Cursor) to debug application latency and production bottlenecks inside a Django monolith and across distributed services, including slow requests, ORM/query inefficiencies, worker contention, and infrastructure-related issues
Investigate production incidents, recurring failures, and reliability regressions, driving root cause analysis and long-term remediation
Translate production reliability and performance issues into clear, actionable engineering tickets and improvement plans
Lead engineering adoption of observability tooling and telemetry-driven debugging practices across the platform by helping engineering teams leverage telemetry during debugging, performance analysis, and incident remediation
Define and standardize operational metrics, SLIs, SLOs, alerting strategies, and incident response practices
Design dashboards and alerts that provide meaningful visibility into application health, infrastructure performance, async processing, and deployment stability
Monitor system capacity, throughput, latency, and resource utilization trends to proactively identify scaling and performance risks
Improve visibility into deployment health, release stability, and production regressions following deployments
Collaborate closely with Software Engineers, Technical Leads, QA, and DevOps to improve system reliability, scalability, and operational maturity
Monitor and improve the reliability of distributed systems, async workflows, background workers, and event-driven processing pipelines
Help establish engineering standards around instrumentation, reliability ownership, incident management, and production readiness

Required Experience:

7+ years of software engineering experience as part of a team building and shipping SaaS products in a production environment
3+ years operating in an SRE or equivalent reliability-focused DevOps/Platform role
Strong experience with observability platforms such as Prometheus, Grafana, ELK/OpenSearch, Datadog, New Relic, OpenTelemetry, Loki or similar tooling
Hands-on experience wiring observability, telemetry and tracing tooling into Python/Django production systems
Proven experience designing operational monitoring systems, alerts, dashboards and telemetry pipelines
Strong practical experience with AI-assisted development tools
Strong understanding of distributed systems, async processing, queues, retries, timeouts, circuit breakers and production reliability patterns
Experience identifying and debugging performance bottlenecks across APIs, databases, background workers, and infrastructure
Experience debugging production database query performance issues
Hands-on experience working within AWS production environments and modern deployment workflows
Strong understanding of infrastructure, containers, networking, and cloud-based application architecture
Experience collaborating directly with software engineers to improve reliability, observability, and operational maturity
Exceptional written and verbal communication skills, with the ability to drive clarity with technical and non-technical stakeholders

Our Stack:

Frontend: React, RTK, Typescript, AntDesign, Nginx, Vite, Scss
Backend: Python, FastAPI, SQLAlchemy, Django, Celery/Redis, MySql, MongoDB Atlas
Infrastructure: Docker, AWS (EC2, ECS, Lambda, S3, SQS, CloudWatch, etc), Prometheus, Grafana, Flower, ELK
CI/CD: BitBucket Pipeline, AWS Blue/Green Code Deploy
Other tools: Cursor, Jest, Selenium, Figma, Git, BitBucket, Jira, Slack, Google.

About Care1:

Care1 is a profitable medical AI and technology company that is changing the way eye doctors think about patient care.

Our cloud-based software connects eye doctors (ophthalmologists and optometrists) allowing them to screen, diagnose, and treat patients remotely. We also label, train, fine tune and implement deep learning AI models for the automatic detection and treatment of complex eyecare diseases via a variety of modern diagnostic technologies including 3-dimensional photos of the eye, ocular coherence tomography and peripheral visual field testing. Over the past decade we have grown Care1 to integrate with over 200 eye doctors across Canada, who have used our platform to deliver care to over 150,000 patients. This makes the Care1 network one of the largest eyecare telemedicine networks worldwide.

Our vision is to become the leading eyecare AI platform in the world.

Please visit our website www.care1.ca and our LinkedIn page https://www.linkedin.com/company/care1tech for more information about Care1

Application Process:

Fill out the job application form where we will gather some initial information as the first step of our interviewing process: https://coda.io/form/SRE-Application-Questionnaire_dXPKp5otWHN
We would like to thank all applicants for their interest in Care1, however only applicants selected for interview will be contacted.

Interview Process:

Round 1 - Initial Screening (45 minutes) - We will introduce Care1 and get to know more about your background and experience. We will discuss your SRE experience specifically and explore how that experience aligns with the listed responsibilities for this role.
Round 2 - Technical Interview (45 minutes) - This will be a technical discussion around the key technologies and practices of site reliability and observability. The goal will be to assess your technical skills that will be required to perform the listed responsibilities for the SRE role.
Round 3 - Reliability & Observability Architecture (60 minutes) - This will be a system design whiteboard session with some members from our engineering team. We will describe a software system and ask you to walk us through how you would architect a reliability and observability solution for that system. We are looking for your ability to architect a comprehensive solution where you reason about trade-offs with clarity of thought and strong communication.

Senior Site Reliability Engineer - Observability (Pakistan Remote)

About this role

Similar jobs

Similar jobs

Similar jobs