Are you passionate about observability, telemetry, production reliability, and helping engineering teams identify performance bottlenecks before users report them?
Care1 is seeking a Senior Site Reliability Engineer (SRE) to own and evolve our production observability and reliability practices. This is a full-time remote position based in Pakistan. To ensure effective collaboration with our Canadian Engineering team, candidates must have daily working hours that overlap from 10:00 AM to 2:00 PM Pacific Time (PT), Monday to Friday.
This is not a DevOps infrastructure role! So you will not be doing infrastructure or platform engineering. We already have a DevOps team. This is a highly technical and hands-on role focused on the health of our application in terms of reliability, telemetry, latency analysis, monitoring and alerting. You will work closely with DevOps, Software Engineers, Technical Leads and QA to identify bottlenecks, improve system visibility, and help evolve the operational maturity of our platform.
Our goal is to move from reactive support, where users report issues first, to proactive observability, where the system alerts us before customers are impacted.
You will help establish observability as an engineering function within Care1. This role is ideal for someone who enjoys deep technical investigation, production debugging, distributed systems, telemetry pipelines, and driving operational excellence across engineering teams.
Responsibilities:
- Implement and own the observability stack in our Python/Django application across logs, metrics, traces, queues, infrastructure, and application telemetry
- Help evolve the platform from reactive support toward proactive reliability and operational awareness
- Ensure production systems generate actionable telemetry that enables effective debugging, latency analysis, and incident investigation
- Leverage AI-assisted engineering tools (we use Cursor) to debug application latency and production bottlenecks inside a Django monolith and across distributed services, including slow requests, ORM/query inefficiencies, worker contention, and infrastructure-related issues
- Investigate production incidents, recurring failures, and reliability regressions, driving root cause analysis and long-term remediation
- Translate production reliability and performance issues into clear, actionable engineering tickets and improvement plans
- Lead engineering adoption of observability tooling and telemetry-driven debugging practices across the platform by helping engineering teams leverage telemetry during debugging, performance analysis, and incident remediation
- Define and standardize operational metrics, SLIs, SLOs, alerting strategies, and incident response practices
- Design dashboards and alerts that provide meaningful visibility into application health, infrastructure performance, async processing, and deployment stability
- Monitor system capacity, throughput, latency, and resource utilization trends to proactively identify scaling and performance risks
- Improve visibility into deployment health, release stability, and production regressions following deployments
- Collaborate closely with Software Engineers, Technical Leads, QA, and DevOps to improve system reliability, scalability, and operational maturity
- Monitor and improve the reliability of distributed systems, async workflows, background workers, and event-driven processing pipelines
- Help establish engineering standards around instrumentation, reliability ownership, incident management, and production readiness
Required Experience:
- 7+ years of software engineering experience as part of a team building and shipping SaaS products in a production environment
- 3+ years operating in an SRE or equivalent reliability-focused DevOps/Platform role
- Strong experience with observability platforms such as Prometheus, Grafana, ELK/OpenSearch, Datadog, New Relic, OpenTelemetry, Loki or similar tooling
- Hands-on experience wiring observability, telemetry and tracing tooling into Python/Django production systems
- Proven experience designing operational monitoring systems, alerts, dashboards and telemetry pipelines
- Strong practical experience with AI-assisted development tools
- Strong understanding of distributed systems, async processing, queues, retries, timeouts, circuit breakers and production reliability patterns
- Experience identifying and debugging performance bottlenecks across APIs, databases, background workers, and infrastructure
- Experience debugging production database query performance issues
- Hands-on experience working within AWS production environments and modern deployment workflows
- Strong understanding of infrastructure, containers, networking, and cloud-based application architecture
- Experience collaborating directly with software engineers to improve reliability, observability, and operational maturity
- Exceptional written and verbal communication skills, with the ability to drive clarity with technical and non-technical stakeholders
Our Stack:
- Frontend: React, RTK, Typescript, AntDesign, Nginx, Vite, Scss
- Backend: Python, FastAPI, SQLAlchemy, Django, Celery/Redis, MySql, MongoDB Atlas
- Infrastructure: Docker, AWS (EC2, ECS, Lambda, S3, SQS, CloudWatch, etc), Prometheus, Grafana, Flower, ELK
- CI/CD: BitBucket Pipeline, AWS Blue/Green Code Deploy
- Other tools: Cursor, Jest, Selenium, Figma, Git, BitBucket, Jira, Slack, Google.
About Care1:
Care1 is a profitable medical AI and technology company that is changing the way eye doctors think about patient care.
Our cloud-based software connects eye doctors (ophthalmologists and optometrists) allowing them to screen, diagnose, and treat patients remotely. We also label, train, fine tune and implement deep learning AI models for the automatic detection and treatment of complex eyecare diseases via a variety of modern diagnostic technologies including 3-dimensional photos of the eye, ocular coherence tomography and peripheral visual field testing. Over the past decade we have grown Care1 to integrate with over 200 eye doctors across Canada, who have used our platform to deliver care to over 150,000 patients. This makes the Care1 network one of the largest eyecare telemedicine networks worldwide.
Our vision is to become the leading eyecare AI platform in the world.
Please visit our website www.care1.ca and our LinkedIn page https://www.linkedin.com/company/care1tech for more information about Care1
Application Process:
- Fill out the job application form where we will gather some initial information as the first step of our interviewing process: https://coda.io/form/SRE-Application-Questionnaire_dXPKp5otWHN
- We would like to thank all applicants for their interest in Care1, however only applicants selected for interview will be contacted.
Interview Process:
- Round 1 - Initial Screening (45 minutes) - We will introduce Care1 and get to know more about your background and experience. We will discuss your SRE experience specifically and explore how that experience aligns with the listed responsibilities for this role.
- Round 2 - Technical Interview (45 minutes) - This will be a technical discussion around the key technologies and practices of site reliability and observability. The goal will be to assess your technical skills that will be required to perform the listed responsibilities for the SRE role.
- Round 3 - Reliability & Observability Architecture (60 minutes) - This will be a system design whiteboard session with some members from our engineering team. We will describe a software system and ask you to walk us through how you would architect a reliability and observability solution for that system. We are looking for your ability to architect a comprehensive solution where you reason about trade-offs with clarity of thought and strong communication.