About Us At Bettermode , we are redefining how businesses streamline customer experiences and foster strong relationships. Our platform empowers businesses to seamlessly craft powerful web apps with engagement tools in its core tailored to their unique needs. Backed by Silicon Valley investors and trusted by brands like Lenovo, Mercedes, and Xano, we’re proud to connect millions of end-users daily (check our Showcase page 😉). Join us as we continue building tools that redefine customer engagement! Benefits 🌟 At Bettermode, we’re dedicated to empowering our team to thrive—both professionally and personally. We offer location-based, competitive compensation that reflects your expertise and impact, with annual reviews so you can grow with us. Our culture is built on ownership and trust, giving you real influence over how we scale and succeed. 🩺 From your very first day, you and your family are covered by comprehensive Canadian health benefits—dental and vision included—so you can focus on what matters most. 😎 Enjoy unlimited paid vacation days, paid parental leave to support your family, and bereavement leave should you need it. 🛠️ You’ll have all the equipment you need provided, or you can bring your own device and access our Device Upgrade Policy—an interest-free hardware stipend repayable via payroll deductions, allowing you to upgrade when you need. 💡 We want you to thrive in your work: every team member receives a monthly Tech & Appreciation Stipend—perfect for testing new software or tools and improving your workflows as you see fit. 🏢 For in-person collaboration, our downtown Toronto office is less than a 15-minute walk from Union Station, with a free shuttle running throughout the day. The office features complimentary snacks, coffee, video games, and board games, as well as dedicated seating and a flexible environment that supports creativity, focus, and teamwork. 🌎 Join a globally diverse and collaborative team where you’re supported to do your best work and have access to all the resources needed to succeed. About This Role Employment Type: Full-time Location: Canada Location type: Remote or Hybrid (3 days at the office in Downtown Toronto - Monday, Tuesday and Wednesday, for employees residing within 40km of the company headquarters) Timezone : Eastern Standard Time The Opportunity This is not a generic DevOps role, not a narrow tool-operator role, and not a vendor-certified specialist role. You will help shape foundational parts of Bettermode's platform across Kubernetes runtime architecture, Terraform-governed AWS and Cloudflare infrastructure, service-to-service networking, data-plane efficiency, OLAP analytics systems, databases such as Aurora PostgreSQL and MongoDB Atlas, cost visibility, security/compliance governance, and deployment architecture. The role is intentionally broad across solutions architecture and systems programming: tune where appropriate, but build or redesign when necessary, with a strong emphasis on secure, recoverable, and well-governed platform foundations. Our operating model includes a production on-call program: engineers participate in an every-other-week rotation for P0 incident response, post-incident learning, and production ownership. Responsibilities Diagnose and remediate foundational platform problems across Kubernetes/EKS, Terraform-managed AWS and Cloudflare infrastructure, networking, observability, OLAP/data systems, security controls, and deployment architecture. Own Kubernetes platform patterns and Terraform/OpenTofu workflows that make environments reproducible, reviewable, secure, and recoverable, including promotion, drift control, and policy-aware infrastructure changes. Design AZ-aware and topology-aware improvements, starting with Aurora PostgreSQL routing/scalability and extending to other data-plane systems where traffic locality, availability, and cost matter. Build cost and workload observability that attributes AWS infrastructure spend, network transfer, CPU/memory usage, and cross-AZ patterns to services, workloads, and teams. Build production-grade platform components in Go, Rust, or TypeScript where appropriate, including Kubernetes controllers, Terraform plugins, telemetry collectors, bespoke proxies, and CLIs. Select the language based on SDK maturity, operational correctness, maintainability, and ecosystem fit. Implement platform security and compliance controls aligned with SOC 2, OWASP, GDPR, IAM least privilege, secrets handling, encryption, network segmentation, auditability, and data protection. Support OLAP analytics infrastructure and the migration from Pinot to ClickHouse, with attention to ingestion topology, query performance, data correctness, cost, and operational safety. Design safe rollout, resilience, and DR patterns, including canaries, bypass modes, fast rollback, degraded-mode operation, backup/restore workflows, failover procedures, RTO/RPO trade-offs, and incident playbooks. Example projects Build an AZ-aware Aurora PostgreSQL routing/proxy component and topology-aware controls that reduce inter-AZ traffic, improve reader/writer behaviour, and provide predictable behaviour during scaling and failover. Build workload-level cost and resource intelligence capabilities across AWS and Kubernetes, including attribution for network traffic, cross-AZ patterns, CPU/GPU/memory utilization, and other signals needed to improve infrastructure efficiency. Redesign fragile legacy deployment and infrastructure abstractions where the right answer is not more Helm or YAML, but stronger software-defined platform foundations. Investigate and correct service-to-service transport pathologies, including HTTP/2 and gRPC enablement, multiplexing behaviour, socket starvation, connection skew, and load-distribution inefficiencies. What You Bring to the Team Deep production experience with Kubernetes/EKS, Terraform/OpenTofu, AWS, and Cloudflare, including secure deployment patterns, environment promotion, drift control, and operational ownership. Strong software engineering fundamentals building backend, infrastructure, or distributed systems in production, with systems instincts for concurrency, performance, failure modes, and operational correctness. Professional experience with Go or Rust for production platform components such as Kubernetes controllers, Terraform providers, CLIs, proxies, telemetry collectors, reconcilers, or internal developer tooling; TypeScript is valuable for higher-level platform definition and GitOps tooling where appropriate. Solid understanding of Linux, TCP/IP, HTTP, HTTP/2, gRPC, connection behaviour under load, and service-to-service networking. Strong understanding of security and compliance-oriented platform engineering, including SOC 2 evidence, OWASP-aligned practices, GDPR-aware data handling, IAM boundaries, encryption, secrets management, and audit trails. Practical experience designing or operating DR capabilities, including backups, restore testing, RTO/RPO trade-offs, failover procedures, degraded-mode operation, and incident response playbooks. Practical familiarity with databases or OLAP/data infrastructure such as Aurora PostgreSQL, MongoDB Atlas, Pinot, ClickHouse, or similar systems, plus the ability to reason from first principles instead of relying only on vendor defaults. Bonus if You Have built Kubernetes controllers, operators, reconcilers, or long-running platform agents in production. Have experience with service meshes, proxies, transport-aware systems, traffic steering, or network observability for cost-attribution. Have worked on cloud cost attribution, workload-level infrastructure observability, eBPF, VPC Flow Logs, or similar telemetry systems Have MLOps experience with KServe, Kubeflow, MLflow, model-serving infrastructure, GPU workloads, or other AI/ML platform systems. Have contributed to brownfield infrastructure migration, Terraform/OpenTofu import workflows, drift detection, policy-as-code, or infrastructure governance. Have experience replacing brittle YAML/Helm-heavy abstractions with typed platform tooling, GitOps generators, CDK-style infrastructure definitions, or internal developer platforms. If you think this role is right for you, apply today! We’re excited to share more details, learn about your experience, and discover together if we’re the perfect fit for each other. Commitment to Diversity As we continue to grow with customers and team members worldwide, we are committed to cultivating an environment where everyone’s unique perspectives are heard and valued. The diversity of our team will enable us to build the most inclusive product and workplace possible. We encourage applications from all backgrounds, identities, abilities, and life experiences. Additional Information Headcount : This is a vacancy at Bettermode. Compensation Range : CA$160K-$180K/annually for Canada-based candidates AI Use : Large language models (LLM) might be used in the hiring process for this position to screen, assess or select job applicants.

Senior Platform Systems Engineer

About this role

Similar jobs

Similar jobs

Similar jobs