The Opportunity We’re looking for a Core Engineer at the Principal Infra / SRE level to own the reliability, scalability, upgradeability, and operational excellence of our edge platform at fleet scale. In this role, you’ll be the technical authority for designing and operating compound capabilities that span software, infrastructure, networking, security, data, and hardware—ensuring we can reliably deploy, upgrade, and manage fleets of thousands of devices with the highest technical rigor. You will set and enforce production standards, and you have the authority to stop changes that would put fleet safety or reliability at risk. During high-severity incidents, you are the technical owner—leading root-cause analysis and driving fixes across teams. This is a hands-on role for someone who thrives in a high-ownership setting and wants to build the infrastructure that makes real-world AI possible. You’ll operate in an AI-native way, using AI to assist diagnostics and operations while ensuring all production changes remain governed, reviewed, and auditable. What You’ll Do Own platform-wide reliability and scalability architecture across the fleet, including upgradeability, rollback safety, resilience, observability, and incident response. Lead the design and delivery of compound capabilities that span multiple specialist domains (hardware, networking, security, data, infrastructure, and AI runtime). Set and enforce production-grade standards for operational excellence, including SLOs/SLIs, error budgets, on-call readiness, change management, incident management, and postmortem practices, with the authority to stop changes that introduce unacceptable risk. Serve as the technical owner during high-severity incidents, leading diagnosis, root-cause analysis, and coordinated remediation across teams. Design and operate secure, automated fleet lifecycle systems for deployment, updates, configuration management, and health management at scale. Drive the evolution of observability and telemetry systems (metrics, logs, traces, audit, fleet state) so issues are detectable, diagnosable, and preventable. Partner with engineering and commercial teams to translate real-world constraints into platform-level requirements and prioritization decisions. Operate in an AI-native way: develop and use AI systems to accelerate diagnostics, automate operational workflows, and increase engineering velocity, while ensuring all production changes remain governed, reviewed, and auditable. Mentor senior engineers across domains, review technical designs, and raise the quality bar for architecture and reliability across the organization. What Success Looks Like In your first 3 months, you will have: Taken full ownership of a platform-wide reliability, upgradeability, or incident reduction initiative and delivered measurable improvements in fleet stability, deployment safety, and operational clarity. Established or strengthened production standards that reduce risk and improve consistency across releases and fleet operations. Demonstrated strong incident ownership by leading at least one high-severity investigation through root cause and durable remediation. In your first year, you will be: Owning the fleet-scale operational architecture end-to-end, with clear accountability for reliability, upgradeability, scalability, and security posture across thousands of deployed systems. Delivering step-function improvements in platform resilience and operational excellence through durable systems (automated lifecycle management, observability, incident reduction, reliability standards). Raising engineering rigor across the organization by enforcing standards, mentoring technical leaders, and driving cross-domain architectural decisions that compound over time. Who You Are 10+ years building and operating production infrastructure and distributed systems, including reliability engineering at scale across complex, multi-tenant or fleet environments. Deep experience with SRE practices: SLOs/SLIs, error budgets, observability, incident response, postmortems, and operational automation (e.g., Kubernetes-based platforms, Linux systems, and automation through infrastructure-as-code). Strong systems thinking across software, infrastructure, networking, and security, with the ability to drive outcomes across multiple domains and enforce production standards. Proven ability to lead ambiguous, high-impact initiatives end-to-end with strong technical judgment, crisp execution, and disciplined change management. Clear communicator and trusted technical partner to engineering leadership, with the ability to lead high-severity incident response and drive cross-team alignment. Ownership mindset: outcomes over tasks. Unique Experiences We Value Designing and operating fleet management and upgrade systems at scale, including safe rollout/rollback, configuration management, and health monitoring (e.g., canary deployments, staged rollouts, and verifiable rollback mechanisms). Building observability platforms that make complex systems diagnosable and measurable across large distributed deployments (e.g., metrics/logs/tracing pipelines, alerting, and dashboards that drive action). Security-first operations experience (secure boot, signed updates, audit logging, default-deny posture) and working in compliance-sensitive environments with governed production changes. Experience operating systems under real-world edge constraints (limited connectivity, bandwidth limits, variable environments, high reliability requirements) and building automation that reduces operational variance. Applying AI to operations and engineering workflows (automated diagnostics, agentic triage, runbook generation, anomaly detection) to increase rigor and speed while keeping production pathways reviewed and auditable. Benefits We work in a high-ownership, real-world startup environment where you’ll move fast, build new systems, and see your impact immediately—what you ship runs in the field and drives measurable customer outcomes. We work alongside AI every day. Writing static code, docs, or plans “by hand” is no longer accepted—here you’ll use the latest AI tools to iterate and ship faster and to apply AI with our customers at scale. You’ll take on elite technical challenges at the frontier of infrastructure, including next-generation cloud and IoT, hardware/software/networking in real-world edge environments, the foundation for data and AI inference, and industry-leading secure systems in demanding operational (OT) settings. You’ll learn fast by working with exceptional teammates and collaborating directly with industry leaders as partners in software, AI, and infrastructure. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. This role has a base salary range of $190,000–$215,000. Total compensation for this role includes equity in your work. You are eligible for meaningful equity through stock options in an early-stage, high-growth company. You are eligible to participate in company benefit plans, which may include health, dental, and vision coverage, a 401(k) with company match, flexible PTO, paid parental leave, commuter benefits, and relocation and visa support for eligible roles. Edgescale AI At Edgescale AI, we’re deploying AI in the real world—helping customers apply this technology to unlock transformative productivity gains. Our work sits at the intersection of infrastructure, security, networking, and AI, where reliability and performance are non-negotiable and where solutions demand deep, distributed systems thinking. We’re intensely AI-native. We build with AI, we ship AI, and we use it every day to accelerate how we design, test, deploy, and operate complex systems. If you want to help pave the application of AI in the real world, at global scale, we want to hear from you. Edgescale AI is building an inclusive, merit-based organization. We are an equal opportunity employer and do not discriminate on any legally protected status. We value diversity, inclusion, and a shared passion for creating real-world impact.

Principal Core Engineer — Infra / SRE

Support summary

About this role

Similar jobs

Similar jobs

Similar jobs