Causallabs logo

Member of Technical Staff - ML Infra

Causallabs
Posted 2 weeks ago
Relocation support
United States
Data & Analytics

Support summary

Relocation support

Explicitly identified in the job description.

Visa sponsorship

No visa sponsorship identified.

About this role

Responsibilities Design, deploy, and maintain large distributed ML training and inference clusters Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales Analyze, profile and debug low-level GPU operations to optimize performance Stay up-to-date on research to bring new ideas to work What weโ€™re looking for We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains. Strong grasp of state-of-the-art techniques for optimizing training and inference workloads Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker) Background working on distributed task management systems and scalable model serving & deployment architectures Understanding of monitoring, logging, observability, and version control best practices for ML systems You donโ€™t have to meet every single requirement above.

Similar jobs