New PageMLOps / DevOps & Cloud Infrastructure Lead

MLOps / DevOps & Cloud Infrastructure Lead

The Opportunity

Lead the design and scaling of ATOM's cloud infrastructure. You'll own everything from Kubernetes clusters and GPU node pools to distributed model inference, simulation job orchestration, and multi-region production deployments. This role is critical—your work enables researchers to experiment rapidly and ensures customers get reliable, cost-optimized service.

What You'll Own

Kubernetes cluster design, GPU node pool management, and scaling strategies for mixed workloads (inference, training, simulation)
CI/CD pipeline architecture for models, services, and infrastructure; GitOps workflows and containerization
Infrastructure-as-code using Terraform or CloudFormation for reproducible, multi-region deployments
High-performance LLM serving using vLLM, TensorRT-LLM, or similar; optimization of batching, latency, and throughput
Model registry, versioning, and experiment tracking infrastructure
Async job execution, scheduling, and distributed task orchestration
Comprehensive logging, metrics, monitoring, and alerting; SLO definition and enforcement
Auto-scaling policies, cost controls, and GPU utilization optimization
Reliability architecture: retries, failover mechanisms, and disaster recovery planning
Required Experience
- 7+ years of hands-on Kubernetes and cloud infrastructure experience (EKS, GKE, or AKS)
- Production experience serving large ML models at scale
- Expertise in CI/CD, Docker, containerization, and GitOps workflows
- Deep understanding of GPU workloads, CUDA, and scaling distributed compute jobs
Strong Plus
- Experience with simulation, rendering, or physics workloads (NVIDIA Isaac, V-Ray, etc.)
- Cost optimization expertise for GPU-heavy systems and distributed inference
- Prior work building ML platforms, feature stores, or AI infrastructure at scale