MLOps / DevOps & Cloud Infrastructure Lead
The Opportunity
Lead the design and scaling of ATOM's cloud infrastructure. You'll own everything from Kubernetes clusters and GPU node pools to distributed model inference, simulation job orchestration, and multi-region production deployments. This role is critical—your work enables researchers to experiment rapidly and ensures customers get reliable, cost-optimized service.
What You'll Own
Kubernetes cluster design, GPU node pool management, and scaling strategies for mixed workloads (inference, training, simulation)
CI/CD pipeline architecture for models, services, and infrastructure; GitOps workflows and containerization
Infrastructure-as-code using Terraform or CloudFormation for reproducible, multi-region deployments
High-performance LLM serving using vLLM, TensorRT-LLM, or similar; optimization of batching, latency, and throughput
Model registry, versioning, and experiment tracking infrastructure
Async job execution, scheduling, and distributed task orchestration
Comprehensive logging, metrics, monitoring, and alerting; SLO definition and enforcement
Auto-scaling policies, cost controls, and GPU utilization optimization
Reliability architecture: retries, failover mechanisms, and disaster recovery planning
Required Experience
7+ years of hands-on Kubernetes and cloud infrastructure experience (EKS, GKE, or AKS)
Production experience serving large ML models at scale
Expertise in CI/CD, Docker, containerization, and GitOps workflows
Deep understanding of GPU workloads, CUDA, and scaling distributed compute jobs
Strong Plus
Experience with simulation, rendering, or physics workloads (NVIDIA Isaac, V-Ray, etc.)
Cost optimization expertise for GPU-heavy systems and distributed inference
Prior work building ML platforms, feature stores, or AI infrastructure at scale

