MLOps / DevOps & Cloud Infrastructure Lead 

The Opportunity 

Lead the design and scaling of ATOM's cloud infrastructure. You'll own everything from Kubernetes clusters and GPU node pools to distributed model inference, simulation job orchestration, and multi-region production deployments. This role is critical—your work enables researchers to experiment rapidly and ensures customers get reliable, cost-optimized service. 

What You'll Own 

  • Kubernetes cluster design, GPU node pool management, and scaling strategies for mixed workloads (inference, training, simulation) 

  • CI/CD pipeline architecture for models, services, and infrastructure; GitOps workflows and containerization 

  • Infrastructure-as-code using Terraform or CloudFormation for reproducible, multi-region deployments 

  • High-performance LLM serving using vLLM, TensorRT-LLM, or similar; optimization of batching, latency, and throughput 

  • Model registry, versioning, and experiment tracking infrastructure 

  • Async job execution, scheduling, and distributed task orchestration 

  • Comprehensive logging, metrics, monitoring, and alerting; SLO definition and enforcement 

  • Auto-scaling policies, cost controls, and GPU utilization optimization 

  • Reliability architecture: retries, failover mechanisms, and disaster recovery planning 

  • Required Experience 

    • 7+ years of hands-on Kubernetes and cloud infrastructure experience (EKS, GKE, or AKS) 

    • Production experience serving large ML models at scale 

    • Expertise in CI/CD, Docker, containerization, and GitOps workflows 

    • Deep understanding of GPU workloads, CUDA, and scaling distributed compute jobs 

  • Strong Plus 

    • Experience with simulation, rendering, or physics workloads (NVIDIA Isaac, V-Ray, etc.) 

    • Cost optimization expertise for GPU-heavy systems and distributed inference 

    • Prior work building ML platforms, feature stores, or AI infrastructure at scale