We are currently seeking a Senior Cloud Infrastructure Engineer with experience in MLOPs to design, implement, and maintain suitable infrastructure and best deployment practices of ML Pipelines and models. You will bring Machine Learning, AI infrastructure, and automation expertise with the knowledge of AWS cloud infrastructure and DevOps practices.
What You'll Do
Design, build, and maintain scalable and robust infrastructure for AI/ML (Artificial Intelligence / Machine Learning) systems, including cloud-based environments, containerization, and orchestration platforms
Develop and implement CI/CD pipelines to automate the deployment, testing, and monitoring of AI/ML models and applications
Evaluate and integrate new tools, technologies, and frameworks to improve the efficiency and effectiveness of our MLOps processes
Design and manage Continuous deployment using Kubernetes, ArgoCD, and Jenkins
Maintain related container registry and model registry.
Monitor infrastructure utilization and costs pertaining to model training, inference, and GPU utilization
Monitor and troubleshoot AI/ML systems to ensure high availability, performance, and reliability
What You'll Need
4+ years of experience as a DevOps Engineer
1 year of previous experience managing AI/ML infrastructure in public cloud environments
In-depth hands-on experience with at least one public cloud platform, preferably AWS
Experience with Python or any other programming language
Experience with Docker and Kubernetes in production
Experience with Continuous Deployment tools such as Jenkins or ArgoCD
Experience with Logging and Monitoring tools for SaaS such as Sumo, Splunk, Datadog, etc