Model Training: Lead and manage the end-to-end process of training large-scale deep learning models, from data collection and preprocessing to model development and optimization.
Architect and Design Training Pipelines: Design efficient and scalable training pipelines, incorporating best practices, distributed computing techniques, and the latest model training methodologies.
Research and Innovation: Stay current with the latest developments in model training and implement innovative techniques to enhance model performance and efficiency.
Performance Optimization: Identify and resolve performance bottlenecks in model training processes, optimizing for speed and resource utilization.
Documentation: Maintain thorough documentation of model training processes, making it accessible to the broader team.
Mentoring and Knowledge Sharing: Mentor junior engineers and actively participate in knowledge-sharing initiatives to foster a culture of learning and growth
Requirements & Skills:
Proven experience in training large-scale machine learning and deep learning models
Strong proficiency in popular deep learning frameworks such as TensorFlow, PyTorch, or similar
Solid understanding of distributed computing, GPU acceleration, and parallel processing
Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) and containerization technologies (e.g., Docker, Kubernetes)
Excellent problem-solving skills and the ability to troubleshoot complex issues in model training pipelines
Strong communication and teamwork skills
Experience with productionizing machine learning models is a plus
Bachelor’s, Master’s, or Ph.D. in Computer Science, Machine Learning, or a related field