Bachelor\'s degree in computer science or a related technical field
Minimum of 5 years of experiences in designing or operating high performance computing system.
Introduce technology and software to improve the performance, resiliency, and quality of service in infrastructure.
Possess a deep understanding of Linux fundamentals.
Understand the Kubernetes environments and be able to run the debugging.
Proven ability to manage priorities in a dynamic, fast-paced environment.
The primary work location is in Kuala Lumpur, but the role may require occasional travel to Johor.
JOB DESCRIPTIONYTL AI Cloud is on the look for an experienced system manager to lead the infrastructure team to develop, manage and operate the GPU cluster infrastructure. This role is responsible for the development, integration, and operation of platforms central to sustain the cluster availability. This role also needs to oversee the entire ticketing system from upstreaming side to customers and down streaming side to suppliers for RMA.Key Responsibilities:
Work closely with system architecture team and external resources to design and develop the system platform and all relevant tools.
Implement, manage, and maintain the platform to ensure optimal performance and high reliability.
Provide expert technical guidance and leadership across complex infrastructure projects.
Develop and implement strategies to resolve technical challenges, enhancing cluster performance while meeting SLA requirements.
Lead the team to manage and implement all owned and customer\xe2\x80\x99s provision in the platform.
Explore the possible enhancements to improve the daily process and troubleshooting/ticketing procedure.
Ensure robust monitoring of the system infrastructure.
Manage strong relationships with vendors, service providers, and internal stakeholders to ensure seamless integration and operations.
Desired Skills:
Deep hands-on expertise and comprehensive knowledge of GPU cluster and platform.
Exceptional communication skills, capable of discussing both technical and non-technical topics with diverse audiences.
Strong interpersonal skills, with a proven ability to develop professional relationships across business and technical teams.
Proficient in project management, with keen analytical abilities and problem-solving skills.
Knowledgeable in operating ticketing system and troubleshooting process in CPU/GPU cluster.
Excellent documentation skills to effectively articulate technical designs, issues, procedures, and assessments.