Site Reliability Engineer (sre)

Kuala Lumpur, M14, MY, Malaysia

Job Description

Key Responsibilities:



1.

Disaster Recovery Planning (DRP):



Design and maintain scalable failover systems, backup strategies, and redundancy mechanisms across cloud and on-prem environments. Develop and update DR documentation, runbooks, and recovery playbooks for infrastructure and application layers.
2.

Business Continuity Testing:



Plan, coordinate, and execute tabletop exercises, DR drills, and failover simulations. Analyze and report outcomes of BC/DR tests; identify gaps and lead remediation initiatives.
3.

Incident Response & Crisis Management:



Develop and refine incident response procedures, escalation paths, and communication frameworks for major outages. Act as a key responder and facilitator during critical incidents, ensuring swift coordination across teams.
4.

Data Backup & Recovery Strategy:



Implement and manage cloud-based and on-premise backup solutions, aligned with defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Regularly test and validate data restoration processes to ensure system recoverability.
5.

24/7/365 Coverage:



Participate in a rotating on-call schedule to ensure continuous coverage. Daily operations will include 3 shifts, each lasting 9 hours, with 1 member per shift and an overlapping hour between shifts to facilitate smooth transitions.
6.

Collaboration with Tier 1 and Tier 2 Support:



Work closely with Tier 1 and Tier 2 teams who will serve as the first point of contact for incidents and service requests.o Provide expertise and escalation support as needed, ensuring efficient resolution of issues and seamless communication between teams.

Qualifications:



Bachelor's degree in Computer Science, Engineering, or a related field. 3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles. Proven experience with DR planning, testing, and recovery operations. Proficiency in AWS, with a focus on relevant services that support infrastructure and application layers. Hands-on experience with backup solutions (e.g., Veeam, Rubrik, AWS Backup, Azure Site Recovery).o Strong understanding of high availability, system redundancy, and incident management frameworks (e.g., ITIL, NIST). Familiarity with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, PagerDuty). Strong spoken and written English communication skills, essential for effective collaboration with global teams.

Preferred Skills:



Certifications in cloud platforms (e.g., AWS Solutions Architect, Azure Administrator) Experience with chaos engineering or reliability testing tools (e.g., Gremlin, Chaos Monkey).
Job Type: Contract
Contract length: 12 months

Pay: Up to RM8,000.00 per month

Benefits:

Flexible schedule Health insurance Maternity leave Opportunities for promotion Professional development
Schedule:

Monday to Friday
Application Question(s):

Will you need any VISA sponsorship to work legally in Malaysia?
Education:

Bachelor's (Preferred)
Experience:

Site Reliability Engineering: 3 years (Required)
Work Location: In person

Expected Start Date: 09/01/2025

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1124651
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Contract
  • Salary:
    84722.0 107277.0 USD
  • Employment Status
    Permanent
  • Job Location
    Kuala Lumpur, M14, MY, Malaysia
  • Education
    Not mentioned