Deskripsi Pekerjaan
TechGlobal Nusantara, a leading digital innovation company based in Jakarta, is seeking a highly skilled and passionate Site Reliability Engineer (SRE) to join our dynamic infrastructure team. In this critical role, you will bridge the gap between development and operations, ensuring the reliability, scalability, and efficiency of our production systems. You will be instrumental in automating infrastructure, monitoring system health, and responding to incidents to maintain an exceptional user experience for millions of our customers.
We offer a collaborative and forward-thinking work environment where you can leverage cutting-edge cloud technologies and make a tangible impact on our platform's success. If you thrive in a fast-paced environment and are passionate about building resilient systems, we want to hear from you.
Tanggung Jawab
- Design, implement, and maintain highly available, scalable, and fault-tolerant cloud infrastructure on AWS and GCP.
- Develop and manage robust monitoring, alerting, and observability solutions using tools like Prometheus, Grafana, and Datadog.
- Automate operational tasks, deployments, and configuration management using Infrastructure as Code (IaC) tools such as Terraform and Ansible.
- Conduct thorough incident response, root cause analysis, and blameless post-mortems to prevent future occurrences.
- Collaborate closely with software engineering teams to improve application performance, reliability, and deployment processes.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and maintain system reliability.
- Optimize system performance and cost-efficiency across all production environments.
- Participate in an on-call rotation to provide 24/7 support for critical production systems.
Kualifikasi
- Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
- Minimum of 3-5 years of experience in a Site Reliability Engineering, DevOps, or Systems Administration role.
- Strong hands-on experience with at least one major cloud provider (AWS, GCP, or Azure).
- Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash).
- Deep understanding of containerization and orchestration technologies, specifically Docker and Kubernetes.
- Experience with Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Pulumi.
- Solid knowledge of Linux system administration, networking (TCP/IP, DNS, HTTP), and distributed systems concepts.
- Experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, or Datadog.
- Excellent problem-solving skills and the ability to work effectively under pressure during incidents.