Deskripsi Pekerjaan
Join TechCorp Indonesia as a Reliability Engineer and become the guardian of our digital infrastructure! We're seeking a visionary professional to architect, implement, and optimize systems that guarantee 99.99% uptime for millions of users. You'll work in a cutting-edge environment where innovation meets resilience, collaborating with elite teams to transform reliability engineering into a competitive advantage. If you thrive under pressure and turn chaos into order, this is your calling.
Tanggung Jawab
- Design and implement robust monitoring systems for cloud/on-premise infrastructure using tools like Prometheus, Grafana, and ELK Stack
- Develop SLOs/SLIs and conduct root cause analysis for critical incidents
- Automate chaos engineering experiments and resilience testing frameworks
- Create runbooks and incident response playbooks for high-impact systems
- Optimize CI/CD pipelines for deployment reliability and rollback capabilities
- Lead post-mortem analysis and implement preventive measures
- Mentor junior engineers on reliability best practices
Kualifikasi
- S1 degree in Computer Science/Engineering or related field (Min. GPA 3.2)
- 3+ years experience in production reliability engineering at scale
- Expertise in Linux systems, Kubernetes, and cloud platforms (AWS/Azure/GCP)
- Proficiency in at least one scripting language (Python/Bash/Go)
- Experience with observability tools and incident management systems
- Strong understanding of distributed systems and microservices architecture
- Relevant certifications (e.g., AWS Certified DevOps Engineer, Kubernetes Administrator)