Requirements :- 5+ years of experience as a Reliability Engineer in a production environment. Experience working for an AWS-based SaaS product organization.- 2+ years of experience designing and maintaining cloud-based solutions with AWS.- Experience with tooling service.- Understanding of using industry enterprise monitoring solutions with hybrid (AWS on-premise) environments.- Experience with AWS's well-architected framework.- Proficiency in coding with Python ( to implement custom monitoring and automation solutions), Groovy, shell scripts, etc.- Exposure to implementing CI / CD pipelines with GitOps on Argo CD, Flux CD, or Harness. Expertise in using Jenkins to build automated pipelines.- Proficiency in AWS Cloud (EC2, RDS, EKS,EMR / Spark, S3, IAM, auto-scaling, Lambda, etc.).- Proficiency in Terraform (for maintenance of AWS infrastructure), Helm, and Configuration management tools.- Knowledge of the Databricks / EMR platform and any experience debugging Spark jobs will be advantageous.- Experience maintaining Helm templates.- Experience working with pipeline scheduling and orchestration tools such as Airflow.- Excellent business communication skills.- Intellectual curiosity and innovative thinking with a passion for problem-solving and working independently or immersed within teams with no boundaries.- Ability to prioritize and handle parallel issues and complete other assigned work.- Ability to tackle incidents and handle high-stress situations.- Bachelor's degree.Responsibilities :- Set up and maintain monitoring systems to track performance, collect metrics, identify issues, and facilitate proactive problem resolution.- Be responsible for ensuring the reliability of systems, minimizing downtime, and maintaining service-level objectives (SLOs).- Respond to and resolve incidents, conduct post-incident reviews, and implement improvements to prevent future occurrences.- Collaborate with Product Engineering and DevOps teams to design scalable and reliable architectures that meet the application's needs and support.- Develop automation and implement automation tools to streamline processes, deploy applications, and manage infrastructure.- Forecast future capacity needs and ensure systems can scale to meet growing demand.- Integrate security best practices into the development and operations processes to ensure a secure environment.- Actively participate in retrospectives and continuously seek ways to improve system reliability and performance.- Create and maintain documentation for systems, procedures, and configurations.- Implement and test fail-safe strategies and backup plans.We Offer :- US and EU projects based on advanced technologies.- Competitive compensation based on skills and experience.- Annual performance appraisals.- Remote-friendly culture and no micromanagement.- Bonuses for recommendations of new employees.- Bonuses for article writing, public talks, other activities.- 15 vacation days, 10 national holidays, sick leaves.- Udemy unlimited training account.- Free webinars, meetups and conferences organized by Svitla.- Fun corporate celebrations and activities.- Awesome team, friendly and supportive community!