About the role

Joining Amex Tech means discovering and shaping your contribution to something big. Here, you can work alongside talented tech teams and build a unique career with the Powerful Backing of American Express. With a range of opportunities to work with the latest technologies, and a commitment to back the broader engineering community through open source, our mission is to power your success. Because Amex Tech is powered by our technology, our culture, and our colleagues. Sr Site Reliability Engineer I develops and implements Site Reliability Engineering (SRE) strategies, ensures real-time system observability, promotes best practices and automation, and collaborates with cross-functional teams to enhance system reliability and customer experiences while mentoring junior engineers. Mentors junior Site Reliability Engineers and cross-functional team of colleagues, fostering a culture of excellence and innovation Provides guidance and support to junior engineers, fostering professional growth and development within the team, ensuring adherence to best practices in Site Reliability Engineering
Manages and oversees collaboration with Software Engineering teams to design, develop, and implement advanced features that enhance system resilience, scalability, and performance, proactively identifying and resolving complex system bottlenecks and failure points
Leads the development and refinement of sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline complex operational workflows, deployment processes, and infrastructure management, significantly reducing manual intervention and ensuring high system efficiency Actively engages in and influences high-level architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are deeply integrated into strategic decision-making processes, and driving the adoption of innovative solutions Designs, executes, and oversees comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhances system robustness and recovery capabilities, and mentors colleagues in these practices
Leads the development, optimization, and maintenance of comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions Advocates for and implements advanced observability practices, including error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability, and mentoring colleagues in these practices Collaborates with cross-functional teams to enhance customer journeys, ensuring seamless and reliable technology experiences by addressing potential reliability and performance issues proactively, and leading initiatives to improve overall system reliability Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives Education Qualifications Bachelor’s degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred 8+ years of experience in software engineering and application development with strong proficiency in Java/J2EE, Python, Kotlin, Spring Boot, SQL, NoSql.
Knowledge of modern observability stack – Splunk, Elastic Search, Prometheus, Grafana
Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud
Experience in software development, or technology operations, with a focus on Site Reliability Engineering
Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP) Licenses and Certifications Advanced certification in Site Reliability Engineering (SRE) or related is a plus Employment eligibility to work with American Express in the United States is required as the company will not pursue visa sponsorship for these positions.

Senior Site Reliability Engineer I

About the role

Take the next step.
It takes 90 seconds.

Senior Site Reliability Engineer I

About the role

Take the next step.It takes 90 seconds.

Take the next step.
It takes 90 seconds.