About the job:
As a Senior Site Reliability Engineer (SRE) in this team, you will be responsible for creating and supporting tools, infrastructure, and processes that allow us to develop, test, deploy, and operate our services in production. You will be working at the intersection of DevOps and SRE, where applications meet infrastructure. In this role, you have the opportunity to leverage your technical skills in systems management, software development, databases, and leadership to provide best-practices guidance and tier-3 support for a novel new business we are launching and leading up the efforts to create and sustain a high-performance SRE team.
We are open to hiring someone remotely and our preferred locations are Germany, Belgium, Czech Republic, France, Poland, Ireland, Spain, Israel, the US, and Canada.
More information about the project:
https://www.redhat.com/en/about/press-releases/red-hat-introduces-red-hat-trusted-software-supply-chain
What you will do:
- Work closely with SRE and software engineering teams to design and implement scalable and high-performance solutions for our cloud services and internal development tools
- Help define the production deployment architecture
- Deploy new clusters and configure them using infrastructure-as-code and GitOps tooling
- Manage the scaling of existing clusters
- Design, implement, test, and use production backup and recovery systems
- Design, implement, test, and use workload migration systems
- Drive automation of application deployment for production and pre-release environments
- Train development teams on how to use the automation tools to deploy their own applications
- Design, implement, and manage continuous integration, build management, and deployment scripts and systems
- Provide troubleshooting and break-fix support for production services
- Quickly and efficiently troubleshoot simple and complex issues in order to provide outstanding support for internal service level objectives
- Identify areas for process and efficiency improvement; recommend solutions and assist in overseeing implementation. Actively facilitate continuous improvement
- Ensure all necessary operational processes and procedures are carried out with a high level of attention to detail, expediency, and on-time delivery
- Document run books and standard operating procedures
- Create and maintain system information and architecture diagrams
- Create customer self-help documentation
- Monitor various systems capacity and health indicators and trends; provide analytics & forecasts for added or reduced capacity as required
- Train development teams how to set up their own logging, metrics, monitoring, and alerting using our operational toolchain, according to established best practices
- Implement automated incident resolution solutions
What you will bring:
- Willingness to do SRE work, including PagerDuty on-call rotations
- Experience with deploying resources using public cloud providers
- Extensive Kubernetes experience; OpenShift experience is preferred
- Fluency with the Kubernetes CLI, web console, and YAML configuration files
- Experience with GitOps and/or infrastructure-as-code required; Argo CD preferred
The following is considered a plus:
- Degree in Computer Science
- Experience with Site Reliability Engineering
- Experience with Kustomize
- Experience with configuration and change management
- Understanding of TCP/IP, HTTP, load balancing clusters, server load balancing, firewalls
- Understanding of automation practices throughout the development, build, and deployment phases of the application life-cycle
- Demonstrated ability to support and administer high-volume pre-release and production environments
- Experience with one or more Unix shell scripting languages (Bash, C Shell, Z Shell, etc.)
- Experience with one or more structured programming languages; Golang preferred
- Experience with build management and continuous integration tools; Tekton preferred
- Understanding of revision control and continuous integration best practices
- Experience using an operational ticketing system to record changes and work history details such as JIRA, OTRS, or Service Now
- Experience with cloud services (Amazon EC2/S3, OpenStack) elastic capacity administration, and cloud deployment and administration tools
#LI-REMOTE #LI-LN1