Finde deinen Traumjob als Senior Site Reliability Engineer - Platform Engineering (Team Lead) bei Red Hat Software in Homeoffice

Red Hat Software

About the job:

As a Senior Site Reliability Engineer (SRE) in this team, you will be responsible for creating and supporting tools, infrastructure, and processes that allow us to develop, test, deploy, and operate our services in production. You will be working at the intersection of DevOps and SRE, where applications meet infrastructure. In this role, you have the opportunity to leverage your technical skills in systems management, software development, databases, and leadership to provide best-practices guidance and tier-3 support for a novel new business we are launching and leading up the efforts to create and sustain a high-performance SRE team.

We are open to hiring someone remotely and our preferred locations are Germany, Belgium, Czech Republic, France, Poland, Ireland, Spain, Israel, the US, and Canada.

More information about the project:

https://www.redhat.com/en/about/press-releases/red-hat-introduces-red-hat-trusted-software-supply-chain

What you will do:

Work closely with SRE and software engineering teams to design and implement scalable and high-performance solutions for our cloud services and internal development tools
Help define the production deployment architecture
Deploy new clusters and configure them using infrastructure-as-code and GitOps tooling
Manage the scaling of existing clusters
Design, implement, test, and use production backup and recovery systems
Design, implement, test, and use workload migration systems
Drive automation of application deployment for production and pre-release environments
Train development teams on how to use the automation tools to deploy their own applications
Design, implement, and manage continuous integration, build management, and deployment scripts and systems
Provide troubleshooting and break-fix support for production services
Quickly and efficiently troubleshoot simple and complex issues in order to provide outstanding support for internal service level objectives
Identify areas for process and efficiency improvement; recommend solutions and assist in overseeing implementation. Actively facilitate continuous improvement
Ensure all necessary operational processes and procedures are carried out with a high level of attention to detail, expediency, and on-time delivery
Document run books and standard operating procedures
Create and maintain system information and architecture diagrams
Create customer self-help documentation
Monitor various systems capacity and health indicators and trends; provide analytics & forecasts for added or reduced capacity as required
Train development teams how to set up their own logging, metrics, monitoring, and alerting using our operational toolchain, according to established best practices
Implement automated incident resolution solutions

What you will bring:

Willingness to do SRE work, including PagerDuty on-call rotations
Experience with deploying resources using public cloud providers
Extensive Kubernetes experience; OpenShift experience is preferred
Fluency with the Kubernetes CLI, web console, and YAML configuration files
Experience with GitOps and/or infrastructure-as-code required; Argo CD preferred

The following is considered a plus:

Degree in Computer Science
Experience with Site Reliability Engineering
Experience with Kustomize
Experience with configuration and change management
Understanding of TCP/IP, HTTP, load balancing clusters, server load balancing, firewalls
Understanding of automation practices throughout the development, build, and deployment phases of the application life-cycle
Demonstrated ability to support and administer high-volume pre-release and production environments
Experience with one or more Unix shell scripting languages (Bash, C Shell, Z Shell, etc.)
Experience with one or more structured programming languages; Golang preferred
Experience with build management and continuous integration tools; Tekton preferred
Understanding of revision control and continuous integration best practices
Experience using an operational ticketing system to record changes and work history details such as JIRA, OTRS, or Service Now
Experience with cloud services (Amazon EC2/S3, OpenStack) elastic capacity administration, and cloud deployment and administration tools