Site Reliability Engineer
Location: Pleasanton, CA
We are looking for an exceptional site reliability engineer with a solid understanding of Linux and distributed computing to join our team. Our multi-disciplinary team in microfluidics, biochemistry, mechanical engineering, computational biology, and software has a proven track record of delivering successful commercial products built on deep technological innovation. If you are a self-starter who is passionate about building and operating reliable, scalable and performant systems, and is excited to work in a highly collaborative environment alongside a diverse team of experts every day, join us at 10x Genomics.
Lead a team of SREs to design, build and maintain resilient and scalable Linux high performance computing (HPC) systems and storage on premise and in the cloud.
Automate the deployment, operations, and monitoring of infrastructure.
Monitor infrastructure and applications for uptime and resource utilization, identify performance bottlenecks, troubleshoot system issues, and develop solutions to improve reliability and performance.
Scale systems and improve operational efficiency.
Maintain detailed documentation of system build and operational procedures.
Off-hour support may be required on occasions.
Required Skills and Background
Bachelor’s degree in Computer Science or a related field, or equivalent work experience.
10+ years of Linux systems engineering experience in a large scale environment.
5+ years of experience in SRE-type role.
Extensive experience with automation, provisioning and configuration management tools (e.g. Ansible, Puppet, Chef).
Knowledge of Linux kernel tuning, networking and performance optimization, with ability to deep dive into code.
Software engineering experience with proficiency in one or more of the following: Go, Python, and/or shell scripting.
Experience with IaaS, e.g. AWS.
Strong desire to learn and implement new technologies.
Excellent written and verbal communication skills.
Desired Skills and Background
Experience in managing multi-petabyte scale network-attached storage (NAS) and operational knowledge of NFS protocol.
Familiar with HPC workload managers such as SGE or Slurm.
Working knowledge of LDAP and Active Directory.
All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.