Staff Site Reliability Engineer
Location: Pleasanton, CA
10x Genomics is building tools for scientific discovery that reveal and address the true complexities of biology and disease. Through a combination of novel microfluidics, chemistry and bioinformatics, our award-winning Chromium™ System is enabling researchers around the world to more fully understand the fundamentals of biology at unprecedented resolution and scale. Learn more at 10xGenomics.com.
Fueled by equal parts scientific vision and determined passion, we are delivering unprecedented innovation to short-read sequencing technologies and transforming how genomic information is accessed. You will feel the 10x difference the moment you enter our offices and labs. There’s a dynamic energy here, and we’re looking for the best of the best to be a part of it. We are seeking talented professionals excited to build new technology that advances scientific research while growing their career within a dynamic, supportive environment.
Staff Site Reliability Engineer
We are looking for an exceptional site reliability engineer with a solid understanding of Linux and distributed computing to join our team. Our multi-disciplinary team in microfluidics, biochemistry, mechanical engineering, computational biology, and software has a proven track record of delivering successful commercial products built on deep technological innovation. If you are a self-starter who is passionate about building and operating reliable, scalable and performant systems, and is excited to work in a highly collaborative environment alongside a diverse team of experts every day, join us at 10x Genomics.
- Lead a team of SREs to build, deploy and maintain resilient and scalable High Performance Computing (HPC) systems and services on premise and in the cloud.
- Scale systems and improve operational efficiency through extensive automation.
- Collaborate with software engineering team on continuous delivery and deployment.
- Monitor infrastructure and applications for uptime and resource utilization, identify performance bottlenecks, troubleshoot and mitigate system issues, and develop solutions to improve reliability and performance.
- Maintain detailed documentation of system build and operational procedures.
- Participate in on-call rotations.
Required Skills and Background
- Bachelor’s degree in Computer Science or a related field, or equivalent work experience.
- 10+ years of Linux systems engineering or development experience in a large scale environment.
- 5+ years of experience in SRE-type role.
- Extensive experience with orchestration and configuration management tools (e.g. Terraform, CloudFormation, Ansible, Puppet, Chef).
- Knowledge of Linux kernel tuning, networking and performance optimization.
- Proficiency in shell scripting and at least one other language, e.g. Python.
- Experience with AWS services and infrastructure design.
- Strong desire to learn and implement new technologies.
- Excellent written and verbal communication skills.
Desired Skills and Background
- Experience in managing multi-petabyte scale network-attached storage (NAS) and operational knowledge of NFS protocol.
- Familiar with HPC workload managers such as SGE or Slurm.
All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.