Nvidia

204 views
Senior Production Engineer – Storage (Santa Clara, CA, USA)

Nvidia Full Time 2788 San Tomas Expressway, Santa Clara, CA, USA , 95051 California, United States January 5, 2025 Engineering - Information Technology - Networking

Job Overview

Join NVIDIA as a Site Reliability Engineer (SRE)

Are you ready to take your engineering skills to the next level? NVIDIA, a world leader in AI computing, is seeking a Site Reliability Engineer (SRE) to join our team and revolutionize GPU cloud services. This role offers a unique opportunity to ensure the reliability and performance of large-scale production systems while supporting cutting-edge AI/ML workloads. Join us in shaping the future of technology!

Job Description

As a Site Reliability Engineer at NVIDIA, you will play a crucial role in designing, building, and maintaining our highly efficient and available production systems. You’ll collaborate with a diverse team of experts, foster a culture of innovation, and tackle complex challenges using automation, performance tuning, and proactive problem-solving.

Your work will directly impact the stability and scalability of our GPU cloud services, helping us deliver on our promise of reliability and uptime to both internal and external users.

Responsibilities

Design, implement, and support large-scale storage clusters, including monitoring, logging, and alerting systems.
Collaborate with AI/ML teams to analyze and optimize complex workflows in large-scale clusters.
Improve the lifecycle of services from inception to refinement.
Support live services through system design consulting, framework development, and capacity management.
Measure and monitor system health, leveraging machine learning models for insights.
Scale and evolve systems using automation and AI/ML methodologies.
Lead incident response and conduct blameless postmortems.
Participate in an on-call rotation to ensure production system reliability.

Requirements

Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
5+ years of hands-on experience in site reliability engineering or similar roles.
Proficiency in algorithms, data structures, and software design, with experience in managing large-scale Linux-based systems.
Coding skills in languages such as C/C++, Java, Python, Go, Perl, or Ruby.
Knowledge of AI/ML frameworks and methodologies.
Familiarity with infrastructure configuration tools like Ansible, Chef, Puppet, or Terraform.
Experience with observability and tracing tools (InfluxDB, Prometheus, Elastic stack).

Ways to Stand Out:

Proven SRE mindset with a customer-first approach.
Experience in CI/CD pipelines, Git, and code review processes.
Strong debugging skills and a systematic problem-solving approach.
Expertise in managing large-scale cloud systems using Kubernetes, OpenStack, and Docker.
Adaptability to diverse working styles and thriving in collaborative environments.

Benefits

Competitive base salary: $148,000 – $339,250 USD, determined by location and experience.
Equity and comprehensive benefits package.
Opportunities for growth and mentorship in a dynamic, supportive environment.
A chance to work with some of the most innovative minds in the technology industry.

How to Apply

Ready to join a world-class team at NVIDIA? Apply now and take the first step in your journey to redefine the future of technology. NVIDIA accepts applications on an ongoing basis. Don’t miss this opportunity to grow your career in a collaborative, inclusive, and innovative environment.

More Information

Address 2788 San Tomas Expressway, Santa Clara, CA, USA
Salary Offers $148,000 - $339,250 USD, determined by location and experience.
Experience Level Senior
Total Years Experience 5-10
Gallery