Explore thousands of job offers
Nvidia
Nvidia

47 views
Senior Production Engineer – Storage (Santa Clara, CA, USA)

Job Overview

Join NVIDIA as a Site Reliability Engineer (SRE)

Are you ready to take your engineering skills to the next level? NVIDIA, a world leader in AI computing, is seeking a Site Reliability Engineer (SRE) to join our team and revolutionize GPU cloud services. This role offers a unique opportunity to ensure the reliability and performance of large-scale production systems while supporting cutting-edge AI/ML workloads. Join us in shaping the future of technology!


Job Description

As a Site Reliability Engineer at NVIDIA, you will play a crucial role in designing, building, and maintaining our highly efficient and available production systems. You’ll collaborate with a diverse team of experts, foster a culture of innovation, and tackle complex challenges using automation, performance tuning, and proactive problem-solving.

Your work will directly impact the stability and scalability of our GPU cloud services, helping us deliver on our promise of reliability and uptime to both internal and external users.


Responsibilities

  • Design, implement, and support large-scale storage clusters, including monitoring, logging, and alerting systems.
  • Collaborate with AI/ML teams to analyze and optimize complex workflows in large-scale clusters.
  • Improve the lifecycle of services from inception to refinement.
  • Support live services through system design consulting, framework development, and capacity management.
  • Measure and monitor system health, leveraging machine learning models for insights.
  • Scale and evolve systems using automation and AI/ML methodologies.
  • Lead incident response and conduct blameless postmortems.
  • Participate in an on-call rotation to ensure production system reliability.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
  • 5+ years of hands-on experience in site reliability engineering or similar roles.
  • Proficiency in algorithms, data structures, and software design, with experience in managing large-scale Linux-based systems.
  • Coding skills in languages such as C/C++, Java, Python, Go, Perl, or Ruby.
  • Knowledge of AI/ML frameworks and methodologies.
  • Familiarity with infrastructure configuration tools like Ansible, Chef, Puppet, or Terraform.
  • Experience with observability and tracing tools (InfluxDB, Prometheus, Elastic stack).

Ways to Stand Out:

  • Proven SRE mindset with a customer-first approach.
  • Experience in CI/CD pipelines, Git, and code review processes.
  • Strong debugging skills and a systematic problem-solving approach.
  • Expertise in managing large-scale cloud systems using Kubernetes, OpenStack, and Docker.
  • Adaptability to diverse working styles and thriving in collaborative environments.

Benefits

  • Competitive base salary: $148,000 – $339,250 USD, determined by location and experience.
  • Equity and comprehensive benefits package.
  • Opportunities for growth and mentorship in a dynamic, supportive environment.
  • A chance to work with some of the most innovative minds in the technology industry.

How to Apply

Ready to join a world-class team at NVIDIA? Apply now and take the first step in your journey to redefine the future of technology. NVIDIA accepts applications on an ongoing basis. Don’t miss this opportunity to grow your career in a collaborative, inclusive, and innovative environment.

More Information

Job Location

Share this job

Nvidia

The way it's meant to be played
(0)
Company Information
  • Total Jobs 16 Jobs
  • Category IT
  • Location California
  • Full Address Santa Clara, California, United States
  • Company Size > 2000 employees
  • CEO Jensen Huang

Explore thousands of job offers from leading companies across the USA. Start your career journey today with Joblya!