Explore thousands of job offers
Nvidia
Nvidia

33 views
Site Reliability Engineer (Santa Clara, CA, USA) (Remote in US)

Job Overview

Job Opportunity: Site Reliability Engineer at NVIDIA

Are you ready to elevate your career with a role at one of the world’s leading technology companies? NVIDIA is looking for a highly skilled Site Reliability Engineer (SRE) to join our innovative team. This position offers the flexibility to work from our Santa Clara, CA office or remotely from anywhere in the United States. If you’re passionate about building robust systems and maintaining high-performance production environments, this is your chance to shine.

Job Description

As an SRE at NVIDIA, you will design, build, and maintain large-scale production systems, ensuring optimal efficiency and availability. By combining software and systems engineering practices, you’ll play a critical role in supporting our GPU cloud services. This includes maintaining system reliability, enabling smooth deployment processes, and driving automation to enhance system performance. NVIDIA’s SRE culture fosters diversity, intellectual curiosity, and a proactive approach to solving challenges, ensuring a dynamic and supportive work environment.

Responsibilities

  • Design and support the operational aspects of a large-scale Observability & Telemetry platform with a focus on real-time monitoring, logging, and alerting.
  • Manage the entire service lifecycle, from inception and design to deployment and operation.
  • Conduct system design consulting, develop tools for capacity management, and perform launch reviews.
  • Monitor and enhance the availability, latency, and health of live services.
  • Scale systems sustainably through automation and advocate for improvements in reliability and velocity.
  • Lead incident response and conduct blameless postmortems to identify root causes.
  • Participate in an on-call rotation to ensure seamless system support.

Requirements

  • Bachelor’s degree in Computer Science or a related technical field (or equivalent experience).
  • 5+ years of experience in infrastructure automation, distributed systems design, and operating large-scale cloud systems in production.
  • Proficiency in at least one programming language: Python, Go, Perl, or Ruby.
  • Strong expertise in Linux, Networking, and Containers.
  • Proven experience delivering foundational infrastructure and observability platforms.

Preferred Qualifications

  • Passion for analyzing and fixing large-scale distributed systems.
  • A systematic problem-solving approach paired with excellent communication skills.
  • Experience with Kubernetes, OpenStack, and Docker.
  • Familiarity with tools such as Grafana, OpenTelemetry, and Prometheus.

Benefits and Compensation

  • Competitive base salary ranging from $144,000 to $270,250 USD, depending on location, experience, and similar role benchmarks.
  • Eligibility for equity and comprehensive benefits.
  • A collaborative and mentorship-focused work environment designed to foster professional growth.

More Information

Job Location

Share this job

Nvidia

The way it's meant to be played
(0)
Company Information
  • Total Jobs 16 Jobs
  • Category IT
  • Location California
  • Full Address Santa Clara, California, United States
  • Company Size > 2000 employees
  • CEO Jensen Huang

Explore thousands of job offers from leading companies across the USA. Start your career journey today with Joblya!