Nvidia

33 views
Site Reliability Engineer (Santa Clara, CA, USA) (Remote in US)

Nvidia Full Time Santa Clara, CA, USA California, United States January 14, 2025 Engineering - Information Technology

Job Overview

Job Opportunity: Site Reliability Engineer at NVIDIA

Are you ready to elevate your career with a role at one of the world’s leading technology companies? NVIDIA is looking for a highly skilled Site Reliability Engineer (SRE) to join our innovative team. This position offers the flexibility to work from our Santa Clara, CA office or remotely from anywhere in the United States. If you’re passionate about building robust systems and maintaining high-performance production environments, this is your chance to shine.

Job Description

As an SRE at NVIDIA, you will design, build, and maintain large-scale production systems, ensuring optimal efficiency and availability. By combining software and systems engineering practices, you’ll play a critical role in supporting our GPU cloud services. This includes maintaining system reliability, enabling smooth deployment processes, and driving automation to enhance system performance. NVIDIA’s SRE culture fosters diversity, intellectual curiosity, and a proactive approach to solving challenges, ensuring a dynamic and supportive work environment.

Responsibilities

Design and support the operational aspects of a large-scale Observability & Telemetry platform with a focus on real-time monitoring, logging, and alerting.
Manage the entire service lifecycle, from inception and design to deployment and operation.
Conduct system design consulting, develop tools for capacity management, and perform launch reviews.
Monitor and enhance the availability, latency, and health of live services.
Scale systems sustainably through automation and advocate for improvements in reliability and velocity.
Lead incident response and conduct blameless postmortems to identify root causes.
Participate in an on-call rotation to ensure seamless system support.

Requirements

Bachelor’s degree in Computer Science or a related technical field (or equivalent experience).
5+ years of experience in infrastructure automation, distributed systems design, and operating large-scale cloud systems in production.
Proficiency in at least one programming language: Python, Go, Perl, or Ruby.
Strong expertise in Linux, Networking, and Containers.
Proven experience delivering foundational infrastructure and observability platforms.

Preferred Qualifications

Passion for analyzing and fixing large-scale distributed systems.
A systematic problem-solving approach paired with excellent communication skills.
Experience with Kubernetes, OpenStack, and Docker.
Familiarity with tools such as Grafana, OpenTelemetry, and Prometheus.

Benefits and Compensation

Competitive base salary ranging from $144,000 to $270,250 USD, depending on location, experience, and similar role benchmarks.
Eligibility for equity and comprehensive benefits.
A collaborative and mentorship-focused work environment designed to foster professional growth.

More Information

Address Santa Clara, CA, USA
Salary Offers $144,000 to $270,250 USD, depending on location, experience, and similar role benchmarks.
Experience Level Senior
Total Years Experience 5-10
Gallery