Lead Site Reliability Engineer at Lightbend
San Francisco, CA, US

We are looking for an experienced Lead Site Reliability Engineer to join our Cloud Services team. This individual will be a key part of the operational and development team for our new Cloud offerings, and should have a deep background in software and systems engineering.


  • Develop and extend software to monitor and improve end-to-end platform performance, identify runtime deficiencies, find potential failures, and fix production issues in a fully managed Cloud environment
  • Build deep, full-stack knowledge of our platforms and applications. 
  • Work to simplify and automate deployment processes, run-time operations, and provide non-disruptive releases.
  • Help create and maintain an environment that provides security and privacy for our customers data.
  • Maintain application reliability and uptime SLAs throughout the application lifecycle using programmatic self-healing and software automation.
  • Travel occasionally to meet with the rest of Lightbend’s technical team.

Candidates can potentially live anywhere in North America, as this is a fully remote position. This is much more than a support position - we are looking for an operations expert to be a part of building and running our new offerings.



  • Are an experienced SRE, with the ability and drive to build and lead an SRE team and capability for Lightbend.
  • Have a passion for automating the complexities of orchestrating and running multi-tenant cloud application services.
  • Are accustomed to collaborating with business owners and understanding diverse business requirements.
  • Have two or more years of experience in distributed systems architecture and runtime requirement.
  • Are a voracious learner, ready to take on new technologies and techniques quickly and constantly.
  • Have excellent written and verbal communication skills in at least English.
  • Are skillful at interacting and working with people; working with a self-organized lean and agile team to mitigate project risks, manage effort and ensure quality.
  • Are dedicated to best practices such as 100% source control, automated testing, code reviews, continuous integration, and continuous deployment.
  • Are biased towards action on tough problems and issues, and focused on your customer’s success.
  • Are an agent of change, constantly learning and seeking better outcomes.
  • Are familiar with many of the supporting technologies we use, including Monitoring, Logging, Kubernetes, Service Mesh frameworks, and other related technologies.
  • Are experienced with complex and secure networking environments, including TLS.

Ideally, you also...

  • Have experience with Google’s Cloud service offerings, GCP, GKE and related services, specifically from an operational perspective.
  • Have supported SaaS/PaaS systems.
  • An awareness of Serverless/Functions-as-a-service Platforms.
  • Are familiar with streaming data technologies, such as Spark, Flink, Kafka.

What we offer:

Lightbend is a welcoming, transparent, and highly distributed company dedicated to creating high-performance systems that bring success to all who use them.  With a strong focus on work-life balance, our company offers a fast-paced, collaborative environment mixed with challenging and engaging work. This combination has attracted and retained some of the brightest minds in our technology communities.

Lightbend is an Equal Opportunity Employer.