Lead Site Reliability Engineer at Lightbend
San Francisco, CA, US

We are looking for an experienced Lead Site Reliability Engineer to join our Cloud Services team. This individual will be a key part of the operational and development team for our new Cloud offerings, and should have a deep background in software and systems engineering.


Develop and extend software to monitor and improve end-to-end platform performance, identify runtime deficiencies, find potential failures, and fix production issues in a fully managed Cloud environment

Build deep, full-stack knowledge of our platforms and applications. 

Work to simplify and automate deployment processes, run-time operations, and provide non-disruptive releases.

Help create and maintain an environment that provides security and privacy for our customers data.

Maintain application reliability and uptime SLAs throughout the application lifecycle using programmatic self-healing and software automation.

Travel occasionally to meet with the rest of Lightbend’s technical team.

Candidates can potentially live anywhere in North America, as this is a fully remote position. This is much more than a support position - we are looking for an operations expert to be a part of building and running our new offerings.



Are an experienced SRE, with the ability and drive to build and lead an SRE team and capability for Lightbend.

Have a passion for automating the complexities of orchestrating and running multi-tenant cloud application services.

Are accustomed to collaborating with business owners and understanding diverse business requirements.

Have two or more years of experience in distributed systems architecture and runtime requirement.

Are a voracious learner, ready to take on new technologies and techniques quickly and constantly.

Have excellent written and verbal communication skills in at least English.

Are skillful at interacting and working with people; working with a self-organized lean and agile team to mitigate project risks, manage effort and ensure quality.

Are dedicated to best practices such as 100% source control, automated testing, code reviews, continuous integration, and continuous deployment.

Are biased towards action on tough problems and issues, and focused on your customer’s success.

Are an agent of change, constantly learning and seeking better outcomes.

Are familiar with many of the supporting technologies we use, including Monitoring, Logging, Kubernetes, Service Mesh frameworks, and other related technologies.

Are experienced with complex and secure networking environments, including TLS.

Ideally, you also...

Have experience with Google’s Cloud service offerings, GCP, GKE and related services, specifically from an operational perspective.

Have supported SaaS/PaaS systems.

An awareness of Serverless/Functions-as-a-service Platforms.

Are familiar with streaming data technologies, such as Spark, Flink, Kafka.

What we offer:

Lightbend is a welcoming, transparent, and highly distributed company dedicated to creating high-performance systems that bring success to all who use them.  With a strong focus on work-life balance, our company offers a fast-paced, collaborative environment mixed with challenging and engaging work. This combination has attracted and retained some of the brightest minds in our technology communities.