Skip to content

The SRE Resources

ALL SRE resources in a single page!

Books

Free

Book Title Description Link
Build Secure & Reliable Systems Best Practices for Designing, Implementing, and Maintaining Systems. Read online
Site Reliability Engineering How Google Runs Production Systems. Read online
The Site Reliaiblity Workbook Practical Ways to Implement SRE. Read online
Book Title Description Link
Becoming a Rockstar SRE Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems. Amazon
Implementing Service Level Objectives A Practical Guide to SLIs, SLOs, and Error Budgets. Amazon
Observability Engineering Achieving Production Excellence. Amazon
Seeking SRE Conversations About Running Production Systems at Scale. Amazon

Conferences

SRE themed

Conference Characteristics Organizer Link
SREday In-person, mutiple cities, and quarterly Harness Site
SREcon EMEA In-person, variable city, and yearly USENIX Site
SREcon Americas In-person, variable city, and yearly USENIX Site

Technology themed

Conference Characteristics Organizer Link
KubeCon + CloudNativeCon China In-person, variable city, and yearly LF/CNCF Site
KubeCon + CloudNativeCon Europe In-person, variable city, and yearly LF/CNCF Site
KubeCon + CloudNativeCon India In-person, variable city, and yearly LF/CNCF Site
KubeCon + CloudNativeCon North America In-person, variable city, and yearly LF/CNCF Site
KubeCon + CloudNativeCon Japan In-person, variable city, and yearly LF/CNCF Site

Education

Learning (Free)

Training Title Type Description Provider Link
Getting Started with OpenTofu (LFEL1009) Course This foundational course is designed for any learner familiar with how to open a command line interface. The Linux Foundation Page
DevOps Engineer, SRE Learning Path Learning Path This learning path guides you through a curated collection of on-demand courses, labs, and skill badges that provide you with real-world, hands-on experience using Google Cloud technologies essential to the DevOps Engineer/SRE role Google Page
Exploring GraphQL: A Query Language for APIs (LFS141x) Course This course is for both management and technical teams involved in the building and management of websites. Before enrolling you should be familiar with web architecture, such as clients and servers and web development concepts such as caching, HTTP requests, and build-time. It is helpful to have some general knowledge about how websites get information from servers, but it is not required. The Linux Foundation Page
Introduction to AI/ML Toolkits with Kubeflow (LFS147) Course This course is designed for developers, engineers, data scientists or anyone interested in understanding the anatomy of a machine learning tool kit that harnesses the power of Kubernetes. The Linux Foundation Page
Introduction to Backstage: Developer Portals Made Easy (LFS142) Course This course is designed for DevOps engineers and professionals interested in or working in Developer Productivity or Developer Experience teams. To make the most of this course, you should be familiar with source control systems and repositories and have basic knowledge of GitHub and JavaScript (especially React and Node.js). For learners using Windows, knowing how to install PostgreSQL locally is a plus. The Linux Foundation Page
Introduction to Cilium (LFS146) Course This course is designed for application developers, systems operators, and security professionals with an interest in learning how to use Cilium to better connect, observe, and secure Kubernetes. Learners should be familiar with basic Kubernetes concepts, Kubernetes operations and the kubectl tool. The Linux Foundation Page
Introduction to GitOps (LFS169) Course This course is for software developers interested in learning how to easily deploy their cloud native applications to Kubernetes; quality assurance engineers interested in understanding what a continuous delivery pipeline on Kubernetes looks like with GitOps; site reliability engineers looking for a simple, easy and secure solution to set up automated and continuous applications, infrastructure, and policy rollouts with an ability to do quick roll backs when needed; and anyone looking to understand the landscape of GitOps and learn how to choose and implement the right tools. The Linux Foundation Page
Introduction to Istio (LFS144) Course This course is intended for application developers, systems operators, and security professionals who already have familiarity and experience with Kubernetes and who wish to take their first steps towards learning and understanding Istio. The Linux Foundation Page
Introduction to Jenkins (LFS167) Course This course is for teams considering using Jenkins as a CI/CD tool and looking to automate their software delivery process, as well as those who need guidelines on how to set up a CI/CD workflow using the Jenkins automation server. The Linux Foundation Page
Introduction to Linux (LFS101) Course This Introduction to Linux course is designed for experienced computer users who have limited or no previous exposure to Linux, whether they are working in an individual or enterprise environment. The Linux Foundation Page
Introduction to Kubernetes (LFS158) Course This course is for teams considering or beginning to use Kubernetes for container orchestration who need guidelines on how to start transforming their organization with Kubernetes and cloud native patterns. Some knowledge of Linux system administration is helpful but not required. The Linux Foundation Page
Introduction to Kubernetes on Edge with k3s (LFS156x) Course This course is designed for those interested learning more about Kubernetes, as well as in deploying applications or embedded sensors in edge locations. While learners do not need a Kubernetes certification for this course, experience with a Linux operating system and shell scripting will be beneficial. Programming experience is also not strictly required. Learners will need to be able to run Docker on their computer. The Linux Foundation Page
Introduction to Node.js (LFW111) Course This course is designed for frontend or backend developers who would like to become more familiar with the fundamentals of Node.js and its most common use cases. Before enrolling, students should know how to use a command line terminal, and have some familiarity with JavaScript. The Linux Foundation Page
Introduction to Serverless on Kubernetes (LFS157) Course The course is designed for developers and IT operators interested in exploring new approaches for building software, who prefer being able to set their own limits when it comes to things such as timeouts and choice of programming languages. Before enrolling, students should have an understanding of cloud and container technologies – including Kubernetes – as well as experience with Python. The Linux Foundation Page
Introduction to DevOps and Site Reliability Engineering (LFS162) Course If you are a manager looking for guidelines on how to start transforming organizations, and understand where to start, this course is for you. If you aspire to make a career in the world of DevOps and Site Reliability Engineering, this course is your starting point. The Linux Foundation Page
Introduction to Site Reliability Engineering (SRE) Learning Path Gain a basic understanding of Site Reliability Engineering (SRE). Microsoft Learn Page
OpenAPI Fundamentals (LFEL1011) Course This course is for technical professionals, such as software developers, who want to learn more about how to describe their APIs using OpenAPI and about the benefits that flow from doing so. The Linux Foundation Page
Practical Observability Course Join Knox Lively, recovering DevOps Engineer and Lead Tech Evangelist at Observe, Inc., alongside his trusty side-kick robot, o11y, as they guide you through the vast and sometimes confusing universe of observability. o11y Academy Page

Accreditation

Name Type Issuer Level Link
IBM Certified Professional SRE - Cloud v2 IBM Cloud focused IBM Professional Page
Site Reliability Engineering (SRE) Foundation Vendor agnostic DevOps Institute Aspiring Page
Site Reliability Engineering (SRE) Practitioner Vendor agnostic DevOps Institute Associate Page

Reports & Guides

Document Title Description Link
The SRE Report 2025 Now in its seventh year, Catchpoint's annual SRE Report is considered the trusted resource for catalyzing innovative business conversations and infusing IT practitioner experiences into professional research. Catchpoint
Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program Providing training and education for Site Reliability Engineers is universally important to set them up for success in your organization. Google

Tools

All OSS tools and technologies for SREs!

Automation

Tool name Type Description Main Features Link
Ansible Infrastructure configuration website
Chef Infrastructure configuration website
Helm K8s configuration website
Puppet Infrastructure configuration website
Terraform Infrastructure provisioning, IaC website

Container

Tool name Type Description Main Features Link
docker Container runtime and management website
Kanivete K8s troubleshooting repo
Kubernetes Pod orchestration website
podman Container runtime and management website

DevOps

Network

Observability

Tool name Type Description Main Features Link
Grafana Monitoring and observability stack Grafana, Grafana Loki, Grafana Mimir, and Grafana Tempo [+] AI/ML; [+] APM; [+] metrics; [+] events; [+] logs; [+] traces; [+] service levels; [+] visualization. website
OpenTelemetry Monitoring platform A.k.a. OTel [-] AI/ML; [+] APM; [+] metrics; [-] events; [-] logs; [+] traces; [+] service levels; [+] visualization. website
Prometheus Monitoring platform [-] AI/ML; [+] APM; [+] metrics; [+] events; [-] logs; [+] traces; [+] service levels; [+] visualization. website

Platform

Security

End