Loading…
SREcon19 Americas has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Sunday, March 24
 

5:00pm

 
Monday, March 25
 

7:30am

7:45am

Continental Breakfast
Sponsored by Twitter

Sponsors
avatar for Twitter

Twitter

At Twitter, our mission is to give everyone the power to create and share ideas and information instantly without barriers. Our Site Reliability Engineering team is critical to enabling this across the globe and we are looking to grow our team. So join us and #LoveWhereYouWork.


Monday March 25, 2019 7:45am - 8:45am
Grand Ballroom Foyer and Northside Ballroom

8:45am

Welcome and Opening Remarks
Monday March 25, 2019 8:45am - 9:00am
Grand Ballroom ABCD

9:00am

What Breaks Our Systems: A Taxonomy of Black Swans
Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time.

By definition, you cannot predict true black swans. But black swans often fall into certain categories that we've seen before. This talk examines those categories and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, hidden system dependencies, and more.

Speakers
avatar for Laura Nolan

Laura Nolan

Slack
Laura Nolan is an SRE who believes in the power of checklists to help us tame complexity and chaos. She is one of the contributors to the books Site Reliability Engineering and Seeking SRE, both published by O'Reilly.


Monday March 25, 2019 9:00am - 9:30am
Grand Ballroom ABCD

9:30am

Complexity: The Crucial Ingredient in Your Kitchen
Software engineering is basically rocket science, so it comes as no surprise that we can learn a lot from that industry. For example, the Challenger explosion in 1986 is a fascinating subject for study. The details of the incident are well documented from a variety of angles (engineering, political, sociotechnical, ethnographical, etc) providing a rich dataset. Highlighting a few examples from this, we can empathize with the architecture considerations and organizational issues that engineers faced at NASA during that time. There are strong, informative parallels between the events that led up to that tragic incident and how software engineers think about reliability today. As Churchill allegedly quipped, "Never let a good crisis go to waste."

Speakers
avatar for Casey Rosenthal

Casey Rosenthal

Verica.io
CEO/Founder of Verica.io. Previously an engineering manager for the Traffic Engineering Team and the Chaos Engineering Team at Netflix. As an executive manager and senior architect, Casey has managed teams to tackle Big Data, architect solutions to difficult problems, and train others... Read More →


Monday March 25, 2019 9:30am - 10:00am
Grand Ballroom ABCD

10:00am

Break with Refreshments
Sponsored by PayPal

Sponsors
avatar for PayPal

PayPal

Fueled by a fundamental belief that having access to financial services creates opportunity, PayPal is committed to democratizing financial services and empowering people and businesses to join and thrive in the global economy. Our open digital payments platform gives PayPal’s 218... Read More →


Monday March 25, 2019 10:00am - 10:30am
Grand Ballroom Foyer and Northside Ballroom

10:30am

Case Study: Implementing SLOs for a New Service
Implementing service level objectives (SLOs) effectively is a hard task, especially for a service which not only is new within your engineering and product organizations but also encompasses both a request-driven and a storage subsystem.

In this talk, I will discuss our experience defining and measuring service level indicators (SLIs) and objectives for our Ceph Object Storage service. I will describe our approach in specifying service level indicators plus the tradeoffs and implementation decisions we made when it came to measuring various types of SLIs, including availability, latency, and durability.

I will also share the lessons learned and benefits gained from our implementation. You will understand why SLOs are crucial for site reliability engineers and service users and will be given some tips on how to implement them for either a request-driven or a storage system.

Speakers
avatar for Arnaud Lawson

Arnaud Lawson

Squarespace
Arnaud is a Senior Site Reliability Engineer at Squarespace in New York, where—among other things—he has led the productionization of Ceph as a storage backend used by many Squarespace services.


Monday March 25, 2019 10:30am - 11:00am
Grand Ballroom ABC

10:30am

Keeping the Balance: Internet-Scale Loadbalancing Demystified
Can you explain the entire path that an IP packet takes from your users to your binary? What about a web request? Do you understand the tradeoffs that different kinds of load balancing techniques make? If not, this talk is for you.

Load balancing is hard, and it is made up of many disparate technologies. It cuts across network, transport, and application layers. We'll describe different flavours of load balancing (network, naming, application) and how they are composed together by cloud providers and other large Internet companies to provide fast, reliable, multi-region services.

Speakers
avatar for Laura Nolan

Laura Nolan

Slack
Laura Nolan is an SRE who believes in the power of checklists to help us tame complexity and chaos. She is one of the contributors to the books Site Reliability Engineering and Seeking SRE, both published by O'Reilly.
avatar for Murali Suriar

Murali Suriar

Google
Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running cluster filesystem and locking services. Left Google to get on a boat. Got bored and came back.


Monday March 25, 2019 10:30am - 11:00am
Grand Ballroom D

11:05am

Fixing On-Call When Nobody Thinks It's (Too) Broken
What's a team to do when they receive more than 30 pages a day, every day, for almost a decade? Deny there's a problem of course! Join me as we relive the data-informed journey from around 70,000 pages over 7 years (~200/week) to under 50/week in just a few short months in a way that shows those carrying the pager improvement is possible and empowers them to continue questioning and improving the status quo moving forward. We'll look at not only the technical challenges but also non-technical challenges like getting buy-in when nobody thinks there's a problem and managing risk when the on-call team is concerned about silencing legitimate pages along with the noise.

Speakers
avatar for Tony Lykke

Tony Lykke

Hudson River Trading
Tony is an SRE on the trade systems team at Hudson River Trading based in NYC, where he gets to tackle hard (often not just technically) automation problems and tech debt cleanup projects across a variety of environments. He is obsessively anti-toil, and regularly refuses to accept... Read More →


Monday March 25, 2019 11:05am - 11:35am
Grand Ballroom ABC

11:05am

Aperture: A Non-Cooperative, Client-Side Load Balancing Algorithm
Twitter's RPC framework, Finagle, employs non-cooperative, client-side load balancing. That is, clients make load balancing decisions independently. Although this architecture continues to serve Twitter well, it also comes with some unique trade-offs and challenges. In particular, it scales poorly as service clusters grow to thousands of instances. In this talk, we will dive deeper into the problem space and how we addressed it via an algorithm we call "Aperture."

Speakers
avatar for Ruben Oanta

Ruben Oanta

Twitter
Ruben has been working on Twitter’s RPC stack for the past five years. In that time, he has made substantial contributions to both the design and implementation of Finagle which have markedly improved the resiliency and operability of Twitter services.


Monday March 25, 2019 11:05am - 11:35am
Grand Ballroom D

11:40am

Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value
How do you monitor systems that don't want to be monitored or ones that you don't have internal access to? Why monitor these systems at all? The United States Digital Service finds the truth and tells the truth, and fights fires across government, even when those fires don't want to be found. We put together a system to black box monitor all 25,000 .GOV domains and then expanded to perform more robust monitoring of important citizen-facing, government-provided services so we can go where the work is and restore services. In the process, we're hoping to change the culture and prove the value of SRE teams across government. This is how we're doing it.

Speakers
avatar for Aaron Wieczorek

Aaron Wieczorek

United States Digital Service
Aaron is a Site Reliability Engineer at the United States Digital Service Headquarters team. He works on hard technical problems and hard bureaucratic problems, from infrastructure to CI/CD pipelines, to network engineering.


Monday March 25, 2019 11:40am - 12:10pm
Grand Ballroom ABC

11:40am

Capacity Prediction in External Services
Applications are often limited by resources in third-party external systems. As an SRE, I want to be able to predict when, and under what conditions, those resources will be exhausted to facilitate pre-emptive remedial actions and appropriate planning. In this talk, I will describe how we use linear regression analysis to generate a predictive model that empowers us to properly plan and size external services, as well as adapting to changes in our application.

Speakers
avatar for Jerome Kraus

Jerome Kraus

Alaska Airlines
Jerome Kraus is a Senior Software Development Engineer/SRE with Alaska Airlines. He has 20 years of software development engineering experience and 15 years with Alaska Airlines in Seattle, WA. He has been practicing Site Reliability Engineering for the past three years assisting... Read More →


Monday March 25, 2019 11:40am - 12:10pm
Grand Ballroom D

12:10pm

Luncheon
Sponsored by NS1

Sponsors
avatar for NS1

NS1

NS1 is defining the future of application delivery and performance by converging real-time user, infrastructure and network data, enabling organizations to control their applications at the extreme edge. Our intelligent DNS & traffic management platform delivers the speed, performance... Read More →


Monday March 25, 2019 12:10pm - 1:40pm
Grand Ballroom EFGHI

1:40pm

Benefits of Taking the Less Traveled Road with Containers Infrastructure
After almost a year of running Openshift Origin we decided to migrate to a vanilla Kubernetes setup and during this phase, we had to take some hard decisions. This talk will explain the reasons, the benefits and the technical details of some not-mainstream (at that time) decisions we took.

Speakers
avatar for Eduard Iacoboaia

Eduard Iacoboaia

Booking.com
I'm a Senior Systems Administrator working for more than 5 years at Booking.com. During the first years, I worked on several teams, some of them managing infra for more than a hundred services. I saw the need for a change in process and I'm happy to see our containers infrastructure... Read More →


Monday March 25, 2019 1:40pm - 1:55pm
Grand Ballroom D

1:40pm

How Did Things Go Right? Learning More from Incidents
Solely learning from failure isn't a fundamental—it's a limitation.

A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure but rather the presence of adaptive capacity.

Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention.

  • What's going on when it seems like nothing is happening?
  • When failure does occur, what's going to keep it from being worse?
  • How do teams adapt successfully when preventative techniques fail?
  • How should we prioritize the effort to develop systems that help us safely manage the consequences of failure?

These questions cannot be answered by trying to explain the causes of failure and fixing remediation items.

Speakers
avatar for Ryan Kitchens

Ryan Kitchens

Netflix
Ryan Kitchens is a Site Reliability Engineer on the Core team at Netflix where he works on building capacity across the organization to ensure its availability and reliability. Before that, Ryan was a founding member of the SRE team at Blizzard Entertainment.


Monday March 25, 2019 1:40pm - 2:10pm
Grand Ballroom ABC

1:55pm

The Ops in Serverless
In this talk, we will examine the increased need for specialized Operations Engineering in the Age of Serverless. We'll use the serverless platform to explore three critical areas of operational readiness of testing, monitoring, and debugging.

Speakers
avatar for Jennifer Davis

Jennifer Davis

Senior Cloud Advocate, Microsoft
Jennifer Davis is a Senior Cloud Advocate at Microsoft. She is also the co-author of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms... Read More →


Monday March 25, 2019 1:55pm - 2:10pm
Grand Ballroom D

2:15pm

Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
We will look at the process for Code Yellow, the term we use for this process of "righting the ship," and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Speakers
avatar for Michael Kehoe

Michael Kehoe

Staff SRE, LinkedIn
Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automatio... Read More →
avatar for Todd Palino

Todd Palino

LinkedIn
Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations... Read More →


Monday March 25, 2019 2:15pm - 2:45pm
Grand Ballroom ABC

2:15pm

Testing in Production at Scale
Once frowned upon, testing in production has started to become a viable solution, especially in the microservices architecture. We present a case study of implementing testing in production at one such large-scale organization. This talk provides insights into real-world testing in production architecture. This is the talk for you if large-scale integration and load testing are on your mind.

Speakers
avatar for Amit Gud

Amit Gud

Uber
Having worked for multiple companies in the storage and systems domain, from startups to multi-billion companies, Amit has a track record of tackling issues relating to performance and scalability. Amit has a masters degree from Kansas State University. He has worked on multiple research... Read More →


Monday March 25, 2019 2:15pm - 2:45pm
Grand Ballroom D

2:50pm

Creating a Code Review Culture
Code review is one of the best ways to keep code quality high, and for engineering teams to communicate their best practices and patterns. But how do organizations build a sustainable code review culture? This talk explores best practices for introducing code review to teams and looks at how to improve the code review process from the perspective of organizations, code authors, and code reviewers.

Speakers
avatar for Johnathan Turner

Johnathan Turner

Squarespace
Johnathan Turner is a Site Reliability Engineer at Squarespace, where he works on tooling and processes that enable product engineers to care less about infrastructure. He spends his spare time playing guitar, reading comic books, listening to heavy metal, and thinking about how to... Read More →


Monday March 25, 2019 2:50pm - 3:20pm
Grand Ballroom ABC

2:50pm

Tackling Kafka, with a Small Team
This is a story about what happens when a distributed system becomes a big part of a small team's infrastructure. This distributed system was Kafka and the team size was one engineer. I will discuss my failures along with my journey of deploying Kafka at scale with very little prior distributed systems experience. This presentation will be a tactical approach to conquering a complex system with an understaffed team while your business is growing fast.

Speakers
JG

Jaren Glover

Robinhood
Jaren Glover is an early engineer at Robinhood. He has spent the last 3 years scaling Robinhood's distributed systems and to support its rapid customer growth. He also allocates a large percentage of his time scaling Robinhood's human capital via new hire mentoring and on-boarding... Read More →


Monday March 25, 2019 2:50pm - 3:20pm
Grand Ballroom D

3:20pm

Break with Refreshments
Sponsored by Bloomberg

Sponsors
avatar for Bloomberg

Bloomberg

Bloomberg has built the world's most trusted information network for financial professionals. Our core product, the Bloomberg Terminal, is an independent and unbiased source of information for our clients – everyone from C-Suite executives, traders, analysts, government officials... Read More →


Monday March 25, 2019 3:20pm - 3:50pm
Grand Ballroom Foyer and Northside Ballroom

3:50pm

Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance
Do you maintain a Rube Goldberg-like service? Perhaps it's highly distributed? Or you recently walked onto a team with an unfamiliar codebase? Have you noticed your service responds slower than molasses? This talk walks you through how to pinpoint bottlenecks, approaches, and tools to make improvements, and make you seem like the hero! All in a day's work.

The talk will describe various types of tracing a web service, including black & white box tracing, tracing distributed systems, as well as various tools and external services available to measure performance. I also present a few different rabbit holes to dive into when trying to improve your service's performance.

Speakers
avatar for Lynn Root

Lynn Root

Software Engineer, Spotify
Lynn Root is an SRE at Spotify in NYC, with historical issues of using her last name as her username, and the resident FOSS evangelist. She is also a global leader of PyLadies and former Vice Chair of the Python Software Foundation Board of Directors. When her hands are not on a keyboard... Read More →


Monday March 25, 2019 3:50pm - 4:20pm
Grand Ballroom ABC

3:50pm

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
SRE and product management—do those even go together? Yes! In this talk, we'll go over small ways and big strategies to form sustainable, impactful relationships with your users and build products that they love whether or not your SRE team has an official product manager. SRE teams' users are other engineers, data scientists, designers, and anyone else who pushes code at your company. It's not enough to build perfectly engineered platforms and tooling. SRE teams must build scalable, opinionated, USABLE products and workflows. This talk will give you the framework to get there.

Speakers
avatar for Jen Wohlner

Jen Wohlner

product manager, platform engineering, Fastly
Jen Wohlner is a product manager for platform engineering at Fastly, an edge cloud platform that provides a content delivery network, Internet security products, load balancing, and video and streaming services for major companies across the globe. Previously, Jen worked as a senior... Read More →


Monday March 25, 2019 3:50pm - 4:20pm
Grand Ballroom D

4:25pm

Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest
Loggers and tracers have become crucial components of computing systems, providing invaluable visibility into the runtime behavior of our software. Ironically, these vital components are opaque when it comes to their own runtime behavior. We typically only look at logging as suspects in performance-related incidents as part of post-mortem analysis.

Why do we have such blind spots when it comes to components that are pervasively used in our systems? We explore possible explanations and present example solutions.

Speakers
avatar for Danny Chen

Danny Chen

software engineer, Bloomberg LP
Danny Chen started his career almost 40 years ago as a UNIX performance engineer at Bell Laboratories where he was a co-developer of one of first general purpose UNIX kernel tracing facilities (USENIX/1988: CASPER the Friendly Daemon). He also contributed to the SVR4 virtual memory... Read More →


Monday March 25, 2019 4:25pm - 4:55pm
Grand Ballroom ABC

4:25pm

Shipping Software with an SRE Mindset
Most SRE techniques revolve around resiliency and reliability of service delivery. Most "product" is the type of product that is deployed, not shipped. At Circonus, we deal with a lot of on-premise software shipment due to hybrid customer requirements. It turns out that many SRE techniques can apply directly to the construction, packaging, and shipment of installed software as well. In this talk, we'll learn all about it.

Speakers
avatar for Theo Schlossnagle

Theo Schlossnagle

Founder & CEO, Circonus
The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded... Read More →


Monday March 25, 2019 4:25pm - 4:55pm
Grand Ballroom D

5:00pm

Operating within Normal Parameters: Monitoring Kubernetes
After Kubernetes takes over your data centers, how can you be sure that it's operating within normal parameters? What does "normal" even mean? By formalizing your expected quality of service, you can measure and compare against known targets with open source tools like Prometheus. In this talk, we'll use Kubernetes as a case study for introducing service level objectives (SLOs) to guide monitoring efforts. Come learn the how and why of metric selection for monitoring Kubernetes quality of service, what gaps exist in the open source Kubernetes monitoring ecosystem, how to use Prometheus and its exporters to establish predictability and "normal" baselines, and how to use this telemetry to debug service degradations in a Kubernetes cluster.

Speakers
EH

Elana Hashman

Two Sigma
Elana Hashman currently works as a Reliability Engineer at Two Sigma, wrangling Kubernetes clusters and automating operations. She is a currently a member of the Kubernetes Instrumentation SIG, where she focuses on benchmarking and metrics usability. In the wider FOSS community, she... Read More →


Monday March 25, 2019 5:00pm - 5:30pm
Grand Ballroom ABC

5:00pm

Using PRDs and User Journeys to Design User-Friendly Tools
Implementing software is one core aspect of the SRE role. Often this software will be used by multiple teams. SREs need to make sure that what they build is easy to use and understandable by all users. Product Requirement Documents (PRDs) can help collect and prioritize requirements for tooling and other software. But how do you write a good PRD?

Speakers
avatar for Gwendolyn Stockman

Gwendolyn Stockman

Google
Gwendolyn Stockman has worked at Google since 2008, first as an SWE then as an SRE for the last 5 years. She is on the Customer Reliability Engineering team which she joined after being on a similar group which works with teams within Google launching to production. Before helping... Read More →


Monday March 25, 2019 5:00pm - 5:30pm
Grand Ballroom D

5:30pm

Happy Hour
Monday March 25, 2019 5:30pm - 6:30pm
Grand Ballroom EFGHI
 
Tuesday, March 26
 

7:30am

8:00am

Continental Breakfast
Sponsored by Microsoft Azure

Sponsors
avatar for Microsoft Azure

Microsoft Azure

Cloud for all.Microsoft Azure believes that all individuals and groups should be empowered with the full freedom and power of the cloud. The cloud should not be available to only an elite few. Azure offers the trust, transparency, and humanity that all companies need to navigate... Read More →


Tuesday March 26, 2019 8:00am - 9:00am
Grand Ballroom Foyer and Northside Ballroom

9:00am

Migrating a Monolith to the Cloud
After over a decade of hosting itself in the data center, Etsy.com moved to the Google Cloud Platform (GCP) in 2018. In this talk, I'll go over:

  • why the company decided to make the transition
  • our architectural approaches to migrating a large monolith, and the difficulties we faced gaining confidence in them
  • the assumptions we never knew about and had to fix: in the application code, the infrastructure tooling, and our processes
  • cutting over to GCP, safely
  • things we learnt running there for the last 9+ months

Speakers
KG

Keyur Govande

Etsy
Keyur is the Chief Architect at Etsy. He has led multiple large architectural changes during his tenure, most recently the move to Google Cloud. Prior to this role, he was a key member of the Systems Engineering team helping scale the site and keeping PHP, Gearman, MySQL, Memcached... Read More →


Tuesday March 26, 2019 9:00am - 9:30am
Grand Ballroom D

9:00am

SRE Classroom - How to Design a Distributed System in 3 Hours
Participants in this workshop will learn principles of systems design, and work in small groups to apply the concepts to designing a distributed system. This workshop emphasizes design skills for the real world, including how to integrate third-party or Cloud-based software components into your own systems.

Speakers
avatar for Ryan Thomas

Ryan Thomas

Google
Ryan is a Site Reliability Engineering Manager at Google Australia, and currently manages the Accelerated Storage SRE team. Ryan is passionate about the design, implementation, and operation of large-scale distributed systems, and sharing his experiences with anyone interested in... Read More →
avatar for JC van Winkel

JC van Winkel

Google
JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the... Read More →
PT

Phillip Tischler

Google
Phillip Tischler is a Senior Software Engineer & Site Reliability Engineer at Google NYC. Phillip is currently SRE Tech Lead of ACL-d Search, which is search over data with permissions/sharing. Phillip also works on general indexing and search, aggregations, and low latency serving... Read More →
JM

Jennifer Mace

SRE, Google
Macey is a Senior Site Reliability Engineer at Google Seattle, where she wrangles the world's largest fleet of Kubernetes clusters under the banner of GKE. Previously the tech lead of Display Ads SRE, she has contributed to the latest SRE Workbook on topics from Incident Management... Read More →


Tuesday March 26, 2019 9:00am - 12:30pm
Grand Ballroom ABC

9:30am

An Introduction to GraphQL
GraphQL is a data sharing schema from Facebook. This talk will introduce the schema, common uses of it, pros and cons versus other data formats. Nat will also talk about some things to consider when using GraphQL in production, and common problems people encounter while running GraphQL deployments and how to combat those issues.

Speakers
avatar for Nat Welch

Nat Welch

Staff Site Reliability Engineer, Google
Nat Welch is an SRE based in Brooklyn, NY, and the author of "Real World SRE" from Packt Publishing. He currently works for Google on the Customer Reliability Engineering team. In the past, he has worked for First Look Media, Hillary for America, iFixit, and others.


Tuesday March 26, 2019 9:30am - 10:00am
Grand Ballroom D

10:00am

Service Discovery Challenges at Scale
We'll discuss what challenges does one face while building Service Discovery at scale of millions of processes, tens of millions of clients, and tens of thousands of state changes per second.

Speakers
RN

Ruslan Nigmatullin

Dropbox, Inc.
Ruslan Nigmatullin is a Software Engineer in Traffic team at Dropbox. Before that he was a Software Engineer in the Internal Components Team at Yandex.


Tuesday March 26, 2019 10:00am - 10:30am
Grand Ballroom D

10:30am

Break with Refreshments
Sponsored by Catchpoint

Sponsors

Tuesday March 26, 2019 10:30am - 11:00am
Grand Ballroom Foyer and Northside Ballroom

11:00am

Inside the Kube: A Guided Tour of Kubernetes Cluster Setup
A lot of SREs are (or will soon be) responsible for Kubernetes clusters. But what exactly makes up Kubernetes? This talk will dive into the services and systems that make a cluster work, how they interact, and what can go wrong. Kubernetes will no longer be a black box, but a system that can be debugged, reconfigured, and improved to suit every administrators' needs.


Speakers
avatar for Liz Frost

Liz Frost

VMware
Liz Frost is a kubernetes contributor and engineer at VMware, née Heptio. She is also a dog mom, queer woman, and occasionally a colorful pony.


Tuesday March 26, 2019 11:00am - 12:30pm
Grand Ballroom D

12:30pm

Luncheon
Sponsored by LaunchDarkly

Sponsors

Tuesday March 26, 2019 12:30pm - 2:00pm
Grand Ballroom EFGHI

2:00pm

What I Wish I Knew before Going On-call
Firefighting a broken system is time-sensitive and stressful but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we'll share common myths among new on-call engineers and the Do's and Don'ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes.

Speakers
avatar for Chie Shu

Chie Shu

Yelp
Chie Shu is a backend Software Engineer at Yelp. She has worked on improving Yelp's revenue-critical Ads data pipeline to be more resilient to system failures, and designed heuristics used internally by executives and Product Managers to assess the financial impact of on-call incidents... Read More →
avatar for Dorothy Jung

Dorothy Jung

Software Engineer, Yelp
Dorothy Jung is a Software Engineer with multiple years of on-call experience. At Yelp she served as a "pushmaster", managing and monitoring company-wide deployments to production; and as a release engineering deputy, helping to set up CI/CD pipelines within the Ads organization... Read More →
avatar for Wenting Wang

Wenting Wang

Yelp
Wenting Wang is a Software Engineer with three years of industry experience. She has been on-call for different teams at Yelp: on the BizApp backend team, where she worked closely with mobile developers and monitored mobile user traffic; and on the Ads team, where she currently develops... Read More →


Tuesday March 26, 2019 2:00pm - 3:30pm
Grand Ballroom D

2:00pm

Getting Started with Observability Lab: Opentracing, Prometheus, and Jaeger
Building a cloud native organization without having a robust understanding of what your applications are doing in production is almost impossible. Exposure to these tools early will give engineers who are beginning to make the transformation in their organizations a greater understanding behind the need for observability.

In this session, we will cover how to install the Prometheus Operator, Grafana, Jaeger, and begin monitoring a live production Kubernetes cluster. We will then instrument an example application composed of Java and AngularJS microservices using technologies such as Opentracing and Micrometer.

We will show how developers can see transactions occurring between their applications and explore how these tools will help both developers and operations troubleshoot and diagnose issues. We will also cover how these tools can be leveraged to build alerts and deliver business intelligence.

Speakers
avatar for Kevin Crawley

Kevin Crawley

Developer Relations, Instana
Kevin works as a developer evangelist for Instana, an APM and container monitoring service provider and speaks globally on topics including distributed computing, microservices, containers, monitoring, logging, deployment automation, observability, public speaking, alert fatigue... Read More →


Tuesday March 26, 2019 2:00pm - 5:30pm
Grand Ballroom ABC

3:30pm

Break with Refreshments
Sponsored by Circonus

Sponsors
avatar for Circonus

Circonus

Analyze billions of metrics a second with the Circonus monitoring and analytics platform. Developed specifically for the requirements of DevOps, the Circonus platform delivers alerts, graphs, dashboards and machine-learning intelligence that help to optimize not just your operations... Read More →


Tuesday March 26, 2019 3:30pm - 4:00pm
Grand Ballroom Foyer and Northside Ballroom

4:00pm

Running Excellent Retrospectives: Talking for Humans
How many awful meetings have you been to in your life, where people are talking forever and saying nothing, or where people are talking at cross purposes and not listening, or where they're saying things that make everyone feel bad? Have you been in retrospectives like that? (Did it make you never want to attend a retrospective again?)

Let's do better! Come learn practical techniques for facilitating pleasant, productive, welcoming retrospectives (which will improve any meeting you need to run). We will talk about the structure of welcoming language and discuss when it's necessary to interrupt someone. We'll examine what it means for language to include blame and how to reframe blaming conversations. We'll practice the mental work of understanding things that seem contrafactual but are actually just confusing (especially helpful for discussing complex systems). When you leave, you'll be ready to make any meeting or retrospective you're in more comfortable and effective, as a leader or an attendee.

Speakers
avatar for Courtney Eckhardt

Courtney Eckhardt

Heroku, a Salesforce Company
Courtney Eckhardt first got into retrospectives when she signed up for comp.risks as an undergrad (and since then, not as much has changed as we’d like to think). Her perspectives on engineering process improvement are strongly informed by the work of Kathy Sierra and Don Norman... Read More →
avatar for Lex Neva

Lex Neva

SRE, Fastly
Lex has six years of experience keeping large services running, including Linden Lab's Second Life, Deviantart.com, Heroku, and his current position at Fastly. While originally trained in computer science, he's found that he most enjoys applying his software engineering skills to... Read More →


Tuesday March 26, 2019 4:00pm - 5:30pm
Grand Ballroom D

5:30pm

Reception
Sponsored by Packet

Sponsors
avatar for Packet

Packet

Founded in 2014, Packet’s proprietary technology automates physical servers and networks without the use of virtualization or multi-tenancy to provide on-demand compute and connectivity. Customers can either build on Packet’s public cloud service, deploy customized hardware in... Read More →


Tuesday March 26, 2019 5:30pm - 7:30pm
Grand Ballroom EFGHI

7:30pm

Lightning Talks
  • Livetweeting Tech Conferences
    Bridget Kromhout, Microsoft
  • 5 Insights from 200 SREs on How Incident Response Affects Them
    Jaime Woo, Dawn Parzych, Catchpoint
  • Distributed Systems Need Deadlines
    Paul Henry, Coinbase
  • Doughnut Dilemma: A Lesson in Resource Managers
    Ravi Lachhman, AppDynamics
  • Automating SRE Work: Focusing on High-Return Customer and Business Outcomes
    Aniket Kulkarni, PayPal
  • How and Why We Lowered Our SLO
    Deborah Wood, Pivotal
  • Durable Disorder
    Anthony Sandoval, GitLab Inc
  • The Operation Maturity Model
    Matthew Fornaciari, Gremlin, Inc.
  • "Monitoring and Alerting, Ain't Nobody Got Time for That": How USDS bootstrapped basic SRE best practices a week before launch at FEMA.
    David Holmes, USDS

Tuesday March 26, 2019 7:30pm - 9:00pm
Grand Ballroom D
 
Wednesday, March 27
 

8:00am

Continental Breakfast
Wednesday March 27, 2019 8:00am - 9:00am
Grand Ballroom Foyer

8:00am

Badge Pickup
Wednesday March 27, 2019 8:00am - 12:00pm
Grand Ballroom DE Foyer & Promenade

9:00am

Optimizing for Learning
The talk is about the most powerful observability system SREs have at their disposal: the human mind! I draw from cognitive science to discuss how we can improve how we learn and store information about our systems in our brains in order to respond better to incidents and anomalies. It's a talk broken into four parts: preparing to learn, gaining knowledge, building mental models, and enabling a team to learn well together.

Speakers
avatar for Logan McDonald

Logan McDonald

SRE, BuzzFeed
Logan is a security-focused Site Reliability Engineer at BuzzFeed, based in New York City. She is a maintainer of BuzzFeed's open source centralized sign-on platform, sso, and has written for dev.to and Increment Magazine. She is obsessed with learning, but especially with the learning... Read More →


Wednesday March 27, 2019 9:00am - 9:30am
Grand Ballroom ABC

9:00am

Scaling SRE Organizations: The Journey from 1 to Many Teams
In this talk, the author will share their experience starting new teams, splitting and moving them from both technical and non-technical standpoints. This is ideal for new leaders in charge of SRE wondering when it's time to grow beyond a single team and how to. This is also very valuable for SREs who are interested to know what happens behind the scenes, how to influence such changes and how they can help while avoiding burnout.

Speakers
avatar for Gustavo Franco

Gustavo Franco

Google
Gustavo Franco is a Customer Reliability Engineer at Google working on to learn more about, helping to define, and expanding the reach of SRE. He's been at Google since 2007 and has started, moved and managed several SRE teams such Google Plus Frontend, BreakFix, Horizon Web, Cluster... Read More →


Wednesday March 27, 2019 9:00am - 9:30am
Grand Ballroom D

9:35am

Zero to SRE
Being able to transform a junior engineer into an excellent mid, then senior engineer is a competitive advantage for any company. Unfortunately, there aren't many entry-level SRE job postings, and if your company hasn't hired juniors before, you'll need to make changes in order to create an environment where they can thrive.

This talk is the story of a junior Web Developer turned SRE. I've been able to successfully transition into my role because my company has embraced junior engineers by creating a 'Culture of Error,' encouraging all engineers to be mentors, and ensuring that all employees take time during the day to learn new skills.

By end of this talk, I'll share the specific details, and you will have a roadmap for how to support junior SREs during their first day, month, 90 days and year.

Speakers
avatar for Kim Schlesinger

Kim Schlesinger

ReactiveOps & diversity
Kim Schlesinger is Site Reliability Engineer at ReactiveOps. Prior to being an SRE, Kim was an Instructor, Web Developer, and Curriculum Designer for the Full-Stack Immersive Program at Galvanize, a code school based in Denver, Colorado.In her spare time, Kim is active in the Colorado... Read More →


Wednesday March 27, 2019 9:35am - 10:05am
Grand Ballroom ABC

9:35am

The Curse of SRE Autonomy and How to Manage It
Within an SRE organization, teams usually develop very different automation tools and processes for accomplishing similar tasks. Some of this can be explained by the software they support: different systems require different reliability solutions. But many SRE tasks are essentially the same across all software: compiling, building, deploying, canarying, load testing, managing traffic, monitoring, and so on.

There are two puzzles here: why does this diversity exist, and how can it be overcome so that SRE teams stop duplicating their development efforts?

This talk presents a solution to both puzzles using the ten-year history of a single SRE tool. The tool is used only internally at a large company. It is one of the rare tools there that has been adopted widely by our very large SRE organization.

Speakers
avatar for Richard Bondi

Richard Bondi

Google
Richard Bondi has been an engineer at Google since 2011, specializing in the entire web stack and working on travel applications. In 2016 he converted to SRE, and then joined the SRE tech writer team. Before Google, and after leaving his political philosophy PhD program to join the... Read More →


Wednesday March 27, 2019 9:35am - 10:05am
Grand Ballroom D

10:10am

One on One SRE
When Amy started at GitHub, support for SRE principles and technical solutions were well underway. What was missing was how to handle the human side: how can a group of individual contributors influence a company to prioritize reliability? To that end, she created the 1:1 SRE outreach and 1:1 incident debrief programs for the purpose of growing GitHub's culture of resilience by embracing the values of empathy and psychological safety. This talk will cover how the programs work, how they were launched, and real-world outcomes.

Speakers
avatar for Amy Tobey

Amy Tobey

Sr SRE, GitHub
Amy has worked in web operations for 20 years at companies of every size, touching everything from kernel code to user interfaces. When she's not working she can usually be found around her home in San Jose, caring for her family, practicing piano, or running slowly in the sun.


Wednesday March 27, 2019 10:10am - 10:40am
Grand Ballroom ABC

10:10am

Learning from Learnings: Anatomy of Three Incidents
The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches, none of which has been previously told fully in public.

Speakers
avatar for Randy Shoup

Randy Shoup

WeWork
Over the past several decades, Randy Shoup has led high-performing engineering teams at eBay, Google, Stitch Fix, and WeWork. A long-time advocate of DevOps practices, Randy specializes in scaling engineering organizations, company cultures, and technology infrastructures. He is equally... Read More →


Wednesday March 27, 2019 10:10am - 10:40am
Grand Ballroom D

10:40am

Break with Refreshments
Sponsored by Two Sigma

Sponsors
avatar for Two Sigma

Two Sigma

We are Two Sigma. We imagine breakthroughs in investment management, insurance and related fields by pushing the boundaries of what open source and proprietary technology can do. In the process, we work to help real people.Our engineers, data scientists and modelers harness data at... Read More →


Wednesday March 27, 2019 10:40am - 11:10am
Grand Ballroom Foyer

11:10am

Fault Tree Analysis Applied to Apache Kafka
At last year's SREcon, we were inspired by talks that introduced fault tree analysis. We decided to apply the technique to bulletproof our Apache Kafka deployments. In this talk, learn about fault tree analysis and what you should focus on to make your Apache Kafka clusters resilient.

Speakers
avatar for Andrey Falko

Andrey Falko

Lyft
Andrey Falko is one of the first Reliability Software Engineers at hired at Lyft, where he has been for seven months. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where he... Read More →


Wednesday March 27, 2019 11:10am - 11:40am
Grand Ballroom ABC

11:10am

Sublinear Scaling in Practice: The 1k SRE Project
At Google, one of the primary objectives of SRE teams is sublinear scaling: the size and number of SRE teams should grow more slowly than the number of supported services. This talk will describe how one team has implemented this principle. Over the last 3 years, we have increased our portfolio by more than 200% (from 187 to 431 supported services) without additional staffing, and we plan for continued growth up to 1000 services. We will review the extensive automation infrastructure that we have in place, describe ongoing projects (including automated incident handling), and discuss the changes we've made in how we approach SRE - moving away from service-specific production readiness reviews towards automated policy verification and service-agnostic consulting. Audience members will hear about a vision for the long-term role of SRE in large organizations, where sublinear scaling requires not just increasing automation but a cultural shift from providing service-specific expertise to mostly service-independent consulting.

Speakers
avatar for Nikolaus Rath

Nikolaus Rath

Google
Dr. Nikolaus Rath is a site reliability engineer working on Google's advertising services. Before joining Google, he worked on feedback control systems for magnetically confined plasmas. He is a maintainer of a number of open-source projects, including libfuse and S3QL.


Wednesday March 27, 2019 11:10am - 11:40am
Grand Ballroom D

11:45am

Strategies to Edit Production Data
At some point, we all find ourselves at a SQL prompt making edits to the production database. We know it’s a bad practice and we always intend to put in place safer infrastructure before we need to do it again—what does a better system actually look like?

This talk progresses through 5 strategies for teams using a Python stack to do SQL writes against a database, to achieve increasing safety and auditability:

  1. Develop a process for raw SQL edits
  2. Run scripts locally
  3. Run scripts on an existing server
  4. Use a task runner
  5. Build a Script Runner service

We’ll talk about the pros and cons of each strategy and help you determine which one is right for your specific needs.

By the end of this talk you’ll be ready to start upgrading your infrastructure for making changes to your production database safely!

Speakers
avatar for Julie Qiu

Julie Qiu

Google


Wednesday March 27, 2019 11:45am - 12:15pm
Grand Ballroom ABC

11:45am

Pragmatic Automation
Automation is great, but how do you know when the right thing to do is to stop writing it? How do you take on complex automation projects of unknown scope and deliver impact incrementally?

This talk explores lessons learned in the automation space at a large public Cloud provider, that are applicable to anyone looking for new ideas to reduce toil in their day to day work.

Speakers
ML

Max Luebbe

Google
Max has been an SRE at Google since 2009, having spent most of that time working in Storage Infrastructure. More recently he was on the teams that externalized Bigtable and Spanner as GCP Products and currently leads the effort to deploy new Google Cloud Regions all over the glob... Read More →


Wednesday March 27, 2019 11:45am - 12:15pm
Grand Ballroom D

12:20pm

Madaari: Ordering for the Monkeys
Lineage Driven Fault Injection (LDFI) is a state of the art technique in chaos engineering experiment selection. As SRE's we would like to perform chaos experiments that reveal the bugs that the customers are most likely to hit first. In this talk, we present new improvements to LDFI that orders the experiment suggestions.

In the first the half of the talk we will show introduce LDFI as a technique that can be widely used within an enterprise. We also highlight how ordering is general purpose technique that we can use to encode the peculiarities of a heterogeneous microservices architecture. LDFI can work in an enterprise by harnessing the observability infrastructure to model the redundancy of the system.

Next, we present experiments conducted within eBay using ordered LDFI and some preliminary results. We show examples of services where we discovered bugs, and how carefully controlling the order of experiments allowed LDFI to avoid running unnecessary experiments.
We will discuss open problems and future direction of LDFI.

Key takeaways :
  1. Understand how LDFI can be integrated in the enterprise by harnessing the observability infrastructure
  2. Limitations of LDFI w.r.t unordered solutions and why ordering matters for chaos experiments
  3. Preliminary results of prioritized LDFI and a future direction for the community

No prior knowledge of LDFI is required.

Speakers
avatar for Ashutosh Raina

Ashutosh Raina

SRE, eBay
Ashutosh is a member of the Site Reliability team at eBay focussed on bringing LDFI to the enterprise. He works at the intersection of academia and industry, trying his best to fuse them together. Previously, Ashutosh was a graduate student at UCSC working at Disorderly Labs making... Read More →
avatar for Ramprasad Ellupuru

Ramprasad Ellupuru

eBay
Ramprasad is a member of the Site Reliability team at eBay working on making checkout highly reliable and available. He is an experienced developer and a new practitioner of chaos engineering at eBay.


Wednesday March 27, 2019 12:20pm - 12:50pm
Grand Ballroom ABC

12:20pm

Differences in SRE Implementations across Companies
With the popularity of "SRE" as a job role, people have become aware that not all such roles are entirely equivalent. There's been a slack channel on the USENIX-SREcon workspace (https://usenix.org/srecon/slack #sre_between_companies) where people have started to explore these distinctions.
This session will be an opportunity to crowd-source more information. It will be a moderated, audience driven session. Come and tell us what SRE means at your company!

Speakers
avatar for Kurt Andersen

Kurt Andersen

Liaison, LinkedIn
Kurt Andersen has been one the co-chairs for SREcon Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for... Read More →


Wednesday March 27, 2019 12:20pm - 12:50pm
Grand Ballroom D

12:50pm

Luncheon
Sponsored by Google

Sponsors
avatar for Google

Google

Google is a global technology leader focused on improving the ways people connect with information. Google's innovations in web search and advertising have made its website a top internet property and its brand one of the most recognized in the world. For more information, visit... Read More →


Wednesday March 27, 2019 12:50pm - 2:20pm
Grand Ballroom EFGHI

2:20pm

Latency SLOs Done Right
Median, average, 90th, 99th percentile. We've all seen these metrics on our monitoring systems, both open source and from commercial vendors, but often they are used incorrectly when constructing Service Level Objectives. This session will show three different approaches to correctly calculating latency SLOs, and how histograms can be used to calculate mathematically correct quantiles and set SLOs based on those.

Speakers
avatar for Fred Moyer

Fred Moyer

Developer Evangelist, Circonus Inc
Fred works as a Developer Evangelist at Circonus, where likes to do fun mathy things with lots of data. He is a recovering Perl programmer and likes to hack in Go and occasionally C. Fred likes to ride his bike and spend time with his family when he's not hacking away. | Fred is... Read More →


Wednesday March 27, 2019 2:20pm - 2:50pm
Grand Ballroom ABC

2:20pm

Automating the Management of the Operational Health of Cloud Accounts at Scale
In a large scale environment where engineers are empowered to independently deliver an application from concept to working production system, and in public cloud providers that allow access to do almost anything, there is a unique challenge of implementing and maintaining controls that align with tight banking regulations. I will discuss how we've used a combination of open source tools and our custom automation to solve various challenges such as:

  • Limiting public access
  • Staying ahead of account resource limits
  • Enforcing resource ownership
  • Cost control
  • Security patching
  • Account-impacting mistakes

Speakers
avatar for Jamie Walls

Jamie Walls

Capital One
Jamie has experience in operations and on feature delivery teams and brings an understanding of the balance between high operational quality and time to market. He understands the value in "Shift Left" operational testing and validation where a focus on simplifying and automating... Read More →


Wednesday March 27, 2019 2:20pm - 2:50pm
Grand Ballroom D

2:55pm

Extending the Error Budget Model to Security and Feature Freshness
Everyone knows about error budgets (most every SRE at this conference, anyway) and how to use them to manage availability.

But what about operations outcomes beyond availability, like _security_ and _feature freshness_? In this talk, we'll describe how to apply the error budget model to measure and improve security and feature value and mitigate the risk of change. And we'll give you the tools to brag about your success.

Speakers
avatar for Jim Thomson

Jim Thomson

Pivotal
Jim is a Product Lead at Pivotal Cloud R+D, and loves to bring product-thinking into an operations world. While he's more into dogs, he shares a love of dad-jokes with David.
avatar for David Laing

David Laing

Pivotal
David is an Engineering Lead at Pivotal Cloud R+D. He previously ran CloudOpsEU - the team that keeps Pivotal Tracker's foundation available, secure, and feature-fresh. He is particularly fond of cats and dad-jokes.


Wednesday March 27, 2019 2:55pm - 3:25pm
Grand Ballroom ABC

2:55pm

Designing Resilient Data Pipelines
There are a number of questions that plague any operator of a complex data pipeline. How do I quickly recover from failures in my pipeline? How do I know that the data I generate is accurate? How do I minimize the risk associated with updating my pipeline? Designing your data pipeline with resiliency and observability in mind will help to answer these questions. In this talk, I will present several strategies that my team has adopted for reducing operational complexity, risk associated with updates, and concerns about accuracy of data pipelines.

Speakers
avatar for Andrew Bolin

Andrew Bolin

Two Sigma Investments, LP
Andrew Bolin is a Reliability Engineer at Two Sigma Investments where he is responsible for the design and operation of data pipelines critical to the firm's research environment. Before his current role, Andrew worked on the team responsible for the development of Two Sigma's open... Read More →


Wednesday March 27, 2019 2:55pm - 3:25pm
Grand Ballroom D

3:30pm

You Don't Have to Love Your Job
"Do what you love, and you'll never work another day in your life." -- someone who's never had a job

We're often told that we need to love our jobs—which sounds great on paper. But if everyone did what they loved, the world would only have astronaut pilots and pony huggers. Feeling we need to love our jobs pushes an imposter syndrome myth and makes great employees feel like they're not doing the right thing.

You don't have to love your job, you just need to like it!

Speakers
avatar for Leslie Carr

Leslie Carr

Quip
Leslie Carr is an Engineering Manager at Quip. | | Leslie transformed from a productive engineer into a pointy-haired manager while at Clover Health. In her past life, Leslie worked at Cumulus Networks in DevOps, helping to push automation in the network world. Prior to that, she... Read More →


Wednesday March 27, 2019 3:30pm - 3:45pm
Grand Ballroom ABC

3:30pm

From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services
Artificial intelligence is all around us, from the digitals assistants in our microwaves to the apps we rely on every day. Many of these systems build on APIs and services that use machine learning to provide key features. This talk will describe techniques for building predictable, reliable ML-based services as well as ways to sustain these services through social and technical change. We discuss challenges unique to the reliability of these systems and relate our experiences with ML in our production systems to illustrate our techniques.

Speakers
avatar for Salim Virji

Salim Virji

Google LLC
Salim Virji is a Site Reliability Engineer at Google, where he has worked on distributed compute, consensus, and storage systems.
avatar for Carlos Villavieja

Carlos Villavieja

Google LLC
Carlos Villavieja is a Computer Architect/Researcher working as a Software/Site Reliability Engineer at Google. He works on Storage optimizations and his interests vary from micro-architecture to machine learning.


Wednesday March 27, 2019 3:30pm - 4:00pm
Grand Ballroom D

3:45pm

Mindfulness in SRE: Monitoring and Alerting for One's Self
As SREs, we are all permanently on-call for our own well-being. Without proper monitoring and alerting about what's going on in our body, mind, and surroundings, we're likely to fall short of our own expectations regarding stress management, work-life balance, social interactions, and risk management. This talk provides an illustrative definition of mindfulness, provides practical examples of its usefulness in SRE, and builds the concept of self-monitoring that can improve performance both on the job and on the street.

Speakers
avatar for Tommy Lutz

Tommy Lutz

Google
Tommy Lutz is an SRE manager at Google and a former engineering manager at Bloomberg Tradebook. The SRE team he serves supports Google's archival storage systems. Tommy is known for commuting to Google NYC by folding boat and bicycle on the Hudson River. The long float down the river... Read More →


Wednesday March 27, 2019 3:45pm - 4:00pm
Grand Ballroom ABC

4:00pm

Break with Refreshments
Wednesday March 27, 2019 4:00pm - 4:30pm
Grand Ballroom Foyer

4:30pm

Resilience Engineering Mythbusting
How confident are you in your prod servers staying up without your help? Too often in tech we mistakenly interchange three important concepts when describing our socio-technical systems: how resilient they are, the reliability they exhibit in day to day work, and how robust they are under duress. Though interrelated, they are not equivalent.

How can we successfully gain insights in post-incident reviews, execute chaos engineering experiments, and build scalable infrastructure if we're misinterpreting our approaches? By separating out these core concepts, we can isolate better approaches in adapting to unforeseen circumstances. We'll look at common misconceptions when describing our systems as resilient and focus on proven methods to help us improve our understanding of our systems.

Speakers
avatar for Will Gallego

Will Gallego

Software Engineer, Fastly
Will Gallego is a systems engineer with 15+ years of experience in the web development field, currently as a Senior Software Engineer at Fastly. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers... Read More →


Wednesday March 27, 2019 4:30pm - 5:00pm
Grand Ballroom ABCD

5:00pm

Why Are Distributed Systems So Hard?
Distributed systems are known for being notoriously difficult to wrangle. But why? This talk will cover a brief history of distributed databases, clear up some common myths about the CAP theorem, dig into why network partitions are inevitable, and close out by highlighting how a few popular consensus algorithms mitigate the risks of operating in a distributed fashion and the importance of considering human factors to fully understand the systems we build. Almost all slides will contain original illustration featuring mischievous cats masquerading as sysadmins. By the end of this talk you will have a better understanding of the design trade-offs involved in architecting for distributed systems, and hopefully, be inspired to start doodling tech concepts!

Speakers
avatar for Denise Yu

Denise Yu

Senior Software Engineer, Pivotal
Denise is a Senior Software Engineer at Pivotal Cloud Foundry (PCF) in Toronto. In her time at Pivotal she has worked on a variety of open source and enterprise products, served briefly as the Product Manager of Pivotal's On-Demand Service Broker SDK, then moved across an ocean to... Read More →


Wednesday March 27, 2019 5:00pm - 5:30pm
Grand Ballroom ABCD

5:30pm

Closing Remarks
Wednesday March 27, 2019 5:30pm - 5:45pm
Grand Ballroom ABCD