SREcon19 Americas has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track 2 [clear filter]
Monday, March 25


Keeping the Balance: Internet-Scale Loadbalancing Demystified
Can you explain the entire path that an IP packet takes from your users to your binary? What about a web request? Do you understand the tradeoffs that different kinds of load balancing techniques make? If not, this talk is for you.

Load balancing is hard, and it is made up of many disparate technologies. It cuts across network, transport, and application layers. We'll describe different flavours of load balancing (network, naming, application) and how they are composed together by cloud providers and other large Internet companies to provide fast, reliable, multi-region services.

avatar for Laura Nolan

Laura Nolan

Laura Nolan is an SRE who believes in the power of checklists to help us tame complexity and chaos. She is one of the contributors to the books Site Reliability Engineering and Seeking SRE, both published by O'Reilly.
avatar for Murali Suriar

Murali Suriar

Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running cluster filesystem and locking services. Left Google to get on a boat. Got bored and came back.

Monday March 25, 2019 10:30am - 11:00am
Grand Ballroom D


Aperture: A Non-Cooperative, Client-Side Load Balancing Algorithm
Twitter's RPC framework, Finagle, employs non-cooperative, client-side load balancing. That is, clients make load balancing decisions independently. Although this architecture continues to serve Twitter well, it also comes with some unique trade-offs and challenges. In particular, it scales poorly as service clusters grow to thousands of instances. In this talk, we will dive deeper into the problem space and how we addressed it via an algorithm we call "Aperture."

avatar for Ruben Oanta

Ruben Oanta

Ruben has been working on Twitter’s RPC stack for the past five years. In that time, he has made substantial contributions to both the design and implementation of Finagle which have markedly improved the resiliency and operability of Twitter services.

Monday March 25, 2019 11:05am - 11:35am
Grand Ballroom D


Capacity Prediction in External Services
Applications are often limited by resources in third-party external systems. As an SRE, I want to be able to predict when, and under what conditions, those resources will be exhausted to facilitate pre-emptive remedial actions and appropriate planning. In this talk, I will describe how we use linear regression analysis to generate a predictive model that empowers us to properly plan and size external services, as well as adapting to changes in our application.

avatar for Jerome Kraus

Jerome Kraus

Alaska Airlines
Jerome Kraus is a Senior Software Development Engineer/SRE with Alaska Airlines. He has 20 years of software development engineering experience and 15 years with Alaska Airlines in Seattle, WA. He has been practicing Site Reliability Engineering for the past three years assisting... Read More →

Monday March 25, 2019 11:40am - 12:10pm
Grand Ballroom D


Benefits of Taking the Less Traveled Road with Containers Infrastructure
After almost a year of running Openshift Origin we decided to migrate to a vanilla Kubernetes setup and during this phase, we had to take some hard decisions. This talk will explain the reasons, the benefits and the technical details of some not-mainstream (at that time) decisions we took.

avatar for Eduard Iacoboaia

Eduard Iacoboaia

I'm a Senior Systems Administrator working for more than 5 years at Booking.com. During the first years, I worked on several teams, some of them managing infra for more than a hundred services. I saw the need for a change in process and I'm happy to see our containers infrastructure... Read More →

Monday March 25, 2019 1:40pm - 1:55pm
Grand Ballroom D


The Ops in Serverless
In this talk, we will examine the increased need for specialized Operations Engineering in the Age of Serverless. We'll use the serverless platform to explore three critical areas of operational readiness of testing, monitoring, and debugging.

avatar for Jennifer Davis

Jennifer Davis

Senior Cloud Advocate, Microsoft
Jennifer Davis is the co-author of Effective DevOps. She is a senior cloud advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms... Read More →

Monday March 25, 2019 1:55pm - 2:10pm
Grand Ballroom D


Testing in Production at Scale
Once frowned upon, testing in production has started to become a viable solution, especially in the microservices architecture. We present a case study of implementing testing in production at one such large-scale organization. This talk provides insights into real-world testing in production architecture. This is the talk for you if large-scale integration and load testing are on your mind.

avatar for Amit Gud

Amit Gud

Having worked for multiple companies in the storage and systems domain, from startups to multi-billion companies, Amit has a track record of tackling issues relating to performance and scalability. Amit has a masters degree from Kansas State University. He has worked on multiple research... Read More →

Monday March 25, 2019 2:15pm - 2:45pm
Grand Ballroom D


Tackling Kafka, with a Small Team
This is a story about what happens when a distributed system becomes a big part of a small team's infrastructure. This distributed system was Kafka and the team size was one engineer. I will discuss my failures along with my journey of deploying Kafka at scale with very little prior distributed systems experience. This presentation will be a tactical approach to conquering a complex system with an understaffed team while your business is growing fast.


Jaren Glover

Jaren Glover is an early engineer at Robinhood. He has spent the last 3 years scaling Robinhood's distributed systems and to support its rapid customer growth. He also allocates a large percentage of his time scaling Robinhood's human capital via new hire mentoring and on-boarding... Read More →

Monday March 25, 2019 2:50pm - 3:20pm
Grand Ballroom D


SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
SRE and product management—do those even go together? Yes! In this talk, we'll go over small ways and big strategies to form sustainable, impactful relationships with your users and build products that they love whether or not your SRE team has an official product manager. SRE teams' users are other engineers, data scientists, designers, and anyone else who pushes code at your company. It's not enough to build perfectly engineered platforms and tooling. SRE teams must build scalable, opinionated, USABLE products and workflows. This talk will give you the framework to get there.

avatar for Jen Wohlner

Jen Wohlner

product manager, platform engineering, Livepeer
Jen Wohlner leads product management at Livepeer, a decentralized video transcoding and live-streaming platform built on the Ethereum blockchain. Before Livepeer, Jen was the product manager for platform engineering at Fastly, an edge cloud platform that provides a content delivery... Read More →

Monday March 25, 2019 3:50pm - 4:20pm
Grand Ballroom D


Shipping Software with an SRE Mindset
Most SRE techniques revolve around resiliency and reliability of service delivery. Most "product" is the type of product that is deployed, not shipped. At Circonus, we deal with a lot of on-premise software shipment due to hybrid customer requirements. It turns out that many SRE techniques can apply directly to the construction, packaging, and shipment of installed software as well. In this talk, we'll learn all about it.

avatar for Theo Schlossnagle

Theo Schlossnagle

Founder & CEO, Circonus
The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded... Read More →

Monday March 25, 2019 4:25pm - 4:55pm
Grand Ballroom D


Using PRDs and User Journeys to Design User-Friendly Tools
Implementing software is one core aspect of the SRE role. Often this software will be used by multiple teams. SREs need to make sure that what they build is easy to use and understandable by all users. Product Requirement Documents (PRDs) can help collect and prioritize requirements for tooling and other software. But how do you write a good PRD?

avatar for Gwendolyn Stockman

Gwendolyn Stockman

Gwendolyn Stockman has worked at Google since 2008, first as an SWE then as an SRE for the last 5 years. She is on the Customer Reliability Engineering team which she joined after being on a similar group which works with teams within Google launching to production. Before helping... Read More →

Monday March 25, 2019 5:00pm - 5:30pm
Grand Ballroom D
Tuesday, March 26


Migrating a Monolith to the Cloud
After over a decade of hosting itself in the data center, Etsy.com moved to the Google Cloud Platform (GCP) in 2018. In this talk, I'll go over:

  • why the company decided to make the transition
  • our architectural approaches to migrating a large monolith, and the difficulties we faced gaining confidence in them
  • the assumptions we never knew about and had to fix: in the application code, the infrastructure tooling, and our processes
  • cutting over to GCP, safely
  • things we learnt running there for the last 9+ months


Keyur Govande

Keyur is the Chief Architect at Etsy. He has led multiple large architectural changes during his tenure, most recently the move to Google Cloud. Prior to this role, he was a key member of the Systems Engineering team helping scale the site and keeping PHP, Gearman, MySQL, Memcached... Read More →

Tuesday March 26, 2019 9:00am - 9:30am
Grand Ballroom D


An Introduction to GraphQL
GraphQL is a data sharing schema from Facebook. This talk will introduce the schema, common uses of it, pros and cons versus other data formats. Nat will also talk about some things to consider when using GraphQL in production, and common problems people encounter while running GraphQL deployments and how to combat those issues.

avatar for Nat Welch

Nat Welch

Staff Site Reliability Engineer, Google
Nat Welch is an SRE based in Brooklyn, NY, and the author of "Real World SRE" from Packt Publishing. He currently works for Google on the Customer Reliability Engineering team. In the past, he has worked for First Look Media, Hillary for America, iFixit, and others.

Tuesday March 26, 2019 9:30am - 10:00am
Grand Ballroom D


Service Discovery Challenges at Scale
We'll discuss what challenges does one face while building Service Discovery at scale of millions of processes, tens of millions of clients, and tens of thousands of state changes per second.


Ruslan Nigmatullin

Dropbox, Inc.
Ruslan Nigmatullin is a Software Engineer in Traffic team at Dropbox. Before that he was a Software Engineer in the Internal Components Team at Yandex.

Tuesday March 26, 2019 10:00am - 10:30am
Grand Ballroom D


Inside the Kube: A Guided Tour of Kubernetes Cluster Setup
A lot of SREs are (or will soon be) responsible for Kubernetes clusters. But what exactly makes up Kubernetes? This talk will dive into the services and systems that make a cluster work, how they interact, and what can go wrong. Kubernetes will no longer be a black box, but a system that can be debugged, reconfigured, and improved to suit every administrators' needs.

avatar for Liz Frost

Liz Frost

​Member of Technical Staff, ​VMWare
Liz Frost is a kubernetes contributor and engineer at VMware, née Heptio. She is also a dog mom, queer woman, and occasionally a colorful pony.

Tuesday March 26, 2019 11:00am - 12:30pm
Grand Ballroom D


What I Wish I Knew before Going On-call
Firefighting a broken system is time-sensitive and stressful but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we'll share common myths among new on-call engineers and the Do's and Don'ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes.

avatar for Chie Shu

Chie Shu

Chie Shu is a backend Software Engineer at Yelp. She has worked on improving Yelp's revenue-critical Ads data pipeline to be more resilient to system failures, and designed heuristics used internally by executives and Product Managers to assess the financial impact of on-call incidents... Read More →
avatar for Dorothy Jung

Dorothy Jung

Software Engineer, Yelp
Dorothy Jung is a backend Software Engineer at Yelp. At Yelp she served as a "pushmaster", managing and monitoring company-wide deployments to production; and as a release engineering deputy, helping to set up CI/CD pipelines within the Ads organization. She was previously at DreamWorks... Read More →
avatar for Wenting Wang

Wenting Wang

Wenting Wang is a Software Engineer with three years of industry experience. She has been on-call for different teams at Yelp: on the BizApp backend team, where she worked closely with mobile developers and monitored mobile user traffic; and on the Ads team, where she currently develops... Read More →

Tuesday March 26, 2019 2:00pm - 3:30pm
Grand Ballroom D


Running Excellent Retrospectives: Talking for Humans
How many awful meetings have you been to in your life, where people are talking forever and saying nothing, or where people are talking at cross purposes and not listening, or where they're saying things that make everyone feel bad? Have you been in retrospectives like that? (Did it make you never want to attend a retrospective again?)

Let's do better! Come learn practical techniques for facilitating pleasant, productive, welcoming retrospectives (which will improve any meeting you need to run). We will talk about the structure of welcoming language and discuss when it's necessary to interrupt someone. We'll examine what it means for language to include blame and how to reframe blaming conversations. We'll practice the mental work of understanding things that seem contrafactual but are actually just confusing (especially helpful for discussing complex systems). When you leave, you'll be ready to make any meeting or retrospective you're in more comfortable and effective, as a leader or an attendee.

avatar for Courtney Eckhardt

Courtney Eckhardt

Heroku, a Salesforce Company
Courtney Eckhardt first got into retrospectives when she signed up for comp.risks as an undergrad (and since then, not as much has changed as we’d like to think). Her perspectives on engineering process improvement are strongly informed by the work of Kathy Sierra and Don Norman... Read More →
avatar for Lex Neva

Lex Neva

SRE, Fastly
Lex has six years of experience keeping large services running, including Linden Lab's Second Life, Deviantart.com, Heroku, and his current position at Fastly. While originally trained in computer science, he's found that he most enjoys applying his software engineering skills to... Read More →

Tuesday March 26, 2019 4:00pm - 5:30pm
Grand Ballroom D
Wednesday, March 27


Scaling SRE Organizations: The Journey from 1 to Many Teams
In this talk, the author will share their experience starting new teams, splitting and moving them from both technical and non-technical standpoints. This is ideal for new leaders in charge of SRE wondering when it's time to grow beyond a single team and how to. This is also very valuable for SREs who are interested to know what happens behind the scenes, how to influence such changes and how they can help while avoiding burnout.

avatar for Gustavo Franco

Gustavo Franco

Gustavo Franco is a Customer Reliability Engineer at Google working on to learn more about, helping to define, and expanding the reach of SRE. He's been at Google since 2007 and has started, moved and managed several SRE teams such Google Plus Frontend, BreakFix, Horizon Web, Cluster... Read More →

Wednesday March 27, 2019 9:00am - 9:30am
Grand Ballroom D


The Curse of SRE Autonomy and How to Manage It
Within an SRE organization, teams usually develop very different automation tools and processes for accomplishing similar tasks. Some of this can be explained by the software they support: different systems require different reliability solutions. But many SRE tasks are essentially the same across all software: compiling, building, deploying, canarying, load testing, managing traffic, monitoring, and so on.

There are two puzzles here: why does this diversity exist, and how can it be overcome so that SRE teams stop duplicating their development efforts?

This talk presents a solution to both puzzles using the ten-year history of a single SRE tool. The tool is used only internally at a large company. It is one of the rare tools there that has been adopted widely by our very large SRE organization.

avatar for Richard Bondi

Richard Bondi

Richard Bondi has been an engineer at Google since 2011, specializing in the entire web stack and working on travel applications. In 2016 he converted to SRE, and then joined the SRE tech writer team. Before Google, and after leaving his political philosophy PhD program to join the... Read More →

Wednesday March 27, 2019 9:35am - 10:05am
Grand Ballroom D


Learning from Learnings: Anatomy of Three Incidents
The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches, none of which has been previously told fully in public.

avatar for Randy Shoup

Randy Shoup

Over the past several decades, Randy Shoup has led high-performing engineering teams at eBay, Google, Stitch Fix, and WeWork. A long-time advocate of DevOps practices, Randy specializes in scaling engineering organizations, company cultures, and technology infrastructures. He is equally... Read More →

Wednesday March 27, 2019 10:10am - 10:40am
Grand Ballroom D


Sublinear Scaling in Practice: The 1k SRE Project
At Google, one of the primary objectives of SRE teams is sublinear scaling: the size and number of SRE teams should grow more slowly than the number of supported services. This talk will describe how one team has implemented this principle. Over the last 3 years, we have increased our portfolio by more than 200% (from 187 to 431 supported services) without additional staffing, and we plan for continued growth up to 1000 services. We will review the extensive automation infrastructure that we have in place, describe ongoing projects (including automated incident handling), and discuss the changes we've made in how we approach SRE - moving away from service-specific production readiness reviews towards automated policy verification and service-agnostic consulting. Audience members will hear about a vision for the long-term role of SRE in large organizations, where sublinear scaling requires not just increasing automation but a cultural shift from providing service-specific expertise to mostly service-independent consulting.

avatar for Nikolaus Rath

Nikolaus Rath

Dr. Nikolaus Rath is a site reliability engineer working on Google's advertising services. Before joining Google, he worked on feedback control systems for magnetically confined plasmas. He is a maintainer of a number of open-source projects, including libfuse and S3QL.

Wednesday March 27, 2019 11:10am - 11:40am
Grand Ballroom D


Pragmatic Automation
Automation is great, but how do you know when the right thing to do is to stop writing it? How do you take on complex automation projects of unknown scope and deliver impact incrementally?

This talk explores lessons learned in the automation space at a large public Cloud provider, that are applicable to anyone looking for new ideas to reduce toil in their day to day work.


Max Luebbe

Max has been an SRE at Google since 2009, having spent most of that time working in Storage Infrastructure. More recently he was on the teams that externalized Bigtable and Spanner as GCP Products and currently leads the effort to deploy new Google Cloud Regions all over the glob... Read More →

Wednesday March 27, 2019 11:45am - 12:15pm
Grand Ballroom D


Differences in SRE Implementations across Companies
With the popularity of "SRE" as a job role, people have become aware that not all such roles are entirely equivalent. There's been a slack channel on the USENIX-SREcon workspace (https://usenix.org/srecon/slack #sre_between_companies) where people have started to explore these distinctions.
This session will be an opportunity to crowd-source more information. It will be a moderated, audience driven session. Come and tell us what SRE means at your company!

avatar for Kurt Andersen

Kurt Andersen

Liaison, LinkedIn
Kurt Andersen has been one the co-chairs for SREcon Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for... Read More →

Wednesday March 27, 2019 12:20pm - 12:50pm
Grand Ballroom D


Automating the Management of the Operational Health of Cloud Accounts at Scale
In a large scale environment where engineers are empowered to independently deliver an application from concept to working production system, and in public cloud providers that allow access to do almost anything, there is a unique challenge of implementing and maintaining controls that align with tight banking regulations. I will discuss how we've used a combination of open source tools and our custom automation to solve various challenges such as:

  • Limiting public access
  • Staying ahead of account resource limits
  • Enforcing resource ownership
  • Cost control
  • Security patching
  • Account-impacting mistakes

avatar for Jamie Walls

Jamie Walls

Capital One
Jamie has experience in operations and on feature delivery teams and brings an understanding of the balance between high operational quality and time to market. He understands the value in "Shift Left" operational testing and validation where a focus on simplifying and automating... Read More →

Wednesday March 27, 2019 2:20pm - 2:50pm
Grand Ballroom D


Designing Resilient Data Pipelines
There are a number of questions that plague any operator of a complex data pipeline. How do I quickly recover from failures in my pipeline? How do I know that the data I generate is accurate? How do I minimize the risk associated with updating my pipeline? Designing your data pipeline with resiliency and observability in mind will help to answer these questions. In this talk, I will present several strategies that my team has adopted for reducing operational complexity, risk associated with updates, and concerns about accuracy of data pipelines.

avatar for Andrew Bolin

Andrew Bolin

Two Sigma Investments, LP
Andrew Bolin is a Reliability Engineer at Two Sigma Investments where he is responsible for the design and operation of data pipelines critical to the firm's research environment. Before his current role, Andrew worked on the team responsible for the development of Two Sigma's open... Read More →

Wednesday March 27, 2019 2:55pm - 3:25pm
Grand Ballroom D


From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services
Artificial intelligence is all around us, from the digitals assistants in our microwaves to the apps we rely on every day. Many of these systems build on APIs and services that use machine learning to provide key features. This talk will describe techniques for building predictable, reliable ML-based services as well as ways to sustain these services through social and technical change. We discuss challenges unique to the reliability of these systems and relate our experiences with ML in our production systems to illustrate our techniques.

avatar for Salim Virji

Salim Virji

Google LLC
Salim Virji is a Site Reliability Engineer at Google, where he has worked on distributed compute, consensus, and storage systems.
avatar for Carlos Villavieja

Carlos Villavieja

Google LLC
Carlos Villavieja is a Computer Architect/Researcher working as a Software/Site Reliability Engineer at Google. He works on Storage optimizations and his interests vary from micro-architecture to machine learning.

Wednesday March 27, 2019 3:30pm - 4:00pm
Grand Ballroom D