Engineering Philosophy: Werner Vogels

Werner Vogels, CTO of Amazon

Key Takeaways

  • His defining principle is that everything fails all the time, so you design for failure rather than against it. As Amazon’s CTO since 2005, Werner Vogels turned a blunt observation – at sufficient scale, component failure is constant and statistically guaranteed – into a design doctrine: assume every disk, server, network link, and dependency will fail, and build systems that stay available through failure instead of pretending it can be prevented.16
  • He co-authored the Dynamo paper, which pioneered the ideas behind modern NoSQL. “Dynamo: Amazon’s Highly Available Key-value Store” (SOSP 2007) put consistent hashing, vector clocks, sloppy quorums, gossip-based membership, and eventual consistency into one always-writable store, and directly influenced Cassandra, Riak, Voldemort, and Amazon’s own DynamoDB.23
  • He is the leading evangelist of eventual consistency. His essay “Eventually Consistent” laid out the availability-versus-consistency trade plainly: when the network partitions – and at scale it will – you must choose, and Dynamo chose to stay available and converge soon rather than block until every replica agrees.47
  • He gave engineering culture the line “you build it, you run it.” In a 2006 conversation he described Amazon’s model of developers owning their services in production – scoping, building, and operating them – and argued that putting builders on the pager and in front of the customer is what drives quality.5

The Principle

“Everything fails, all the time.” – Werner Vogels, Amazon CTO, on designing reliable distributed systems6

Most engineering optimizes for the case where everything works. You build the happy path, handle the few errors you can imagine, and ship. That instinct survives right up until you operate at scale – and then it betrays you. When you are running hundreds of thousands of machines, “rare” stops being rare. A disk failure that happens once in three years per drive happens somewhere in your fleet every few minutes. A network link that drops one packet in a million drops millions of packets a day. Vogels’s most famous line compresses this into four words: everything fails all the time.6 Failure is not an exception to engineer around; at scale it is the steady-state condition you must engineer for.

The principle that follows is the inversion of the usual one. If you cannot prevent failure – and at scale you provably cannot – then preventing it is the wrong goal. The right goal is to stay available while things are failing. So you assume every component will die, and you design so that any single death is survivable: replicate data across machines so the loss of one is invisible, decouple services behind APIs so a failing dependency degrades a feature instead of toppling the system, and minimize the blast radius of any one fault so it cannot take the whole thing down. The system is not built to avoid the failure case; it is built so the failure case is boring.16

There is a second half to the principle, and it is the one that makes the first half real: you cannot have perfect consistency and perfect availability at the same time once the network can partition, so you must choose – and Vogels chose availability. When a network split cuts your replicas off from each other, a system can either refuse to answer until everyone agrees (consistent, but unavailable) or keep answering with what it has and reconcile later (available, but briefly inconsistent). For Amazon’s shopping cart, refusing to answer was unacceptable – a cart that rejects an “add to cart” during a partition is a cart that loses a sale.4 So Dynamo always accepts the write and lets the replicas converge afterward. The cost is a small window where different replicas may return different answers; the payoff is a system that never tells a customer “no.” That trade – staying available and converging soon rather than blocking until everyone agrees – is eventual consistency, and Vogels spent a career arguing it is the right trade at scale.47

Context

Werner Vogels was born on 3 October 1958 in Ermelo, Netherlands.1 His path into computing was not the conventional straight line through a top university. He studied computer science at The Hague University of Applied Sciences, finishing in 1989, and only later earned a PhD in computer science from Vrije Universiteit Amsterdam – his 2003 thesis, “Scalable Cluster Technologies for Mission Critical Enterprise Computing,” was supervised by Henri Bal and Andrew Tanenbaum, the latter one of the field’s foundational figures in distributed systems and operating systems.1 The detail worth keeping is that the doctorate followed years of real systems work rather than preceding it; the theory caught up to the practice.

The most formative chapter came at Cornell University, where from 1994 to 2004 he was a research scientist working on scalable, reliable enterprise systems.1 At Cornell he was inside Ken Birman’s distributed-systems group – the lineage behind Isis and reliable group communication, the body of work that asked how a set of machines can agree, stay consistent, and keep running as members fail and recover. He co-founded a company, Reliable Network Solutions, with Birman and Robbert van Renesse, and served as its VP and CTO.1 This is the intellectual soil Vogels grew in: not “how do we keep machines from failing,” but “how does a group of machines stay correct and available while its members fail.” When he later said everything fails all the time, he was not improvising – he was stating the founding premise of the reliable-distributed-systems tradition he had spent a decade inside.

He joined Amazon in September 2004 as director of systems research, was named CTO in January 2005, and added the VP title in March 2005 – the role he has held ever since, driving technology direction across the company.1 His arrival coincided with the years Amazon was inventing the modern cloud: the Dynamo storage system was built and written up in this period, Amazon Web Services launched its foundational services, and Vogels became the public voice for the architectural principles underneath all of it – design for failure, decouple via services, embrace eventual consistency, and put the people who build a service in charge of running it.245

The Work

“Everything fails all the time”: design for failure and eventual consistency

Start here, because it is the principle made into engineering. The doctrine has two moves. The first is design for failure: treat every component as something that will fail, and make the system survive its loss. That means redundancy (replicate so any one copy can vanish), decoupling (services talk through APIs so a sick dependency degrades gracefully instead of cascading), and blast-radius containment (partition the system so a fault is trapped in a small cell rather than spreading).16 The test of a design is not “does it work when everything is healthy” but “what happens when this piece dies at the worst possible moment” – and the answer has to be “the system keeps serving.”

The second move is the one that makes high availability possible at scale: eventual consistency. Eric Brewer’s CAP observation says that when the network partitions, a distributed system cannot be both perfectly consistent and fully available – it must give one up.7 Vogels’s “Eventually Consistent” makes the choice explicit and defines the alternative precisely: under eventual consistency, “the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.”4 The word eventually is the whole trade. A system that emphasizes availability “may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write.”4 For a brief, bounded window, two replicas can disagree – but neither ever refuses to answer. Convergence happens in the background, and the user is never blocked.

Why it matters as engineering: most developers’ mental model of a database is the single-machine one, where a write is instantly visible to every subsequent read because there is only one copy. That model does not survive scale, because one copy is a single point of failure and one machine is a ceiling on throughput. The moment you replicate – which you must, to be available – you inherit the question of what a reader sees while the copies are catching up. Vogels’s contribution was to insist this is not a bug to be hidden but a design dimension to be chosen deliberately, and to give engineers the vocabulary – read-your-writes, monotonic reads, session consistency – to pick exactly how much consistency a given workload actually needs, rather than paying for the strongest guarantee everywhere.4

The Dynamo paper and the NoSQL movement

The principle has a canonical artifact: “Dynamo: Amazon’s Highly Available Key-value Store,” which Vogels co-authored and which was published at SOSP 2007, the field’s premier operating-systems venue.2 Dynamo was Amazon’s answer to a specific, brutal requirement – the shopping cart had to accept writes always, even during data-center partitions and disk failures, because an unavailable cart loses revenue directly.23 Traditional relational databases, tuned for strong consistency, could not promise that under partition. So Amazon built a store that traded consistency for availability and wrote down exactly how.

Dynamo is a catalog of distributed-systems techniques assembled into one always-writable, decentralized system, and the paper’s influence comes from how cleanly it laid them out.23 Consistent hashing partitions data across nodes so the ring can grow or shrink without reshuffling everything – “incremental, possibly linear scalability.”3 Vector clocks track the causal history of each value so that concurrent writes can be detected rather than silently lost. Sloppy quorums and hinted handoff keep the system writable even when some replicas are unreachable, parking writes on a temporary stand-in until the rightful node returns. Anti-entropy with Merkle trees lets replicas efficiently find and repair their differences. Gossip-based membership lets nodes learn about each other and detect failures with no central coordinator – the design is deliberately symmetric and decentralized, so “every node in Dynamo should have the same set of responsibilities as its peers,” which means there is no special node whose death is catastrophic.3 Every one of these choices serves the same master: stay available when things fail.

Werner Vogels speaking at AWS re:Invent

Amazon never released Dynamo’s code, but the paper did the work – it became one of the most influential systems papers of its decade, the intellectual seed of the NoSQL movement.3 Apache Cassandra, Riak, and Project Voldemort all trace their leaderless, eventually-consistent designs directly to it.3 And the name lived on commercially in Amazon DynamoDB, which is built on Dynamo’s principles even though it made different engineering choices under the hood (single-leader replication rather than Dynamo’s pure leaderless model).3 The lesson of Dynamo’s influence is itself worth noting: Amazon’s competitive moat was not the code, it was the clarity. By explaining precisely which guarantees they gave up and why, they taught a generation of engineers how to reason about the trade.

AWS, service orientation, and “you build it, you run it”

Dynamo is a storage system; the deeper Vogels contribution is architectural and cultural. Amazon’s platform is built as a mesh of independent services that talk only through APIs – no shared databases reached behind the curtain, no hidden coupling.5 The discipline matters for failure: when services are decoupled behind hard interfaces, a failing one degrades the specific feature it powers instead of corrupting the data or stalling the threads of everything that touched it. Service orientation is blast-radius containment expressed as architecture. It is also what made AWS possible – once your internal systems are clean, API-addressable services, exposing them to the outside world as products is a natural next step.

The cultural half is the line Vogels is quoted for as often as “everything fails all the time”: “you build it, you run it.” In a 2006 conversation with Jim Gray, he described Amazon’s model where each service is owned end to end by the team that makes it: “Each service has a team associated with it, and that team is completely responsible for the service – from scoping out the functionality to architecting it, to building it and operating it.”5 And the rationale was explicitly about quality through ownership: “You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer.”5 There is no wall to throw code over; the engineer who wrote the service carries the pager for it. The effect is a tight feedback loop – the person best able to fix a fragility is the person who feels its pain at 3 a.m., and the person who designed the feature hears the customer complaint directly. Ownership is not an HR slogan here; it is a reliability mechanism. A team that runs what it builds designs for failure because failure wakes them up.

Werner Vogels on the Web Summit centre stage

Evangelizing the cloud: blast radius, cells, and well-architected systems

Vogels’s fourth body of work is less a single artifact than a sustained role: for two decades he has been the architect-evangelist who codified how to build on the cloud, not just how to build the cloud.16 The recurring themes are the principle applied at ever-larger scope. Minimize blast radius: partition systems into independent cells so a fault, a bad deploy, or a poison request is contained to a slice of customers rather than all of them. Decouple aggressively: prefer asynchronous, loosely-coupled services with explicit contracts over tight synchronous chains where one slow dependency stalls the whole call path. Automate the recovery, do not document it: a runbook that needs a human does not execute when the human is asleep. Embrace failure as a test input, deliberately injecting faults to prove the system survives them rather than hoping it will. Each of these is “everything fails all the time” turned into an operating practice – the consistent message, repeated across talks and writing and a second ACM conversation years on, that resilience is a property you design in from the first line, not a layer you add after the demo works.6

The Method

Read across Dynamo, eventual consistency, service orientation, and “you build it, you run it,” and the same commitments recur. Vogels’s method is less a slogan than a set of standing habits.

Design for the failure case first. At scale, failure is the steady state, not the exception, so the question is never “does this work” but “what happens when each piece of this dies.”6 The lesson transfers far past Amazon’s scale: do not write the happy path and patch in error handling – enumerate the failure modes first and let the working path fall out of a system that already survives them. It is the evidence gate applied to reliability – “it works in the demo” is not evidence; “it stays available when I kill a node mid-request” is the same standard for self-healing that Radia Perlman built into networks that re-converge with no human in the loop.

Choose your consistency, do not inherit it. The deepest move in Dynamo is refusing the default that every read must see every prior write. Vogels makes consistency a dial you set per workload – strong where correctness demands it, eventual where availability matters more – and is precise about which guarantee a system actually provides.47 The discipline is to know exactly what your consistency claim rests on and never to pay for a guarantee a workload does not need. This is the same precision about correctness that Leslie Lamport brought to distributed time: do not assume the property, define it exactly and know when it holds.

Decouple to contain the blast radius. Independent services behind hard APIs mean a failure is trapped where it happens instead of cascading.5 The standing habit is to draw the boundaries so that the worst case is a degraded feature, never a downed system – to ask of every dependency, “when this fails, how big is the hole?” and to make the hole small. It is the architectural form of minimum worthy product: the cleanest boundary is the one that does exactly its job and fails alone.

Make the builders own the running. “You build it, you run it” puts the people who design a service on the pager for it, closing the loop between a fragility and the person able to fix it.5 The lesson is that operational pain is the most honest quality signal there is – a team insulated from production will under-invest in resilience, because the cost of fragility lands on someone else. Ownership is a reliability mechanism, which is quality is the only variable made into an org chart: the only way to guarantee quality is to make the builder feel the consequence of its absence.

Explain the trade in the open. Dynamo’s influence came not from its code – which was never released – but from a paper that stated plainly which guarantees were given up and why.23 The habit is to make the reasoning legible: name the trade-off, justify the side you chose, and teach the next engineer to reason about it rather than cargo-cult the result. Clarity about why is what lets a design outlive its author – the same explanatory discipline that made Perlman’s and Lamport’s papers teachable decades on.

Influence Chain

Who Shaped Him

Ken Birman and the Cornell reliable-distributed-systems tradition. Vogels’s decade at Cornell, inside Birman’s group and the Isis/reliable-group-communication lineage, is the source of his founding premise.1 That tradition’s central question – how does a group of machines stay correct and available while its members fail and recover – is precisely the question “everything fails all the time” answers. He did not coin a slogan; he restated his field’s first principle for a planetary audience. (Formative influence)

Andrew Tanenbaum and the distributed-systems academy. His Vrije Universiteit doctorate was supervised in part by Tanenbaum, one of the field’s foundational teachers of operating and distributed systems.1 The grounding shows: Dynamo reads like a working synthesis of the distributed-systems canon – consistent hashing, vector clocks, quorums, gossip – assembled by someone who knew the literature cold. (Formative influence)

Eric Brewer and the CAP trade-off. Vogels’s case for eventual consistency rests explicitly on the CAP observation that a partition-tolerant system must trade consistency against availability.47 Brewer framed the impossibility; Vogels operationalized the choice at Amazon’s scale and made “pick availability and converge” a respectable default. (Direct influence)

Who He Shaped

The entire NoSQL movement. The Dynamo paper is the direct ancestor of Cassandra, Riak, and Voldemort, and the namesake of DynamoDB – the leaderless, eventually-consistent design pattern propagated from one 2007 paper into the data layer of a generation of systems.3

Cloud-native architecture and DevOps culture. “You build it, you run it” became one of the founding ideas of modern DevOps – full-service ownership, on-call developers, and the dissolution of the dev/ops wall trace directly to the model Vogels described in 2006.5

A generation of cloud architects. Through AWS’s design principles and his sustained evangelism, “design for failure,” “minimize blast radius,” and “decouple via services” became the default vocabulary engineers use to reason about building reliable systems on the cloud.6

The Throughline

Vogels is the operational-scale keystone of this series – the figure who took distributed-systems theory and ran it on a planet’s worth of machines. Leslie Lamport gave distributed systems their foundations: how to define time, ordering, and consensus precisely, and how to keep a system correct when participants fail or behave arbitrarily. Vogels is what those foundations look like when they have to serve a Black Friday shopping cart – the same questions of consistency and failure, answered not on a whiteboard but under real load, with real revenue riding on staying available.4 And Radia Perlman built networks that treat the failure case as the design center, healing themselves with no human in the loop; Vogels built services on exactly that instinct, one layer up the stack – replicate, decouple, contain the blast radius, and let the system converge on its own. Where Lamport says define correctness and prove it survives failure and Perlman says build it to heal itself, Vogels says: everything fails all the time, so stop trying to prevent it – design so the system stays available straight through the failure, and let the builders who run it feel every crack. (Series bridge)

What I Take From This

The lesson I keep from Vogels is to treat failure as the normal case, not the exception. My instinct, like most builders’, is to write the path where the call succeeds, the dependency answers, the disk is there – and then bolt on a try/catch once it works. “Everything fails all the time” is the rebuke: at any real scale the failure is not a rare event happening to my system, it is a constant condition my system lives in. So when I build something now – a sync job, an API client, a queue consumer – I try to start from “what dies, and does the rest keep serving when it does?” rather than getting there last. The honest version of “it works” is not the green demo; it is killing a dependency mid-request and watching the system degrade gracefully instead of falling over. A system that only survives the happy path is a system I have not finished designing.

The second lesson is that availability and consistency are a trade I have to make on purpose. It is tempting to want both – every read sees every write, and the system never says no – and for a single machine you can have it. The moment I replicate anything, that comfort is gone, and Vogels’s discipline is to choose the side deliberately for each workload rather than defaulting to the strongest guarantee everywhere out of habit. Most of what I build does not need a read to instantly reflect the latest write; it needs to never refuse the customer. Eventual consistency reframed that for me from a scary compromise into a precise tool: name exactly how stale a reader can tolerate being, buy availability with the slack, and stop paying for a guarantee the feature never needed. The skill is not always reaching for the strongest promise – it is knowing which promise the work actually requires.

FAQ

What does “everything fails all the time” mean?

It is Werner Vogels’s compression of a hard-won lesson about scale: when you operate enough machines, component failure stops being a rare exception and becomes a constant, statistically guaranteed condition.6 A failure mode rare enough to ignore on one server happens somewhere in a large fleet constantly. The practical consequence is the inversion of normal engineering: instead of trying to prevent failure, you assume every disk, server, link, and dependency will fail, and you design systems that stay available through failure – via redundancy, decoupling, and contained blast radius – so that any single fault is survivable and, ideally, invisible.16

What is the Dynamo paper?

“Dynamo: Amazon’s Highly Available Key-value Store” is a 2007 SOSP paper, co-authored by Vogels, describing the storage system Amazon built to keep services like the shopping cart writable even during failures and network partitions.23 It combined consistent hashing for partitioning, vector clocks for tracking concurrent writes, sloppy quorums and hinted handoff for staying available under failure, anti-entropy with Merkle trees for repair, and gossip for decentralized membership – all in service of always accepting a write and reconciling later. Amazon never released the code, but the paper became foundational to the NoSQL movement, directly influencing Cassandra, Riak, Voldemort, and Amazon DynamoDB.3

What is eventual consistency?

Eventual consistency is a relaxed consistency model Vogels championed and defined in his essay “Eventually Consistent”: “if no new updates are made to the object, eventually all accesses will return the last updated value.”4 In a replicated system, a write may reach some replicas before others, so for a brief window different replicas can return different answers – but none ever refuses a request. The system stays available and converges in the background rather than blocking until every replica agrees. It is the availability side of the CAP trade-off: when the network partitions, a system can be consistent (refuse to answer until all agree) or available (answer with what it has and reconcile later), and eventual consistency chooses available.47

What does “you build it, you run it” mean?

“You build it, you run it” is Vogels’s description, from a 2006 ACM Queue conversation, of Amazon’s model of full-service ownership: the team that builds a service is “completely responsible for the service – from scoping out the functionality to architecting it, to building it and operating it.”5 There is no wall between development and operations – the engineers who wrote the code carry the pager for it. Vogels argued this “brings developers into contact with the day-to-day operation of their software” and “into day-to-day contact with the customer,” and that the resulting feedback loop is what drives quality.5 The idea became one of the founding principles of modern DevOps culture.


Sources


  1. “Werner Vogels,” Wikipedia. Born 3 October 1958 in Ermelo, Netherlands. Studied computer science at The Hague University of Applied Sciences (completed 1989); PhD in computer science from Vrije Universiteit Amsterdam (2003), thesis “Scalable Cluster Technologies for Mission Critical Enterprise Computing,” supervised by Henri Bal and Andrew Tanenbaum. Visiting scientist then research scientist at Cornell University (1994-2004) working on scalable, reliable enterprise systems; co-founded Reliable Network Solutions, Inc. with Kenneth Birman and Robbert van Renesse (serving as VP and CTO). Joined Amazon in September 2004 as director of systems research; named CTO in January 2005 and VP in March 2005, the role driving technology innovation across the company. Co-author of the Dynamo paper. 

  2. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels, “Dynamo: Amazon’s Highly Available Key-value Store,” Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP ‘07), ACM, 2007, pp. 205-220. Describes Dynamo, the highly available, eventually-consistent key-value store Amazon built to keep core services (such as the shopping cart) writable during failures and partitions; trades strong consistency for availability, always accepting writes and reconciling later. 

  3. “Dynamo (storage system),” Wikipedia. Dynamo is a set of techniques that together form a highly available key-value store built by Amazon, presented in the 2007 SOSP paper. Techniques: consistent hashing for partitioning (“incremental, possibly linear scalability”); vector clocks (or dotted version vectors) for highly available writes; sloppy quorum and hinted handoff for temporary failures; anti-entropy using Merkle trees for permanent failure recovery; gossip-based membership protocol and failure detection for decentralization. Architected around symmetry and decentralization – “every node in Dynamo should have the same set of responsibilities as its peers.” Amazon published the paper but never released the implementation; the work strongly influenced the NoSQL movement, inspiring Apache Cassandra, Project Voldemort, and Riak. Amazon DynamoDB is built on the principles of Dynamo but uses a different (single-leader) architecture. 

  4. Werner Vogels, “Eventually Consistent,” All Things Distributed (December 2008), revised for ACM Queue (2008) and published in Communications of the ACM 52(1), January 2009, pp. 40-44. Defines eventual consistency: “the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.” References Eric Brewer’s CAP theorem and explains the availability-versus-consistency trade: a system that emphasizes availability “may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write.” Describes consistency variations including read-your-writes, session consistency, and monotonic reads. 

  5. Jim Gray, “A Conversation with Werner Vogels,” ACM Queue 4(4), May 2006 (the queue.acm.org page may return HTTP 403 to automated fetches; the quotations are corroborated by HandWiki, “Software:You Build It You Run It”). Vogels describes Amazon’s full-service-ownership model: “Each service has a team associated with it, and that team is completely responsible for the service – from scoping out the functionality to architecting it, to building it and operating it.” And: “Giving developers operational responsibilities has greatly enhanced the quality of the services… You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer.” 

  6. “Everything Fails All the Time,” Communications of the ACM, on the design principle attributed to Werner Vogels (the cacm.acm.org page may return HTTP 403 to automated fetches; the attribution is corroborated by The Next Web, “Werner Vogels: ‘Everything fails all the time’”). Vogels’s widely-cited maxim that, at scale, component failure is constant and statistically guaranteed, so systems must be designed for failure – via redundancy, decoupling, automated recovery, and contained blast radius – to remain available through failure rather than attempting to prevent it. The principle is foundational to AWS’s design guidance and the Well-Architected Framework. 

  7. “Eventual consistency,” Wikipedia. Eventual consistency is a consistency model used in distributed computing to achieve high availability: informally, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. It is the availability-favoring side of the CAP theorem trade-off (consistency, availability, partition tolerance – a partition-tolerant system must trade consistency against availability), and is widely deployed in distributed systems including DNS and many NoSQL stores descended from Amazon’s Dynamo. 

Artículos relacionados

Engineering Philosophy: Radia Perlman

Radia Perlman invented the Spanning Tree Protocol and built networks that heal themselves -- loop-free, self-stabilizing…

24 min de lectura

Engineering Philosophy: Margaret Hamilton

Margaret Hamilton coined 'software engineering' and wrote the Apollo flight code that survived the 1202 alarm. Defensive…

27 min de lectura