I wrote a first version of this posting on consistency models about a year ago, but I was never happy with it as it was written in haste and the topic is important enough to receive a more thorough treatment. ACM Queue asked me to revise it for use in their magazine and I took the opportunity to improve the article. This is that new version.
Eventually Consistent - Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.
At the foundation of Amazon's cloud computing are infrastructure services such as Amazon's S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.
Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.
One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when the underlying distributed system provides an eventual consistency model for data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. In this article I present some of the relevant background that has informed our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this text appeared as a posting on the All Things Distributed weblog in December 2007 and was greatly improved with the help of its readers.
A question I get asked frequently is how working in industry is different from working in academia. My answer from the beginning has been that the main difference is teamwork. While in academia there are collaborations among faculty and there are student teams working together, the work is still rather individual, as is the reward structure. In industry you cannot get anything done without teamwork. Products do not get build by individuals but by teams; definition, implementation, delivery and operation are all collaborative processes that have many people from many different disciplines working together.
As such the Information Week's Chief of the Year award cannot be my award. It is an award for all the Amazonians who in the past years have developed technologies and processes that are so innovative that they have defined a whole business landscape: first in ecommerce and now with Amazon Web Services they are defining Cloud Computing through the delivery of Infrastructure as a Service. Compared to the immense work that was needed to make all of this work, my involvement has been small.
A relentless focus on innovation by all Amazonians has made this possible: from new hardware development to the definition of new business models, from building ultra-reliable storage services to a massively scalable compute cloud, from pervasive monitoring and performance control to revolutionary efficient software architectures. At a scale and with reliability, performance and cost-effectiveness that is unparalleled in today's technology world. All these advances are based on 13 years of experience with building the world's most customer centric ecommerce operation, and as such the success of AWS is absolutely not the work of a single individual but the success of all Amazonians.
But this is only the beginning. We are intent on building the world's most customer-centric cloud computing operation and, as we have done with ecommerce, we will not accept the old norms of what must be done. We will always focus on what our customers need and work backwards from there. We will continue to innovate and roll out services and features that address the real needs of our customers.
It is still only Day One...
Starting today the Amazon Elastic Computing Cloud (EC2) supports the ability to launch instances in multiple geographically distinct regions. The new EU region enables users to launch instances in Europe.
This addresses the requests from many our European customers and from companies that want to run instances closer to European customers. Over the past year I have visited with many of our European customers and frequently they remarked "if only we had EC2 in Europe". We heard their requests loud and clear and have worked very hard to roll out the European Region. This is a very important milestone on the road to local access to all our services.
These are three of the main drivers for the requests by our customers
- Lower latency from EC2 instances to their clients. The European Region can be accessed with low latency from all major European network hubs.
- Low latency access to data stored in the Amazon Simple Storage Service (S3). A large number of customers have stored data into the European Region of Amazon S3. With the new European region this data can now be accessed with low latency from within EC2 at no cost
- Regulatory requirements may require that data be stored in the EU and/or processing take place within the EU. With the European Regions of Amazon S3 and Amazon EC2 developers now can address those requirements.
The new European Region will also contain two Availability Zones such that developers can build applications that can tolerate a variety of failure scenarios. One can even develop fail-over scenarios that will span multiple continents. Amazon Elastic Block Storage will also be available to our customers that launch instances in the European Region.
With the European Regions of Amazon EC2, S3 and SQS, combined with Amazon CloudFront, developers now have a full set of services that can help them address the European market.
I am very excited about the launch of the Amazon EC2 in European and I am looking forward to work with our European partners and customers to roll out their applications and services in the EU Region.
More details on the Amazon EC2 detail page , the AWS blog and at RightScale
Today marks the launch of Amazon CloudFront, the new Amazon Web Service for content delivery. It integrates seamlessly with Amazon S3 to provide low-latency distribution of content with high data transfer speeds through a world-wide network of edge locations. It requires no upfront commitments and is a pay-as-you-go service in the same style as the other Amazon Web Services.
Amazon CloudFront has been designed to be fast; the service will cache copies of the content in edge locations close to the end-user's location, significantly lowering the access latency to the content. High sustainable data transfer rates can be achieved with the service especially when distributing larger objects.
Amazon CloudFront will be useful for many different application scenarios such as giving your customers low-latency access to popular objects and protecting your site from popularity surges; other popular examples are low-cost delivery of rich media and sustainable fast transfer rates for software distributions.
See also the posting on the AWS Developer weblog and at Rightscale.
Seamless integration
A content delivery service that would extend Amazon S3 has been something that is very high on the wish list of our customers. They were already successfully using Amazon S3 for some of their content distribution needs, but many wanted the choice to do so with even lower latency and with higher data transfer rates to any place in the world.
Customers really appreciate the scalability, reliability and cost-effectiveness of Amazon S3 and the fact that it integrates so easily with Amazon EC2. Amazon CloudFront builds further on that seamless integration by making it really simple to distribute Amazon S3 content world-wide. The combination of the two services is really powerful: Amazon S3 will give you durable storage of your data, and the network of edge locations on three continents used by the Amazon CloudFront will deliver the content to your customers with low latency from the most appropriate location.
The network of edge locations
To ensure low-latency delivery, Amazon CloudFront uses a network of edge locations world-wide:
- United States: Ashburn (VA), Dallas/Fort Worth, Los Angeles, Miami, Newark, Palo Alto, Seattle and St. Louis
- Europe: Amsterdam, Dublin, Frankfurt and London
- Asia: Hong Kong and Tokyo
These edge locations work together to direct customers' requests to the edge location that can provide the response with the lowest latency.
Simplicity
Because Amazon CloudFront follows the core principles of all Amazon Web Services it is a unique content delivery service. The simplicity in getting started has been described by many of our early customers as a very important feature.
Using Amazon CloudFront is dead simple:
- Put your objects in an Amazon S3 bucket.
- Call the CreateDistribution API with the name of the S3 bucket, which will return your distribution's domain name.
- Use the new domain name in urls on your web or in your application. Whenever these urls are accessed CloudFront will determine the optimal edge location from where to serve your content.
Many of our private beta customers have reported that it only took them 10-15 minutes from the moment that they first signed up for the service to the moment that Amazon CloudFront was distributing their content.
The second Amazon Web Services principle that sets Amazon CloudFront apart is that no upfront commitments are necessary and you only pay for what you have used. There are no upfront fees or high volume requirements and no negotiations are necessary because we have published low prices from the start. This brings content delivery in the hands of all businesses, and you can exploit the benefits of Amazon's world-wide network of edge locations, regardless of whether you are a highly popular website, a small blog, a complex enterprise application or a developer doing some prototyping.
Tools such as S3Fox have support for Amazon CloudFront built-in such that if you want to avoid any programming you can immediately start exploiting world-wide, low-latency content delivery.
A core distributed systems component
It is not uncommon to think about a service for content delivery such as Amazon CloudFront only in the context of media distribution for web sites, but it actually plays a more fundamental role.
There are two main technology components to such a service; the first is intelligent request routing, which routes requests to the location that can best serve the user given a series of requirements and the status of the network. The second technology component is that of object caching, which is a fundamental building block in both operating systems and in distributed systems.
For example your operating system will have a file cache, where it will store popular, recently-accessed files in memory to provide much faster access and greater throughput. Without a file cache your whole computer would appear much slower as all work would happen at the speed of the disk instead of memory.
Caching is an essential technique that is used to make sure that components can operate at the fastest speed possible, to overcome the performance differences that exist in systems. For example CPU's have caches that are much faster than memory, memory works as caches for disks, local disks can function as caches for remote disks, etc.
In distributed systems caching is primarily used to provide fast access to popular objects that are located in remote storage servers. These systems of caching servers often cooperate to create massive aggregate world-wide capacity to provide low latency access. And by using globally decentralized cache servers for distribution, very high data transfer speed can be achieved.
Caching technology has long been the center piece of computer systems research and in Amazon CloudFront we use the type of highly advanced algorithms for reliability and scale that you have come to expect from our Amazon services.
Many of our customers will look to Amazon CloudFront for rock solid content distribution for websites, but its application is not limited to that. Developers can easily integrate the service into their desktop and server applications and benefit from the advanced routing and caching that Amazon CloudFront offers. For example enterprise style applications such as NASDAQ's Market Replay application are ideal candidates to integrate Amazon CloudFront to provide low latency access to popular market data while reducing the cost of data transfers.
Graphic by Renato Valdés Olmos of Postmachina
These are times where many companies are focusing on the basics of their IT operations and are asking themselves how they can operate more efficiently to make sure that every dollar is spent wisely. This is not the first time that we have gone through this cycle, but this time there are tools available to CIOs and CTOs that help them to manage their IT budgets very differently. By using infrastructure as a service, basic IT costs are moved from a capital expense to a variable cost, building clearer relationships between expenditures and revenue generating activities. CFOs are especially excited about the premise of this shift.
In recent weeks in my discussions with many of our Amazon Web Services customers I have seen a heightened interest in moving functionality into the AWS cloud to get a better grasp on controlling cost. And this is across the board; from young businesses to Fortune 500 enterprises, from research labs to television networks, all are concerned about reducing upfront cost associated with the new ventures and reducing waste in existing operations. Most of them point to 3 properties of the Amazon Web Services model that helps them become more efficient:
The pay-as-you-go model. There are significant advantages to this model for efficiency as one only pays for those resources one has actually consumed. If the application scales along the right revenue generating dimensions these costs will be in line with the revenue being generated.
Managing peak capacity. Many IT organizations need to maintain extra capacity for anticipated peak loads, capacity that sits idle for most of the time. These peak loads can be driven by customer demand such as in the online world, but it can also be capacity required to execute essential IT tasks such as periodic document indexing or business tasks such as closing the books at the end of a quarter. This is often the first step that our enterprise customers take to become familiar with using infrastructure as a service. After successfully running some of their peaks jobs they will then starting moving more permanent processing into the cloud.
A great example in the online world is the Indy 500 organization that normally runs 50 servers to serve their customers, but during the races move all of their processing into Amazon EC2 to handle all traffic no matter how many hundreds of thousands of customers show up at the same time. The savings for the Indy IT budget during the races this spring was over 50%.
Higher reliability at lower cost. Negotiating several contracts with different datacenter and network providers to make sure the IT tasks can survive complex failure scenarios is a difficult task and many organizations find it hard to achieve this in a cost efficient manner. Amazon EC2 with its Regions and Availability Zones gives its customers access to several high-end datacenters with highly redundant networking capabilities at a single pricing model, without any negotiations.
Amazon's efficiency principles
At Amazon we have a long history of implementing our services in a highly efficient manner. Whether these are our infrastructure services or our high-level ecommerce services, frugality is essential in our retail business. Margins in a retail business are traditionally small and these constraints have driven major innovations in the way that we manage our IT capacity. We have developed a lot of expertise in building highly efficient architectures to support Amazon's goal of providing our customer with products at low prices. Every savings we have been able to make in our IT cost we have been able to give back to our customers in terms of lowering prices. This tradition of letting customers benefit from our cost saving is something that we also apply to our Amazon Web Services business. When earlier this year we were able to negotiate better deals with our network providers we immediately reduced the bandwidth cost for our customers.
But we have learned at Amazon that having a low cost infrastructure is only the starting point of being as efficient as possible. You need to make sure that your applications will make use of the infrastructure in an adaptive and scalable manner to achieve a high degree of efficiency. In the Amazon architecture being incrementally scalable is key. This means that services' and applications' main course of action to handle increasing load or larger datasets is to grow one unit at a time. A more precise definition can be found here.
All services at Amazon are built to be horizontally scalable. An efficient request routing mechanism delivers requests to services in a manner that optimizes performance at a certain efficiency point. Capacity is acquired and released on short time frames to handle increase and decreases in resource usage. To achieve this principle of automatic scaling our services there are four basic components that need to work together:
Elastic Compute Capacity. The basic resources required to execute our services and applications need to be able to grow and shrink at a moment's notice in a fully automated fashion. This is the fundamental premise behind Amazon EC2; whenever an Amazon service requires additional capacity it can use a simple API call to acquire additional capacity without any interference from operators or data techs, and can release it when no longer needed.
Monitoring. We relentlessly measure every possible resource usage parameter, every application counter, and every customer's experience. Many gigabits per second of monitoring data flows continuously through the Amazon networks to make sure that our customers are getting serviced at the levels they can expect and at an efficiency level the business desires. We don't really care that much about averages or medians, for us performance at the 99.9 percentile is important to make sure that all our customers get the right experience.
Load balancing. Using the monitoring information we route requests intelligently, using several algorithms, to those services instances that can provide responses with the expected performance. In reality balancing the load is a secondary task of the request routing system as it is the customer's experience we are most driven by. The optimization quest is to deliver the right customer experience at the optimal resource utilization.
Automatic scaling. Using Monitoring data, Load balancing and EC2, the auto-scaling service monitors service health and performance, brings more capacity on-line if needed or reduces the number of instances to meet efficiency goals. It spreads instances over multiple availability zones and regions to achieve the desired reliability guarantees. All without interference of developers or operators.
These four services are the core of Amazon's highly efficient infrastructure that has allowed us to drive our IT costs to the floor for our retail operations.
Building highly-efficient systems on AWS
To make sure our customers can also benefit from our experience in building highly efficient systems we have decided to release versions of these services on the Amazon Web Services platform. The Monitoring, Load Balancing and Auto-Scaling services will be combined with a Management Console that provides a simple, point-and-click web interface that lets you configure, manage and access your AWS cloud resources.
They will first be released in private beta and you can express your interest in that program on the AWS web site. More details can be found in the posting on the Amazon Web Services blog.
Graphic by Renato Valdés Olmos of Postmachina
Congratulations to the Amazon EC2 team for the hard work to get to the point where the beta tag is removed from the service and it is now in full production. Not only that, but there now is an SLA, and Microsoft Windows and SQL Server are available as of today.
More details on the Amazon EC2 product page and on the Amazon Web Services weblog. Also, it's becoming a tradition for the folks at Rightscale to have a detailed posting on the new features.
The backend servers that power the world of Internet Services have become increasingly diverse. With today's announcement that Microsoft Windows Server is available on Amazon EC2 we can now run the majority of popular software systems in the cloud. Windows Server ranked very high on the list of requests by customers so we are happy that we will be able to provide this.
One particular area that customers have been asking for Amazon EC2 with Windows Server was for Windows Media transcoding and streaming. There is a range of excellent codecs available for Windows Media and there is a large amount of legacy content in those formats. In past weeks I met with a number of folks from the entertainment industry and often their first question was: when can we run on windows?
There are many different reasons why customers have requested Windows Server; for example many customers want to run ASP.NET websites using Internet Information Server and use Microsoft SQL Server as their database. Amazon EC2 running Windows Server enables this scenario for building scalable websites. In addition, several customers would like to maintain a global single Windows-based desktop environment using Microsoft Remote Desktop, and Amazon EC2 is a scalable and dependable platform on which to do so.
Amazon EC2 with Windows Server is still currently in private beta testing, but will be available for general use before the end of the year. Keep an eye on the AWS Weblog for information about Amazon Web Services at the Microsoft Professional Developer Conference.
The last week for submitting the applications for the AWS Startup Challenge has started. Looking at the proposals that are being submitted it looks like this will be another very inspiring challenge. These proposals are reviewed by a panel and five finalists will be selected. The finalists will come to Seattle to compete for $50K in cash, $50K in AWS credits, 2 years of Premium Support and more. All finalists will receive Rightscale Premium for 6 months and there will be a number of promotional events that includes all the finalists.
Last year there were 900 applications which made for very intense proposal reading sessions. Eventual Ooyala won the challenge and got to smash the server to bits. The videos of last year's finalists are still online.
If you have a brilliant idea/business that we should be evaluating you have until October 10 to let us know.
For many the "Cloud" in Cloud Computing signifies the notion of location independence; that somewhere in the internet services are provided and that to access them you do not need any specific knowledge of where they are located. Many applications have already been built using cloud services and they indeed achieve this location transparency; their customers do not have to worry about where and how the application is being served.
However for developers to do their job properly the cloud cannot be fully transparent. As much as we would like to make it easy and simple for everyone, building high-performance and highly reliable applications in the cloud requires that the developers have more control. For example a reality is that failures can happen; servers can crash and networks can become disconnected. Even if these are only temporary glitches and are transient errors, the developer of applications in the cloud really wants to make sure his or her application can continue to serve customers even in the face of these rare glitches. A similar issue is that of network latency; as much as we would like to see the cloud to be transparent, the transport of network packets is still limited to the speed of light (at best) and customers of cloud applications may experience a different performance depending on where they are located in relation to where the applications are running. We have seen that for many applications that works just fine, but there are developers who would like more control over how their customers are being served and for example would like to give all their customers low latency access, regardless of their location.
At Amazon we have been building applications on these cloud principles for several years now and we are very much aware of the tools that developers need to build applications that are required to meet very high standards with respect to scalability, reliability, performance and cost-effectiveness. We are also listening very closely to the feedback AWS customers are giving us to make sure we expose the right tools for them to do their job. We launched Amazon S3 in Europe to ensure that developers could build applications that could serve data out of a European storage cloud. We launched Regions and Availability Zones (combined with Elastic IPs) for Amazon EC2 such that developers would have better control over where their applications would be running to ensure high-availability. We are now ready to expand the cloud even further and bring the cloud storage to its customers' doorstep.
Today we are announcing that we are expanding the cloud by adding a new service that will give developers and businesses the ability to serve data to their customers world-wide, using low-latency and high data transfer rates. Using a global network of edge locations this new service can deliver popular data stored in Amazon S3 to customers around the globe through local access.
We have developed this content delivery service using the robust AWS principles we know work well for our customers:
- Cost-effective: no commitments and no minimum usage requirements. You only pay for what you use in a manner similar to the other Amazon Web Services.
- Simple to use: one API call gets you going. You store the data you want to distribute in an Amazon S3 bucket and you use this API call to register this bucket with the content distribution service. The registration will provide you with a new domain name that you can use in url's to access the data through this service with HTTP. When your customer accesses your content through your new url the data it refers to will be delivered through a network of edge servers.
- Works well with other services: The service integrates seamlessly with Amazon S3 and the data/content served through the service can be accessed using the standard HTTP access techniques.
- Reliable: Amazon S3 will give you durable storage of your data, and the network of edge locations on three continents used by the new service will deliver your content to your customers from the most appropriate location.
This is an important first step in expanding the cloud to give developers even more control over how their applications and their data are served by the cloud. The service is currently in private beta but we expect to have the service widely available before the end of the year. You can get a few more details and sign up to get notified when the service is becoming on this AWS page Also check Jeff Bar's posting on the AWS weblog.
Today marks the launch of Amazon EBS (Elastic Block Store), the long awaited persistent storage service for EC2. Details can be found on the EC2 detail page, the press release and Jeff Barr's posting over on the AWS evangelists blog. Also the folks at Rightscale have two detailed postings: why Amazon EBS matters and Amazon EBS explained.
With the launch of the Elastic Block Store we complete an important milestone in offering a complete suite of storage solutions as part of the Amazon Infrastructure Services. Back in the days when we made the architectural decision to virtualize the internal Amazon infrastructure one of the first steps we took was a deep analysis of the way that storage was used by the internal Amazon services. We had to make sure that the infrastructure storage solutions we were going to develop would be highly effective for developers by addressing the most common patterns first. That analysis led us to three top patterns:
- Key-Value storage. The majority of the Amazon storage patterns were based on primary key access leading to single value or object. This pattern led to the development of Amazon S3.
- Simple Structured Data storage. A second large category of storage patterns were satisfied by access to simple query interface into structured datasets. Fast indexing allows high-speed lookups over large dataset. This pattern led to the development of Amazon SimpleDB. A common pattern we see is that secondary keys to objects stored in Amazon S3 are stored in SimpleDB, where lookups result in sets of S3 (primary) keys.
- Block storage. The remaining bucket holds a variety of storage patterns ranging special file systems such as ZFS to applications managing their own block storage (e.g. cache servers) to relational databases. This category is served by Amazon EBS which provides the fundamental building block for implementing a variety of storage patterns.
I have written before about the basic features of Amazon EBS:
- Amazon EBS will be offered in the form of storage volumes which you can mount into your EC2 instance as a raw block storage device. It basically looks like an unformatted hard disk. Once you have the volume mounted for the first time you can format it with any file system you want or if you have advanced applications such as high-end database engines, you could use it directly.
- Developers can create multiple volumes, in size ranging from 1 GB to 1TB. This volume will be created within a specified Availability Zone and will be accessible by your EC2 instances running in that Availability Zone. As to be expected with a volume abstraction only one instance can have the volume mounted at any given time. Volumes can migrate and be reattached to other instances if necessary for failure handling or application migration reasons.
- The consistency of data written to this device is similar to that of other local and network-attached devices; it is under control of the developer when and how to force flush data to disk if you want to bypass the traditional lazy-writer functionality in the operating systems file-cache. Because of the session oriented model for access to the volume you do not need to worry about eventual consistency issues.
However Amazon EBS isn't just a massive volume storage array within an Availability Zone, it provides a unique feature that allows for the creation of novel storage management scenarios: the ability to create snapshots and store those snapshots into Amazon S3. These snapshots can then be used as the starting point for creating new volumes within any availability zone.
We see developers use this feature for long term backup purposes, for use in rollback strategies, for (world-wide) volume re-creation purposes. Snapshots also play an important role in building fault-tolerance scenarios when combined with managing applications using Elastic IP addresses and Availability Zones.
Congratulations to the EBS team for delivering a great service that will help a lot of EC2 customers managing their storage efficiently.
