Today we launched a new option for acquiring Amazon EC2 Compute resources: Spot Instances. Using this option, customers bid any price they like on unused Amazon EC2 capacity and run those instances for as long their bid exceeds the current "Spot Price." Spot Instances are ideal for tasks that can be flexible as to when they start and stop. This gives our customers an exciting new approach to IT cost management.
The central concept in this new option is that of the Spot Price, which we determine based on current supply and demand and will fluctuate periodically. If the maximum price a customer has bid exceeds the current Spot Price then their instances will be run, priced at the current Spot Price. If the Spot Price rises above the customer's bid, their instances will be terminated and restarted (if the customer wants it restarted at all) when the Spot Price falls below the customer's bid. This gives customers exact control over the maximum cost they are incurring for their workloads, and often will provide them with substantial savings. It is important to note that customers will pay only the existing Spot Price; the maximum price just specifies how much a customer is willing to pay for capacity as the Spot Price changes.
Spot Instances are ideal for Amazon EC2 customers who have workloads that are flexible as to when its tasks are run. These can be incidental tasks, such as the analysis of a particular dataset, or tasks where the amount of work to be done is almost never finished, such as media conversion from a Hollywood's studio's movie vault, or web crawling for a search indexing company. For most of these tasks their completion is not time critical and as such they are ideal targets for additional cost savings.
Economies of scale
Spot Instances are an innovation that is made possible by the unparalleled economies of scale created by the tremendous growth of the AWS Infrastructure Services. The broad Amazon EC2 customer base brings such diversity in workload and utilization patterns that it allows us to operate Amazon EC2 with extreme efficiency. True to the Amazon philosophy, we let our customers benefit from the economies of scale they help us create by lowering our prices when we achieve lower cost structures. Consistently we have lowered compute, storage and bandwidth prices based on such cost savings.
This massive scale also enables new innovative purchasing models such as Spot Instances that empower our customers to gain even more control over the cost-effectiveness of their IT infrastructure. A highly efficient purchasing model such as Spot Instances is another way in which Amazon EC2 customers benefit from the unique economies of scale found in AWS Infrastructure Services.
Different Purchasing Models
The three different purchasing models Amazon EC2 offers give customers maximum flexibility in managing their IT costs; On-Demand Instances are charged by the hour at a fixed rate with no commitment; with Reserved Instances you pay a low, one-time fee and in turn receive a significant discount on the hourly usage charge for that instance; and Spot Instances provide the ability to assign the maximum price you want for capacity with flexible start and end times.
- On-Demand Instances - On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments or upfront payments. You can increase or decrease your compute capacity depending on the demands of your application and only pay the specified hourly rate for the instances you use. These instances are used mostly for short term workloads and for workloads with unpredictable resource demand characteristics.
- Reserved Instances - Reserved Instances let you make a low, one-time, upfront payment for an instance, reserve it for a one or three year term, and pay a significantly lower rate for each hour you run that instance. You are assured that your Reserved Instance will always be available in the Availability Zone in which you purchased it. These instances are used for longer running workloads with predictable resource demands.
- Spot Instances - Spot Instances allow you to specify the maximum hourly price that you are willing to pay to run a particular instance type. We set a Spot Price for each instance type in each region, which is the price all customers will pay to run a Spot Instance for that given hour. The Spot Price fluctuates based on supply and demand for instances, but customers will never pay more than the maximum price they have specified. These instances are used for workloads with flexible completion times.
Managing Spot Instance Applications
There is one technological requirement for the applications that run as Spot Instances: they need to be able to handle that their computation can be stopped and restarted based on the Spot Price in relation to the customer's pricing limit. Ideally these applications will periodically save their state into, for example, EBS or Amazon S3 and upon restart read the last saved state and continue their work. This snapshot-restart technique is a well known methodology already available to many batch oriented applications.
There are three features that help customers manage their Spot Instances and the pricing.
- Persistent Requests: Spot Instance requests can be one-time or persistent. A one-time request will only be satisfied once; a persistent request will remain in consideration after each instance termination. A persistent request is useful when you have a large amount of computing that you want to get done but only below a certain price. By using a persistent request, customers can launch instances any time the Spot Price is below the target price and steadily work through the tasks.
- Launch Groups and specifying Availability Zones: Customers can request a cluster of instances to always launch and terminate simultaneously by specifying a Launch Group. They can also specify the Availability Zone these Instances should be launched in.
- Price History: Amazon EC2 provides a history of the Spot Price for each instance type in each Region via the AWS Management Console and the APIs. Spot Price history is a valuable tool in helping customers use what-if scenarios to determine right pricing level for a particular workload.
Spot instances are a great innovation that, as far as I know, has no equivalent in the IT industry. It brings our customers a powerful new way of managing the cost for those workloads that are flexible in their execution and completion times. This new customer-managed pricing approach holds the power to make new areas of computing feasible for which the economics were previously unfavorable.
For more details and background information visit the Amazon EC2 Spot Instance detail page, the AWS developer blog and the good folks at RightScale.
Today a powerful new feature is available for our Amazon EC2 customers: the ability to boot their instances from Amazon EBS (Elastic Block Store).
Customers like the simplicity of the AMI (Amazon Machine Image) model where they either choose a preconfigured AMI or upload their own AMI into Amazon S3. A wide variety of operating systems and software configurations is available for use. But customers have also asked us for more flexibility and control in the way that Amazon EC2 instances are booted such that they have finer grained control over for example what software configurations and data sets are available to the instance at boot time.
The ability to boot from Amazon EBS gives customers very powerful control over the boot configuration of the Amazon EC2 instances. In the traditional boot process, the root partition of the image will be the local disk, which is created and populated at boot time. In the new Amazon EBS boot process, the root partition is an Amazon EBS volume, which is created at boot time from an Amazon EBS snapshot. Other Amazon EBS volumes beyond the root disk can also made part of the instance before it is booted. This allows for a very fine-grain control of software and data configuration. An additional advantage of using the Amazon EBS boot process is that root partitions are no longer constrained by the size of the local disk and can be up to 1TB in size. And the new boot process is significantly faster because a local disk no longer needs to be populated.
With this new boot process another powerful feature is available to our Amazon EC2 customers: the ability to stop an instance and restart it at a later time with the disk configuration intact. When an instance is restarted, the customer can choose to use a different instance type (e.g., with more memory or CPU), a different operating system (e.g., with new security patches installed), or add new user data. While the instance is stopped it does not accrue any usage hours and customers are only charged for the storage associated with the Amazon EBS volume. The ability to stop and restart an instance is a very powerful mechanism that makes management of instances much easier; many scenarios related to adaptive instance sizing and software management have now become much simpler.
The new boot from Amazon EBS feature is an important step in our continuing quest to remove more and more of the heavy lifting that comes with today's computer environments.
For more details on the new boot features visit the Amazon EC2 detail page and the posting on the AWS developer blog. RightScale's perspective is also worth reading.
We have expanded the AWS footprint in the US and starting today a new AWS Region is available for use: US-West (Northern California). This new Region consists of multiple Availability Zones and provides low-latency access to the AWS services from for example the Bay Area. In the US, AWS customers now can choose between the US-East (Northern Virginia) Region and the new US-West (Northern California) Region. In addition, the EU (Ireland) Region is available to customers who want local access to services from Europe to address their performance or jurisdiction requirements.
As we announced earlier this month a Region with multiple Availability Zones will come online in Singapore in the first half of 2010, with other regions in Asia to follow later in 2010.
AWS is committed to making its services available at low cost. We use the cost-following principle in pricing the services and several times we have lowered our pricing in response to the cost savings we were able to achieve. Operation costs are often different based on location and, as such, the pricing for services may vary somewhat between Regions, giving our customers the power to make trade-offs between, for example, cost and latency.
At the end of Q3 2009 we counted over 82 billion objects in Amazon S3. Congrats to the team for providing such a rock solid service!
When looking at the graph keep in mind that the first 4 markers are a year apart, but the last one only 6 months.
Today marks the launch of Amazon RDS - the Amazon Relational Database Service. Amazon RDS is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. Amazon RDS handles all the "muck" of relational database management freeing up its users to focus on their applications and business.
Fine Tuning Data Management
At Amazon we have a long history of fine tuning our data management solutions to make sure that our systems can be reliable and cost-effective as we continue to scale. Almost from the beginning of operating the Amazon ecommerce platform it was clear that its scalability, reliability, performance, and cost-effectiveness were all dependent on the way that data was managed. In the first years of Amazon.com the site was architected like a traditional two-tier web system: a collection of application servers connected to a backend of databases. Many of the old-timer Amazonians recall how hard it was to scale the site and keep it reliable, as all of that work was rooted in scaling the centralized database servers. Looking back they jokingly talk about "duct tape and WD-40 engineering." With the move years ago from the two tier system to a fine grained, decentralized, service oriented architecture this changed dramatically.In the Amazon services architecture, each service is responsible for its own data management, which means that each service team can pick exactly those solutions that are ideally suited for the particular application they are implementing. It allows them to tailor the data management system such that they get maximum reliability and guaranteed performance at the right cost as the system scales up. Early on already the distinction was made between key-values storage systems and structured data management. Key-Value storage systems play a very important role in the Amazon architecture and this has ultimately led to the creation of the Amazon Simple Storage Service (Amazon S3). Amazon S3 addresses the need for a highly scalable and reliable Key-Value data storage system while shielding customers from all the complexities such as geo-replication, capacity planning, and performance management at high scale.
Structured data management systems are traditionally served by relational databases but these sophisticated systems have their limitations, especially when it comes to scale and reliability. Often they also require tremendous expertise to operate efficiently and reliably especially when scaling up. Of course, a significant portion of the structured data world does not require RDBMS features such as complex transactions and relations, and can be served by a simpler, much more agile system. Such a simple structured storage system for example does not require the use of a rigid schema and can allow attributes and indexes to be adapted on the fly. This system has led to the creation of Amazon SimpleDB where its customers get the benefits of such a simple scalable structured storage system without having to worry about replication, backups, buffer cache optimizations, databases resizing, etc
There are a several applications and services that do need the feature richness of an RDBMS. Until now they were served through the use of the Relational Database AMIs that are available for Amazon EC2. These AMIs can be launched to create a compute instance with database technologies such as Vertica, Oracle, DB2, SQL Server, Sybase, and PostgreSQL. These RDBMS are best used in concert with the Amazon Elastic Block Store (EBS) to create a scalable and reliable storage volume that can be used for persisting the databases.
As I mentioned earlier, running your own database system efficiently and reliably requires expertise and dedication of resources. Quite a few of our AWS customers are running relational databases, either because they require the specific relational functionality or because they are using software packages that have been designed with RDBMS as the database solution. These customers typically spend a significant amount of time in database management. Indeed, for many of these customers database management is yet another form of "muck": the tremendous amount of work they have to do that doesn't differentiate them and prevents that from focusing more on delivering value with their product. For these customers who require a relational database but do not have a need to exert complete administrative control over their database server, there is now another option: the Amazon Relational Database Service (Amazon RDS).
Amazon Relational Database Service
Amazon RDS provides a MySQL 5.1 relational database in the cloud. It provides cost-efficient and resizable capacity, while managing time-consuming database administration tasks for customers. The service takes much of the hassle out of setting up and managing relational databases, such as backups and code patching, freeing up its users to focus on their applications and businessAmazon RDS provides the full capabilities of a MySQL Database, which means that libraries, applications and tools that have been designed for use with MySQL can be used without modification. This makes it very simple for customers to start using Amazon RDS. As with all AWS services Amazon RDS is a scalable resource; its storage, processing power and memory usage can be adjusted on demand and the customer only pays for those resources that have been used.
Amazon RDS is a very important addition to our offering of database solutions as it addresses a significant stumbling block for many of our customers; the management of relational databases. Amazon RDS makes this much simpler which will free up resources at our customers to focus on contributions that really matter to their customers.
AWS customers now have three database solutions available:
- Amazon RDS for when the application requires a relational database but you want to reduce the time you spend on database management, Amazon RDS automates common administrative tasks to reduce your complexity and total cost of ownership. Amazon RDS allows you to manage your database compute and storage resources with a simple API call, and only pay for the infrastructure resources they actually consume.
- Amazon EC2- Relational Database AMIs for when the application require the use of a particular relational database and/or when the customer wants to exert complete administrative control over their database. An Amazon EC2 instance can be used to run a database, and the data can be stored within an Amazon Elastic Block Store (Amazon EBS) volume. Amazon EBS is a fast and reliable persistent storage feature of Amazon EC2. Available AMIs include IBM DB2, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, Sybase, and Vertica.
- Amazon SimpleDB for applications that do not require a relational model, and that principally demand index and query capabilities. Amazon SimpleDB eliminates the administrative overhead of running a highly-available production database, and is unbound by the strict requirements of a RDBMS. With Amazon SimpleDB, you store and query data items via simple web services requests, and Amazon SimpleDB does the rest. In addition to handling infrastructure provisioning, software installation and maintenance, Amazon SimpleDB automatically indexes your data, creates geo-redundant replicas of the data to ensure high availability, and performs database tuning on customers' behalf. Amazon SimpleDB also provides no-touch scaling. There is no need to anticipate and respond to changes in request load or database utilization; the service simply responds to traffic as it comes and goes, charging only for the resources consumed.
More details at the Amazon RDS detail page and the AWS developer blog. Other relevant readings are James Hamilton's posting and the RightScale blog.
In the past week both Vivek Kundra, the U.S. CIO, and Casey Coleman, the CIO of the GSA, have made very strong statements in supporting the use of cloud computing to power Federal programs. A good example is today's announcement about apps.gov. In conversations with Vivek and Casey, I am struck every time by how much their observations that Federal CIOs are focused too much on infrastructure issues are similar to the observations within Amazon a number of years ago that motivated us to develop the AWS Infrastructure services. At that time, Amazon engineering teams focused more than 70% of their work effort on keeping their infrastructure efficient, scalable and reliable, which were important, but non-differentiating tasks. The development of the Infrastructure Services such as Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) abstracted the "muck" away from our teams so that they could focus on delivering true value for Amazon customers.
It is exciting to hear these CIOs talk about how cloud computing can help the Federal Government focus on those activities that can really deliver real value for its citizens. Since the launch of those first AWS services, more than 3.5 years ago, we have seen companies of every size, from startups to fortune 100 companies, from innovative media companies to efficient financial services organizations to large scale pharmaceutical companies be able to focus more and more on delivering value to their customers because of the use of our cloud services. We are excited and looking forward to counting the Federal Government among our customers and helping them achieve their goals.
Next to the ability to focus more on delivering value instead of managing infrastructure, the other benefits of cloud computing will also become of great importance to the public sector: the cost savings that they will be able to achieve can be immediately applied to truly meaningful programs and the self-service and elasticity of the cloud will help them bring programs to market much faster than they were ever able to do before. On stage with me at Gov 2.0 Casey mentioned that at this moment about 45% of Federal Computing projects could be considered for powering by the cloud, so the opportunities for the government to reduce cost and become more agile are significant.
I am looking forward to working closely with the Federal CIOs to make sure our services can meet the requirements that can make them successful in their quest.
At this 3rd anniversary of the launch of Amazon Elastic Compute Cloud (Amazon EC2), it is amazing to see the impact this service has had on the industry. It is truly disruptive technology and its impact has reached far beyond a pure technology offering as the benefits of the cloud have changed the way we view IT Infrastructure. As one of the CIOs at the ACM Cloud Computing Roundtable summarized it: "IT used to be the blocker in anything we did, but with our shift to the cloud IT is now the enabler." From young businesses and established enterprises to hospitals and governments agencies, all are equally enthusiastic cloud customers for whom IT infrastructure has changed forever.
Even though we keep rolling out new services and features, and several existing AWS services are already very successful, this is still Day One. We are only at the brink of what is possible to deliver in the cloud and at Amazon we continue to innovate to make this future a reality.
We continuously listen to our customers to make sure our roadmap matches their needs. One important piece of feedback that mainly came from our enterprise customers was that the transition to the cloud of more complex enterprise environments was challenging. We made it a priority to address this and have worked hard in the past year to find new ways to help our customers transition applications and services to the cloud, while protecting their investments in their existing IT infrastructure.
Protecting investments during the transition
Most enterprises with a datacenter practice have invested significantly over the past decade into the management of their systems and applications. CIOs of Fortune 500 companies are responsible for hundreds if not thousands of applications running in a variety of locations. Keeping track of those resources and managing access to them is a daunting task that continues to require significant investment.
The CIO of a large financial services company in the Northeast explained to me that his teams manage close to 3000 applications and services in 27 different locations. Consolidation of applications, resources and locations is a process that never stops in a world where mergers and acquisitions happen frequently. For him the cloud is attractive as a target for his consolidated services: it allows him to significantly reduce both his capital and operational costs, while gaining significant flexibility and reliability with resources that are globally distributed, without the headache of owning and maintaining them.
He has set the guideline that their current data center infrastructure should not expand any further and that all new development will target the cloud. He expects that the process of moving his existing applications and services to the cloud will take time to complete, as his road map is driven by many internal and external factors. And there are certainly some legacy applications that may never move. He has set the goal of moving 20% of his applications into the cloud by the end of 2010, but to meet this goal he needed to find a solution for a significant obstacle: how to integrate applications running in the cloud into his existing management frameworks. In his world, this especially applies to those management practices that manage policy-driven access controls and required, cross-application regulatory auditing.
This story is typical of many of the conversations I have had with CIOs around the globe. They have bought into the cloud as a target for a significant portion of their services, as the benefits are too obvious to ignore, and most expect that their transition will be a continuous process. They would accelerate the adoption of cloud services if they could access a form of cloud that would give them the best of both worlds: the flexibility and cost-effectiveness of accessing a virtually infinite pool of resources without owning it, while being able to integrate those resources into their existing datacenter environments such that they could continue to leverage existing investments in their management and control infrastructure.
Private Cloud is not the Cloud
These CIOs know that what is sometimes dubbed "private cloud" does not meet their goal as it does not give them the benefits of the cloud: true elasticity and capex elimination. Virtualization and increased automation may give them some improvements in utilization, but they would still be holding the capital, and the operational cost would still be significantly higher.
I often get asked to define "The Cloud," especially because of the many permutations that different vendors use in trying to make their existing businesses look like a cloud offering. I define the cloud by it benefits, as those are very clear. What are called private clouds have little of these benefits and as such, I don't think of them as true clouds.
The cloud:
- Eliminates Cost. The cloud changes capital expense to variable expense and lowers operating costs. The utility-based pricing model of the cloud combined with its on-demand access to resources eliminates the needs for capital investments in IT Infrastructure. And because resources can be released when no longer needed, effective utilization rises dramatically and our customers see a significant reduction in operational costs.
- Is Elastic. The ready access to vast cloud resources eliminates the need for complex procurement cycles, improving the time-to-market for its users. Many organizations have deployment cycles that are counted in weeks or months, while cloud resources such as Amazon EC2 only take minutes to deploy. The scalability of the cloud no longer forces designers and architects to think in resource-constrained ways and they can now pursue opportunities without having to worry how to grow their infrastructure if their product becomes successful.
- Removes Undifferentiated "Heavy Lifting."The cloud let its users focus on delivering differentiating business value instead of wasting valuable resources on the undifferentiated heavy lifting that makes up most of IT infrastructure. Over time Amazon has invested over $2B in developing technologies that could deliver security, reliability and performance at tremendous scale and at low cost. Our teams have created a culture of operational excellence that power some of the world's largest distributed systems. All of this expertise is instantly available to customers through the AWS services.
Elasticity is one of the fundamental properties of the cloud that drives many of its benefits. While virtualization has tremendous benefits to the enterprise, certainly as an important tool in server consolidation, it by itself is not sufficient to give the benefits of the cloud. To achieve true cloud-like elasticity in a private cloud, such that you can rapidly scale up and down in your own datacenter, will require you to allocate significant hardware capacity. While to your internal customers it may appear that they have increased efficiency, at the company level you still own all the capital expense of the IT infrastructure. Without the diversity and heterogeneity of the large number of AWS cloud customers to drive a high utilization level, it can never be a cost-effective solution.
We have been listening very closely to the real requirements that our customers have and have worked closely with many of these CIOs and their teams to understand what solution would allow them to treat the cloud as a seamless extension of their datacenter, where their standard management practices can be applied with limited or no modifications. This needs to be a solution where they get all the benefits of cloud as mentioned above while treating it as a part of their datacenter.

Introducing Amazon Virtual Private Cloud
We have developed Amazon Virtual Private Cloud (Amazon VPC) to allow our customers to seamlessly extend their IT infrastructure into the cloud while maintaining the levels of isolation required for their enterprise management tools to do their work.
With Amazon VPC you can:
- Create a Virtual Private Cloud and assign an IP address block to the VPC. The address block needs to be CIDR block such that it will be easy for your internal networking to route traffic to and from the VPC instance. These are addresses you own and control, most likely as part of your current datacenter addressing practice.
- Divide the VPC addressing up into subnets in a manner that is convenient for managing the applications and services you want run in the VPC.
- Create a VPN connection between the VPN Gateway that is part of the VPC instance and an IPSec-based VPN router on your own premises. Configure your internal routers such that traffic for the VPC address block will flow over the VPN.
- Start adding AWS cloud resources to your VPC. These resources are fully isolated and can only communicate to other resources in the same VPC and with those resources accessible via the VPN router. Accessibility of other resources, including those on the public internet, is subject to the standard enterprise routing and firewall policies.
Amazon VPC offers customers the best of both the cloud and the enterprise managed data center:
- Full flexibility in creating a network layout in the cloud that complies with the manner in which IT resources are managed in your own infrastructure.
- Isolating resources allocated in the cloud by only making them accessible through industry standard IPSec VPNs.
- Familiar cloud paradigm to acquire and release resources on demand within your VPC, making sure that you only use those resources you really need.
- Only pay for what you use. The resources that you place within a VPC are metered and billed using the familiar pay-as-you-go approach at the standard pricing levels published for all cloud customers. The creation of VPCs, subnets and VPN gateways is free of charge. VPN usage and VPN traffic are also priced at the familiar usage based structure
- All the benefits from the cloud with respect to scalability and reliability, freeing up your engineers to work on things that really matter to your business.
For more details on Amazon Virtual Private Cloud, visit the Amazon VPC detail page and the posting on the AWS developer weblog. For how our partners view Amazon VPC see for example the posting at RightScale
And happy birthday to Amazon EC2!
Ingrained in the DNA of the Amazon Technologist is a single-minded focus on the needs of our customers. The Amazon development process is even called "Working from the customer backwards". Essential in this process is a good understanding of what the customers need in terms of new services, new features for existing services, or different approaches to things that we are already doing. We collect this feedback continuously from various sources: the AWS forums, the AWS Premium Support Team, Amazonians on the road talking to customers, solution architects helping to define customer architectures, ISV partners building on our services, system integration partners who relay customer needs, advisory boards, and of course the Amazon ecommerce engineers building on the AWS platform.
Once a year however we take a moment to make sure that everyone who wants to give their input into the direction of the Amazon Web Services has the opportunity to do so. We have developed a Survey that helps us define what is really important to our current and future customers. If you have feedback that you would like to give to the Amazon Web Services team, this Survey would be an excellent place to do so.
Join recruiters and hiring managers from several of Amazon's global offices on July 14, 2009. We'll be in-world from 6am through midnight (Pacific/Seattle time) for the first ever Amazon Second Life Job Fair. This free event is a unique opportunity for candidates to have direct access to hiring managers and recruiters from around the world! We are looking across all levels of technical and non-technical profiles - from hands-on engineers to program managers and game-changing principal architects. Visit our U.S. career site at www.amazon.com/careers for open U.S. positions and links to our global careers pages, then join us in-world at www.bit.ly/AmazonJobFair on July 14. We look forward to meeting your avatar!
More details in Jeff Barr's post over at the AWS blog
Amazon careers. Work hard. Have fun. Make history
Image of the Amazon Developer Island in Second Life by Tao Takashi
http://www.flickr.com/photos/taotakashi/ / CC BY-NC-SA 2.0
Before networks were everywhere, the easiest way to transport information from one computer in your machine room was to write the data to a floppy disk, run to the computer and load the data there from that floppy. This form of data transport was jokingly called "sneaker net". It was efficient because networks only had limited bandwidth and you wanted to reserve that for essential tasks.
In some ways the computing world has changed dramatically; networks have become ubiquitous and the latency and bandwidth capabilities have improved immensely. Next to this growth in network capabilities we have been able to grow something else to even bigger proportions, namely our datasets. Gigabyte data sets are considered small, terabyte sets are common place, and we see several customers working with petabyte size datasets.
No matter how much we have improved our network throughput in the past 10 years, our datasets have grown faster, and this is likely to be a pattern that will only accelerate in the coming years. While network may improve another other of magnitude in throughput, it is certain that datasets will grow two or more orders of magnitude in the same period of time.
At the same time processing large amounts of data has become common place. Where this used to be the domain of Physics and Biotech researchers or maybe business intelligence, now increasingly other domains are being driven by large datasets. In research we see that traditional social sciences such as psychology and history are moving to become data driven. In the commercial world for example no ecommerce site can function anymore without mining massive amounts of data to optimize recommendations to their customers. Also in the systems management domain, data sets are growing faster and faster, consequently backup and disaster recovery has to deal with increasingly large sets. Log files and monitoring also spew out more and more relevant data.
Many of our customers have large datasets and would love to move into our storage services and process them in Amazon EC2. However moving these large datasets over the network can be cumbersome. If you look at typical network speeds and how long it would take to move a terabyte dataset:
Depending on the network throughput available to you and the data set size it may take rather long to move your data into Amazon S3. To help customers move their large data sets into Amazon S3 faster, we offer them the ability to do this over Amazon's internal high-speed network using AWS Import/Export.
AWS Import/Export allows you to ship your data on one or more portable storage devices to be loaded into Amazon S3. For each portable storage device to be loaded, a manifest explains how and where to load the data, and how to map file to Amazon S3 object keys. After loading the data into Amazon S3, AWS Import/Export stores the resulting keys and MD5 Checksums in log files such that you can check whether the transfer was successful.
AWS Import/Export is of great help to many of our customers who have to handle large data sets. We continue to listen to our customers to make sure we are adding features, tools and services that help them solve real problems. For more information on AWS Import/Export visit the detail page.
For more background on the evolution of large data sets and the challenges with moving them over the network you should read some papers and interviews with Jim Gray who was a pioneer in the area of computing.
