AWS and Elastic Map Reduce (EMR) Netflix

Why Use Elastic MapReduce (EMR)?

EMR enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.

By reducing the cost and complexity of analyzing huge data sets, EMR also enables greater experimentation and innovation

Case Study: Netflix

50 billion daily events coming from netflix-enabled televisions, mobile devices, and laptops. How do you collected and store all of that data?

Netflix streams 8 TB of data into the cloud per day. This is collected, aggregated, and pushed to Amazon S3 via a fleet of EC2 servers running Apache Chukwa.

The processed data is then streamed back into Amazon S3 where it is accessible by other teams including personalisation/recommendation services.

The processed data is then streamed back into Amazon S3 where it is accessible by other teams including personalization/recommendation services and to analysts through a real-time custom visualization tool called Sting

For Netflix, they can run their prod cluster with 300 nodes during the day…

And expand it to 400+ on the evening and weekend. Also using EMR’s alarming capabilities, this can be setup to be done automatically based on the load on the cluster.

And for jobs with specific hardware or capacity requirements, analysts can spin-up their own query clusters, again streaming from the same data source.

Test-tube data (Storing information in DNA)

The Economist Jan 26th 2013
LIKE all the best ideas, this one was born in a pub. Nick Goldman and Ewan Birney of the European Bioinformatics Institute (EBI) near Cambridge, were pondering what they could do with the torrent of genomic data their research group generates, all of which has to be archived.The volume of data is growing faster than the capacity of the hard drives used to hold it. “That means the cost of storage is rising, but our budgets are not,” says Dr Goldman. Over a few beers, the pair began wondering if artificially constructed DNA might be one way to store the data torrent generated by the natural stuff. After a few more drinks and much scribbling on beer mats, what started out as a bit of amusing speculation had turned into the bones of a workable scheme. After some fleshing out and a successful test run, the full details were published this week in Nature.

The idea is not new. DNA is, after all, already used to store information in the form of genomes by every living organism on Earth. Its prowess at that job is the reason that information scientists have been trying to co-opt it for their own uses. But this has not been without problems.

Dr Goldman’s new scheme is significant in several ways. He and his team have managed to set a record (739.3 kilobytes) for the amount of unique information encoded. But it has been designed to do far more than that. It should, think the researchers, be easily capable of swallowing the roughly 3 zettabytes (a zettabyte is one billion trillion or 10²¹ bytes) of digital data thought presently to exist in the world and still have room for plenty more. It would do so with a density of around 2.2 petabytes (10¹⁵) per gram; enough, in other words, to fit all the world’s digital information into the back of a lorry. Moreover, their method dramatically reduces the copying errors to which many previous DNA storage attempts have been prone.

Faithful reproduction

The trick to this fidelity lies in the way the researchers translate their files from the hard drive to the test tube. DNA uses four chemical “bases”—adenosine (A), thymine (T), cytosine (C) and guanine (G)—to encode information. Previous approaches have often mapped the binary 1s and 0s used by computers directly onto these bases. For instance, A and C might represent 0, while G and T signify 1. The problem is that sequences of 1s or 0s in the source code can generate repetition of a single base in the DNA (say, TTTT). Such repetitions are more likely to be misread by DNA-sequencing machines, leading to errors when reading the information back.

The team’s solution was to translate the binary computer information into ternary (a system that uses three numerals: 0, 1 and 2) and then encode that information into the DNA. Instead of a direct link between a given number and a particular base, the encoding scheme depends on which base has been used most recently (see table). For instance, if the previous base was A, then a 2 would be represented by T. But if the previous base was G, then 2 would be represented by C. Similar substitution rules cover every possible combination of letters and numbers, ensuring that a sequence of identical digits in the data is not represented by a sequence of identical bases in the DNA, helping to avoid mistakes.

The code then had to be created in artificial DNA. The simplest approach would be to synthesise one long DNA string for every file to be stored. But DNA-synthesis machines are not yet able to do that reliably. So the researchers decided to chop their files into thousands of individual chunks, each 117 bases long. In each chunk, 100 bases are devoted to the file data themselves, and the remainder used for indexing information that records where in the completed file a specific chunk belongs. The process also contains the DNA equivalent of the error-detecting “parity bit” found in most computer systems.

To provide yet more tolerance for mistakes, the researchers chopped up the source files a further three times, each in a slightly different, overlapping way. The idea is to ensure that each 25-base quarter of a 100-base chunk was also represented in three other chunks of DNA. If any copying errors did occur in a particular chunk, it could be compared against its three counterparts, and a majority vote used to decide which was correct. Reading the chunks back is simply a matter of generating multiple copies of the fragments using a standard chemical reaction, feeding these into a DNA-sequencing machine and stitching the files back together.

When the scheme was tested, it worked almost as planned. The researchers were able to encode and decode five computer files, including an MP3 recording of part of Martin Luther King’s “I have a dream” speech and a PDF version of the 1953 paper by Francis Crick and James Watson describing the structure of DNA. The one glitch was that, despite all the precautions, two 25-base segments of the DNA paper went missing. The problem was eventually traced to a combination of a quirk of DNA chemistry and another quirk in the machines used to do the synthesis. Dr Goldman is confident that a tweak to their code will avoid the problem in future.

There are downsides to DNA as a data-storage medium. One is the relatively slow speed at which data can be read back. It took the researchers two weeks to reconstruct their five files, although with better equipment it could be done in a day. Beyond that, the process can be sped up by adding more sequencing machines.

Ironically, then, the method is not suitable for the EBI’s need to serve up its genome data over the internet at a moment’s notice. But for less intensively used archives, that might not be a problem. One example given is that of CERN, Europe’s biggest particle-physics lab, which maintains a big archive of data from the Large Hadron Collider.

Store out of direct sunlight

The other disadvantage is cost. Dr Goldman estimates that, at commercial rates, their method costs around $12,400 per megabyte stored. That is millions of times more than the cost of writing the same data to the magnetic tape currently used to archive digital information. But magnetic tapes degrade and must be replaced every few years, whereas DNA remains readable for tens of thousands of years so long as it is kept somewhere cool, dark and dry—as proved by the recovery of DNA from woolly mammoths and Neanderthals.

The longer you want to store information, then, the more attractive DNA becomes. And the cost of sequencing and synthesising DNA is falling fast. The researchers reckon that, within a decade, that could make DNA competitive with other methods for (infrequently-used) archives designed to last fifty years or more.

There is one final advantage in using DNA. Modern, digital storage technologies tend to come and go: just think of the fate of the laser disc, for example. In the early 2000s NASA, America’s space agency, was reduced to trawling around internet auction sites in order to find old-style eight-inch floppy drives to get at the data it had laid down in the 1960s and 1970s. But, says Dr Goldman, DNA has endured for more than 3 billion years. So long as life—and biologists—endure, someone should know how to read it.

Amazon Elastic Block Store (Amazon EBS)

Amazon Elastic Block Store (Amazon EBS) provides persistent block storage volumes for use with Amazon EC2 instances in the AWS Cloud. Each Amazon EBS volume is automatically replicated within its Availability Zone to protect you from component failure, offering high availability and durability. Amazon EBS volumes offer the consistent and low-latency performance needed to run your workloads. With Amazon EBS, you can scale your usage up or down within minutes – all while paying a low price for only what you provision.

Amazon EBS allows you to create storage volumes and attach them to Amazon EC2 instances. Once attached, you can create a file system on top of these volumes, run a database, or use them in any other way you would use block storage. Amazon EBS volumes are placed in a specific Availability Zone, where they are automatically replicated to protect you from the failure of a single component. All EBS volume types offer durable snapshot capabilities and are designed for 99.999% availability.

Amazon EBS provides a range of options that allow you to optimize storage performance and cost for your workload. These options are divided into two major categories: SSD-backed storage for transactional workloads such as databases and boot volumes (performance depends primarily on IOPS) and HDD-backed storage for throughput intensive workloads such as MapReduce and log processing (performance depends primarily on MB/s).

SSD-backed volumes include the highest performance Provisioned IOPS SSD (io1) for latency-sensitive transactional workloads and General Purpose SSD (gp2) that balance price and performance for a wide variety of transactional data. HDD-backed volumes include Throughput Optimized HDD (st1) for frequently accessed, throughput intensive workloads and the lowest cost Cold HDD (sc1) for less frequently accessed data.

Elastic Volumes is a feature of Amazon EBS that allows you to dynamically increase capacity, tune performance, and change the type of live volumes with no downtime or performance impact. This allows you to easily right-size your deployment and adapt to performance changes.

For upfront fast performing websites, applications or databases the combo of Amazons EC2 and EBS is a class leading performing platform. Amazon understand customers and business requirements and they class themselves as the most customer centric company on earth, with services like this maybe they are!.

Amazons S3 storage

Amazon S3

S3, the Simple Storage Service, is a reliable, fast and cheap way to store “stuff” on the Internet. S3 can be used to store just about anything: XML documents, binary data, images, videos, or whatever else our customers want to store.

Amazon Simple Storage Service (Amazon S3)

  • Object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web.
  • It is designed to deliver 99.999999999% durability, and scale past trillions of objects worldwide.
  • Business use S3 as primary storage for cloud-native applications; as a bulk repository, or “data lake,” for analytics; as a target for backup & recovery and disaster recovery; and with server less computing.
  • It’s simple to move large volumes of data into or out of S3
  • Once data is stored in Amazon S3, it can be automatically tiered into lower cost, longer-term cloud storage classes like S3 Standard – Infrequent Access and Amazon Glacier for archiving

About Amazon S3

Amazon S3 stores data as objects within resources called “buckets”. You can store as many objects as you want within a bucket, and write, read, and delete objects in your bucket. Objects can be up to 5 terabytes.

You can control access to the bucket (who can create, delete, and retrieve objects in the bucket for example), view access logs for the bucket and its objects, and choose the AWS region where a bucket is stored to optimize for latency, minimize costs, or address regulatory requirements.


Amazon S3 is designed as a complete storage platform. Consider the ownership value included with every GB.

Simplicity. Amazon S3 is built for simplicity, with a web-based management console, mobile app, and full REST APIs and SDKs for easy integration with third party technologies.

Durability. Amazon S3 is available in regions around the world, and includes geographic redundancy within each region as well as the option to replicate across regions. In addition, multiple versions of an object may be preserved for point-in-time recovery.

Scalability. Customers around the world depend on Amazon S3 to safeguard trillions of objects every day. Costs grow and shrink on demand, and global deployments can be done in minutes. Industries like financial services, healthcare, media, and entertainment use it to build big data, analytics, transcoding, and archive applications.

Security. Amazon S3 supports data transfer over SSL and automatic encryption of your data once it is uploaded. You can also configure bucket policies to manage object permissions and control access to your data using AWS Identity and Access Management (IAM).

Broad integration with other AWS services for security (IAM and KMS), alerting (CloudWatch, CloudTrail and Event Notifications), computing (Lambda), and database (EMR, Redshift), designed to integrate directly with Amazon S3.

Cloud Data Migration options. AWS storage includes multiple specialized methods to help you get data into and out of the cloud.

Enterprise-class Storage Management. S3 Storage Management features allow you to take a data-driven approach to storage optimization, data security, and management efficiency.

Amazon S3 video

What is Cloud Computing and why bring your business there??

What is Cloud Computing and why bring your business there??

From running applications that share photos to mobile users or if you are supporting critical operations the cloud platform provides instant access to elastic and low cost IT resources. Cloud computing mean you don’t need to make large upfront investments in hardware. Instead, you can spin up exactly the right type and size of computing resources you need to power your IT infrastructure, accessing as many resources as you need you only pay for what you use. Exactly like you household utilities

Cloud Computing and how it Works?

Cloud services provide a simple way to access servers, storage, databases and applications services via the Internet.  Cloud services platforms like Amazon Web Services (AWS) and Microsoft Azure own and maintain networks and the  hardware required to power these services, you simply configure what you need!!

Benefits of the cloud – Benefit from massive economies of scale icon


Using cloud computing means you can operate lower variable cost than maintaining and scaling your own hardware to meet business demands during peak and off peak times, scale up and down!

Future Requirements…..Limitations

Having to estimate your infrastructure capacity needs means you must spend on what you “think” may be the business requirement, this means you often either end up sitting on expensive idle resources or dealing with limited capacity. Cloud services gives you the ability to access as much or as little as you need, and scale up and down as required.


In a cloud computing environment, new IT resources spun up instantly, which means you reduce the time it takes to make those resources available to your developers from weeks to just minutes. This results in a dramatic increase in agility for the organization, since the cost and time it takes to experiment and develop new applications is significantly lower.

Physical benefits

No more costs on running and maintaining data centers or PoP sites

Cloud providers let you focus on your business rather than on the heavy lifting of racking, stacking and powering servers, also removing security both manual and electronic and even the cooling systems that can be the biggest cost of any datacenter

Global Reach

You have the ability to deploy your application in multiple sites around the world. You can choose to have your data or application hosted in any datacenter the cloud provider has to offer, this will allow faster response times than simply hosting all your services in one country. This means you can provide a lower latency and better experience for your customers simply and at minimal cost.

Types of Cloud Computing

Cloud computing has three main types that are commonly referred to as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Selecting the right type of cloud computing for your needs can help you

Cloud Solutions

The move to cloud has progressed at a steady rate and now thousands of businesses have joined Microsoft Azure, Oracle Google and Amazon Web Services (AWS)  for solutions to build their businesses. Cloud computing platforms provide flexibility to build your application, your way, regardless of your industry or business size. Companies are can saving huge resources time and money, without compromising security requirements or business availability and performance.