Amazon thinks big data was made for the Cloud




For Amazon Web Services Chief Data Scientist Matt Wood, the day isn’t filled performing data alchemy on behalf of his employer; he’s entertaining its customers. Wood helps AWS users build big data architectures that use the company’s cloud computing resources, and then take what he learns about those users’ needs and turn them into products — such as the Data Pipeline Service and Redshift data warehouse AWS announced this week.

He and I sat down this week at AWS’s inaugural Re: Invent conference and talked about many things, including what he’s seen in the field and where cloud-based big data efforts are headed. Here are the highlights.



The end of contstraint-based thinking

Not so long ago, computer scientists understood many of the concepts that we now call data science, but limited resources meant they were hamstrung in the types of analysis they could attempt to do. “That can be very limiting, very constraining when you’re working with data,” Wood said.

Now, however, data storage and processing resources are relatively inexpensive and abundant — so much so that they’ve actually made the concept of big data possible. Cloud computing has only made those resources cheaper and more abundant. The result, Wood said, is that people working with data are undergoing a shift from that mindset of limiting their data analysis to the resources they have available to one where they think about business needs first.

If they’re able to get past traditional notions of sampling and days-long processing times,  he added, individuals can focus their attention on what they can do because they have so many resources available. He noted how Yelp gave developers relatively free rein early on the use of Elastic MapReduce, saving them from having to formally request resources just “to see if the crazy idea [someone] had over coffee is going to play out.” Yelp was able to spot a shift in mobile traffic volume years ago and get a headstart on its mobile efforts because of that, Wood added.

Data problems aren’t just about scale

Generally speaking, Wood said, solving customers’ data problems isn’t just about figuring out how to store ever greater volumes for every cheaper prices. “You don’t have to be at a petabyte scale in order to get some insight on who’s using your social game,” he said.

In fact, access to limitless storage and processing is a solution to one problem that actually creates another. Companies want to keep all the data they generate, and that creates complexity, Wood explained. As that data piles up in various repositories — perhaps in Amazon’s S3 and DynamoDB services, as well as on some physical machines with a company’s data center — moving it from place to place in order to reuse it becomes a difficult process.

Wood said AWS built its new Data Pipeline Service in order to address this problem. Pipelines can be “arbitrarily complex,” he explained — from running a simple piece of business logic against data to running whole batches through Elastic MapReduce — but the idea is to automate the movement and processing so users don’t have to build these flows themselves and then manually run them.



The cloud isn’t just for storing tweets

People sometimes question the relevance of cloud computing for big data workloads, if only because any data generated on in-house systems has to make its way to the cloud over inherently slow connections. The bigger the dataset, the longer the upload time.

Wood said AWS is trying hard to alleviate these problems. For example, partners such as Aspera and even some open source projects enable customers to move large files at fast speeds over the internet (Wood said he’s seen consistent speeds of 700 megabits per second). This is also why AWS has eliminated data-transfer fees for inbound data, has turned on parallel uploads for large files and created its Direct Connect program with data center operators that provide dedicated connections to AWS facilities.

And if datasets are too large for all those methods, customers can just send AWS their physical disks. “We definitely receive hard drives,” Wood said.

Collaboration is the future

Once data makes its way to the cloud, it opens up entirely new methods of collaboration where researchers or even entire industries can access and work together on shared datasets too big to move around. “This sort of data space is something that’s becoming common in fields where there are very large datasets,” Wood said, citing as an example the 1000 Genomes project dataset that AWS houses.


DNAnexus’s cloud-based architecture

As we’ve covered recently, the genetics space is drooling over the promise of cloud computing. The 1000 Genomes database is only 200TB, Wood explained, but very few project leads could get the budget to store that much data and make it accessible to their peers, much less the computation power required to process it. And even in fields such as pharmaceuticals, Amazon CTO Werner Vogels told me during an earlier interview, companies are using the cloud to collaborate on certain datasets so companies don’t have to spend time and money reinventing the wheel.

No more supercomputers?

Wood seemed very impressed with the work that AWS’s high-performance computing customers have been doing on the platform — work that previously would have been done on supercomputers or other physical systems. Thanks to AWS partner Cycle Computing, he noted, the Morgridge Institute at the University of Wisconsin was able to perform 116 years worth of computing in just one week. In the past, access to that kind of power would have required waiting in line until resources opened up on a supercomputer somewhere.

The collaborative efforts Wood discussed certainly facilitate this type of extreme computation, as does AWS’s continuous efforts to beef up its instances with more and more power. Whatever users might need, from the new 250GB RAM on-demand instances to GPU-powered Cluster Compute Instances, Wood said AWS will try to provide it. Because cost sometimes matters, AWS has opened Cluster Compute Instances and Elastic MapReduce to its spot market for buying capacity on the cheap.

But whatever data-intensive workloads organizations want to run, many will always look to the cloud now. Because cloud computing and big data — Hadoop, especially — have come of age roughly in parallel with each other, Wood hypothesized, they often go hand-in-hand in people’s minds.



AWS re:Invent 2012 IT pros have few cloud computing security concerns
TechTarget
LAS VEGAS -- Even though Amazon Web Services Senior Vice President Andy Jassy admitted this week that cloud computing security is the No. 1 concern of potential AWS customers, attendees at the company's inaugural re:Invent conference expressed ...
See all stories on this topic »
Business units control cloud strategy: Survey
ITWorld Canada
The benefits of some type of cloud computing – be it basic email or a database of contacts, if not going whole hog – have been demonstrated for some time. Among the potential benefits is that it can take a load off IT department staff, letting them ...
See all stories on this topic »

ITWorld Canada
What is cloud computing?
ihotdesk - IT News
Cloud computing is a term which has been banded around in almost all circles in recent years, but many firms still fail to grasp what it really is and how it can be beneficial to them. Put simply cloud computing is a system where data and applications ...
See all stories on this topic »
How to Know When It's Time to Move Your IT to the Cloud
Greater Lansing Business Monthly
Another clear advantage of cloud computing is that it scales much more easily than in-house systems. If, for example, your business is growing or if you have a seasonal business with peak times, you no longer have to worry about running out of ...
See all stories on this topic »
Why Amazon thinks big data was made for the cloud
GigaOM
According to Amazon Web Services Chief Data Scientist Matt Wood, big data andcloud computing are nearly a match made in heaven. Limitless, on-demand and inexpensive resources open up new worlds of possibility, and a central platform makes it easy ...
See all stories on this topic »

GigaOM
How Jim Geiger piloted Cbeyond into the cloud computing era
Smart Business Network
Of course you've heard the overused phrases 'cloud computing' and 'cloud services.' Everybody wants to be offering cloud services today. But, indeed, we are.” Geiger explains Cbeyond's move into the realm of cloud-based technology in terms of “boxes.”...
See all stories on this topic »

Smart Business Network
Cloud Browsers May Place Businesses at Risk, University Research Finds
Midsize Insider
According to research at North Carolina State University and the University of Oregon, "cloudbrowsers" pose a potentially serious threat to users. They can be exploited to perform substantialcomputing tasks anonymously. Now that many cloud-based ...
See all stories on this topic »
A Foot in the Cloud
www.waterstechnology.com
This past week, at Telx Marketplace, a conference for data center services providers and their customers, experts on cloud computing strategies pointed to developments and opportunities in the cloud that can affect how reference data will be handled ...
See all stories on this topic »

www.waterstechnology.com
IDC's 2013 tech predictions: Mobile devices, emerging markets drive growth
Computerworld New Zealand
The most significant IT trends of 2013 are familiar -- mobility, cloud computing, social technologies and big data -- but the new year will bring a new urgency as enterprises look for vendors to move past exploratory roadmap stages and deliver ...
See all stories on this topic »
Managed Service Provider Week in Review
The Complete Managed Services Resource
The managed services and cloud computing arenas drew closer this week, with Cisco's major announcement that it would be consolidating its managed service provider (MSP) and cloud computing programs. The merger is being done to eliminate ...
See all stories on this topic »

Comments

Popular posts from this blog

Airplane Parts and Functions

Simran Kaur