The technical wizardry behind Google's successful search engine may come down to a blindingly obvious insight: PCs crash.
On Wednesday, Urs Hoelzle, a vice president of engineering and of operations  at the search giant, shed some light on how Google's data centers operate. Many  people consider the company's operations expertise more valuable than the actual  search algorithms that launched the enterprise.
Hoelzle spoke at EclipseCon, a conference for  application programmers that's going on till Thursday here.  The way Google has been able to  build out its computing infrastructure for millions, rather than tens of  millions, of dollars is by buying relatively cheap machines. Looking at hardware  costs, company engineers saw that purchasing a few high-end servers, with eight  or more powerful processors, costs significantly more than dozens of simpler  "commodity" servers.  The trick is to make these racks of hardware operate in tandem and to ensure  that the failure of one machine does not derail an operation, such as returning  a search query or serving up an ad.  Consider a home PC, Hoelzle said. Optimistically, a consumer PC might crash  once in three years from a software glitch or hardware problem.  "At Google scale...if you have thousands of PCs, you can expect one (failure)  a day," he said. "So you better deal with that in an automated way, or you will  have service outages."  Google, known for its rigorous hiring practices aimed at attracting the  brightest minds in computer science, has created a number of software tools to  handle its computing installation.  The company wrote its own file system, called Google File System, which is  optimized for handling large, 64 megabyte blocks of data. Significantly, the  file system was designed to assume that a failure, such as a failed disk or  unplugged network cable, can happen at any time.  Data is replicated in three places, and there is a "master" machine that can locate copies of a piece of data, such as a keyword index, if the original is out of commission.  "You make the software tolerate failures. If you can expect failures, then  this is what makes cheap commodity PCs viable for Internet services," Hoelzle  said.  Google's PC servers, which number in the thousands, run a stripped-down  version of Linux, which is based on the Red Hat distribution but is really just  the operating system kernel modified for Google, he added.  The company has also devised a system for handling massive amounts of data  and returning rapid responses to queries. Google splits the Web into millions of  pieces, or "shards" in Google tech speak, which are replicated in case of  failure.  Not surprisingly, the company creates an index of words that appear on the  Web, which it stores as an array of large files. But it also has document  servers, which hold copies of Web pages that Google crawls and downloads.  Another important engineering feat done by Google is to make writing programs  that run across thousands of servers very straightforward, according to Hoelzle.  Normally, building applications to run in a "parallel" configuration of servers  requires specialized tools and skills.  Google's programming tool, called MapReduce,  which automates the task of recovering a program in case of a failure, is  critical to keeping the company's costs down.  "Cost is really the sum of what the equipment you need to do the work costs  and how much programming time you need to put into getting something useful,"  Hoelzle said, adding that Google has started using MapReduce more widely over  the past year.  Finally, Google has created "batch" job scheduling software that acts as a  sort of taskmaster for millions of operations. Called the Global Work Queue, it  breaks up computing jobs into many smaller tasks and distributes them across  machines.  For all its built-in redundancy in case of failure, the system doesn't  address all problems, Hoelzle revealed. During the presentation, he showed a  photo of six fire trucks responding to an emergency at a Google data center in  an undisclosed location.  In a follow-up interview with CNET News.com, Hoelzle said the cost of power  is another important factor in Google's data center designs.  "The physical cost of operations, excluding people, is directly proportional  to power costs," he said. "(Power) becomes a factor in running cheaper  operations in a data center. It's not just buying cheaper components but you  also have to have an operating expense that makes sense." 
 
 

No comments:
Post a Comment