Next Article Table of Contents Previous Article

Analysis & Commentary:

QUALITY OF SEARCH RESULTS IS OUR PRIMARY GOAL
by Christoph Poeppe, Special to DSstar

Search engines, special programs that are used via the net, have become indispensable guides in the chaotic diversity of the World Wide Web. Upon entering a keyword a search engine delivers a list of Webpages which have something important to say about the keyword (ideal case), or which contain the keyword somewhere (real case). Approximately one year ago the search engine Google, www.google.com, or www.google.de, got also well known in Europe by verbal propaganda and soon became the most requested search engine of all. The name Google alludes to the American nickname "googol" for 10^100 -- which is a slight exaggeration for the current number of registered Webpages.

Monika Henzinger, director of reasearch of Google Inc, speaks about the current state-of-the-art of search engines at the 17th International Supercomputer Conference, ISC2002, in Heidelberg, www.supercomp.de. Dr. Henzinger has been invited to keynote ISC2002 on Thursday, June 20, 2002, and she will deliver a presentation about the topic "Indexing the Web -- A Challenge for Supercomputing".

This interview for HPCwire was conducted by Christoph Poeppe, editor of the magazine "Spektrum der Wissenschaft", which is the German version of "Scientific American".

HPCwire: How big is Google today?

HENZINGER: We have 3 billions of pages in our repository. Among those there are 700 million newsgroup-articles dating back to the far past, which we bought from Deja-News, 300 million images, and over 2 billion Webpages.

HPCwire: And you have stored all of them in your database?

HENZINGER: Yes, in compressed form.

HPCwire: Is this the whole Web?

HENZINGER: No, not at all! Actually the Web is infinite. There exist databases which can create a large number of Webpages on demand. Of course, it is useless to have all of them in the search engine. We restrict ourselves to pages of high quality.

HPCwire: What is the quality measure?

HENZINGER: The PageRank. This is a kind of grade we attribute to each page, independent of queries this page could be relevant for. In fact, the quality of a page increases with the number of other pages pointing to it and the quality of these other pages. Moreover, the PageRank essentially determines the order in which Google presents search results to the user.

HPCwire: Is it possible to manipulate the calculation of the PageRanks, for example by putting something like quotation-cartels into the net?

HENZINGER: Spammers at least try it again and again. For instance, there are a lot of queries for "Britney Spears". Therefore many people try to increase their PageRank for being on one of the top places among the answers for "Britney Spears", even if they only sell sneakers.

HPCwire: What are you doing about it?

HENZINGER: If we see an obvious abuse we take the corresponding page out. This is in the interest of our users for whom we want to preserve the quality.

HPCwire: Do you also take other steps apart from these individual corrections?

HENZINGER: Yes, but we do not discuss them in public, since we do not want to get into an arms race with the Spammers.

HPCwire: Since when does Google exist?

HENZINGER: The company was founded three years ago. More than two years ago we went public. The news about us has spread from mouth to mouth. In the meantime half of the queries come from outside the USA, 12 percent just from German speaking countries. We answer over 150 million queries a day, either directly or via our partners. If, for instance, the search engine Yahoo does not find a keyword in its own index, it passes the query on to us and returns our answer to the user.

HPCwire: Which hardware runs your system?

HENZINGER: We have more than 10000 PCs, distributed over four data centers. Our operating system is Linux.

HPCwire: How often do you look if the Webpages you list do still exist?

HENZINGER: We update our database every 28 days. Moreover, there exist some very busy webpages which we visit daily. Every 28 days we recreate the index which, for every word, lists all the Webpages containing this word. If you enter two words into the search field you are presented the intersection of both lists, sorted by PageRank and a few other criteria. In particular, it can happen that you get the homepage of a company which does not even contain the name of the company in readable form, but maybe only as part of n image. But from the many Webpages which point to this homepage and quote the name of the company we know that this has to be the homepage, and present it that way.

HPCwire: How much effort is this indexing?

HENZINGER: A lot. About one week.

HPCwire: How many employees does your company have?

HENZINGER: Approximately 350. Up to now the number of employees has doubled each year.

HPCwire: How is Google financed?

HENZINGER: First, by usual advertisement: one-line, running-text ads. They only appear for queries like "cars", that is, if they might be interesting for the questioner; we call this keyword targeting. Second, everybody can buy an advertisement online with their credit card. If you want to congratulate your wife to her birthday with Google you can place an advertisement which only appears for her name. However, the typical advertiser is a small producer of maple syrup in Maine. His advertisement will be displayed to the right of the search results if the user types "maple syrup". Third, by search services. For instance Yahoo pays us for displaying our search results on their page. Some companies want to set up search functionality for their Webpage, but do not want to program it themselves. For these companies we build a separate index and answer search queries the company gets asked. Fourth, we recently started to sell our products for internal use in company intranets. We are one of the few startup-companies who are really well off.

HPCwire: Are there any new projects?

HENZINGER: There are several. Speech input, for instance. The user speaks his question into a microphone and gets the answers on the screen, in the future there may be even spoken answers. Another project is our news search. Our machines read daily newspapers and group articles on the same topic from many different countries. This is really interesting, because the coverage usually is locally biased. A regular comparison can expand your horizon a lot. Click at "News and Services" and then "Try out our beta news search". Or user interfaces. How can you convince a user to type in more than two words? The more words he tells us, the better we can serve him.

Glossary:

Repository: a very large database, which is used by Google to store the contents of the registered Webpages, pictures, and news articles.

Crawling: the systematic search of the Web for Webpages via the links they contain. Google's computers regularly crawl through more than 2 billion Webpages per week.

PageRank: a measure for the quality of a Webpage, more precisely: for the respect it gets inside the Web. To determine the PageRank of Webpage A, one adds the PageRanks T_j of all the Webpages pointing to A and divides it by the total number of links contained in Webpage T_j. This number plus an additive constant times a proportionality factor is the PageRank of A. If a user clicks from page to page at random, most often clicking a link on the current page and sometimes (this is the additive constant) choosing a completely unrelated Webpage, then the probability of entering Webpage A at some time is equal to the PageRank of A. The PageRank is thus defined by the PageRank itself -- a definition which only becomes sensible by solving a huge set of equations (more precisely: an eigenvalue problem) with all the PageRanks as unknowns.

Top of Page


Previous Article  |  Table of Contents  |  Next Article