Analysis & Commentary:
QUALITY OF SEARCH RESULTS IS OUR PRIMARY GOAL
by Christoph Poeppe, Special to DSstar
Search engines, special programs that are used via the net, have become
indispensable guides in the chaotic diversity of the World Wide Web. Upon
entering a keyword a search engine delivers a list of Webpages which have
something important to say about the keyword (ideal case), or which contain
the keyword somewhere (real case). Approximately one year ago the search
engine Google, www.google.com, or
www.google.de, got also well
known in Europe by verbal propaganda and soon became the most requested search
engine of all. The name Google alludes to the American nickname "googol" for
10^100 -- which is a slight exaggeration for the current number of registered
Webpages.
Monika Henzinger, director of reasearch of Google Inc, speaks about the
current state-of-the-art of search engines at the 17th International
Supercomputer Conference, ISC2002, in Heidelberg, www.supercomp.de.
Dr. Henzinger has been invited to keynote ISC2002 on Thursday, June 20, 2002,
and she will deliver a presentation about the topic "Indexing the Web -- A
Challenge for Supercomputing".
This interview for HPCwire was conducted by Christoph Poeppe, editor of the
magazine "Spektrum der Wissenschaft", which is the German version of
"Scientific American".
HPCwire: How big is Google today?
HENZINGER: We have 3 billions of pages in our repository. Among those there
are 700 million newsgroup-articles dating back to the far past, which we
bought from Deja-News, 300 million images, and over 2 billion Webpages.
HPCwire: And you have stored all of them in your database?
HENZINGER: Yes, in compressed form.
HPCwire: Is this the whole Web?
HENZINGER: No, not at all! Actually the Web is infinite. There exist databases
which can create a large number of Webpages on demand. Of course, it is
useless to have all of them in the search engine. We restrict ourselves to
pages of high quality.
HPCwire: What is the quality measure?
HENZINGER: The PageRank. This is a kind of grade we attribute to each page,
independent of queries this page could be relevant for. In fact, the quality
of a page increases with the number of other pages pointing to it and the
quality of these other pages. Moreover, the PageRank essentially determines
the order in which Google presents search results to the user.
HPCwire: Is it possible to manipulate the calculation of the PageRanks, for
example by putting something like quotation-cartels into the net?
HENZINGER: Spammers at least try it again and again. For instance, there are a
lot of queries for "Britney Spears". Therefore many people try to increase
their PageRank for being on one of the top places among the answers for
"Britney Spears", even if they only sell sneakers.
HPCwire: What are you doing about it?
HENZINGER: If we see an obvious abuse we take the corresponding page out. This
is in the interest of our users for whom we want to preserve the quality.
HPCwire: Do you also take other steps apart from these individual corrections?
HENZINGER: Yes, but we do not discuss them in public, since we do not want to
get into an arms race with the Spammers.
HPCwire: Since when does Google exist?
HENZINGER: The company was founded three years ago. More than two years ago we
went public. The news about us has spread from mouth to mouth. In the meantime
half of the queries come from outside the USA, 12 percent just from German
speaking countries. We answer over 150 million queries a day, either directly
or via our partners. If, for instance, the search engine Yahoo does not find a
keyword in its own index, it passes the query on to us and returns our answer
to the user.
HPCwire: Which hardware runs your system?
HENZINGER: We have more than 10000 PCs, distributed over four data centers.
Our operating system is Linux.
HPCwire: How often do you look if the Webpages you list do still exist?
HENZINGER: We update our database every 28 days. Moreover, there exist some
very busy webpages which we visit daily. Every 28 days we recreate the index
which, for every word, lists all the Webpages containing this word. If you
enter two words into the search field you are presented the intersection of
both lists, sorted by PageRank and a few other criteria. In particular, it can
happen that you get the homepage of a company which does not even contain the
name of the company in readable form, but maybe only as part of n image. But
from the many Webpages which point to this homepage and quote the name of the
company we know that this has to be the homepage, and present it that way.
HPCwire: How much effort is this indexing?
HENZINGER: A lot. About one week.
HPCwire: How many employees does your company have?
HENZINGER: Approximately 350. Up to now the number of employees has doubled
each year.
HPCwire: How is Google financed?
HENZINGER: First, by usual advertisement: one-line, running-text ads. They
only appear for queries like "cars", that is, if they might be interesting for
the questioner; we call this keyword targeting. Second, everybody can buy an
advertisement online with their credit card. If you want to congratulate your
wife to her birthday with Google you can place an advertisement which only
appears for her name. However, the typical advertiser is a small producer of
maple syrup in Maine. His advertisement will be displayed to the right of the
search results if the user types "maple syrup". Third, by search services. For
instance Yahoo pays us for displaying our search results on their page. Some
companies want to set up search functionality for their Webpage, but do not
want to program it themselves. For these companies we build a separate index
and answer search queries the company gets asked. Fourth, we recently started
to sell our products for internal use in company intranets. We are one of the
few startup-companies who are really well off.
HPCwire: Are there any new projects?
HENZINGER: There are several. Speech input, for instance. The user speaks his
question into a microphone and gets the answers on the screen, in the future
there may be even spoken answers. Another project is our news search. Our
machines read daily newspapers and group articles on the same topic from many
different countries. This is really interesting, because the coverage usually
is locally biased. A regular comparison can expand your horizon a lot. Click
at "News and Services" and then "Try out our beta news search". Or user
interfaces. How can you convince a user to type in more than two words? The
more words he tells us, the better we can serve him.
Glossary:
Repository: a very large database, which is used by Google to store the
contents of the registered Webpages, pictures, and news articles.
Crawling: the systematic search of the Web for Webpages via the links they
contain. Google's computers regularly crawl through more than 2 billion
Webpages per week.
PageRank: a measure for the quality of a Webpage, more precisely: for the
respect it gets inside the Web. To determine the PageRank of Webpage A, one
adds the PageRanks T_j of all the Webpages pointing to A and divides it by the
total number of links contained in Webpage T_j. This number plus an additive
constant times a proportionality factor is the PageRank of A. If a user clicks
from page to page at random, most often clicking a link on the current page
and sometimes (this is the additive constant) choosing a completely unrelated
Webpage, then the probability of entering Webpage A at some time is equal to
the PageRank of A. The PageRank is thus defined by the PageRank itself -- a
definition which only becomes sensible by solving a huge set of equations
(more precisely: an eigenvalue problem) with all the PageRanks as unknowns.
|