This article, we introduced the google, it is a large search engine (of a large-scale search engine) prototype, the search engine is widely used in the hypertext.
Google is designed to efficiently catch and index pages, which results than other existing systems are clever. The prototype of the full text and hyperlink database of
at least 24'000'000 pages. We can http://google.stanford.edu/ download. Design search engine is a challenging task. Search engines index billions of pages, which
contains a large number of very different vocabulary. And to answer thousands of queries per day. In the network, although large-scale search engine is very important,
but little academic research it. Furthermore, since the rapid development of technology and the large increase in pages, and now a search engine and three years ago
are completely different.
This paper describes our major search engines, as far as we know, in the published paper, this is the first description of such a detailed manner. In addition to
traditional data search technology to such a large class of problems encountered in the page, there are many new technical challenges, including the application of
hypertext in the additional information to improve search results. This article will address this issue, describes how to use additional information in the hypertext,
a large utility system. Anyone can freely publish information online, how to effectively deal with these non-organized collection of hypertext, but also to pay
attention to this issue.
Keywords World Wide Web, search engines, information retrieval, PageRank, Google 1 Introduction to Information Retrieval Web has brought new challenges. Rapid
growth in the amount of information on the Web, while there have been no experience of new users to experience the Web the art. People like to use hyperlinks to surf
the Internet, usually to important pages such as Yahoo or search engines started. We believe that List (catalog) effectively contains all topics of interest, but it is
subjective, expensive to establish and maintain, upgrade slowly, not all esoteric topics. Automatic search engine based on keywords usually return too many low-quality
match. Make matters worse, some advertisers find ways to win people's attention to mislead the automatic search engine.
We have established a large-scale search engine to solve many problems in the existing system. Application of hypertext structure, greatly improving the quality
of the query. Our system named google, named after the popular spelling of googol, or 10 to the 100th, this and our goal to build a large-scale search engine coincide.
1.1 Web search engine - upgrade (scaling up) :1994-2000 had to quickly upgrade the search engine technology (scale dramatically) to keep up with the number
doubling in the web. In 1994, the first Web search engine, World Wide Web Worm (WWWW) can be retrieved 110,000 Web pages and Web documents. To November 1994, claiming
the top of the search engine can retrieve 2'000'000 (WebCrawler) to 100'000'000 a network file (from the Search Engine Watch). Can be expected to 2000, the page can be
retrieved more than 1'000'000'000. Meanwhile, the search engine traffic will grow at an alarming rate. In March and April 1997, World Wide Web Worm received an average
of 1,500 queries per day.
In November 1997, Altavista claimed it handled roughly 20'000'000 day queries. With the growth of Internet users, to 2000, the automatic search engines will handle
hundreds of millions of daily queries. Our system is designed to solve many problems, including quality and scalability, the introduction of search engine technology
to upgrade (scaling search engine technology), to upgrade it to such a large number of data.
1.2 Google: Web to keep up the pace (Scaling with the Web) to create a scale able to adapt to today's web search engines will face many challenges. Web technology
must be caught fast enough to keep up with the pace of change pages (keep them up to date). Indexing and document storage space must be large enough. Indexing system
must be able to deal effectively with hundreds of billions of data. Process the query must be fast, to be able to handle hundreds of queries per second (hundreds to
thousands per second.). As the Web grows, these tasks become more difficult. However, the efficiency and cost of hardware is also growing rapidly, can be partially
offset these difficulties. There are several noteworthy factors, such as disk seek time (disk seek time), the efficiency of the operating system (operating system
robustness). Google in the design process, we not only consider the Web's growth rate, but also consider the technology updates. Google is designed to handle very
large data set of the upgrade. It can effectively use the storage space to store the index. Optimized data structure can quickly and efficiently access (see Section
4.2). Further, we hope, as opposed to the capture of text files and HTML pages in terms of quantity, the cost of storage and indexing as small as possible (see
Appendix B). Such as Google for the centralized system, these measures were satisfactory system scalability (scaling properties).
1.3 Design Goals
1.3.1 improve search quality our main goal is to improve the quality of Web search engines. In 1994, it was that the establishment of the whole search index (a
complete search index) can make it easy to find any data. According to Best of the Web 1994 - Navigators, "The best navigation service can search for any information
on the Web is very easy (at that time all the data can be logged in)." However, on the Web in 1997 is very different. Recent search engine users have confirmed the
integrity of the index is not the sole criterion for evaluation of search quality. Interest to the user's search results are often lost in the "junk results Junk
result" in the.
Q-logic SFP
Redback SFP
SMC SFP
Loading...