- 作者:zhaozj
- 发表时间:2020-12-23 11:02
- 来源:未知
Download the source code for this article [ZIP 8kb]
Comment on this article at TheCodeProject |
Article I describes building a simple search engine that crawls the filesystem from a specified folder, and indexing all HTML (or other types) of document. A basic design and object model was developed as well as a query/results page which you can see here.
This second article in the series discusses replacing the 'filesystem crawler' with a 'web spider' to search and catalog a website by following the links in the HTML. The challenges involved include:
Downloading HTML (and other document types) via HTTP Parsing the HTML looking for links to other pages Ensuring that we don't keep recursively searching the same pages, resulting in an infinite loop Parsing the HTML to extract the words to populate the search catalog from Article IDesign
The design from Article I remains unchanged...
A Catalog contains a collection of Words,and each Word contains a reference to every File that it appears in |
... the object model is the same too...
What has changed is the way the Catalog is populed. Instead of looping through folders in the filesystem to look for files to open, the code requires the Url of a start page which it will load, index and then attempt to follow every link within that page, indexing those pages too. To prevent the code from indexing the entire internet (in this version) it only attempts to download pages on the same server as the start page.
Code Structure
Some of the code from Article I will be referenced again, but we've added a new page - SearcharooSpider.aspx - that does the HTTP access and HTML link parsing [making the code that walks directories in the filesystem - SearcharooCrawler.aspx -obsolete]. We've also changed the name of the search page to SearcharooToo.aspx so you can use it side-by-side with the old one.
Implementation of the object model; compiled into both ASPX pages RE-USED FROM ARTICLE 1 |
OBSOLETE, REPLACED WITH SPIDER |
<%@ Page Language="C#" Src="Searcharoo.cs" %><%@ import Namespace="Searcharoo.Net"%> Retrieves the Catalog object from the Cache and allows searching via an HTML form. UPDATED SINCE ARTICLE 1 TO IMPROVE USEABILITY, and renamed to SearcharooToo.aspx |
<%@ Page Language="C#" Src="Searcharoo.cs" %><%@ import Namespace="Searcharoo.Net"%> Starting from the start page, download and index every linked page. NEW PAGE FOR THIS ARTICLE |