In continuation to the earlier blog post on types and levels of web crawlers, there is much more to the series of categorization. In case you missed any of the previous awesome types of crawlers, click here to read about them.
Meanwhile, let’s move ahead and check out more interesting web crawlers and crawling process types.
Incremental Crawl
Incremental call, as the name itself suggests, refers to the typical old school type of crawling process wherein the crawler enlists URLs as per relevance and then revisits them in order to refresh the content that it had fetched. With each incremental crawl, the crawler replaces the outdated content with fresh, updated ones. The crawler revisits the URLs based on the web page’s historical frequency of refreshing data.
Along with refreshing the existing content with updated ones, it also adds in new and more content oriented pages each time it revisits a URL. This ensures that the crawling process and fetched data is consistent and uniform throughout. One key plus of an incremental crawling process is that only relevant and needed data is provided to the user and thus effectively clears all the clutter and offers the user a well-filtered set of data. Consequently, the optimal use of bandwidth also results in cost-efficiency.
Parallel Crawl
Parallel crawling process, like it literally refers to, is a process that runs multiple web crawling processes simultaneously. These processes are known as c-procs and this process and run on a network of desktops. While it is very convenient to set up such a crawling process through a local network, it can also be scaled up to cover geographically distant locations as well. What makes this a popular process among coders? The distribution and multiple parallel processing system makes for a very time-efficient crawling process.
Web Dynamics
Web dynamics typically refers to the frequency and rate of refresh of information over the internet. Web dynamics are typically used by search engines for updating the index files. He index record for a certain content source.
The index entry for a certain document, indexed at time t 0 , is said to be β- current at time t if the document has not changed in the time interval between t 0 and t − β. Basically β is a ‘grace period’: if we pretend that the user query was made β time units ago rather than now, then the information in the search engine would be up to date. A search engine for a given collection of documents is said to be (α, β)-current if the probability that a document is β-current is at least α.
According to this definition, we can ask interesting questions like ‘how many documents per day should a search engine refresh in order to guarantee it will remain (0.9,1 week)-current?’ Answering this question requires that we develop a probabilistic model of Web dynamics.
Distributed Crawl
A distributed crawl typically refers to a crawling activity which distributes of segments crawling activities which is strikingly similar to the process of distributed computing. A distributed crawling process is characterized by a central server whose entire job is to manage communication and ensure all nodes respond efficiently. These nodes may be physically apart by over a thousand miles.
In order to improve efficiency and speed, a distributed crawler utilizes the PageRank algorithm.
One good thing about a distributed crawler is its immunity against system crashes and similar havoc events. A distributed web crawler is quite scalable and even more efficient on memory. A distributed crawler can be best described as an agile and reliable crawler.