Web crawling as a concept is gaining momentum across all walks of the industry by the day. The relevance and number of use cases that have been recognized and utilized is just growing phenomenally. One can only gape in awe when observing the evolution of web crawling as a product and service over the past number of years since its inception in the early internet days.
Initially, web crawling was designed only for a small web base that the internet was almost two decades back. However today, with over multi-million – or rather billion of webpages spread across the huge internet. The early-day web crawler is definitely inept and lacks the endurance to crawl and scale through such a base. Today, web crawling uses a series of processes to scale the huge pile of webpages to fetch the real deal.
Here’s how the serial processes can be elaborated upon.
Selective Crawl
A selective crawl is a process wherein the web crawler crawls through several web pages basis certain predetermined algorithm which defines the criteria to execute the crawling process. This algorithm can be likened to a scoring model wherein the process scores the relevance of contents crawled from a web page.
The set of crawled URLs are then sorted on the basis of the relevance score determined by the algorithmic model. This sorting provides for a top to bottom relevant listing of URLs. Typically, such an engine and mechanism lead to a listing of the most relevant URLs first, then trickling down to remotely relevant URLs.
The scoring criteria is typically based on factors like the number of link levels to reach a content on a web page, number of backlinks and similar depth and popularity metrics.
Focus Crawl
Focus crawl literally implies and refers to a crawling process which is of a targeted and focused fashion. In such a process, the crawler typically crawls to fetch content which is relevant to a specific keyword or set of keywords. The web crawler will fetch content and content pages which are relevant among themselves.
Once the crawling is done, the fetched content pages are classified and organized into categories. This mechanism helps gauging the depth to which the content on a certain web page is relevant to specific topic.
This web crawling process is also known as ‘topic crawling’ owing to its style of crawling. One major reason for its popularity can be attributed to the affordability of required infrastructure resources. This can also cut the quantum of network traffic reasonably. The exposure of a topic crawler is comparatively large.
Distributed Crawl
A distributed crawl typically refers to a crawling activity which distributes of segments crawling activities which is strikingly similar to the process of distributed computing. A distributed crawling process is characterized by a central server whose entire job is to manage communication and ensure all nodes respond efficiently.
These nodes may be physically apart by over a thousand miles. In order to improve efficiency and speed, a distributed crawler utilizes the PageRank algorithm. One good thing about a distributed crawler is its immunity against system crashes and similar havoc events. A distributed web crawler is quite scalable and even more efficient on memory. A distributed crawler can be best described as an agile and reliable crawler.
These type and levels of web crawlers are quite popular for web crawling activities. However, there’s much more. Watch this space for a few more interesting levels of web crawlers.