BLOG

SOMETIMES, WE EXPRESS!

How Can I Create A Open Source Distributed Web Scraper

  • 09/10/2022

Distributed Web Scraper

Web scraping services in the USA ensure the delivery of humongous amounts of data in the quickest span of time. And open-source web scrapers add up to fuel their efficiency. Open-source advanced web crawlers enable the users to execute coding on the source code. Such a feature extensively increases the speed of web scraping by also making it all the more convenient.

Among the different web scrapers, open-source web scrapers and indexers let users code using their source code or framework and play a significant role in supporting scraping in a quick, easy, yet thorough method having web scraping at scale.

On the other side, programmers can only use open source web crawlers, which are highly robust and versatile. With so many non-coding applications available, scraping is no more just a developer's luxury. These technologies are better suited for and make scraping simpler for those who lack programming skills. It has cruise control, so you may carry out the entire scraping procedure with a few keystrokes. Additionally, you can design a workflow to alter the crawler.

Let us walk through the steps for a better understanding of how to create an open-source distributed web scraper.

To begin with, creating an automated web crawler open source, Redis and Python3 must be pre-installed if your system does not have them already. Next, you can install all the libraries through pip install.

Knowing Celery and Redis

Celery is employed to divide the load among multiple servers, which would otherwise be a task of numerous nodes for creating a distributed web crawler. Celery makes a nonsynchronous open source task.

Redis is a database, cache, and communication aggregator that is open source and used to process raw representations. We will utilise Redis as a database rather than arrays and sets to store all the stuff (in memory). Redis can be used as a broker by Celery. Thus, we won't need any additional programmes to run it.

Executing Celery

We develop an activity in Celery that prints the values received by the variables. Keep in mind, a python task prints only a particular string.

 

from celery import Celery

app = Celery('tasks', broker_url='redis://127.0.0.1:6379/1')

@app.task

def demo(str):

print(f'Str: {str}')

demo('ZenRows') # Str: ZenRows

demo.delay('ZenRows') # ?

 

Connecting Celery To The Task

This is where extract_links collaborate web page links apart from the unfollow links. You can insert the filtering notes in the eventual process.

 

from celery import Celery

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

app = Celery('tasks', broker_url='redis://127.0.0.1:6379/1')

@app.task

def crawl(url):

            html = get_html(url)

            soup = BeautifulSoup(html, 'html.parser')

            links = extract_links(url, soup)

            print(links)

def get_html(url):

            try:

                        response = requests.get(url)

                        return response.content

            except Exception as e:

                        print(e)

            return ''

def extract_links(url, soup):

            return list({

                        urljoin(url, a.get('href'))

                        for a in soup.find_all('a')

                        if a.get('href') and not(a.get('rel') and 'nofollow' in a.get('rel'))

            })

starting_url = 'https://scrapeme.live/shop/page/1/'

crawl.delay(starting_url)

 

URL Tracking For Redis

Distributed web scraping will not allow reliance on the memory variables. The data like real-time crawled pages, visiting list etc must be persistent. You can achieve this with the use of Redis rather than Celery and eliminate multiples.

from redis import Redis

from tasks import crawl

connection = Redis(db=1)

starting_url = 'https://scrapeme.live/shop/page/1/'

connection.rpush('crawling:to_visit', starting_url)

while True:

            # timeout after 1 minute

            item = connection.blpop('crawling:to_visit', 60)

            if item is None:

                        print('Timeout! No more items to process')

                        break

            url = item[1].decode('utf-8')

            print('Pop URL', url)

            crawl.delay(url)

 

This allows automated processing and also link looping. However it is not advisable considering the duplication of pages.

 

from redis import Redis

# ...

connection = Redis(db=1)

@app.task

def crawl(url):

            connection.sadd('crawling:queued', url) # add URL to set

            html = get_html(url)

            soup = BeautifulSoup(html, 'html.parser')

            links = extract_links(url, soup)

            for link in links:

                        if allow_url_filter(link) and not seen(link):

                                    print('Add URL to visit queue', link)

                        add_to_visit(link)

            # atomically move a URL from queued to visited

            connection.smove('crawling:queued', 'crawling:visited', url)

 def allow_url_filter(url):

            return '/shop/page/' in url and '#' not in url

def seen(url):

            return connection.sismember('crawling:visited', url) or connection.sismember('crawling:queued', url)

def add_to_visit(url):

            # LPOS command is not available in Redis library

            if connection.execute_command('LPOS', 'crawling:to_visit', url) is None:

                        connection.rpush('crawling:to_visit', url) # add URL to the end of the list

maximum_items = 5

while True:

            visited = connection.scard('crawling:visited') # count URLs in visited

            queued = connection.scard('crawling:queued')

            if queued + visited > maximum_items:

                        print('Exiting! Over maximum')

                        break

            # …

Separating Concepts

The progress of the project at this point will meet the separation of concepts. The two files, namely tasks.py and main.py will be extended into crawler functions like (crawler.py) and database access (repo.py).

 

from redis import Redis

connection = Redis(db=1)

to_visit_key = 'crawling:to_visit'

visited_key = 'crawling:visited'

queued_key = 'crawling:queued'

def pop_to_visit_blocking(timeout=0):

            return connection.blpop(to_visit_key, timeout)

def count_visited():

            return connection.scard(visited_key)

def is_queued(value):

            return connection.sismember(queued_key, value)

 

Parser Personalization

The content will be extracted and stored with only a specific subset of links in the sequence that originates with (parsers/defaults.py)

 

import repo

def extract_content(url, soup):

            return soup.title.string # extract page's title

def store_content(url, content):

            # store in a hash with the URL as the key and the title as the content

            repo.set_content(url, content)

def allow_url_filter(url):

            return True # allow all by default

def get_html(url):

            # ... same as before

content_key = 'crawling:content'

# ..

def set_content(key, value):

            connection.hset(content_key, key=key, value=value)

 

The princess here stays the same with abstract links and content extraction. The action here utilises a set of commands and parameters rather than crawler hardcoring. These can be substituted into functions with imports. Here a generator will be necessary. The host will be parserlist.py. This can be simplified into a parser per domain with two test domains scrapeme.live and quotes.toscrape.com.

 

from urllib.parse import urlparse

from parsers import defaults

parsers = {

            'scrapeme.live': defaults,

            'quotes.toscrape.com': defaults,

}

def get_parser(url):

            hostname = urlparse(url).hostname # extract domain from URL

            if hostname in parsers:

                        # use the dict above to return the custom parser if present

                        return parsers[hostname]

return defaults

@app.task

def crawl(url):

            parser = get_parser(url) # get the parser, either custom or the default one

            html = parser.get_html(url)

            # ...

            for link in links:

                        if parser.allow_url_filter(link) and not seen(link):

                                    # …

 

The process is followed by steps for Custom Parser, Headless Browsers HTML, Detection, and Proxies Elimination.

The process of building a large-scale web scraping advanced web crawler is a technically heavy job. These needs are highly catered to by fine web scraping services USA.

If you require web scraping services for your business or personal use, feel free to contact us. We would be more than happy to help!


Get A Quote