Web scraping services in the USA ensure the delivery of humongous amounts of data in the quickest span of time. And open-source web scrapers add up to fuel their efficiency. Open-source advanced web crawlers enable the users to execute coding on the source code. Such a feature extensively increases the speed of web scraping by also making it all the more convenient.
Among the different web scrapers, open-source web scrapers and indexers let users code using their source code or framework and play a significant role in supporting scraping in a quick, easy, yet thorough method having web scraping at scale.
On the other side, programmers can only use open source web crawlers, which are highly robust and versatile. With so many non-coding applications available, scraping is no more just a developer's luxury. These technologies are better suited for and make scraping simpler for those who lack programming skills. It has cruise control, so you may carry out the entire scraping procedure with a few keystrokes. Additionally, you can design a workflow to alter the crawler.
Let us walk through the steps for a better understanding of how to create an open-source distributed web scraper.
To begin with, creating an automated web crawler open source, Redis and Python3 must be pre-installed if your system does not have them already. Next, you can install all the libraries through pip install.
Knowing Celery and Redis
Celery is employed to divide the load among multiple servers, which would otherwise be a task of numerous nodes for creating a distributed web crawler. Celery makes a nonsynchronous open source task.
Redis is a database, cache, and communication aggregator that is open source and used to process raw representations. We will utilise Redis as a database rather than arrays and sets to store all the stuff (in memory). Redis can be used as a broker by Celery. Thus, we won't need any additional programmes to run it.
Executing Celery
We develop an activity in Celery that prints the values received by the variables. Keep in mind, a python task prints only a particular string.
from celery import Celery
app = Celery('tasks', broker_url='redis://127.0.0.1:6379/1')
@app.task
def demo(str):
print(f'Str: {str}')
demo('ZenRows') # Str: ZenRows
demo.delay('ZenRows') # ?
Connecting Celery To The Task
This is where extract_links collaborate web page links apart from the unfollow links. You can insert the filtering notes in the eventual process.
from celery import Celery
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
app = Celery('tasks', broker_url='redis://127.0.0.1:6379/1')
@app.task
def crawl(url):
html = get_html(url)
soup = BeautifulSoup(html, 'html.parser')
links = extract_links(url, soup)
print(links)
def get_html(url):
try:
response = requests.get(url)
return response.content
except Exception as e:
print(e)
return ''
def extract_links(url, soup):
return list({
urljoin(url, a.get('href'))
for a in soup.find_all('a')
if a.get('href') and not(a.get('rel') and 'nofollow' in a.get('rel'))
})
starting_url = 'https://scrapeme.live/shop/page/1/'
crawl.delay(starting_url)
URL Tracking For Redis
Distributed web scraping will not allow reliance on the memory variables. The data like real-time crawled pages, visiting list etc must be persistent. You can achieve this with the use of Redis rather than Celery and eliminate multiples.
from redis import Redis
from tasks import crawl
connection = Redis(db=1)
starting_url = 'https://scrapeme.live/shop/page/1/'
connection.rpush('crawling:to_visit', starting_url)
while True:
# timeout after 1 minute
item = connection.blpop('crawling:to_visit', 60)
if item is None:
print('Timeout! No more items to process')
break
url = item[1].decode('utf-8')
print('Pop URL', url)
crawl.delay(url)
This allows automated processing and also link looping. However it is not advisable considering the duplication of pages.
from redis import Redis
# ...
connection = Redis(db=1)
@app.task
def crawl(url):
connection.sadd('crawling:queued', url) # add URL to set
html = get_html(url)
soup = BeautifulSoup(html, 'html.parser')
links = extract_links(url, soup)
for link in links:
if allow_url_filter(link) and not seen(link):
print('Add URL to visit queue', link)
add_to_visit(link)
# atomically move a URL from queued to visited
connection.smove('crawling:queued', 'crawling:visited', url)
def allow_url_filter(url):
return '/shop/page/' in url and '#' not in url
def seen(url):
return connection.sismember('crawling:visited', url) or connection.sismember('crawling:queued', url)
def add_to_visit(url):
# LPOS command is not available in Redis library
if connection.execute_command('LPOS', 'crawling:to_visit', url) is None:
connection.rpush('crawling:to_visit', url) # add URL to the end of the list
maximum_items = 5
while True:
visited = connection.scard('crawling:visited') # count URLs in visited
queued = connection.scard('crawling:queued')
if queued + visited > maximum_items:
print('Exiting! Over maximum')
break
# …
Separating Concepts
The progress of the project at this point will meet the separation of concepts. The two files, namely tasks.py and main.py will be extended into crawler functions like (crawler.py) and database access (repo.py).
from redis import Redis
connection = Redis(db=1)
to_visit_key = 'crawling:to_visit'
visited_key = 'crawling:visited'
queued_key = 'crawling:queued'
def pop_to_visit_blocking(timeout=0):
return connection.blpop(to_visit_key, timeout)
def count_visited():
return connection.scard(visited_key)
def is_queued(value):
return connection.sismember(queued_key, value)
Parser Personalization
The content will be extracted and stored with only a specific subset of links in the sequence that originates with (parsers/defaults.py)
import repo
def extract_content(url, soup):
return soup.title.string # extract page's title
def store_content(url, content):
# store in a hash with the URL as the key and the title as the content
repo.set_content(url, content)
def allow_url_filter(url):
return True # allow all by default
def get_html(url):
# ... same as before
content_key = 'crawling:content'
# ..
def set_content(key, value):
connection.hset(content_key, key=key, value=value)
The princess here stays the same with abstract links and content extraction. The action here utilises a set of commands and parameters rather than crawler hardcoring. These can be substituted into functions with imports. Here a generator will be necessary. The host will be parserlist.py. This can be simplified into a parser per domain with two test domains scrapeme.live and quotes.toscrape.com.
from urllib.parse import urlparse
from parsers import defaults
parsers = {
'scrapeme.live': defaults,
'quotes.toscrape.com': defaults,
}
def get_parser(url):
hostname = urlparse(url).hostname # extract domain from URL
if hostname in parsers:
# use the dict above to return the custom parser if present
return parsers[hostname]
return defaults
@app.task
def crawl(url):
parser = get_parser(url) # get the parser, either custom or the default one
html = parser.get_html(url)
# ...
for link in links:
if parser.allow_url_filter(link) and not seen(link):
# …
The process is followed by steps for Custom Parser, Headless Browsers HTML, Detection, and Proxies Elimination.
The process of building a large-scale web scraping advanced web crawler is a technically heavy job. These needs are highly catered to by fine web scraping services USA.
If you require web scraping services for your business or personal use, feel free to contact us. We would be more than happy to help!