  • 19/03/2020

What is Web Scraping?

Many of you want to know what Web Scraping is. The concept may seem complicated, but it is actually quite simple.

This kind of "gold mining" on the internet involves extracting relevant information from a particular website to be analyzed later. This data will be used to improve decision making with a greater chance of success and success.

It is possible to do the same process manually, but when it comes to Web Scraping bot, the idea is to automate the work using bots. Thus, it is possible to collect a much larger number of data in a short fraction of the time.

Naturally since we are talking about capturing data from other sites, it is very important to be aware of the limits of this practice, both legally and morally.

Care you need to take when applying Web Scraping in your digital strategy:

Now we need to talk about the ethical and legal limits of Web Scraping. First of all, it must be said that the practice is not illegal in itself.

But in some cases, there are barriers that you need to be concerned about in order not to go wrong and suffer negative consequences.

The fact is that many sites have specific policies and actions to prohibit or hinder data mining. See what are the main points of attention and how to act with each of them:

  • Robots.txt: This file may contain restrictions on what can be mined or not. Respect your limitations to avoid bad consequences;

  • Terms of service: finding that the terms of service do not apply in this case is not quite true. If someone complains in court, the statements of those terms may be valid;

  • Laws of the location where the website is hosted : If the website is hosted in another country, care must be taken to avoid violating local data protection laws;

  • Crawl rate: The faster the bots work the more access to the server. There is also a greater chance that the site will perceive this as an attack. Take it easy on the pace of extraction;

  • Scraper identification : Creating an identification file for your Scraper, indicating who you are and how you will use the data, is a good practice that can avoid problems;

  • Protection of collected data: If the data you want to use has copyright protection, it is best not to collect it.

