Web Crawling : the act of automatically downloading data from a web page, extracting the hyperlinks contained therein and following them. The downloaded data is usually stored in an index or database to facilitate its search. Web crawling, also known as indexing, is used to index information on a web page using bots, also called crawlers. Web Crawlers are basically used by major search engines like Google, Bing and Yahoo.
Based on the above, you can certainly guess that you must be extremely cautious with web crawling services. Here is some advice:
- - Use an API, if one is provided, instead of collecting data.
- - Respect the Terms of Service.
- - Respect the rules of robots.txt.
- - Use a reasonable crawl rate, that is, do not bombard the site with requests. Respect the crawl access time setting provided in robots.txt; if none, use a conservative crawl rate (for example, 1 request for 10 to 15 seconds).
- - Identify your scraper or crawler with a legitimate user agent string. Create a page that explains what you are doing and why and link to the page in the agent's string (for example, “MY-BOT (+ https://seusite.com/mybot.html)”)
- - If the Terms of Service or robots.txt prevent you from crawling or scraping, ask the site owner for written permission before doing anything else.
- - Do not republish your tracked or copied data or any derived data set without checking the data license or obtaining written permission from the copyright holder.
- - If you doubt the legality of what you are doing, don't do it. Or seek advice from a lawyer.
- - Do not base your entire company on data extraction. The website you scrape may eventually block you.
- - Finally, you should be suspicious of any advice you find on the internet (including these), so please consult an attorney.
- - Remember that companies and individuals are perfectly free to sue you, for whatever reason they wish. But if you scrape / crawl their website without permission and do something they don't like, you will definitely put yourself in a vulnerable position.