Tips & Tricks to Steer Your Web Crawling & Scraping Journey Clear of Legal Slludge

  • 26/11/2017

In the last few posts, we have establish the ground reality for various legal aspects of web scraping, data extraction & web crawling. While only the web administrator can decide how deep into the legal sludge does he wish to drag you, there are a few guidelines which can help you steer meaningfully clear of legal troubles.

These guidelines here are no guarantee for steering clear of legal troubles, it is always advisable to seek professional legal counsel to stay on the right side of law. However, these sure can for the foundation to setting up internal processes.

  1. Abide by the “Terms of Service” or “Terms of Use”
    Every website has a “Terms of Service” or the “Terms of Use”, be sure to go through it well before launching a crawler right into it. This is typically a set of guidelines published by the web administrator. You can find it in the website sitemap. It is always advisable to stick to this.
  2. Respect “Robots.txt”
    The “Robots.txt” file is maintained by website administrators to especially communicate the right code of conduct for web scraping and web crawling bots. It is strongly recommended to abide by the guidelines given in this file and steer clear of the pages with a “nofollow” attribute.
  3. Maintain a natural crawl rate
    Do not stuff the website’s bandwidth with continual and high-frequency requests. This chokes the websites’ bandwidth and adds tremendous pressure on the server making the user experience for real human users a compromised one. Understand the fact that these websites are not made to serve data hunters but real humans instead.
  4. Don’t be anonymous
    Anonymous is shady and creates doubt for the web administrator. It is advisable to identify your web crawler with a legitimate user agent string which links back to your web page explaining what you are doing and why. This allows the website administrator to understand intent and they usually oblige is web crawlers come clean.
  5. When in doubt, ask
    If the ‘robots.txt’ or ‘Terms of Use’ or ‘Terms of Service’ disallows you from scraping or crawling certain data, it is always prudent to seek consent in writing or through an email.
  6. Don’t republish blindly
    If you have acquired certain data sets through web crawling or scraping, it is always a good idea to check for license requirements or copyrights before publishing. In case you are unable to find anything, it is always prudent to write to the website administrator for permission to use the data for publishing.
  7. Never do anything you have a doubt about
    If you have the slightest ounce of doubt about crawling or scraping a certain website, step back immediately. It is not worth the ignorance. Seek legal counsel and ensure that you are treading on safe grounds.
  8. Don’t build your revenue only by scraping a certain website
    If you build your entire revenue model around scraping one website in particular, you are exposing yourself to massive concentration risk. It is most likely that you will eventually put a lot of pressure on the website through your web crawler and this may lead to the website blocking and shutting you out and shutting your entire revenue source with one stroke of action.

Web scraping and crawling is not really illegal if done the right way. Although filing a lawsuit against you is not a very common first-step taken by web administrators, there is nothing stopping them from doing it if you get on their wrong side. 

Get A Quote