Web scraping is a process filled with multiple layers and more profound complexity than one can understand without professional help. And this complexity would only deepen if the task is scraping dynamic websites. One reliable source shows that nearly three-quarters of web portals are dynamic and require JavaScript for their operations.
To begin with, scraping a web page, we shall understand the specimen, which will navigate us to dynamic website scraping Python and with more tools. Initially, we will be required to figure out if the given website is indeed dynamic, which we shall decode by understanding using the Python solution we will eventually be scraping.
Here's the script:
import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)
Here’s the output:
[ ]
We can clearly conclude the fact that the scraper could not produce the data as <div> was an empty slot.
We shall now understand the different procedures that cater to Python scraping dynamic websites with the help of expert web scraping services USA.
The most common undertaking while scraping dynamic websites is that the scraper can generally not extract data from any dynamic webpage because the information is embedded in a dynamic process using JavaScript. We have unique solutions from the most efficient web scraping services USA to deal with such circumstances. These services are Reverse Engineering JavaScript and Rendering JavaScript.
Let's begin with the Reverse Engineering of JavaScript
Reverse Engineering JavaScript
Reverse Engineering JavaScript is quite a useful process in scraping dynamic websites, which will help us learn how to extract dynamically available data from a web page.
We will locate the “inspect element” tab for a concerned URL and click on the step as the initial procedure. Moving on, we shall look for and click on the “Network” tab that will locate the requests that are made along with the path /ajax based search.json. However, that is not the only way though. We can eliminate processing the /ajax and search.json data and still execute the process using the Python script that goes like this:
import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json()
Let's look at a specimen here. The script we are looking at enables the JSON response by accommodating the Python JSON method. It is still possible in a different way too. We can download the raw string response, which will be incorporated by consuming the JSON.loads technique in Python. This entire framework has been possible only with Python, which is why there has been an emphasis on dynamic website scraping Python. Now, the information from all the countries with a search of the alphabet 'a followed by the recapitulation of the result pages delivered by the JSON results.
import requests
import string
PAGE_SIZE = 15
url = 'http://example.webscraping.com/ajax/' + 'search.json?page={}&page_size={}&search_term=a'
countries = set()
for letter in string.ascii_lowercase:
print('Searching with %s' % letter)
page = 0
while True:
response = requests.get(url.format(page, PAGE_SIZE, letter))
data = response.json()
print('adding %d records from the page %d' %(len(data.get('records')),page))
for record in data.get('records'):countries.add(record['country'])
page += 1
if page >= data['num_pages']:
break
with open('countries.txt', 'w') as countries_file:
countries_file.write('n'.join(sorted(countries)))
Once we have successfully engaged the script mentioned above, we will receive the output, which will be encoded and stored in the files that can be renamed into countries.txt.
Here's the output.
Searching with a
adding 15 records from the page 0
adding 15 records from the page 1
...
The next approach is Rendering JavaScript
Rendering JavaScript
Our previous attempt was with Reverse Engineering which had its own set of pros and cons. We could conclude that the API worked and we could utilize it to extract the requests in the very first request. However, as we mentioned, there were a few cons as well. Let’s look at a couple of those.
- When there could be many origins of a web page, one could be made with an ace-level browser, perhaps like the Google Web Toolkit. In an encounter with such a case, a machine can generate the retrieved JS code, making it tedious to reverse engineer.
- There are some elements that can bring a complex turn in reverse engineering as they operate through a high-level framework like React.js that abstract JavaScript, which is already a complex structure, while scraping dynamic websites.
Nevertheless, there are methods that can avoid these hindrances. One can operate through a browser rendering engine that will enable parsing HTML by applying the CSS formatting while enabling a web page display through JavaScript execution.
Here is an example of the same. We will render the JavaScript by using a known Python module, Selenium. Selenium will help us with rendering the web page through Python code, while Python scraping dynamic websites.
To begin with, we shall import the web driver off Selenium.
This is how we can do it:
from selenium import webdriver
Next, the web driver will need a way to meet our commands.
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path = path)
We shall now go to the web browser influenced by our Python script to get the URL.
driver.get('http://example.webscraping.com/search')
We will proceed to set the element that we want to select by using the ID of the search toolbox.
driver.find_element_by_id('search_term').send_keys('.')
Next, we can set the content with the use of JavaScript.
js = "document.getElementById('page_size').options[1].text = '100';"
driver.execute_script(js)
The next thing you see is the code that depicts that the search is now clickable on the web page.
driver.find_element_by_id('search').click()
Now we shall see the code that says there will be a waiting period of 45 seconds as the AJAX request is being finished.
driver.implicitly_wait(45)
We are nearing the conclusion as we can select the country links using the CSS selector.
links = driver.find_elements_by_css_selector('#results a')
Finally, we will now extract the text from each link to make the list of the countries.
countries = [link.text for link in links]
print(countries)
driver.close()
Here we conclude the process of Python scraping dynamic websites using various integrations of the web scraping process through the most efficient web scraping services USA.
If you need web scraping services for websites, then you should definitely consider consulting the experts at BotScraper. BotScraper offers a multitude of services, such as data extraction, web scraping, web crawling, and data mining. Our team of professionals is dedicated to providing clients with accurate, reliable, and instantaneous data extraction services. So what are you waiting for? Contact us today to get started with your web scraping needs!