Interested in this project?
Continue LearningTutorial Dependencies
Before you learn how to scrape the web, you should probably be familiar with basic Python and HTML. If you're on Windows, do install the Windows Subsystem for Linux (WSL), as that will make life a lot easier when running Python scripts (macOS and Linux users, you're all good to go). Instructions are available on my site. In addition, I recommend learning how to use your browser's developer tools (F12) to select HTML elements on a website, as you'll be doing that quite a lot in this tutorial.
Introduction
The Art of Web Scraping
Before we delve into web scraping with the Python programming language, we'll need to understand the fundamentals of web scraping. So with that, you may ask, what is a web scraper and how useful it is?
Basically, web scraping allows us to "scrape," or retrieve data and text from websites on the Internet. If you want to pull the name or other details of a YouTube video and display it in your Python program, you're effectively web scraping a YouTube link on the Internet.
However, some sites are harder to scrape than others. These types of sites dislike users from crawling and scraping their pages, but we'll get into a few workarounds that will allow us to emulate a browser and give us scraping access. There's a few reasons as to why these sites dislike web scrapers, as scrapers tend to take a hit in server performance and overload them. Yet, as long as we're not spamming them with millions of requests at once, we're okay to crawl their pages and web scraping is indeed ethical in this case.
With all of this in mind, here's the action plan for this tutorial. Firstly, our script will gather user input for a product they might be interested (ex. they can type in Microsoft Surface), then our scraper will scrape three retailers (eBay, Amazon, and Best Buy) for the cheapest, most popular price, then it will return this price and a link to the purchase page.
Sounds cool right? Let's jump right in!
Setting Up
Terminal Environment, Files & Libraries
Once you've got WSL set up on Windows, and/or you've got a terminal running and ready to go on Windows/Linux, type the following in your terminal:
sudo apt update && upgrade
sudo apt install python3 python3-pip ipython3
If you're on macOS, follow this tutorial.
Now, you've got a working Python setup on your computer that you can run your scripts on.
Let's create a new main.py file in your favorite IDE. Personally, mine is Visual Studio Code, but even Windows Notepad or Mac TextEdit will do the job just fine.
In order to extract our target site's HTML code into our Python script, we're going to want to install Python's requests library. This library effectively allows our script to talk to and play nice with the Internet website, downloading its contents. To install it, fire up your terminal and type:
pip3 install requests
If you're using Windows, pip
should work just fine as well.
In order to use requests, we must import it into our Python script. To do that, open main.py and type at the top:
import requests
Here, we're going to install another important library, one that allows us to interact and explore the HTML we parsed with requests. With that, I'd like to introduce Beautiful Soup! Beautiful Soup is a highly versatile library that allows us to manage the HTML we extract. To install it, type in your terminal:
pip3 install beautifulsoup4
Let's import Beautiful Soup next by typing into our Python script:
from bs4 import BeautifulSoup
Now, as mentioned above in the introduction, we'll need to emulate our browser to scrape sites like Amazon. To do this, we'll utilize the help of a web automation tool, Selenium. With Selenium, we can secretly pose as a web browser in our script and automate browser navigation.
The process for configuring Selenium is a little confusing, but this site has easy-to-follow instructions for all web browsers. I recommend you install a web driver listed for a browser already installed on your machine.
Once you've got Selenium configured and a browser driver downloaded on your machine, we must import it into our script:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from requests_html import HTMLSession
The above may be a little different depending on which browser/browser driver you've installed on your machine.
Now that you've imported everything you need to begin your project, here's what your script should look like so far:
# libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from requests_html import HTMLSession
Yes, we've imported the Selenium library into our script. However, we can't use it just yet because we still need to point it to the right driver installed on our machine. With that said, let's move on.
Scraping on Dynamic HTML and Strict Websites
BeautifulSoup, as you will see at the end of the tutorial, is truly a BEAUTIFUL library for helping scrape HTML in our script. However, it's not so beautiful when trying to work with dynamic sites (sites where HTML changes constantly) like Amazon, or sites that try to prevent web scraping like Best Buy.
But don't worry guys, we have just the tools required to circumvent these obstacles. Let's dive right in.
First, let's bypass the strict no-scraping policies of sites like Best Buy or Amazon. The reason that we're not able to scrape these sites is due to the fact that these sites do not allow for automated bots (incl. our BeautifulSoup parser) to scrape content without use of an API. However, with the right browser headers, we're able to trick these sites into believing that our parser is an actual web browser.
To create our browser headers, let's set up a Python dictionary that will store all of our headers (including browser user-agent, different for every browser). For convenience, here are the headers I've used to scrape Amazon and Best Buy:
headers = {
'Host': 'www.amazon.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
Next, let's configure the Selenium library we imported. The browser I've used for testing and usage for this project is Firefox on Windows, as the geckodriver supports a headless (no GUI or no window) mode. Your actual script may differ depending on what browser you've installed. For Google Chrome users, check out this link.
If you've got Firefox, enter into your script:
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe') # this must point to where Firefox is installed on your machine
options = webdriver.FirefoxOptions() # this saves all the configuration settings for our webdriver
options.add_argument('-headless') # this will open the browser silently, not displaying a window
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'C:\\BrowserDriver\\geckodriver.exe', firefox_options=options) # this must point to where you've downloaded your web driver
With the above, we can now emulate Firefox on our computer and make sure that our BeautifulSoup parser will only scrape the site once the site is fully loaded.
Summary of Your Progress
Here is what our file should look like so far!
Summary
```python
# libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from requests_html import HTMLSession
# global variables
headers = {
'Host': 'www.amazon.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe') # this must point to where Firefox is installed on your machine
options = webdriver.FirefoxOptions() # this saves all the configuration settings for our webdriver
options.add_argument('-headless') # this will open the browser silently, not displaying a window
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'C:\\BrowserDriver\\geckodriver.exe', firefox_options=options) # this must point to where you've downloaded your web driver
```
To summarize, we first imported all the necessary libraries to scrape our sites (BeautifulSoup, Requests, and Selenium). We then created a set of headers in order to trick some strict websites into believing that our scraper is a normal Internet browser, necessary in order to scrape Amazon and Best Buy. Then, we configured Selenium to point to an installed browser and downloaded web driver on our machine.
If you have questions about the setup process, let us know in the comments below and we can help you!
HTML Scraping
Getting User Input
Now that we have the fundamentals in place, let's get working on our main script. First, we'll want to ask the user for what product they are looking for from online retailers. Getting user input is very important when it comes down to improving overall user experience.
To gather user input, we can use the Python input()
function to save user input into our string variable:
search = input('Enter the item you want to search for: ')
print()
print('Searching for ' + search)
Now that we have what the user wants to find, we can get down to the actual scraping now.
Scraping eBay
Before we scrape eBay, let's take a look at its site (web page) layout and URL path (in your browser's address bar). Visit eBay and type a term, any term into the search bar and press the Enter key (I recommend a search term that has two words or more - like Apple iPhone).
After you're taken to the results page, take a look at your address bar. Let's examine the URL path listed. If you entered "Apple iPhone", here is the path you should expect to see:
https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313.TR12.TRC2.A0.H0.XApple+iPhone.TRS0&_nkw=Apple+iPhone... (rest of the path doesn't really matter)
We can split this above URL into two parts: a base URL and the query parameters.
The base URL represents the path to the search query. In the path above, this is everything until &_nkw=. This will mostly remain the same regardless of whatever you type into the search box. (Ignore the Apple iPhone text in the base URL)
Next, the query parameters house your search query. In the path above, this is _nkw=Apple+iPhone. Since you cannot have spaces inside a browser URL, eBay joins the two words together using a plus sign. Different sites may use different symbols to concatenate the search words in the URL.
With the restrictions above in mind, we can use Python's replace() method to replace the spaces in our search string with plus symbols. We can also add the base URL from above to the new search query with plus symbols as shown below:
search = search.replace(' ', '+') # first argument is before, second argument is after
print('Gathering eBay listings...')
URL_e = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313.TR12.TRC2.A0.H0.XApple+iPhone.TRS0&_nkw=' + search
page = requests.get(URL_e)
The last line of the snippet above performs an HTTP request to the URL we typed in. This effectively downloads the HTML content from the URL, and we can now work with it as a standard Python object.
Now, let's use BeautifulSoup to parse this downloaded HTML. To do this, type in your script:
soup = BeautifulSoup(page.content, 'html.parser')
Essentially, we're converting the downloaded page object into a BeautifulSoup object. We can now use the useful functions that BeautifulSoup provides to scrape text from this object!
Now that we've got the Python object, let's identify the HTML we need to scrape from the downloaded content.
All of the listings on our search site must be under a container of some sort. Let's try to find the HTML parent container tag that contains all of these listings. After toggling the pointer button in your browser tools (Inspect element in browser menu), hover over all the site elements until you find the one that fits.
Solution
```html
<div id="mainContent"><!-- listings --></div>
```
Using BeautifulSoup, we can find this specific element by its ID:
results = soup.find(id='mainContent')
By selecting this element ID, we're essentially filtering out the soup broth liquid in the Beautiful Soup, picking out the vegetables to savor.
From the results object, we can extract the actual listing objects themselves into a new variable, listings, through the following:
listings = results.find_all('li', class_='s-item')
BeautifulSoup allows us to find items based on class as well, which is what we're doing above: we're looking for li
tags (the listings) that have the class attribute "s-item," and thus filtering the carrots from the many vegetables we've picked out.
Now that we've got all the listings, we only want to pick the very first listing on the page (at the top). For this, we'll iterate through the listings array using a for loop as shown in the code snippet below:
runs = 0
for listing in listings:
if runs >= 1: break
item_element = listing.find('h3', class_='s-item__title')
item_price = listing.find('span', class_='s-item__price')
if None in (item_element, item_price):
continue
print(item_element.text.strip())
print(item_price.text.strip())
print()
print()
runs += 1
# save price
ebay_price = item_price.text.strip()
ebay_price = ebay_price.replace('$', '')
ebay_price = float(ebay_price)
This code above loops through all of the listings on the listings on the eBay results site, extracting the element name and price from the listing by finding the dedicated HTML elements and calling the strip()
function to remove extraneous characters. We're also saving the price to a variable to help facilitate price comparison near the end of our script.
And there we have it! We've just implemented code that will scrape the top result and its price from eBay. Basically, we found the listings code and then we pointed BeautifulSoup to scrape the first listing and its price.
The full code that we have so far is down below:
Summary
```python
# libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from requests_html import HTMLSession
# global variables
headers = {
'Host': 'www.amazon.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe') # this must point to where Firefox is installed on your machine
options = webdriver.FirefoxOptions() # this saves all the configuration settings for our webdriver
options.add_argument('-headless') # this will open the browser silently, not displaying a window
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'C:\\BrowserDriver\\geckodriver.exe', firefox_options=options) # this must point to where you've downloaded your web driver
# main code
search = input('Enter the item you want to search for: ')
print()
print('Searching for ' + search)
# eBay
search = search.replace(' ', '+') # first argument is before, second argument is after
print('Gathering eBay listings...')
URL_e = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313.TR12.TRC2.A0.H0.XApple+iPhone.TRS0&_nkw=' + search
page = requests.get(URL_e)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='mainContent')
listings = results.find_all('li', class_='s-item')
runs = 0
for listing in listings:
if runs >= 1: break
item_element = listing.find('h3', class_='s-item__title')
item_price = listing.find('span', class_='s-item__price')
if None in (item_element, item_price):
continue
print(item_element.text.strip())
print(item_price.text.strip())
print()
print()
runs += 1
# save price
ebay_price = item_price.text.strip()
ebay_price = ebay_price.replace('$', '')
ebay_price = float(ebay_price)
```
Now, we'll need this similar logic to scrape results from Amazon and Best Buy in the coming steps.
Scraping Amazon
Let's start by examining the URL from an Amazon results page. Type a term in the Amazon search bar.
After you're taken to the results page, take a look at your address bar. Let's examine the URL path listed. If you entered "Apple iPhone", here is the path you should expect to see:
https://www.amazon.com/s?k=Apple+iPhone... (rest of the path doesn't really matter)
From this above URL, determine what is the base URL and the query parameters.
Solution
Base URL: https://www.amazon.com/
Parameters: s?k=Apple+iPhone
Now, we'll use the base URL and parameters to generate a URL with our search term:
print('Gathering Amazon listings...')
URL = 'https://www.amazon.com/s?k=' + search # search term we generated before
driver.get(URL)
After creating the search URL, you might notice that we're using driver.get(URL)
. Here, we're using the Selenium Webdriver we incorporated above to scrape Amazon's dynamic HTML once its done loading, passing in our URL to secretly emulate Firefox.
Now, let's use BeautifulSoup to parse this downloaded HTML. To do this, type in your script:
soup = BeautifulSoup(driver.page_source, 'html.parser')
This time, we're passing in the HTML source from our web driver into BeautifulSoup to parse.
Now that we've got the Python object, let's identify the HTML we need to scrape from the downloaded content.
NOTE: Sites like Amazon are notorious for changing the HTML of search results pages based on the search term you've entered. For example, if you type "Apple iPhone", Amazon would have a section called "New from Apple", not present with other search terms like "Windows 10."
With Amazon, we're going to have to loop through the results twice, once for getting the name of the product and another for getting the product price. This is because Amazon's listing layout is more complex than eBay's above. Let's loop as per the code below:
runs = 0
results = soup.findAll('span', attrs={'class': 'a-size-medium a-color-base a-text-normal'})
for listing in results:
if runs >= 1: break
print(soup.select_one('span.a-size-medium').get_text())
runs += 1
runs = 0
results = soup.findAll('span', attrs={'class': 'a-offscreen'})
for listing in results:
if runs >= 1: break
element = soup.select_one('span.a-offscreen')
if None in element:
continue
print(element.get_text())
print()
print()
runs += 1
# save price
amazon_price = item_price.text.strip()
amazon_price = amazon_price.replace('$', '')
amazon_price = float(amazon_price)
This code above loops through all of the listings on the listings on the Amazon results site twice, extracting the element name and price from the listing by finding the dedicated HTML elements and calling the strip()
function to remove extraneous characters. We're also saving the price to a variable for later comparison.
And there we have it! We've just implemented code that will scrape the top result and its price from Amazon. Basically, we found the listings code and then we pointed BeautifulSoup to scrape the first listing and its price.
Next up, our final search site, Best Buy.
Scraping Best Buy
Let's start by examining the URL from a Best Buy results page. Open Best Buy and search for an item.
After you're taken to the results page, take a look at your address bar. Let's examine the URL path listed. If you entered "Windows 10", here is the path you should expect to see:
https://www.bestbuy.com/site/searchpage.jsp?st=Windows+10... (rest of the path doesn't really matter)
From this above URL, determine what is the base URL and the query parameters.
Solution
Base URL: https://www.bestbuy.com/site/searchpage.jsp?
Parameters: st=Windows+10
Now, we'll use the base URL and parameters to generate a URL with our search term:
print('Gathering Best Buy listings...')
URL_bb = 'https://www.bestbuy.com/site/searchpage.jsp?st=' + search # search term we generated before
driver.get(URL_bb)
With driver.get(URL_bb)
, we're navigating the hidden Firefox from the current site to the Best Buy site with all the search results listed.
Now, let's use BeautifulSoup to parse this downloaded HTML from the search results. To do this, type in your script:
soup = BeautifulSoup(driver.page_source, 'html.parser')
This time, we're passing in the HTML source from our web driver into BeautifulSoup to parse.
Now that we've got the Python object, let's identify the HTML we need to scrape from the downloaded content.
NOTE: Best Buy is as notorious as Amazon above. If we type in a product such as 'Apple iPhone', Best Buy does not load search results, instead loading an Apple-affiliate purchase page - which is not what we want. Our search results page will differ based on what product we enter.
With Best Buy, we're also going to have to loop twice (one loop for the product name, another loop for the price). Let's loop as per the code below:
runs = 0
results = soup.findAll('h4', attrs={'class': 'sku-header'})
for listing in results:
if runs >= 1: break
print(soup.select_one('h4.sku-header').get_text())
runs += 1
runs = 0
results = soup.findAll('div', attrs={'class': 'priceView-hero-price priceView-customer-price'})
for listing in results:
if runs >= 1: break
element = soup.select_one('div.priceView-customer-price > span:first-child')
if None in element:
continue
print(element.get_text())
print()
print()
runs += 1
# save price
bb_price = item_price.text.strip()
bb_price = bb_price.replace('$', '')
bb_price = float(bb_price)
This code above loops through all of the listings on the listings on the Best Buy results site twice, extracting the element name and price from the listing by finding the dedicated HTML elements and calling the strip()
function to remove extraneous characters. We're also saving the price to a variable for later comparison.
If you've noticed at the end of the scraping code snippets, you might notice that we're removing the dollar sign from our price. This is because we're converting these prices to float datatypes in order to convert them soon.
Summary of Your Progress
What did we accomplish so far? Adding on to the first module, in this module, we've basically cooked the vegetables for our Beautiful Soup.
We basically scraped three retail sites for the top result product and price! Let's take a look at the code you should have in your main Python script up to this point:
Summary
```python
# libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from requests_html import HTMLSession
# global variables
headers = {
'Host': 'www.amazon.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe') # this must point to where Firefox is installed on your machine
options = webdriver.FirefoxOptions() # this saves all the configuration settings for our webdriver
options.add_argument('-headless') # this will open the browser silently, not displaying a window
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'C:\\BrowserDriver\\geckodriver.exe', firefox_options=options) # this must point to where you've downloaded your web driver
# main code
search = input('Enter the item you want to search for: ')
print()
print('Searching for ' + search)
# eBay
search = search.replace(' ', '+') # first argument is before, second argument is after
print('Gathering eBay listings...')
URL_e = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313.TR12.TRC2.A0.H0.XApple+iPhone.TRS0&_nkw=' + search
page = requests.get(URL_e)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='mainContent')
listings = results.find_all('li', class_='s-item')
runs = 0
for listing in listings:
if runs >= 1: break
item_element = listing.find('h3', class_='s-item__title')
item_price = listing.find('span', class_='s-item__price')
if None in (item_element, item_price):
continue
print(item_element.text.strip())
print(item_price.text.strip())
print()
print()
runs += 1
# save price
ebay_price = item_price.text.strip()
ebay_price = ebay_price.replace('$', '')
ebay_price = float(ebay_price)
# Amazon
print('Gathering Amazon listings...')
URL = 'https://www.amazon.com/s?k=' + search # search term we generated before
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'html.parser')
runs = 0
results = soup.findAll('span', attrs={'class': 'a-size-medium a-color-base a-text-normal'})
for listing in results:
if runs >= 1: break
print(soup.select_one('span.a-size-medium').get_text())
runs += 1
runs = 0
results = soup.findAll('span', attrs={'class': 'a-offscreen'})
for listing in results:
if runs >= 1: break
element = soup.select_one('span.a-offscreen')
if None in element:
continue
print(element.get_text())
print()
print()
runs += 1
# save price
amazon_price = item_price.text.strip()
amazon_price = amazon_price.replace('$', '')
amazon_price = float(amazon_price)
# Best Buy
print('Gathering Best Buy listings...')
URL_bb = 'https://www.bestbuy.com/site/searchpage.jsp?st=' + search # search term we generated before
driver.get(URL_bb)
soup = BeautifulSoup(driver.page_source, 'html.parser')
runs = 0
results = soup.findAll('h4', attrs={'class': 'sku-header'})
for listing in results:
if runs >= 1: break
print(soup.select_one('h4.sku-header').get_text())
runs += 1
runs = 0
results = soup.findAll('div', attrs={'class': 'priceView-hero-price priceView-customer-price'})
for listing in results:
if runs >= 1: break
element = soup.select_one('div.priceView-customer-price > span:first-child')
if None in element:
continue
print(element.get_text())
print()
print()
runs += 1
# save price
bb_price = item_price.text.strip()
bb_price = bb_price.replace('$', '')
bb_price = float(bb_price)
```
Now, we're on the home stretch. We're just going to compare these products and see which one has the lower price, then return this product to you with a link.
If you're lost and need help, let us know in the comments below!
Price Comparison
In the code above, we saved our prices to three different variables for each company: ebay_price
, amazon_price
, and bb_price
. We're going to compare these prices and figure out which one is the lowest, and then we will list the product that has the cheapest price along with a link to the purchase site.
We'll be using if-else statements to perform the comparisons as shown below:
print('Comparing the three prices...')
elected_link = ''
elected_price = 0.00
if (ebay_price <= amazon_price) and (ebay_price <= bb_price):
elected_price = ebay_price
elected_link = URL_e
elif (amazon_price <= ebay_price) and (amazon_price <= bb_price):
elected_price = amazon_price
elected_link = URL
elif (bb_price <= ebay_price) and (bb_price <= amazon_price):
elected_price = bb_price
elected_link = URL_bb
print('Here\'s the link to the lowest price. Have fun!')
print(elected_link)
print('$' + str(elected_price))
# Tutorial finished!
In the code above, we're comparing the prices of all the retailers, then setting the final price to be the cheapest of those three, and we're also setting the link to the cheapest search results. Next, to successfully complete the tutorial, we are printing out the link, along with the price. To add the dollar sign, we are converting the price back to a string in order to concatenate.
With that, here's the final program:
Solution
# libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from requests_html import HTMLSession
# global variables
headers = {
'Host': 'www.amazon.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe') # this must point to where Firefox is installed on your machine
options = webdriver.FirefoxOptions() # this saves all the configuration settings for our webdriver
options.add_argument('-headless') # this will open the browser silently, not displaying a window
driver = webdriver.Firefox(firefox_binary=binary, executable_path=r'C:\\BrowserDriver\\geckodriver.exe', firefox_options=options) # this must point to where you've downloaded your web driver
# main code
search = input('Enter the item you want to search for: ')
print()
print('Searching for ' + search)
# eBay
search = search.replace(' ', '+') # first argument is before, second argument is after
print('Gathering eBay listings...')
URL_e = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313.TR12.TRC2.A0.H0.XApple+iPhone.TRS0&_nkw=' + search
page = requests.get(URL_e)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='mainContent')
listings = results.find_all('li', class_='s-item')
runs = 0
for listing in listings:
if runs >= 1: break
item_element = listing.find('h3', class_='s-item__title')
item_price = listing.find('span', class_='s-item__price')
if None in (item_element, item_price):
continue
print(item_element.text.strip())
print(item_price.text.strip())
print()
print()
runs += 1
# save price
ebay_price = item_price.text.strip()
ebay_price = ebay_price.replace('$', '')
ebay_price = float(ebay_price)
# Amazon
print('Gathering Amazon listings...')
URL = 'https://www.amazon.com/s?k=' + search # search term we generated before
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'html.parser')
runs = 0
results = soup.findAll('span', attrs={'class': 'a-size-medium a-color-base a-text-normal'})
for listing in results:
if runs >= 1: break
print(soup.select_one('span.a-size-medium').get_text())
runs += 1
runs = 0
results = soup.findAll('span', attrs={'class': 'a-offscreen'})
for listing in results:
if runs >= 1: break
element = soup.select_one('span.a-offscreen')
if None in element:
continue
print(element.get_text())
print()
print()
runs += 1
# save price
amazon_price = item_price.text.strip()
amazon_price = amazon_price.replace('$', '')
amazon_price = float(amazon_price)
# Best Buy
print('Gathering Best Buy listings...')
URL_bb = 'https://www.bestbuy.com/site/searchpage.jsp?st=' + search # search term we generated before
driver.get(URL_bb)
soup = BeautifulSoup(driver.page_source, 'html.parser')
runs = 0
results = soup.findAll('h4', attrs={'class': 'sku-header'})
for listing in results:
if runs >= 1: break
print(soup.select_one('h4.sku-header').get_text())
runs += 1
runs = 0
results = soup.findAll('div', attrs={'class': 'priceView-hero-price priceView-customer-price'})
for listing in results:
if runs >= 1: break
element = soup.select_one('div.priceView-customer-price > span:first-child')
if None in element:
continue
print(element.get_text())
print()
print()
runs += 1
# save price
bb_price = item_price.text.strip()
bb_price = bb_price.replace('$', '')
bb_price = float(bb_price)
print('Comparing the three prices...')
elected_link = ''
elected_price = 0.00
if (ebay_price <= amazon_price) and (ebay_price <= bb_price):
elected_price = ebay_price
elected_link = URL_e
elif (amazon_price <= ebay_price) and (amazon_price <= bb_price):
elected_price = amazon_price
elected_link = URL
elif (bb_price <= ebay_price) and (bb_price <= amazon_price):
elected_price = bb_price
elected_link = URL_bb
print('Here\'s the link to the lowest price. Have fun!')
print(elected_link)
print('$' + str(elected_price))
# Tutorial finished!
Project Summary
My friends, that's the essence of web scraping. Upon completing this tutorial, you will be able to use Beautiful Soup to scrape and pull data from websites at your ease. To take this a step further, feel free to modify your code to scrape from other retailers of your choice! This was a lot to take in, but at the end, we're all able to enjoy the well-prepared steaming hot vegetables inside the Beautiful Soup.
Comments (0)