STAT 29000: Project 3 — Spring 2022
Motivation: Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from zillow.com.
Context: In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let’s you interact with a browser, selenium.
Scope: python, web scraping, selenium
Questions
Question 1
Pop open a browser and visit zillow.com. Many websites have a similar interface — a bold and centered search bar for a user to interact with.
First, in your browser, type in 34474
into the search bar and press enter/return. There are two possible outcomes of this search, depending on the computer you are using and whether or not you’ve been browsing zillow. The first is your search results. The second a page where the user is asked to select which type of listing they would like to see.
This second option may or may not consistently pop up. For this reason, we’ve included the relevant HTML below, for your convenience.
<div>
<img alt="" src="https://www.zillowstatic.com/s3/homepage/static/interstitial_graphic.png" class="sc-14dvu6m-0 iYqEdo " width="262px" height="100px">
<h3 id="interstitial-title" class="sc-14dvu6m-1 kvYidp ">What type of listings would you like to see?</h3>
<ul class="sc-14dvu6m-2 gfkDFS listing-interstitial-buttons">
<li><button class="StyledButton-c11n-8-48-0__sc-wpcbcc-0 bCYrmZ">For sale</button></li>
<li><button class="StyledButton-c11n-8-48-0__sc-wpcbcc-0 bCYrmZ">For rent</button></li>
<li><button class="StyledTextButton-c11n-8-48-0__sc-n1gfmh-0 jBjBRQ">Skip this question</button></li>
</ul>
</div>
Remember that the value of an element is the text that is displayed between the tags. For example, the following element has "happy" as its value.
You can use XPath expressions to find elements by their value. For example, the following XPath expression will find all //div[text()='happy'] |
Use selenium
, and write Python code that first finds the search bar input
element. Then, use selenium
to emulate typing the zip code 34474
into the search bar followed by a press of the enter/return button.
Confirm your code works by printing the current URL of the page after the search has been performed. What happens? Well, it is likely that the URL is unchanged. Remember, the "For sale", "For rent", "Skip this question" page may pop up, and this page has the same URL. To confirm this, instead of printing the URL, instead print the HTML after the search.
To print the HTML of an element using
If you don’t know what HTML to expect, the
Of course, please only print a sample of the HTML — we don’t want to print it all — that would be a lot! |
Remember, in the background, |
One downside to selenium is it has some more boilerplate code than,
Please feel free to "reset" your driver (for example, if you’ve lost track of "where" it is or you aren’t getting results you expected) by running the following code, followed by the code shown above.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Okay, let’s go forward with the assumption that we will always see the "For sale", "For rent", and "Skip this question" page. We need our code to handle this situation and click the "Skip this question" button so we can get our search results!
Write Python code that uses selenium
to find the "Skip this question" button and click it. Confirm your code works by printing the current URL of the page after the button has been clicked.
Don’t forget, it may be best to put a |
Uh oh! If you did this correctly, it is likely that the URL is not quite right — something like: www.zillow.com/homes/rb/
. By default, this URL will place the nearest city in the search bar — this is _not what we wanted. On the bright side, we did notice (when doing this search manually) that the URL should look like: www.zillow.com/homes/34474_rb/
— we can just insert our zip code directly in the URL and that should work without any fuss, plus we save some page loads and clicks. Great!
If you are paying close attention — you will find that this is an inconsistency between using a browser manually and using |
Test out (using selenium
) that simply inserting the zip code in the URL works as intended. Finding the title
element and printing the contents should verify quickly that it works as intended.
element = driver.find_element_by_xpath("//title")
print(element.get_attribute("outerHTML"))
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Okay great! Take your time to open a browser to www.zillow.com/homes/34474_rb/
and use the Inspector to figure out how the web page is structured. For now, let’s not worry about any of the filters. The main useful content is within the cards shown on the page. Price, number of beds, number of baths, square feet, address, etc., is all listed within each of the cards.
What non li
element contains the cards in their entirety? Use selenium
and XPath expressions to extract those elements from the web page. Print the value of the id
attributes for all of the cards. How many cards was there? (this could vary depending on when the data was scraped — that is ok)
You can use the |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Write code to print the mean price of each of the cards on the page, as well as the mean square footage. Print the values.
Uh oh! Once again, something is not working right. If you were to dig in, you’d find that only about 10 or so cards contain their data. This is because the cards are lazy-loaded. What this means is that you must scroll in order for the rest of the info to show up. You can verify this if you scroll super fast. You’ll notice even if the page was loaded for 10 seconds, that content at the bottom will take a second to load after scrolling fast. To fix this problem — we need to scroll! Try the following code. Of course, fill in the
|
Your project writer is mean. Of course not every card contains a house — some of it is land. Unfortunately, land doesn’t have a square footage on the website! Do something similar to the following to skip over those annoying plots of land. (and don’t forget to fill in the xpaths)
|
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Update your code from question (4) to first filter the homes by the number of bedrooms and bathrooms. Let’s look at some bigger homes. Filter to get houses with 4+ bedrooms and 3+ bathrooms. Recalculate the mean price and square footage for said houses. Print the values.
To apply said filters, you will need to emulate 3 clicks. One to activate the menu of filters, another to select the number of bedrooms, and another to select the number of bathrooms. You should be able to use a combination of element type (div/button/span/etc.) and attributes to accomplish this. |
-
Code used to solve this problem.
-
Output from running the code.
Question 6 (optional, 0 pts)
Package your code up into a function that let’s you choose the zip code, number of bedrooms, and number of bathrooms. Experiment with the function for different combinations and print your results. If you really want to have some fun create an interesting graphic to show your results.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |