Week Four of Data Science for the Public Good

Gathering addresses and images for our AI Models!

Week Four
Author

Kailyn Hogan

Published

June 9, 2023

Monday

Meeting with Erin Oslon-Douglas

On Monday this week, we met with the Director of the Community and Economic Development department at Iowa State University Extension and Outreach, Erin Olson-Douglas. She is one of our stakeholders for the Housing Project, and we had a great meeting with her. The agenda of the meeting was to update her on where we are at with the project. As of Monday, we were gathering data and ideas for the AI Image Model.

She had a couple of questions:

  1. If the image gathered from Google Street View is bad, can the model be told to look at a different site for a better image?
  2. How will the model choose one of the photos?
    1. Is the image chosen on a rating scale (good, usable, not usable)?
    2. Is the image chosen based on a simple yes or no?

I ended the meeting with a question myself:

  1. How do we get the model to work off of websites, not already available, pre-loaded images?

This is something I had not thought about yet. So far, we have been making, training, and testing models on images that we gathered and put into a database ourselves. I am not sure how we would get an AI Model to gather its own images, or if that is something we even need it to do.

We also ended the meeting with some knowledge on how we are presenting our final project. We need to keep in mind how our project can be picked up and used by others. We also need to make sure we are documenting and explaining our process so the next years of DSPG interns can continue this housing project.

Finally, Erin Olson-Douglas is going to be arranging a meeting with a county assesor for us. We are very curious how assessors complete their jobs with houses. We want to know what it is that they look at and look for when completing the assessment. Erin thinks we will be meeting with the Polk County assessor.

Sorting Address Data

On Monday, I also helped my team assemble Excel spreadsheets with all the addresses for Slater, Grundy Center, Independence, and New Hampton. I was in charge of the Slater data and part of the Grundy Center and Independence data sets.

We needed to first gather the addresses for each city. Gavin and Angelina used a spatial tool on Vanguard and Beacon to do this. On the map for the websites, the second tool from the left is called the Selection Tool. When you drag it over a section of properties, the list of parcels shows up in a “Results” section on the right.

Gavin used a web scraper attached to a Chrome Extension to then scrape the parcel information listed in “Results” from Beacon. Angelina was lucky and could download the parcel information as a .csv file from Vanguard.

Once the parcel information was scraped for Beacon, we had to go in and clean the data. Below is a sample of what the .csv file looked like after Gavin scraped it. The first thing we had to do was get the data all on one line. We used the following function in Excel to transform the street addresses onto one line:

=TRIM(CLEAN(SUBSTITUTE(cell, CHAR(160), ” “)))

We then used the the “Text to Columns” tool in Excel to separate the data into Parcel ID, Street Address, and Owner. We used the “-” as a delimiter. We cleaned the data to remove any addresses that were obviously non-residential, and we also narrowed the data down to just the Parcel ID and address.

From there, we created the URLs to gather Google Street View images.

We used the following function in Excel to transform the street addresses into workable addresses for Google Street View:

=SUBSTITUTE(TRIM(cell),” “,”+“))

This function removes all spaces and replaces them with + signs.

Here is the output of the cleaned Slater data:

Tuesday

On Tuesday morning, the entire DSPG group went to Slater to get some practice for the WINVEST project. We walked around with a city council member so we would have some local knowledge of the town while we were practicing. The local knowledge was very helpful, as we would not have known what some of the downtown commercial buildings were used for without the city council member.

We discovered on Tuesday that gutters are not the best test of our AI Model because most houses in Slater either had perfectly fine gutters or had no gutters at all. We do not think we will be able to get enough images to successfully train an AI Model to identify damaged gutters.

Tuesday afternoon was spent web scraping. I did the DataCamp training for web scraping in R about a week and a half ago, which was understandable then. It could have been better when applying it in real-world scenarios. Angelina and I have done a lot of Googleing to find other examples to help us. We were tasked earlier with scraping county housing assessor data. Independence in Buchanan County is on Vanguard. Slater in Story County, Grundy Center in Grundy County, and New Hampton in Chickasaw County are all on Beacon. We needed help figuring out how to scrape from these sites this week.

I successfully scraped the categories of shoes from my favorite shoe site, Jonak, though.

library(rvest)
Warning: package 'rvest' was built under R version 4.1.3
jonak = "https://www.jonak-paris.com/collection/shoes/sandals.html"
categories <- read_html(jonak) %>% html_elements(".categ_itm_name") %>% html_text2()
head(categories)
[1] "New in"              "Mules, Clogs"        "Sandals"            
[4] "Beach sandals"       "Wedges, Espadrilles" "Babies, salomes"    

I wanted to know if Beacon and Vanguard had anti-web scraping protections on them, and that’s why Angelina and I were unsuccessful in scraping them. I found a function online called paths_allowed() in the robotstxt package that checks to see if there are protections. Both Beacon and Vanguard have protections from running the URLs through the function. Jonak doesn’t, so it was easy to scrape from the site. Zillow doesn’t have any protections, either.

library(robotstxt)
Warning: package 'robotstxt' was built under R version 4.1.3
#TRUE = web scraping allowed, FALSE = web scraping not allowed
paths_allowed("https://beacon.schneidercorp.com/Application.aspx?AppID=165&LayerID=2145&PageTypeID=3&PageID=1107&Q=1818183221")

 beacon.schneidercorp.com                      
[1] FALSE
paths_allowed("https://buchanan.iowaassessors.com/results.php?mode=basic&history=-1&ipin=%25&idba=&ideed=&icont=&ihnum=&iaddr=&ilegal=&iacre1=&iacre2=&iphoto=0")

 buchanan.iowaassessors.com                      
[1] FALSE
paths_allowed("https://www.zillow.com/homedetails/2925-Arbor-St-Ames-IA-50014/93961907_zpid/")

 www.zillow.com                      
[1] TRUE
paths_allowed("https://www.jonak-paris.com/collection/shoes/sandals.html")

 www.jonak-paris.com                      
[1] TRUE

Because Zillow doesn’t have protections, the housing team decided to switch tactics. The Housing Team had decided earlier to scrape Zillow and Trulia alongside Vanguard and Beacon for housing data. When we started the scraping of Trulia, we learned that Zillow owns Trulia. This was a huge win for us because that meant we only had to scrape one of the sites. We chose Zillow because it provides Zestimates, estimates of housing price based on external factors, and Trulia does not.

Wednesday and Thursday

On Wednesday, I was tasked with scraping the houses in Slater, IA. I made this data frame in R with the data I scraped from Zillow.

library(rvest)
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.3
Warning: package 'ggplot2' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'tidyr' was built under R version 4.1.3
Warning: package 'readr' was built under R version 4.1.3
Warning: package 'purrr' was built under R version 4.1.3
Warning: package 'dplyr' was built under R version 4.1.3
Warning: package 'stringr' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
Warning: package 'lubridate' was built under R version 4.1.3
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.2     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.2     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.1     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter()         masks stats::filter()
x readr::guess_encoding() masks rvest::guess_encoding()
x dplyr::lag()            masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#### Pulling recently SOLD houses ######
########################################

sold = "https://www.zillow.com/slater-ia/sold/?searchQueryState=%7B%22mapBounds%22%3A%7B%22north%22%3A41.930365556704984%2C%22east%22%3A-93.55027834838869%2C%22south%22%3A41.782563414617314%2C%22west%22%3A-93.76760165161134%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sche%22%3A%7B%22value%22%3Afalse%7D%2C%22schm%22%3A%7B%22value%22%3Afalse%7D%2C%22schh%22%3A%7B%22value%22%3Afalse%7D%2C%22schp%22%3A%7B%22value%22%3Afalse%7D%2C%22schr%22%3A%7B%22value%22%3Afalse%7D%2C%22schc%22%3A%7B%22value%22%3Afalse%7D%2C%22schu%22%3A%7B%22value%22%3Afalse%7D%2C%22land%22%3A%7B%22value%22%3Afalse%7D%2C%22rs%22%3A%7B%22value%22%3Atrue%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A12%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20522%2C%22regionType%22%3A6%7D%5D%2C%22pagination%22%3A%7B%7D%7D"
# read the html in the url
ss = read_html(sold)

# lists how many records there are to pull from
housesold <- read_html(sold) %>% html_elements("article")

#create a dataframe with addresses, prices, bathrooms, bedrooms, and square footage of all SOLD houses
res_ss <- tibble(
      address= ss %>% html_nodes(xpath = "/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/a/address") %>% html_text(),
      price = ss %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div/div/span') %>% html_text(),
      bedrooms = ss %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div[3]/ul/li[1]/b') %>% 
        html_text(),
      bathrooms = ss %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div[3]/ul/li[2]/b') %>% 
        html_text(),
      sqft = ss %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div[3]/ul/li[3]/b') %>% 
        html_text()
    ) 

    
##### Pulling FOR SALE houses #####
######################################

sale = "https://www.zillow.com/slater-ia/?searchQueryState=%7B%22mapBounds%22%3A%7B%22north%22%3A41.930365556704984%2C%22east%22%3A-93.55027834838869%2C%22south%22%3A41.782563414617314%2C%22west%22%3A-93.76760165161134%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sche%22%3A%7B%22value%22%3Afalse%7D%2C%22schm%22%3A%7B%22value%22%3Afalse%7D%2C%22schh%22%3A%7B%22value%22%3Afalse%7D%2C%22schp%22%3A%7B%22value%22%3Afalse%7D%2C%22schr%22%3A%7B%22value%22%3Afalse%7D%2C%22schc%22%3A%7B%22value%22%3Afalse%7D%2C%22schu%22%3A%7B%22value%22%3Afalse%7D%2C%22land%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A12%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20522%2C%22regionType%22%3A6%7D%5D%2C%22pagination%22%3A%7B%7D%7D"
# read the html in the webpage
pg = read_html(sale)

# get list of houses for sale that appears on the page
# each property card is called an article when you inspect the webpage
housesale <- read_html(sale)%>%html_elements("article")

# create a dataframe for the FOR SALE houses
res_pg <- tibble(
  address= pg %>% html_nodes(xpath = "/html/body/div[1]/div[5]/div/div/div[1]/div[1]/ul/li//div/div/article/div/div[1]/a/address") %>% html_text(),
  price = pg %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div/div/span') %>% html_text(),
  bedrooms = pg %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div[3]/ul/li[1]/b') %>% 
    html_text(),
  bathrooms = pg %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div[3]/ul/li[2]/b') %>% 
    html_text(),
  sqft = pg %>% html_nodes(xpath = '/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/div[3]/ul/li[3]/b') %>% 
    html_text()
) 

# combine recently SOLD and FOR SALE houses in one data frame
results <- res_pg %>% bind_rows(res_ss)
print(results)
# A tibble: 12 x 5
   address                             price    bedrooms bathrooms sqft 
   <chr>                               <chr>    <chr>    <chr>     <chr>
 1 201 10th Ave, Slater, IA 50244      $295,000 3        2         924  
 2 50287 210th Hwy, Slater, IA 50244   $475,000 4        2         2,147
 3 1013 Redbud Dr, Slater, IA 50244    $429,900 4        3         1,884
 4 506 8th Ave, Slater, IA 50244       $328,000 4        3         1,180
 5 604 Story St, Slater, IA 50244      $200,000 3        1         1,359
 6 611 1st Ave N, Slater, IA 50244     $265,000 4        2         1,116
 7 101 Main St, Slater, IA 50244       $255,000 5        3         1,896
 8 1015 Redbud Dr, Slater, IA 50244    $397,732 2        3         1,325
 9 107 Main St, Slater, IA 50244       $242,500 4        2         1,515
10 52898 Highway 210, Slater, IA 50244 $400,000 4        1.75      2,550
11 604 Marshall St, Slater, IA 50244   $135,000 3        1         1,166
12 104 N Benton St, Slater, IA 50244   $210,000 3        2         1,056

This web scraping was really hard. I spent a lot of time understanding the xpaths and why I was using them. I think this was beneficial to me though. I got the xpaths for my code by inspecting the Zillow webpage. To inspect, you right click on the web page and then select “Inspect.” This will open up a screen that shows you the HTML in the web page.

If you right click on any element in the HTML you can select “Copy” and then “Full xpath” to copy the xpath of an element. There were some minor changes that needed to be made to the xpaths. Below is an example:

/html/body/div[1]/div[5]/div/div/div/div/div[1]/ul/li[1]/div/div/article/div/div[1]/a/address

v.

/html/body/div[1]/div[5]/div/div/div/div/div[1]/ul/li//div/div/article/div/div[1]/a/address

The difference in these two xpaths is what comes after the li element. The first xpath selects only the first li in the HTML. The second xpath selects all li elements in the HTML. The second version allows you to get all of the children of all li elements as well.

Friday

I spent a lot of time on Thursday creating the Team blog for this week. We gave a presentation to an outside individual from Oklahoma State University Friday morning, so I made sure to put a lot of my attention towards that.

After the presentation on Friday morning, I got back to web scraping. Gavin was able to scrape some images from Zillow on Thursday night, and he shared the code he used with Angelina and me. The code is below:

# webpage to scrape. This specific link brings you to the grundy center houses for sale.
zillow_url_grundy <- "https://www.zillow.com/grundy-center-ia/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22Grundy%20Center%2C%20IA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-93.21166512207031%2C%22east%22%3A-92.40828987792969%2C%22south%22%3A42.153050722920995%2C%22north%22%3A42.55594363773797%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A24980%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22land%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D"
webpage_grundy <- read_html(zillow_url_grundy)

# gathers addresses. This xpath can be obtained by right clicking on the address you want and clicking inspect.
# you then must navigate to the html section that contains the text. right click again and go to copy -> full xpath
# to gather all addresses on page the full xpath must be altered for example this xpath below has li// which signifies select all children where the original xpath would just have li/...
addresses <- webpage_grundy %>%
  html_nodes(xpath = "/html/body/div[1]/div[5]/div/div/div[1]/div[1]/ul/li//div/div/article/div/div[1]/a/address") %>%
  html_text()
print(addresses)


# gathers image links. Similair method as above
image_urls <- webpage_grundy %>%
  html_nodes(xpath = '//*[@id="swipeable"]/div[1]/a/div/img') %>%
  html_attr("src")

print(image_urls)

#downloads first item
#download.file(image_urls[1], "image.png", mode = "wb")

# creates folder for images scraped then iterativly names each image (1-9 in this case).
# More specifically it takes the image links gathered above, goes to each link, and downloads the image
# dir.create makes a new folder. You only need to run this once.Everytime you do file.path to that folder it will add newly downloaded images to that folder
# This method simply names each image 1 - number of images
dir.create("images_grundy_sale")
for (i in seq_along(image_urls)) {
  file_path <- file.path("images_grundy_sale", paste0("image_", i, ".png"))
  download.file(image_urls[i], file_path, mode = "wb")
  print(file_path)
}

#creates folder for images scraped then names them based on address they were scraped with
# same as above except for how the images are named. for each image the address grabbed earlier is printed as the name.
# this returns (image_ 123 main st) for example
# you can alter this to return our naming convention (source_city_address_) by replacing "image_" with the source and city
# for example if you are pulling slater images from zillow it will be paste0("Z_S_", address, "_.png") which will print the titles of images as Z_S_123 main st_.png
# Z (Zillow) G (Google) V (Vanguard) B (Beacon) :: S (Slater) H (New Hampton) D (Independence) G (Grundy Center)
dir.create("images_grundy_sale_addresses")
for (j in seq_along(image_urls)) {
  address <- addresses[j]
  file_name <- paste0("image_", address, ".png")
  file_path <- file.path("images_grundy_sale_addresses", file_name)
  download.file(image_urls[j], file_path, mode = "wb")
  print(file_path)
}

Next week, I will get around the anti-web scraping protections Beacon and Vanguard have on their sites. Beacon and Vanguard have information on houses that aren’t listed on Zillow, and more pictures aren’t listed on Zillow or Google Street View.

Notes to self:

git pull

git add .

git commit -m “message”

git push