Module 5–WEDNESDAY MORNING, June 24
You’ve no doubt heard the term “web scraping.” It refers to the use of automated means to collect data. We can’t teach you web scraping in a general data course, but we can show you some ways to collect data that might spark your imagination. We will use Google Sheets. In the second video, we use the Google Chrome web browser.
Watch the video on using IMPORTDATA and IMPORTHTML (runs 8:09). Or download.
Watch the video on extracting JSON data with developer tools (runs 9:27). Or download.
- Find a csv file online that you would like to import into a Google sheet. Right click (<ctrl> click on a Mac) to get the URL for the csv file. Use the IMPORTDATA function in Google Sheets to bring the data into the sheet. If you like, you can use one of the CSVs linked on the additional resources page.
- Go to a website that has some COVID-19 data arranged in a table (example: https://www.princeedwardisland.ca/en/information/health-and-wellness/pei-covid-19-testing-data) Examine the page source for the page, as shown in the video, to determine if there are <table> tags enclosing the data (there will also be <tr> and <td> tags in a table); remember that closing tags add a /, so </table> ends a table. Once you have determined that the table is an HTML table coded into the page source, use IMPORTHTML to import the data to a Google sheet.
- Challenge: Go to the website at https://novascotia.ca/coronavirus/data/ Notice that there is no download link for the data displayed on the page. You learned in the video that sometimes data being displayed is passed to the page in the background, often as JSON data but also possibly as XML, or in a csv or other text file. Using the method shown in the video, use Chrome Developer Tools to locate the file that contains the raw data. You may have to examine more than one file. Once you find the right file, copy the response and paste it into a text file. Extra challenge: see if you can get it into Google sheets. We can go over the answer in the live session.
- Still want more? Explore the network traffic that populates the federal government page at https://health-infobase.canada.ca/covid-19/epidemiological-summary-covid-19-cases.html
Data for this exercise is all online. The exercise contains the necessary links.
Live session east, on Zoom: 1:30 p.m. Atlantic/12:30 Eastern
Live session west: 2:30 p.m. Eastern/1:30 p.m. Central/11:30 a.m. Pacific
From the textbook: Tantalized by what we’ve shown you so far and want to go deeper? Chapter 9 of The Data Journalist gets into more detail on web scraping and how to do it using the Python programming language.