Web Scraping and SCOTUS

October 17, 2020

Table of Contents

1. Intro to Web Scraping and HTML

Every website you see on the web is made up of the so-called Hypertext Markup Language (HTML). The father of the web, not to be confused with the father of the internet, birthed it alongside the web itself circa 1990. The term "markup" comes from how paper documents were/are annotated by people like editors which included not only notes on spelling/grammar but also instructions around how the text should be presented (e.g. bold text, larger header, etc.). Similarly, HTML tags are used to instruct the browser how to present content to you โ€“ not just text, but any elements you see on the screen, while another language called CSS (Cascading Style Sheets), is used to specify their style.

Web scraping then is the process of automatically capturing data (images, text, etc.) from a web page and storing it somewhere for your purposes, likely for use in research, an analysis, or even an application or product. And while the goal with web-scraping is often pulling down data, the same techniques can be used for browser automation or RPA in general. The possibilities, much like for the web itself, are endless.

๐Ÿงฎ Actuarial Shout-out

Hypothetically, if one wanted to systematically compare and analyze insurance plans available through an insurer (e.g. AXA, for a variety of products), or even public marketplace (say healthcare.gov for health insurance plans in the U.S.) that provides quotes online, you would of course need to source the required data to be able to do so, which in this case is often semi-publicly available.

In comes web scraping; you can write a script to go to the website, enter in all the requested info to generate a quote (age, gender, area, whatever else they expose that may go into the pricing of a given product) and do that over and over again for many combinations of the inputs capturing the resulting quote associated with each to build up a comprehensive dataset. You could then use that data to do cost-comparisons, inferring the rating factors, etc.

In this post, we'll talk a bit about some of the controversy surrounding the use of web scraping, as well as a gentle introduction into how to get started doing it yourself.

2. ScrapedIn by HiQ - the Supreme Court is Interested

A 3 year war between LinkedIn and a company you've never heard of called hiQ Labs, whose entire business hinges on being able to scrape data from public profiles on LinkedIn, appears to be headed to the Supreme Court of the United States (SCOTUS). LinkedIn has already lost 2 battles (once in District Court and then again in the 9th Circuit Appellate Court), but this final battle will be the only one that matters and could set a serious precedent.

hiQ takes the info it pulls from people's profiles on LinkedIn, runs some algorithms to identify people that by their estimates are likely to turn over soon, and then sells reports summarizing this info to employers. Not exactly a wholesome business if you ask me - hope they have set some pretty high confidence thresholds, my actuarial judgement tells me some people would get pretty ticked off if their boss thought they were planning on quitting based on some erroneous report.

The drama started with a cease-and-desist letter from LinkedIn, demanding hiQ stop scraping their user data; hiQ fought it and actually won in the Northern District of California. LinkedIn's argument is founded on the Computer Fraud and Abuse Act (CFAA), which was enacted in 1986, long before the internet had legs to stand on and became what it is today. The law is broad and vague, and can result in lengthy sentences for otherwise common and innocuous online acts. LinkedIn, backed by Microsoft, is not going to take a loss laying down. So they appealed, and lost again - the appellate court ruled that hiQ's scraping didn't violate CFAA as it was targeting profiles that did not require a password to be viewed.

โ€œThe Ninth Circuitโ€™s decision upsets that stable understanding and prevents websites from setting and enforcing transparent standards that allow their users to understand how their data will (and will not) be used and made available to third parties.โ€ - LinkedIn

But again, LinkedIn being LinkedIn, they petitioned the Supreme Court to hear the case. The Supreme Court has asked hiQ to respond, thereby indicating that there is some interest and that they are more likely to take it on, since they typically reject the vast majority of requests for intervention. If they end up not hearing it, then the Ninth Circuit's decision will stand.

โ€œIf the Ninth Circuitโ€™s decision is left in place, technology companies in that Circuit will have no recourse under the CFAA against, for instance, a third-party-scraper employing artificial intelligence to compile a massive database that could allow for instant facial recognition (and possible surveillance) of billions of people, while companies litigating in other circuits will be able to combat such activity.โ€ - LinkedIn, surely with some hypocrisy notwithstanding

This is something we'll be keeping an eye on... once more info comes out you can expect a follow-up post.

2.1. What's at Stake

The issue relating to the vagueness and outdatedness of the CFAA is of paramount importance. Data is the new oil in our globalized, digital economy. The status quo casts questions of who can access it and control it into a grey area, resulting in significant potential business risk to really any organization when it comes to sourcing and stewarding data (especially if that is their sole business, e.g. the alternative data industry which has been on the rise), and to individuals with regards to how their data gets used and if they even own it. The hope is that the Supreme Court will hear the case, and provide some much-needed clarity around these topics, at least from a U.S. perspective.

3. Playing in the Grey Zone

Depending on your jurisdiction, and appetite for data, let's carry on with how to go about doing a bit of scraping of your own - just for practice. Please choose wisely in what you choose to scrape. I bear no liability in sharing with you what is now so commonplace and simple to do.

Let's start with a few basics. In order to scrape data from a website, you need to understand the nuts and bolts of how a website is structured... and in particular, the exact one that you want to extract information from.

The rest of this post will be getting more into some hands-on technical aspects, so if that's of no interest to you, feel free to bail now, but instead I'd recommend you come along for the ride and give it a shot.

3.1. How to Browse HTML

If you're reading this page on a desktop, go ahead and push F12 right now on your keyboard. If you haven't done this before, congratulations, you've just stepped through the portal. You should see all the elements on the page, and there's a little button that lets you go back to the site and select an element on the screen and it will go to the HTML that created it. You can do this on any website, to get a peek under the hood. This will come in very handy when you roll up your sleeves to write code that scrapes data from a given page.

Here's a quick clip of me showing this in action to change part of Google's background to yellow:

But don't get too excited - it only turned yellow on my local machine, just in my browser, and only during that session. If you manage to turn it yellow for everyone else, well, then I imagine Google would pay you a pretty hefty bug bounty if you told them how you did it. Something to strive for I guess.

If this is all news to you and you want to learn more about HTML, check out w3schools.com's tutorial, which is 100% free and comes with exercises. This is also a very good whirlwind tour: Beginnersโ€™ Guide To Writing Good HTML.

3.2. Scraping with Python

Let's start slow and safe, with a basic website that is just begging to be scraped; Quotes to Scrape!

It's a simple site that lists a bunch of great quotes in this form: Alternative text

In our scenario, we really want to capture all these great quotes, but copying and pasting would take a long time (many pages, formatting issues, etc.), and would miss some of the things completely - for example, the tags (like "truth" in the example above) don't get picked up. Furthermore, if this is something we wanted to repeat, we would have to check the site for updates and manually repeat the process to add new quotes. This is just no good. Writing a simple script once gives us a repeatable, and schedulable solution that can be completely automated.

Identifying the Structure and the Targets

The Targets

If we hit F12 on our keyboard while looking at the quotes site, and use the element selector tool (CTRL + SHIFT + C) to click on one of the quotes, we'll see the following HTML:

Quote Element from quotes.toscrape.com
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">โ€œNever tell the truth to people who are not worthy of it.โ€</span>        <span>by <small class="author" itemprop="author">Mark Twain</small>          <a href="/author/Mark-Twain">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="truth">
            <a class="tag" href="/tag/truth/page/1/">truth</a>
        </div>
    </div>

If we look at this closely, we can see how it represents the end-result seen on screen, and which elements contain the information that we're interested in obtaining programmatically; the quote and the author.

The entire quote item is represented by a <div>, and the first <span> element that it contains (i.e. its first "child") contains the text of the quote itself. So that's our first target.

Our second target is the author, which is in the very next <span>'s first child, a <small> element. It is the content (what is in-between the closing and ending tags, e.g. <span> and </span>) that we are after, which is just text in this case.

The Collection

As in this example, you'll often be capturing info across a collection of items, not just the info for one item. In our case, we know what we want from each quote, but we also need to identify how to get at all of those quotes.

The website only shows 10 quotes per page. At the bottom of each pages is a Next and/or Previous button. But we don't have to bother with those directly in our code - instead, we just observe what happens as we click them: the URL parameter gets incremented or decremented accordingly in the URL, which takes the form:

http://quotes.toscrape.com/page/{PAGE_NUMBER}/

So now we are armed with enough information to sketch out a plan of attack, and then execute it.

Crafting an Approach

Now that we have identified what we are after and where it can be found, we need to design an approach to retrieve it. As with most things in life, there are many ways to tackle the problem and achieve the same aim. When it comes to web-scraping in particular, a few things to keep in mind:

  1. The site may change. If your scraping code relies on too many conditions to find the target, it may break more easily. Some websites rarely update, others may update quite frequently. A good setup will reduce your maintenance headache (assuming you're writing something that you plan on using more than once).

  2. Try to find the most direct path. Instead of first looking for a quote item (which represents the entire block with all the info), and going quote by quote, you may instead just target all the specific elements that contain the quote text itself in one shot. In addition, try not to rely on order, as that can be subject to change as in point 1 but will also typically result in less readability (e.g. if we write our code to search for a <small> element with class="author" that's a lot more self-explanatory than code that searches for the 2nd <span>'s first element under each <div> with class="quote").

  3. Pick the right tool for the job. Sometimes, websites are very dynamic, e.g. when you click certain buttons, what they may do is essentially retrieve/unhide additional information without taking you to a new page. This is done through the use of JavaScript that runs in your web browser, and makes scraping more complicated. Libraries like Selenium help with managing these types of interactions, which we can cover in a separate post for a more advanced scenario. Here, however, good old Beautiful Soup will do just fine as we can see that the web page requires no interaction for us to get to our desired data elements.

Writing the Script

If you already have Python up and running on your machine, fantastic. If not, and you're brand new to it, check out this First Steps with Python guide to get setup and cover off a basic intro to this great programming language that has really taken off this past decade, in part due to its low barrier to entry. I would recommend using VS Code however, which I outline here in a post under the Tools section of this site.

Make sure Beautiful Soup is installed: pip install beautifulsoup4 (run in your command prompt/terminal), then in your code editor of choice (which I would strongly recommend VS Code for, although that is a post for another day), run the following:

quote_scraper.py
from bs4 import BeautifulSoup
import requests
import csv

QUOTE_SITE_BASE_URL = "http://quotes.toscrape.com/page/"

def get_quotes():

    page_no = 0
    quote_collection = []

    while True:
        page_no += 1
        print(f"Checking page {page_no} of quotes!")
        request = requests.get(f"{QUOTE_SITE_BASE_URL}{page_no}")
        soup = BeautifulSoup(request.content, features="html.parser")

        authors = [
            authorElement.text
            for authorElement in soup.findAll("small", {"class": "author"})
        ]

        if len(authors) == 0:
            print("No more quotes found")
            return quote_collection
        quotes = [
            quoteElement.text[1:-1]  # drops quotation marks
            for quoteElement in soup.findAll("span", {"class": "text"})
        ]

        author_quotes = zip(authors, quotes)
        quote_collection.extend(list(author_quotes))

print("Starting")
quote_collection = get_quotes()

field_names = ["Author", "Quote"]

print("Writing output to CSV")
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(field_names)
    writer.writerows(quote_collection)

print("Completed")
The Play-by-Play
  • starting in line 7, we write a function that will help us retrieve the desired data, this is where the action happens
  • in line 12, we kick off a while loop that has no condition (True is always True), so it will only stop once we return something to end the function and break out of the loop, which happens in line 25
  • line 15 makes a request to the web site, using the base URL and filling in a page number that we increment by 1 each time the loop runs
  • line 16 parses the HTML response and creates a BeautifulSoup object variable called soup that we will use to extract out our desired data
    • ๐Ÿ’ก Practical Advice: When first crafting the logic to extract out the data, it is convenient to use Jupyter Notebooks. So in one of your cells, you can have this code that gets the data from the website and stores it in our soup variable and run it just once, then in subsequent notebook cells you can play around with that to extract different data out (without having to keep re-running the code that actually hits the website).
  • lines 18-21 we use the findAll() method to find all <small> elements that have a style class of "author", and for each of these elements we take the text out using a python list comprehension
  • in line 23 we check to see if we got any data back, and if not, then we have probably hit all of the pages of quotes and have gone too far, so we just return the quotes that we've collected
  • lines 27-30 now we again use the findAll() method, but this time to get all the <span> elements that have a style class of "text"
    • note that in line 28 we're slicing the characters in our quote string to exclude the first and the last, which are quotation marks, because we only want the actual quote itself
  • line 32ย zips the 2 lists together, lining up each quote with its author from the other list and returning pairs
  • line 33 adds the quotes we just gathered from this go-round of the loop to our master quote collection, which will eventually get returned

All of that covers the main function itself, so now it's time to use it!

  • line 36 calls the function we wrote to get all of the quotes from the site
    • note that we have written the program in such a way where even if more pages were added, it would be able to retrieve them
  • lines 41-44 write the header row and then our quote collection rows into a comma-separated value file called quotes.csv
    • we specify a UTF-8 encoding to be able to handle any special characters without them coming out all garbled
    • we could have stored our results in a pandas DataFrame, and then simply used their built-in to_csv() or to_excel() methods, but for this example we avoided including any extra libraries

And that's it!

Hopefully you enjoyed this post and it gives you a good idea on how to get started doing some web data collection of your own. Please let me know what you think in the comments below.