Web Scraping From Scratch With 3 Simple Steps

Image for post
Image for post
Photo by Pankaj Patel on Unsplash

Introduction

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

Image for post
Image for post

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:

Image for post
Image for post

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
<section class="thead">
<div>...</div>
<div class="tr-wrapper">
<div class="tr-left">
<div class="tr">
<div>...</div>
<div style="flex:7" class="th">
<span class="cell">
<i class="sorting-icon">
</i>
<span class="cell-text">Asset</span>
</span>
</div>
</div>
</div>
</div>
...
</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
<section class="tr-section">
<a href="/price/bitcoin">
<div class="tr-wrapper">
<div class="tr-left">
<div class="tr">
<div style="flex:2" class="td">
<span class="cell cell-rank">
<strong>01</strong>
</span>
</div>
<div style="flex:7" class="td">
<span class="cell cell-asset">
<img>...</img>
<strong class="cell-asset-title">Bitcoin</strong>
<span class="cell-asset-iso">BTC</span>
</span>
</div>
</div>
</div>
</div>
</a>
</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests 
from lxml import html
target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"} 
result = requests.get("https://www.coindesk.com/search", params=payload)

The attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the result is a response object which has a status_code text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “ tr-wrapper “ style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “ cell-text”. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset', 
'Price',
'Market Cap',
'Total Exchange Volume',
'Returns (24h)',
'Total Supply',
'Category',
'Value Proposition',
'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “ tr-section “ in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = [] for row in table_body: 
tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
rest_cols = [span.text_content().strip() for span in spans]
row_data = ranks + tokens + assets + rest_cols
records.append(row_data)

Note that we are using “contains” in order to match the node with class like “cell cell-rank”, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

import pandas as pd 
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

Image for post
Image for post

Or you can also write the scrapped data into a csv file with the csv module:

import csv with open("token_price.csv", "w", newline="") as csvfile: 
writer = csv.writer(csvfile)
writer.writerow(column_header)
for row in records:
writer.writerow(row)

Limitations & Constraints

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

Originally published at https://www.codeforests.com on December 6, 2020.

Resources and tutorials for python, data science and automation solutions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store