Web Scraping From Scratch With 3 Simple Steps

Image for post
Image for post
Photo by Pankaj Patel on Unsplash

Introduction

Understand the structure of your target website

Image for post
Image for post
Image for post
Image for post
<section class="cex-table">
<section class="thead">
<div>...</div>
<div class="tr-wrapper">
<div class="tr-left">
<div class="tr">
<div>...</div>
<div style="flex:7" class="th">
<span class="cell">
<i class="sorting-icon">
</i>
<span class="cell-text">Asset</span>
</span>
</div>
</div>
</div>
</div>
...
</section>
</section>
<section class="tbody">
<section class="tr-section">
<a href="/price/bitcoin">
<div class="tr-wrapper">
<div class="tr-left">
<div class="tr">
<div style="flex:2" class="td">
<span class="cell cell-rank">
<strong>01</strong>
</span>
</div>
<div style="flex:7" class="td">
<span class="cell cell-asset">
<img>...</img>
<strong class="cell-asset-title">Bitcoin</strong>
<span class="cell-asset-iso">BTC</span>
</span>
</div>
</div>
</div>
</div>
</a>
</section>
</section>

Locate and parse the target data element with XPath

pip install requests pip install lxml
import requests 
from lxml import html
target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)
payload = {"q" : "bitcoin", "s" : "relevant"} 
result = requests.get("https://www.coindesk.com/search", params=payload)
tree = html.fromstring(result.text)
table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")
headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")
['Asset', 
'Price',
'Market Cap',
'Total Exchange Volume',
'Returns (24h)',
'Total Supply',
'Category',
'Value Proposition',
'Consensus Mechanism']
table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")
records = [] for row in table_body: 
tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
rest_cols = [span.text_content().strip() for span in spans]
row_data = ranks + tokens + assets + rest_cols
records.append(row_data)
column_header = ["Rank", "Token"] + headers

Save the scraping result

import pandas as pd 
df = pd.DataFrame(records, columns=column_header)
Image for post
Image for post
import csv with open("token_price.csv", "w", newline="") as csvfile: 
writer = csv.writer(csvfile)
writer.writerow(column_header)
for row in records:
writer.writerow(row)

Limitations & Constraints

Conclusion

Written by

Resources and tutorials for python, data science and automation solutions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store