Web Crawling with Python: From Lxml to Scrapy Techniques
Written on
Introduction to Web Crawling
Web crawling has been an integral part of the internet since its inception, with origins dating back to 1989 when Tim Berners-Lee envisioned the World Wide Web. Just four years later, the first crawlers, "The Wanderer" and "JumpStation," emerged. JumpStation became a pioneering crawler-based search engine, paving the way for platforms like Google, Bing, and Yahoo.
Web crawling involves automated methods for discovering and retrieving web data through specialized software. This process, executed by a web crawler (often referred to as a spider or bot), systematically visits web pages, extracts pertinent information, and organizes it in a structured format.
While the terms "web crawling" and "web scraping" are often used interchangeably, they denote different processes. Web scraping focuses on extracting data from specific websites, whereas web crawling is concerned with discovering URLs across the web. Both methods serve distinct purposes, such as building search engine databases or monitoring website performance.
In this guide, we will explore two popular methods for web scraping: Requests combined with Lxml, and Scrapy. Before diving in, let's inspect the website we will target.
Inspecting a Website Prior to Scraping
Often, however, our goal is to gather information on all items available for purchase. This requires identifying and collecting URLs for each page we wish to scrape. The typical web scraping workflow includes:
- Identifying the target website (Completed).
- Collecting URLs of the pages containing desired data.
- Using the browser's inspect tool to locate relevant HTML tags.
- Making HTTP requests to the identified URLs to retrieve HTML content.
- Utilizing locators to extract specific data from the HTML.
- Saving the extracted data in structured formats like JSON or CSV.
Goal 1: Scraping Top Items
To scrape the "top items currently being scraped," we begin by visiting the target page. Using Chrome, Safari, or Firefox, we right-click and select "Inspect" to examine the page's structure.
We find that the top items are encapsulated within individual inner tags, which are themselves nested within a common outer tag. Each item's structure is similar yet unique in content.
<div class="item">
<img src="cart_image_url" />
<h4>$101.99</h4>
<h4 title="Memo Pad HD 7"><a>Memo Pad HD 7</a></h4>
<p class="description">IPS, Dual-Core 1.2GHz, 8GB, Android 4.3</p>
<p class="ratings">10 reviews</p>
</div>
From this snippet, we can deduce the following:
- The image URL is found in the src attribute of the img tag.
- The price is within the first h4 tag.
- The item title is in the title attribute of an a tag.
- The description resides in a p tag with a class of "description".
- The review count is present in another p tag within a div tagged as "ratings".
- The star rating is stored as a value in the data-rating attribute of a p tag.
This foundational knowledge will assist us in creating our scraper in subsequent sections. Next, let’s consider Goal 2.
Goal 2: Scraping All Items
For this task, we aim to extract all items under the "Computers" and "Phones" categories. By exploring the website, we note that "Computers" comprises two sub-categories, "Laptops" and "Tablets," while "Phones" has one sub-category, "Touch." This results in three pages to scrape.
Given the simplicity of the website, we could annotate the three URLs and manually download and parse each. However, in real-world scenarios, websites often contain numerous pages, making manual extraction impractical. Instead, we’ll automate the URL extraction by inspecting the navigation bar.
#### Understanding the Navigation Bar
The navigation menu is encapsulated within a ul tag, containing li tags for the categories: "Home," "Computers," and "Phones."
<ul>
<li><a href="/test-sites/e-commerce/allinone">Home</a></li>
<li><a href="/test-sites/e-commerce/allinone/computers">Computers</a></li>
<li><a href="/test-sites/e-commerce/allinone/phones">Phones</a></li>
</ul>
Each li tag contains an a tag with the href attribute leading to the relevant URLs. However, since sub-categories aren't explicitly defined on this page, we will scrape the "Computers" and "Phones" pages to gather their URLs.
Next, we'll see how to implement this with Requests combined with Lxml and Scrapy.
Scraping with Requests + Lxml
One of the most straightforward methods to begin scraping in Python is by using the Requests and Lxml libraries.
The Requests library simplifies sending HTTP requests, while Lxml is adept at processing XML and HTML, making it ideal for web scraping tasks.
Goal 1 Implementation
For our first goal, let's look into making a request to our target page and storing the HTML.
import requests
# Specify the URL
# Query the website and return the HTML content
response = requests.get(url)
# Save the HTML to a file for later review
filename = response.url.replace('/', '|') + '.html'
with open(filename, 'wb') as f:
f.write(response.content)
This code snippet retrieves the HTML content from the target URL and saves it to a file for future reference.
Next, we will parse the HTML using Lxml.
from lxml import html
# Parse the HTML content
root_node = html.fromstring(response.content)
# Locate a specific node corresponding to one of our items
node = root_node.find('body/div[1]/div[3]/div/div[2]/div[2]/div[1]')
# Display node information
print(f"Node tag: {node.tag}, Node attributes: {node.attrib}")
To extract all relevant items, we can loop through the nodes and gather the required information.
all_items = []
# Iterate through all relevant nodes to extract information
for node in nodes:
all_items.append(
{
'image_url': node.find('div[1]/img').attrib['src'],
'price': node.find('div[1]/div[1]/h4[1]').text,
'title': node.find('div[1]/div[1]/h4[2]/a').text,
'product_url': node.find('div[1]/div[1]/h4[2]/a').attrib['href'],
'description': node.find('div[1]/div[1]/p').text,
'review': node.find('div[1]/div[2]/p[1]').text,
'rating': node.find('div[1]/div[2]/p[2]').attrib['data-rating'],
}
)
print(all_items)
If successful, the output should yield a list of dictionaries containing all desired product information.
Goal 2 Implementation
For our second goal, we will gather URLs from the navigation bar and scrape each sub-category.
rel_category_urls = tree.xpath('//ul[@class="nav"]/li/a[@class="category-link "]/@href')
print(full_category_urls)
Next, we'll extract sub-category links and wrap our item extraction logic in a reusable function.
def extract_items_from_url(url):
"""Scrapes items from the specified URL and returns their details."""
# Query the website and return the HTML
response = requests.get(url)
root_node = html.fromstring(response.content)
nodes = root_node.xpath('//div[@class="col-sm-4 col-lg-4 col-md-4"]')
all_items = []
for node in nodes:
all_items.append(
{
'image_url': node.xpath('//img')[0].attrib['src'],
'price': node.xpath('//h4[contains(@class, "price")]')[0].text,
'title': node.xpath('//h4/a[@class="title"]')[0].text,
'product_url': node.xpath('//h4/a[@class="title"]')[0].attrib['href'],
'description': node.xpath('//p[@class="description"]')[0].text,
'review': node.xpath('//div[@class="ratings"]/p')[0].text,
'ratings': node.xpath('//div[@class="ratings"]/p[boolean(@data-rating)]')[0].attrib['data-rating'],
}
)
return all_items
# Run the extraction function for all discovered URLs
all_items = []
for url in all_sub_category_urls:
all_items += extract_items_from_url(url)
print(f"Number of items: {len(all_items)}")
print(all_items[:1])
This will yield all the information we need from the various sub-categories.
Stepping Up with Scrapy
Scrapy is a robust framework tailored for creating web crawlers that can efficiently traverse multiple sites and extract relevant data. It can manage large datasets and is designed for scalability and fault tolerance.
To initiate a new Scrapy project, use the following commands:
mkdir webscraper
cd webscraper
scrapy startproject webscraper
This generates the necessary project files and folder structure. Now, let’s define our crawler in the webscraper/spiders directory.
from w3lib.url import url_query_cleaner
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class WebscraperCrawler(CrawlSpider):
name = "webscraper"
custom_settings = {
'HTTPCACHE_DIR': './httpcache',
'HTTPCACHE_ENABLED': True,
'HTTPCACHE_EXPIRATION_SECS': 0,
'ROBOTSTXT_OBEY': False,
'CONCURRENT_REQUESTS': 2,
'DOWNLOAD_DELAY': 1,
}
allowed_domains = ['webscraper.io']
rules = (
Rule(LinkExtractor(restrict_xpaths='//ul[@class="nav"]//li/a[contains(@class, "category-link ")]'),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
return {
'url': response.url,
'status': response.status,
'image_url': response.selector.xpath('//img/@src').get(),
'price': response.selector.xpath('//h4[contains(@class, "price")]/text()').get(),
'title': response.selector.xpath('//h4/a[@class="title"]/text()').get(),
'product_url': response.selector.xpath('//h4/a[@class="title"]/@href').get(),
'description': response.selector.xpath('//p[@class="description"]/text()').get(),
'review': response.selector.xpath('//div[@class="ratings"]/p/text()').get(),
'ratings': response.selector.xpath('//div[@class="ratings"]/p[boolean(@data-rating)]/@data-rating').get(),
}
To execute the crawler, run the following command:
scrapy crawl webscraper --logfile webscraper.log -o webscraper.jl -t jsonlines
After completing the crawl, you can inspect the results with:
cat webscraper.jl | head -n 1
Conclusion
Web crawling is a powerful technique with a rich history, serving as the backbone for major search engines like Google and Bing. This guide has introduced two prevalent methods for web scraping: Requests with Lxml and Scrapy. While Requests and Lxml offer a simple entry point for beginners, Scrapy provides enhanced flexibility and scalability for larger projects.
With the right tools and understanding, web crawling can be an invaluable asset for collecting data from the internet across various applications, including market analysis, competitive research, and content aggregation.
Join my mailing list to receive updates on new content as it becomes available!
References
Learn how to scrape websites using Python and Lxml in this informative tutorial.
A comprehensive guide to web scraping with Python using Scrapy, Selenium, Requests, BeautifulSoup, and Lxml.