didismusings.com

Web Crawling with Python: From Lxml to Scrapy Techniques

Written on

Introduction to Web Crawling

Web crawling has been an integral part of the internet since its inception, with origins dating back to 1989 when Tim Berners-Lee envisioned the World Wide Web. Just four years later, the first crawlers, "The Wanderer" and "JumpStation," emerged. JumpStation became a pioneering crawler-based search engine, paving the way for platforms like Google, Bing, and Yahoo.

Web crawling involves automated methods for discovering and retrieving web data through specialized software. This process, executed by a web crawler (often referred to as a spider or bot), systematically visits web pages, extracts pertinent information, and organizes it in a structured format.

While the terms "web crawling" and "web scraping" are often used interchangeably, they denote different processes. Web scraping focuses on extracting data from specific websites, whereas web crawling is concerned with discovering URLs across the web. Both methods serve distinct purposes, such as building search engine databases or monitoring website performance.

In this guide, we will explore two popular methods for web scraping: Requests combined with Lxml, and Scrapy. Before diving in, let's inspect the website we will target.

Inspecting a Website Prior to Scraping

Often, however, our goal is to gather information on all items available for purchase. This requires identifying and collecting URLs for each page we wish to scrape. The typical web scraping workflow includes:

  1. Identifying the target website (Completed).
  2. Collecting URLs of the pages containing desired data.
  3. Using the browser's inspect tool to locate relevant HTML tags.
  4. Making HTTP requests to the identified URLs to retrieve HTML content.
  5. Utilizing locators to extract specific data from the HTML.
  6. Saving the extracted data in structured formats like JSON or CSV.

Goal 1: Scraping Top Items

To scrape the "top items currently being scraped," we begin by visiting the target page. Using Chrome, Safari, or Firefox, we right-click and select "Inspect" to examine the page's structure.

We find that the top items are encapsulated within individual inner tags, which are themselves nested within a common outer tag. Each item's structure is similar yet unique in content.

<div class="item">

<img src="cart_image_url" />

<h4>$101.99</h4>

<h4 title="Memo Pad HD 7"><a>Memo Pad HD 7</a></h4>

<p class="description">IPS, Dual-Core 1.2GHz, 8GB, Android 4.3</p>

<p class="ratings">10 reviews</p>

</div>

From this snippet, we can deduce the following:

  • The image URL is found in the src attribute of the img tag.
  • The price is within the first h4 tag.
  • The item title is in the title attribute of an a tag.
  • The description resides in a p tag with a class of "description".
  • The review count is present in another p tag within a div tagged as "ratings".
  • The star rating is stored as a value in the data-rating attribute of a p tag.

This foundational knowledge will assist us in creating our scraper in subsequent sections. Next, let’s consider Goal 2.

Goal 2: Scraping All Items

For this task, we aim to extract all items under the "Computers" and "Phones" categories. By exploring the website, we note that "Computers" comprises two sub-categories, "Laptops" and "Tablets," while "Phones" has one sub-category, "Touch." This results in three pages to scrape.

Given the simplicity of the website, we could annotate the three URLs and manually download and parse each. However, in real-world scenarios, websites often contain numerous pages, making manual extraction impractical. Instead, we’ll automate the URL extraction by inspecting the navigation bar.

#### Understanding the Navigation Bar

The navigation menu is encapsulated within a ul tag, containing li tags for the categories: "Home," "Computers," and "Phones."

<ul>

<li><a href="/test-sites/e-commerce/allinone">Home</a></li>

<li><a href="/test-sites/e-commerce/allinone/computers">Computers</a></li>

<li><a href="/test-sites/e-commerce/allinone/phones">Phones</a></li>

</ul>

Each li tag contains an a tag with the href attribute leading to the relevant URLs. However, since sub-categories aren't explicitly defined on this page, we will scrape the "Computers" and "Phones" pages to gather their URLs.

Next, we'll see how to implement this with Requests combined with Lxml and Scrapy.

Scraping with Requests + Lxml

One of the most straightforward methods to begin scraping in Python is by using the Requests and Lxml libraries.

The Requests library simplifies sending HTTP requests, while Lxml is adept at processing XML and HTML, making it ideal for web scraping tasks.

Goal 1 Implementation

For our first goal, let's look into making a request to our target page and storing the HTML.

import requests

# Specify the URL

# Query the website and return the HTML content

response = requests.get(url)

# Save the HTML to a file for later review

filename = response.url.replace('/', '|') + '.html'

with open(filename, 'wb') as f:

f.write(response.content)

This code snippet retrieves the HTML content from the target URL and saves it to a file for future reference.

Next, we will parse the HTML using Lxml.

from lxml import html

# Parse the HTML content

root_node = html.fromstring(response.content)

# Locate a specific node corresponding to one of our items

node = root_node.find('body/div[1]/div[3]/div/div[2]/div[2]/div[1]')

# Display node information

print(f"Node tag: {node.tag}, Node attributes: {node.attrib}")

To extract all relevant items, we can loop through the nodes and gather the required information.

all_items = []

# Iterate through all relevant nodes to extract information

for node in nodes:

all_items.append(

{

'image_url': node.find('div[1]/img').attrib['src'],

'price': node.find('div[1]/div[1]/h4[1]').text,

'title': node.find('div[1]/div[1]/h4[2]/a').text,

'product_url': node.find('div[1]/div[1]/h4[2]/a').attrib['href'],

'description': node.find('div[1]/div[1]/p').text,

'review': node.find('div[1]/div[2]/p[1]').text,

'rating': node.find('div[1]/div[2]/p[2]').attrib['data-rating'],

}

)

print(all_items)

If successful, the output should yield a list of dictionaries containing all desired product information.

Goal 2 Implementation

For our second goal, we will gather URLs from the navigation bar and scrape each sub-category.

rel_category_urls = tree.xpath('//ul[@class="nav"]/li/a[@class="category-link "]/@href')

print(full_category_urls)

Next, we'll extract sub-category links and wrap our item extraction logic in a reusable function.

def extract_items_from_url(url):

"""Scrapes items from the specified URL and returns their details."""

# Query the website and return the HTML

response = requests.get(url)

root_node = html.fromstring(response.content)

nodes = root_node.xpath('//div[@class="col-sm-4 col-lg-4 col-md-4"]')

all_items = []

for node in nodes:

all_items.append(

{

'image_url': node.xpath('//img')[0].attrib['src'],

'price': node.xpath('//h4[contains(@class, "price")]')[0].text,

'title': node.xpath('//h4/a[@class="title"]')[0].text,

'product_url': node.xpath('//h4/a[@class="title"]')[0].attrib['href'],

'description': node.xpath('//p[@class="description"]')[0].text,

'review': node.xpath('//div[@class="ratings"]/p')[0].text,

'ratings': node.xpath('//div[@class="ratings"]/p[boolean(@data-rating)]')[0].attrib['data-rating'],

}

)

return all_items

# Run the extraction function for all discovered URLs

all_items = []

for url in all_sub_category_urls:

all_items += extract_items_from_url(url)

print(f"Number of items: {len(all_items)}")

print(all_items[:1])

This will yield all the information we need from the various sub-categories.

Stepping Up with Scrapy

Scrapy is a robust framework tailored for creating web crawlers that can efficiently traverse multiple sites and extract relevant data. It can manage large datasets and is designed for scalability and fault tolerance.

To initiate a new Scrapy project, use the following commands:

mkdir webscraper

cd webscraper

scrapy startproject webscraper

This generates the necessary project files and folder structure. Now, let’s define our crawler in the webscraper/spiders directory.

from w3lib.url import url_query_cleaner

from scrapy.spiders.crawl import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

class WebscraperCrawler(CrawlSpider):

name = "webscraper"

custom_settings = {

'HTTPCACHE_DIR': './httpcache',

'HTTPCACHE_ENABLED': True,

'HTTPCACHE_EXPIRATION_SECS': 0,

'ROBOTSTXT_OBEY': False,

'CONCURRENT_REQUESTS': 2,

'DOWNLOAD_DELAY': 1,

}

allowed_domains = ['webscraper.io']

rules = (

Rule(LinkExtractor(restrict_xpaths='//ul[@class="nav"]//li/a[contains(@class, "category-link ")]'),

callback="parse_item",

follow=True),

)

def parse_item(self, response):

return {

'url': response.url,

'status': response.status,

'image_url': response.selector.xpath('//img/@src').get(),

'price': response.selector.xpath('//h4[contains(@class, "price")]/text()').get(),

'title': response.selector.xpath('//h4/a[@class="title"]/text()').get(),

'product_url': response.selector.xpath('//h4/a[@class="title"]/@href').get(),

'description': response.selector.xpath('//p[@class="description"]/text()').get(),

'review': response.selector.xpath('//div[@class="ratings"]/p/text()').get(),

'ratings': response.selector.xpath('//div[@class="ratings"]/p[boolean(@data-rating)]/@data-rating').get(),

}

To execute the crawler, run the following command:

scrapy crawl webscraper --logfile webscraper.log -o webscraper.jl -t jsonlines

After completing the crawl, you can inspect the results with:

cat webscraper.jl | head -n 1

Conclusion

Web crawling is a powerful technique with a rich history, serving as the backbone for major search engines like Google and Bing. This guide has introduced two prevalent methods for web scraping: Requests with Lxml and Scrapy. While Requests and Lxml offer a simple entry point for beginners, Scrapy provides enhanced flexibility and scalability for larger projects.

With the right tools and understanding, web crawling can be an invaluable asset for collecting data from the internet across various applications, including market analysis, competitive research, and content aggregation.

Join my mailing list to receive updates on new content as it becomes available!

References

Learn how to scrape websites using Python and Lxml in this informative tutorial.

A comprehensive guide to web scraping with Python using Scrapy, Selenium, Requests, BeautifulSoup, and Lxml.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Embracing Change: Wisdom from Stoicism and Adaptability

Discover how adaptability, as taught by Stoic philosophers, plays a vital role in leading a fulfilling life.

Rediscovering the Breath: A Journey to Awareness and Presence

Explore the significance of breath and mindfulness in life. Discover how awareness can transform your existence.

Effective Strategies for Managing Life's Chaos and Staying Productive

Discover practical strategies to enhance productivity and maintain mental well-being amidst life's chaos.

Finding the Strength to Leave a Narcissistic Relationship

Discover how to recognize the patterns of a narcissistic relationship and the steps to reclaim your life.

Essential Science Books Recommended by Charlie Munger

Explore three insightful science books that Charlie Munger recommends for a deeper understanding of humanity and the universe.

# Unlocking Productivity: The Power of the Two-List System

Discover how the Two-List Method can enhance your productivity and help you manage tasks without feeling overwhelmed.

Look for the Helpers: Celebrating Public Health Heroes in 'Contagion'

Explore how 'Contagion' highlights the bravery of scientists and public health professionals amid global crises.

Create a Scalable Serverless Retail Solution on AWS with Terraform

Learn how to build a serverless retail solution for an endless aisle using AWS and Terraform automation.