Download Entire Website Text

Background

I previously worked on a GenAI RAG system where the reference data came from a public website.

For this purpose, I designed a simple web crawler using Selenium that could navigate web pages, traverse sub-pages, and then traverse the links on those sub-pages until reaching a predetermined depth. However, this crawler quickly became inadequate: each depth level had different web page structures, and in fact, I had to rewrite the logic for each level. Moreover, once the page structure changed—for example, if a first-level page jumped to a third-level page—the crawler would quickly crash.

So, in the second version, I implemented a unified crawling approach for each page and used infinite recursion to expand the website. Initially, this solution worked, but as more pages were opened, the Chrome driver controlled by Selenium would mysteriously crash, outputting the error message: "lost connection to the driver."

According to programming principles, writing a dedicated crawler for this single function was wasteful, and if no login authentication was required, using Selenium to crawl web pages seemed like overkill. Therefore, I developed a more general and lightweight solution: using BeautifulSoup to simply extract web page text (including text from PDFs).

Development Process

Step 1: Technology Selection and Architecture Design

Considering the issues with the previous Selenium solution, I chose a more lightweight technology stack:

BeautifulSoup for HTML parsing, replacing Selenium's browser automation
requests for HTTP requests, more stable with lower resource consumption
pdfplumber and PyPDF2 for dual-guaranteed PDF text extraction
JSON format for result storage, convenient for subsequent data processing

Step 2: Implementing Core Crawling Logic

The entire crawling process is divided into two phases:

Phase 1: Web Content Crawling I designed a queue-based breadth-first search algorithm. Starting from the initial URL, it extracts page content and links, adding new same-domain links to the crawling queue. This ensures:

Systematic traversal of the entire website
Avoidance of duplicate visits to the same page
Control of crawling scope within the same domain

Phase 2: PDF File Processing After collecting all PDF links in the first phase, PDF files are processed separately:

Download PDFs to temporary files
Prioritize pdfplumber for text extraction, fallback to PyPDF2 if it fails
Clean and format the extracted text content

Step 3: Optimization and Error Handling

Several optimization measures were added:

Request delays: Avoid putting pressure on the target server
URL normalization: Handle relative paths and query parameters
Error handling: Graceful degradation when network exceptions or parsing errors occur
Progress display: Real-time display of crawling progress and statistics

Step 4: Data Storage and Output

All extracted text content and corresponding URLs are saved in JSON format:

[
  ["Page text content", "Page URL"],
  ["PDF text content", "PDF file URL"]
]

This format is convenient for subsequent data analysis and processing, particularly suitable for vectorization needs in RAG systems.

Code Highlights

Several key features of this tool:

Two-phase processing: First crawl all web pages, then process PDFs centrally, avoiding the complexity of mixed processing
Intelligent URL filtering: Automatically exclude CSS, JS, images, and other non-content files
Dual PDF parsing: Ensure maximum PDF text extraction success rate
Domain restriction: Prevent the crawler from "wandering" to other websites
Resource cleanup: Automatic deletion of temporary PDF files to avoid disk usage

Usage Example

# Create downloader instance
downloader = WebsiteDownloader(
    base_url="https://example.com",
    delay=0.1  # 100ms delay
)

# Start crawling (maximum 500 pages)
result = downloader.crawl_website(max_pages=500)

# Save to JSON file
filename = downloader.save_to_json()

Summary

Compared to the previous Selenium solution, this BeautifulSoup-based website downloader has clear advantages:

Better stability: Avoids browser driver crash issues
Lower resource consumption: No need to launch a full browser
Stronger processing capability: Supports content extraction from large-scale websites
Better versatility: Can handle any publicly accessible website

This tool is particularly suitable for preparing training data for RAG systems or scenarios requiring batch extraction of website text content. The entire project code is concise and clear, with just over 300 lines of code, yet fully functional and practical.

GitHub Repository: https://github.com/Harvey-Labs/download-entire-website-text