Web Scraper project (code) in Python

← Back to Projects

Web Scraper in Python.

About the project: This is Python project for a Web Scraper.

This program will connect to a website, extract specific data (in this case, article titles from a blog), and display it.

For this to work, we'll use two libraries:

requests to get the webpage's content and
BeautifulSoup to parse the HTML.

How to use this program:

To use this Web Scraper, you'll first need to install the two required libraries: requests and BeautifulSoup. Open your terminal or command prompt and run:


  pip install requests beautifulsoup4
  

    Once the libraries are installed, you can run the Python script:

  • Save the code: Save the code as a Python file (e.g., web_scraper.py).
  • Run the script: Open your terminal, navigate to the directory, and run python web_scraper.py.

This code is written to scrape a specific blog, so you may need to adjust the find_all method's parameters ('h2' and class_='entry-title') if you want to scrape a different website.

Project Level: Intermediate

You can directly copy the below snippet code with the help of green copy button, paste it and run it in any Python editor you have.

Steps: Follow these steps

Step 1: Copy below code using green 'copy' button.

Step 2: Paste the code on your chosen editor.

Step 3: Save the code with filename and .py extention.

Step 4: Run (Press F5 if using python IDLE)




# web_scraper.py

import requests
from bs4 import BeautifulSoup
import sys

def scrape_website(url):
    """
    Fetches a webpage, parses its HTML, and extracts article titles.

    Args:
        url (str): The URL of the website to scrape.
    """
    try:
        # Send a GET request to the URL
        response = requests.get(url, timeout=10)
        
        # Raise an exception for bad status codes (e.g., 404, 500)
        response.raise_for_status()
        
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find all (h2) tags with the class 'entry-title'
        # This is specific to a particular blog and may need to be adjusted
        # for different websites.
        article_titles = soup.find_all('h2', class_='entry-title')
        
        if not article_titles:
            print("No article titles found. The website structure may have changed.")
            return

        print("\n--- Scraped Article Titles ---")
        for title in article_titles:
            # Extract the text from the 'a' tag within the 'h2' tag
            link = title.find('a')
            if link:
                print(link.text.strip())
        print("------------------------------")
        
    except requests.exceptions.RequestException as e:
        print(f"Network error: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

def main():
    """
    Main function to run the Web Scraper.
    """
    print("--- Python Web Scraper ---")
    
    # We will use a sample blog URL for this example
    url_to_scrape = "http://books.toscrape.com/"
    print(f"Scraping data from: {url_to_scrape}")
    
    scrape_website(url_to_scrape)

    # You can easily prompt the user for a URL to make it more dynamic
    # user_url = input("\nEnter the URL to scrape: ").strip()
    # if user_url:
    #     scrape_website(user_url)
    # else:
    #     print("URL cannot be empty.")

# This ensures that main() is called only when the script is executed directly.
if __name__ == "__main__":
    main()