How To Automate Web Source Code Retrieval: Python urllib3 Tutorial for Beginners

1. Why Learn to Get Web Page Source Code with Python?

Retrieving web page source code is a fundamental task in web development, data scraping, and website testing. While you can view source code directly via the browser’s “View Page Source” feature, this manual method is inefficient—it can’t handle batch processing or automatic saving. In contrast, using Python’s `urllib3` library to send HTTP Get requests lets you quickly fetch source code from any URL and automatically save it as an HTML file. Later, opening the HTML file will display content identical to the original web page, significantly boosting productivity. This article uses a local WordPress site as an example to guide you through every step, from environment setup to code execution.

2. Step 1: Check and Install the `urllib3` Library

`urllib3` is a popular Python library for sending HTTP requests and the core tool for this tutorial. Before writing code, you need to confirm if `urllib3` is installed on your computer. If not, you’ll need to install it manually. Here are the detailed steps:

  1. Open the VS Code Terminal:
    Launch VS Code, click “View” in the top menu bar, and select “Terminal” from the dropdown. A PowerShell command line window will appear at the bottom of VS Code.
  2. Check if `urllib3` is installed:
    Type the command `pip show urllib3` in the terminal and press Enter. If the terminal displays `urllib3`’s version, installation path, and other details, it’s already installed. If it shows “WARNING: Package(s) not found: urllib3”, it’s not installed.
  3. (Optional) Uninstall `urllib3` (for demonstration purposes):
    If `urllib3` is installed but you want to re-demonstrate the installation process, type `pip uninstall urllib3`. The terminal will prompt “Proceed (Y/n)?”. Enter “Y” and press Enter to uninstall. After uninstallation, run `pip show urllib3` again—you’ll see “not found”, confirming successful uninstallation.
  4. Install the `urllib3` library:
    Type `pip install urllib3` in the terminal and press Enter. Python will automatically download and install `urllib3` from the official repository. Once installed, the terminal will show “Successfully installed urllib3-x.x.x” (x.x.x is the version number). Running `pip show urllib3` again will display the installation details, confirming your environment is ready.

3. Step 2: Write Python Code to Implement the HTTP Get Request Logic

After installing `urllib3`, you can write the Python code. The core logic of the code is: import required libraries → define a function to fetch web source code → receive the URL parameter via the command line → call the function to fetch the source code and save it as an HTML file. Here are the detailed steps:

  1. Create a new Python file:
    In VS Code, click “Explorer” on the left, right-click any folder, select “New File”, and name it `get_web_source.py` (the filename can be customized, but the extension must be `.py`).
  2. Import necessary libraries:
    The first line imports the `urllib3` library for sending HTTP requests. The second line imports Python’s built-in `sys` library for receiving URL parameters from the command line. The code is as follows:

    import urllib3
    import sys
  3. Define the `get_request` function (core functionality):
    This function takes a `url` parameter (the URL of the target web page) and implements the full logic of “send a Get request → fetch source code → save as an HTML file”. Here are the steps inside the function:
    – Create a connection pool manager:
    Use `with urllib3.PoolManager() as http` to create a connection pool manager named `http`. Connection pools reuse HTTP connections to improve request efficiency, and the `with` statement automatically manages connections to avoid resource leaks.
    – Send an HTTP Get request:
    Call the `http.request()` method. The first parameter is `”GET”` (specifying the request method as Get), and the second parameter is `url` (the target web page address). Assign the result to the `resp` variable. `resp` contains all response information, and `resp.data` is the web page’s source code (in binary format).
    – Print the source code to the console:
    Use `print(resp.data)` to print the fetched source code to the terminal for real-time verification.
    – Save the source code as an HTML file:
    Use `open(“web_source.html”, “wb”) as f` to create a file named `web_source.html` (“wb” means opening in binary write mode to avoid Chinese garbled text). Then use `f.write(resp.data)` to write the source code from `resp.data` to the file. The file will be automatically saved in the same folder as the Python script.
    – Print a success message:
    After writing the file, use `print(“Response content saved to file”)` to notify the user that the operation is complete.
    The complete function code is as follows:

    def get_request(url):
           # Create a connection pool manager
           with urllib3.PoolManager() as http:
               # Send an HTTP Get request and get the response
               resp = http.request("GET", url)
               # Print the source code to the console
               print(resp.data)
               # Save the source code as an HTML file
               with open("web_source.html", "wb") as f:
                   f.write(resp.data)
               # Print a success message
               print("Response content saved to file")
  4. Write the `main` function to receive command-line parameters:
    To let the script receive the URL parameter via the command line (without modifying the URL in the code each time), add a `main` function at the end of the code. Here are the steps:
    – Get the URL parameter from the command line:
    `sys.argv` is a list that stores all command-line parameters. `sys.argv[0]` is the path and filename of the Python script, and `sys.argv[1]` is the first parameter after the script name (the URL we need). Use `url = sys.argv[1]` to get the URL.
    – Call the `get_request` function:
    Pass the fetched `url` to the function to trigger the Get request, print the source code, and save the file.
    The complete `main` function code is as follows:

    if __name__ == "__main__":
          # Get the URL parameter from the command line
          url = sys.argv[1]
          # Call the function to fetch the source code and save it
          get_request(url)

4. Step 3: Run the Python Script and Verify the Web Page Source Code

Once the code is written, you can run the script via the command line, pass the target web page’s URL, fetch the source code, and verify the results. This tutorial uses a local WordPress site (the URL is usually `http://localhost/wordpress`; adjust it to your local site’s address). Here are the detailed steps:

  1. Confirm the target web page is accessible:
    Open a browser, enter the URL of your local WordPress site (e.g., `http://localhost/wordpress`) in the address bar. If the web page displays normally, the URL is valid. Right-click the blank area of the web page and select “View Page Source” to see the original code—this is the code we’ll fetch with Python.
  2. Run the Python script in the terminal:
    Return to the VS Code terminal. First, use the `cd` command to navigate to the folder where the Python script is stored (e.g., if the script is in “D:\PythonProjects”, type `cd D:\PythonProjects` and press Enter). Then type the command `python get_web_source.py http://localhost/wordpress` (note: “get_web_source.py” is your script’s filename, and “http://localhost/wordpress” is the target URL—there’s a space between them). Press Enter.
  3. Check the running results:
    After the script runs, the terminal will first print the web page’s source code (identical to the code from “View Page Source” in the browser), then display “Response content saved to file”, indicating successful execution.
  4. Verify the saved HTML file:
    – Locate the HTML file: Open File Explorer, navigate to the folder where the Python script is stored. You’ll see a file named `web_source.html`—this is the file automatically generated by the script.
    – Open the file with a browser: Right-click `web_source.html`, select “Open with”, and choose Microsoft Edge (or another browser) from the dropdown. After opening, you’ll find that the content of the page is exactly the same as when you directly access `http://localhost/wordpress`. This confirms that the source code has been successfully fetched and saved.

5. Conclusion: Advantages and Extensions of Getting Web Source Code with Python

Through this tutorial, we successfully implemented the full process of “sending an HTTP Get request → fetching web page source code → saving it as an HTML file” using Python. Compared with manual operations, this method has three key advantages: ① High automation—no need to copy and paste code manually; ② Batch processing capability—modify the code to loop through multiple URLs, and you can fetch source code from multiple web pages at once; ③ Secondary development potential—after fetching the source code, you can add logic to parse HTML (e.g., using the `BeautifulSoup` library to extract titles, image links, etc.) to meet more needs.

If you encounter problems during operation (e.g., inaccessible URLs, code errors), check if the URL is correct, confirm `urllib3` is installed successfully, or leave a comment for discussion. Mastering this basic skill will help you complete web development testing, data collection, and analysis work more efficiently in the future.

6. Demo Video

You can watch the following demo video by select the subtitle to your preferred subtitle language.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top