Automating URL Indexing with Python: A Deep Dive into the Google Indexing API

In today’s digital landscape, ensuring that your website’s content is discoverable by search engines is paramount. Search engine optimization (SEO) relies on indexing, where search engines crawl and store information about web pages. While search engines like Google automatically index many pages, you can directly notify them of new or updated content using the Google Indexing API. In this article, we’ll explore how to automate this process using Python.

Introduction to the Google Indexing API

The Google Indexing API allows developers to programmatically notify Google when a URL on their site is added, removed, or updated. This direct method of communication ensures that new or updated content is promptly indexed, potentially leading to faster discovery and inclusion in search results.

Understanding the Code

Let’s delve into the Python code that automates URL indexing using the Google Indexing API.

# Importing required libraries
from tqdm import tqdm
import asyncio
import aiohttp
import os
import pandas as pd
from oauth2client.service_account import ServiceAccountCredentials
import json

# Constants
SCOPES = ["https://www.googleapis.com/auth/indexing"]
ENDPOINT = "https://indexing.googleapis.com/v3/urlNotifications:publish"
URLS_PER_ACCOUNT = 200

# Function to send URL indexing requests
async def send_url(session, http, url):
    # Content to be sent in the request
    content = {
        'url': url.strip(),
        'type': "URL_UPDATED"
    }
    # Retry up to 3 times in case of errors
    for _ in range(3):
        try:
            async with session.post(ENDPOINT, json=content, headers={"Authorization": f"Bearer {http}"}, ssl=False, timeout=60) as response:
                return await response.text()
        except ServerDisconnectedError:
            print("TimeoutError: Retrying...")
            await asyncio.sleep(2)  # Wait for 2 seconds before retrying
            continue
    # Return a custom error message after all retries fail
    return '{"error": {"code": 500, "message": "Server Disconnected after multiple retries"}}'

# Function to asynchronously index URLs
async def indexURL(http, urls):
    successful_urls = 0
    error_429_count = 0
    other_errors_count = 0
    tasks = []

    async with aiohttp.ClientSession() as session:
        # Using tqdm for progress bar
        for url in tqdm(urls, desc="Processing URLs", unit="url"):
            tasks.append(send_url(session, http, url))

        results = await asyncio.gather(*tasks)

        # Processing the results
        for result in results:
            data = json.loads(result)
            if "error" in data:
                if data["error"]["code"] == 429:
                    error_429_count += 1
                else:
                    other_errors_count += 1
            else:
                successful_urls += 1

    # Printing the summary
    print(f"\nTotal URLs Tried: {len(urls)}")
    print(f"Successful URLs: {successful_urls}")
    print(f"URLs with Error 429: {error_429_count}")

# Function to set up the OAuth2 client
def setup_http_client(json_key_file):
    credentials = ServiceAccountCredentials.from_json_keyfile_name(json_key_file, scopes=SCOPES)
    token = credentials.get_access_token().access_token
    return token

# Main function
def main():
    # Check if CSV file exists
    if not os.path.exists("data.csv"):
        print("Error: data.csv file not found!")
        return

    # Ask user for number of accounts
    num_accounts = int(input("How many accounts have you created (1-5)? "))
    if not 1 <= num_accounts <= 5:
        print("Invalid number of accounts. Please enter a number between 1 and 5.")
        return

    # Read all URLs from CSV
    try:
        all_urls = pd.read_csv("data.csv")["URL"].tolist()
    except Exception as e:
        print(f"Error reading data.csv: {e}")
        return

    # Process URLs for each account
    for i in range(num_accounts):
        print(f"\nProcessing URLs for Account {i+1}...")
        json_key_file = f"account{i+1}.json"

        # Check if account JSON file exists
        if not os.path.exists(json_key_file):
            print(f"Error: {json_key_file} not found!")
            continue

        start_index = i * URLS_PER_ACCOUNT
        end_index = start_index + URLS_PER_ACCOUNT
        urls_for_account = all_urls[start_index:end_index]

        http = setup_http_client(json_key_file)
        asyncio.run(indexURL(http, urls_for_account))

# Call the main function
if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nScript paused. Press Enter to resume or Ctrl+C again to exit.")
        input()
        main()

Explaining the Code

Dependencies: The script relies on libraries such as tqdm for progress bars, aiohttp for asynchronous HTTP requests, and pandas for data manipulation.
Constants: Constants like SCOPES and ENDPOINT define the Google Indexing API scope and endpoint URL, respectively.
Functions:

send_url: Sends URL indexing requests asynchronously, with support for retries in case of errors.
indexURL: Asynchronously indexes multiple URLs, keeping track of successful and failed attempts.
setup_http_client: Sets up the OAuth2 client for authorization using service account credentials.
main: Main function that orchestrates the URL indexing process, reading URLs from a CSV file and processing them for each user account.

Documentation

Click here to read more about Documentation/Usage

Conclusion

Automating URL indexing with Python using the Google Indexing API offers a convenient way to ensure that your website’s content is promptly indexed by search engines. By leveraging asynchronous programming and error handling mechanisms, you can efficiently manage large volumes of URLs across multiple user accounts. Integrating this script into your workflow can streamline the SEO process and improve the discoverability of your web content.

Blog