Documentation: URL Indexing with Python
Documentation: URL Indexing with Python
This script is designed to asynchronously index URLs using the Google Indexing API. It reads URLs from a CSV file, processes them in batches, and sends them for indexing. It provides support for multiple user accounts and handles errors gracefully.
Dependencies
- tqdm: A fast, extensible progress bar for loops and iterables.
- asyncio: Asynchronous I/O framework for concurrent code execution.
- aiohttp: Asynchronous HTTP client/server framework.
- os: Operating system interface for interacting with the filesystem.
- pandas: Data manipulation and analysis library.
- oauth2client.service_account: Library for handling OAuth2 credentials for service accounts.
- json: Library for handling JSON data.
Constants
- SCOPES: OAuth2 scopes required for authentication with the Google Indexing API.
- ENDPOINT: API endpoint URL for submitting URL indexing requests.
- URLS_PER_ACCOUNT: Maximum number of URLs to process per user account.
Functions
send_url(session, http, url)
Asynchronously sends a URL indexing request to the Google Indexing API.
- session:
aiohttp.ClientSession
object for making HTTP requests. - http: OAuth2 token for authorization.
- url: URL to be indexed.
indexURL(http, urls)
Asynchronously indexes multiple URLs.
- http: OAuth2 token for authorization.
- urls: List of URLs to be indexed.
setup_http_client(json_key_file)
Sets up the OAuth2 client for authorization using the provided JSON key file.
- json_key_file: Path to the JSON key file containing service account credentials.
main()
Main entry point of the script. Reads URLs from a CSV file, processes them for indexing, and prints the results.
Workflow
- Check if the required CSV file (
data.csv
) exists. - Prompt the user to enter the number of user accounts to process URLs for (between 1 and 5).
- Read all URLs from the CSV file.
- Process URLs for each user account sequentially:
- Retrieve OAuth2 token for authorization.
- Partition URLs into batches based on the
URLS_PER_ACCOUNT
constant. - Asynchronously index each batch of URLs.
- Print the total number of URLs processed, successful URLs, and URLs with error code 429 (Too Many Requests).
- Handle keyboard interrupts gracefully to pause or exit the script.
Usage
- Ensure the required dependencies are installed.
- Prepare a CSV file (
data.csv
) containing the URLs to be indexed. - Create service account JSON key files (
account1.json
,account2.json
, etc.) for each user account. - Run the script and follow the prompts to specify the number of user accounts.
- Monitor the progress and review the indexing results.
Error Handling
- The script retries failed URL indexing requests up to three times.
- It handles server disconnection errors (
ServerDisconnectedError
) gracefully by retrying after a brief delay. - It prints a custom error message if a URL fails to index after multiple retries.
- It provides a pause/resume mechanism to interrupt execution and resume later.