Automating Log File Analysis with
Python and Pandas

HomeAll PostsThe Modern SEO StackAutomating Log File Analysis with Python and Pandas

The "Black Box" of SEO is Sitting on Your Server.

Most SEOs are flying blind. They rely on third-party crawlers (like Screaming Frog or Ahrefs) to simulate what Google sees. They look at Google Search Console (GSC) sample data to guess what is happening.

But simulation is not reality. And sampling is not accuracy.

The only single source of truth in SEO is your Server Log File. This file records every single request made to your server—including every hit from Googlebot. It tells you exactly:

Which pages Googlebot visits most (and least).
Where your “Crawl Budget” is being wasted on useless parameters.
Which high-value pages are being completely ignored.

Traditionally, Log File Analysis was expensive (Enterprise SaaS tools cost $1,000+/mo) or difficult (Excel crashes after 1 million rows).

At kōdōkalabs, we solve this with Python.

This guide is a technical tutorial on how to build your own Log Analysis Engine using the Pandas library. We will move you from “Guessing” to “Knowing” in less than 50 lines of code.

Part 1: Why You Cannot Ignore Log Files

If you manage a site with over 5,000 pages (e.g., E-commerce, Publisher, B2B SaaS), Crawl Budget is a ranking factor. Google does not have infinite resources. If it spends its budget crawling your filters, 404s, and redirects, it won’t have budget left to discover your new “Money Pages.”

The "Zombie Page" Phenomenon

Our internal data suggests that on large sites, 40% of pages receive zero crawls per month.
These are “Zombie Pages.” They exist on your site, but to Google, they are dead.

SaaS Tools won’t show you this (they only show what they found).
GSC won’t show you this (it aggregates data).
Python reveals it instantly.

Part 2: The Stack (Free & Powerful)

We are going to replicate the functionality of a $10k enterprise tool for free.

Language: Python 3.10+
Library: Pandas (For high-speed data manipulation).
Library: Advertools (An amazing SEO-specific library by Elias Dabbas).
Input: Your raw access log files (usually .log or .gz from Nginx/Apache).

Part 3: The Tutorial (From Raw Data to Insight)

This is where 99% of marketers fail. They spend weeks on the template and minutes on the data.
You must spend 80% of your time on the data.

Step 1: Accessing the Data

Ask your DevOps team or hosting provider for the “Raw Access Logs” for the last 30 days. You want the unprocessed Nginx or Apache logs.

Step 2: The Parsing Script

Raw logs are messy strings. We need to convert them into a structured DataFrame (a table) so we can ask questions.

We will use advertools because it handles the regex parsing automatically.

import advertools as adv

import pandas as pd



# 1. Parse the log file (handles standard combined log format)

# converting the raw .log file into a structured Parquet file for speed

adv.logs_to_df(

    log_file='access.log',

    output_file='logs_output.parquet',

    errors_file='errors.txt',

    log_format='combined'

)



# 2. Load the data into Pandas

df = pd.read_parquet('logs_output.parquet')



# 3. Convert timestamp to datetime object (crucial for time-series analysis)

df['datetime'] = pd.to_datetime(df['datetime'], format='%d/%b/%Y:%H:%M:%S %z')



print(f"Total Requests Loaded: {len(df)}")

Step 3: Filtering for Googlebot

The internet is full of noise (scrapers, user traffic, malicious bots). We only care about the VIP: Googlebot. Warning: Do not rely on the User Agent string alone (it can be spoofed). In a production environment, you should verify the IP address against Google’s list. For this tutorial, we will filter by User Agent for simplicity.

# Create a filter for Googlebot User Agents

googlebot_filter = df['user_agent'].str.contains('Googlebot', case=False, na=False)



# Create a clean DataFrame of only Googlebot hits

google_df = df[googlebot_filter].copy()



print(f"Total Googlebot Hits: {len(google_df)}")

Part 4: The Analysis (Asking the Right Questions)

Now that we have clean data, we can answer the strategic questions that drive SEO revenue.

Analysis A: Status Code Distribution (The Health Check)

Are bots hitting 200s (Success), or are they hitting 301s (Redirects) and 404s (Errors)?

# Count status codes

status_counts = google_df['status'].value_counts()

print(status_counts)



# Calculate Percentage

status_percent = google_df['status'].value_counts(normalize=True) * 100

print(status_percent)

Strategic Insight:

If 3xx > 10%: You have redirect chains wasting budget.
If 4xx > 5%: You are sending Google into dead ends. Prioritize fixing these links immediately.
If 5xx > 1%: Your server is unstable when Google visits. This is a critical ranking emergency.

Analysis B: Top Crawled Sections (The Priority Check)

Google tells you what it thinks is important by where it spends its time. Let’s see which folders get the most love.

# Extract the first folder from the URL path google_df['folder'] = google_df['request'].str.split('/').str[1] # Count hits per folder folder_counts = google_df.groupby('folder').size().sort_values(ascending=False) print(folder_counts.head(10))

Strategic Insight:

Is your /blog/ getting 80% of the crawl but driving 5% of revenue?
Is your /products/ folder being ignored?
Action: If low-value folders are dominating, block them in robots.txt or improve internal linking to high-value folders.

Analysis C: Identifying "Zombie Pages"

This is the money script. We need to compare “All Known Pages” (from your Sitemap) vs. “Crawled Pages” (from Logs). Prerequisite: Download your XML sitemap into a CSV/List.

# Load your sitemap URLs sitemap_urls = pd.read_csv('sitemap_urls.csv')['url'].tolist() # Get unique URLs crawled by Google crawled_urls = google_df['request'].unique().tolist() # Find the difference (Pages in Sitemap but NOT in Logs) zombie_pages = set(sitemap_urls) - set(crawled_urls) print(f"Total Pages in Sitemap: {len(sitemap_urls)}") print(f"Total Pages Crawled: {len(crawled_urls)}") print(f"Zombie Pages (Zero Crawls): {len(zombie_pages)}") # Export Zombies to CSV for the Content Team pd.DataFrame(list(zombie_pages), columns=['url']).to_csv('zombie_pages.csv', index=False)

Strategic Insight:

A Zombie Page has no chance of ranking or being updated in the index.
Action: Check the internal linking. Does this page have zero inbound links? Add it to your “Interlinking Agent” workflow.

Part 5: Advanced "Crawl Velocity" Analysis

For news sites or high-frequency publishers, knowing how fast Google discovers new content is key. We can calculate the “Time to First Byte” equivalent for crawling.

# Resample data by hour to see crawl frequency trends hourly_crawl = google_df.set_index('datetime').resample('H').size() # Plot it (requires matplotlib) import matplotlib.pyplot as plt hourly_crawl.plot(title="Googlebot Hits Per Hour", figsize=(12, 6)) plt.show()

If you see massive spikes, correlate them with your deployments. Did a new code push trigger a re-crawl? Or did a server error cause a drop-off?

Part 6: Automating the Pipeline

Doing this once is an audit. Doing it daily is DevOps.
At kōdōkalabs, we wrap this script in a Docker container and run it every Monday morning.

Cron Job: Pulls logs from AWS S3.
Script: Runs the analysis.
Alerting: If Zombie_Rate > 30% or Error_Rate > 5%, it sends a Slack alert to the Technical SEO channel.

This is how you move from “Reactive SEO” (fixing things after traffic drops) to “Proactive SEO” (fixing things before Google cares).

Part 7: The "kōdōkalabs" Advantage

Why do we share this code? Because having the tool isn’t the same as having the strategy.
Knowing you have 4,000 zombie pages is step one.
Knowing which of those pages should be pruned, which should be merged, and which need a schema injection is where the Human Strategist comes in.

Do you trust your log files
to a spreadsheet?

Or do you want to engineer your crawl budget?

Automating Log File Analysis withPython and Pandas

The "Black Box" of SEO is Sitting on Your Server.

Part 1: Why You Cannot Ignore Log Files

The "Zombie Page" Phenomenon

Part 2: The Stack (Free & Powerful)

Part 3: The Tutorial (From Raw Data to Insight)

Step 1: Accessing the Data

Step 2: The Parsing Script

Step 3: Filtering for Googlebot

Part 4: The Analysis (Asking the Right Questions)

Analysis A: Status Code Distribution (The Health Check)

Analysis B: Top Crawled Sections (The Priority Check)

Analysis C: Identifying "Zombie Pages"

Part 5: Advanced "Crawl Velocity" Analysis

Part 6: Automating the Pipeline

Part 7: The "kōdōkalabs" Advantage

Do you trust your log files to a spreadsheet?

Automating Log File Analysis with
Python and Pandas

Do you trust your log files
to a spreadsheet?