kōdōkalabs

The "Black Box" of SEO is Sitting on Your Server.

Most SEOs are flying blind. They rely on third-party crawlers (like Screaming Frog or Ahrefs) to simulate what Google sees. They look at Google Search Console (GSC) sample data to guess what is happening.

But simulation is not reality. And sampling is not accuracy.

The only single source of truth in SEO is your Server Log File. This file records every single request made to your server—including every hit from Googlebot. It tells you exactly:

  • Which pages Googlebot visits most (and least).
  • Where your “Crawl Budget” is being wasted on useless parameters.
  • Which high-value pages are being completely ignored.

Traditionally, Log File Analysis was expensive (Enterprise SaaS tools cost $1,000+/mo) or difficult (Excel crashes after 1 million rows).

At kōdōkalabs, we solve this with Python.

This guide is a technical tutorial on how to build your own Log Analysis Engine using the Pandas library. We will move you from “Guessing” to “Knowing” in less than 50 lines of code.

Part 1: Why You Cannot Ignore Log Files

If you manage a site with over 5,000 pages (e.g., E-commerce, Publisher, B2B SaaS), Crawl Budget is a ranking factor. Google does not have infinite resources. If it spends its budget crawling your filters, 404s, and redirects, it won’t have budget left to discover your new “Money Pages.”

The "Zombie Page" Phenomenon

Our internal data suggests that on large sites, 40% of pages receive zero crawls per month.
These are “Zombie Pages.” They exist on your site, but to Google, they are dead.

  • SaaS Tools won’t show you this (they only show what they found).
  • GSC won’t show you this (it aggregates data).
  • Python reveals it instantly.

Part 2: The Stack (Free & Powerful)

We are going to replicate the functionality of a $10k enterprise tool for free.

  • Language: Python 3.10+
  • Library: Pandas (For high-speed data manipulation).
  • Library: Advertools (An amazing SEO-specific library by Elias Dabbas).
  • Input: Your raw access log files (usually .log or .gz from Nginx/Apache).

Part 3: The Tutorial (From Raw Data to Insight)

This is where 99% of marketers fail. They spend weeks on the template and minutes on the data.
You must spend 80% of your time on the data.

Step 1: Accessing the Data

Ask your DevOps team or hosting provider for the “Raw Access Logs” for the last 30 days. You want the unprocessed Nginx or Apache logs.

Step 2: The Parsing Script

Raw logs are messy strings. We need to convert them into a structured DataFrame (a table) so we can ask questions.

We will use advertools because it handles the regex parsing automatically.

import advertools as adv
import pandas as pd

# 1. Parse the log file (handles standard combined log format)
# converting the raw .log file into a structured Parquet file for speed
adv.logs_to_df(
log_file='access.log',
output_file='logs_output.parquet',
errors_file='errors.txt',
log_format='combined'
)

# 2. Load the data into Pandas
df = pd.read_parquet('logs_output.parquet')

# 3. Convert timestamp to datetime object (crucial for time-series analysis)
df['datetime'] = pd.to_datetime(df['datetime'], format='%d/%b/%Y:%H:%M:%S %z')

print(f"Total Requests Loaded: {len(df)}")

Step 3: Filtering for Googlebot

The internet is full of noise (scrapers, user traffic, malicious bots). We only care about the VIP: Googlebot. Warning: Do not rely on the User Agent string alone (it can be spoofed). In a production environment, you should verify the IP address against Google’s list. For this tutorial, we will filter by User Agent for simplicity.
# Create a filter for Googlebot User Agents
googlebot_filter = df['user_agent'].str.contains('Googlebot', case=False, na=False)

# Create a clean DataFrame of only Googlebot hits
google_df = df[googlebot_filter].copy()

print(f"Total Googlebot Hits: {len(google_df)}")
Automating Log File Analysis with Python and Pandas

Part 4: The Analysis (Asking the Right Questions)

Now that we have clean data, we can answer the strategic questions that drive SEO revenue.

Analysis A: Status Code Distribution (The Health Check)

Are bots hitting 200s (Success), or are they hitting 301s (Redirects) and 404s (Errors)?
# Count status codes
status_counts = google_df['status'].value_counts()
print(status_counts)

# Calculate Percentage
status_percent = google_df['status'].value_counts(normalize=True) * 100
print(status_percent)

Strategic Insight:

  • If 3xx > 10%: You have redirect chains wasting budget.
  • If 4xx > 5%: You are sending Google into dead ends. Prioritize fixing these links immediately.
  • If 5xx > 1%: Your server is unstable when Google visits. This is a critical ranking emergency.

Analysis B: Top Crawled Sections (The Priority Check)

Google tells you what it thinks is important by where it spends its time. Let’s see which folders get the most love.

# Extract the first folder from the URL path
google_df['folder'] = google_df['request'].str.split('/').str[1]

# Count hits per folder
folder_counts = google_df.groupby('folder').size().sort_values(ascending=False)
print(folder_counts.head(10))

Strategic Insight:

  • Is your /blog/ getting 80% of the crawl but driving 5% of revenue?
  • Is your /products/ folder being ignored?
  • Action: If low-value folders are dominating, block them in robots.txt or improve internal linking to high-value folders.

Analysis C: Identifying "Zombie Pages"

This is the money script. We need to compare “All Known Pages” (from your Sitemap) vs. “Crawled Pages” (from Logs). Prerequisite: Download your XML sitemap into a CSV/List.

# Load your sitemap URLs
sitemap_urls = pd.read_csv('sitemap_urls.csv')['url'].tolist()

# Get unique URLs crawled by Google
crawled_urls = google_df['request'].unique().tolist()

# Find the difference (Pages in Sitemap but NOT in Logs)
zombie_pages = set(sitemap_urls) - set(crawled_urls)

print(f"Total Pages in Sitemap: {len(sitemap_urls)}")
print(f"Total Pages Crawled: {len(crawled_urls)}")
print(f"Zombie Pages (Zero Crawls): {len(zombie_pages)}")

# Export Zombies to CSV for the Content Team
pd.DataFrame(list(zombie_pages), columns=['url']).to_csv('zombie_pages.csv', index=False)

Strategic Insight:

  • A Zombie Page has no chance of ranking or being updated in the index.
  • Action: Check the internal linking. Does this page have zero inbound links? Add it to your “Interlinking Agent” workflow.

Part 5: Advanced "Crawl Velocity" Analysis

For news sites or high-frequency publishers, knowing how fast Google discovers new content is key. We can calculate the “Time to First Byte” equivalent for crawling.

# Resample data by hour to see crawl frequency trends
hourly_crawl = google_df.set_index('datetime').resample('H').size()

# Plot it (requires matplotlib)
import matplotlib.pyplot as plt
hourly_crawl.plot(title="Googlebot Hits Per Hour", figsize=(12, 6))
plt.show()

If you see massive spikes, correlate them with your deployments. Did a new code push trigger a re-crawl? Or did a server error cause a drop-off?

Part 6: Automating the Pipeline

Doing this once is an audit. Doing it daily is DevOps.
At kōdōkalabs, we wrap this script in a Docker container and run it every Monday morning.

  • Cron Job: Pulls logs from AWS S3.
  • Script: Runs the analysis.
  • Alerting: If Zombie_Rate > 30% or Error_Rate > 5%, it sends a Slack alert to the Technical SEO channel.

This is how you move from “Reactive SEO” (fixing things after traffic drops) to “Proactive SEO” (fixing things before Google cares).

Part 7: The "kōdōkalabs" Advantage

Why do we share this code? Because having the tool isn’t the same as having the strategy.
Knowing you have 4,000 zombie pages is step one.
Knowing which of those pages should be pruned, which should be merged, and which need a schema injection is where the Human Strategist comes in.

Do you trust your log files
to a spreadsheet?

Or do you want to engineer your crawl budget?
Select the fields to be shown. Others will be hidden. Drag and drop to rearrange the order.
  • Image
  • SKU
  • Rating
  • Price
  • Stock
  • Availability
  • Add to cart
  • Description
  • Content
  • Weight
  • Dimensions
  • Additional information
Click outside to hide the comparison bar
Compare
Ask Me Anything
Hello! How can I help you today?