Most SEOs are flying blind. They rely on third-party crawlers (like Screaming Frog or Ahrefs) to simulate what Google sees. They look at Google Search Console (GSC) sample data to guess what is happening.
But simulation is not reality. And sampling is not accuracy.
The only single source of truth in SEO is your Server Log File. This file records every single request made to your server—including every hit from Googlebot. It tells you exactly:
Traditionally, Log File Analysis was expensive (Enterprise SaaS tools cost $1,000+/mo) or difficult (Excel crashes after 1 million rows).
At kōdōkalabs, we solve this with Python.
This guide is a technical tutorial on how to build your own Log Analysis Engine using the Pandas library. We will move you from “Guessing” to “Knowing” in less than 50 lines of code.
If you manage a site with over 5,000 pages (e.g., E-commerce, Publisher, B2B SaaS), Crawl Budget is a ranking factor. Google does not have infinite resources. If it spends its budget crawling your filters, 404s, and redirects, it won’t have budget left to discover your new “Money Pages.”
Our internal data suggests that on large sites, 40% of pages receive zero crawls per month.
These are “Zombie Pages.” They exist on your site, but to Google, they are dead.
We are going to replicate the functionality of a $10k enterprise tool for free.
This is where 99% of marketers fail. They spend weeks on the template and minutes on the data.
You must spend 80% of your time on the data.
Ask your DevOps team or hosting provider for the “Raw Access Logs” for the last 30 days. You want the unprocessed Nginx or Apache logs.
Raw logs are messy strings. We need to convert them into a structured DataFrame (a table) so we can ask questions.
We will use advertools because it handles the regex parsing automatically.
import advertools as adv
import pandas as pd
# 1. Parse the log file (handles standard combined log format)
# converting the raw .log file into a structured Parquet file for speed
adv.logs_to_df(
log_file='access.log',
output_file='logs_output.parquet',
errors_file='errors.txt',
log_format='combined'
)
# 2. Load the data into Pandas
df = pd.read_parquet('logs_output.parquet')
# 3. Convert timestamp to datetime object (crucial for time-series analysis)
df['datetime'] = pd.to_datetime(df['datetime'], format='%d/%b/%Y:%H:%M:%S %z')
print(f"Total Requests Loaded: {len(df)}")
# Create a filter for Googlebot User Agents
googlebot_filter = df['user_agent'].str.contains('Googlebot', case=False, na=False)
# Create a clean DataFrame of only Googlebot hits
google_df = df[googlebot_filter].copy()
print(f"Total Googlebot Hits: {len(google_df)}")
# Count status codes
status_counts = google_df['status'].value_counts()
print(status_counts)
# Calculate Percentage
status_percent = google_df['status'].value_counts(normalize=True) * 100
print(status_percent) Strategic Insight:
Google tells you what it thinks is important by where it spends its time. Let’s see which folders get the most love.
# Extract the first folder from the URL path
google_df['folder'] = google_df['request'].str.split('/').str[1]
# Count hits per folder
folder_counts = google_df.groupby('folder').size().sort_values(ascending=False)
print(folder_counts.head(10))
Strategic Insight:
# Load your sitemap URLs
sitemap_urls = pd.read_csv('sitemap_urls.csv')['url'].tolist()
# Get unique URLs crawled by Google
crawled_urls = google_df['request'].unique().tolist()
# Find the difference (Pages in Sitemap but NOT in Logs)
zombie_pages = set(sitemap_urls) - set(crawled_urls)
print(f"Total Pages in Sitemap: {len(sitemap_urls)}")
print(f"Total Pages Crawled: {len(crawled_urls)}")
print(f"Zombie Pages (Zero Crawls): {len(zombie_pages)}")
# Export Zombies to CSV for the Content Team
pd.DataFrame(list(zombie_pages), columns=['url']).to_csv('zombie_pages.csv', index=False)
Strategic Insight:
# Resample data by hour to see crawl frequency trends
hourly_crawl = google_df.set_index('datetime').resample('H').size()
# Plot it (requires matplotlib)
import matplotlib.pyplot as plt
hourly_crawl.plot(title="Googlebot Hits Per Hour", figsize=(12, 6))
plt.show()
Doing this once is an audit. Doing it daily is DevOps.
At kōdōkalabs, we wrap this script in a Docker container and run it every Monday morning.
This is how you move from “Reactive SEO” (fixing things after traffic drops) to “Proactive SEO” (fixing things before Google cares).
Why do we share this code? Because having the tool isn’t the same as having the strategy.
Knowing you have 4,000 zombie pages is step one.
Knowing which of those pages should be pruned, which should be merged, and which need a schema injection is where the Human Strategist comes in.