kōdōkalabs

Your Website is Not a List. It is a Graph.

Most SEOs treat internal linking as an afterthought. They write a new article, remember 2 or 3 related older posts, and manually add links. This approach is random, biased, and inefficient.

It relies on human memory (“I think we wrote about this last year?”) rather than mathematical precision.

To a search engine crawler (Googlebot), your website is a mathematical structure known as a Directed Graph.

  • Nodes = Pages.
  • Edges = Links.

If you want to maximize crawl efficiency and “Link Juice” flow (PageRank), you cannot rely on human intuition. You must apply Graph Theory.

At kōdōkalabs, we use Python to visualize our clients’ site architecture, calculate the “Centrality” of every page, and programmatically identify where the structure is broken.

This guide will teach you how to audit your site using the NetworkX library in Python, moving your internal linking strategy from “Art” to “Science.”

Before we write code, we must understand the metrics that actually matter to a bot.

1. PageRank (The Classic)

PageRank is the probability that a random surfer clicking links will arrive at a specific page.

  • The Goal: Ensure your “Money Pages” (High Conversion) have high PageRank.
  • The Failure: Often, low-value blog posts inadvertently hoard PageRank because of poor navigation structure (e.g., “Recent Posts” widgets).

2. Betweenness Centrality (The Bridge)

This measures how often a node (page) appears on the shortest path between two other nodes.

  • High Centrality Pages: These are your “Bridges.” If you remove a High Centrality page, your site might split into isolated islands.
  • The Goal: Identify these bridges and fortify them.

3. Modularity (The Clusters)

This measures how well your site is divided into “Topical Clusters.”

  • Good Modularity: Distinct clusters (e.g., /blog/seo/ pages link heavily to other /blog/seo/ pages).
  • Bad Modularity: Spaghetti links (everything links to everything). LLMs struggle to understand “Entity Salience” on sites with low modularity.

Part 2: The Stack

We will use Python to scrape the site, build the graph, and calculate these metrics.

The Internal Linking Automation - Graph Theory for SEO

Part 3: Building the Graph Analyzer

Step 1: The Crawl

First, we need a map of every link on your site. We use advertools to run a quick crawl.

import advertools as adv
import pandas as pd
import networkx as nx

# 1. Crawl the website (Set limit to avoid massive crawls for this demo)
# In production, use your sitemap or a full crawl.
adv.crawl('[https://www.yourdomain.com](https://www.yourdomain.com)', 'output_file.jl', follow_links=True)

# 2. Load the crawl data
crawl_df = pd.read_json('output_file.jl', lines=True)

# 3. Create an "Edge List" (Source -> Target)
# This extracts every internal link found on the site
links_df = crawl_df[['url', 'links_url']].explode('links_url')

# Filter out external links and self-links
domain = "yourdomain.com"
internal_links = links_df[
(links_df['links_url'].str.contains(domain, na=False)) &
(links_df['url'] != links_df['links_url'])
].dropna()

print(f"Graph Nodes (Pages): {len(internal_links['url'].unique())}")
print(f"Graph Edges (Links): {len(internal_links)}")

Step 2: Constructing the Graph

Now we feed this data into NetworkX.
# Initialize a Directed Graph
G = nx.from_pandas_edgelist(
internal_links,
source='url',
target='links_url',
create_using=nx.DiGraph()
)

print("Graph built successfully.")

Let’s see where the authority actually lives.

# Calculate PageRank
pagerank = nx.pagerank(G, alpha=0.85) # alpha 0.85 is standard Google damping factor

# Convert to DataFrame for easy viewing
pr_df = pd.DataFrame.from_dict(pagerank, orient='index', columns=['PageRank'])
pr_df = pr_df.sort_values(by='PageRank', ascending=False)

print("--- Top 10 Authority Pages ---")

print(pr_df.head(10))

Strategic Check: Look at the Top 10. Are they your “Money Pages” (Product/Pricing)?

  • If Yes: Good architecture.
  • If No: If your #1 page is “Terms of Service” or a random blog post from 2018, your architecture is bleeding revenue. You need to adjust your navigation or inject links from those high-authority pages to your product pages.

Part 4: Automating "Orphan Page" Detection

Tools like Ahrefs report orphan pages, but they miss “Functionally Orphaned” pages—pages that technically have one link, but are buried so deep (Click Depth > 5) that Google rarely crawls them.

# Calculate Shortest Path from Homepage
homepage = "[https://www.yourdomain.com/](https://www.yourdomain.com/)"

try:
shortest_paths = nx.shortest_path_length(G, source=homepage)
depth_df = pd.DataFrame.from_dict(shortest_paths, orient='index', columns=
['Click_Depth'])

# Identify Deep Pages
deep_pages = depth_df[depth_df['Click_Depth'] > 3]
print(f"Pages buried deeper than 3 clicks: {len(deep_pages)}")

# Identify True Orphans (Nodes in the graph but not reachable from Homepage)
all_nodes = set(G.nodes())
reachable_nodes = set(shortest_paths.keys())
orphans = all_nodes - reachable_nodes
print(f"True Orphan Pages: {len(orphans)}")

except Exception as e:

print(f"Error calculating paths: {e}")

The Fix:
Feed the list of deep_pages and orphans into your “Interlinking Agent” (from our previous Agentic Workflow tutorial). Ask the agent to find 3 relevant parent pages for each orphan and insert a link.

Part 5: Advanced "Semantic Clustering" Logic

Graph theory tells you where links are. Embeddings tell you where links should be.

To automate suggestions, we combine Graph Theory with Vector Embeddings.

The Logic:

  1. Calculate Embeddings: Convert every page content into a Vector (using OpenAI text-embedding-3-small).
  2. Find Mismatches: Look for pairs of pages that have High Semantic Similarity (Vectors are close) but Zero Graph Connection (No link exists).
  3. The Opportunity: These are your “Missing Links.”

(Conceptual Snippet)

# Pseudocode for the Semantic Bridge
from sklearn.metrics.pairwise import cosine_similarity

# Assume 'vectors' is a matrix of page embeddings
similarity_matrix = cosine_similarity(vectors)

# If Similarity > 0.85 AND Link_Exists is False:
# Append to "suggested_links.csv"

This is the exact logic kōdōkalabs uses to generate our “Internal Link Opportunities” report. It removes human bias. It doesn’t care if you wrote the article 3 years ago; if it is semantically relevant to your new post, the math finds it.

Part 6: Visualizing the Mess

Sometimes, you need to show the client why their SEO is failing. A visual graph is undeniable. Export your graph to Gephi format.

nx.write_gexf(G, "site_structure.gexf")

Download Gephi (free open-source software). Open this .gexf file.

  • Run “Force Atlas 2” layout.
  • Color nodes by “Modularity Class” (Topic).
  • Size nodes by “PageRank.”

What you will see:

  • Good SEO: Distinct, colored clusters (The “Silo” structure) with thick bridges between them.
  • Bad SEO: A giant hairball where everything links to everything, or a “Galaxy” with a dense center and hundreds of floating, disconnected stars (Orphans).

Conclusion: Let the Math Decide

Internal linking is the most undervalued lever in SEO. It costs $0 (unlike backlinks) and you have 100% control over it. By moving from manual placement to Graph-Theory automation, you ensure that every drop of authority your site earns is distributed efficiently to the pages that drive revenue.

Stop guessing.
Start calculating.

Select the fields to be shown. Others will be hidden. Drag and drop to rearrange the order.
  • Image
  • SKU
  • Rating
  • Price
  • Stock
  • Availability
  • Add to cart
  • Description
  • Content
  • Weight
  • Dimensions
  • Additional information
Click outside to hide the comparison bar
Compare
Ask Me Anything
Hello! How can I help you today?