Most SEOs treat internal linking as an afterthought. They write a new article, remember 2 or 3 related older posts, and manually add links. This approach is random, biased, and inefficient.
It relies on human memory (“I think we wrote about this last year?”) rather than mathematical precision.
To a search engine crawler (Googlebot), your website is a mathematical structure known as a Directed Graph.
If you want to maximize crawl efficiency and “Link Juice” flow (PageRank), you cannot rely on human intuition. You must apply Graph Theory.
At kōdōkalabs, we use Python to visualize our clients’ site architecture, calculate the “Centrality” of every page, and programmatically identify where the structure is broken.
This guide will teach you how to audit your site using the NetworkX library in Python, moving your internal linking strategy from “Art” to “Science.”
Before we write code, we must understand the metrics that actually matter to a bot.
PageRank is the probability that a random surfer clicking links will arrive at a specific page.
This measures how often a node (page) appears on the shortest path between two other nodes.
This measures how well your site is divided into “Topical Clusters.”
We will use Python to scrape the site, build the graph, and calculate these metrics.
import advertools as adv
import pandas as pd
import networkx as nx
# 1. Crawl the website (Set limit to avoid massive crawls for this demo)
# In production, use your sitemap or a full crawl.
adv.crawl('[https://www.yourdomain.com](https://www.yourdomain.com)', 'output_file.jl', follow_links=True)
# 2. Load the crawl data
crawl_df = pd.read_json('output_file.jl', lines=True)
# 3. Create an "Edge List" (Source -> Target)
# This extracts every internal link found on the site
links_df = crawl_df[['url', 'links_url']].explode('links_url')
# Filter out external links and self-links
domain = "yourdomain.com"
internal_links = links_df[
(links_df['links_url'].str.contains(domain, na=False)) &
(links_df['url'] != links_df['links_url'])
].dropna()
print(f"Graph Nodes (Pages): {len(internal_links['url'].unique())}")
print(f"Graph Edges (Links): {len(internal_links)}")
# Initialize a Directed Graph
G = nx.from_pandas_edgelist(
internal_links,
source='url',
target='links_url',
create_using=nx.DiGraph()
)
print("Graph built successfully.") Let’s see where the authority actually lives.
# Calculate PageRank
pagerank = nx.pagerank(G, alpha=0.85) # alpha 0.85 is standard Google damping factor
# Convert to DataFrame for easy viewing
pr_df = pd.DataFrame.from_dict(pagerank, orient='index', columns=['PageRank'])
pr_df = pr_df.sort_values(by='PageRank', ascending=False)
print("--- Top 10 Authority Pages ---")
print(pr_df.head(10))
Strategic Check: Look at the Top 10. Are they your “Money Pages” (Product/Pricing)?
Tools like Ahrefs report orphan pages, but they miss “Functionally Orphaned” pages—pages that technically have one link, but are buried so deep (Click Depth > 5) that Google rarely crawls them.
# Calculate Shortest Path from Homepage
homepage = "[https://www.yourdomain.com/](https://www.yourdomain.com/)"
try:
shortest_paths = nx.shortest_path_length(G, source=homepage)
depth_df = pd.DataFrame.from_dict(shortest_paths, orient='index', columns=
['Click_Depth'])
# Identify Deep Pages
deep_pages = depth_df[depth_df['Click_Depth'] > 3]
print(f"Pages buried deeper than 3 clicks: {len(deep_pages)}")
# Identify True Orphans (Nodes in the graph but not reachable from Homepage)
all_nodes = set(G.nodes())
reachable_nodes = set(shortest_paths.keys())
orphans = all_nodes - reachable_nodes
print(f"True Orphan Pages: {len(orphans)}")
except Exception as e:
print(f"Error calculating paths: {e}")
The Fix:
Feed the list of deep_pages and orphans into your “Interlinking Agent” (from our previous Agentic Workflow tutorial). Ask the agent to find 3 relevant parent pages for each orphan and insert a link.
Graph theory tells you where links are. Embeddings tell you where links should be.
To automate suggestions, we combine Graph Theory with Vector Embeddings.
(Conceptual Snippet)
# Pseudocode for the Semantic Bridge
from sklearn.metrics.pairwise import cosine_similarity
# Assume 'vectors' is a matrix of page embeddings
similarity_matrix = cosine_similarity(vectors)
# If Similarity > 0.85 AND Link_Exists is False:
# Append to "suggested_links.csv" This is the exact logic kōdōkalabs uses to generate our “Internal Link Opportunities” report. It removes human bias. It doesn’t care if you wrote the article 3 years ago; if it is semantically relevant to your new post, the math finds it.
Sometimes, you need to show the client why their SEO is failing. A visual graph is undeniable. Export your graph to Gephi format.
nx.write_gexf(G, "site_structure.gexf") Download Gephi (free open-source software). Open this .gexf file.