Last week was incredibly eventful due to the leak of the Content Warehouse API documentation. In this article, we share what we believe are the most significant findings.
About the Leak
The leak that every SEO professional has been dreaming about includes over 2596 internal modules from Google’s Search API. This documentation unveils details that were previously hidden from the public. The leak reveals over 14,000 features potentially used in Google’s search ranking algorithm, offering a deep dive into the factors influencing search results.
The leak surfaced on GitHub in March 2024 and remained accessible until May 2024. The information, shared by Efran Azimi of EA Eagle Digital with Rand Fishkin and Mic King, has sparked significant interest and analysis within the SEO community. Google has confirmed the authenticity of the leak but advises caution, as the information might be outdated or incomplete, leading to potential misinterpretations.
We’ve categorized some of our key takeaways below:
Ranking Signals
NavBoost and Click Data
- Google has been using a system called NavBoost, which relies on click data, including CTR, long clicks, short clicks, and user behavior. This contradicts Google’s public denial of using click-based signals for ranking.
Chrome Clickstream Data
- Google utilizes clickstream data from the Chrome browser to determine page and site popularity. This includes tracking Chrome views to create site links and rank pages, contradicting previous claims that Chrome data isn’t used for search ranking.
Domain Authority and Site Authority
- Despite Google’s repeated denials, the leak reveals that Google calculates a “siteAuthority” metric, essentially a form of domain authority used in their ranking system.
Page Title Matching
- Page titles are still evaluated against query relevance. A metric called “titlematchScore” indicates that having a page title closely match the query remains an important ranking factor.
Content Quality and Relevance
Quality Rater Feedback
- Google uses data from quality raters in its ranking system, incorporating human evaluations to train their algorithms and influence rankings. This suggests that quality ratings have a more direct impact than previously acknowledged.
Embedding and Topic Relevance
- Google uses embeddings to measure the relevance of a page’s content to its site’s overall topic. Pages deviating significantly from a site’s main topic can be demoted based on this analysis.
Content Freshness and Dates
- Google tracks various dates associated with content, including byline dates, syntactic dates (from URLs and titles), and semantic dates (from content). Fresh and consistently dated content is favored in rankings.
Handling of Short Content
- Google has a specific metric called “OriginalContentScore” for evaluating short content. This score measures the originality of shorter documents, indicating that not all short content is automatically deemed low quality.
Brand Name Searches
- Brand name searches are important; the more people search for your brand name, the more you’ll show up for keyword searches versus your competitors.
Human Written Content
- 100% human-written content is classed as “Golden Pages” and will always outrank AI content of the same quality. Important to note that quality is the factor here.
Spam and Security Measures
Sandbox for New Sites
- The new leak confirms the existence of a “sandbox” for new or potentially spammy sites, where new domains are temporarily restricted to prevent spam. This contradicts Google’s public stance that no sandbox exists. If your new website is not ranking at all, here’s the reason why.
Link Graph Analysis and Spam Detection
- Google has detailed systems for analyzing link quality, including identifying and demoting spammy links based on link velocity and relevance. They also store historical link data and evaluate link quality based on the freshness and tier of the source page.
Exact Match Domain (EMD) Demotions
- Google has specific measures in place to demote exact match domains (EMDs) that align too closely with unbranded search queries. This aligns with Google’s previous efforts to reduce the ranking advantages of such domains.
Other Demotions
- Content can be demoted for a variety of reasons, such as:
- A link doesn’t match the target site.
- SERP signals indicate user dissatisfaction.
- Product reviews.
- Location.
- Exact match domains.
- Porn.
Change History
- Google apparently keeps a copy of every version of every page it has ever indexed, meaning Google can “remember” every change ever made to a page. However, Google only uses the last 20 changes of a URL when analyzing links.
Special Content Considerations
Use of Whitelists
- Google employs whitelists for specific types of content, such as travel, Covid-19 information, and election-related content. This ensures certain authoritative sites are prioritized or demoted during critical periods.
Homepage Trust and PageRank
- Google assesses the trustworthiness of a homepage and uses its PageRank as a proxy for new pages on the same site. The trust level of a homepage (e.g., fully trusted, partially trusted) influences the value assigned to links from that page.
Freshness Matters
- Google looks at dates in the byline (bylineDate), URL (syntacticDate), and on-page content (semanticDate). Content within 30 days is classed as fresh.
Page Truncation and Content Tokens
- There is a maximum number of tokens (words, tags, punctuations) that Google considers for a document. Documents are truncated at a certain length, which emphasizes the importance of placing critical content early in the text to ensure it is indexed and considered.
Domain Registration Information
- Google stores domain registration information (RegistrationInfo).
Topic Focus
- Websites with too many topics lose out on a high Topic Focus Score.
Implications for SEO
The insights from the leak may prompt significant changes in SEO practices, encouraging a focus on newly revealed ranking factors. High-quality content remains crucial for achieving favorable search rankings, as emphasized by the disclosed information. However, not all SEO tactics may be affected, as the leak might not encompass the entirety of Google’s ranking system.
Despite the wealth of information, there are still uncertainties. The precise weight and influence of each ranking factor remain unclear, necessitating careful analysis and application. Given Google’s frequent updates, the leaked information might not entirely represent the current algorithm.
Disclaimer: While the information provided is based on the leaked Google Search API documentation, it is essential to approach it with caution. The details may be outdated or incomplete, and Google’s algorithm is subject to frequent changes and updates. Always consider multiple sources and stay informed about the latest developments in SEO.
For a deeper dive into the leaked docs, check out this amazing website: 2596.org.