Loading Logo

Google’s documentation leak: a rare glimpse into Google Search

June 2024
 by Jorge Repiso

Google’s documentation leak: a rare glimpse into Google Search

June 2024
 By Jorge Repiso

On 27th May, the leak of 2,500+ internal Google documents caused ripples across the search community. The leak contained extensive internal documentation appearing to come from  Google’s Content Warehouse API. The documentation revealed detailed information about 14,000+ ranking signals used or potentially used to rank search results.

This marks another public setback for Google. Last year, the tech giant was forced to disclose some of the inner workings of its search product during a landmark antitrust trial brought by the US Justice Department.

Unlike the unprecedented Yandex code leak in January 2023, this latest leak is believed to have been accidental, with the documentation spreading to public indices where it was discovered, assessed, and shared by members of the search community. On 29th May, Google confirmed the data’s authenticity.

What is in the leak?

The leak revealed various factors that could potentially influence how the algorithm ranks pages.

It seemed to confirm the existence of navBoost, a ranking factor that provides signals to the algorithm based on user clicks. This works in conjunction with a wealth of data Google collects from Chrome users. Long denied to be a ranking factor, the documentation indicates that Google collects and assesses clicks and post-click behaviour in its ranking algorithms.

The leak also appeared to confirm that Google uses a metric called siteAuthority, suggesting that having an authoritative website positively affects site ranking. This is something the company had previously denied existed. It also revealed Google uses a feature called smallPersonalSite, though it is unclear whether it uses this to promote or demote any such sites.

The leak also shows the use of golden documents, which appears to be a flag used for adding “additional weight to human-labeled” documents in contrast to “automatically labelled annotations”. This could mean that a Google employee could manually flag a specific URL to boost it in the results page.

The documentation indicates the existence of whitelists for certain topics, such as isElectionAuthority and isCovidLocalAuthority, which would explain the high level of curation around specific queries.

What was Google’s response?

Google issued the following statement in response to the leak: “We would caution against making inaccurate assumptions about search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.”

What the documentation does not reveal is which of the 14,000+ ranking signals leaked are in production, or the weighting that ranking signals and factors have in its algorithms. Google declined to comment on a signal-by-signal basis.

What does this mean for Google Search?

In truth, nothing really changes, but Google will have to work hard to earn the search community’s trust again. It did not deny the accuracy or validity of the leaked data, just that it lacked context.

Google also said that to improve its services, its ranking systems change over time, and that it will communicate any information it can to the community.

One thing is clear though: everything Google releases in the foreseeable future about Google Search will be heavily scrutinised through the lens of this leak.

Join our newsletter and get access to all the latest information and news:

Privacy Policy.
Revoke consent.

© Digitalis Media Ltd. Privacy Policy.