Google Data Leak 🔥 How a Software Engineer Sees It?

This post is to give you “Uncle Rad’s” software engineering perspective on the data leak. I’m looking at what it is and what it isn’t and providing some context into how I see it.

Categories: Articles about SEO, Technical

Time to read:
17 min



I rarely comment on hot takes like this because it’s really difficult to surprise me or have me extremely excited about any news.

This isn’t because I’m an old-school SEO who’s been doing it for 14 years, and who’s very hard to be impressed (*although that too! 🤣). It’s because I am more of a sceptic than an opportunist who jumps on all possible RTM (real-time marketing) chances.

So, since the news about “the greatest of all leaks” was published by Rand and Mike I was just reading the docs, doing my research, due diligence and my own assessment of what this is and what it gives us.

I’m not going to do a backstory on it because it’s everywhere, so you can just click the above links and read more yourself.

I think it’s great that Rand and Mike published their pieces, and it’s interesting/crazy/coveted/envious how quickly they got so much attention. Definitely a huge 👏🏻 to them!

Today, I read a great, toned-down article by Patrick Stox of Ahrefs that quite well describes the positives and negatives of, not the leak itself, but its perception. I think it’s a great take, so I’ll link to it below:

👉🏻 Google Documents Leaked & SEOs Are Making Some Wild Assumptions

This piece in many ways reflects some of my reservations.

But, anyway, because I’m really going away from what I wanted to share. So, without dragging it much further, I’m just going to shoot my thoughts about these revelations right out.

Things to note before reading

Before we really jump in, let me just state one last bit, important for context: why do I think I am qualified to talk about it?

First of all, despite my day job being an SEO, I am a developer first and an SEO second. I have been involved in dev projects since 2001, and I have my IT engineering degree (BEng). I definitely understand code, developers, and the project lifecycle. I am a software architect and lead quite a few web dev teams, so I also get requirements for repositories, storage solutions, and generally have a very technical mind.

But before anything else, also because I simply can, and since everyone in the SEO space (including our grandmas and barbers 🤭) is commenting on the leak, I think I can also contribute a few observations.

I’d just ask you, before going ahead and commenting or straight-up bashing me for my opinions here, please read everything carefully. These are my opinions and they might or might not be correct. I’m working with the same data available to everyone 😉

1/ This leak isn’t really a leak

I dug up a Java repository that has all the JavaDoc with the same content that the documentation. Everyone can see it here:

How did I find it? Just searching for the file name on Github Search:

You can see the search results for Java, PHP, and Elixir files. Some have the exact same doc (/** any form of {Java|PHP|ex}Doc **/) some don’t, but it’s definitely the same API.

Now, I have no idea who this EduMelo Github user who owns this Java repository is or what they do. It’s probably not the best idea to go chasing them to ask about it. They probably just happened to have it in their repo, leave them be!

However, if you look at the commit dates, you’ll see that they date back to Oct 2022:

My argument here is that it’s been around for a bit, it’s just now that it’s been brought to the attention of a larger audience.

I also don’t fully agree with the statement that “internal version of documentation(1) for the deprecated Document AI Warehouse was accidentally published publicly to a code repository(2) for the client library“.

I am highlighting these two: documentation (1) and code repository (2) because they are the key terms in this context. Get ready for some semantics now…

The documentation (1) was pushed to the code repo (2) by mistake – this is something I am almost 100% certain of. The reason, however, is different to what’s assumed here (namely a leak).

It was a mistake, not because they wanted to hide it or because it was a top-secret thing, and they s*at 💩 their pants when it got committed to the code repo. If you know a few things about the development workflow, you know that most devs would rather not include documentation in their code repositories.

Documentation, especially when the code is well written and has (in this case) good JavaDoc or annotations, would be generated separately (often on an autopilot) and stored on a different service – like, where this documentation is currently still residing.

Since the whole Java|exDoc annotations are in the code that was distributed, it’s not really difficult to argue that it wasn’t by mistake.

Moreover, have a look at this fella ➡️ Brent Shaffer, who works at … wait for it … Google (and who, by the way, is very active on GitHub).

I bring him up as GitHub search is showing the “CompressedQualitySignals.php” file in his repository as the first result.

I sincerely hope I’m not throwing Brent under the bus – I’ll be happy to learn his side of the story and clarify if I got anything wrong! Just reach out to me and I’ll include it.

To be clear, however, it’s not Brent who committed the “leaked” PHP code matching this very version of CotnentWarehouse API. It was originally a bot who made the initial commit to the official Google API PHP Client Services repository.

It surprises me a bit that GitHub search, instead of showing the official repo and the yoshi-code-bot commit (that you can check below), shows Brent’s repository:

This is the official Google API PHP Client where I found the earliest commit with the file I’m looking for from the Oct 1st, 2022.

But anyway, I have some serious doubts around why Brent would have still been maintaining the official repository for over a year (like it’s business as usual) after this sensitive/leaked data got uploaded there.

But hey, maybe these were some automated changes. And the pull request comments, too…


Knowing how developers work and think, I wouldn’t be surprised that Brent just hasn’t noticed all of these extra properties, methods, or anything out of the ordinary.

Firstly, doing some changes or maintenance on the code doesn’t mean that a developer is reviewing EVERYTHING that’s in the repository. Some repos are giant beasts, and it would just not make sense.

Secondly, developers focus on their exact task (or ticket) they are updating. So, even if the same file reveals the entirety of Google’s algorithm, if a developer just wants to change a simple attribute, give it a default value, for example, they’ll do just and only that. The rest of the code doesn’t matter as long as the dev’s current change is not breaking anything. And this “breaking anything” assessment is often done automatically by the IDE (e.g. PhpStorm), set of unit or CI (continuous integration) tests.

So Brent could absolutely not have been aware of what’s in these files, despite making changes to them.

However, on May 7th of this year, as Mike King described in his iPullrank article, things started to get a little sketchy. This is when the ContentWarehouse API in the form that is described in the “leaked” API documentation is removed from the repositories.

Lo and behold, also in this PHP repository, there is a big change around the Cotnent Warehouse code. And guess what? The Content Warehouse code was actually removed personally by Brent with just a simple “update Contentwarehouse” commit message:


Shiiiiiiiiit, that’s a lot of files changed/deleted….)


It’s definitely unclear why. Let’s explore some potential reasons below.

Was it an attempt at trying to undo something that shouldn’t have happened? Then why wouldn’t he just purge it from the history as sensitive data?

Didn’t purge because it was already broadly forked (including Brent’s own fork)? Or because he would rather not draw attention to it? Or perhaps because it wasn’t considered that big of a problem?

Or simply none of the above, and maybe, just maybe, because the API was being deprecated and replaced? 🤔

Unfortunately, this is all just a maybe – these are all maybes, and it’s really difficult to say for sure!

What is sure, however, is that the beauty of GitHub allows us to track the history of the files, speculate on what happened, and picture some wild ‘maybes’! 😜

This is now another wild speculation from me, but looking at the dates of AI Document Warehouse and Content Warehouse Java releases, it seems possible that they all went live in Oct 2022 (I actually found someone on X mentioning the date as early as Sept 20, 2022, too, but cannot find the post again or confirm this date).



However, still, it’s worth noting that in the Java API client documentation I linked above, there is absolutely no mention, whatsoever, of anything to do with, for example, CompressedQualitySignals.


So, yes, this could indicate that it was a fuckup. But in the end, who knows…

If anyone is to blame here, I’d blame this yoshi-code-bot (we can’t be sure what this bot is exactly) and whoever gave it so much freedom to harvest and publish whatever it wants to Google’s official repos, possibly without any human oversight. Welcome to the automation, slash, AI world, baby!

So, to summarise, and paraphrase this 1st point of mine, my question is this:
How can something that’s been around for 19 months suddenly become a leak?

I don’t know, can it? I suppose, since everyone is calling it a leak, as if everyone suddenly became a plumber! 👩🏻‍🔧🪠

In my world, precision matters, and although it’s just semantics, I wouldn’t call it a “leak”. Instead, I’d call it a “discovery”.

We also have these mysteries that remain unsolved: did Google really automatically push something internal to a publicly available repository? And have they actually kept maintaining it for over 19 months?

It does appear so, but who knows!?

2/ Data (content) warehouse is storage, not algorithm

I absolutely CRINGE 🫤 when someone says that this leak reveals the algorithm. A statement like this is untrue and blown out of proportion. Allow me to explain why.

Any developer who knows anything about a data warehouse knows that this is designated as a storage system.

To quote Geeks For Geeks:

“A Data Warehouse is separate from DBMS (database management system – author’s note), it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision-making.”Geeks For Geeks


“A data warehouse is meant for only query and analysis rather than transaction processing. The data warehouse is essentially subject-oriented, non-volatile, integrated, time-variant, and consists of historical data stored over long periods of time. A blueprint of BI and data mining algorithms.”Geeks For Geeks

A data warehouse (or content warehouse) is fundamentally a storage system. It is designed to store large volumes of data collected from various sources.

Primary functions of a data warehouse vs. an algorithm:

Data warehouseAlgorithm
Data Consolidation
Combining data from different sources into a central repository.
Data Processing
Transforming and analysing data stored in the warehouse.
Data Integration
Ensuring that the data is consistent and well-organized.
Data Mining
Discovering patterns or insights from the data.
Data Retrieval
Allowing users to efficiently query and analyse the data.
Machine Learning
Building predictive models or automating decision-making processes based on the data from the warehouse.

It might, again, just be semantics, but it’s semantics that is so dangerous, because when miunderstood by people, they will just draw wrong conclusions!

Remember what Patrick said? “SEOs Are Making Some Wild Assumptions” – and this misunderstanding or misrepresentation is precisely why this happens.

Here’s another thing Patrick said that is so true for a data warehouse:

Exactly – if you have something stored in your data, it doesn’t automatically mean that it’s being used.

To make a little break from this point, let me introduce another term here – data lake.

A data lake is a central repository for raw, unstructured, or semi-structured data. Data lakes can accommodate data from various sources without the need for predefined schemas. They offer more flexibility in terms of data storage options and are often used for machine learning and data exploration. Data lakes prioritize storage volume and cost-effectiveness over performance, providing a lower-cost option for storing large amounts of data, including images and videos.

So here we have 2 common approaches to data storage (at least those I used personally when designing solutions to process and store large volumes of data):

  1. Schema-on-write (or schema-first)
    Usually applied in a data warehouse.
    Approach where you have a preconfigured data structure. This is usually when you know exactly what you’ll be storing.
    For example, a “person” object would have predefined properties (attributes) like weight, height, gender, sex, some other medical information, eye and hair colour, etc.
  2. Schema-on-ready (or schema-last)
    Usually applied in a data lake.
    Approach where you know very little about the data that’s going to be coming in so you can’t prepare the structure. This is typically when you have just a vague idea about what you’ll be storing.
    Say, you only know what type of entity the data is describing, but you don’t know what attributes to expect.
    An example can be a system for monitoring wildlife, automatically recognising the animals and storing the information about them in the data storage solution.
    In this scenario, you would submit a wildebeast described by a set of entirely different attributes just the same as you would a lion or a croc. This approach allows you to store details about each animal, regardless of its kind or number and types of attributes describing it.


I know what you’re thinking: WTF Rad? What is all this and why talk about it?

Well, again, I want to talk about it just to be very precise and give it some context! There is a reason why we’re reading documentation for a “content warehouse” in this leak.

I’m assuming that, in this context, content warehouse is nothing else than a data warehouse for content.

This means, we have a set structure and the data being pushed into it is pre-processed (which there is a lot of evidence for! Read the next point below for more), but by no means we know what attributes, connections and/or data points are in use.

Let me give you a couple of examples to easily see how that might be.

Example 1

You are building a system, where the algorithm using a data warehouse is meant to generate the best medical insurance providers for a specific person.

You already have all the data about people from other sources stored in your warehouse. The data consists of and expands the attributes I used with the schema-on-write example above:

  • weight (thanks, Zepp, Xiaomi, Google Fit and Apple Health!)
  • height
  • race
  • gender
  • sex
  • age/DoB
  • some medical information
    • your first-contact clinic/surgery or GP visits
    • all your dentist appointments and treatment
    • BMI – calculated based on other metrics
    • all your blood pressure readings
    • all your heart rate readings (thanks, Apple Watch and Oura Ring!)
    • etc.
  • eye colour
  • hair colour
  • interests, hobbies
  • shopping data (thanks, Nectar card!)
  • financial information (thanks, banks and Experian!)
  • job situation (thanks, LinkedIn!)
  • family situation,
  • browsing history, etc.

Let’s start with the fact that you will not have all that data about every person you have in your system. Some people are open to share more and others are very private or simply don’t use some (or all) of these services.

Now, when it comes to the system itself – this is your algorithm.

In its most basic form, your system would likely use some information describing your physical wellbeing (like: weight, height, race, age, BMI) as well as selected medical information and, maybe, financials to know if you can be pushed towards more or less expensive options.

But you would, arguably, not need all your shopping data or family situation.

Example 2

Let’s say you build a system based on the same data as in the above example, but this time it’s meant to serve you targeted coupon codes as ads.

In this case, you’d probably still care about basic personal information (race, age, sex and gender), but a lot more about interests, shopping data, browsing history and, perhaps, also family situation (in order to, for example, serve ads for pregnancy supplements).

The last little nuance here, just to emphasize it, is also about what you “know” about an entity (person, website or a document).

For instance, if we’re talking about a person, every system on earth, including Google will know much more about, for example, Elon Musk than it would know about me. Not because Elon oversubscribes to all these apps I mentioned (it’s actually very likely that his privacy is much better protected than mine!), but because he’s just a much better known person and there will be tons and tons of sources talking about him.

It’s obvious that for many entities, even if you’re the most powerful, data hungry and (arguably) capable search engine in the world, you just won’t be able to get all the information, or you won’t be willing to compute it, to put them in the logical graph.

As I said in my Chiang Mai SEO presentation “Cracking The Code – Secrets of Technical SEO” in 2018, Google algorithm works within a multidimensional vector space. I brought it up in the context of RankBrain, but it applies to everything.

This is by design because, firstly, vectors are easy to manage, store, update and access.

Secondly, vectorised information allows you to compute it easier because you can (almost effortlessly) apply mathematic operation on them (e.g. subtracting word from a word with the help of word2vec).

Thirdly, multidimensional vectors (meaning they’re vectors within vectors) allow you to build quite an extensive, ontological knowledge graph. Oh, yes, having said ontological, you can also collect and organise this information hierarchically, so it all makes much more sense and reflects relationships between entities you’re collecting or describing.

More reading to show that Google is pretty bad-ass when it comes to their knowledge graph approach, manipulation and management can be found in Google’s own patents:

(The image above is here just because it looks cool and shows how a knowledge graph can be used to predict and merge information based on the knowledge of entity relationships 😉)

Me dropping the links above doesn’t mean they are in use in search, but they’re to show that Google is all about knowledge graphs across many products, because, well, knowledge graph is something that is excellent to use in a distributed knowledge storage system.

So, I’m just trying to say that a system storing your data, as much as it needs to be well thought through and designed, is absolutely NOT the system that does something with this data.

So, we should take it with a grain of salt that we have all of these attributes there. Not all of them will be used, not all of them will be used for each document (still possible that documents that don’t have some attributes are losing out!) and we’re not quite sure what the algorithm is doing with some of these attributes.

This brings me to the next point.

3/ This leak is useless, BUT …

This is probably the most important point. I’ll say it straight, but wait for the big “BUT…” in a moment.

This leak is useless – here, I said it.

I said what many people are already saying on X – the documentation on its own (even combined with the code it’s for) is useless in deciphering how the algorithm works.

It’s useless because:

  • there is absolutely no information on inner working of the attributes (how some are computed, used or where their values come from),
  • there is very little information on which attributes are being used (if any),
  • it doesn’t contain any actual source code – even the API client source code doesn’t give us anything about how some of the properties or methods (e.g. setters) are being used,
  • we have no direct insight into the underlying algorithm,
  • we don’t know how the systems populating the information in the content warehouse work,
  • we even have no idea what values most of the attributes get – there is no example values (other than their type) other than a few (like this spamRank attribute for which the value is 0 or 65535) !

The way I see it, it is (almost for sure) something Google uses internally (read the next point to learn why “almost for sure”), but like I said in the above point, we have no idea how, why and whether it’s even setting the data attributes available.

So, just be mindful before you jump in and change your whole SEO process based on this article:

Or before trusting the AI Overview analysis of the docs done on some website:

This is ABSOLUTELY NOT to discredit Rand or Matt (who created!

In fact, let me tell you something – I will definitely be changing some of our internal SEO processes – exactly as Rand suggested!

We’ve already fed our internal AI Knowledge Base system the entirety of the leak content, along with some great articles about it (I’ll share some in a moment). If we didn’t have own system, Matt’s website would definitely be of help!

I’m just trying to say that whatever is in this leak must be carefully analysed, understood and precisely concluded before you should be jumping on and treating it as an incontrovertible truth.


Yes, here’s the big “BUT…” I meantioned earlier.

The document is the first-ever peek of this size we ever got into the parameters (attributes), methods (functions), data structures and data most likely powering the Google algorithm.

It’s always been a black box kept well under lock and key. It makes this leak EXTREMELY valuable source of information!

Wearing my engineering hat 👷🏻🎓, I am totally in awe looking at the technical composition and engineering skill of what is in this document.

It’s so complex, but also organised and well thought-through! It’s just beautiful 😍!

It’s absolutely remarkable to finally know what Google stores or generates and what are some connections between the data points, documents, entities, calculations and parameters.

There are just so many nuggets in these documents! We can all learn so much from it!

Some things should really not surprise us.

Like for example Google using clicks or CTR for rankings – here’s a slide from my presentation from 2018:

Here are some more patents I linked to on this slide:

See this cloud on the slide saying “Results Satisfaction Analysis (CTR, BR)”? Hmm, did I call it?!

No, it was widely assumed based on patents and some empirical evidence + my understanding of how can Google really have a feedback loop to know what users like/dislike.

Considering the data they have access to, it was extremely easy to predict that they use clicks for this purpose (regardless of what they’ve been saying for years before and after). How else would they measure it reliably?

And adding Chrome data to it? Heck, they probably learnt the value of clickstream through Google Toolbar and decided to build Chrome!

With this, it also baffles me how people jumped on poor Gary Illyes who said “using clicks directly in ranking would be a mistake”.

Word precision matters, remember?

So if we look at the leak data and consider how many different signals around clicks Google is using, I’d say they are most definitely NOT “using clicks directly“, but they really work with the click information to make it reliable.

Here are a few other slides from this same presentation that have confirmation in this leaked document:

I’m just trying to say that as much as anecdotal evidence is not really something to blindly follow, the empirical evidence gives you a reliable answer to measured questions.

This is why you should apply the empirically confirmed findings in your SEO work.

This is also why SEO testing is so important!

But, in the leak itself, there are also many other things that did surprise me, some of which I haven’t seen people talk about yet.

For example, Google likely does automatic speech recognition on videos and compares it with what’s in the video frame using OCR:

We also learn that a lot of stuff is happening before many of the data attributes are populated – like actions taken based on mysterious Cookbook recipes, generated MuppetSignals or the whole SpamBrain data.

I have no idea what most of these means. (Yet, but working on it!)

There’s just so much to unpack that I’ll be spending weeks analysing everything and milking this information until it’s dry 😉

All to, obviously, expand my own knowledge, understanding of Google and predict which of these can still be in use, which are important and which are not.

Oh, yes, and to give our company competitive advantage. Even though this information is public and everyone has access to it, the interpretation and testing is still up to you. IYKYK!

If you want to read a great piece of content analysing the leak data (although still only scratching the surface of it), definitely give this article a read:

👉🏻 Unpacking Google’s massive search documentation leak

Andrew Ansley really goes into some great details and provides awesome context! Thanks Andy, if you’re ever reading this!

4/ The documentation is legitimate

I mean, Google confirmed it, too, right?


But even without them confirming, there were just too many references to Google’s intranet, teams and other internal resources for it to NOT be legitimate.

Firstly, the document mentions internal teams very often:

There are also too many references to Google’s internal-only resources:

And, lastly, it literally gives Googlers instructions on some elements (no chance the below is for public use 🤣):

I think that if it was to be somehow generated, it would take a very creative LLM to hallucinate that many internal references, all of which made sense!

My theory is that Google built this Content Warehouse for themselves first, before they decided, “hmm, maybe, we could make money with this content store solution?!“.

After which, they included it in their Google Cloud stack, but (likely) forgot to remove all the parameters they use internally.

Or they just released it as is (OKR and $$$ pressure, baby!), before they created the API that was designated for public eyes.

It just looks like they built a lot of solutions because they need them themselves, and only (eventually) let it out as public offering, like Google Cloud, to double-down on monetisation.

At least something like this. I am just coming up with some wild, personal maybes, remember?


Well, as I said earlier, it doesn’t happen often that I am really excited. But with this piece of documentation, I absolutely am!

I am excited because this is truly a hell of an engineering piece that I can be a part of (even if it’s only conceptually). There’s a lot to learn from it.

I am also excited because there is so much you can take away and put together to better understand the algorithm. Nothing direct that says “do this and you’ll rank”, but a plethora of indirect suggestions along with elements that you now have strong indication might work.

I am also excited because we can really go wild with field-testing. This, literally, feels like 2010 again, when I was performing plenty of single-variable tests on a daily basis. It was easier back then, but at least now I know exactly what to test! 💪🏻

I am also happy because a fair bit of mystery just got removed. We can see into it and understand it better. In practical terms, it means the industry (we) will draw some biased conclusions, still get a lot it wrong, and blindly trust what Google says. But IYKYK, remember? 😉

So, yeah, there’s a lot to look forward to in the coming weeks/months/years.

Keep digging, be curious, but always be precise!

5 2 votes
Article Rating

Notify of
Inline Feedbacks
View all comments

Check out latest blog posts

Google Data Leak 🔥 How a Software Engineer Sees It?

Google Data Leak 🔥 How a Software Engineer Sees It?

17 min read

This post is to give you “Uncle Rad’s” software engineering perspective on the data leak. I’m looking at what it is and what it isn’t and providing some context into how I see it.

Read more

Mastering PDCA Cycle: A Guide to Improved SEO Practices

Mastering PDCA Cycle: A Guide to Improved SEO Practices

7 min read

If there’s one thing to be sure about in the digital marketing world, it’s that you need to keep developing in order to stay afloat. The way you approach it determines your success, sustainability, and relevance in the market. To make matters simpler, we decided to follow the PDCA framework. It is not just a […]

Read more

Creating BoFu Content That Converts

Creating BoFu Content That Converts

8 min read

Driving conversions is the ultimate goal of every marketing strategy. Compelling Bottom of Funnel (BoFu) content can help you reach that goal much faster, giving your marketing efforts a notable boost. Out of the three ToFu, MoFu, and BoFu, the last one has the most direct impact on your business’s sales and revenue. It plays […]

Read more

MoFu Content: Convert Leads into Prospects

MoFu Content: Convert Leads into Prospects

6 min read

It’s difficult to point to the most important part of the sales funnel, but MoFu is undeniably where a lot of content marketing magic happens. In the Middle of the Funnel stage, you nurture your relationship with your potential customers further and nudge them closer to becoming prospects. You convert interest into intent by suggesting […]

Read more

SaaS SEO: Increase Traffic and Brand Awareness

SaaS SEO: Increase Traffic and Brand Awareness

9 min read

Software-as-a-service is unarguably a core business model in modern IT, both in B2B and B2C scenarios. Yet, a good offer is not everything that makes a successful SaaS. Quite the opposite – this is where your marketing strategies come to the forefront. You need to find ways to increase interest in your brand and products. […]

Read more

Google March 2024 Core Update: What we know so far? [Don’t Panic]

Google March 2024 Core Update: What we know so far? [Don’t Panic]

9 min read

Google has done it again! – that’s something you probably hear with every major Google update (and sometimes also with minor ones). This time, however, it is a bit different. With the March 2024 core update, What does it mean? We would love to say that this is just about updated spam policies, so your […]

Read more

Your Website Can Earn 40-60% More!

This is how much more traffic our clients grow, on average, within 12 months.
Get Your FREE 30-minute consultation for a limited time only!

Would love your thoughts, please comment.x