How chatbots source their info: key findings from our study revealed

June 2024

by The Digital Risk and Intelligence Team

How chatbots source their info: key findings from our study revealed

June 2024

By The Digital Risk and Intelligence Team

Chatbots have quickly become part of everyday life, especially since ChatGPT’s release in November 2022, which popularised Large Language Models (LLMs). According to web statistics, ChatGPT – one of the many chatbots available – saw 1.8 billion total visits alone in April. Chatbots are so ubiquitous they are now used professionally across a range of industries. Teachers use them to create lesson plans, programmers to write code and realtors to write property listings.

There has even been debate over whether chatbots could replace search engines altogether. A March 2023 survey found 42% of professionals envisage predominantly using AI chatbots for online queries in future.

Chatbots and online reputation

In terms of online reputation, chatbots can present a big issue. Hallucinations, which occur when a chatbot generates a false or fabricated response because it has misinterpreted the data or the data does not exist, can present major reputational concerns.

That’s why it is important for everyone – but high-profile individuals and companies in particular – to understand which sources chatbots use and how they prioritise the information from these sources when presenting responses to queries about them.

To get to the bottom of this issue, we looked into the types of sources chatbots use and where these sources rank on Search Engine Results Pages (SERPs).

How we did our research

For this piece of research, we used three well-known chatbots:

Gemini – Google’s flagship chatbot, previously known as Bard
ChatGPT – version 4.0
Perplexity – a newer chatbot which has quickly risen in popularity to become a competitor to ChatGPT

Our aim was to find out whether there is a correlation between the sources chatbots use and SERP rankings, and establish which source types are the most influential in chatbot responses.

To do this, we entered a query and compared the sources cited by chatbots for their claims with the ranking of the same query on the first 20 pages of search engine results. For reference, Gemini uses Google as the basis for its research, while ChatGPT and Perplexity use Bing.

We categorised the chatbot sources into the following categories:

Wikipedia.
Owned asset – including individual or company websites or other assets they have direct control over such as a LinkedIn profile.
Mainstream news – news outlets such as Forbes, Bloomberg and the New York Times.
Industry publications – more specialist outlets.
Alternative media – including blog and social media posts.
Reference and research – non-Wikipedia encyclopaedia citations such as Britannica as well as academic journals and databases.

For our queries, we examined entities for the following categories: “Fortune 500 CEOs”, “High-profile families”, “Fortune 500 companies” and “Private US companies”. From looking at these queries over Gemini, ChatGPT and Perplexity, we analysed 1,544 different claims made by chatbots.

The results

For our queries on Fortune 500 CEOs, the chatbots returned 376 claims. The majority (54%) of sources cited are located on Page 1 of SERPs, 30 points ahead of the next closest result. This is especially prevalent for Perplexity, with 61% of all sources located on Page 1.

In terms of type of source, owned assets proved the most prevalent, accounting for 43% of sources cited. Wikipedia is the second most common source at 21%.

For our queries on high profile families, the three chatbots returned 386 claims. The majority of sources (58%) are located on Page 1, 36 percentage points ahead of the next closest result. In this instance, ChatGPT used the most sources on Page 1.

For our queries on ruling families, Wikipedia is the most popular source comprising 43% of all sources cited by all three chatbots. Ruling families proved to be the only category where mainstream news is the second most prevalent source, comprising 25% of all sources cited. The preference for Wikipedia and mainstream news on this particular topic is largely due to many high-profile families having a reputation dominated by historical information.

For our queries on Fortune 500 companies, the three chatbots returned 321 claims. This set of queries is the only instance in which sources outside the first 20 pages of Google proved more prevalent than Page 1 sources, comprising 42.1% claims compared to 41.7%. This is largely driven by Gemini and ChatGPT which, in their responses to queries on the company, tended to focus on press releases reporting on recent changes in the company, released over the past year. These press releases are often from smaller outlets or company newsrooms and do not rank strongly on SERPs. Perplexity, which tended to give profiles encompassing the company’s operations, continued to overwhelmingly favour Page 1 sources.

The three chatbots favoured owned assets as a source, using it in 55% of claims in total. This total number is increased largely by both ChatGPT and Gemini heavily favouring owned assets as sources by 55 and 43 percentage points respectively. These owned assets emerge in chatbot responses citing company press releases and website information for their profiles. ChatGPT, in particular, tended to focus its responses around recent news on the companies and used company website information as a source regularly.

For our queries on privately listed US companies, the three chatbots returned 461 claims and 66% of sources are located on Page 1, making it the most common location for sources cited. This is in large part driven by Perplexity which overwhelmingly favoured Page 1 sources by a greater margin than in any other set of queries for any of the chatbots. This is partly down to Perplexity heavily favouring Wikipedia or the home page of a company’s website as its source. These are almost always located on Page 1 for an entity. Both ChatGPT and Gemini used Page 1 sources the most, albeit to a lesser extent.

Wikipedia proved the most popular domain sourced, cited in 45% of claims. However, having a collection of owned assets appears to be just as important. Both are among the top source categories people check and both can significantly influence a person’s online profile. This reflects the findings for Fortune 500 Companies, where the chatbots heavily favoured owned assets as well.

Overall, the majority of sources on both ChatGPT and Perplexity are ranked on Page 1, while Gemini actually tended to use sources outside the first 20 pages of Bing and Google. Perplexity has a much stronger first page bias than ChatGPT, using Page 1 sources 66% of the time compared to 48% for ChatGPT.

Surprisingly, owned assets proved to be the most prevalent source, with Wikipedia being the single most used website, consulted only 28 fewer times than the variety of owned assets used. Alternative media proved the least common source type with blog posts occasionally being used.

The takeaways

From our research, we can now confirm that what shows up on Page 1 of a SERP is related to key chatbots’ answers to queries, particularly for Perplexity.

As a result, the appearance of positive and trustworthy assets on Page 1 may reduce the likelihood of chatbots drawing on nefarious sources.

However, maintaining a positive and accurate Page 1 should not be the only focus, as 27% of sources consulted for responses did not appear on the first 20 pages of Bing and Google.

Another big finding is the prevalence of owned assets as a source for chatbots, alongside Wikipedia. These two sources make up 72% of all sources used by chatbots. That means companies need a factually accurate Wikipedia page and detailed online assets describing the company and individuals if they want chatbots giving out accurate information about them.

Potential red flags

These findings raise some issues. Firstly, users may be dissuaded from using chatbots to research companies if they feel as though the response they receive is simply a corporate puff piece. Secondly, there’s potential for hostile actors to create content impersonating a company’s online assets in the hopes of confusing the chatbots into assuming it is the company’s page.

Users may also be dissuaded from using chatbots for research on entities due to their heavy reliance on Wikipedia as a non-owned asset source. Often, responses from chatbots are taken word for word from Wikipedia.

Other interesting observations

In four cases, chatbots cited documentaries or video clips as sources for its claims. For example, in response to the query about a high-profile family, Perplexity listed as one of its additional sources an episode from a BBC documentary on the family.

There were also cases where chatbots would make accurate claims, however, the information was not found within the source directly attributed to the claim. For example, in a query about a high-profile chairman, ChatGPT accurately described him as the chairman of the company, citing Wikipedia as the source. However, this information does not appear on Wikipedia.

While we identified very limited hallucinations in our research, none of these hallucinations dramatically altered the profile of the entity presented in the query. For example, in Perplexity’s response to the query about a Fortune 500 company, it stated a fact about the company. However, the source Perplexity provided did not include this information and other sources contradicted the fact. Overall, hallucinations only appeared in a very limited number in our research.

Back to News

How chatbots source their info: key findings from our study revealed

How chatbots source their info: key findings from our study revealed

Join our newsletter and get access to all the latest information and news: