© Copyright Acquisition International 2024 - All Rights Reserved.

Article Image - Should You Block AI Bots from Crawling Your Website?
Posted 15th April 2024

Should You Block AI Bots from Crawling Your Website?

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.

Mouse Scroll AnimationScroll to keep reading

Let us help promote your business to a wider following.

Should You Block AI Bots from Crawling Your Website?
Web Crawler

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.  

According to SEO and PPC agency MRS Digital, businesses should make a careful decision on whether they want to block AI or not. On the one hand, blocking AI could help prevent risks, such as your content being unintentionally misrepresented. On the other, could you be missing a world of opportunity presented by this seemingly unstoppable technology shift? 

Quick recap: What are crawlers?  

A crawler is essentially a tool that is typically operated by search engines like Google or Bing to review your website and index data from it, like the content you’ve written and information about your company, ensuring your website appears in search results. It’s how search engines like Google discover and understand your site, so the concept of a crawler is nothing new. As a website owner you can decide which parts of your website you want crawlers to be able to view and index in search results by making use of robots.txt files.  
 
AI crawlers use the same technology but instead of simply indexing your website data, AI crawlers review the information on your site and can utilise it to train their own technology (Large Langue Models). 

What AI chatbots could crawl your site?  

While most people will have heard of ChatGPT and Bard (now called Gemini), there are other lesser-known AI crawlers out there.  

So, the other AI crawlers. There’s:  

  • ChatGPT-User. This is used by ChatGPT when a user on GPT-4 directs the bot to your site in a prompt like “tell me how many times [SITE URL] mentions AI”.  
  • GPTBot. This is the crawler that just gets the data from your site for training data for their AI knowledge base. 
  • Google Extended. This is how Google gets data for all their AI products, including Gemini (previously called Bard), their AI chatbot.  
  • Anthropic-AI. Anthropic has a range of AI tools, including Claude, their AI chatbot, and their crawler collects the data for this.  
  • CC-Bot. This is the Common Crawler bot and is what ChatGPT-3 was trained on. It’s designed to make access to data accessible for everyone, without any fees.  

Why would you block AI crawlers? 

Blocking AI crawlers might be the right decision for you, especially if you’re concerned about your content being misrepresented, or your site is in development.  

Misrepresented content 

When humans write content, we write with nuance and there may be cultural or business context that means what you write makes sense to a specific audience. When your content is taken out of that context and used to form part of an AI chatbot’s answer, it will most likely lose the nuance, and the point your content made may have been lost or misrepresented entirely. For some companies, that isn’t something they want to run the risk of, and so they block AI crawlers to prevent this. For example, if you were a medical company who had specific advice pertaining to one of your products, you wouldn’t want an AI to take that out of context to an unrelated product or medical query.  

Unwanted association 

As AI crawlers tend to take sections of information from varying websites without always understanding the context of that piece of information. There is a risk that your information may be presented next to additional sources that your business doesn’t want to be associated with. If this is the case, then you may want to choose to block any AI crawler. This will stop your company from being mixed in with competitors, or those in your industry who may not uphold best practice. For some companies where reputation management is an issue, or has been historically, this could be a very strong argument.  

Data Scraping 

It’s best practice to block any crawlers from viewing parts of your site you don’t want them to see. For example, you might have a staff wellbeing portal on your intranet or customer logins on your website. You don’t want these crawled as they contain personally identifiable information, something your customers or employees definitely don’t want an AI company to have! OpenAI says that the GPTBot is “filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies.” Most websites will already have these blocked, so it’s worth speaking to your hosting provider or SEO team to see if they can add any AI crawlers.  

Spam Generation 

As technology evolves, so do cybercriminals. We’re now seeing the most sophisticated phishing emails and malicious links being sent thanks to AI-generated content. Using a combination of AI-powered chatbots like ChatGPT and data harvested from your site means that spam emails are more realistic than ever. Malicious actors could then use AI to create even more realistic spam emails which could more closely imitate your employees or the company itself. This ultimately could lead to more successful phishing attempts which can cause financial and reputational loss for your business.  

How to block the ChatGPT crawler: 

1. robots.txt 

A robots.txt file will already be present on your site. It’s simply a matter of updating it to exclude the pages you want to block any AI crawler from viewing. Doing this will protect any sensitive data and content that should not be public knowledge. Robots.txt files done wrong can cause your site to no longer be seen by Google and other search engines so it’s best to proceed with caution here. If you have an SEO agency, check in with them before you do this as they may be able to help you.  

You can also allow them to crawl at a certain speed. If you want them to crawl some, but not all, of your site, such as admin areas, this is possible too. Different businesses will have varying reasons to block crawlers, or not block them at all. 

2. Web Application Firewall (WAF) 

You can also use a WAF to block the crawler(s) as well as any unwanted traffic to your site. You’ll be able to keep it up and running for your customers without hindering their experience on your website. 

So, is it really worth blocking AI crawlers?  

When considering whether to block ChatGPT and similar crawlers, there’s more to ponder over than the downsides alone. 

In November 2023, ChatGPT hit 100 million users per week. With this figure likely to grow, that’s a great deal of brand visibility you’re missing out on if you refuse to embrace this technology.  

LLMs are the future of search, did nobody tell you yet? Bing has already embraced AI in the form of Microsoft’s Copilot, and Google is hot on its heels, recently moving the testing of its own AI-powered search – Google Search Generative Experience (SGE) – into the main Google search results. This means that if you’ve ever relied on organic search or SEO for a portion of your business generation, blocking AI could seriously hamper your efforts, if not now, then in the near future. 

There’s even a branch of SEO forming known as Generative Engine Optimisation (GEO) that focuses on improving visibility on popular LLMs like ChatGPT. Again, this may be an emerging acquisition channel that you’re missing out on if you block AI crawlers. 

You should also consider really how effective it is trying to block LLMs from your site. First, you must look beyond the big- name AIs. Blocking ChatGPT alone won’t cut it. Large language models like this are trained on a range of different datasets like Wikipedia and Reddit.  

One of the datasets most commonly used by LLMs (including ChatGPT) is Common Crawl which has been created by a non-profit organisation and crawls the entire internet. So, if you’re genuinely determined to exclude your site from LLMs, then you need block bots like Common Crawl as well more popular crawlers. 

Granting access to your website content can assist in ensuring that your brand is accurately and favourably portrayed to ChatGPT users. Blocking it may actually have the opposite effect if you’re trying to avoid being misrepresented online. 

All said and done, let’s say you bend over backwards to block every known crawler belonging to and contributing to AI LLMs. You’re safe right? Wrong. Your website has almost certainly already been crawled and incorporated into existing datasets like Common Crawl’s. And, at present, there’s no way of removing your website content from these datasets. It all feels rather futile. 

A final word

The rapidly evolving world of AIs is intimidating, and whether you decide to attempt to block LLMs from your site or not, we’d recommend that it’s worth genning up on the subject. Whatever decision you make should be active and informed by up-to-date knowledge. 

A not insignificant 32.9% of the top 1,000 websites on the internet have elected to block the GPTBot. However, for many the growing opportunity presented by AI, combined with the futility of trying to resist the tide, has led to the decision that blocking AI is not the right move. At least for now. 

Categories: News, Strategy


You Might Also Like
Read Full PostRead - Eye Icon
Why Corporate Social Responsibility Is Essential During Disasters
Corporate Social Responsibility
31/03/2023Why Corporate Social Responsibility Is Essential During Disasters

Corporate social responsibility (CSR) can take many forms. While many firms are focusing on reducing their carbon footprint and working to achieve net zero, others are intent on giving back to the community, especially those who suffered from a natural disaste

Read Full PostRead - Eye Icon
Top Employee Engagement Tools for Managers: How to Keep Your Team Engaged
Leadership
26/07/2023Top Employee Engagement Tools for Managers: How to Keep Your Team Engaged

Managers bear several responsibilities, one of which is to ensure that their team remains engaged and motivated. However, keeping up with the ever-changing employee engagement needs can be challenging.

Read Full PostRead - Eye Icon
Are Students Being Bullied out of the Rental Market?
Strategy
13/08/2015Are Students Being Bullied out of the Rental Market?

Buying a property in the UK is becoming increasingly unaffordable, driving-up demand and prices for those wanting to rent. Research suggests that by 2025 over 50% of 20-39 years olds will be privately renting.

Read Full PostRead - Eye Icon
The Rush to Succession Plan
Legal
19/04/2022The Rush to Succession Plan

There has undoubtedly been a refocus on business succession planning during the pandemic, possibly driven by a desire to find an element of stability in these incredibly unstable times. So, what do family business owners need to be aware of when starting the s

Read Full PostRead - Eye Icon
Africa on the Rise
Legal
04/05/2016Africa on the Rise

Mrs. UWIMANA Gisèle is a holder of Master’s Degree in Law (LLM) and Bachelor’s Degree in Law (LLB) and active collaborator of the Law Firm as she joined Rwanda Bar Association in 2011.

Read Full PostRead - Eye Icon
Tesco Mobile Reveals New Brand Identity
Strategy
07/09/2020Tesco Mobile Reveals New Brand Identity

Tesco Mobile has revealed today a brand redesign to revitalise the brand expression and to align it more closely with the wider Tesco family. The UK mobile network’s new brand look and feel includes a refreshed logo and the use of bold brand colours and ener

Read Full PostRead - Eye Icon
What is Regulatory Hosting and How Does it Work?
Finance
29/07/2021What is Regulatory Hosting and How Does it Work?

Regulatory hosting enables businesses to carry out regulated activities without directly being FCA approved. We go into more detail in this article.

Read Full PostRead - Eye Icon
Technology Aftermarket Support Secures Success
Innovation
23/03/2020Technology Aftermarket Support Secures Success

Technical services and aftermarket support have become an increasingly important battleground for technology manufacturers as the expectations of consumers and end-users continue to rise. Rising to meet them is Qcom, a support partner which delivers these serv

Read Full PostRead - Eye Icon
5 Cost-Efficient Ways to Strengthen Your Brand
News
28/01/20225 Cost-Efficient Ways to Strengthen Your Brand

Your brand is one of your most valuable assets. It represents everything that your company stands for. It’s what customers use to identify you and differentiate you from your competitors. Having a strong brand is essential to business growth and success.



Our Trusted Brands

Acquisition International is a flagship brand of AI Global Media. AI Global Media is a B2B enterprise and are committed to creating engaging content allowing businesses to market their services to a larger global audience. We have 14 unique brands, each of which serves a specific industry or region. Each brand covers the latest news in its sector and publishes a digital magazine and newsletter which is read by a global audience.

Arrow