© Copyright Acquisition International 2024 - All Rights Reserved.

Article Image - Should You Block AI Bots from Crawling Your Website?
Posted 15th April 2024

Should You Block AI Bots from Crawling Your Website?

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.

Mouse Scroll AnimationScroll to keep reading

Let us help promote your business to a wider following.

Should You Block AI Bots from Crawling Your Website?
Web Crawler

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.  

According to SEO and PPC agency MRS Digital, businesses should make a careful decision on whether they want to block AI or not. On the one hand, blocking AI could help prevent risks, such as your content being unintentionally misrepresented. On the other, could you be missing a world of opportunity presented by this seemingly unstoppable technology shift? 

Quick recap: What are crawlers?  

A crawler is essentially a tool that is typically operated by search engines like Google or Bing to review your website and index data from it, like the content you’ve written and information about your company, ensuring your website appears in search results. It’s how search engines like Google discover and understand your site, so the concept of a crawler is nothing new. As a website owner you can decide which parts of your website you want crawlers to be able to view and index in search results by making use of robots.txt files.  
 
AI crawlers use the same technology but instead of simply indexing your website data, AI crawlers review the information on your site and can utilise it to train their own technology (Large Langue Models). 

What AI chatbots could crawl your site?  

While most people will have heard of ChatGPT and Bard (now called Gemini), there are other lesser-known AI crawlers out there.  

So, the other AI crawlers. There’s:  

  • ChatGPT-User. This is used by ChatGPT when a user on GPT-4 directs the bot to your site in a prompt like “tell me how many times [SITE URL] mentions AI”.  
  • GPTBot. This is the crawler that just gets the data from your site for training data for their AI knowledge base. 
  • Google Extended. This is how Google gets data for all their AI products, including Gemini (previously called Bard), their AI chatbot.  
  • Anthropic-AI. Anthropic has a range of AI tools, including Claude, their AI chatbot, and their crawler collects the data for this.  
  • CC-Bot. This is the Common Crawler bot and is what ChatGPT-3 was trained on. It’s designed to make access to data accessible for everyone, without any fees.  

Why would you block AI crawlers? 

Blocking AI crawlers might be the right decision for you, especially if you’re concerned about your content being misrepresented, or your site is in development.  

Misrepresented content 

When humans write content, we write with nuance and there may be cultural or business context that means what you write makes sense to a specific audience. When your content is taken out of that context and used to form part of an AI chatbot’s answer, it will most likely lose the nuance, and the point your content made may have been lost or misrepresented entirely. For some companies, that isn’t something they want to run the risk of, and so they block AI crawlers to prevent this. For example, if you were a medical company who had specific advice pertaining to one of your products, you wouldn’t want an AI to take that out of context to an unrelated product or medical query.  

Unwanted association 

As AI crawlers tend to take sections of information from varying websites without always understanding the context of that piece of information. There is a risk that your information may be presented next to additional sources that your business doesn’t want to be associated with. If this is the case, then you may want to choose to block any AI crawler. This will stop your company from being mixed in with competitors, or those in your industry who may not uphold best practice. For some companies where reputation management is an issue, or has been historically, this could be a very strong argument.  

Data Scraping 

It’s best practice to block any crawlers from viewing parts of your site you don’t want them to see. For example, you might have a staff wellbeing portal on your intranet or customer logins on your website. You don’t want these crawled as they contain personally identifiable information, something your customers or employees definitely don’t want an AI company to have! OpenAI says that the GPTBot is “filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies.” Most websites will already have these blocked, so it’s worth speaking to your hosting provider or SEO team to see if they can add any AI crawlers.  

Spam Generation 

As technology evolves, so do cybercriminals. We’re now seeing the most sophisticated phishing emails and malicious links being sent thanks to AI-generated content. Using a combination of AI-powered chatbots like ChatGPT and data harvested from your site means that spam emails are more realistic than ever. Malicious actors could then use AI to create even more realistic spam emails which could more closely imitate your employees or the company itself. This ultimately could lead to more successful phishing attempts which can cause financial and reputational loss for your business.  

How to block the ChatGPT crawler: 

1. robots.txt 

A robots.txt file will already be present on your site. It’s simply a matter of updating it to exclude the pages you want to block any AI crawler from viewing. Doing this will protect any sensitive data and content that should not be public knowledge. Robots.txt files done wrong can cause your site to no longer be seen by Google and other search engines so it’s best to proceed with caution here. If you have an SEO agency, check in with them before you do this as they may be able to help you.  

You can also allow them to crawl at a certain speed. If you want them to crawl some, but not all, of your site, such as admin areas, this is possible too. Different businesses will have varying reasons to block crawlers, or not block them at all. 

2. Web Application Firewall (WAF) 

You can also use a WAF to block the crawler(s) as well as any unwanted traffic to your site. You’ll be able to keep it up and running for your customers without hindering their experience on your website. 

So, is it really worth blocking AI crawlers?  

When considering whether to block ChatGPT and similar crawlers, there’s more to ponder over than the downsides alone. 

In November 2023, ChatGPT hit 100 million users per week. With this figure likely to grow, that’s a great deal of brand visibility you’re missing out on if you refuse to embrace this technology.  

LLMs are the future of search, did nobody tell you yet? Bing has already embraced AI in the form of Microsoft’s Copilot, and Google is hot on its heels, recently moving the testing of its own AI-powered search – Google Search Generative Experience (SGE) – into the main Google search results. This means that if you’ve ever relied on organic search or SEO for a portion of your business generation, blocking AI could seriously hamper your efforts, if not now, then in the near future. 

There’s even a branch of SEO forming known as Generative Engine Optimisation (GEO) that focuses on improving visibility on popular LLMs like ChatGPT. Again, this may be an emerging acquisition channel that you’re missing out on if you block AI crawlers. 

You should also consider really how effective it is trying to block LLMs from your site. First, you must look beyond the big- name AIs. Blocking ChatGPT alone won’t cut it. Large language models like this are trained on a range of different datasets like Wikipedia and Reddit.  

One of the datasets most commonly used by LLMs (including ChatGPT) is Common Crawl which has been created by a non-profit organisation and crawls the entire internet. So, if you’re genuinely determined to exclude your site from LLMs, then you need block bots like Common Crawl as well more popular crawlers. 

Granting access to your website content can assist in ensuring that your brand is accurately and favourably portrayed to ChatGPT users. Blocking it may actually have the opposite effect if you’re trying to avoid being misrepresented online. 

All said and done, let’s say you bend over backwards to block every known crawler belonging to and contributing to AI LLMs. You’re safe right? Wrong. Your website has almost certainly already been crawled and incorporated into existing datasets like Common Crawl’s. And, at present, there’s no way of removing your website content from these datasets. It all feels rather futile. 

A final word

The rapidly evolving world of AIs is intimidating, and whether you decide to attempt to block LLMs from your site or not, we’d recommend that it’s worth genning up on the subject. Whatever decision you make should be active and informed by up-to-date knowledge. 

A not insignificant 32.9% of the top 1,000 websites on the internet have elected to block the GPTBot. However, for many the growing opportunity presented by AI, combined with the futility of trying to resist the tide, has led to the decision that blocking AI is not the right move. At least for now. 

Categories: News, Strategy


You Might Also Like
Read Full PostRead - Eye Icon
Valuable Tips for Bootstrapping an API-Based Startup
Innovation
19/04/2022Valuable Tips for Bootstrapping an API-Based Startup

An API-based startup is one of the best business concepts to build up using the bootstrapping method.

Read Full PostRead - Eye Icon
European Stocks Touch Seven-Year Highs as Economic Growth Picks Up
Finance
13/02/2015European Stocks Touch Seven-Year Highs as Economic Growth Picks Up

European stock markets hit seven-year highs in early trading as improved GDP figures boosted sentiment.

Read Full PostRead - Eye Icon
HAYSTACKID Acquisition of FLEX Discovery Transaction
Innovation
29/02/2016HAYSTACKID Acquisition of FLEX Discovery Transaction

HAYSTACKID is an international end-to-end eDiscovery and forensics services and solutions provider.

Read Full PostRead - Eye Icon
Six Important Tips to Build Local Awareness for Your Brand
News
04/09/2023Six Important Tips to Build Local Awareness for Your Brand

Business branding is not a new concept for local businesses. Every business wants to enhance the overall perception of their brand, product, or service. However, it can be a hard endeavor to achieve when you have so much competition around you. In such circums

Read Full PostRead - Eye Icon
Makesworth: More Than Just Accountants
Strategy
24/05/2018Makesworth: More Than Just Accountants

Makesworth Accountants is a leading firm of Chartered Certified Accountants located in Harrow.

Read Full PostRead - Eye Icon
What Can You Purchase with Crypto?
News
24/07/2023What Can You Purchase with Crypto?

Almost a decade ago, it was impossible to do anything with cryptocurrencies other than trade and store tokens in a wallet and hope for the best. Much has changed since then, however, with crypto now being used to purchase a wide variety of assets, products and

Read Full PostRead - Eye Icon
How to Choose the Best Software Development Company in 2023
Innovation
24/07/2023How to Choose the Best Software Development Company in 2023

Choosing the ideal partner might seem like a difficult process with so many businesses providing their services. However, you may make an informed choice that supports your company's objectives by taking certain criteria into account and using a methodical app

Read Full PostRead - Eye Icon
Wesco Aircraft Acquired by Affiliate of Platinum Equity, Combined with Pattonair at Closing
M&A
13/01/2020Wesco Aircraft Acquired by Affiliate of Platinum Equity, Combined with Pattonair at Closing

The combined company, which will be headquartered in Valencia following closing, becomes a $2.4 billion business with a global footprint in 17 countries and more than 4,000 employees. The combined company will serve more than 8,400 customers, including many of

Read Full PostRead - Eye Icon
Arbitration Guide
Innovation
11/08/2015Arbitration Guide

Arbitration Guide



Our Trusted Brands

Acquisition International is a flagship brand of AI Global Media. AI Global Media is a B2B enterprise and are committed to creating engaging content allowing businesses to market their services to a larger global audience. We have 14 unique brands, each of which serves a specific industry or region. Each brand covers the latest news in its sector and publishes a digital magazine and newsletter which is read by a global audience.

Arrow