How to resolve typical robots.txt issues?

Robots.txt is a vital tool for directing search engine crawlers on website navigation, crucial for effective technical SEO. While not all-powerful, as Google states, it aids in preventing server overload. Ensuring proper usage, especially with dynamic URLs, is paramount. This post addresses common robots.txt issues, their impact on your site and search presence, and provides solutions. But first, let’s briefly explore robots.txt and its alternatives.

What Is Robots.txt file?

Robots.txt is formatted as a plain text file and placed in your website's root directory, necessitating it to be at the topmost level. If placed in a subdirectory, search engines will disregard it. Despite its significant capabilities, creating a basic robots.txt file is often straightforward and can be accomplished quickly using tools like Notepad. You can also incorporate additional messaging for users. Additionally, there are alternative methods to achieve similar goals as those addressed by robots.txt. For instance, individual pages can include a robots meta tag in their code, or the X-Robots-Tag HTTP header can be utilized to influence content display in search results.

What functions does robots.txt serve?

Robots.txt can produce various outcomes for different content types:

It can prevent webpages from being crawled, leading to no text description in search results and non-HTML content not being indexed.
Media files like images, videos, and audio can be excluded from Google search results.
Blocking resource files such as external scripts may impact how Googlebot perceives a page's content.
However, robots.txt cannot entirely remove a webpage from Google search results; for that, using a noindex meta tag in the page's head section is necessary.

Common errors in robots.txt and how to fix them

Misplacing Robots.txt Outside the Root Directory.
Misuse of Wildcards.
Noindex vs robots.txt
Blocking Scripts and Stylesheets.
Absence of sitemap in robots.txt.
Overreliance on Absolute URLs.
Usage of Deprecated or Unsupported Elements.

When your website exhibits unusual behavior in search results, it's wise to inspect your robots.txt file for errors, syntax issues, and overly restrictive rules. Let's delve into each of these errors further and explore methods to ensure the validity of your robots.txt file.

Misplacing Robots.txt outside the root directory

Search engine bots can only find your robots.txt file if it's in the main directory. Make sure there's only a forward slash between your domain and 'robots.txt' in the URL. If it's in a subfolder, bots won't see it, causing issues. Move it to the main directory, but you may need server access. Content systems might auto-upload to a subfolder, so adjust to ensure bots can access your robots.txt file correctly.

Misuse of wildcards

Robots.txt utilizes two wildcard characters:

Asterisk (*) – represents any valid character, akin to a wildcard in a deck of cards.

Dollar sign ($) – signifies the end of a URL, allowing rules to be applied solely to the final part, such as the file extension.

User-agent: *

Disallow: /*.pdf$

In this example, the robots.txt file is instructing all user agents (search engine crawlers) to disallow access to any URLs ending with ".pdf". This means that any PDF files on the website will not be crawled or indexed by search engines.

It's prudent to employ wildcards minimally, as they can impose restrictions on a broader scope of your website. Improper placement of an asterisk can inadvertently block robot access to your entire site. Use a robots.txt testing tool to verify wildcard rules for expected behavior. Exercise caution with wildcard usage to avoid inadvertent blocking or overly permissive access.

Noindex vs robots.txt

This issue is prevalent on websites that have been around for several years. Google ceased to honor noindex directives in robots.txt files starting September 1, 2019. If your robots.txt file was generated before that date or includes noindex instructions, it's probable that those pages will be indexed in Google's search results.

To address this issue, consider employing an alternative method to prevent indexing. One solution is to use the robots meta tag, which can be added to the head section of individual webpages to instruct Google not to index them.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="robots" content="noindex">
    <title>Your Page Title</title>
</head>
<body>
    <!-- Your webpage content goes here -->
</body>
</html>

Blocking scripts and stylesheets

Blocking crawler access to external JavaScripts and CSS may appear logical, but it's crucial to remember that Googlebot requires access to these files to properly interpret HTML and PHP pages. If your pages display abnormalities in Google's results or seem improperly indexed, check if access to essential external files is blocked. A straightforward solution is to remove the line from your robots.txt file that blocks access. Alternatively, if certain files need to be blocked, consider adding an exception to allow access to necessary CSS and JavaScript.

User-agent: *
Disallow: /js/
Disallow: /css/

In this example, the robots.txt file is instructing all user agents (search engine crawlers) to disallow access to any files located in the "/js/" and "/css/" directories. This effectively blocks crawler access to external JavaScript and CSS files.

Absence of sitemap in robots.txt

This primarily pertains to SEO considerations. You have the option to incorporate the URL of your XML sitemap into your robots.txt file. Given that Googlebot typically scans this location first during website crawling, it provides the crawler with early insight into your site's structure and key pages. Although omitting a sitemap isn't technically an error and doesn't directly impact your website's functionality or appearance in search results, including your sitemap URL in robots.txt can enhance your SEO endeavors.

How to add sitemap to robots.txt?

User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

In this example, the robots.txt file specifies that all user agents (search engine crawlers) are allowed to access all parts of the website (Disallow: /). Additionally, it includes a line that points to the XML sitemap located at https://www.example.com/sitemap.xml. This tells search engine crawlers where to find the sitemap file for your website.

Overreliance on absolute URLs

While it's advisable to employ absolute URLs for elements like canonicals and hreflang attributes, the opposite holds true for URLs within the robots.txt file.

Opting for relative paths in the robots.txt file is the preferred method for specifying which sections of a site should be excluded from crawler access.

This information is elaborated in Google's robots.txt documentation, which explains:

A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned.

When you use an absolute URL, there’s no guarantee that crawlers will interpret it as intended and that the disallow/allow rule will be followed.

Usage of deprecated or unsupported elements

Although the guidelines for robots.txt files have remained relatively unchanged over time, two commonly included elements are:

Crawl-delay
Noindex

While Bing accommodates crawl-delay, Google does not. However, webmasters often specify it. Previously, crawl settings could be configured in Google Search Console, but this feature was removed toward the end of 2023.

In July 2019, Google announced it would cease supporting the noindex directive in robots.txt files. Prior to this change, webmasters could utilize the noindex directive in their robots.txt file. However, this approach was not widely supported or standardized. The preferred method for implementing noindex was to utilize on-page robots or x-robots measures at a page level.

How to resolve issues with robots.txt?

If a mistake in your robots.txt file negatively impacts your website's search visibility, the initial step is to rectify the robots.txt and ensure that the new directives produce the desired outcome. Certain SEO crawling tools can assist, eliminating the need to wait for search engines to crawl your site again. Once you're confident in the behavior of your robots.txt, aim to have your site re-crawled promptly. Platforms like Google Search Console and Bing Webmaster Tools offer assistance in this process. Submitting an updated sitemap and requesting a re-crawl of any erroneously delisted pages is recommended. However, the timeline for Googlebot to reinstate missing pages in its search index is uncertain. All you can do is take appropriate measures to minimize this duration and continue monitoring until the fixed robots.txt is fully implemented by Googlebot.