5 May 2023
If you own a website or manage its content, you’ve likely heard of robots.txt. It’s a file that instructs search engine robots on how to crawl and index your website’s pages. Despite its importance in search engine optimization (SEO), many website owners overlook the significance of a well-designed robots.txt file.
In this complete guide, we’ll explore what robots.txt is, why it’s important for SEO, and how to create a robots.txt file for your website.
A robots.txt is a file that tells search engine robots (also known as crawlers or spiders) which pages or sections of a website should be crawled or not. It is a plain text file located in the root directory of a website, and it typically includes a list of directories, files, or URLs that the webmaster wants to block from search engine indexing or crawling.
This is how a robots.txt file looks like:
There are three main reasons why robots.txt is important for your website:
“Crawl budget” stands for the number of pages that Google will crawl on your site at any given time. The number depends on the size, health, and quantity of backlinks on your site.
The crawl budget is important because if the number of pages on your site goes over the crawl budget, you will have pages that aren’t indexed.
Furthermore, pages that are not indexed will not rank for anything.
By using robots.txt to block useless pages, Googlebot (Google’s web crawler) may spend more of your crawl budget on pages that matter.
You have many pages on your site that you do not want to index.
For example, you might have an internal search results page or a login page. These pages need to exist. However, you don’t want random people to land on them.
In this case, you’d use robots.txt to prevent search engine crawlers and bots from accessing certain pages.
Sometimes you will want Google to exclude resources such as PDFs, videos, and images from search results.
Possibly you want to keep those resources private, or you want Google to focus more on important content.
In such cases, using robots.txt is the best approach to prevent them from being indexed.
Robots.txt files instruct search engine bots which pages or directories of the website they should or should not crawl or index.
While crawling, search engine bots find and follow links. This process leads them from site X to site Y to site Z over billions of links and websites.
When a bot visits a site, the first thing it does is look for a robots.txt file.
If it detects one, it will read the file before doing anything else.
For example, suppose you want to allow all bots except DuckDuckGo to crawl your site:
User-agent: DuckDuckBot
Disallow: /
Note: A robots.txt file can only give instructions; it cannot impose them. It’s similar to a code of conduct. Good bots (such as search engine bots) will follow the rules, whereas bad bots (such as spam bots) will ignore them.
The robots.txt file, like any other file on your website, is hosted on your server.
You can access the robots.txt file of any website by entering the complete URL of the homepage and then adding /robots.txt at the end, such as https://pickupwp.com/robots.txt.
However, if the website does not have a robots.txt file, you will receive a “404 Not Found” error message.
Before showing how to create a robots.txt file, let’s first look at the robots.txt syntax.
The syntax of a robots.txt file can be broken down into the following components:
Here is an example of a robots.txt file:
User-agent: Googlebot
Disallow: /private/
Allow: /public/
Crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml
Note: It’s important to note that robots.txt files are case-sensitive, so it’s important to use the correct case when specifying URLs.
For example, /public/ is not the same as /Public/.
On the other hand, Directives like “Allow” and “Disallow” are not case-sensitive, so it’s up to you to capitalize them or not.
After learning about robots.txt syntax, you can create a robots.txt file using a robots.txt generator tool or create one yourself.
Here is how to create a robots.txt file in just four steps:
Simply opening a .txt document with any text editor or web browser.
Next, give the document the name robots.txt. To work, it must be named robots.txt.
Once done, you can now start typing directives.
A robots.txt file contains one or more groups of directives, each with multiple lines of instructions.
Each group starts with a “User-agent” and contains the following data:
Lines that do not match any of these directives are ignored by crawlers.
For example, you want to prevent Google from crawling your /private/ directory.
It would look like this:
User-agent: Googlebot
Disallow: /private/
If you had further instructions like this for Google, you’d put them in a separate line directly below like this:
User-agent: Googlebot
Disallow: /private/
Disallow: /not-for-google
Furthermore, if you’re done with Google’s specific instructions and want to create a new group of directives.
For example, if you wanted to prevent all search engines from crawling your /archive/ and /support/ directories.
It would look like this:
User-agent: Googlebot
Disallow: /private/
Disallow: /not-for-google
User-agent: *
Disallow: /archive/
Disallow: /support/
When you’re finished, you can add your sitemap.
Your completed robots.txt file should look like this:
User-agent: Googlebot
Disallow: /private/
Disallow: /not-for-google
User-agent: *
Disallow: /archive/
Disallow: /support/
Sitemap: https://www.example.com/sitemap.xml
Next, save your robots.txt file. Remember, it must be named robots.txt.
For more useful robots.txt rules, check out this helpful guide from Google.
After saving your robots.txt file to your computer, upload it to your website and make it available for search engines to crawl.
Unfortunately, there is no tool that can help with this step.
Uploading of the robots.txt file depends on your site’s file structure and web hosting.
For instructions on how to upload your robots.txt file, search online or contact your hosting provider.
After you’ve uploaded the robots.txt file, next you can check if anyone can see it and if Google can read it.
Simply open a new tab in your browser and search for your robots.txt file.
For example, https://pickupwp.com/robots.txt.
If you see your robots.txt file, you’re ready to test the markup (HTML code).
For this, you can use a Google robots.txt Tester.
Note: You have a Search Console account set up to test your robots.txt file using robots.txt Tester.
The robots.txt tester will find any syntax warnings or logic errors and highlight them.
Plus, it also shows you the warnings and errors below the editor.
You can edit errors or warnings on the page and retest as often as necessary.
Just keep in mind that changes made on the page aren’t saved to your site.
To make any changes, copy and paste this into the robots.txt file of your site.
Keep these best practices in mind while creating your robots.txt file to avoid some common mistakes.
To prevent confusion for search engine crawlers, add each directive to a new line in your robots.txt file. This applies to both Allow and Disallow rules.
For example, if you don’t want a web crawler to crawl your blog or contact page, add the following rules:
Disallow: /blog/
Disallow: /contact/
Bots don’t have any problem if you use the same user agent again and over.
However, using it just once keeps things organized and reduces the chance of human error.
If you have a large number of pages to block, adding a rule for each one might be time-consuming. Fortunately, you may use wildcards to simplify your instructions.
A wildcard is a character that can represent one or more characters. The most commonly used wildcard is the asterisk (*).
For instance, if you want to block all files that end in .jpg, you would add the following rule:
Disallow: /*.jpg
The dollar sign ($) is another wildcard that may be used to identify the end of a URL. This is useful if you want to restrict a specific page but not the ones after it.
Suppose you want to block the contact page but not the contact-success page, you would add the following rule:
Disallow: /contact$
Everything that begins with a hash (#) is ignored by crawlers.
As a result, developers often use the hash to add comments to the robots.txt file. It keeps the document organized and readable.
For example, if you want to prevent all files ending with .jpg, you may add the following comment:
# Block all files that end in .jpg
Disallow: /*.jpg
This helps anyone understand what the rule is for and why it’s there.
If you have a website that has multiple subdomains, it is recommended to create an individual robots.txt file for each one. This keeps things organized and helps search engine crawlers grasp your rules more easily.
The robots.txt file is a useful SEO tool since it instructs search engine bots on what to index and what to not.
However, it is important to use it with caution. Since a misconfiguration can result in complete deindexation of your website (e.g., using Disallow: /).
Generally, the good way is to allow search engines to scan as much of your site as possible while keeping sensitive information and avoiding duplicate content. For example, you may use the Disallow directive to prevent specific pages or directories or the Allow directive to override a Disallow rule for a particular page.
It’s also worth mentioning that not all bots follow the rules provided in the robots.txt file, so it’s not a perfect method for controlling what gets indexed. But it’s still a valuable tool to have in your SEO strategy.
We hope this guide helps you learn what a robots.txt file is and how to create one.
For more, you can check out these other helpful resources:
Lastly, follow us on Facebook and Twitter for regular updates on new articles.