Originally Posted on The Search Herald by Search Engine Journal
In a LinkedIn post, Gary Illyes, an Analyst at Google, reiterated long-standing guidance for website owners: Use the robots.txt file to prevent web crawlers from accessing URLs that trigger actions like adding items to carts or wishlists.
Illyes highlighted the common complaint of unnecessary crawler traffic overloading servers, often stemming from search engine bots crawling URLs intended for user actions.
He wrote:
“Looking at what we’re crawling from the sites in the complaints, way too often it’s action URLs such as ‘add to cart’ and ‘add to wishlist.’ These are useless for crawlers, and you likely don’t want them crawled.”
To avoid this wasted server load, Illyes advised blocking access in the robots.txt file for URLs with parameters like “?add_to_cart” or “?add_to_wishlist.”
As an example, he suggests:
“If you have URLs like:
https://example.com/product/scented-candle-v1?add_to_cart
and
https://example.com/product/scented-candle-v1?add_to_wishlistYou should probably add a disallow rule for them in your robots.txt file.”
While using the HTTP POST method can also prevent the crawling of such URLs, Illyes noted crawlers can still make POST requests, so robots.txt remains advisable.
Reinforcing Decades-Old Best Practices
Alan Perkins, who engaged in the thread, pointed out that this guidance echoes web standards introduced in the 1990s for the same reasons.
Quoting from a 1993 document titled “A Standard for Robot Exclusion”:
“In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons…robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).”
The robots.txt standard, proposing rules to restrict well-behaved crawler access, emerged as a “consensus” solution among web stakeholders back in 1994.
Obedience & Exceptions
Illyes affirmed that Google’s crawlers fully obey robots.txt rules, with rare exceptions thoroughly documented for scenarios involving “user-triggered or contractual fetches.”
This adherence to the robots.txt protocol has been a pillar of Google’s web crawling policies.
Why SEJ Cares
While the advice may seem rudimentary, the re-emergence of this decades-old best practice underscores its relevance.
By leveraging the robots.txt standard, sites can help tame overzealous crawlers from hogging bandwidth with unproductive requests.
How This Can Help You
Whether you run a small blog or a major e-commerce platform, following Google’s advice to leverage robots.txt for blocking crawler access to action URLs can help in several ways:
- Reduced Server Load: You can reduce needless server requests and bandwidth usage by preventing crawlers from hitting URLs that invoke actions like adding items to carts or wishlists.
- Improved Crawler Efficiency: Giving more explicit rules in your robots.txt file about which URLs crawlers should avoid can lead to more efficient crawling of the pages/content you want to be indexed and ranked.
- Better User Experience: With server resources focused on actual user actions rather than wasted crawler hits, end-users will likely experience faster load times and smoother functionality.
- Stay Aligned with Standards: Implementing the guidance puts your site in compliance with the widely adopted robots.txt protocol standards, which have been industry best practices for decades.
Revisiting robots.txt directives could be a simple but impactful step for websites looking to exert more control over crawler activity.
Illyes’ messaging indicates that the ancient robots.txt rules remain relevant in our modern web environment.
Featured Image: BestForBest/Shutterstock