Serve a different robots.txt file for every site hosted in the same directory

Serve a different robots.txt file for every site hosted in the same directory - If a page has internal and external outgoing links to redirecting URLs, it’s returning 3xx (301, 302, etc.) HTTP status codes standing for redirection. This issue means that the page does not exist on a permanent or temporary basis. It appears on most of the popular web browsers, usually caused by a misconfigured website. However, there are some steps you can take to ensure the issue isn’t on your side. You can find more details about redirecting URLs by reading the Google Search Central overview. In this article, we’ll go over how you can fix the Serve a different robots.txt file for every site hosted in the same directory error on your web browser. Problem :


We have a global brand website project for which we are only working the LATAM portion. There is a website installation process here that allows to have one website installation with several ccTLDs, in order to reduce costs.



Because of this the robots.txt in www.domain.com/robots.txt is the same file in www.domain.com.ar/robots.txt.



We would like to implement custom robots.txt files for each LATAM country locale (AR, CO, CL, etc..). One solution we are thinking about is having a redirect placed at www.domain.com.ar/robots.txt to 301 to www.domain.com.ar/directory/robots.txt.



This way we could have custom robots.txt files for each country locale.




  1. Does this make sense?

  2. Is it possible to redirect a robots.txt file to another robots.txt file?

  3. Any other suggestions?



Thanks in advance for any input you might have.


Solution :

I wouldn't count on all spiders being able to follow a redirect to get to a robots.txt file. See: Does Google respect a redirect header for robots.txt to a different file name?



Assuming you are hosted on an Apache server, you could use mod_rewrite from your .htaccess file to to serve the correct file for the correct domain:



RewriteEngine On
RewriteCond %HTTP_HOST ^www.example.([a-z.]+)$
RewriteRule ^robots.txt /%1/robots.txt [L]


In that case your robots.txt file for your .cl domain would be in /cl/robots.txt and your .com.au robots.txt file would be in /com.au/robots.txt



While this should work, it has a few potential drawbacks:




  • Every crawler has to do two HTTP requests: one to discover the redirect, and another one to actually fetch the file.


  • Some crawlers might not handle the 301 response for robots.txt correctly; there's nothing in the original robots.txt specification that says anything about redirects, so presumably they should be treated the same way as for ordinary web pages (i.e. followed), but there's no guarantee that all the countless robots that might want to crawl your site will get that right.



    (The 1997 Internet Draft does explicitly say that "[o]n server response indicating Redirection (HTTP Status Code 3XX) a robot should follow the redirects until a resource can be found", but since that was never turned into an official standard, there's no real requirement for any crawlers to actually follow it.)




Generally, it would be better to simply configure your web server to return different content for robots.txt depending on the domain it's requested for. For example, using Apache mod_rewrite, you could internally rewrite robots.txt to a domain-specific file like this:



RewriteEngine On
RewriteBase /

RewriteCond %HTTP_HOST ^(www.)?domain(.com?)?.([a-z][a-z])$
RewriteCond robots_%3.txt -f
RewriteRule ^robots.txt$ robots_%3.txt [NS]


This code, placed in an .htaccess file in the shared document root of the sites, should rewrite any requests for e.g. www.domain.com.ar/robots.txt to the file robots_ar.txt, provided that it exists (that's what the second RewriteCond checks). If the file does not exist, or if the host name doesn't match the regexp, the standard robots.txt file is served by default.



(The host name regexp should be flexible enough to also match URLs without the www. prefix, and to also accept the 2LD co. instead of com. (as in domain.co.uk) or even just a plain ccTLD after domain; if necessary, you can tweak it to accept even more cases. Note that I have not tested this code, so it could have bugs / typos.)



Another possibility would be to internally rewrite requests for robots.txt to (e.g.) a PHP script, which can then generate the content of the file dynamically based on the host name and anything else you want. With mod_rewrite, this could be accomplished simply with:



RewriteEngine On
RewriteBase /

RewriteRule ^robots.txt$ robots.php [NS]


(Writing the actual robots.php script is left as an exercise.)


We hope that this article has helped you resolve the google, redirects, web-crawlers error in your web browsers. Enjoy browsing the internet uninterrupted!

Comments

Popular posts from this blog

How to redirect to any domain [duplicate]

"302 found" for index page bad for SEO?

Create redirect from url like www.example.us/?p=100&option=