Links
Robots Dot Text
Categories
Robots Dot Text
Robots dot text allows you to create a robots.txt file dynamically using Ruby. This gives you the option to create directives which are updated dynamically as your site's content changes.
...It's also a bit more fun, it allows you to use your routes to specify directives, it saves you from having to remember the names of all of the various crawlers and the syntax for robots.txt files.
You also have the option to log every request made to /robots.txt in a separate log file called user_agents.log - Perfect for spider-spotters
The user_agents.log is written in csv(ish) format so you can change the extension and upload it to a database or spreadsheet of your choice.
All feedback welcome: dr_gavin@hotmail.com
Setup Instructions
script/plugin install http://github.com/GavinM/robots dot text.git
remove robots.txt from your /public directory
create a controller for your robots with an action called index
script/generate controller robots indexadd a route in routes.rb to your robots index action:
map.connect "robots.txt", :controller => "robots"
See the examples below for implementation.
Examples:
Simple example
class RobotsController < ActionController::Base
def index
respond_to do |format|
format.text do
log_user_agent # adds the crawler's user_agent to user_agents.log
@page_content = robots dot text do |rules|
rules.comment "Tell all crawlers to keep out of these pages"
rules.add :all, admin_path, customers_path, log_path
rules.br
rules.sitemap sitemap_url
end
render :text => @page_content, :layout => false
end
end
end
end
will render:
# Tell all crawlers to keep out of these pages
User-agent: *
Disallow: /admin
Disallow: /customers
Disallow: /log
Sitemap: http://handyrailstips.com/sitemap.xml
Complex Example
class RobotsController & ActionController::Base
def index
respond_to do |format|
format.txt do
log_user_agent(:short, logger) # :short is the datetime format, logger specifies to use Rails.logger instead
@page_content = robots dot text do |rules|
rules.add :all
rules.sitemap sitemap_url, google_news_sitemap_url
rules.br
rules.comment "Google ignores most directives so here are some rules for Google"
rules.add [:google, :google_image, :google_mobile]
rules.allow article_path("*")
rules.block articles_path
rules.line_break
rules.comment "These crawlers respect the Crawl-delay directive"
rules.add [:yahoo, :msn, :cuil, :ask], private_path, admin_path
rules.rate "1/500s"
rules.delay 10
rules.comment <<-END
Request robots only crawl between 2am and 8am.
(Those are our quiet times)
END
rules.visit_time "0200", "0800"
end
render :text => @page_content, :layout => false
end
end
end
end
will render:
User-agent: *
Sitemap: http://handyrailstips.com/sitemap.xml
Sitemap: http://handyrailstips.com/google_news_sitemap.xml
# Google ignores most directives so here are some rules for Google
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: Googlebot-Mobile
Allow: /articles/*
Disallow: /articles
# These crawlers respect the Crawl-delay directive
User-agent: Slurp
User-agent: MSNBot
User-agent: Twiceler
User-agent: Teoma
Disallow: /private
Disallow: /admin
Request-rate: 1/500s
Crawl-delay: 10
# Request robots only crawl between 2am and 8am.
# (Those are our quiet times)
Visit-time: 0200-0800
Extras
To turn off sessions for robots, add turn_off_sessions_for_robots to your application controller. (not required from Rails >=2.3)
To log the user-agent in the current request header, call log_user_agent in your controller. You can use this in your respond_to block to log the user-agent of any robots hitting /robots.txt
Check out the RDocs - /doc/index.html
For more info on crawler user-agents check out this page
Copyright © 2009 Gavin Morrice, released under the MIT license
Vitals
| Home | http://handyrailstips.com/robots.txt |
|---|---|
| Repository | git://github.com/GavinM/robots_dot_text.git |
| License | Rails' (MIT) |
| Tags |
bot crawler
|
| Rating | (1 vote) |
| Owner | Gavin Morrice |
| Created | 20 June 2009 |

