Guidelines for site/page optimisation when using a Site Accelerator
Depending on the mode of Site Acceleration (fully fronting a site or serving assets only), there can be SEO search implications. The guidelines below are recommended to prevent search engines such as Google from indexing the accelerated site domain in preference to your actual site domain.
- Make sure you have <link rel="canonical" href="http://www.yoursite.com"/> on the homepage and other pages (replace www.yoursite.com with your actual site domain) (http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139066). Many CMS systems have plugins that can place the correct canonical tag on each of your web pages.
- Configure your webserver to serve up a custom robots.txt for requests to the site accelerator domain. The custom robots.txt should contain a rule that tells the indexing bot not to index the accelerated site domain - an example is here: http://stage.metacdn.com/robots.txt
- Use absolute links on your pages instead of relative - e.g. Links on your pages should be of the form www.yoursite/page.html, not /page.html.
Serving a custom robots.txt
If you use Apache, you can set a custom rule in your .htaccess file to ensure that a non-permissive robots.txt is delivered when requested via your accelerated URL (e.g. yoursite.sa.metacdn.com). This will ensure Google or other search engines do not mistakenly index your website using this URL instead of your canonical URL (i.e. www.yoursite.com). This works by having two robots file:
- robots.txt that is served when a search engine accesses your site normally (e.g. via your preferred URL, www.yoursite.com).
- robots_cdn.txt that is served when a search engine accesses your site via your accelerated URL (e.g. yoursite.sa.metacdn.com).
One of following rules needs to be present in your .htaccess file in the root of your website. The rule depends on which tier of Site Accelerator you currently use. Please choose the one that matches your site.
# This serves a custom robots.txt to the CDN subdomain
RewriteEngine on
RewriteCond %{HTTP:Via} .*ECAcc.*$
RewriteRule ^robots.txt robots_cdn.txt [NC,L]RewriteCond %{HTTP:Via} .*CloudFront.*$
RewriteRule ^robots.txt robots_cdn.txt [NC,L]
RewriteCond %{HTTP:Via} .*ECS.*$
RewriteRule ^robots.txt robots_cdn.txt [NC,L]
RewriteCond %{REMOTE_ADDR} ^92\.60\.240\.20[8-9] [OR]
RewriteCond %{REMOTE_ADDR} ^92\.60\.240\.21[0-5]
RewriteRule ^robots.txt robots_cdn.txt [NC,L]
RewriteCond %{REMOTE_ADDR} ^108\.161\.17[6-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^108\.161\.18[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^108\.161\.19[0-1]\.
RewriteRule ^robots.txt robots_cdn.txt [NC,L]
IIS example
For Microsoft's IIS server, you can translate the above to something like this (you should also add the remote addr rules above, but we're not familiar enough with IIS to be able to write the rule (anyone capable, feel free to submit the necessary match rules):
<rewrite>
<rules>
<rule name="Robots ECS">
<match url="robots.txt" ignoreCase="true" />
<conditions>
<add input="{HTTP_HOST}" pattern="ECS" />
</conditions>
<action type="Rewrite" url="robots-cdn.txt" />
</rule>
<rule name="Robots ECAcc">
<match url="robots.txt" ignoreCase="true" />
<conditions>
<add input="{HTTP_HOST}" pattern="ECAcc" />
</conditions>
<action type="Rewrite" url="robots-cdn.txt" />
</rule>
<rule name="Robots CloudFront">
<match url="robots.txt" ignoreCase="true" />
<conditions>
<add input="{HTTP_HOST}" pattern="CloudFront" />
</conditions>
<action type="Rewrite" url="robots-cdn.txt" />
</rule>
</rules>
</rewrite>