mattgadient.com

Joomla, SJSB, SMF, and Google Canonical to reduce duplicate content – Part 2

Check out Part 1 before going any further.

Tiny issue – after you do it, you may find that your wrapped version is full of “noindex” on all the pages. Well actually, it’ll have Joomla’s “index,follow”, but keep reading down the meta tags and you’ll find a “nofollow”. This is assuming you’ve told SJSB to pass the META tags to Joomla.

Anyway, you can fix it.

Edit your index.template.php file, and add the following:

if (($_SERVER['SERVER_NAME']=="forums.eyeglassretailerreviews.com")) {

// Please don’t index these Mr Robot.
if (!empty($context[‘robot_no_index’]))
echo ‘
<meta name=”robots” content=”noindex” />’;

}

Now the stuff in purple is already there – don’t add it again! You’re basically adding the first line, and last line (squiggly bracket). This basically says “only do the noindex stuff if the site is… whatever site you have in the blue”. The site in the blue should be the regular non-wrapped SMF forums.

This creates a minor problem. Now, the noindex tag NEVER shows on your wrapped site. That’s still bettter than before – it’s better to index EVERYTHING instead of NOTHING afterall.

There is a fix though. It involves work with your robots.txt on your Joomla site.

Below the first section (which should already be there with a User-agent: * ), add a new section with User-agent: Googlebot and make it look something like this:

User-agent: Googlebot
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Disallow: /forums.html*%3B*
Disallow: /forums.html*.msg*
Allow: /forums.html*wap2$
Allow: /forums.html*wap$
Allow: /forums.html*imode$

The non-bolded stuff is basically a copy/paste of everything that was under User-agent: * before. You’re using that stuff again, plus the bolded stuff. The %3B refers to anything with a semi-colon ( ; ). The *.msg shows up anytime someone clicks on a single message (which is essentially duplicate content since each single message is also displayed in a thread). By disallowing these, you’re keeping from indexing non-pure stuff, which means you’re keeping from indexing duplicate content! Finally the last 3 lines tell Googlebot to index ALL of the WAP, WAP2, and IMODE versions of your site. This is needed because they ALL have semi-colons in the names and would otherwise be blocked by the lines right before them.

If you don’t use Adsense, you’re probably done. If you DO use Adsense, you now want to make 2 new sections (yet again) below the Googlebot lines but EXACTLY the same as the Googlebot one, but this time name them User-agent: Mediapartners-Google and User-agent: Adsbot-Google . In these 2 new sections, instead of just the last 3 lines being Allow you want to make it so the last 5 lines *all* say Allow (you’re basically changing 2 of them from Disallow to Allow).

By allowing the wildcards for the ad-servers you’re allowing them to show ads on the duplicate-content pages.

As always, make sure you back up any files before changing them! You may also want to experiement in Google’s Webmaster Tools to make sure your robots.txt file is going to work correctly!

0 Comments

 | Leave a Comment

    Leave a Comment

    You can use an alias and fake email. However, if you choose to use a real email, "gravatars" are supported. You can check the privacy policy for more details.

    To reduce spam, I manually approve all comments, so don't panic if your comment doesn't show up immediately.