Seo Analytics Blog Archive

Google Sitemap rapid deployment

I was going to call this "Quick and Dirty Sitemaps", but "Rapid Deployment" sounds more buzz-word-worthy. This is how to get a Google sitemap up and running quickly, using the Google sitemap generator and the Web Developer Firefox plug-in.

I had occasion to set up a sitemap using the Google sitemap generator for a site recently. Here's what I did:

Download the generator using the excellent documentation found at the previous link. Unpack it into a convenient location and copy the example_config.xml file to something else, e.g., www.mysite.com_config.xml. Edit the new configuration file and:

  1. Modify the "base_url" setting to your site;
  2. Change the "store_into" setting to a file in your site's document root;
  3. Add a pointer to a file that will contain your list-of-links, e.g.,
    
    
    I would locate this in the same path as your new configuration file.

Now, if you don't already have Web Developer, give yourself a demerit and go install it.

...

Okay, you'll thank me for that. Now pick a few pages from your site: good choices, depending on your site's design, are the home page, the sitemap (if you have one), and any of the top-level "nav links" you may have set up.

Visit each of those pages in turn. Use Web Developer to assemble the links from the pages, clicking:

  1. Tools menu
  2. Web Developer extension
  3. Information
  4. View link information

Copy and paste each informational list-of-links and append it to a text file. You can clean it up a bit when you are done, removing any links you don't want in the sitemap, or you can let the sitemap generator tell you which ones to remove while testing.

You can sort and de-duplicate the file with something like this:

$ sort site_urls.txt | uniq > site_urls.out
Inspect the site_urls.out file and when you're happy with it, rename it to "site_urls.txt".

You're ready to run the sitemap generator:

$ python sitemap_gen.py --config=www.mysite.com_config.xml --testing

Check the output for warnings, adjust the configuration and/or the site_urls.txt file, and eventually you can run this without the --testing flag. Now you just need to add it to a crontab where it will be run appropriately, and you're done!

Slash URL

There's always more to learn in this job. Today I learned that Apache web server is smarter than me.

A typical SEO-friendly solution to Interchange pre-defined searches (item categories, manufacturer lists, etc.) is to put together a URL that includes the search parameter, but looks like a hierarchical URL:

/accessories/Mens-Briefs.html
/manufacturer/Hanes.html

Through the magic of actionmaps, we can serve up a search results page that looks for products which match on the "accessories" or "manufacturer" field. The problem comes when a less-savvy person adds a field value that includes a slash:

accessories: "Socks/Hosiery"
or
manufacturer: "Disney/Pixar"

Within my actionmap Perl code, I wanted to redirect some URLs to the canonical actionmap page (because we were trying to short-circuit a crazy Web spider, but that's beside the point). So I ended up (after several wild goose chases) with:

my $new_path = '/accessories/' .
   Vend::Tags->filter({body => (join '%2f' => (grep { /\D/ } @path)),
       op => 'urlencode', }) .
   '.html';

By this I mean: I put together my path out of my selected elements, joined them with a URL-encoded slash character (%2f), and then further URL-encoded the result. This was counter-intuitive, but as you can see at the first link in this article, it's necessary because Apache is smarter than you. Well, than me anyway.

Interchange Search Caching with "Permanent More"

Most sites that use Interchange take advantage of Interchange's "more lists". These are built-in tools that support an Interchange "search" (either the search/scan action, or result of direct SQL via [query]) to make it very easy to paginate results. Under the hood, the more list is a drill-in to a cached "search object", so each page brings back a slice from the cache of the original search. There are extensive ways to modify the look and behavior of more lists and, with a bit of effort, they can be configured to meet design requirements.

Where more lists tend to fall short, however, is with respect to SEO. There are two primary SEO deficiencies that get business stakeholders' attention:

  • There is little control over the construction of the URLs for more lists. They leverage the scan actionmap and contain a hash key for the search object and numeric data to identify the slice and page location. They possess no intrinsic value in identifying the content they reference.
  • The search cache by default is ephemeral and session-specific. This means all those results beyond page 1 the search engine has cataloged will result in dead links for search users who try to land directly on the more-listed pages.

It is the latter issue that I wish to address because there is--and has been for some time now--a simple mechanism called "permanent more" to remedy the default behavior.

You can leverage "permanent more" by adding the boolean mv_more_permanent, or the shorthand pm, to your search conditions. E.g.:

Link:

    <a href="[area search="
        co=1
        sf=category
        se=Foo
        op=rm
        more=1
        ml=5
        pm=1
    "]">All Foos</a>

Loop:

    [loop search="
        co=1
        sf=category
        se=Foo
        op=rm
        more=1
        ml=5
        pm=1
    "]
    ...loop body with [more-list]...
    [/loop]

Query:

    [query
        list=1
        more=1
        ml=10
        pm=1
        sql="SELECT * FROM products WHERE category LIKE '%Foo%'"
    ]
    ...same as loop but with 10 matches/page...
    [/query]

If the initial search is defined with the "permanent more" setting, it will produce the following adjustments:

  • The hash key used to store and identify the search cache is deterministic based on the search conditions. Many searches for Interchange are category driven. Thus, all end users who wish to browse a category end up clicking identical links, which create duplicate search caches, belonging uniquely to them. With permanent more, they all share the same cache, with the same identifier. As long as the search conditions don't change, neither does the cache identifier. Even as the cache is refreshed with new executions of the search, the object remains in the same location. Thus, the results a search engine produced this morning reference links still valid now, tomorrow, or next week, provided they reference the same search conditions.
  • The cached search object has no session affinity. Any link referencing the cache with the correct hash key has access to the content.

Taken together, "permanent more" removes (for the most part, addressed later) dead links from more lists cataloged by search engines. There are, however, other benefits to "permanent more" beyond those intended as described above:

  • As stated in passing, standard Interchange search caching produces duplicate search objects for common search conditions. For a busy site, these caches can have an impact on storage. Typically, maintenance is implemented to clean up cache files for all such files whose age exceeds by some amount the session duration (standard is 48 hours). With permanent more, duplicate caches are eliminated. A cache location is reused by all users with the same search requirements, keeping data-storage requirements for caches to the minimum necessary. As searches change, ophaned caches can still easily be cleaned up as they will immediately start to age with no more access to them necessary for storage.
  • For the same reason that "permanent more" resolves search-engine links, it also resolves content management for individual sites using a reverse proxy for caching. Because most (and certainly the easiest) caching keys are based off of URL, the deterministic nature of the hash keys for "permanent more" allows assurance that the cached content in the proxy accurately reflects the search content over time, and that all users will hit the cached resource and not generate new, unique links with varying hash keys.

One shortcoming of "permanent more" to be aware of is the impact of changing data underneath the search. Even if search conditions do not change, the count and order of matching record sets may. So, e.g., enough products may be removed from a given category to cause the last page of a more list to become empty, which would cause any specific link into that page to become dead. More minor, but still a possibility, is the introduction or removal of products so that a particularly searched-for term has been "bumped" to another page within the search cache since the last time the search engine crawled the more lists. For searches backed by particularly volatile data, "permanent more" may not be sufficient to address search-engine or caching demands.

Finally, "permanent more" should be avoided for any search features that may cache data sensitive to an individual user. This is unlikely to happen as, under most circumstances, the configuration of the search itself will change based on the unique characteristics of the user executing the search (e.g., a username included in a query to review order history). However, it is still possible that context-sensitive information could be stored in the search object and, if so, all other users with access to the more lists would have access to that information.

SEO friendly redirects in Interchange

In the past, I've had a few Interchange clients that would like the ability to be able to have their site do a SEO friendly 301 redirect to a new page for different reasons. It could be because either a product had gone out of stock and wasn't going to return or they completely reworked their url structures to be more SEO friendly and wanted the link juice to transfer to the new URLs. The normal way to handle this kind of request is to set up a bunch of Apache rewrite rules.

There were a few issues with going that route. The main issue is that to add or remove rules would mean that we would have to restart or reload Apache every time a change was made. The clients don't normally have the access to do this so it meant they would have to contact me to do it. Another issue was that they also don't have the access to modify the Apache virtual host file to add and remove rules so again, they would have to contact me to do it. To avoid the editing issue, we could have put the rules in a .htaccess file and allow them to modify it that way, but this can present its own challenges because some text editors and FTP clients don't handle hidden files very well. The other issue is that even though overall basic rewrite rules are pretty easy to copy, paste and reuse, they still can have nasty side effects if not done properly and can also be difficult to troubleshoot so I devised a way to allow them to be able to manage their 301 redirects using a simple database table and Interchange's Autoload directive.

The database table is a very simple table with two fields. I called them old_url and new_url with the primary key being old_url. The Autoload directive accepts a list of subroutines as its arguments so this requires us to create two different GlobalSubs. One to actually do the redirect and one to check the database and see if we need to redirect. The redirect sub is really straight forward and looks like this:

sub redirect {
   my ($url, $status) = @_;
   $status ||= 302;
   $Vend::StatusLine = qq|Status: $status moved\nLocation: $url\n|;
   $::Pragma->{download} = 1;
   my $body = '';
   ::response($body);
   $Vend::Sent = 1;
   return 1;
}

The code for the sub that checks to see if we need to redirect looks like this:

sub redirect_old_links {
   my $db = Vend::Data::database_exists_ref('page_redirects');
   my $dbh = $db->dbh();
   my $current_url = $::Tag->env({ arg => "REQUEST_URI" });
   my $normal_server = $::Variable->{NORMAL_SERVER};
   if ( ! exists $::Scratch->{redirects} ) {
       my $sth = $dbh->prepare(q{select * from page_redirects});
       my $rc  = $sth->execute();
       while ( my ($old,$new) = $sth->fetchrow_array() ) {
           $::Scratch->{redirects}{"$old"} = $new;
       }
       $sth->finish();
   }
   if ( exists $::Scratch->{redirects}  ) {
       if ( exists $::Scratch->{redirects}{"$current_url"} ) {
           my $path = $normal_server.$::Scratch->{redirects}{"$current_url"};
           my $Sub = Vend::Subs->new;
           $Sub->redirect($path, '301');
           return;
       } else {
          return;
       }
   }
}

We normally create these as two different files and put them into our own directory structure under the Interchange directory called custom/GlobalSub and then add this, include custom/GlobalSub/*.sub, to the interchange.cfg file to make sure they get loaded when Interchange restarts. After those files are loaded, you'll need to tell the catalog that you want it to Autoload this subroutine and to do that you use the Autoload directive in your catalog.cfg file like this:

Autoload redirect_old_links

After modifying your catalog.cfg file, you will need to reload your catalog to ensure to change takes effect. Once these things are in place, you should just be able to add data into the page_redirects table and start a new session and it will redirect you properly. When I was working on the system, I just created an entry that redirected /cgi-bin/vlink/redirect_test.html to /cgi-bin/vlink/index.html so I could ensure that it was redirecting me properly.