Hosting Blog Archive

Our SoftLayer API tools

We do a lot of our hosting at SoftLayer, which seems to be one of the hosts with the most servers in the world -- they claim to have over 100,000 servers as of last month. More important for us than sheer size are many other fine attributes that SoftLayer has, in no particular order:

  • a strong track record of reliability
  • responsive support
  • datacenters around the U.S. and some in Europe and Asia
  • solid power backup
  • well-connected redundant networks with multiple 10 Gbps uplinks
  • gigabit Ethernet pipes all the way to the Internet
  • first-class IPv6 support
  • an internal private network with no data transfer charge
  • Red Hat Enterprise Linux offered at no extra charge
  • diverse dedicated server offerings at many price & performance points
  • some disk partitioning options (though more flexibility here would be nice, especially with LVM for the /boot and / filesystems)
  • fully automated provisioning, without salesman & quote hassles for standard offerings
  • 3000 GB data transfer per month included standard with most servers
  • month-to-month contracts
  • reasonable prices (though we can of course always use lower prices, we'll take quality over cheapness for most of our hosting needs!)
  • no arbitrary port blocks (some other providers rate-limit incoming TCP connections on port 22 to slow down ssh dictionary attacks, while others forbid IRC, etc.)
  • a web service API for monitoring and controlling many aspects of our account via REST/JSON or SOAP

(No, they're not paying me for writing this! But they really have nice offerings.)

It is this last item, the SoftLayer API, that I want to elaborate on here.

The SoftLayer Development Network features API information and documentation and once you have an API account set up in the management website (quick and easy to do), you can start automating all sorts of tasks, from provisioning new hosts, monitoring your upcoming invoice or other accounting information, and much more.

I've released as open source two scripts we use: One is for managing secondary DNS domains in SoftLayer's DNS servers, from a primary name server running BIND 9. The other is a Nagios check script for monitoring monthly data transfer used and alerting when over a set threshold or over the monthly allotment.

See the GitHub repository of endpoint-softlayer-api if they would be useful to you, or to use as a starting point to interface with other SoftLayer APIs.

MySQL replication monitoring on Ubuntu 10.04 with Nagios and NRPE

If you're using MySQL replication, then you're probably counting on it for some fairly important need. Monitoring via Nagios is generally considered a best practice. This article assumes you've already got your Nagios server setup and your intention is to add a Ubuntu 10.04 NRPE client. This article also assumes the Ubuntu 10.04 NRPE client is your MySQL replication master, not the slave. The OS of the slave does not matter.

Getting the Nagios NRPE client setup on Ubuntu 10.04

At first it wasn't clear what packages would be appropriate packages to install. I was initially mislead by the naming of the nrpe package, but I found the correct packages to be:

sudo apt-get install nagios-nrpe-server nagios-plugins

The NRPE configuration is stored in /etc/nagios/nrpe.cfg, while the plugins are installed in /usr/lib/nagios/plugins/ (or lib64). The installation of this package will also create a user nagios which does not have login permissions. After the packages are installed the first step is to make sure that /etc/nagios/nrpe.cfg has some basic configuration.

Make sure you note the server port (defaults to 5666) and open it on any firewalls you have running. (I got hung up because I forgot I have both a software and hardware firewall running!) Also make sure the server_address directive is commented out; you wouldn't want to only listen locally in this situation. I recommend limiting incoming hosts by using your firewall of choice.

Choosing what NRPE commands you want to support

Further down in the configuration, you'll see lines like command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10. These are the commands you plan to offer the Nagios server to monitor. Review the contents of /usr/lib/nagios/plugins/ to see what's available and feel free to add what you feel is appropriate. Well designed plugins should give you a usage if you execute them from the command line. Otherwise, you may need to open your favoriate editor and dig in!

After verifying you've got your NRPE configuration completed and made sure to open the appropriate ports on your firewall(s), let's restart the NRPE service:

service nagios-nrpe-server restart

This would also be an appropriate time to confirm that the nagios-nrpe-server service is configured to start on boot. I prefer the chkconfig package to help with this task, so if you don't already have it installed:

sudo apt-get install chkconfig
chkconfig | grep nrpe

# You should see...
nagios-nrpe-server     on

# If you don't...
chkconfig nagios-nrpe-server on

Pre flight check - running check_nrpe

Before going any further, log into your Nagios server and run check_nrpe and make sure you can execute at least one of the commands you chose to support in nrpe.cfg. This way, if there are any issues, it is obvious now, while we've not started modifying your Nagios server configuration. The location of your check_nrpe binary may vary, but the syntax is the same:

check_nrpe -H host_of_new_nrpe_client -c command_name

If your command output something useful and expected, your on the right track. A common error you might see: Connection refused by host. Here's a quick checklist:

  • Did you start the nagios-nrpe-server service?
  • Run netstat -aunt on the NRPE client to make sure the service is listening on the right address and ports.
  • Did you open the appropriate ports on all your firewall(s)?
  • Is there NAT translation which needs configuration?

Adding the check_mysql_replication plugin

There is a lot of noise out there on Google for Nagios plugins which offer MySQL replication monitoring. I wrote the following one using ideas pulled from several existing plugins. It is designed to run on the MySQL master server, check the master's log position and then compare it to the slave's log position. If there is a difference in position, the alert is considered Critical. Additionally, it checks the slave's reported status, and if it is not "Waiting for master to send event", the alert is also considered critical. You can find the source for the plugin at my Github account under the project check_mysql_replication. Pull that source down into your plugins directory (/usr/lib/nagios/plugins/ (or lib64)) and make sure the permissions match the other plugins.

With the plugin now in place, add a command to your nrpe.cfg.

command[check_mysql_replication]=sudo /usr/lib/nagios/plugins/check_mysql_replication.sh -H 

At this point you may be saying, WAIT! How will the user running this command (nagios) have login credentials to the MySQL server? Thankfully we can create a home directory for that nagios user, and add a .my.cnf configuration with the appropriate credentials.

usermod -d /home/nagios nagios #set home directory
mkdir /home/nagios
chmod 755 /home/nagios
chown nagios:nagios /home/nagios

# create /home/nagios/.my.cnf with your preferred editor with the following:
[client]
user=example_replication_username
password=replication_password

chmod 600 /home/nagios/.my.cnf
chown nagios:nagios /home/nagios/.my.cnf

This would again be an appropriate place to run a pre flight check and run the check_nrpe from your Nagios server to make sure this configuration works as expected. But first we need to add this command to the sudoer's file.

nagios ALL= NOPASSWD: /usr/lib/nagios/plugins/check_mysql_replication.sh

Wrapping Up

At this point, you should run another check_nrpe command from your server and see the replication monitoring report. If not, go back and check these steps carefully. There are lots of gotchas and permissions and file ownership are easily overlooked. With this in place, just add the NRPE client using the existing templates you have for your Nagios servers and make sure the monitoring is reporting as expected.

Some great press for College District

College District has been getting some positive press lately, the most recent being a Forbes article which talks about the success they have been seeing in the last few years.

College District is a company that sells collegiate merchandise to fans. They got their start focusing on the LSU Tigers at TigerDistrict.com and have branched out to teams such as the Oregon Ducks and Alabama Roll Tide.

We've been working with Jared Loftus @ College District for more then four and a half years. College District is running on a heavily modified Interchange system with some cool Postgres tricks. The system can support a nearly unlimited number of sites, running on 2 catalogs (1 for the admin, 1 for the front end) and 1 database. The key to the system is different schemas, fronted by views, that hide and expose records based on the database user that is connected. The great thing about this system is that Jared can choose to launch a new store within a day and be ready for sales, something he has taken advantage of in the past when a team is on fire and he sees an opportunity he can't pass up.

We are currently preparing for a re-launch of the College District site that will focus on crowd-sourced designs. Artists and fans will submit their designs, have them voted on, some will be chosen to be sold and the folks that have their designs chosen will get paid for their efforts. The goal here is to grow a community that guides what College District and the individual school sites ultimately sell.

With College District's quick growth we've also been helping them improve their order fulfillment process. This includes streamlining how orders are picked, packed and shipped. The introduction of bar code scanners will help with the accuracy and speed of the process.

We get a kick out of seeing our clients succeed, especially those that come to us with a clear vision and a good attitude, and then put the hard work in to make it happen. It's an exciting year ahead for College District and we'll be right there supporting them on the journey.

Automating removal of SSH key patterns

Every now and again, it becomes necessary to remove a user's SSH key from a system. At End Point, we'll often allow multiple developers into multiple user accounts, so cleaning up these keys can be cumbersome. I decided to write a shell script to brush up on those skills, make sure I completed my task comprehensively, and automate future work.

Initial Design and Dependencies

My plan for this script is to accept a single argument which would be used to search the system's authorized_keys files. If the pattern was found, it would offer you the opportunity to delete the line of the file on which the pattern was found.

I've always found mlocate to be very helpful; it makes finding files extremely fast and it's usage is trivial. For this script, we'll use the output from locate to find all authorized_keys files in the system. Of course, we'll want to make sure that the mlocate.db has recently been updated. So let's show the user when the database was last updated and offer them a chance to update it.

mlocate_path="/var/lib/mlocate/mlocate.db"
if [ -r $mlocate_path ]
then
    echo -n "mlocate database last updated: "
    stat -c %y $mlocate_path
    echo -n "Do you want to update the locate database this script depends on? [y/n]: "
    read update_locate
    if [ "$update_locate" = "y" ]
    then
        echo "Updating locate database.  This may take a few minutes..."
        updatedb
        echo "Update complete."
    fi  
else
    echo "Cannot read the mlocate db path: $mlocate_path"
    exit 2
fi

First we define the path where we can find the mlocate database. Then we check to see if we can read that file. If we can't read the file, we let the user know and exit. If we can read the file, print the date and time it was last modified and offer the user a chance to update the database. While this is functional, it's pretty brittle. Let's make things a bit more flexible by letting locate tell us where its database is.

if
    mlocate_path=`locate -S`
then
    # locate -S command will output database path in following format:
    # Database /full/path/to/db: (more output)...
    mlocate_path=${mlocate_path%:*} #remove content after colon
    mlocate_path=${mlocate_path#'Database '*} #remove 'Database '
else
    echo "Couldn't run locate command.  Is mlocate installed?"
    exit 5
fi

Instead of hard coding the path to the database, we collect the locate database statics using the -S parameter. By using some string manipulation functions we can tease out the file path from the output.

Because we are going to offering to update the location database (as well as eventually manipulating authorized_keys files), it makes sense to check that we are root before proceeding. Additionally, let's check to see that we get a pattern form our user, and provide some usage guidance.

if [ ! `whoami` = "root" ]
then
    echo "Please run as root."
    exit 4
fi

if [ -z $1 ]
then
    echo "Usage: check_authorized_keys PATTERN"
    exit 3
fi

Checking and modifying authorization_keys for a pattern

With some prerequisites in place, we're finally ready to scan the system's authorized keys files. Let's just start with the syntax for that loop.

for key_file in `locate authorized_keys`; do
    echo "Searching $key_file..."
done

We do not specify a dollar sign ($) in front of key_file when defining the loop, but once inside our loop we use the regular syntax. We use command substitution by placing a command around back quotes (`) around the output of the command we want to use. We're now scanning each file, but how do we find matching entries?

IFS=$'\n'
for matching_entry in `grep "$1" $key_file`; do
    IFS=' '
    echo "Found an entry in $key_file:"
    echo $matching_entry
done

For each $key_file, we now grep our user's pattern ($1) and store it in $matching_entry. We have to change the Input Field Seperator (IFS) to a new line, instead of the default space, in order to capture each grepped line in its entriety. (Thanks to Brian Miller for that one!)

With a matching entry found in a key file, it's time to finally offer the user a chance to remove the entry.

echo "Found an entry in $key_file:"
echo $matching_entry
echo -n "Remove entry? [y/n]: "
read remove_entry
if [ "$remove_entry" = "y" ]
then
    if [ ! -w $key_file ]
    then
        echo "Cannot write to $key_file."
     exit 1
 else
     sed -i "/$matching_entry/d" $key_file
     echo "Deleted."
 fi
else
    echo "Not deleted."
fi

We prompt the user if they want to delete the shown entry, verify we can write to the $key_file, and then delete the $matching entry. By using the -i option to the sed command, we are able to make modifications inline.

The Final Product

I'm sure there is a lot of room for improvement on this script and I'd welcome pull requests on the Github repo I setup for this little block of code. As always, be *very* careful when running automated scripts as root. Please test this script out on a non-production system before use.

#!/bin/bash

if [ ! `whoami` = "root" ]
then
    echo "Please run as root."
    exit 4
fi


if [ -z $1 ]
then
    echo "Usage: check_authorized_keys PATTERN"
    exit 3
fi

if
    mlocate_path=`locate -S`
then
   # locate -S command will output database path in following format:
   # Database /full/path/to/db: (more output)...
   mlocate_path=${mlocate_path%:*} #remove content after colon
   mlocate_path=${mlocate_path#'Database '*} #remove 'Database '
else
    echo "Couldn't run locate command.  Is mlocate installed?"
    exit 5
fi

if [ -r $mlocate_path ]
then
    echo -n "mlocate database last updated: "
    stat -c %y $mlocate_path
    echo -n "Do you want to update the locate database this script depends on? [y/n]: "
    read update_locate
    if [ "$update_locate" = "y" ]
    then
        echo "Updating locate database.  This may take a few minutes..."
        updatedb
        echo "Update complete."
        echo ""
    fi
else
    echo "Cannot read from $mlocate_path"
    exit 2
fi

for key_file in `locate authorized_keys`; do
    echo "Searching $key_file..."
    IFS=$'\n'
    for matching_entry in `grep "$1" $key_file`; do
    IFS=' '
        echo "Found an entry in $key_file:"
        echo $matching_entry
        echo -n "Remove entry? [y/n]: "
        read remove_entry
        if [ "$remove_entry" = "y" ]
        then
            if [ ! -w $key_file ]
            then
                echo "Cannot write to $key_file."
                exit 1
            else
                sed -i "/$matching_entry/d" $key_file
                echo "Deleted."
            fi
        else
            echo "Not deleted."
        fi
    done
done

echo "Search complete."

Converting CentOS 6 to RHEL 6

A few years ago I needed to convert a Red Hat Enterprise Linux (RHEL) 5 development system to CentOS 5, as our customer did not actively use the system any more and no longer wanted to renew the Red Hat Network entitlement for it. Making the conversion was surprisingly straightforward.

This week I needed to make a conversion in the opposite direction: from CentOS 6 to RHEL 6. I didn't find any instructions on doing so, but found a RHEL 6 to CentOS 6 conversion guide with roughly these steps:

yum clean all
mkdir centos
cd centos
wget http://mirror.centos.org/centos/6.0/os/x86_64/RPM-GPG-KEY-CentOS-6
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/centos-release-6-0.el6.centos.5.x86_64.rpm
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/yum-3.2.27-14.el6.centos.noarch.rpm
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/yum-utils-1.1.26-11.el6.noarch.rpm
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/yum-plugin-fastestmirror-1.1.26-11.el6.noarch.rpm
rpm --import RPM-GPG-KEY-CentOS-6
rpm -e --nodeps redhat-release-server
rpm -e yum-rhn-plugin rhn-check rhnsd rhn-setup rhn-setup-gnome
rpm -Uhv --force *.rpm
yum upgrade

I then put together a plan to do more or less the opposite of that. The high-level overview of the steps is:

  1. Completely upgrade the current CentOS and reboot to run the latest kernel, if necessary, to make sure you're starting with a solid system.
  2. Install a handful of packages that will be needed by various RHN tools.
  3. Log into the Red Hat Network web interface and search for and download onto the server the most recent version of these packages for RHEL 6 x86_64:
    • redhat-release-server-6Server
    • rhn-check
    • rhn-client-tools
    • rhnlib
    • rhnsd
    • rhn-setup
    • yum
    • yum-metadata-parser
    • yum-rhn-plugin
    • yum-utils
  4. Install the Red Hat GnuPG signing key.
  5. Forcibly remove the package that identifies this system as CentOS.
  6. Forcibly upgrade to the downloaded RHEL and RHN packages.
  7. Register the system with Red Hat Network.
  8. Update any packages that now need it using the new Yum repository.

The exact steps I used today to convert from CentOS 6.1 to RHEL 6.2 (with URL session tokens munged):

yum upgrade
shutdown -r now
yum install dbus-python libxml2-python m2crypto pyOpenSSL python-dmidecode python-ethtool python-gudev usermode
mkdir rhel
cd rhel
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/redhat-release-server/6Server-6.2.0.3.el6/x86_64/redhat-release-server-6Server-6.2.0.3.el6.x86_64.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhn-check/1.0.0-73.el6/noarch/rhn-check-1.0.0-73.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhn-client-tools/1.0.0-73.el6/noarch/rhn-client-tools-1.0.0-73.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhnlib/2.5.22-12.el6/noarch/rhnlib-2.5.22-12.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhnsd/4.9.3-2.el6/x86_64/rhnsd-4.9.3-2.el6.x86_64.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhn-setup/1.0.0-73.el6/noarch/rhn-setup-1.0.0-73.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum/3.2.29-22.el6/noarch/yum-3.2.29-22.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum-metadata-parser/1.1.2-16.el6/x86_64/yum-metadata-parser-1.1.2-16.el6.x86_64.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum-rhn-plugin/0.9.1-36.el6/noarch/yum-rhn-plugin-0.9.1-36.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum-utils/1.1.30-10.el6/noarch/yum-utils-1.1.30-10.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget https://www.redhat.com/security/fd431d51.txt
rpm --import fd431d51.txt
rpm -e --nodeps centos-release
rpm -e centos-release-cr
rpm -Uhv --force *.rpm
rpm -e yum-plugin-fastestmirror
yum clean all
rhn_register
yum upgrade

I'm expecting to use this process a few more times in the near future. It is very useful when working with a hosting provider that does not directly support RHEL, but provides CentOS, so we can get the new servers set up without needing to request a custom operating system installation that may add a day or two to the setup time.

Given the popularity of both RHEL and CentOS, it would be neat for Red Hat to provide a tool that would easily switch, at least "upgrading" from CentOS to RHEL to bring more customers into their fold, if not the other direction!

Semaphore limits and many Apache instances on Linux

On some of our development servers, we run many instances of the Apache httpd web server on the same system. By "many", I mean 30 or more separate Apache instances, each with its own configuration file and child processes. This is not unusual on DevCamps setups with many developers working on many projects on the same server at the same time, each project having a complete software stack nearly identical to production.

On Red Hat Enterprise Linux 5, with somewhere in the range of 30 to 40 Apache instances on a server, you can run into failures at startup time with this error or another similar one in the error log:

[error] (28)No space left on device: Cannot create SSLMutex

The exact error will depend on what Apache modules you are running. The "space left on device" error does not mean you've run out of disk space or free inodes on your filesystem, but that you have run out of SysV IPC semaphores.

You can see what your limits are like this:

# cat /proc/sys/kernel/sem
250 32000 32 128

I typically double those limits by adding this line to /etc/sysctl.conf:

kernel.sem = 500 64000 64 256

That makes sure you'll get the change at the next boot. To make the change take immediate effect:

# sysctl -p

With those limits I've run 100 Apache instances on the same server.

RPM building: Fedora's _sharedstatedir

When Red Hat Enterprise Linux does not offer packages that we need, EPEL (Extra Packages for Enterprise Linux) often has what we want, kept compatible with RHEL. When EPEL also doesn't have a package, or we need a newer release than is offered, we rebuild packages from Fedora, which has consistently high-quality packages even in its "rawhide" development phase. We then distribute our packages in several compatibility-oriented Yum repositories at packages.endpoint.com.

Of course some things in the latest Fedora are not compatible with RHEL. In rebuilding the logcheck package (needed as a dependency for another package), I found that Fedora RPM spec files have begun using the _sharedstatedir macro in /usr/lib/rpm/macros, which RHEL has never used before.

On RHEL that macro has been set to /usr/com, a strange nonexistent path that apparently came from the GNU autoconf tools but wasn't used in RHEL. Now in Fedora the macro is set to /var/lib and is being used, as described in a Fedora wiki page on packaging.

The easiest and most compatible way to make the change without munging the system- or user-wide RPM macros is to add this definition to the top of the spec file where it's needed:

%define _sharedstatedir /var/lib

And then the RPM build is happy.

In related news, alongside the new logcheck package, there are also new End Point RHEL 5 x86_64 packages for the brand-new Git 1.7.7.1 and pbzip2 1.1.5, the multi-CPU core parallel compressor that has had several bugfix releases this year.

OpenSSH known_hosts oddity

A new version of the excellent OpenSSH was recently released, version 5.9. As you'd expect from such widely-used mature software, there are lots of minor improvements to enjoy rather than anything too major.

But what I want to write about today is a little surprise in how ssh handles multiple cached host keys in its known_hosts files.

I had wrongly thought that ssh stopped scanning known_hosts when it hit the first hostname or IP address match, such as happens with lookups in /etc/hosts. But that isn't how it works. The sshd manual reads:

It is permissible (but not recommended) to have several lines or different host keys for the same names. This will inevitably happen when short forms of host names from different domains are put in the file. It is possible that the files contain conflicting information; authentication is accepted if valid information can be found from either file.

The "files" it refers to are the global /etc/ssh/known_hosts and the per-user ~/.ssh/known_hosts.

The surprise was that if there are multiple host key entries in ~/.ssh/known_hosts, say, for 10.0.0.1. If the first one has a non-matching host key, the ssh client tries the second one, and so on until it runs out of matching IP address entries to check. If none have a matching host key, the ssh client error reports the offending line number for the last matching IP address, but gives no indication there are earlier mismatches as well.

This is actually kind of convenient if you have scripts that simply append new host keys to the end of the known_hosts file, and it also makes sense since hostname wildcards and multiple hostnames per line are allowed. It's fine, but it isn't what I expected and is nice to know.

Home router problems with .0 IP address

In our work the occasional mysterious problem surfaces which makes me appreciate how tractable and sane the majority of the challenges are. Here I'll tell the story of one of the mysterious problems.

In Internet routing of IPv4 addresses, there's nothing inherently special about an IP address that ends in .0, .255, or anything else. It all depends on the subnet. In the days before CIDR (Classless Inter-Domain Routing) brought us arbitrary subnet masks, there were classes of routing, most commonly A, B, and C. And the .0 and .255 addresses were special.

That was a long time ago, but it can still cause occasional trouble today. One of our hosting providers assigned us an IP address ending in .0, which we used for hosting a website. It worked fine, and was in service for many months before we heard any reports of trouble.

Then we heard a report from one of our clients that they could not access that website from their home, but they could from their office. We couldn't ever figure out why.

Next one of our own employees found that he could not access the website from his home, but he could from other locations.

Finally we had enough evidence when a friend from the open source community also could not access that website from his home.

The commonality was in the router they were using:

  • Belkin G Wireless Router Model F5D7234-4 v4
  • Belkin F5D9231-4 v1
  • and the third thought it was a Belkin but they were not able to provide the exact model.

We moved the website to a different IP address on the same server, and they had no problem accessing it.

The routers are obviously broken, but there's little sense arguing about that. For now we avoid using any .0 IP address because there are going to be some few people who can't reach it.

Raising open file descriptor limits for Dovecot and nginx

Recently we've needed to increase some limits in two excellent open-source servers: Dovecot, for IMAP and POP email service, and nginx, for HTTP/HTTPS web service. These are running on different servers, both using Red Hat Enterprise Linux 5.

First, let's look at Dovecot. We have a somewhat busy mail server and as it grew busier, it occasionally hit connection limits when the server itself still has plenty of available capacity.

Raising the number of processes in Dovecot is easy. Edit /etc/dovecot.conf and change from the prior (now commented-out) limits to the new limits:

#login_max_processes_count = 128
login_max_processes_count = 512

and later in the file:

#max_mail_processes = 512
max_mail_processes = 2048

However, then Dovecot won't start at all due to a shortage of available file descriptors. There are various ways to change that, including munging the init scripts, changing the system defaults, etc. The most standard and non-interventive way to do so with this RHEL 5 Dovecot RPM package is to edit /etc/sysconfig/dovecot and add:

ulimit -n 131072

That sets the shell's maximum number of open file descriptors allowed in the init script /etc/init.d/dovecot before the Dovecot daemon is run. The default ulimit -n is 1024, so we here increased it to an arbitrarily big enough number (2 * 64K) to handle the new limits and then some.

Similarly, on another server we needed to increase the number of connections allowed per nginx worker process from the default 1024 for a very high-capacity HTTP caching proxy server.

We edited /etc/nginx/nginx.conf and changed the events block like this:

events {
    worker_connections  65536;
}

But then nginx wouldn't start at all. The same problem and same solution applied. We edited /etc/sysconfig/nginx to add:

ulimit -n 131072

And now nginx has enough file descriptors to start.

Changing the limits this way also has the benefit of surviving an upgrade, because /etc/sysconfig files are marked in RPM as configuration files that should have any changes preserved.

RailsConf 2011 - Day One

Today was the first official day of RailsConf 2011. As with most technical conferences, this one spent the first day with tutorials and workshops. For those of us without paid access to the tutorial sessions, the BohConf was a nice way to spend our first day of the four-day event.

BohConf is described as the "unconference" of RailsConf. It's a loosely organized collection of presentations, mini-hackathons and barcamp-style meetings. I spent the first half of Monday at the BohConf. Of particular interest to me was Chris Eppstein and Scott Davis' introduction to Sass and Compass. I've dabbled with Sass in the past but only recently learned of Compass.

Sass is a great way to construct your CSS without the tedious duplication that's typical of most modern spreadsheets. Introducing programming features like variables, inheritance and nested blocks, Sass makes it easy to keep your source material concise and logical. Once your source declarations are ready, compile your production spreadsheets with Sass or Compass.

Compass is effectively a framework for easy construction and deployment of spreadsheets using Sass. To hear Scott describe it, "Compass is to Sass as Rails is to Ruby". Together they're a very attractive combination for the Ruby developer who also dabbles in design (and who doesn't these days?). Truth be told, while I'm very impressed with the capabilities of Sass, I worry about the trend to re-introduce logic and presentation. My mom raised me to abstract the presentation layer for ease on graphic designers, and that rule has suited me well to date. Time will tell.

In the afternoon I wandered into a sponsored workshop from VMware. Dave McCrory and Dekel Tankel led a demonstration of their new CloudFoundry service. A nice reward for attending was getting instant approval of your CloudFoundry.com beta registration. Although I felt mildly guilty for it afterwards, I took advantage of this opportunity to get an extra beta account (my personal request had been approved a week earlier).

Dave introduced everyone to the CloudFoundry beta offering and discussed a vague roadmap for the open source project and their own commercial VM product slated for early 2012. They emphasized that the CloudFoundry core project will remain open source, and that interested parties can fork it on github, hack on the code, and deploy their own private "micro-clouds". Dave even hinted that at least one startup has based their PHP PaaS service on CloudFoundry.

Once all the attendees had received their beta accounts, Dekel walked us through the installation and basic command-line usage of the vmc utility. I was able to immediately vmc login with my Beta account credentials and change my password with vmc passwd. I should note that before logging in, I had to choose my target server with vmc target api.cloudfoundry.com. Why is this important? This will allow developers to easily switch between targeted environments. For example, I could install CloudFoundry on my workstation or development server and "target" it as my development or staging environment. Once my tests pass, I can quickly switch targets and push the changes to production.

They had us follow along by recreating a sample Ruby application designed to check if one Twitter account follows another. The examples were simple and easy to follow. Once we had our Gemfile, controller and views in place, we had to bundle package all of the dependencies. Once this was complete, a quick vmc push myappname and we were live!

Unfortunately, most (if not all) of us encountered a gotcha with this example. Because all of our instantiated applications reside behind the same IP address on CloudFoundry's network, we quickly hit Twitter's API quota. I'm not sure if this will be a problem once CloudFoundry officially launches, but it's something to keep in mind. And while it was useful to vmc logs myappname to debug this problem, an astute attendee brought up the fact that there was no way to tail application logs in CloudFoundry. This is a glaring oversight and one I hope they rectify before the Beta is finished.

The workshop continued with an introduction into binding services like Redis, Mongo or MySQL. We added new functionality to our existing application that introduced a Redis backend to store leaderboard information for Twitter activity. Lastly, Dekel demonstrated the ease of scaling our applications with vmc instances myappname 5. This simple command instantiates new copies of the application behind their dynamic load balancer. Coincidentally, they're not currently offering any sort of scalable backend storage, so keep that in mind before you try to launch any production sites on their Beta service. Now you'd never do that, would you? ;-)

The first day of RailsConf 2011 was impressive. I came last year as an exhibitor and am really excited that I get to attend this year for the conference sessions. I can't wait to see all the talks tomorrow!

RHEL 5 SELinux initscripts problem

I ran into a strange problem updating Red Hat Enterprise Linux 5 a few months ago, and just ran into it again and this time better understood what went wrong.

The problem was serious: After a `yum upgrade` of a RHEL 5 x86_64 server with SELinux enforcing, it never came back after a reboot. Logging into the console I could see that it was stuck in single user mode because there were no init scripts! Investigation showed that indeed the initscripts package was completely missing.

I searched on bugzilla.redhat.com looking for any reported problems and didn't find any. I reinstalled initscripts, rebooted, and the server was fine, but it was not happytimes to have that unexpected downtime.

Tonight I ran into the problem again, this time on a build server where downtime didn't matter so I could investigate more leisurely.

The error from yum looked like this (the same problem affected the screen package as affected initscripts):

Downloading Packages:
screen-4.0.3-4.el5.i386.rpm          | 559 kB      00:00
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
groupadd: unable to open group file
error: %pre(screen-4.0.3-4.el5.i386) scriptlet failed, exit status 10
error:   install: %pre scriptlet failed (2), skipping screen-4.0.3-4.el5

Updated:
  screen.i386 0:4.0.3-4.el5

Complete!
# cat /selinux/enforce
1

The way I dealt with this initially was to temporarily disable SELinux enforcing, update the package, then reboot (to also load a kernel update):

# setenforce 0
# yum -y upgrade
# shutdown -r now

But following up on the specific error message showed:

# ls -lFaZ /etc/group
-rw-r--r--  root root system_u:object_r:file_t:s0      /etc/group

Aha! The SELinux context is wrong. Given that this has happened a couple of different machines, I'm guessing some past upgrade broke the context. What should it be? Let's check /etc/passwd for reference:

# ls -lFaZ /etc/passwd
-rw-r--r--  root root system_u:object_r:etc_t:s0       /etc/passwd

That's confirmed the correct context for /etc/group on another working server. To fix:

# chcon system_u:object_r:etc_t:s0 /etc/group

Then the updates proceed fine. It would be nice to find out what past action set the context wrong on /etc/group.

Google 2-factor authentication

About a month ago, Google made available to all users their new 2-factor authentication, which they call 2-step authentication. In addition to the customary username and password, this optional new feature requires that you enter a 6-digit number that changes every 30 seconds, generated by the Google Authenticator app on your Android, BlackBerry, or iPhone. The app looks like this:

This was straightforward to set up and has worked well for me in the past month. It would thwart bad guys who intercept your password in most cases. It would also lock you out of your Google account if you lose your phone and your emergency scratch codes. :)

I was happy to see this is all based on some open standards under development, and Google has made this even more useful by releasing an open source PAM module called google-authenticator. With that PAM module, a Linux system administrator can require a Google Authenticator code in addition to password authentication for login.

I tried this out on a CentOS x86_64 system and found it fairly straightforward to set up. I ran into two minor gotchas which were reported by others as well:

  • The Makefile calls sudo directly, which it shouldn't -- I was running a minimal installation without sudo installed, and in any case the administrator should decide when to become root and how. (Issue 17)
  • The Makefile installs into /lib/security instead of /lib64/security. This has since been fixed. (Issue 6)

After build and installation it was easy to generate a secret key for each individual user account. The key is stored in the user's home directory, which Issue 4 notes has some downsides, and the resolution to Issue 24 provides a partial workaround for this. The home directory seems like a nice default to me.

In the end, I found the google-authenticator module isn't suitable for my regular ssh use due to no fault of its developers. I normally use SSH public key authentication and that's handled by OpenSSH natively, separately from PAM, and thus bypasses 2-factor authentication entirely. So I can have 2-factor authentication with password authentication, but not with public key authentication, which is really what I want.

Does anyone know of a way to configure things that way? I wasn't able to find a way, so I'm not planning on using this for shell logins right now. But it's still a nice option for Google logins right now, and I expect the google-authenticator project will advance over time.

SAS 70 becomes SSAE 16

In recent years it’s become increasingly common for hosting providers to advertise their compliance with the SAS 70 Type II audit. Interest in that audit often comes from hosting customers’ need to meet Sarbanes-Oxley (aka Sarbox) or other legal requirements in their own businesses. But what is SAS 70?

It was not clear to me at first glance that SAS 70 is actually a financial accounting audit, not one that deals primarily with privacy, information technology security, or other areas.

SAS 70 was created by the American Institute of Certified Public Accountants (AICPA) and contains guidelines for assessing organizations’ service delivery processes and controls. The audit is performed by an independent Certified Public Accountant.

Practically speaking, what does passing a SAS 70 audit tell us about an organization? Most importantly that it is financially reliable, and thus hopefully a safe partner for providing critical Internet hosting and data storage services.

On June 15, 2011, the SAS 70 audit will be effectively replaced by the new SSAE 16 attestation standard (Statement on Standards for Attestation Engagements no. 16, Reporting on Controls at a Service Organization). Thus the focus appears to shift from an external auditor investigating an organization, to the organization making claims about itself under the guidance of an auditor.

SSAE 16 was created by the AICPA to make the United States service organization reporting standard compatible with the new international service organization reporting standard, ISAE 3402, which is freely available in PDF format. The SSAE 16 document is available only for a fee.

The AICPA’s FAQ on the SAS 70 to SSAE 16 transition makes an interesting point:

Q. — Will entities now become “SSAE 16 certified”?

A. — No! A popular misconception about SAS 70 is that a service organization becomes “certified” as SAS 70 compliant after undergoing a type 1 or type 2 service auditor’s engagement. There is no such thing as being SAS 70 certified and there will be no such thing as being SSAE 16 certified. An SSAE 16 report (as with a SAS 70 report) is primarily an auditor to auditor communication, the purpose of which is to provide user auditors with information about controls at a service organization that are relevant to the user entities’ financial statements.

This is interesting because many in the industry informally state that they are “SAS 70 Type II certified”. But practically speaking for those of us involved in Internet hosting, is “certification” very different from “passing an audit”? It serves primarily as a requirement checklist item about hosting providers in either case.

Many major hosting providers have completed a SAS 70 Type II audit, including Rackspace (and Rackspace Cloud), Amazon Web Services, SoftLayer (and The Planet, which SoftLayer recently acquired), Verio, Terremark, and ServePath, to mention a few that we have worked with. Presumably these will make an SSAE 16 attestation later this year.

Note that many VPS and cloud hosting providers do not report having been SAS 70 audited. If this is a requirement for your hosting, it's important to look for it early before settling on a provider.

More details about the SAS 70 to SSAE 16 transition are available on the AICPA Service Organization Controls Reporting website.

Speeding up the Spree demo site

There's a lot that can be done to speed up Spree, and Rails apps in general. Here I'm not going to deal with most of that. Instead I want to show how easy it is to speed up page delivery using standard HTTP server tuning techniques, demonstrated on demo.spreecommerce.com.

First, let's get a baseline performance measure from the excellent webpagetest.org service using their remote Internet Explorer 7 tests:

  • First page load time: 2.1 seconds
  • Repeat page load time: 1.5 seconds

The repeat load is faster because the browser has images, JavaScript, and CSS cached, but it still has to check back with the server to make sure they haven't changed. Full details are in this initial report.

The demo.spreecommerce.com site is run on a Xen VPS with 512 MB RAM, CentOS 5 i386, Apache 2.2, and Passenger 2.2. There were several things to tune in the Apache httpd.conf configuration:

  • mod_deflate was already enabled. Good. That's a big help.
  • Enable HTTP keepalive: KeepAlive On and KeepAliveTimeout 3
  • Limit Apache children to keep RAM available for Rails: StartServers 5, MinSpareServers 2, MaxSpareServers 5
  • Limit Passenger pool size to 2 child processes (down from the default 6), to queue extra requests instead of using slow swap memory: PassengerMaxPoolSize 2
  • Enable browser & intermediate proxy caching of static files: ExpiresActive On and ExpiresByType image/jpeg "access plus 2 hours" etc. (see below for full example)
  • Disable ETags which aren't necessary once Expires is enabled: FileETag None and Header unset ETag
  • Disable unused Apache modules: free up memory by commenting out LoadModule proxy, proxy_http, info, logio, usertrack, speling, userdir, negotiation, vhost_alias, dav_fs, autoindex, most authn_* and authz_* modules
  • Disable SSLv2 (for security and PCI compliance, not performance): SSLProtocol all -SSLv2 and SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:-LOW:-SSLv2:-EXP

After making these changes, without tuning Rails, Spree, or the database at all, a new webpagetest.org run reports:

  • First page load time: 1.2 seconds
  • Repeat page load time: 0.4 seconds

That's an easy improvement, a reduction of 0.9 seconds for the initial load and 1.1 seconds for a repeat load! Complete details are in this follow-on report.

The biggest wins came from enabling HTTP keepalive, which allows serving multiple files from a single HTTP connection, and enabling static file caching which eliminates the majority of requests once the images, JavaScript, and CSS are cached in the browser.

Note that many of the resource-limiting changes I made above to Apache and Passenger would be too restrictive if more RAM or CPU were available, as is typical on a dedicated server with 2 GB RAM or more. But when running on a memory-constrained VPS, it's important to put such limits in place or you'll practically undo any other tuning efforts you make.

I wrote about these topics a year ago in a blog post about Interchange ecommerce performance optimization. I've since expanded the list of MIME types I typically enable static asset caching for in Apache. Here's a sample configuration snippet to put in the <VirtualHost> container in httpd.conf:

    ExpiresActive On
    ExpiresByType image/gif   "access plus 2 hours"
    ExpiresByType image/jpeg  "access plus 2 hours"
    ExpiresByType image/png   "access plus 2 hours"
    ExpiresByType image/tiff  "access plus 2 hours"
    ExpiresByType text/css    "access plus 2 hours"
    ExpiresByType image/bmp   "access plus 2 hours"
    ExpiresByType video/x-flv "access plus 2 hours"
    ExpiresByType video/mpeg  "access plus 2 hours"
    ExpiresByType video/quicktime "access plus 2 hours"
    ExpiresByType video/x-ms-asf  "access plus 2 hours"
    ExpiresByType video/x-ms-wm   "access plus 2 hours"
    ExpiresByType video/x-ms-wmv  "access plus 2 hours"
    ExpiresByType video/x-ms-wmx  "access plus 2 hours"
    ExpiresByType video/x-ms-wvx  "access plus 2 hours"
    ExpiresByType video/x-msvideo "access plus 2 hours"
    ExpiresByType application/postscript        "access plus 2 hours"
    ExpiresByType application/msword            "access plus 2 hours"
    ExpiresByType application/x-javascript      "access plus 2 hours"
    ExpiresByType application/x-shockwave-flash "access plus 2 hours"
    ExpiresByType image/vnd.microsoft.icon      "access plus 2 hours"
    ExpiresByType application/vnd.ms-powerpoint "access plus 2 hours"
    ExpiresByType text/x-component              "access plus 2 hours"

Of course you'll still need to tune your Spree application and database, but why not tune the web server to get the best performance you can there?

Red Hat SELinux policy for mod_wsgi

Using SELinux, you can safely grant a process only the permissions it needs to perform its function, and no more. Linux distributions provide policies to enforce these limits on most software they package, but many aren't covered. We've made allowances for mod_wsgi on RHEL and CentOS 5 by extending Apache httpd's SELinux policy.

It seems the SELinux policy for Apache httpd is twice as large as any other package's. The folks at Red Hat have put a lot of work into making sure that attackers who manage to exploit httpd can't break out to the rest of your system, while still allowing the flexibility to serve most applications. Consult the httpd_selinux man page if messages in audit.log coincide with your error.

File Contexts

If you've created files and/or directories in /etc/httpd, make sure they have the proper file contexts so the daemon can read them:

  # restorecon -vR /etc/httpd

httpd can only serve files with an explicitly allowed file context. Configure the context of files and directories within your production code base using the semanage command:

  # semanage fcontext --add --ftype -- --type httpd_sys_content_t "/home/projectname/live(/.*)?"
  # semanage fcontext --add --ftype -d --type httpd_sys_content_t "/home/projectname/live(/.*)?"
  # restorecon -vR /home/projectname/live

View file contexts with ls -Z. Changes should be generally accomplished with semanage and restorecon -vR.

Booleans

The httpd policy provides several boolean options for easy run-time configuration:

  • httpd_can_network_connect - Allows httpd to make network connections, including the local ones you'll be making to a database
  • httpd_enable_homedirs - Allows httpd to access /home/

Booleans are persistently set using the setsebool command with the -P flag:

  # setsebool -P httpd_can_network_connect on

WSGI Socket

When running in daemon mode, httpd and the mod_wsgi daemon communicate via a UNIX socket file. This should usually have a context of httpd_var_run_t. The standard Red Hat SELinux policy includes an entry for /var/run/wsgi.* to use this context, so it makes sense to put the socket there using the WSGISocketPrefix directive within your httpd configuration:

  WSGISocketPrefix run/wsgi

(Note that run/wsgi translates to /etc/httpd/run/wsgi which is symlinked to /var/run/wsgi.)

If socket communication fails, httpd returns a 503 "Temporarily Unavailable" error response.

SELinux Policy Module

In the course of our testing SELinux denials like the following appeared:

  host=example.com type=AVC msg=audit(1262803154.315:1851): avc:  denied  { execmem } for  pid=5337 comm="httpd" scontext=root:system_r:httpd_t:s0 tcontext=root:system_r:httpd_t:s0 tclass=process

Unusual behavior like this is usually best allowed by creating application-specific SELinux policy modules. If you cannot resolve these AVC errors by manipulating file contexts or booleans, collect all the errors into a single file and feed that into the audit2allow utility:

  # yum install policycoreutils
  # mkdir ~/tmp  # if this doesn't exist already
  # audit2allow --module wsgi < ~/tmp/pile_of_auditd_output > ~/tmp/wsgi.te

This will output source for a new policy module. You might review the .te file before compiling. Ours looks like this:

module wsgi 1.0;

require {
      type httpd_t;
      class process execmem;
}

#============= httpd_t ==============
allow httpd_t self:process execmem;

Compile this source into a new policy module and package it:

  # checkmodule -M -m -o ~/tmp/wsgi.mod ~/tmp/wsgi.te
  # semodule_package --outfile ~/tmp/wsgi.pp --module ~/tmp/wsgi.mod

Once created, the module may be installed permanently into any compatible system's SELinux configuration:

  # semodule --install ~/tmp/wsgi.pp

There's plenty of room for improvement here. The file contexts we assigned with semanage should be defined in a .fc source file and included within the policy module. And creating a new context just for the WSGI daemon to transition into would restrict it further, allowing only a subset of Apache httpd's abilities. Writing your own policy like this allows you much finer tuning of your processes' limits, while allowing their needed functionality.

Surge 2010 day 1

Today (technically, yesterday) was the first day of the Surge 2010 conference in Baltimore, Maryland. The Tremont Grand venue is perfect for a conference. The old Masonic lodge makes for great meeting rooms, and having a hallway connect it to the hotel was nice to avoid the heavy rain today. The conference organization and scheduling and Internet have all been solid. Well done!

There were a lot of great talks, but I wanted to focus on just one that was very interesting to me: Artur Bergman's on scaling Wikia. Some points of interest:

  • They (ab)use Google Analytics to track other things besides the typical pages viewed by whom, when. For example, page load time as measured by JavaScript, with data sent to a separate GA profile for analysis separately from normal traffic. That is then correlated with exit rates to give an idea of the benefit of page delivery speed in terms of user stickiness.
  • They use the excellent Varnish reverse proxy cache.
  • 500 errors from the origin result in a static page served by Varnish, with error data hitting a separate Google Analytics profile.
  • They have both geographically distributed servers and team.
  • They've found SSDs (solid state disks) to be well worth the extra cost: fast, using less power in a given server, and requiring fewer servers overall. They have to use Linux software RAID because no hardware RAID controllers they've tested could keep up with the speed of SSD. They have run into the known problems with disk write performance dropping as they fill and recycle, but haven't found it to be a problem when used on replaceable cache machines.
  • They run their own CDN, with nodes running Varnish, Quagga (for BGP routing), BIND, and Scribe. But they use Akamai for static content.
  • Even running Varnish with 1 second TTL can save your backend app servers when heavy traffic arrives! One hit per second is no problem; thousands may mean meltdown.
  • Serving stale cached content when the backend is down can be a good choice. It means most visitors will never know anything was wrong. (Depends on the site's functions, of course.)
  • Their backup datacenter in Iowa is in a former nuclear bunker. See monitoring graphs for it.
  • Wikia ops staff interact with their users via IRC. This "crowdsourced monitoring" has resulted in a competition between Wikia ops people and the users to see who can spot outages first.
  • Having their own hardware in multiple redundant datacenters has meant much more leverage in pricing discussions with datacenters. "We can just move."
  • They own their own hardware, and run on bare metal. At no time does user traffic pass through any virtualized systems at all. The performance just isn't there. They do use virtual machines for some external monitoring stuff.
  • They use Riak for N-master inter-datacenter synchronization, and RiakFS for sessions and files. RiakFS is for the "legacy" MediaWiki need for POSIX access to files, but they can serve those files to the general public from Riak's HTTP interface via Varnish cache.
  • They use VPN tunnels between datacenters. Sometimes using their own routes, even over multiple hops, leads to faster transit than going over the public Internet.
  • Lots of interesting custom VCL (Varnish Configuration Language) examples.

This had plenty of interesting things to consider for any web application architecture.

PostgreSQL 8.4 in RHEL/CentOS 5.5

The announcement of end of support coming soon for PostgreSQL 7.4, 8.0, and 8.1 means that people who've put off upgrading their Postgres systems are running out of time before they're in the danger zone where critical bugfixes won't be available.

Given that PostgreSQL 7.4 was released in November 2003, that's nearly 7 years of support, quite a long time for free community support of an open-source project.

Many of our systems run Red Hat Enterprise Linux 5, which shipped with PostgreSQL 8.1. All indications are that Red Hat will continue to support that version of Postgres as it does all parts of a given version of RHEL during its support lifetime. But of course it would be nice to get those systems upgraded to a newer version of Postgres to get the performance and feature benefits of newer versions.

For any developers or DBAs familiar with Postgres, upgrading to a new version with RPMs from the PGDG or other custom Yum repository is not a big deal, but occasionally we've had a client worry that using a packages other than the ones supplied by Red Hat is riskier.

For those holdouts still on PostgreSQL 8.1 because it's the "norm" on RHEL 5, Red Hat gave us a gift in their RHEL 5.5 update. It now includes separate PostgreSQL 8.4 packages that may optionally be used on RHEL 5 instead of PostgreSQL 8.1. (Both can't be used on the same system at the same time.)

I know that getting these packages from Red Hat shouldn't be necessary, but for those who feel jittery about using 3rd-party packages, it's a good nudge to switch to Postgres 8.4 using Red Hat's supported packages. Thanks to Tom Lane at Red Hat for making this happen. Though I don't know whose idea it was, Tom is the author of all the RPM commitlog messages, so thanks, Tom!

This brings up a few other rhetorical questions: Will RHEL 6 ship with PostgreSQL 9.0? Will RHEL 5.6 have backported PostgreSQL 9.0 in similar postgresql90 packages? It'd be great to see each new PostgreSQL release have supported packages in RHEL so that there's even less reason to start a new project on an older version of Postgres. RHEL 5.5 with PostgreSQL 8.4 is a nice start in that direction.

Ruby on Rails Typo blog upgrade

I needed to migrate a Typo blog (built on Ruby on Rails) from one RHEL 5 x86_64 server to another. To date I've done Ruby on Rails deployments using Apache with FastCGI, mongrel, and Passenger, and I've been looking for an opportunity to try out an nginx + Unicorn deployment to see how it compares. This was that opportunity, and here are the changes that I made to the stack during the migration:

I used the following packages from End Point's Yum repository for RHEL 5 x86_64:

  • nginx-0.7.64-2.ep
  • ruby-enterprise-1.8.7-3.ep
  • ruby-enterprise-rubygems-1.3.6-3.ep

The rest were standard Red Hat Enterprise Linux packages, including the new RHEL 5.5 postgresql84 packages. The exceptions were the Ruby gems, which were installed locally with the `gem` command as root.

I had to install an older version of one gem dependency manually, sqlite3-ruby, because the current version requires a newer version of sqlite than comes with RHEL 5. The installation commands were roughly:

yum install sqlite-devel.x86_64
gem install sqlite3-ruby -v 1.2.5

gem install unicorn
gem install typo

yum install postgresql84-devel.x86_64
gem install postgres

Then I followed (mostly) the instructions in Upgrading to Typo 5.4, which are still pretty accurate even though outdated by one release. One difference was the need to specify PostgreSQL to override the default of MySQL (even though the default is documented as being sqlite):

typo install /path/to/typo database=postgresql

Then I ran pg_dump on the old Postgres database and imported the data into the new database, and put in place the database.yml configuration file.

The Typo upgrade went pretty smoothly this time. I had to delete the sidebars configuration from the database to stop getting a 500 error for that, and redo the sidebars manually -- which I've had to do with every past Typo upgrade as well. But otherwise it was easy.

I first tested the migrated blog by running unicorn_rails manually in development mode. Then to have Unicorn start at boot time, I wrote this little shell script and put it in ~/bin/start-unicorn.sh:

#!/bin/bash
cd /path/to/app || exit 1
unicorn_rails -E production -D -c config/unicorn.conf.rb

Then added a cron job to run it:

@reboot bin/start-unicorn.sh

That unicorn.conf.rb file contains only:

listen 8080
worker_processes 4

The listen port 8080 is the default, but I may need to change it. Unicorn defaults to only 1 worker process, so I increased it to 4.

I added the following nginx configuration inside the http { ... } block (actually in a separate include file):

upstream app_server {
    server 127.0.0.1:8080 fail_timeout=0;
}

server {
    listen       the.ip.add.ress:80;
    server_name  the.host.name;

    location / { 
        root   /path/to/rails/typo/public/cache;

        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        #proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Host $http_host;
        proxy_redirect off;

        rewrite ^/blog/xml/atom/feed\.xml$ /articles.atom permanent;
        rewrite ^/blog/xml/rss20/feed\.xml$ /articles.rss permanent;

        if (-f $request_filename) {
            break;
        }   

        set $possible_request_filename $request_filename/index.html;
        if (-f $possible_request_filename) {
            rewrite (.*) $1/index.html;
            break;
        }   

        set $possible_request_filename $request_filename.html;
        if (-f $possible_request_filename) {
            rewrite (.*) $1.html;
            break;
        }   

        if (!-f $request_filename) {
            proxy_pass http://app_server;
            break;
        }   
    }   

    # Rails error pages
    error_page 500 502 503 504 /500.html;
    location = /500.html {
        root   /path/to/rails/typo/public;
    }   
}

The configuration was a little complicated to get nginx serving static content directly, including cache files that Typo writes out. I had to add special handling for / which gets cached as /index.html, but can't be called that when passing URIs to Typo, as it doesn't know about any /index.html. And all HTML cache files end in .html, though the URIs don't, so those need special handling too.

But when all is said and done, the blog is now running on the latest version of Typo, on the latest Unicorn, Rails, Ruby Enterprise Edition, PostgreSQL, and nginx, with all static content and fully-cached pages served directly by nginx, and for the most part only dynamic requests being served by Unicorn. I need to tweak the nginx rewrite rules a bit more to get 100% of static content served directly by nginx.

As far as blogging platforms go, I can recommend Typo mainly for Rails enthusiasts who want to write their own plugins, tweak the source, etc. WordPress or Movable Type are so much more widely used that non-programmers are going to have a lot easier time deploying and supporting them. They've had a lot more security vulnerabilities requiring updates, though that may also be a function of popularity and payoff for those exploiting them.

Rails deployment seems to take a lot of memory no matter how you do it. I don't think nginx + Unicorn uses much less RAM than Apache + Passenger, mostly the different between nginx and Apache themselves. But using Unicorn does allow for running the application processes on another server or several servers without needing nginx or Apache on those other servers. It does provide for clean separation between the web server and the application(s), including possibly different SELinux contexts rather than always httpd_sys_script_t as we see with Passenger. Passenger at least switches the child process UID to run with different permissions from the main web server, which is good. Both Passenger and Unicorn are much nicer than FastCGI, which I've always found to be a little buggy, and mongrel, which required specifying a range of ports and load-balancing across all of them in the proxy -- managing multiple port ranges is a pain with multiple apps on the same machine, especially when some need more than others.

I think if you have plenty of RAM, going with Apache + Passenger may still be the easiest Rails web deployment method overall, when mixed with other static content, server-side includes, PHP, and CGIs. But for high-traffic and custom setups, nginx + Unicorn is a nice option.

Efficiency of find -exec vs. find | xargs

This is a quick tip for anyone writing a cron job to purge large numbers of old files.

Without xargs, this is a pretty common way to do such a purge, in this case of all files older than 31 days:

find /path/to/junk/files -type f -mtime +31 -exec rm -f {} \;

But that executes rm once for every single file to be removed, which adds a ton of overhead just to fork and exec rm so many times. Even on modern operating systems that are so efficient with fork, it can easily increase the I/O and load and runtime by 10 times or more than just running a single rm command with a lot of file arguments.

Instead do this:

find /path/to/junk/files -type f -mtime +31 -print0 | xargs -0 -r rm -f

That will run xargs once for each very long list of files to be removed, so the overhead of fork & exec is incurred very rarely, and the job can spend most of its effort actually unlinking files. (The xargs -r option says not to run the command if there is no input to xargs.)

How long can the argument list to xargs be? It depends on the system, but xargs --show-limits will tell us. Here's output from a RHEL 5 x86_64 system (using findutils 4.2.27):

% xargs --show-limits                                                                                                   
Your environment variables take up 2293 bytes                                                                                        
POSIX lower and upper limits on argument length: 2048, 129024                                                                        
Maximum length of command we could actually use: 126731                                                                              
Size of command buffer we are actually using: 126731

The numbers are similar on Debian Etch and Lenny.

And here's output from an Ubuntu 10.04 x86_64 system (using findutils 4.4.2):

% xargs --show-limits
Your environment variables take up 1370 bytes
POSIX upper limit on argument length (this system): 2093734
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2092364
Size of command buffer we are actually using: 131072

Roughly 2 megabytes of arguments is a lot. But even the POSIX minimum of 4 kB is a lot better than processing one file at a time.

It doesn't usually make much of a difference, but we can tune even more. Make sure the maximum number of files is processed at one time by first changing to the base directory so that the relative pathnames are shorter:

cd /path/to/junk/files && find . -type f -mtime +31 -print0 | xargs -0 -r rm -f

That way each file argument is shorter, e.g. ./junkfile compared to /path/to/junk/files/junkfile.

The above assumes you're using GNU findutils, which includes find -print0 and xargs -0 for processing ASCII NUL-delimited filenames for safety when filenames include embedded spaces, newlines, etc.

Why is my load average so high?

One of the most common ways people notice there's a problem with their server is when Nagios, or some other monitoring tool, starts complaining about a high load average. Unfortunately this complaint carries with it very little information about what might be causing the problem. But there are ways around that. On Linux, where I spend most of my time, the load average represents the average number of process in either the "run" or "uninterruptible sleep" states. This code snippet will display all such processes, including their process ID and parent process ID, current state, and the process command line:

#!/bin/sh

ps -eo pid,ppid,state,cmd |\
    awk '$3 ~ /[RD]/ { print $0 }'

Most of the time, this script has simply confirmed what I already anticipated, such as, "PostgreSQL is trying to service 20 times as many simultaneous queries as normal." On occasion, however, it's very useful, such as when it points out that a backup job is running far longer than normal, or when it finds lots of "[pdflush]" operations in process, indicating that the system was working overtime to write dirty pages to disk. I hope it can be similarly useful to others.

Tip: Find all non-UTF-8 files

Here's an easy way to find all non-UTF-8 files for later perusal:

find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}" > utf8_fail

I've needed this before for converting projects over into UTF-8; obviously certain files are going to be binary and will show up in this list, so manual vetting will need to be done before converting all your images over into UTF-8.

Xen MAC mismatch VNC mouse escape HOWTO

This is a story that probably shouldn't need to be told if everything is in documentation somewhere. I'm not using any fancy virtualization management tools and didn't have an easy time piecing everything together, so I thought it'd be worth writing down the steps of the manual approach I took.

Dramatis personæ:

  • Server: Red Hat Enterprise Linux 5.4 with Xen kernel
  • Guest virtual server: CentOS 5.4 running paravirtualized under Xen
  • Workstation: Ubuntu 9.10

The situation: I updated the CentOS 5 Xen virtual guest via yum and rebooted to load the new Linux kernel and other libraries such as glibc. According to Xen as reported by xm list, the guest had started back up fine, but the host wasn't reachable over the network via ping, http, or ssh, including from the host network.

The guest wasn't using much CPU (as shown by xm top), so I figured it wasn't just a slow-running fsck during startup. And I was familiar with the iptables firewall rules on this guest, so I was fairly sure I wasn't being blocked there. I needed to get to the console to see what was wrong.

The way I've done this before is using VNC to access the virtual console remotely. The Xen host was configured to accept VNC connections on localhost, which I could see by looking in /etc/xen/xend-config.sxp:

(vnc-listen '127.0.0.1')

There are 11 Xen guests, with consoles listening on TCP ports 5900-5910. Which one was the one? I don't know any simple way to get a list that maps ports to Xen guests, but I did it this way:

ps auxww | grep qemu-dm

I noted the PID of the process that was running for my guest as revealed in its command line. Then I looked for the listener running under that PID:

netstat -nlp

I looked for $pid/qemu-dm in the PID/Program Name column and could then see the TCP port in the Local Address column. In my case it was 127.0.0.1:5903.

So I set up an ssh tunnel to the server for my VNC traffic:

ssh -f -N -L 5903:localhost:5903 root@$remote_host &

Then I opened the default Ubuntu/GNOME VNC viewer, labeled "Remote Desktop Viewer" under the Internet menu. This program is actually called Vinagre, and is basic but works. I connected to localhost:5903, since I'd forwarded my own local TCP port 5903 to the remote port 5903.

The remote console came up, and I was presented with the login banner and prompt. If I hadn't had the root password, I would have needed to reboot the guest and start it in runlevel 0 to get a root shell without a password and change the password. But I did have the password, so that wasn't necessary.

Then when trying to change to another window on my desktop, I ran into the biggest snag of the whole exercise: Getting control of the mouse out of the VNC remote desktop window and back in my own desktop! I couldn't find anything accurate on this in any documentation, forums, etc. Finally I stumbled across the trick: Press F10, which pulls down the Machine menu in Vinagre and as a side-effect takes control of the mouse away from the remote desktop. It was nice not to have to ssh in from another machine to kill Vinagre. But it makes me wonder how I'd send an F10 through to the remote console ...

Armed with the root password I was able to log into the guest and discover that only the lo (loopback) interface started on boot. The eth0 and eth1 interfaces failed because there was no virtual NIC available with the MAC addresses specified in /etc/sysconfig/network-scripts/ifcfg-eth0 and eth1.

That was because the virtual machine image had been cloned from another one and hadn't been given new unique MAC addresses. The problem was easily fixed by updating the ifcfg-eth0 and ifcfg-eth1 files with the MAC addresses actually given to the interfaces, as seen by ifconfig, which were ultimately assigned by the Xen host in /etc/xen/$host in the vif parameter. (You can also specify no MAC addresses in the guest at all and it will use whatever it gets.)

Then after running service network restart the networking was back, and I rebooted to make sure it started correctly on its own.

Spree on Heroku for Development

Yesterday, I worked through some issues to setup and run Spree on Heroku. One of End Point's clients is using Spree for a multi-store solution. We are using the the recently released Spree 0.10.0.beta gem, which includes some significant Spree template and hook changes discussed here and here in addition to other substantial updates and fixes. Our client will be using Heroku for their production server, but our first goal was to work through deployment issues to use Heroku for development.

Since Heroku includes a free offering to be used for development, it's a great option for a quick and dirty setup to run Spree non-locally. I experienced several problems and summarized them below.

Application Changes

1. After a failed attempt to setup the basic Heroku installation described here because of a RubyGems 1.3.6 requirement, I discovered the need for Heroku's bamboo deployment stack, which requires you to declare the gems required for your application. I also found the Spree Heroku extension and reviewed the code, but I wanted to take a more simple approach initially since the extension offers several features that I didn't need. After some testing, I created a .gems file in the main application directory including the contents below to specify the gems required on the Badious Bamboo Heroku stack.

rails -v 2.3.5
highline -v '1.5.1'
authlogic -v '>=2.1.2'
authlogic-oid -v '1.0.4'
activemerchant -v '1.5.1'
activerecord-tableless -v '0.1.0'
less -v '1.2.20'
stringex -v '1.0.3'
chronic -v '0.2.3'
whenever -v '0.3.7'
searchlogic -v '2.3.5'
will_paginate -v '2.3.11'
faker -v '0.3.1'
paperclip -v '>=2.3.1.1'
state_machine -v '0.8.0'

2. The next block I hit was that git submodules are not supported by Heroku, mentioned here. I replaced the git submodules in our application with the Spree extension source code.

3. I also worked through addressing Heroku's read-only filesystem limitation. The setting perform_caching is set to true for a production environment by default in an application running from the Spree gem. In order to run the application for development purposes, perform_caching was set to false in RAILS_APP/config/environments/production.rb:

config.action_controller.perform_caching             = false

Another issue that came up due to the read-only filesystem constraint was that the Spree extensions were attempting to copy files over to the rails application public directory during the application restart, causing the application to die. To address this issue, I removed the public images and stylesheets from the extension directories and verified that these assets were included in the main application public directory.

I also removed the frozen Spree gem extension public files (javascripts, stylesheets and images) to prevent these files from being copied over during application restart. These files were moved to the main application public directory.

4. Finally, I disabled the allow_ssl_in_production to turn SSL off in my development application. This change was made in the extension directory extension_name_extensions.rb file.

AppConfiguration.class_eval do
  preference :allow_ssl_in_production, :boolean, :default => false
end

Obviously, this isn't the preference setting that will be used for the production application, but it works for a quick and dirty Heroku development app. Heroku's SSL options are described here.

Deployment Tips

1. To create a Heroku application running on the Bamboo stack, I ran:

heroku create --stack bamboo-ree-1.8.7 --remote bamboo

2. Since my git repository is hosted on github, I ran the following to push the existing repository to my heroku app:

git push bamboo master

3. To run the Spree database bootstrap (or database reload), I ran the following:

heroku rake db:bootstrap AUTO_ACCEPT=1

As a side note, I ran the command heroku logs several times to review the latest application logs throughout troubleshooting.

Despite the issues noted above, the troubleshooting yielded an application that can be used for development. I also learned more about Heroku configurations that will need to be addressed when moving the project to production, such as SSL setup and multi domain configuration. We'll also need to determine the best option for serving static content, such as using Amazon's S3, which is supported by the Spree Heroku extension mentioned above.

Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.

Riak Install on Debian Lenny

I'm doing some comparative analysis of various distributed non-relational databases and consequently wrestled with the installation of Riak on a server running Debian Lenny.

I relied upon the standard "erlang" debian package, which installs cleanly on a basically bare system without a hitch (as one would expect). However, the latest Riak's "make" tasks fail to run; this is because the rebar script on which the make tasks rely chokes on various bad characters:

riak@nosql-01:~/riak$ make all rel
./rebar compile
./rebar:2: syntax error before: PK
./rebar:11: illegal atom
./rebar:30: illegal atom
./rebar:72: illegal atom
./rebar:76: syntax error before: ��n16
./rebar:79: syntax error before: ','
./rebar:91: illegal integer
./rebar:149: illegal atom
./rebar:160: syntax error before: Za��ze
./rebar:172: illegal atom
./rebar:176: illegal atom
escript: There were compilation errors.
make: *** [compile] Error 127

Delicious.

Ultimately, I came across this article describing issues getting Riak to install on Ubuntu 9.04, and ultimately determined that the Erlang version mentioned seemed to apply here. Following the article's instructions for building Erlang from source worked out fine, and so far I've been able to start, ping, and stop the local Riak server without incident.

Since a true investigation requires running these kinds of tools in a cluster, and that means automation of the installation/configuration is desirable, I've been scripting out the configuration steps (putting things into a configuration management tool like Puppet will come later when we're farther along and closer to picking the right solution for the problem in question). So, here's the script I've been running to build these things from my local machine (relying upon SSH); these are rough, a work in progress, and are not intended as examples of excellence, elegance, or beauty -- they simply get the job done (so far) for me and may help somebody else.

#!/bin/sh

hostname=$1
erlang_release=otp_src_R13B04
riak_release=riak-0.8.1

ssh root@$hostname "
# necessary for Erlang build
apt-get install build-essential libncurses5-dev m4
apt-get install openssl libssl-dev
# standard from-source build
mkdir erlang-build
cd erlang-build
wget http://ftp.sunet.se/pub/lang/erlang/download/$erlang_release.tar.gz
tar xzf $erlang_release.tar.gz
cd $erlang_release
./configure
make
make install
# put all of riak in a riak user
useradd -m riak
su -c 'wget http://bitbucket.org/basho/riak/downloads/$riak_release.tar.gz' - riak
su -c 'tar xzf $riak_release.tar.gz' - riak
su -c 'cd $riak_release && make all rel' - riak
su -c 'mv $riak_release/rel riak' - riak
"

(I have other scripts for preparing the box post-OS-install, but I don't think they impact this particular part of the process.)

PostgreSQL EC2/EBS/RAID 0 snapshot backup

One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.

A big benefit of Amazon's cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon's EBS storage is that it is much more I/O limited than, say, a nice SAN.

To partially mitigate the I/O limitations, they're using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.

They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That's working fine.

They also do nightly backups using pg_dumpall on the master so that there's a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).

There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I'm not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:

  1. Create an atomic snapshot of the block devices
  2. Spin up another virtual server
  3. Mount the backup volume
  4. Start Postgres and allow it to recover from the apparent "crash" the server had (since there wasn't a clean shutdown of the database before the snapshot
  5. Do whatever pg_dump or other backups are desired
  6. Make throwaway copies of the snapshot for QA or other testing

The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That's a big difference from a freshly created database and import from pg_dump.

The difference here is that we're using 4 EBS volumes with RAID 0 striped across them, and there isn't currently a way to do an atomic snapshot of all 4 volumes at the same time. So it's no longer "atomic" and who knows what state the filesystem metadata and the file data itself would be in?

Well, why not try it anyway? Filesystem metadata doesn't change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren't all crashes a little unpredictable? Why shouldn't the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.

I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:

  1. Created one EC2 image:
    Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
    Debian AMI ID ami-4ffe1926 (x86_64)
    Instance Type: High-CPU Extra Large (c1.xlarge) - 7 GB RAM, 8 CPU cores
  2. Created 4 x 10 GB EBS volumes
  3. Attached volumes to the image
  4. Created software RAID 0 device:
    mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi
  5. Created XFS filesystem on top of RAID 0 device:
    mkfs -t xfs -L /pgdata /dev/md0
  6. Set up in /etc/fstab and mounted:
    mkdir /pgdata
    # edit /etc/fstab, with noatime
    mount /pgdata
  7. Installed PostgreSQL 8.3
  8. Configured postgresql.conf to be similar to primary production database server
  9. Created empty new database cluster with data directory in /pgdata
  10. Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory
  11. Ran some bulk inserts to grow database to around 5 GB
  12. Rebooted EC2 instance to confirm everything came back up correctly on its own
  13. Set up two concurrent data-insertion processes:
    • 50 million row insert based on another local table (INSERT INTO ... SELECT ...), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won't have committed before the snapshot is taken)
    • Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through
  14. Started those "beater" jobs, which mostly consumed 2-3 CPU cores
  15. Manually inserted a known test row and created a known view that should appear in the snapshot
  16. Started Postgres's backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:
    SELECT pg_start_backup('raid_backup');
  17. Manually inserted a 2nd known test row & 2nd known test view that I don't want to appear in the snapshot after recovery
  18. Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes -- during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes
  19. Tell Postgres the backup's over:
    SELECT pg_stop_backup();
  20. Ran script to create new EBS volumes derived from the 4 snapshots (which aren't directly usable and always go into S3), using ec2-create-volume --snapshot
  21. Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume
  22. Then, on the new EC2 instance for doing backups:
    • mdadm --assemble --scan
    • mount /pgdata
    • Start Postgres
    • Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn't show any new rows, and that the table getting individual rows committed to reads correctly
    • VACUUM VERBOSE -- and confirm no errors or inconsistencies detected
    • pg_dumpall # confirmed no errors and data looks sound

It worked! No errors or problems, and pretty straightforward to do.

Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn't make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn't a fool's errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they'd be close to atomic.

There are lots of caveats to note here:

  • This is an experiment in progress, not a how-to for the general public.
  • The data set that was snapshotted was fairly small.
  • Two successful runs, even with no failures, is not a very big sample set. :)
  • I didn't use Postgres's point-in-time recovery (PITR) here at all -- I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.

So there's more work to be done to prove this technique viable in production for a mission-critical database, but it's a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store's snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.

MySQL Ruby Gem CentOS RHEL 5 Installation Error Troubleshooting

Building and installing the Ruby mysql gem on freshly-installed Red Hat based systems sometimes produces the frustratingly ambiguous error below:

# gem install mysql
/usr/bin/ruby extconf.rb
checking for mysql_ssl_set()... no
checking for rb_str_set_len()... no
checking for rb_thread_start_timer()... no
checking for mysql.h... no
checking for mysql/mysql.h... no
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers.  Check the mkmf.log file for more
details.  You may need configuration options.

Searching the web for info on this error yields two basic solutions:

  1. Install the mysql-devel package (this provides the mysql.h file in /usr/include/mysql/).
  2. Run gem install mysql -- --with-mysql-config=/usr/bin/mysql_config or some other additional options.

These are correct but not sufficient. Because this gem compiles a library to interface with MySQL's C API, the gcc and make packages are also required to create the build environment:

# yum install mysql-devel gcc make
# gem install mysql -- --with-mysql-config=/usr/bin/mysql_config

Alternatively, if you're using your distro's ruby (not a custom build like Ruby Enterprise Edition), you can install EPEL's ruby-mysql package along with their rubygem-rails and other packages.

On Linux, noatime includes nodiratime

Note to performance-tweaking Linux sysadmins, pointed out to me by Selena Deckelmann: On Linux, the filesystem attribute noatime includes nodiratime, so there's no need to say both "noatime,nodiratime" as I once did. (See this article on atime for details if you're not familiar with it.)

Apparently the nodiratime attribute was added later as a subset of noatime applying only to directories to still offer a bit of performance boost in situations where noatime on files would cause trouble (as with mutt and a few other applications that care about atimes).

See also the related newer relatime attribute in the mount(8) manpage.

DevCamps on different systems, including Plesk, CPanel and ISPConfig

In the last few months I've been active setting up DevCamps for several of our newer clients. DevCamps is an open source development environment system, that once setup, allows for easily starting up and tearing down a development environment for a specific site/code base.

I've done many camps setups, and you tend to run into surprises from system to system, but what was most interesting and challenging about these latest installs was that they were to be done on systems running Plesk, CPanel, and ISPConfig. Some things that are different between a normal deployment and one on the above mentioned platforms are:

  • On the Plesk system there was a secured Linux called 'Atomic Secured Linux' which includes the grsecurity module. One restriction of this module is (TPE) Trusted Path Execution which required the camp bin scripts to be owned by root and the bin directory could not be writable by other groups, otherwise they would fail to run.
  • Permissions are a mixed bag, where typically we set all of the files to be owned by the site owner, in Plesk there are special groups such as psacln that the files need to be owned by.
  • On the CPanel system we needed to move the admin images for Interchange to a different directory since CPanel includes Interchange and has aliases for /interchange/ and /interchange-5/ to point at a central location which we would not be using.
  • On ISPConfig and Plesk the home directories of the sites are in different places, which required deploying the code in such places as /var/www/clients/client/user/domain.com or /var/www/vhosts/domain.com.

In the end we were able to get DevCamps to run properly on these various platforms both in development and production. If you are starting a new project or working on an existing project and could use a strong development environment, consider DevCamps.

SSHFS and ServerAliveInterval

If you're using SSHFS (as I do recently since OpenVPN started crashing frequently on my OpenBSD firewall), note that the ServerAliveInterval option for SSH can have significant impact on the stability of your mounts.

I set it to 10 seconds on my system and have been happy with the results so far. It could probably safely go considerably higher than that.

It's not on by default, which leaves the stability of your SSH tunnels up to the success of TCP keepalive (which is on by default). On my wireless network, that alone has not been sufficient.

dstat: better system resource monitoring

I recently came across a useful tool I hadn't heard of before: dstat, by Dag Wieers (of DAG RPM-building fame). He describes it as "a versatile replacement for vmstat, iostat, netstat, nfsstat and ifstat."

The most immediate benefit I found is the collation of system resource monitoring output at each point in time, removing the need to look at output from multiple monitors. The coloring helps readability too:

% dstat                                                                         
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--         
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw          
  4   1  92   3   0   0|  56  84k|   0     0 |  94 188B|1264  1369          
  3   7  43  44   1   1| 368  11M| 151 222B|   0   260k|1453  1565          
  3   2  46  48   1   0| 4325784k|   0     0 |   0     0 |1421  1584          
  2   2  47  49   0   0| 592k    0 |   0     0 |   0     0 |1513  1763          
  6   2  44  49   1   0| 448 248k|   0     0 |   0     0 |1398  1640          
  8   4  41  45   3   0| 456k    0 | 135 222B|   0     0 |1530  2102          
 18   4  38  41   0   0| 408 128k|   0    47B|   0     0 |1261  1977          
 10   4  44  43   0   0| 728 208k|   0     0 |   0     0 |1445  2203          
  6   3  39  51   0   0| 648 256k|36074124B|   0     0 |1496  2180          
  7   7  34  53   0   0|1088k    0 |1234 582B|   0     0 |1465  2057          
 14   8  28  49   0   0|2856 104k|   0     0 |   0    52k|1610  2995          
  6   6  43  45   0   0|1992k    0 |59644836B|   0     0 |1493  2391          
  9  14  34  44   0   0|2432 112k|7854 726B|   0     0 |1527  2190          
  9  11  40  41   1   0|2680k    0 |1382 972B|   0     0 |1550  2298          
  5   4  68  22   0   0| 5761096k|  124628B|   0     0 |1522  1731 ^C       

(Textual screenshot by script of util-linux and Perl module HTML::FromANSI.)

Its default one-line-per-timeslice output makes it good for collecting data samples over time, as opposed to full-screen top-like utilities such as atop, which give much more detailed information at each snapshot, but don't show history.

Since dstat is a standard package available in RHEL/CentOS and Debian/Ubuntu, it is a reasonably easy add-on to get on various systems.

dstat also allows plugins, and just in the most recent release last month were added new plugins "for showing NTP time, power usage, fan speed, remaining battery time, memcache hits and misses, process count, top process total and average latency, top process total and average CPU timeslice, and per disk utilization rates."

It sounds like it'll grow even more useful over time and is worth keeping an eye on.

Multiple links to files in /etc

I came across an unfamiliar error in /var/log/messages on a RHEL 5 server the other day:

Dec  2 17:17:23 X restorecond: Will not restore a file with more than one hard link (/etc/resolv.conf) No such file or directory

Sure enough, ls showed the inode pointed to by /etc/resolv.conf having 2 links. What was the other link?

# find /etc -samefile resolv.conf
/etc/resolv.conf
/etc/sysconfig/networking/profiles/default/resolv.conf
# ls -lai /etc/resolv.conf /etc/sysconfig/networking/profiles/default/resolv.conf
1526575 -rw-r--r-- 2 root root 69 Nov 30  2008 /etc/resolv.conf
1526575 -rw-r--r-- 2 root root 69 Nov 30  2008 /etc/sysconfig/networking/profiles/default/resolv.conf

I've worked with a lot of RHEL/CentOS 5 servers and hadn't ever dealt with these network profiles. Kiel guessed it was probably a system configuration tool that we never use, and he was right: Running system-config-network (part of the system-config-network-tui RPM package) creates the hardlinks for the default profile.

/etc/hosts gets the same treatment as /etc/resolv.conf.

I suppose SELinux's restorecond doesn't want to apply any context changes because its rules are based on filesystem paths, and the paths of the multiple links are different and could result in conflicting context settings.

Since we don't use network profiles, we can just delete the extra links in /etc/sysconfig/networking/profiles/default/.

Cisco PIX mangled packets and iptables state tracking

Kiel and I had a fun time tracking down a client's networking problem the other day. Their scp transfers from their application servers behind a Cisco PIX firewall failed after a few seconds, consistently, with a connection reset.

The problem was easily reproducible with packet sizes of 993 bytes or more, not just with TCP but also ICMP (bloated ping packets, generated with ping -s 993 $host). That raised the question of how this problem could go undetected for their heavy web traffic. We determined that their HTTP load balancer avoided the problem as it rewrote the packets for HTTP traffic on each side.

Kiel narrowed the connect resets down to iptables' state-tracking considering packets INVALID, not ESTABLISHED or RELATED as they should be.

Then he found via tcpdump that the problem was easily visible in scp connections when TCP window scaling adjustments were made by either side of the connection. We tried disabling window scaling but that didn't help.

We tried having iptables allow packets in state INVALID when they were also ESTABLISHED or RELATED, and that reduced the frequency of terminated connections, but still didn't eliminate them entirely. (And it was a kludge we weren't eager to keep in place anyway.)

We wanted to avoid some unpleasant possibilities: (1) turn off stateful firewalling or (2) perform risky updates or configuration changes on the Cisco PIX, which may or may not fix the problem, in the middle of the busy holiday ecommerce season.

Finally, Kiel found this netfilter mailing list post which describes how to enable a Linux kernel workaround for the mangled packets the Cisco generates:

echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

Of course saving that in /etc/sysctl.conf so it persists after a reboot.

So we have reliable long-running scp connections with TCP window scaling working and iptables doing its job. I love it when a plan comes together.

XZ compression

XZ is a new free compression file format that is starting to be more widely used. The LZMA2 compression method it uses first became popular in the 7-Zip archive program, with an analogous Unix command-line version called 7z.

We used XZ for the first time in the Interchange project in the Interchange 5.7.3 packages. Compared to gzip and bzip2, the file sizes were as follows:

interchange-5.7.3.tar.gz   2.4M
interchange-5.7.3.tar.bz2  2.1M
interchange-5.7.3.tar.xz   1.7M

Getting that tighter compression comes at the cost of its runtime being about 4 times slower than bzip2, but a bonus is that it decompresses about 3 times faster than bzip2. The combination of significantly smaller file sizes and faster decompression made it a clear win for distributing software packages, leading to it being the format used for packages in Fedora 12.

It's also easy to use on Ubuntu 9.10, via the standard xz-utils package. When you install that with apt-get, aptitude, etc., you'll get a scary warning about it replacing lzma, a core package, but this is safe to do because xz-utils provides compatible replacement binaries /usr/bin/lzma and friends (lzcat, lzless, etc.). There is also built-in support in GNU tar with the new --xz aka -J options.

Port knocking with knockd

One of the best ways to secure your box against SSH attacks is the use of port knocking. Basically, port knocking seals off your SSH port, usually with firewall rules, such that nobody can even tell if you are running SSH until the proper "knock" is given, at which time the SSH port appears again to a specific IP address. In most cases, a "knock" simply means accessing specific ports in a specific order within a given time frame.

Let's step back a moment and see why this solution is needed. Before SSH there was telnet, which was a great idea way back at the start of the Internet when hosts trusted each other. However, it was (and is) extremely insecure, as it entails sending usernames and passwords "in the clear" over the internet. SSH, or Secure Shell, is like telnet on steroids. With a mean bodyguard. There are two common ways to log in to a system using SSH. The first way is with a password. You enter the username, then the password. Nice and simple, and similar to telnet, except that the information is not sent in the clear. The second common way to connect with SSH is by using public key authentication. This is what I use 99% of the time. It's very secure, and very convenient. You put the public copy of your PGP key on the server, and then use your local private SSH key to authenticate. Since you can cache your private key, this means only having to type in your SSH password once, and then you can ssh to many different systems with no password needed.

So, back to port knocking. It turns out that any system connected to the internet is basically going to come under attack. One common target is SSH - specifically, people connecting to the SSH port, then trying combinations of usernames and passwords in the hopes that one of them is right. The best prevention against these attacks is to have a good password. Because public key authentication is so easy, and makes typing in the actual account password such a rare event, you can make the password something very secure, such as:

gtsmef#3ZdbVdAebAS@9e[AS4fed';8fS14S0A8d!!9~d1aAQ5.81sa0'ed

However, this won't stop others from trying usernames and passwords anyway, which fills up your logs with their attempts and is generally annoying. Thus, the need to "hide" the SSH port, which by default is 22. One thing some people do is move SSH to a "non-standard" port, where non-standard means anything but 22. Typically, some random number that won't conflict with anything else. This will reduce and/or stop all the break-in attempts, but at a high cost: all clients connecting have to know to use that port. With the ssh client, it's adding a -p argument, or setting a "Port" line in the relevant section of your .ssh/config file.

All of which brings us to port knocking. What if we could run SSH on port 22, but not answer to just anyone, but only to people who knew the secret code? That's what port knocking allows us to do. There are many variants on port knocking and many programs that implement it. My favorite is "knockd", mostly because it's simple to learn and use, and is available in some distros' packaging systems. My port knocking discussion and examples will focus on knockd, unless stated otherwise.

knockd is a daemon that listens for incoming requests to your box, and reacts when a certain combination is reached. Once knockd is installed and running, you modify your firewall rules (e.g. iptables) to drop all incoming traffic to port 22. To the outside world, it's exactly as if you are not running SSH at all. No break-in attempts are possible, and your security logs stay nice and boring. When you want to connect to the box via SSH, you first send a series of knocks to the box. If the proper combination is received, knockd will open a hole in the firewall for your IP on port 22. From this point forward, you can SSH in as normal. The new firewall entry can get removed right away, cleared out at some time period later, or you can define another knock sequence to remove the firewall listing and close the hole again.

What exactly is the knock? It's a series of connections to TCP or UDP ports. I prefer choosing a few random TCP ports, so that I can simply use telnet calls to connect to the ports. Keep in mind that when you do connect, it will appear as if nothing happened - you cannot tell that knockd is logging your attempt, and possibly acting on it.

Here's a sample knockd configuration file:

[options]
  logfile = /var/log/knockd.log

[openSSH]
  sequence    = 32144,21312,21120
  seq_timeout = 15
  command     = /sbin/iptables -I INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
  tcpflags    = syn

[closeSSH]
  sequence    = 32144,21312,21121
  seq_timeout = 15
  command     = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
  tcpflags    = syn

In the above file, we've stated that any host that sends a TCP syn flag to ports 32144, 21312, and 21120, in that order, within 15 seconds, will cause the iptables command to be run. Note that the use of iptables is completely not hard-coded to knockd at all. Any command at all can be run when the port sequence is triggered, which allows for all sorts of fancy tricks.To close it up, we do the same sequence, except the final port is 21221.

Once knockd is installed, and the configuration file is put in place, start it up and begin testing. Leave a separate SSH connection open to the box while you are testing! If you are really paranoid, you might want to open a second SSH daemon on a second port as well. First, check that the port knocking works by triggering the port combinations. knockd comes with a command-line utility for doing so, but I usually just use telnet like so:

[greg@home ~] telnet example.com 32144
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused
[greg@home ~] telnet example.com 21312
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused
[greg@home ~] telnet example.com 21120
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused

Note that we reveived a bunch of "Connection refused" - the same message as if we tried any other random port. Also the same message that people trying to connect to a port knock protected SSH will see. If you look in the logs for knockd (set as /var/log/knockd.log in the example file above), you'll see some lines like this if all went well:

[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 1
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 2
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 3
[2009-11-09 14:01] 100.200.300.400: openSSH: OPEN SESAME
[2009-11-09 14:01] openSSH: running command: /sbin/iptables -I INPUT -s 100.200.300.400 -p tcp --dport 22 -j ACCEPT

Voila! Your iptables should now contain a new line:

$ iptables -L -n | grep 100.200
ACCEPT     tcp  --  100.200.300.400  anywhere            tcp dpt:ssh

The next step is to lock everyone else out from the SSH port. Add a new rule to the firewall, but make sure it goes to the bottom:

$ iptables -A INPUT -p tcp --dport ssh -j DROP
$ iptables -L | grep DROP
DROP       tcp  --  anywhere             anywhere            tcp dpt:ssh

You'll note that we used "A" to append the DROP to the bottom of the INPUT chain, and "I" to insert the exceptions to the top of the INPUT chain. At this point, you should try a new SSH connection and make sure you can still connect.If all is working, the final step is to make sure the knockd daemon starts up on boot, and that the DROP rule is added on boot as well. You can also add some hard-coded exceptions for boxes you know are secure, if you don't want to have to port knock from them every time.

One flaw in the above scheme the sharp reader may have spotted is that although the SSH port cannot be reached without a knock, the sequence of knocks used can easily be intercepted and played back. While this doesn't gain the potential bad guy too much, there is a way to overcome it. The knockd program allows the port knocking combinations to be stored inside of a file, and read from, one line at a time. Each successful knock will move the required knocks to the next line, so that even knowing someone else's knock sequence will not help, as it changes each time. To implement this, just replace the 'sequence' line as seen in the above configuration file with a line like this:

one_time_sequences = /etc/knockd.sequences.txt

In this case, the sequences will be read from the file named "/etc/knockd.sequences.txt". See the manpage for knockd for more details on one_time_sequences as well as other features not discussed here. For more on port knocking in general, visit portknocking.org.

While the one_time_sequences is a great idea, I'd like to see something a little different implemented someday. Specifically, having to pre-populate a fixed list of sequences is a drag. Not only do you have to make sure they are random, and that you have enough, but you have to keep the list with you locally. Lose that list, and you cannot get in! A better way would be to have your port knocking program generate the new port sequences on the fly. It would also encrypt the new port sequences to one or more public keys, and then put the file somewhere web accessible. Thus, one could simply grab the file from the server, decrypt it, and perform the port knocking based on the list of ports inside of it. Is all of that overkill for SSH? Almost certainly. :) However, there are many other uses for port knocking that simple SSH blocking and unblocking. Remember that many pieces of information can be used against your server, including what services are running on which ports, and which versions are in use.

Upgrading from RHEL 5.2 to CentOS 5.4

I have a testing server that was running RHEL 5.2 (x86_64) but its RHN entitlement ran out and I wanted to upgrade it to CentOS 5.4. I found a few tips online about how to do that, but they were a little dated so here are updated instructions showing the steps I took:

yum clean all
mkdir ~/centos
cd ~/centos
wget http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/centos-release-5-4.el5.centos.1.x86_64.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/centos-release-notes-5.4-4.x86_64.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/yum-3.2.22-20.el5.centos.noarch.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/yum-updatesd-0.9-2.el5.noarch.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/yum-fastestmirror-1.1.16-13.el5.centos.noarch.rpm
rpm --import RPM-GPG-KEY-CentOS-5 
rpm -e --nodeps redhat-release
rpm -e yum-rhn-plugin yum-updatesd
rpm -Uvh *.rpm
yum -y upgrade
# edit /etc/grub.conf to point to correct new kernel (with Xen, in my case)
shutdown -r now

It has worked well so far.

rsync and bzip2 or gzip compressed data

A few days ago, I learned that gzip has a custom option --rsyncable on Debian (and thus also Ubuntu). This old write-up covers it well, or you can just `man gzip` on a Debian-based system and see the --rsyncable option note.

I hadn't heard of this before and think it's pretty neat. It resets the compression algorithm on block boundaries so that rsync won't view every block subsequent to a change as completely different.

Because bzip2 has such large block sizes, it forces rsync to resend even more data for each plaintext change than plain gzip does, as noted here.

Enter pbzip2. Based on how it works, I suspect that pbzip2 will be friendlier to rsync, because each thread's compressed chunk has to be independent of the others. (However, pbzip2 can only operate on real input files, not stdin streams, so you can't use it with e.g. tar cj directly.)

In the case of gzip --rsyncable and pbzip2, you trade a little lower compression efficency (< 1% or so worse) for reduced network usage by rsync. This is probably a good tradeoff in many cases.

But even more interesting for me, a couple of days ago Avery Pennarun posted an article about his experimental code to use the same principles to more efficiently store deltas of large binaries in Git repositories. It's painful to deal with large binaries in any version control system I've used, and most people simply say, "don't do that". It's too bad, because when you have everything else related to a project in version control, why not some large images or audio files too? It's much more convenient for storage, distribution, complete documentation, and backups.

Avery's experiment gives a bit of hope that someday we'll be able to store big file changes in Git much more efficiently. (Though it doesn't affect the size of the initial large object commits, which will still be bloated.)

Using ln -sf to replace a symlink to a directory

When you want to forcibly replace a symbolic link on some kind of Unix (here I'm using the version of ln from GNU coreutils), you can do it the manual way:

rm -f /path/to/symlink
ln -s /new/target /path/to/symlink

Or you can provide the -f argument to ln to have it replace the existing symlink automatically:

ln -sf /new/target /path/to/symlink

(I was hoping this would be an atomic action such that there's no brief period when /path/to/symlink doesn't exist, as when mv moves a file over top of an existing file. But it's not. Behind the scenes it tries to create the symlink, fails because a file already exists, then unlinks the existing file and finally creates the symlink.)

Anyway, that's convenient, but I ran into a gotcha which was confusing. If the existing symlink you're trying to replace points to a directory, the above actually creates a symlink inside the dereferenced directory the old symlink points to. (Or fails if the referent is invalid.)

To replace an existing directory symlink, use the -n argument to ln:

ln -sfn /new/target /path/to/symlink

That's always what I have wanted it to do, so I need to remember the -n.

Starting processes at boot under SELinux

There are a few common ways to start processes at boot time in Red Hat Enterprise Linux 5 (and thus also CentOS 5):

  1. Standard init scripts in /etc/init.d, which are used by all standard RPM-packaged software.
  2. Custom commands added to the /etc/rc.local script.
  3. @reboot cron jobs (for vixie-cron, see `man 5 crontab` -- it is not supported in some other cron implementations).

Custom standalone /etc/init.d init scripts become hard to differentiate from RPM-managed scripts (not having the separation of e.g. /usr/local vs. /usr), so in most of our hosting we've avoided those unless we're packaging software as RPMs.

rc.local and @reboot cron jobs seemed fairly equivalent, with crond starting at #90 in the boot order, and local at #99. Both of those come after other system services such as Postgres & MySQL have already started.

To start up processes as various users we've typically used su - $user -c "$command" in the desired order in /etc/rc.local. This was mostly for convenience in easily seeing in one place what all would be started at boot time. However, when running under SELinux this runs processes in the init_t context which usually prevents them from working properly.

The cron @reboot jobs don't have that SELinux context problem and work fine, just as if run from a login shell, so now we're using those. Of course they have the added advantage that regular users can edit the cron jobs without system administrator intervention.

Increasing maildrop's hardcoded 5-minute timeout

One of the ways I like to retrieve email is to use fetchmail as a POP and IMAP client with maildrop as the local delivery agent. I prefer maildrop to Postfix, Exim, or sendmail for this because it doesn't add any headers to the messages.

The only annoyance I have had is that maildrop has a hardcoded hard timeout of 5 minutes for delivering a mail message. When downloading a very long message such as a Git commit notification of a few hundred megabytes, or a short message with an attached file of dozens of megabytes, especially over a slow network connections, this timeout prevents the complete message from being delivered.

Confusingly, a partial message will be delivered locally without warning -- with the attachment or other long message data truncated. When fetchmail receives the error status return from maildrop, it then tries again, and given similar circumstances it suffers a similar fate. In the worst case this leads to hours of clogged tubes and many partial copies of the same email message, and no other new mail.

This maildrop hard timeout is compiled in and there is no runtime option to override it. Thus it is helpful to compile a custom build from source, specifying a different timeout at configure time. In my case, I set the timeout to be 1 day:

./configure --enable-global-timeout=86400 --without-db --enable-syslog=1 \
    --enable-tempdir=tmp --enable-smallmsg=65536 
make

If you choose to configure with --without-db as I do, you need to manually remove two occurrences of makedatprog from Makefile, as makedatprog is a utility only needed by DBM and won't have been compiled. Then make install as root and edit your ~/.fetchmailrc lines, adding mda "/usr/local/bin/maildrop", and restart the fetchmail daemon.

Long messages will still take a long time to deliver over a slow link, but they will at least be allowed to eventually finish this way.

Rejecting SSLv2 politely or brusquely

Once upon a time there were still people using browsers that only supported SSLv2. It's been a long time since those browsers were current, but when running an ecommerce site you typically want to support as many users as you possibly can, so you support old stuff much longer than most people still need it.

At least 4 years ago, people began to discuss disabling SSLv2 entirely due to fundamental security flaws. See the Debian and GnuTLS discussions, and this blog post about PCI's stance on SSLv2, for example.

To politely alert people using those older browsers, yet still refusing to transport confidential information over the insecure SSLv2 and with ciphers weaker than 128 bits, we used an Apache configuration such as this:

# Require SSLv3 or TLSv1 with at least 128-bit cipher
<Directory "/">
    SSLRequireSSL
    # Make an exception for the error document itself
    SSLRequire (%{SSL_PROTOCOL} != "SSLv2" and %{SSL_CIPHER_USEKEYSIZE} >= 128) or %{REQUEST_URI} =~ m:^/errors/:
    ErrorDocument 403 /errors/403-weak-ssl.html
</Directory>

That accepts their SSLv2 connection, but displays an error page explaining the problem and suggesting some links to free modern browsers they can upgrade to in order to use the secure part of the website in question.

Recently we've decided to drop that extra fuss and block SSLv2 entirely with Apache configuration such as this:

SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:-LOW:-SSLv2:-EXP

The downside of that is that the SSL connection won't be allowed at all, and the browser doesn't give any indication of why or what the user should do. They would simply stare at a blank screen and presumably go away frustrated. Because of that we long considered the more polite handling shown above to be superior.

But recently, after having completely disabled SSLv2 on several sites we manage, we have gotten zero complaints from customers. Doing this also makes PCI and other security audits much simpler because SSLv2 and weak ciphers are simply not allowed at all and don't raise audit warnings.

So at long last I think we can consider SSLv2 dead, at least in our corner of the Internet!

Defining variables for rpmbuild

RPM spec files offer a way to define and test build variables with a directive like this:

%define <variable> <value>

Sometimes it's useful to override such variables temporarily for a single build, without modifying the spec file, which would make the changed variable appear in the output source RPM. For some reason, how to do this has been hard for me to find in the docs and hard for me to remember, despite its simplicity.

Here's how. For example, to override the standard _prefix variable with value /usr/local:

rpmbuild -ba SPECS/$package.spec --define '_prefix /usr/local'

SDCH: Shared Dictionary Compression over HTTP

Here's something new in HTTP land to play with: Shared Dictionary Compression over HTTP (SDCH, apparently pronounced "sandwich") is a new HTTP 1.1 extension announced by Wei-Hsin Lee of Google last September. Lee explains that with it "a user agent obtains a site-specific dictionary that then allows pages on the site that have many common elements to be transmitted much more quickly." SDCH is applied before gzip or deflate compression, and Lee notes 40% better compression than gzip alone in their tests. Access to the dictionaries stored in the client is scoped by site and path just as cookies are.

The first client support was in the Google Toolbar for Internet Explorer, but it is now going to be much more widely used because it is supported in the Google Chrome browser for Windows. (It's still not in the latest Chrome developer build for Linux, or at any rate not enabled by default if the code is there.)

Only Google's web servers support it to date, as far as I know. Someone intended to start a mod_sdch project for Apache, but there's no code at all yet and no activity since September 2008.

It is interesting to consider the challenge this will have on HTTP proxies that filter content, since the entire content would not be available to the proxy to scan during a single HTTP conversation. Sneakily-split malicious payloads would then be reassembled by the browser or other client, not requiring JavaScript or other active reassembly methods. This forum thread discusses this threat and gives an example of stripping the Accept-encoding: sdch request headers to prevent SDCH from being used at all. Though the threat is real, it's hard to escape the obvious analogy with TCP filtering, which had to grow from stateless to more difficult stateful TCP packet inspection. New features mean not just new benefits but also new complexity, but that's not reason to reflexively reject them.

SDCH references:

Packaging Ruby Enterprise Edition into RPM

It's unfortunate that past versions of Ruby have gained a reputation of performing poorly, consuming too much memory, or otherwise being "unfit for the enterprise." According to the fine folks at Phusion, this is partly due to the way Ruby does memory management. And they've created an alternative branch of Ruby 1.8 called "Ruby Enterprise Edition." This code base includes many significant patches to the stock Ruby code which dramatically improve performance.

Phusion advertises an average memory savings of 33% when combined with Passenger, their Apache module for serving Rails apps. We did some testing of our own, using virtualized Xen servers from our Spreecamps.com offering. These servers use the DevCamps system to run several separate instances of httpd for each developer, so reducing the usage of Passenger was crucial to fitting into less than a gigabyte of memory. Our findings were dramatic: one instance dropped 100MB down to 40MB. (The status tools included with Passenger were very helpful in confirming this.)

There has been some discussion on the Phusion Passenger and other mailing lists about packaging Ruby Enterprise Edition for Red Hat Enterprise Linux and its derivatives (CentOS and Fedora). Packages are available from Phusion for Ubuntu Linux, but many of our clients prefer RHEL's reputation as a stable platform for e-commerce hosting. So we've packaged ruby-enterprise into RPM and made them available to give back to the Rails community.

We want our SpreeCamps systems to be easy to maintain, following the "Principle of Least Astonishment." By default, Phusion's script installs ruby-enterprise into /opt, so invocation must include the full path to the executable. This would be unsettling to a developer who mistakenly installed gems to Red Hat's rubygems path while intending to install gems usable by REE and Passenger. It is important to install the ruby and gem executables into all users' $PATH.

We took a cue from our customized local-perl packages. These packages install themselves into /usr/local. This means that all executables reside in /usr/local/bin; no $PATH modifications are necessary to utilize them via the command-line. Our ruby-enterprise packages are configured the same way. (If another /usr/local/bin/ruby exists, package installation will fail before clobbering another ruby installation.) Applications which specify #!/usr/bin/ruby will continue to use Red Hat's packaged ruby.

Similar to a source-based installation, once these packages are installed you may do gem install passenger and any other gems your application needs. Phusion's REE installer also installs several "useful gems". However we elected not to include these in the main ruby-enterprise RPM package. More, smaller packages limited to a particular module or piece of software, is better than one or two big fat RPMs with a bunch of stuff you may or may not need. We will likely package individual gems in the near future.

These packages are publicly available from our repository. We've just begun using these but are finding them reliable and very helpful so far. Any of you who would like to are welcome to try them out via direct download, or much easier, adding our Yum repository to your system as described here:

https://packages.endpoint.com/

Once you've done that, a simple command should get you most of the way there:

yum install ruby-enterprise ruby-enterprise-rubygems

If you prefer to download them directly, the .rpm packages are available on that site as well, just browse through the repo.

The .spec file is available for review and forking on GitHub: http://gist.github.com/108940

Many thanks to list member Tim Charper for providing an example .spec, and my colleagues at End Point for reviewing this work.

We appreciate any comments or questions you may have. This package repo is for us and our clients primarily, but if there's a package you need that isn't in there, let us know and maybe we'll add it.

Announcing SpreeCamps.com hosting

On day two of RailsConf 2009, we are pleased to announce our new SpreeCamps.com hosting service. SpreeCamps is the quickest way to get started developing your new e-commerce website with Ruby on Rails and Spree and easily deploy it into production.

You get the latest Spree 0.8.0 that was just released yesterday, as part of a fully configured environment built on the best industry-standard open-source software: CentOS, Ruby on Rails, your choice of PostgreSQL or MySQL, Apache, Passenger, Git, and DevCamps. Your system is harmonized and pre-installed on high-performance hardware, so you can simply sign up and start coding today.

SpreeCamps gives you a 64-bit virtual private server and include backups, your own preconfigured iptables firewall, ping and ssh monitoring, and DNS. We also include a benefit unheard of in the virtual private server space: Out of the box we enforce an SELinux security policy that protects you against many types of unforeseen security vulnerabilities, and is configured to work with Passenger, Rails, and Spree.

SpreeCamps' built-in DevCamps system gives you development and staging environments that make it easy to work together in teams, show others your work in progress, and deploying your changes from development to production is as easy as "git pull".

And the best part of all? You can sign up now for just $95 per month, with no setup charge. We also offer hands-on training and support for Spree, DevCamps, and any other part of your application. Visit SpreeCamps.com for more information or to sign up for your own SpreeCamp.

(By the way, it works fine for hosting any other kind of Rails application, or whatever else you want too. We've already used it for one non-Rails website because it made getting going with DevCamps so easy.)

Learn more about End Point's rails development and rails shopping cart development.

Apache RewriteRule to a destination URL containing a space

Today I needed to do a 301 redirect for an old category page on a client's site to a new category which contained spaces in the filename. The solution to this issue seemed like it would be easy and straight forward, and maybe it is to some, but I found it to be tricky as I had never escaped a space in an Apache RewriteRule on the destination page.

The rewrite rule needed to rewrite:

/scan/mp=cat/se=Video Games

to:

/scan/mp=cat/se=DS Video Games

I was able to get the first part of the rewrite rule quickly:

^/scan/mp=cat/se=Video\sGames\.html$

The issue was figuring out how to properly escape the space on the destination page. A literal space, %20 and \s all failed to work properly. Jon Jensen took a look and suggested a standard unix escape of '\ ' and that worked. Some times a solution is right under your nose and it's obvious once you step back or ask for help from another engineer. Googling for the issue did not turn up such a simple solution, thus the reason for this blog posting.

The final rule:

RewriteRule ^/scan/mp=cat/se=Video\sGames\.html$ http://www.site.com/scan/mp=cat/se=DS\ Video\ Games.html [L,R=301]

Varnish, Radiant, etc.

As my colleague Jon mentioned, the Presidential Youth Debates launched its full debate content this week. And, as Jon also mentioned, the mix of tools involved was fairly interesting:

Our use of Postgres for this project was not particularly special, and is simply a reflection of our using Postgres by default. So I won't discuss the Postgres usage further (though it pains me to ignore my favorite piece of the software stack).

Radiant

Dan Collis-Puro, who has done a fair amount of CMS-focused work throughout his career, was the initial engineer on this project and chose Radiant as the backbone of the site. He organized the content within Radiant, configured the Page Attachments extension for use with Amazon's S3 (Simple Storage Service), and designed the organization of videos and thumbnails for easy administration through the standard Radiant admin interface. Furthermore, prior to the release of the debate videos, Dan built a question submission and moderation facility as a Radiant extension, through which users could submit questions that might ultimately get passed along to the candidates for the debate.

In the last few days prior to launch, it fell to me to get the new debate materials into production, and we had to reorganize the way we wanted to lay out the campaign videos and associated content. Because the initial implementation relied purely on conventions in how page parts and page attachments are used, accomplishing the reorganization was straightforward and easily achieved; it was not the sort of thing that required code tweaks and the like, managed purely through the CMS. It ended up being quite -- dare I say it? -- an agile solution. (Agility! Baked right in! Because it's opinionated software! Where's my Mac? It just works! Think Same.)

For managing small, simple, straightforward sites, Radiant has much to recommend it. For instance:

  • the hierarchical management of content/pages is quite effective and intuitive
  • a pretty rich set of extensions (such as page attachments)
  • the "filter" option on content is quite handy (switch between straight text, fckeditor, etc.) and helpful
  • the Radiant tag set for basic templating/logic is easy to use and understand
  • the general resources available for organizing content (pages, layouts, snippets) enables and readily encourages effective reuse of content and/or presentation logic

That said, there are a number of things for which one quickly longs within Radiant:

  • In-place editing user interface: an adminstrative mode of viewing the site in which editing tools would show in-place for the different components on a given page. This is not an uncommon approach to content management. The fact that you can view the site in one window/tab and the admin in another mitigates the pain of not having this feature to a healthy extent, but the ease of use undoubtedly suffers nevertheless.
  • Radiant offers different publishing "states" for any given page ("draft", "published", "hidden", etc.), and only publicly displays pages in the "published" state in production. This is certainly helpful, but it is ultimately insufficient. This is no substitute for versioning of resources; there is no way to have a staging version of a given page, in which the staging version is exposed to administrative users only at the same URL as the published version. To get around this, one needs to make an entirely different page that will replace the published page when you're ready. While it's possible to work around the problem in this manner, it clutters up the set of resources in the CMS admin UI, and doesn't fit well with the hierarchical nature of the system; the staging version of a page can't have the same children as the published version of the page, so any staging involving more than one level of edits is problematic and awkward. That leaves quite a lot to be desired: any engineer who has ever done all development on a production site (no development sites) and moved to version-controlled systems knows full well that working purely against a live system is extremely painful. Content management is no different.
  • The page attachments extension, while quite handy in general, has configuration information (such as file size limits and the attachment_fu storage backend to use) hard-coded into its PageAttachment model class definition, rather than abstracting that configuration information into YAML files. Furthermore, it's all or nothing: you can only use one storage backend, apparently, rather than having the flexibility of choosing different storage backends by the content type of the file attached, or choosing manually when uploading the file, etc. The result in our case is that all page attachments go to Amazon S3, even though videos were the only thing we really wanted to have in S3 (bandwidth on our server is not a concern for simple images and the like).

The in-place editing UI features could presumably be added to Radiant given a reasonable degree of patience. The page attachment criticisms also seem achievable. The versioning, however, is a more fundamental issue. Many CMSes attempt to solve this problem many different ways, and ultimately things tend to get unpleasant. I tend to think that CMSes would do well to learn from version control systems like Git in their design; beyond that, integrate with Git: dump the content down to some intelligent serialized format and integrate with git branching, checkin, checkout, pushing, etc. That dandy, glorious future is not easily realized.

To be clear: Radiant is a very useful, effective, straightforward tool; I would be remiss not to emphasize that the things it does well are more important than the areas that need improvement. As is the case with most software, it could be better. I'd happily use/recommend it for most content management cases I've encountered.

Amazon S3

I knew it was only a matter of time before I got to play with Amazon S3. Having read about it, I felt like I pretty much knew what to expect. And the expectations were largely correct: it's been mostly reliable, fairly straightforward, and its cost-effectiveness will have to be determined over time. A few things did take me by surprise, though:

  • The documentation on certain aspects, particularly the logging is, fairly uninspiring. It could be a lot worse. It could also be a lot better. Given that people pay for this service, I would expect it to be documented extremely well. Of course, given the kind of documentation Microsoft routinely spits out, this expectation clearly lacks any grounding in reality.
  • Given that the storage must be distributed under the hood, making usage information aggregation somewhat complicated, it's nevertheless disappointing that Amazon doesn't give any interface for capping usage for a given bucket. It's easy to appreciate that Amazon wouldn't want to be on the hook over usage caps when the usage data comes in from multiple geographically-scattered servers, presumably without any guarantee of serialization in time. Nevertheless, it's a totally lame problem. I have reason to believe that Amazon plans to address this soon, for which I can only applaud them.

So, yeah, Amazon S3 has worked fine and been fine and generally not offended me overmuch.

Varnish

The Presidential Youth Debate project had a number of high-profile sponsors potentially capable of generating significant usage spikes. Given the simplicity of the public-facing portion of the site (read-only content, no forms to submit), scaling out with a caching reverse proxy server was a great option. Fortunately, Varnish makes it pretty easy; basic Varnish configuration is simple, and putting it in place took relatively little time.

Why go with Varnish? It's designed from the ground up to be fast and scalable (check out the architecture notes for an interesting technical read). The time-based caching of resources is a nice approach in this case; we can have the cached representations live for a couple of minutes, which effectively takes the load off of Apache/Rails (we're running Rails with Phusion Passenger) while refreshing frequently enough for little CMS-driven tweaks to percolate up in a timely fashion. Furthermore, it's not a custom caching design, instead relying upon the fundamentals of caching in HTTP itself. Varnish, with its Varnish Configuration Language (VCL), is extremely flexible and configurable, allowing us to easily do things like ignore cookies, normalize domain names (though I ultimately did this in Apache), normalize the annoying Accept-Encoding header values, etc. Furthermore, if the cache interval is too long for a particular change, Varnish gives you a straightforward, expressive way of purging cached representations, which came in handy on a number of occasions close to launch time.

A number of us at End Point have been interested in Varnish for some time. We've made some core patches: JT Justman tracked down a caching bug when using Edge-Side Includes (ESI), and Charles Curley and JT have done some work to add native gzip/deflate support in Varnish, though that remains to be released upstream. We've also prototyped a system relying on ESI and message-driven cache purging for an up-to-date, high-speed, extremely scalable architecture. (That particular project hasn't gone into production yet due to the degree of effort required to refactor much of the underlying app to fit the design, though it may still come to pass next year -- I hope!)

Getting Varnish to play nice with Radiant was a non-issue, because the relative simplicity of the site feature set and content did not require specialized handling of any particular resource: one cache interval was good for all pages. Consequently, rather than fretting about having Radiant issue Cache-Control headers on a per-page basis (which may have been fairly unpleasant, though I didn't look into it deeply; eventually I'll need to, though, having gotten modestly hooked on Radiant and less-modestly hooked on Varnish), the setup was refreshingly simple:

  • The public site's domain disallows all access to the Radiant admin, meaning it's effectively a read-only site.
  • The public domain's Apache container always issues a couple of cache-related headers:
    Header always set Cache-Control "public; max-age=120"
    Header always set Vary "Accept-Encoding"
    The Cache-Control header tells clients (Varnish in this case) that it's acceptable to cache representations for 120 seconds, and that all representations are valid for all users ("public"). We can, if we want, use VCL to clean this out of the representation Varnish passes along to clients (i.e. browsers) so that browsers don't cache automatically, instead relying on conditional GET. The Vary header tells clients that cache (again, primarily concerned with Varnish here) to consider the "Accept-Encoding" header value of a request when keying cached representations.
  • An entirely separate domain exists that is not fronted by Varnish and allows access to the Radiant admin. We could have it fronted by Varnish with caching deactivated, but the configuration we used keeps things clean and simple.
  • We use some simple VCL to tell Varnish to ignore cookies (in case of Rails sessions on the public site), to normalize the Accept-Encoding header value to one of "gzip" or "deflate" (or none at all) to avoid caching different versions of the same representation due to inconsistent header values submitted by competing browsers.

Getting all that sorted was, as stated, refreshingly easy. It was a little less easy, surprisingly, to deal with logging. The main Varnish daemon (varnishd) logs to a shared memory block. The logs just sit there (and presumably eventually get overwritten) unless consumed by another process. A varnishlog utility, which can be run as a one-off or as a daemon, reads in the logs and outputs them in various ways. Furthermore, a varnishncsa utility outputs logging information in an Apache/NCSA-inspired "combined log" format (though it includes full URLs in the request string rather than just the path portion, presumably due to the possibility of Varnish fronting many different domains). Neither one of these is particularly complicated, though the varnishlog output is reportedly quite verbose and may need frequent rotation, and when run in daemon mode, both will re-open the log file to which they write upon receiving SIGHUP, meaning they'll play nice with log rotation routines. I found myself repeatedly wishing, however, that they both interfaced with syslog.

So, I'm very happy with Varnish at this point. Being a jerk, I must nevertheless publicly pick a few nits:

  • Why no syslog support in the logging utilities? Is there a compelling argument against it (I haven't encountered one, but admittedly I haven't looked very hard), or is it simply a case of not having been handled yet?
  • The VCL snippet we used for normalizing the Accept-Encoding header came right off the Varnish FAQ, and seems to be a pretty common case. I wonder if it would make more sense for this to be part of the default VCL configuration requiring explicit deactivation if not desired. It's not a big deal either way, but it seems like the vast majority of deployments are likely to use this strategy.

That's all I have to whine about, so either I'm insufficiently observant or the software effectively solves the problem it set out to address. These options are not mutually exclusive.

I'm definitely looking forward to further work with Varnish. This project didn't get into ESI support at all, but the native ESI support, combined with the high-performance caching, seems like a real win, potentially allowing for simplification of resource design in the application server, since documents can be constructed by the edge server (Varnish in this case) from multiple components. That sort of approach to design calls into question many of the standard practices seen in popular (and unpopular) application servers (namely, high-level templating with "pages" fitting into an overall "layout") but could help engineers keep maintain component encapsulation, think through more effectively the URL space, resource privacy and scoping considerations (whether or not a resource varies per user, by context, etc.), etc. But I digress. Shocking.

Walden University Presidential Youth Debate goes live

This afternoon was the launch of Walden University's Presidential Youth Debate website, which features 14 questions and video responses from Presidential candidates Barack Obama and John McCain. The video responses are about 44 minutes long overall.

The site has a fairly simple feature set but is technologically interesting for us. It was developed by Dan Collis-Puro and Ethan Rowe using Radiant, PostgreSQL, CentOS Linux, Ruby on Rails, Phusion Passenger, Apache, Varnish, and Amazon S3.

Nice work, guys!

Machine virtualization on the Linux desktop

In the past I've used virtualization mostly in server environments: Xen as a sysadmin, and VMware and Virtuozzo as a user. They have worked well enough. When there've been problems they've mostly been traceable to network configuration trouble.

Lately I've been playing with virtualization on the desktop, specifically on Ubuntu desktops, using Xen, kvm, and VirtualBox. Here are a few notes.

Xen: Requires hardware virtualization support for full virtualization, and paravirtualization is of course only for certain types of guests. It feels a little heavier on resource usage, but I haven't tried to move beyond lame anecdote to confirm that.

kvm: Rumored to have been not ready for prime time, but when used from libvirt with virt-manager, has been very nice for me. It requires hardware virtualization support. One major problem in kvm on Ubuntu 8.04 is with the CD/DVD driver when using RHEL/CentOS guests. To work around that, I used the net install and it worked fine.

VirtualBox: This was for me the simplest of all for desktop stuff. I've used both the OSE (Open Source Edition) in Ubuntu and Sun's cost-free but proprietary package on Windows Vista. The current release of VirtualBox only emulates i386 32-bit machines at the moment, though! (No 64-bit guests, though a 64-bit host is fine.) It's also been a little buggy at times -- I've had a few machine crashes when running both an OpenBSD 4.3 and a RHEL 5 guest, though I wasn't able to reproduce the problem and it's possible it wasn't a VirtualBox issue.

I should note that some manufacturers have a BIOS option to disable hardware virtualization, and that it is sometimes disabled by default. When booting a new machine, check for that, especially in servers you won't necessarily want to take down later.

A final note about RHEL 5's net install: Why, oh why, does the installer ask for an HTTP install location as separate web site and directory entries, instead of a universally used and easy URL? And further, when the install source I'm using goes down (as download mirrors occasionally do), why are my only options to reboot or retry? Would it have been so hard to allow me the option of entering a new download URL? Yes, I know, I need to send in a patch.