Hosting Blog Archive

Efficiency of find -exec vs. find | xargs

This is a quick tip for anyone writing a cron job to purge large numbers of old files.

Without xargs, this is a pretty common way to do such a purge, in this case of all files older than 31 days:

find /path/to/junk/files -type f -mtime +31 -exec rm -f {} \;

But that executes rm once for every single file to be removed, which adds a ton of overhead just to fork and exec rm so many times. Even on modern operating systems that are so efficient with fork, it can easily increase the I/O and load and runtime by 10 times or more than just running a single rm command with a lot of file arguments.

Instead do this:

find /path/to/junk/files -type f -mtime +31 -print0 | xargs -0 -r rm -f

That will run xargs once for each very long list of files to be removed, so the overhead of fork & exec is incurred very rarely, and the job can spend most of its effort actually unlinking files. (The xargs -r option says not to run the command if there is no input to xargs.)

How long can the argument list to xargs be? It depends on the system, but xargs --show-limits will tell us. Here's output from a RHEL 5 x86_64 system (using findutils 4.2.27):

% xargs --show-limits                                                                                                   
Your environment variables take up 2293 bytes                                                                                        
POSIX lower and upper limits on argument length: 2048, 129024                                                                        
Maximum length of command we could actually use: 126731                                                                              
Size of command buffer we are actually using: 126731

The numbers are similar on Debian Etch and Lenny.

And here's output from an Ubuntu 10.04 x86_64 system (using findutils 4.4.2):

% xargs --show-limits
Your environment variables take up 1370 bytes
POSIX upper limit on argument length (this system): 2093734
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2092364
Size of command buffer we are actually using: 131072

Roughly 2 megabytes of arguments is a lot. But even the POSIX minimum of 4 kB is a lot better than processing one file at a time.

It doesn't usually make much of a difference, but we can tune even more. Make sure the maximum number of files is processed at one time by first changing to the base directory so that the relative pathnames are shorter:

cd /path/to/junk/files && find . -type f -mtime +31 -print0 | xargs -0 -r rm -f

That way each file argument is shorter, e.g. ./junkfile compared to /path/to/junk/files/junkfile.

The above assumes you're using GNU findutils, which includes find -print0 and xargs -0 for processing ASCII NUL-delimited filenames for safety when filenames include embedded spaces, newlines, etc.

Why is my load average so high?

One of the most common ways people notice there's a problem with their server is when Nagios, or some other monitoring tool, starts complaining about a high load average. Unfortunately this complaint carries with it very little information about what might be causing the problem. But there are ways around that. On Linux, where I spend most of my time, the load average represents the average number of process in either the "run" or "uninterruptible sleep" states. This code snippet will display all such processes, including their process ID and parent process ID, current state, and the process command line:

#!/bin/sh

ps -eo pid,ppid,state,cmd |\
    awk '$3 ~ /[RD]/ { print $0 }'

Most of the time, this script has simply confirmed what I already anticipated, such as, "PostgreSQL is trying to service 20 times as many simultaneous queries as normal." On occasion, however, it's very useful, such as when it points out that a backup job is running far longer than normal, or when it finds lots of "[pdflush]" operations in process, indicating that the system was working overtime to write dirty pages to disk. I hope it can be similarly useful to others.

Tip: Find all non-UTF-8 files

Here's an easy way to find all non-UTF-8 files for later perusal:

find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}" > utf8_fail

I've needed this before for converting projects over into UTF-8; obviously certain files are going to be binary and will show up in this list, so manual vetting will need to be done before converting all your images over into UTF-8.

Xen MAC mismatch VNC mouse escape HOWTO

This is a story that probably shouldn't need to be told if everything is in documentation somewhere. I'm not using any fancy virtualization management tools and didn't have an easy time piecing everything together, so I thought it'd be worth writing down the steps of the manual approach I took.

Dramatis personæ:

  • Server: Red Hat Enterprise Linux 5.4 with Xen kernel
  • Guest virtual server: CentOS 5.4 running paravirtualized under Xen
  • Workstation: Ubuntu 9.10

The situation: I updated the CentOS 5 Xen virtual guest via yum and rebooted to load the new Linux kernel and other libraries such as glibc. According to Xen as reported by xm list, the guest had started back up fine, but the host wasn't reachable over the network via ping, http, or ssh, including from the host network.

The guest wasn't using much CPU (as shown by xm top), so I figured it wasn't just a slow-running fsck during startup. And I was familiar with the iptables firewall rules on this guest, so I was fairly sure I wasn't being blocked there. I needed to get to the console to see what was wrong.

The way I've done this before is using VNC to access the virtual console remotely. The Xen host was configured to accept VNC connections on localhost, which I could see by looking in /etc/xen/xend-config.sxp:

(vnc-listen '127.0.0.1')

There are 11 Xen guests, with consoles listening on TCP ports 5900-5910. Which one was the one? I don't know any simple way to get a list that maps ports to Xen guests, but I did it this way:

ps auxww | grep qemu-dm

I noted the PID of the process that was running for my guest as revealed in its command line. Then I looked for the listener running under that PID:

netstat -nlp

I looked for $pid/qemu-dm in the PID/Program Name column and could then see the TCP port in the Local Address column. In my case it was 127.0.0.1:5903.

So I set up an ssh tunnel to the server for my VNC traffic:

ssh -f -N -L 5903:localhost:5903 root@$remote_host &

Then I opened the default Ubuntu/GNOME VNC viewer, labeled "Remote Desktop Viewer" under the Internet menu. This program is actually called Vinagre, and is basic but works. I connected to localhost:5903, since I'd forwarded my own local TCP port 5903 to the remote port 5903.

The remote console came up, and I was presented with the login banner and prompt. If I hadn't had the root password, I would have needed to reboot the guest and start it in runlevel 0 to get a root shell without a password and change the password. But I did have the password, so that wasn't necessary.

Then when trying to change to another window on my desktop, I ran into the biggest snag of the whole exercise: Getting control of the mouse out of the VNC remote desktop window and back in my own desktop! I couldn't find anything accurate on this in any documentation, forums, etc. Finally I stumbled across the trick: Press F10, which pulls down the Machine menu in Vinagre and as a side-effect takes control of the mouse away from the remote desktop. It was nice not to have to ssh in from another machine to kill Vinagre. But it makes me wonder how I'd send an F10 through to the remote console ...

Armed with the root password I was able to log into the guest and discover that only the lo (loopback) interface started on boot. The eth0 and eth1 interfaces failed because there was no virtual NIC available with the MAC addresses specified in /etc/sysconfig/network-scripts/ifcfg-eth0 and eth1.

That was because the virtual machine image had been cloned from another one and hadn't been given new unique MAC addresses. The problem was easily fixed by updating the ifcfg-eth0 and ifcfg-eth1 files with the MAC addresses actually given to the interfaces, as seen by ifconfig, which were ultimately assigned by the Xen host in /etc/xen/$host in the vif parameter. (You can also specify no MAC addresses in the guest at all and it will use whatever it gets.)

Then after running service network restart the networking was back, and I rebooted to make sure it started correctly on its own.

Spree on Heroku for Development

Yesterday, I worked through some issues to setup and run Spree on Heroku. One of End Point's clients is using Spree for a multi-store solution. We are using the the recently released Spree 0.10.0.beta gem, which includes some significant Spree template and hook changes discussed here and here in addition to other substantial updates and fixes. Our client will be using Heroku for their production server, but our first goal was to work through deployment issues to use Heroku for development.

Since Heroku includes a free offering to be used for development, it's a great option for a quick and dirty setup to run Spree non-locally. I experienced several problems and summarized them below.

Application Changes

1. After a failed attempt to setup the basic Heroku installation described here because of a RubyGems 1.3.6 requirement, I discovered the need for Heroku's bamboo deployment stack, which requires you to declare the gems required for your application. I also found the Spree Heroku extension and reviewed the code, but I wanted to take a more simple approach initially since the extension offers several features that I didn't need. After some testing, I created a .gems file in the main application directory including the contents below to specify the gems required on the Badious Bamboo Heroku stack.

rails -v 2.3.5
highline -v '1.5.1'
authlogic -v '>=2.1.2'
authlogic-oid -v '1.0.4'
activemerchant -v '1.5.1'
activerecord-tableless -v '0.1.0'
less -v '1.2.20'
stringex -v '1.0.3'
chronic -v '0.2.3'
whenever -v '0.3.7'
searchlogic -v '2.3.5'
will_paginate -v '2.3.11'
faker -v '0.3.1'
paperclip -v '>=2.3.1.1'
state_machine -v '0.8.0'

2. The next block I hit was that git submodules are not supported by Heroku, mentioned here. I replaced the git submodules in our application with the Spree extension source code.

3. I also worked through addressing Heroku's read-only filesystem limitation. The setting perform_caching is set to true for a production environment by default in an application running from the Spree gem. In order to run the application for development purposes, perform_caching was set to false in RAILS_APP/config/environments/production.rb:

config.action_controller.perform_caching             = false

Another issue that came up due to the read-only filesystem constraint was that the Spree extensions were attempting to copy files over to the rails application public directory during the application restart, causing the application to die. To address this issue, I removed the public images and stylesheets from the extension directories and verified that these assets were included in the main application public directory.

I also removed the frozen Spree gem extension public files (javascripts, stylesheets and images) to prevent these files from being copied over during application restart. These files were moved to the main application public directory.

4. Finally, I disabled the allow_ssl_in_production to turn SSL off in my development application. This change was made in the extension directory extension_name_extensions.rb file.

AppConfiguration.class_eval do
  preference :allow_ssl_in_production, :boolean, :default => false
end

Obviously, this isn't the preference setting that will be used for the production application, but it works for a quick and dirty Heroku development app. Heroku's SSL options are described here.

Deployment Tips

1. To create a Heroku application running on the Bamboo stack, I ran:

heroku create --stack bamboo-ree-1.8.7 --remote bamboo

2. Since my git repository is hosted on github, I ran the following to push the existing repository to my heroku app:

git push bamboo master

3. To run the Spree database bootstrap (or database reload), I ran the following:

heroku rake db:bootstrap AUTO_ACCEPT=1

As a side note, I ran the command heroku logs several times to review the latest application logs throughout troubleshooting.

Despite the issues noted above, the troubleshooting yielded an application that can be used for development. I also learned more about Heroku configurations that will need to be addressed when moving the project to production, such as SSL setup and multi domain configuration. We'll also need to determine the best option for serving static content, such as using Amazon's S3, which is supported by the Spree Heroku extension mentioned above.

Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.

Riak Install on Debian Lenny

I'm doing some comparative analysis of various distributed non-relational databases and consequently wrestled with the installation of Riak on a server running Debian Lenny.

I relied upon the standard "erlang" debian package, which installs cleanly on a basically bare system without a hitch (as one would expect). However, the latest Riak's "make" tasks fail to run; this is because the rebar script on which the make tasks rely chokes on various bad characters:

riak@nosql-01:~/riak$ make all rel
./rebar compile
./rebar:2: syntax error before: PK
./rebar:11: illegal atom
./rebar:30: illegal atom
./rebar:72: illegal atom
./rebar:76: syntax error before: ��n16
./rebar:79: syntax error before: ','
./rebar:91: illegal integer
./rebar:149: illegal atom
./rebar:160: syntax error before: Za��ze
./rebar:172: illegal atom
./rebar:176: illegal atom
escript: There were compilation errors.
make: *** [compile] Error 127

Delicious.

Ultimately, I came across this article describing issues getting Riak to install on Ubuntu 9.04, and ultimately determined that the Erlang version mentioned seemed to apply here. Following the article's instructions for building Erlang from source worked out fine, and so far I've been able to start, ping, and stop the local Riak server without incident.

Since a true investigation requires running these kinds of tools in a cluster, and that means automation of the installation/configuration is desirable, I've been scripting out the configuration steps (putting things into a configuration management tool like Puppet will come later when we're farther along and closer to picking the right solution for the problem in question). So, here's the script I've been running to build these things from my local machine (relying upon SSH); these are rough, a work in progress, and are not intended as examples of excellence, elegance, or beauty -- they simply get the job done (so far) for me and may help somebody else.

#!/bin/sh

hostname=$1
erlang_release=otp_src_R13B04
riak_release=riak-0.8.1

ssh root@$hostname "
# necessary for Erlang build
apt-get install build-essential libncurses5-dev m4
apt-get install openssl libssl-dev
# standard from-source build
mkdir erlang-build
cd erlang-build
wget http://ftp.sunet.se/pub/lang/erlang/download/$erlang_release.tar.gz
tar xzf $erlang_release.tar.gz
cd $erlang_release
./configure
make
make install
# put all of riak in a riak user
useradd -m riak
su -c 'wget http://bitbucket.org/basho/riak/downloads/$riak_release.tar.gz' - riak
su -c 'tar xzf $riak_release.tar.gz' - riak
su -c 'cd $riak_release && make all rel' - riak
su -c 'mv $riak_release/rel riak' - riak
"

(I have other scripts for preparing the box post-OS-install, but I don't think they impact this particular part of the process.)

PostgreSQL EC2/EBS/RAID 0 snapshot backup

One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.

A big benefit of Amazon's cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon's EBS storage is that it is much more I/O limited than, say, a nice SAN.

To partially mitigate the I/O limitations, they're using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.

They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That's working fine.

They also do nightly backups using pg_dumpall on the master so that there's a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).

There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I'm not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:

  1. Create an atomic snapshot of the block devices
  2. Spin up another virtual server
  3. Mount the backup volume
  4. Start Postgres and allow it to recover from the apparent "crash" the server had (since there wasn't a clean shutdown of the database before the snapshot
  5. Do whatever pg_dump or other backups are desired
  6. Make throwaway copies of the snapshot for QA or other testing

The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That's a big difference from a freshly created database and import from pg_dump.

The difference here is that we're using 4 EBS volumes with RAID 0 striped across them, and there isn't currently a way to do an atomic snapshot of all 4 volumes at the same time. So it's no longer "atomic" and who knows what state the filesystem metadata and the file data itself would be in?

Well, why not try it anyway? Filesystem metadata doesn't change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren't all crashes a little unpredictable? Why shouldn't the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.

I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:

  1. Created one EC2 image:
    Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
    Debian AMI ID ami-4ffe1926 (x86_64)
    Instance Type: High-CPU Extra Large (c1.xlarge) - 7 GB RAM, 8 CPU cores
  2. Created 4 x 10 GB EBS volumes
  3. Attached volumes to the image
  4. Created software RAID 0 device:
    mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi
  5. Created XFS filesystem on top of RAID 0 device:
    mkfs -t xfs -L /pgdata /dev/md0
  6. Set up in /etc/fstab and mounted:
    mkdir /pgdata
    # edit /etc/fstab, with noatime
    mount /pgdata
  7. Installed PostgreSQL 8.3
  8. Configured postgresql.conf to be similar to primary production database server
  9. Created empty new database cluster with data directory in /pgdata
  10. Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory
  11. Ran some bulk inserts to grow database to around 5 GB
  12. Rebooted EC2 instance to confirm everything came back up correctly on its own
  13. Set up two concurrent data-insertion processes:
    • 50 million row insert based on another local table (INSERT INTO ... SELECT ...), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won't have committed before the snapshot is taken)
    • Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through
  14. Started those "beater" jobs, which mostly consumed 2-3 CPU cores
  15. Manually inserted a known test row and created a known view that should appear in the snapshot
  16. Started Postgres's backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:
    SELECT pg_start_backup('raid_backup');
  17. Manually inserted a 2nd known test row & 2nd known test view that I don't want to appear in the snapshot after recovery
  18. Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes -- during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes
  19. Tell Postgres the backup's over:
    SELECT pg_stop_backup();
  20. Ran script to create new EBS volumes derived from the 4 snapshots (which aren't directly usable and always go into S3), using ec2-create-volume --snapshot
  21. Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume
  22. Then, on the new EC2 instance for doing backups:
    • mdadm --assemble --scan
    • mount /pgdata
    • Start Postgres
    • Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn't show any new rows, and that the table getting individual rows committed to reads correctly
    • VACUUM VERBOSE -- and confirm no errors or inconsistencies detected
    • pg_dumpall # confirmed no errors and data looks sound

It worked! No errors or problems, and pretty straightforward to do.

Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn't make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn't a fool's errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they'd be close to atomic.

There are lots of caveats to note here:

  • This is an experiment in progress, not a how-to for the general public.
  • The data set that was snapshotted was fairly small.
  • Two successful runs, even with no failures, is not a very big sample set. :)
  • I didn't use Postgres's point-in-time recovery (PITR) here at all -- I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.

So there's more work to be done to prove this technique viable in production for a mission-critical database, but it's a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store's snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.

MySQL Ruby Gem CentOS RHEL 5 Installation Error Troubleshooting

Building and installing the Ruby mysql gem on freshly-installed Red Hat based systems sometimes produces the frustratingly ambiguous error below:

# gem install mysql
/usr/bin/ruby extconf.rb
checking for mysql_ssl_set()... no
checking for rb_str_set_len()... no
checking for rb_thread_start_timer()... no
checking for mysql.h... no
checking for mysql/mysql.h... no
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers.  Check the mkmf.log file for more
details.  You may need configuration options.

Searching the web for info on this error yields two basic solutions:

  1. Install the mysql-devel package (this provides the mysql.h file in /usr/include/mysql/).
  2. Run gem install mysql -- --with-mysql-config=/usr/bin/mysql_config or some other additional options.

These are correct but not sufficient. Because this gem compiles a library to interface with MySQL's C API, the gcc and make packages are also required to create the build environment:

# yum install mysql-devel gcc make
# gem install mysql -- --with-mysql-config=/usr/bin/mysql_config

Alternatively, if you're using your distro's ruby (not a custom build like Ruby Enterprise Edition), you can install EPEL's ruby-mysql package along with their rubygem-rails and other packages.

On Linux, noatime includes nodiratime

Note to performance-tweaking Linux sysadmins, pointed out to me by Selena Deckelmann: On Linux, the filesystem attribute noatime includes nodiratime, so there's no need to say both "noatime,nodiratime" as I once did. (See this article on atime for details if you're not familiar with it.)

Apparently the nodiratime attribute was added later as a subset of noatime applying only to directories to still offer a bit of performance boost in situations where noatime on files would cause trouble (as with mutt and a few other applications that care about atimes).

See also the related newer relatime attribute in the mount(8) manpage.

DevCamps on different systems, including Plesk, CPanel and ISPConfig

In the last few months I've been active setting up DevCamps for several of our newer clients. DevCamps is an open source development environment system, that once setup, allows for easily starting up and tearing down a development environment for a specific site/code base.

I've done many camps setups, and you tend to run into surprises from system to system, but what was most interesting and challenging about these latest installs was that they were to be done on systems running Plesk, CPanel, and ISPConfig. Some things that are different between a normal deployment and one on the above mentioned platforms are:

  • On the Plesk system there was a secured Linux called 'Atomic Secured Linux' which includes the grsecurity module. One restriction of this module is (TPE) Trusted Path Execution which required the camp bin scripts to be owned by root and the bin directory could not be writable by other groups, otherwise they would fail to run.
  • Permissions are a mixed bag, where typically we set all of the files to be owned by the site owner, in Plesk there are special groups such as psacln that the files need to be owned by.
  • On the CPanel system we needed to move the admin images for Interchange to a different directory since CPanel includes Interchange and has aliases for /interchange/ and /interchange-5/ to point at a central location which we would not be using.
  • On ISPConfig and Plesk the home directories of the sites are in different places, which required deploying the code in such places as /var/www/clients/client/user/domain.com or /var/www/vhosts/domain.com.

In the end we were able to get DevCamps to run properly on these various platforms both in development and production. If you are starting a new project or working on an existing project and could use a strong development environment, consider DevCamps.

SSHFS and ServerAliveInterval

If you're using SSHFS (as I do recently since OpenVPN started crashing frequently on my OpenBSD firewall), note that the ServerAliveInterval option for SSH can have significant impact on the stability of your mounts.

I set it to 10 seconds on my system and have been happy with the results so far. It could probably safely go considerably higher than that.

It's not on by default, which leaves the stability of your SSH tunnels up to the success of TCP keepalive (which is on by default). On my wireless network, that alone has not been sufficient.

dstat: better system resource monitoring

I recently came across a useful tool I hadn't heard of before: dstat, by Dag Wieers (of DAG RPM-building fame). He describes it as "a versatile replacement for vmstat, iostat, netstat, nfsstat and ifstat."

The most immediate benefit I found is the collation of system resource monitoring output at each point in time, removing the need to look at output from multiple monitors. The coloring helps readability too:

% dstat                                                                         
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--         
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw          
  4   1  92   3   0   0|  56  84k|   0     0 |  94 188B|1264  1369          
  3   7  43  44   1   1| 368  11M| 151 222B|   0   260k|1453  1565          
  3   2  46  48   1   0| 4325784k|   0     0 |   0     0 |1421  1584          
  2   2  47  49   0   0| 592k    0 |   0     0 |   0     0 |1513  1763          
  6   2  44  49   1   0| 448 248k|   0     0 |   0     0 |1398  1640          
  8   4  41  45   3   0| 456k    0 | 135 222B|   0     0 |1530  2102          
 18   4  38  41   0   0| 408 128k|   0    47B|   0     0 |1261  1977          
 10   4  44  43   0   0| 728 208k|   0     0 |   0     0 |1445  2203          
  6   3  39  51   0   0| 648 256k|36074124B|   0     0 |1496  2180          
  7   7  34  53   0   0|1088k    0 |1234 582B|   0     0 |1465  2057          
 14   8  28  49   0   0|2856 104k|   0     0 |   0    52k|1610  2995          
  6   6  43  45   0   0|1992k    0 |59644836B|   0     0 |1493  2391          
  9  14  34  44   0   0|2432 112k|7854 726B|   0     0 |1527  2190          
  9  11  40  41   1   0|2680k    0 |1382 972B|   0     0 |1550  2298          
  5   4  68  22   0   0| 5761096k|  124628B|   0     0 |1522  1731 ^C       

(Textual screenshot by script of util-linux and Perl module HTML::FromANSI.)

Its default one-line-per-timeslice output makes it good for collecting data samples over time, as opposed to full-screen top-like utilities such as atop, which give much more detailed information at each snapshot, but don't show history.

Since dstat is a standard package available in RHEL/CentOS and Debian/Ubuntu, it is a reasonably easy add-on to get on various systems.

dstat also allows plugins, and just in the most recent release last month were added new plugins "for showing NTP time, power usage, fan speed, remaining battery time, memcache hits and misses, process count, top process total and average latency, top process total and average CPU timeslice, and per disk utilization rates."

It sounds like it'll grow even more useful over time and is worth keeping an eye on.

Multiple links to files in /etc

I came across an unfamiliar error in /var/log/messages on a RHEL 5 server the other day:

Dec  2 17:17:23 X restorecond: Will not restore a file with more than one hard link (/etc/resolv.conf) No such file or directory

Sure enough, ls showed the inode pointed to by /etc/resolv.conf having 2 links. What was the other link?

# find /etc -samefile resolv.conf
/etc/resolv.conf
/etc/sysconfig/networking/profiles/default/resolv.conf
# ls -lai /etc/resolv.conf /etc/sysconfig/networking/profiles/default/resolv.conf
1526575 -rw-r--r-- 2 root root 69 Nov 30  2008 /etc/resolv.conf
1526575 -rw-r--r-- 2 root root 69 Nov 30  2008 /etc/sysconfig/networking/profiles/default/resolv.conf

I've worked with a lot of RHEL/CentOS 5 servers and hadn't ever dealt with these network profiles. Kiel guessed it was probably a system configuration tool that we never use, and he was right: Running system-config-network (part of the system-config-network-tui RPM package) creates the hardlinks for the default profile.

/etc/hosts gets the same treatment as /etc/resolv.conf.

I suppose SELinux's restorecond doesn't want to apply any context changes because its rules are based on filesystem paths, and the paths of the multiple links are different and could result in conflicting context settings.

Since we don't use network profiles, we can just delete the extra links in /etc/sysconfig/networking/profiles/default/.

Cisco PIX mangled packets and iptables state tracking

Kiel and I had a fun time tracking down a client's networking problem the other day. Their scp transfers from their application servers behind a Cisco PIX firewall failed after a few seconds, consistently, with a connection reset.

The problem was easily reproducible with packet sizes of 993 bytes or more, not just with TCP but also ICMP (bloated ping packets, generated with ping -s 993 $host). That raised the question of how this problem could go undetected for their heavy web traffic. We determined that their HTTP load balancer avoided the problem as it rewrote the packets for HTTP traffic on each side.

Kiel narrowed the connect resets down to iptables' state-tracking considering packets INVALID, not ESTABLISHED or RELATED as they should be.

Then he found via tcpdump that the problem was easily visible in scp connections when TCP window scaling adjustments were made by either side of the connection. We tried disabling window scaling but that didn't help.

We tried having iptables allow packets in state INVALID when they were also ESTABLISHED or RELATED, and that reduced the frequency of terminated connections, but still didn't eliminate them entirely. (And it was a kludge we weren't eager to keep in place anyway.)

We wanted to avoid some unpleasant possibilities: (1) turn off stateful firewalling or (2) perform risky updates or configuration changes on the Cisco PIX, which may or may not fix the problem, in the middle of the busy holiday ecommerce season.

Finally, Kiel found this netfilter mailing list post which describes how to enable a Linux kernel workaround for the mangled packets the Cisco generates:

echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

Of course saving that in /etc/sysctl.conf so it persists after a reboot.

So we have reliable long-running scp connections with TCP window scaling working and iptables doing its job. I love it when a plan comes together.

XZ compression

XZ is a new free compression file format that is starting to be more widely used. The LZMA2 compression method it uses first became popular in the 7-Zip archive program, with an analogous Unix command-line version called 7z.

We used XZ for the first time in the Interchange project in the Interchange 5.7.3 packages. Compared to gzip and bzip2, the file sizes were as follows:

interchange-5.7.3.tar.gz   2.4M
interchange-5.7.3.tar.bz2  2.1M
interchange-5.7.3.tar.xz   1.7M

Getting that tighter compression comes at the cost of its runtime being about 4 times slower than bzip2, but a bonus is that it decompresses about 3 times faster than bzip2. The combination of significantly smaller file sizes and faster decompression made it a clear win for distributing software packages, leading to it being the format used for packages in Fedora 12.

It's also easy to use on Ubuntu 9.10, via the standard xz-utils package. When you install that with apt-get, aptitude, etc., you'll get a scary warning about it replacing lzma, a core package, but this is safe to do because xz-utils provides compatible replacement binaries /usr/bin/lzma and friends (lzcat, lzless, etc.). There is also built-in support in GNU tar with the new --xz aka -J options.

Port knocking with knockd

One of the best ways to secure your box against SSH attacks is the use of port knocking. Basically, port knocking seals off your SSH port, usually with firewall rules, such that nobody can even tell if you are running SSH until the proper "knock" is given, at which time the SSH port appears again to a specific IP address. In most cases, a "knock" simply means accessing specific ports in a specific order within a given time frame.

Let's step back a moment and see why this solution is needed. Before SSH there was telnet, which was a great idea way back at the start of the Internet when hosts trusted each other. However, it was (and is) extremely insecure, as it entails sending usernames and passwords "in the clear" over the internet. SSH, or Secure Shell, is like telnet on steroids. With a mean bodyguard. There are two common ways to log in to a system using SSH. The first way is with a password. You enter the username, then the password. Nice and simple, and similar to telnet, except that the information is not sent in the clear. The second common way to connect with SSH is by using public key authentication. This is what I use 99% of the time. It's very secure, and very convenient. You put the public copy of your PGP key on the server, and then use your local private SSH key to authenticate. Since you can cache your private key, this means only having to type in your SSH password once, and then you can ssh to many different systems with no password needed.

So, back to port knocking. It turns out that any system connected to the internet is basically going to come under attack. One common target is SSH - specifically, people connecting to the SSH port, then trying combinations of usernames and passwords in the hopes that one of them is right. The best prevention against these attacks is to have a good password. Because public key authentication is so easy, and makes typing in the actual account password such a rare event, you can make the password something very secure, such as:

gtsmef#3ZdbVdAebAS@9e[AS4fed';8fS14S0A8d!!9~d1aAQ5.81sa0'ed

However, this won't stop others from trying usernames and passwords anyway, which fills up your logs with their attempts and is generally annoying. Thus, the need to "hide" the SSH port, which by default is 22. One thing some people do is move SSH to a "non-standard" port, where non-standard means anything but 22. Typically, some random number that won't conflict with anything else. This will reduce and/or stop all the break-in attempts, but at a high cost: all clients connecting have to know to use that port. With the ssh client, it's adding a -p argument, or setting a "Port" line in the relevant section of your .ssh/config file.

All of which brings us to port knocking. What if we could run SSH on port 22, but not answer to just anyone, but only to people who knew the secret code? That's what port knocking allows us to do. There are many variants on port knocking and many programs that implement it. My favorite is "knockd", mostly because it's simple to learn and use, and is available in some distros' packaging systems. My port knocking discussion and examples will focus on knockd, unless stated otherwise.

knockd is a daemon that listens for incoming requests to your box, and reacts when a certain combination is reached. Once knockd is installed and running, you modify your firewall rules (e.g. iptables) to drop all incoming traffic to port 22. To the outside world, it's exactly as if you are not running SSH at all. No break-in attempts are possible, and your security logs stay nice and boring. When you want to connect to the box via SSH, you first send a series of knocks to the box. If the proper combination is received, knockd will open a hole in the firewall for your IP on port 22. From this point forward, you can SSH in as normal. The new firewall entry can get removed right away, cleared out at some time period later, or you can define another knock sequence to remove the firewall listing and close the hole again.

What exactly is the knock? It's a series of connections to TCP or UDP ports. I prefer choosing a few random TCP ports, so that I can simply use telnet calls to connect to the ports. Keep in mind that when you do connect, it will appear as if nothing happened - you cannot tell that knockd is logging your attempt, and possibly acting on it.

Here's a sample knockd configuration file:

[options]
  logfile = /var/log/knockd.log

[openSSH]
  sequence    = 32144,21312,21120
  seq_timeout = 15
  command     = /sbin/iptables -I INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
  tcpflags    = syn

[closeSSH]
  sequence    = 32144,21312,21121
  seq_timeout = 15
  command     = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
  tcpflags    = syn

In the above file, we've stated that any host that sends a TCP syn flag to ports 32144, 21312, and 21120, in that order, within 15 seconds, will cause the iptables command to be run. Note that the use of iptables is completely not hard-coded to knockd at all. Any command at all can be run when the port sequence is triggered, which allows for all sorts of fancy tricks.To close it up, we do the same sequence, except the final port is 21221.

Once knockd is installed, and the configuration file is put in place, start it up and begin testing. Leave a separate SSH connection open to the box while you are testing! If you are really paranoid, you might want to open a second SSH daemon on a second port as well. First, check that the port knocking works by triggering the port combinations. knockd comes with a command-line utility for doing so, but I usually just use telnet like so:

[greg@home ~] telnet example.com 32144
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused
[greg@home ~] telnet example.com 21312
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused
[greg@home ~] telnet example.com 21120
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused

Note that we reveived a bunch of "Connection refused" - the same message as if we tried any other random port. Also the same message that people trying to connect to a port knock protected SSH will see. If you look in the logs for knockd (set as /var/log/knockd.log in the example file above), you'll see some lines like this if all went well:

[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 1
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 2
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 3
[2009-11-09 14:01] 100.200.300.400: openSSH: OPEN SESAME
[2009-11-09 14:01] openSSH: running command: /sbin/iptables -I INPUT -s 100.200.300.400 -p tcp --dport 22 -j ACCEPT

Voila! Your iptables should now contain a new line:

$ iptables -L -n | grep 100.200
ACCEPT     tcp  --  100.200.300.400  anywhere            tcp dpt:ssh

The next step is to lock everyone else out from the SSH port. Add a new rule to the firewall, but make sure it goes to the bottom:

$ iptables -A INPUT -p tcp --dport ssh -j DROP
$ iptables -L | grep DROP
DROP       tcp  --  anywhere             anywhere            tcp dpt:ssh

You'll note that we used "A" to append the DROP to the bottom of the INPUT chain, and "I" to insert the exceptions to the top of the INPUT chain. At this point, you should try a new SSH connection and make sure you can still connect.If all is working, the final step is to make sure the knockd daemon starts up on boot, and that the DROP rule is added on boot as well. You can also add some hard-coded exceptions for boxes you know are secure, if you don't want to have to port knock from them every time.

One flaw in the above scheme the sharp reader may have spotted is that although the SSH port cannot be reached without a knock, the sequence of knocks used can easily be intercepted and played back. While this doesn't gain the potential bad guy too much, there is a way to overcome it. The knockd program allows the port knocking combinations to be stored inside of a file, and read from, one line at a time. Each successful knock will move the required knocks to the next line, so that even knowing someone else's knock sequence will not help, as it changes each time. To implement this, just replace the 'sequence' line as seen in the above configuration file with a line like this:

one_time_sequences = /etc/knockd.sequences.txt

In this case, the sequences will be read from the file named "/etc/knockd.sequences.txt". See the manpage for knockd for more details on one_time_sequences as well as other features not discussed here. For more on port knocking in general, visit portknocking.org.

While the one_time_sequences is a great idea, I'd like to see something a little different implemented someday. Specifically, having to pre-populate a fixed list of sequences is a drag. Not only do you have to make sure they are random, and that you have enough, but you have to keep the list with you locally. Lose that list, and you cannot get in! A better way would be to have your port knocking program generate the new port sequences on the fly. It would also encrypt the new port sequences to one or more public keys, and then put the file somewhere web accessible. Thus, one could simply grab the file from the server, decrypt it, and perform the port knocking based on the list of ports inside of it. Is all of that overkill for SSH? Almost certainly. :) However, there are many other uses for port knocking that simple SSH blocking and unblocking. Remember that many pieces of information can be used against your server, including what services are running on which ports, and which versions are in use.

Upgrading from RHEL 5.2 to CentOS 5.4

I have a testing server that was running RHEL 5.2 (x86_64) but its RHN entitlement ran out and I wanted to upgrade it to CentOS 5.4. I found a few tips online about how to do that, but they were a little dated so here are updated instructions showing the steps I took:

yum clean all
mkdir ~/centos
cd ~/centos
wget http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/centos-release-5-4.el5.centos.1.x86_64.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/centos-release-notes-5.4-4.x86_64.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/yum-3.2.22-20.el5.centos.noarch.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/yum-updatesd-0.9-2.el5.noarch.rpm
wget http://mirror.centos.org/centos/5/os/x86_64/CentOS/yum-fastestmirror-1.1.16-13.el5.centos.noarch.rpm
rpm --import RPM-GPG-KEY-CentOS-5 
rpm -e --nodeps redhat-release
rpm -e yum-rhn-plugin yum-updatesd
rpm -Uvh *.rpm
yum -y upgrade
# edit /etc/grub.conf to point to correct new kernel (with Xen, in my case)
shutdown -r now

It has worked well so far.

rsync and bzip2 or gzip compressed data

A few days ago, I learned that gzip has a custom option --rsyncable on Debian (and thus also Ubuntu). This old write-up covers it well, or you can just `man gzip` on a Debian-based system and see the --rsyncable option note.

I hadn't heard of this before and think it's pretty neat. It resets the compression algorithm on block boundaries so that rsync won't view every block subsequent to a change as completely different.

Because bzip2 has such large block sizes, it forces rsync to resend even more data for each plaintext change than plain gzip does, as noted here.

Enter pbzip2. Based on how it works, I suspect that pbzip2 will be friendlier to rsync, because each thread's compressed chunk has to be independent of the others. (However, pbzip2 can only operate on real input files, not stdin streams, so you can't use it with e.g. tar cj directly.)

In the case of gzip --rsyncable and pbzip2, you trade a little lower compression efficency (< 1% or so worse) for reduced network usage by rsync. This is probably a good tradeoff in many cases.

But even more interesting for me, a couple of days ago Avery Pennarun posted an article about his experimental code to use the same principles to more efficiently store deltas of large binaries in Git repositories. It's painful to deal with large binaries in any version control system I've used, and most people simply say, "don't do that". It's too bad, because when you have everything else related to a project in version control, why not some large images or audio files too? It's much more convenient for storage, distribution, complete documentation, and backups.

Avery's experiment gives a bit of hope that someday we'll be able to store big file changes in Git much more efficiently. (Though it doesn't affect the size of the initial large object commits, which will still be bloated.)

Using ln -sf to replace a symlink to a directory

When you want to forcibly replace a symbolic link on some kind of Unix (here I'm using the version of ln from GNU coreutils), you can do it the manual way:

rm -f /path/to/symlink
ln -s /new/target /path/to/symlink

Or you can provide the -f argument to ln to have it replace the existing symlink automatically:

ln -sf /new/target /path/to/symlink

(I was hoping this would be an atomic action such that there's no brief period when /path/to/symlink doesn't exist, as when mv moves a file over top of an existing file. But it's not. Behind the scenes it tries to create the symlink, fails because a file already exists, then unlinks the existing file and finally creates the symlink.)

Anyway, that's convenient, but I ran into a gotcha which was confusing. If the existing symlink you're trying to replace points to a directory, the above actually creates a symlink inside the dereferenced directory the old symlink points to. (Or fails if the referent is invalid.)

To replace an existing directory symlink, use the -n argument to ln:

ln -sfn /new/target /path/to/symlink

That's always what I have wanted it to do, so I need to remember the -n.

Starting processes at boot under SELinux

There are a few common ways to start processes at boot time in Red Hat Enterprise Linux 5 (and thus also CentOS 5):

  1. Standard init scripts in /etc/init.d, which are used by all standard RPM-packaged software.
  2. Custom commands added to the /etc/rc.local script.
  3. @reboot cron jobs (for vixie-cron, see `man 5 crontab` -- it is not supported in some other cron implementations).

Custom standalone /etc/init.d init scripts become hard to differentiate from RPM-managed scripts (not having the separation of e.g. /usr/local vs. /usr), so in most of our hosting we've avoided those unless we're packaging software as RPMs.

rc.local and @reboot cron jobs seemed fairly equivalent, with crond starting at #90 in the boot order, and local at #99. Both of those come after other system services such as Postgres & MySQL have already started.

To start up processes as various users we've typically used su - $user -c "$command" in the desired order in /etc/rc.local. This was mostly for convenience in easily seeing in one place what all would be started at boot time. However, when running under SELinux this runs processes in the init_t context which usually prevents them from working properly.

The cron @reboot jobs don't have that SELinux context problem and work fine, just as if run from a login shell, so now we're using those. Of course they have the added advantage that regular users can edit the cron jobs without system administrator intervention.

Increasing maildrop's hardcoded 5-minute timeout

One of the ways I like to retrieve email is to use fetchmail as a POP and IMAP client with maildrop as the local delivery agent. I prefer maildrop to Postfix, Exim, or sendmail for this because it doesn't add any headers to the messages.

The only annoyance I have had is that maildrop has a hardcoded hard timeout of 5 minutes for delivering a mail message. When downloading a very long message such as a Git commit notification of a few hundred megabytes, or a short message with an attached file of dozens of megabytes, especially over a slow network connections, this timeout prevents the complete message from being delivered.

Confusingly, a partial message will be delivered locally without warning -- with the attachment or other long message data truncated. When fetchmail receives the error status return from maildrop, it then tries again, and given similar circumstances it suffers a similar fate. In the worst case this leads to hours of clogged tubes and many partial copies of the same email message, and no other new mail.

This maildrop hard timeout is compiled in and there is no runtime option to override it. Thus it is helpful to compile a custom build from source, specifying a different timeout at configure time. In my case, I set the timeout to be 1 day:

./configure --enable-global-timeout=86400 --without-db --enable-syslog=1 \
    --enable-tempdir=tmp --enable-smallmsg=65536 
make

If you choose to configure with --without-db as I do, you need to manually remove two occurrences of makedatprog from Makefile, as makedatprog is a utility only needed by DBM and won't have been compiled. Then make install as root and edit your ~/.fetchmailrc lines, adding mda "/usr/local/bin/maildrop", and restart the fetchmail daemon.

Long messages will still take a long time to deliver over a slow link, but they will at least be allowed to eventually finish this way.

Rejecting SSLv2 politely or brusquely

Once upon a time there were still people using browsers that only supported SSLv2. It's been a long time since those browsers were current, but when running an ecommerce site you typically want to support as many users as you possibly can, so you support old stuff much longer than most people still need it.

At least 4 years ago, people began to discuss disabling SSLv2 entirely due to fundamental security flaws. See the Debian and GnuTLS discussions, and this blog post about PCI's stance on SSLv2, for example.

To politely alert people using those older browsers, yet still refusing to transport confidential information over the insecure SSLv2 and with ciphers weaker than 128 bits, we used an Apache configuration such as this:

# Require SSLv3 or TLSv1 with at least 128-bit cipher
<Directory "/">
    SSLRequireSSL
    # Make an exception for the error document itself
    SSLRequire (%{SSL_PROTOCOL} != "SSLv2" and %{SSL_CIPHER_USEKEYSIZE} >= 128) or %{REQUEST_URI} =~ m:^/errors/:
    ErrorDocument 403 /errors/403-weak-ssl.html
</Directory>

That accepts their SSLv2 connection, but displays an error page explaining the problem and suggesting some links to free modern browsers they can upgrade to in order to use the secure part of the website in question.

Recently we've decided to drop that extra fuss and block SSLv2 entirely with Apache configuration such as this:

SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:-LOW:-SSLv2:-EXP

The downside of that is that the SSL connection won't be allowed at all, and the browser doesn't give any indication of why or what the user should do. They would simply stare at a blank screen and presumably go away frustrated. Because of that we long considered the more polite handling shown above to be superior.

But recently, after having completely disabled SSLv2 on several sites we manage, we have gotten zero complaints from customers. Doing this also makes PCI and other security audits much simpler because SSLv2 and weak ciphers are simply not allowed at all and don't raise audit warnings.

So at long last I think we can consider SSLv2 dead, at least in our corner of the Internet!

Defining variables for rpmbuild

RPM spec files offer a way to define and test build variables with a directive like this:

%define <variable> <value>

Sometimes it's useful to override such variables temporarily for a single build, without modifying the spec file, which would make the changed variable appear in the output source RPM. For some reason, how to do this has been hard for me to find in the docs and hard for me to remember, despite its simplicity.

Here's how. For example, to override the standard _prefix variable with value /usr/local:

rpmbuild -ba SPECS/$package.spec --define '_prefix /usr/local'

SDCH: Shared Dictionary Compression over HTTP

Here's something new in HTTP land to play with: Shared Dictionary Compression over HTTP (SDCH, apparently pronounced "sandwich") is a new HTTP 1.1 extension announced by Wei-Hsin Lee of Google last September. Lee explains that with it "a user agent obtains a site-specific dictionary that then allows pages on the site that have many common elements to be transmitted much more quickly." SDCH is applied before gzip or deflate compression, and Lee notes 40% better compression than gzip alone in their tests. Access to the dictionaries stored in the client is scoped by site and path just as cookies are.

The first client support was in the Google Toolbar for Internet Explorer, but it is now going to be much more widely used because it is supported in the Google Chrome browser for Windows. (It's still not in the latest Chrome developer build for Linux, or at any rate not enabled by default if the code is there.)

Only Google's web servers support it to date, as far as I know. Someone intended to start a mod_sdch project for Apache, but there's no code at all yet and no activity since September 2008.

It is interesting to consider the challenge this will have on HTTP proxies that filter content, since the entire content would not be available to the proxy to scan during a single HTTP conversation. Sneakily-split malicious payloads would then be reassembled by the browser or other client, not requiring JavaScript or other active reassembly methods. This forum thread discusses this threat and gives an example of stripping the Accept-encoding: sdch request headers to prevent SDCH from being used at all. Though the threat is real, it's hard to escape the obvious analogy with TCP filtering, which had to grow from stateless to more difficult stateful TCP packet inspection. New features mean not just new benefits but also new complexity, but that's not reason to reflexively reject them.

SDCH references:

Packaging Ruby Enterprise Edition into RPM

It's unfortunate that past versions of Ruby have gained a reputation of performing poorly, consuming too much memory, or otherwise being "unfit for the enterprise." According to the fine folks at Phusion, this is partly due to the way Ruby does memory management. And they've created an alternative branch of Ruby 1.8 called "Ruby Enterprise Edition." This code base includes many significant patches to the stock Ruby code which dramatically improve performance.

Phusion advertises an average memory savings of 33% when combined with Passenger, their Apache module for serving Rails apps. We did some testing of our own, using virtualized Xen servers from our Spreecamps.com offering. These servers use the DevCamps system to run several separate instances of httpd for each developer, so reducing the usage of Passenger was crucial to fitting into less than a gigabyte of memory. Our findings were dramatic: one instance dropped 100MB down to 40MB. (The status tools included with Passenger were very helpful in confirming this.)

There has been some discussion on the Phusion Passenger and other mailing lists about packaging Ruby Enterprise Edition for Red Hat Enterprise Linux and its derivatives (CentOS and Fedora). Packages are available from Phusion for Ubuntu Linux, but many of our clients prefer RHEL's reputation as a stable platform for e-commerce hosting. So we've packaged ruby-enterprise into RPM and made them available to give back to the Rails community.

We want our SpreeCamps systems to be easy to maintain, following the "Principle of Least Astonishment." By default, Phusion's script installs ruby-enterprise into /opt, so invocation must include the full path to the executable. This would be unsettling to a developer who mistakenly installed gems to Red Hat's rubygems path while intending to install gems usable by REE and Passenger. It is important to install the ruby and gem executables into all users' $PATH.

We took a cue from our customized local-perl packages. These packages install themselves into /usr/local. This means that all executables reside in /usr/local/bin; no $PATH modifications are necessary to utilize them via the command-line. Our ruby-enterprise packages are configured the same way. (If another /usr/local/bin/ruby exists, package installation will fail before clobbering another ruby installation.) Applications which specify #!/usr/bin/ruby will continue to use Red Hat's packaged ruby.

Similar to a source-based installation, once these packages are installed you may do gem install passenger and any other gems your application needs. Phusion's REE installer also installs several "useful gems". However we elected not to include these in the main ruby-enterprise RPM package. More, smaller packages limited to a particular module or piece of software, is better than one or two big fat RPMs with a bunch of stuff you may or may not need. We will likely package individual gems in the near future.

These packages are publicly available from our repository. We've just begun using these but are finding them reliable and very helpful so far. Any of you who would like to are welcome to try them out via direct download, or much easier, adding our Yum repository to your system as described here:

https://packages.endpoint.com/

Once you've done that, a simple command should get you most of the way there:

yum install ruby-enterprise ruby-enterprise-rubygems

If you prefer to download them directly, the .rpm packages are available on that site as well, just browse through the repo.

The .spec file is available for review and forking on GitHub: http://gist.github.com/108940

Many thanks to list member Tim Charper for providing an example .spec, and my colleagues at End Point for reviewing this work.

We appreciate any comments or questions you may have. This package repo is for us and our clients primarily, but if there's a package you need that isn't in there, let us know and maybe we'll add it.

Announcing SpreeCamps.com hosting

On day two of RailsConf 2009, we are pleased to announce our new SpreeCamps.com hosting service. SpreeCamps is the quickest way to get started developing your new e-commerce website with Ruby on Rails and Spree and easily deploy it into production.

You get the latest Spree 0.8.0 that was just released yesterday, as part of a fully configured environment built on the best industry-standard open-source software: CentOS, Ruby on Rails, your choice of PostgreSQL or MySQL, Apache, Passenger, Git, and DevCamps. Your system is harmonized and pre-installed on high-performance hardware, so you can simply sign up and start coding today.

SpreeCamps gives you a 64-bit virtual private server and include backups, your own preconfigured iptables firewall, ping and ssh monitoring, and DNS. We also include a benefit unheard of in the virtual private server space: Out of the box we enforce an SELinux security policy that protects you against many types of unforeseen security vulnerabilities, and is configured to work with Passenger, Rails, and Spree.

SpreeCamps' built-in DevCamps system gives you development and staging environments that make it easy to work together in teams, show others your work in progress, and deploying your changes from development to production is as easy as "git pull".

And the best part of all? You can sign up now for just $95 per month, with no setup charge. We also offer hands-on training and support for Spree, DevCamps, and any other part of your application. Visit SpreeCamps.com for more information or to sign up for your own SpreeCamp.

(By the way, it works fine for hosting any other kind of Rails application, or whatever else you want too. We've already used it for one non-Rails website because it made getting going with DevCamps so easy.)

Learn more about End Point's rails development and rails shopping cart development.

Apache RewriteRule to a destination URL containing a space

Today I needed to do a 301 redirect for an old category page on a client's site to a new category which contained spaces in the filename. The solution to this issue seemed like it would be easy and straight forward, and maybe it is to some, but I found it to be tricky as I had never escaped a space in an Apache RewriteRule on the destination page.

The rewrite rule needed to rewrite:

/scan/mp=cat/se=Video Games

to:

/scan/mp=cat/se=DS Video Games

I was able to get the first part of the rewrite rule quickly:

^/scan/mp=cat/se=Video\sGames\.html$

The issue was figuring out how to properly escape the space on the destination page. A literal space, %20 and \s all failed to work properly. Jon Jensen took a look and suggested a standard unix escape of '\ ' and that worked. Some times a solution is right under your nose and it's obvious once you step back or ask for help from another engineer. Googling for the issue did not turn up such a simple solution, thus the reason for this blog posting.

The final rule:

RewriteRule ^/scan/mp=cat/se=Video\sGames\.html$ http://www.site.com/scan/mp=cat/se=DS\ Video\ Games.html [L,R=301]

Varnish, Radiant, etc.

As my colleague Jon mentioned, the Presidential Youth Debates launched its full debate content this week. And, as Jon also mentioned, the mix of tools involved was fairly interesting:

Our use of Postgres for this project was not particularly special, and is simply a reflection of our policy that for new systems we use Postgres by default. So I won't discuss the Postgres usage further (though it pains me to ignore my favorite piece of the software stack).

Radiant

Dan Collis-Puro, who has done a fair amount of CMS-focused work throughout his career, was the initial engineer on this project and chose Radiant as the backbone of the site. He organized the content within Radiant, configured the Page Attachments extension for use with Amazon's S3 (Simple Storage Service), and designed the organization of videos and thumbnails for easy administration through the standard Radiant admin interface. Furthermore, prior to the release of the debate videos, Dan built a question submission and moderation facility as a Radiant extension, through which users could submit questions that might ultimately get passed along to the candidates for the debate.

In the last few days prior to launch, it fell to me to get the new debate materials into production, and we had to reorganize the way we wanted to lay out the campaign videos and associated content. Because the initial implementation relied purely on conventions in how page parts and page attachments are used, accomplishing the reorganization was straightforward and easily achieved; it was not the sort of thing that required code tweaks and the like, managed purely through the CMS. It ended up being quite -- dare I say it? -- an agile solution. (Agility! Baked right in! Because it's opinionated software! Where's my Mac? It just works! Think Same.)

For managing small, simple, straightforward sites, Radiant has much to recommend it. For instance:

  • the hierarchical management of content/pages is quite effective and intuitive
  • a pretty rich set of extensions (such as page attachments)
  • the "filter" option on content is quite handy (switch between straight text, fckeditor, etc.) and helpful
  • the Radiant tag set for basic templating/logic is easy to use and understand
  • the general resources available for organizing content (pages, layouts, snippets) enables and readily encourages effective reuse of content and/or presentation logic

That said, there are a number of things for which one quickly longs within Radiant:

  • In-place editing user interface: an adminstrative mode of viewing the site in which editing tools would show in-place for the different components on a given page. This is not an uncommon approach to content management. The fact that you can view the site in one window/tab and the admin in another mitigates the pain of not having this feature to a healthy extent, but the ease of use undoubtedly suffers nevertheless.
  • Radiant offers different publishing "states" for any given page ("draft", "published", "hidden", etc.), and only publicly displays pages in the "published" state in production. This is certainly helpful, but it is ultimately insufficient. This is no substitute for versioning of resources; there is no way to have a staging version of a given page, in which the staging version is exposed to administrative users only at the same URL as the published version. To get around this, one needs to make an entirely different page that will replace the published page when you're ready. While it's possible to work around the problem in this manner, it clutters up the set of resources in the CMS admin UI, and doesn't fit well with the hierarchical nature of the system; the staging version of a page can't have the same children as the published version of the page, so any staging involving more than one level of edits is problematic and awkward. That leaves quite a lot to be desired: any engineer who has ever done all development on a production site (no development sites) and moved to version-controlled systems knows full well that working purely against a live system is extremely painful. Content management is no different.
  • The page attachments extension, while quite handy in general, has configuration information (such as file size limits and the attachment_fu storage backend to use) hard-coded into its PageAttachment model class definition, rather than abstracting that configuration information into YAML files. Furthermore, it's all or nothing: you can only use one storage backend, apparently, rather than having the flexibility of choosing different storage backends by the content type of the file attached, or choosing manually when uploading the file, etc. The result in our case is that all page attachments go to Amazon S3, even though videos were the only thing we really wanted to have in S3 (bandwidth on our server is not a concern for simple images and the like).

The in-place editing UI features could presumably be added to Radiant given a reasonable degree of patience. The page attachment criticisms also seem achievable. The versioning, however, is a more fundamental issue. Many CMSes attempt to solve this problem many different ways, and ultimately things tend to get unpleasant. I tend to think that CMSes would do well to learn from version control systems like Git in their design; beyond that, integrate with Git: dump the content down to some intelligent serialized format and integrate with git branching, checkin, checkout, pushing, etc. That dandy, glorious future is not easily realized.

To be clear: Radiant is a very useful, effective, straightforward tool; I would be remiss not to emphasize that the things it does well are more important than the areas that need improvement. As is the case with most software, it could be better. I'd happily use/recommend it for most content management cases I've encountered.

Amazon S3

I knew it was only a matter of time before I got to play with Amazon S3. Having read about it, I felt like I pretty much knew what to expect. And the expectations were largely correct: it's been mostly reliable, fairly straightforward, and its cost-effectiveness will have to be determined over time. A few things did take me by surprise, though:

  • The documentation on certain aspects, particularly the logging is, fairly uninspiring. It could be a lot worse. It could also be a lot better. Given that people pay for this service, I would expect it to be documented extremely well. Of course, given the kind of documentation Microsoft routinely spits out, this expectation clearly lacks any grounding in reality.
  • Given that the storage must be distributed under the hood, making usage information aggregation somewhat complicated, it's nevertheless disappointing that Amazon doesn't give any interface for capping usage for a given bucket. It's easy to appreciate that Amazon wouldn't want to be on the hook over usage caps when the usage data comes in from multiple geographically-scattered servers, presumably without any guarantee of serialization in time. Nevertheless, it's a totally lame problem. I have reason to believe that Amazon plans to address this soon, for which I can only applaud them.

So, yeah, Amazon S3 has worked fine and been fine and generally not offended me overmuch.

Varnish

The Presidential Youth Debate project had a number of high-profile sponsors potentially capable of generating significant usage spikes. Given the simplicity of the public-facing portion of the site (read-only content, no forms to submit), scaling out with a caching reverse proxy server was a great option. Fortunately, Varnish makes it pretty easy; basic Varnish configuration is simple, and putting it in place took relatively little time.

Why go with Varnish? It's designed from the ground up to be fast and scalable (check out the architecture notes for an interesting technical read). The time-based caching of resources is a nice approach in this case; we can have the cached representations live for a couple of minutes, which effectively takes the load off of Apache/Rails (we're running Rails with Phusion Passenger) while refreshing frequently enough for little CMS-driven tweaks to percolate up in a timely fashion. Furthermore, it's not a custom caching design, instead relying upon the fundamentals of caching in HTTP itself. Varnish, with its Varnish Configuration Language (VCL), is extremely flexible and configurable, allowing us to easily do things like ignore cookies, normalize domain names (though I ultimately did this in Apache), normalize the annoying Accept-Encoding header values, etc. Furthermore, if the cache interval is too long for a particular change, Varnish gives you a straightforward, expressive way of purging cached representations, which came in handy on a number of occasions close to launch time.

A number of us at End Point have been interested in Varnish for some time. We've made some core patches: JT Justman tracked down a caching bug when using Edge-Side Includes (ESI), and Charles Curley and JT have done some work to add native gzip/deflate support in Varnish, though that remains to be released upstream. We've also prototyped a system relying on ESI and message-driven cache purging for an up-to-date, high-speed, extremely scalable architecture. (That particular project hasn't gone into production yet due to the degree of effort required to refactor much of the underlying app to fit the design, though it may still come to pass next year -- I hope!)

Getting Varnish to play nice with Radiant was a non-issue, because the relative simplicity of the site feature set and content did not require specialized handling of any particular resource: one cache interval was good for all pages. Consequently, rather than fretting about having Radiant issue Cache-Control headers on a per-page basis (which may have been fairly unpleasant, though I didn't look into it deeply; eventually I'll need to, though, having gotten modestly hooked on Radiant and less-modestly hooked on Varnish), the setup was refreshingly simple:

  • The public site's domain disallows all access to the Radiant admin, meaning it's effectively a read-only site.
  • The public domain's Apache container always issues a couple of cache-related headers:
    Header always set Cache-Control "public; max-age=120"
    Header always set Vary "Accept-Encoding"
    The Cache-Control header tells clients (Varnish in this case) that it's acceptable to cache representations for 120 seconds, and that all representations are valid for all users ("public"). We can, if we want, use VCL to clean this out of the representation Varnish passes along to clients (i.e. browsers) so that browsers don't cache automatically, instead relying on conditional GET. The Vary header tells clients that cache (again, primarily concerned with Varnish here) to consider the "Accept-Encoding" header value of a request when keying cached representations.
  • An entirely separate domain exists that is not fronted by Varnish and allows access to the Radiant admin. We could have it fronted by Varnish with caching deactivated, but the configuration we used keeps things clean and simple.
  • We use some simple VCL to tell Varnish to ignore cookies (in case of Rails sessions on the public site), to normalize the Accept-Encoding header value to one of "gzip" or "deflate" (or none at all) to avoid caching different versions of the same representation due to inconsistent header values submitted by competing browsers.

Getting all that sorted was, as stated, refreshingly easy. It was a little less easy, surprisingly, to deal with logging. The main Varnish daemon (varnishd) logs to a shared memory block. The logs just sit there (and presumably eventually get overwritten) unless consumed by another process. A varnishlog utility, which can be run as a one-off or as a daemon, reads in the logs and outputs them in various ways. Furthermore, a varnishncsa utility outputs logging information in an Apache/NCSA-inspired "combined log" format (though it includes full URLs in the request string rather than just the path portion, presumably due to the possibility of Varnish fronting many different domains). Neither one of these is particularly complicated, though the varnishlog output is reportedly quite verbose and may need frequent rotation, and when run in daemon mode, both will re-open the log file to which they write upon receiving SIGHUP, meaning they'll play nice with log rotation routines. I found myself repeatedly wishing, however, that they both interfaced with syslog.

So, I'm very happy with Varnish at this point. Being a jerk, I must nevertheless publicly pick a few nits:

  • Why no syslog support in the logging utilities? Is there a compelling argument against it (I haven't encountered one, but admittedly I haven't looked very hard), or is it simply a case of not having been handled yet?
  • The VCL snippet we used for normalizing the Accept-Encoding header came right off the Varnish FAQ, and seems to be a pretty common case. I wonder if it would make more sense for this to be part of the default VCL configuration requiring explicit deactivation if not desired. It's not a big deal either way, but it seems like the vast majority of deployments are likely to use this strategy.

That's all I have to whine about, so either I'm insufficiently observant or the software effectively solves the problem it set out to address. These options are not mutually exclusive.

I'm definitely looking forward to further work with Varnish. This project didn't get into ESI support at all, but the native ESI support, combined with the high-performance caching, seems like a real win, potentially allowing for simplification of resource design in the application server, since documents can be constructed by the edge server (Varnish in this case) from multiple components. That sort of approach to design calls into question many of the standard practices seen in popular (and unpopular) application servers (namely, high-level templating with "pages" fitting into an overall "layout") but could help engineers keep maintain component encapsulation, think through more effectively the URL space, resource privacy and scoping considerations (whether or not a resource varies per user, by context, etc.), etc. But I digress. Shocking.

Walden University Presidential Youth Debate goes live

This afternoon was the launch of Walden University's Presidential Youth Debate website, which features 14 questions and video responses from Presidential candidates Barack Obama and John McCain. The video responses are about 44 minutes long overall.

The site has a fairly simple feature set but is technologically interesting for us. It was developed by Dan Collis-Puro and Ethan Rowe using Radiant, PostgreSQL, CentOS Linux, Ruby on Rails, Phusion Passenger, Apache, Varnish, and Amazon S3.

Nice work, guys!

Machine virtualization on the Linux desktop

In the past I've used virtualization mostly in server environments: Xen as a sysadmin, and VMware and Virtuozzo as a user. They have worked well enough. When there've been problems they've mostly been traceable to network configuration trouble.

Lately I've been playing with virtualization on the desktop, specifically on Ubuntu desktops, using Xen, kvm, and VirtualBox. Here are a few notes.

Xen: Requires hardware virtualization support for full virtualization, and paravirtualization is of course only for certain types of guests. It feels a little heavier on resource usage, but I haven't tried to move beyond lame anecdote to confirm that.

kvm: Rumored to have been not ready for prime time, but when used from libvirt with virt-manager, has been very nice for me. It requires hardware virtualization support. One major problem in kvm on Ubuntu 8.04 is with the CD/DVD driver when using RHEL/CentOS guests. To work around that, I used the net install and it worked fine.

VirtualBox: This was for me the simplest of all for desktop stuff. I've used both the OSE (Open Source Edition) in Ubuntu and Sun's cost-free but proprietary package on Windows Vista. The current release of VirtualBox only emulates i386 32-bit machines at the moment, though! (No 64-bit guests, though a 64-bit host is fine.) It's also been a little buggy at times -- I've had a few machine crashes when running both an OpenBSD 4.3 and a RHEL 5 guest, though I wasn't able to reproduce the problem and it's possible it wasn't a VirtualBox issue.

I should note that some manufacturers have a BIOS option to disable hardware virtualization, and that it is sometimes disabled by default. When booting a new machine, check for that, especially in servers you won't necessarily want to take down later.

A final note about RHEL 5's net install: Why, oh why, does the installer ask for an HTTP install location as separate web site and directory entries, instead of a universally used and easy URL? And further, when the install source I'm using goes down (as download mirrors occasionally do), why are my only options to reboot or retry? Would it have been so hard to allow me the option of entering a new download URL? Yes, I know, I need to send in a patch.

Death, Taxes -- and Spam?!

End Point has been hard at work keeping spam from reaching our clients' email boxes.

In December 2004, End Point installed new email filters, powered by two state-of-the-art packages: amavisd-new and SpamAssassin.

By filtering out the most egregious junk emails at the server level, far fewer undesirable emails reach our clients' inboxes. Real mail downloads to their inboxes faster, and the painful morning routine of going through the pile of junk mail delivered overnight is gone. That time savings multiplied for each employee really adds up.

We're gratified to hear about the positive, immediate difference this change has made for our current email services clients. If you're not currently using End Point's email service, please contact us. We'd be happy to discuss your requirements and explain how we can help resolve your email-related concerns.