Hosting Blog Archive

Making SSL Work with Django Behind an Apache Reverse Proxy

Bouncing Admin Logins

We have a Django application that runs on Gunicorn behind an Apache reverse proxy server. I was asked to look into a strange issue with it: After a successful login to the admin interface, the browser was re-directed to the http (non-SSL) version of the interface.

After some googling and investigation I determined the issue was likely due to our specific server arrangement. Although the login requests were made over https, the requests proxied by Apache to Gunicorn used http (securely on the same host). Checking the Apache SSL error logs quickly affirmed this suspicion. I described the issue in the #django channel on freenode IRC and received some assistance from Django core developer Carl Meyer. As of Django 1.4 there was a new setting Carl had developed to handle this particular scenario.

Enter SECURE_PROXY_SSL_HEADER

The documentation for the SECURE_PROXY_SSL_HEADER variable describes how to configure it for your project. I added the following to the settings.py config file:

SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https')

Because this setting tells Django to trust the X-Forwarded-Proto header coming from the proxy (Apache) there are security concerns which must be addressed. The details are described in the Django documentation and this is the Apache configuration I ended up with:

# strip the X-Forwarded-Proto header from incoming requests
RequestHeader unset X-Forwarded-Proto

# set the header for requests using HTTPS
RequestHeader set X-Forwarded-Proto https env=HTTPS

With SECURITY_PROXY_SSL_HEADER in place and the Apache configuration updated, logins to the admin site began to work correctly.

This is standard practice for web applications that reside behind an HTTP reverse proxy, but if the application was initially set up using only plain HTTP, when HTTPS is later added, it can be easy to be confused and overlook this part of the setup.

Converting RHEL 5.9 and 6.4 to CentOS

CentOS is, by design, an almost identical rebuild of Red Hat Enterprise Linux (RHEL). Any given version of each OS should behave the same as the other and packages and yum repositories built for one should work for the other unchanged. Any exception I would call a bug.

Because Red Hat is the source or origin of packages that ultimately end up in CentOS, there is an inherent delay between when Red Hat releases new packages and when they appear in CentOS. CentOS is financed by optional donations of work, hosting, and money, while Red Hat Enterprise Linux is financed by requiring customers to purchase entitlements to use the software and get various levels of support from Red Hat.

Thanks to this close similarity and the tradeoff between rapidity of updates vs. cost and entitlement tracking, we find reasons to use both RHEL and CentOS, depending on the situation.

Sometimes we want to convert RHEL to CentOS or vice versa, on a running machine, without the expense and destabilizing effect of having to reinstall the operating system. In the past I've written on this blog about converting from CentOS 6 to RHEL 6, and earlier about converting from RHEL 5 to CentOS 5.

I recently needed to migrate several servers from RHEL to CentOS, and found an update of the procedure was in order because some URLs and package versions had changed. Here are current instructions on how to migrate from RHEL 5.9 to CentOS 5.9, and RHEL 6.4 to CentOS 6.4.

These commands should of course be run as root, and observed carefully by a human eye to look for any errors or warnings and adapt accordingly.

RHEL 5.9 to CentOS 5.9 conversion, 64-bit (x86_64)

cd
mkdir centos
cd centos
wget http://mirror.centos.org/centos/5.9/os/x86_64/RPM-GPG-KEY-CentOS-5
wget http://mirror.centos.org/centos/5.9/os/x86_64/CentOS/centos-release-5-9.el5.centos.1.x86_64.rpm
wget http://mirror.centos.org/centos/5.9/os/x86_64/CentOS/centos-release-notes-5.9-0.x86_64.rpm
wget http://mirror.centos.org/centos/5.9/os/x86_64/CentOS/yum-3.2.22-40.el5.centos.noarch.rpm
wget http://mirror.centos.org/centos/5.9/os/x86_64/CentOS/yum-updatesd-0.9-5.el5.noarch.rpm
wget http://mirror.centos.org/centos/5.9/os/x86_64/CentOS/yum-fastestmirror-1.1.16-21.el5.centos.noarch.rpm
wget http://mirror.centos.org/centos/5.9/os/x86_64/CentOS/gamin-python-0.1.7-10.el5.x86_64.rpm
yum erase yum-rhn-plugin rhn-client-tools rhn-virtualization-common rhn-setup rhn-check rhnsd yum-updatesd
yum clean all
rpm --import RPM-GPG-KEY-CentOS-5
rpm -e --nodeps redhat-release
yum localinstall *.rpm
yum upgrade
shutdown -r now

RHEL 5.9 to CentOS 5.9 conversion, 32-bit (i386)

cd
mkdir centos
cd centos
wget http://mirror.centos.org/centos/5.9/os/i386/RPM-GPG-KEY-CentOS-5
wget http://mirror.centos.org/centos/5.9/os/i386/CentOS/centos-release-5-9.el5.centos.1.i386.rpm
wget http://mirror.centos.org/centos/5.9/os/i386/CentOS/centos-release-notes-5.9-0.i386.rpm
wget http://mirror.centos.org/centos/5.9/os/i386/CentOS/yum-3.2.22-40.el5.centos.noarch.rpm
wget http://mirror.centos.org/centos/5.9/os/i386/CentOS/yum-updatesd-0.9-5.el5.noarch.rpm
wget http://mirror.centos.org/centos/5.9/os/i386/CentOS/yum-fastestmirror-1.1.16-21.el5.centos.noarch.rpm
wget http://mirror.centos.org/centos/5.9/os/i386/CentOS/gamin-python-0.1.7-10.el5.i386.rpm
yum erase yum-rhn-plugin rhn-client-tools rhn-virtualization-common rhn-setup rhn-check rhnsd yum-updatesd
yum clean all
rpm --import RPM-GPG-KEY-CentOS-5
rpm -e --nodeps redhat-release
yum localinstall *.rpm
yum upgrade
shutdown -r now

RHEL 6.4 to CentOS 6.4 conversion, 64-bit (x86_64)

cd
mkdir centos
cd centos
wget http://mirror.centos.org/centos/6.4/os/x86_64/RPM-GPG-KEY-CentOS-6
wget http://mirror.centos.org/centos/6.4/os/x86_64/Packages/centos-release-6-4.el6.centos.10.x86_64.rpm
wget http://mirror.centos.org/centos/6.4/os/x86_64/Packages/yum-3.2.29-40.el6.centos.noarch.rpm
wget http://mirror.centos.org/centos/6.4/os/x86_64/Packages/yum-utils-1.1.30-14.el6.noarch.rpm
wget http://mirror.centos.org/centos/6.4/os/x86_64/Packages/yum-plugin-fastestmirror-1.1.30-14.el6.noarch.rpm
yum erase yum-rhn-plugin rhn-client-tools rhn-virtualization-common rhn-setup rhn-check rhnsd yum-updatesd subscription-manager
yum clean all
rpm --import RPM-GPG-KEY-CentOS-6
rpm -e --nodeps redhat-release-server
yum localinstall *.rpm
yum upgrade
shutdown -r now

We don't use 32-bit (i386) RHEL or CentOS 6, so you're on your own with that, but it should be very straightforward to adapt the x86_64 instructions.

If during the yum localinstall you get an error like this that references a URL containing %24releasever:

[Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
Error: Cannot retrieve repository metadata (repomd.xml) for repository

Then you need to temporarily disable that add-on yum repository until after the conversion is complete by editing /etc/yum.repos.d/name.repo to change enabled=1 to enabled=0. The problem here is caused by the repo configuration using the releasever yum variable which is undefined mid-conversion because we forcibly removed the redhat-release* package that defines it. We can't expect the OS to know what kind it is in the middle of its identity crisis and change!

If all goes well, nothing will look any different at all, except you'll now see:

# cat /etc/redhat-release 
CentOS release 5.9 (Final)

or:

# cat /etc/redhat-release 
CentOS release 6.4 (Final)

Deploying password files with Chef

Today I worked on a Chef recipe that needed to deploy an rsync password file from an encrypted data bag. Obtaining the password from the data bag in the recipe is well documented, but I knew that great care should be taken when writing the file. There are a plethora of ways to write strings to files in Chef, but many have potential vulnerabilities when dealing with secrets. Caveats:

  • The details of execute resources may be gleaned from globally-visible areas of proc.
  • The contents of a template may be echoed to the chef client.log or stored in cache, stacktrace or backup areas.
  • Some chef resources which write to files can be made to dump the diff or contents to stdout when run with verbosity.

With tremendous help from Jay Feldblum in freenode#chef, we came up with a safe, optimized solution to deploy the password from a series of ruby blocks:

pw_path = Pathname("/path/to/pwd/file")
pw_path_uid = 0
pw_path_gid = 0
pw = Chef::EncryptedDataBagItem.load("bag", "item")['password']

ruby_block "#{pw_path}-touch" do
  block   { FileUtils.touch pw_path } # so that we can chown & chmod it before writing the pw to it
  not_if  { pw_path.file? }
end

ruby_block "#{pw_path}-chown" do
  block   { FileUtils.chown pw_path_uid, pw_path_gid, pw_path }
  not_if  { s = pw_path.stat ; s.uid == pw_path_uid && s.gid == pw_path_gid }
end

ruby_block "#{pw_path}-chmod" do
  block   { FileUtils.chmod 0600, pw_path }
  not_if  { s = pw_path.stat ; "%o" % s.mode == "100600" }
end

ruby_block "#{pw_path}-content" do
  block   { pw_path.open("w") {|f| f.write pw} }
  not_if  { pw_path.read == pw } # NOTE: a secure compare method might make this even better
end

Further reading:

Getting started with Heroku

It's becoming increasingly popular to host applications with a nice cloud-based platform like Engine Yard or Heroku.

Here is a little guide showing how to join the development of a Heroku-based project. In Heroku terms it's called "collaborating on the project". The official tutorial does provide answers to most of the questions, but I would like to enhance it with my thoughts and experiences.

First essential question: how to get your hands on the app source code?

I wish Heroku had something like devcamps service provided, so you wouldn't need to experience the hassle of launching the application locally, dealing with the database and system processes needed for development. With Heroku the code does need to be cloned to the local environment like this:

$ heroku git:clone --app my_heroku_app

Second, how to commit the changes?

I got this error when trying to push to the repository:

! Your key with fingerprint xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx is not authorized
to access my_heroku_app.
fatal: The remote end hung up unexpectedly

Turned out I needed to add the new identity to my local machine.

Also, if you previously had accounts with Heroku with different email address, it's essential to create the new ssh key just for that application you are collaborating on. Heroku does not allow to use the same ssh key for different accounts.

Here is the full sequence:

$ ssh-keygen -t rsa -C "yourname@yourdomain.com" -f  ~/.ssh/id_rsa_heroku
$ ssh-add ~/.ssh/id_rsa_heroku

and, finally

$ heroku keys:add ~/.ssh/id_rsa_heroku.pub
$ git push heroku master

The code is not only pushed with this command, but it also gets immediately deployed on the server.

Finally, how to run the application console?

I use application console a lot to debug/troubleshoot/check things after the deployment.

For Heroku it's the Heroku Toolbelt "run" command that triggers all the usual command line routines. The "-a" parameter is necessary to define the application.

heroku run -a my_heroku_app script/rails console

That's it! Nice & easy!

Install SSL Certificate from Network Solutions on nginx

Despite nginx serving pages for 12.22% of the web's million busiest sites, Network Solutions does not provide instructions for installing SSL certificates for nginx. This artcle provides the exact steps for chaining the intermediary certificates for use with nginx.

Chaining the Certificates

Unlike Apache, nginx does not allow specification of intermediate certificates in a directive, so we must combine the server certificate, the intermediates, and the root in a single file. The zip file provided from Network Solutions contains a number of certificates, but no instructions on the order in which to chain them together. Network Solutions' instructions for installing on Apache provide a hint, but let's make it clear.

cat your.site.com.crt UTNAddTrustServer_CA.crt NetworkSolutions_CA.crt > chained_your.site.com.crt

This follows the general convention of "building up" to a trusted "root" authority by appending each intermediary. In this case UTNADDTrustServer_CA.crt is the intermediary while NetworkSolutions_CA.crt is the parent authority. With your certificates now chained together properly, use the usual nginx directives to configure SSL.

listen                 443;
ssl                    on;
ssl_certificate        /etc/ssl/chained_your.site.com.crt;
ssl_certificate_key    /etc/ssl/your.site.com.key;

As always, make sure your key file is secure by giving it minimal permissions.

chmod 600 your.site.com.key

I hope this little note helps to ease nginx users looking to use a Network Solutions SSL certificate.

Redirect from HTTP to HTTPS before basic auth

While reviewing PCI scan results for a client I found an issue where the scanner had an issue with a private admin URL requesting basic http auth over HTTP. The admin portion of the site has its own authentication method and it is served completely over HTTPS. We have a second layer of protection with basic auth, but the issue is the username and password could be snooped on since it can be accessed via HTTP.

The initial research and attempts at fixing the problem did not work out as intended. Until I found this blog post on the subject. The blog laid out all of the ways that I had already tried and then a new solution was presented.

I followed the recommended hack which is to use SSLRequireSSL in a location matching the admin and a custom 403 ErrorDocument. This 403 ErrorDocument does a bit of munging of the URL and redirects from HTTP to HTTPS. The instructions in the blog did have one issue, in our environment I could not serve the 403 document from the admin, I had to have it in an area that could be accessed by HTTP and by the public. I'm not sure how it could work being served from a URL that requires ssl and is protected by basic auth. The reason that this hack does work is because SSLRequireSSL is processed before any auth requirements and ErrorDocument 403 is presented when SSL is not being used.

Now hopefully the scanner will be happy (as happy as a scanner can be) by always requiring HTTPS when /admin appears in the URL and presenting an error when that is not the case, before the basic auth is requested.

Job Opening: DevOps Engineer

We're looking for a full-time, salaried DevOps engineer to work with our existing hosting and system administration team and consult with our clients on their needs. If you like to figure out problems, solve them, can take responsibility for getting a job done well without intensive oversight, please read on!

What is in it for you?

  • Work from your home office
  • Flexible full-time work hours
  • Health insurance benefit
  • 401(k) retirement savings plan
  • Annual bonus opportunity
  • Ability to move without being tied to your job location

What you will be doing:

  • Remotely set up and maintain Linux servers (mostly RHEL/CentOS, Debian, and Ubuntu), daemons, and custom software written mostly in Ruby, Python, Perl, and PHP
  • Audit and improve security, reliability, backups, monitoring (with Nagios etc.)
  • Support developer use of major language ecosystems: Perl's CPAN, Python PyPI (pip/easy_install), Ruby gems, PHP PEAR/PECL, etc.
  • Automate provisioning with Chef, Puppet, etc.
  • Work with internal and customer systems and staff
  • Use open source tools and contribute back as opportunity arises
  • Use your desktop platform of choice: Linux, Mac OS X, Windows

What you will need:

  • Professional experience with Linux system administration, networking, firewalls, Apache or nginx web servers, SSL, DNS
  • A customer-centered focus
  • Strong verbal and written communication skills
  • Experience directing your own work, and working from home
  • Ability to learn new technologies
  • Willingness to shift work time to evening and weekend hours when required

Bonus points for experience:

  • Packaging software for RPM, Yum, and apt/dpkg
  • Managing Amazon Web Services, Rackspace Cloud, Heroku, or other cloud hosting services
  • Working with PostgreSQL, MySQL, Cassandra, CouchDB, or other databases
  • Complying or auditing for PCI and other security standards
  • Using load balancers, virtualization (kvm, Xen, VirtualBox, VMware), FC or iSCSI SAN storage
  • With JavaScript, HTML/CSS, Java/JVM, Node.js, etc.
  • Contributing to open source projects

About us

End Point is a 17-year-old Internet consulting company based in New York City, with 31 full-time employees working mostly remotely from home offices. We serve over 200 clients ranging from small family businesses to large corporations, using a variety of open source technologies. Our team is made up of strong ecommerce, database, and system administration talent, working together using ssh, Screen and tmux, IRC, Google+ Hangouts, Skype, and good old phones.

How to apply

Please email us an introduction to jobs@endpoint.com to apply. Include a resume and your GitHub or other URLs that would help us get to know you. We look forward to hearing from you!

Piggybak on Heroku

Several weeks ago, we were contacted through our website with a request for Heroku support on Piggybak. Piggybak is an open source Ruby on Rails ecommerce platform developed and maintained by End Point. Piggybak is similar to many other Rails gems in that it can be installed from Rubygems in any Rails application, and Heroku understands this requirement from the application’s Gemfile. This is a brief tutorial for getting a Rails application up and running with Piggybak. For the purpose of this tutorial, I’ll be using the existing Piggybak demo for deployment, instead of creating a Rails application from scratch.

a) First, clone the existing Piggybak demo. This will be your base application. On your development machine (local or other), you must run bundle install to get all the application’s dependencies.

b) Next, add config.assets.initialize_on_precompile = false to config/application.rb to allow your assets to be compiled without requiring creating a local database.

c) Next, compile the assets according to this Heroku article with the command RAILS_ENV=production bundle exec rake assets:precompile. This will generate all the application assets into the public/assets/ directory.

d) Next, add the assets to the repo by removing public/assets/ from .gitignore and committing all modified files. Heroku’s disk read-only limitation prohibits you from writing public/assets/ files on the fly, so this is a necessary step for Heroku deployment. It is not necessary for standard Rails deployments.

e) Next, assuming you have a Heroku account and have installed the Heroku toolbelt, run heroku create to create a new Heroku application.

f) Next, run git push heroku master to push your application to your new Heroku application. This will push the code and install the required dependencies in Heroku.

g) Next, run heroku pg:psql, followed by \i sample.psql to load the sample data to the Heroku application.

h) Finally, run heroku restart to restart your application. You can access your application through a browser by running heroku open.

That should be it. From there, you can manipulate and modify the demo to experiment with Piggybak functionality. The major difference between Heroku deployment and standard deployment is that all your compiled assets must be in the repository because Heroku cannot write them out on the fly. If you plan to deploy the application elsewhere, you will have to make modifications to the repository regarding public/assets.

A full set of commands for this tutorial includes:

# Clone and set up the demo app
git clone git://github.com/piggybak/demo.git
bundle install
# add config.assets.initialize_on_precompile = false
# to config/application.rb

# Precompile assets and add to repository
RAILS_ENV=production bundle exec rake assets:precompile
# edit .gitignore here to stop ignoring public/assets/
git add .
git commit -m "Heroku support commit."

# Deploy to Heroku
heroku create
git push heroku master
heroku pg:psql
>> \i sample.psql
heroku restart
heroku open

cPanel no-pty ssh noise removal

We commonly use non-interactive ssh for automation of various tasks. This usually involves setting BatchMode=yes in the ~/.ssh/config file or the no-pty option in the ~/.ssh/authorized_keys file, and stops a tty from being assigned for the ssh session so that a job will not wait for interactive input in unexpected places.

When using a RHEL 5 Linux server that has been modified by cPanel, ssh sessions display “stdin: is not a tty” on stderr. For ad-hoc tasks this is merely an annoyance, but for jobs run from cron it means an email is sent because cron didn’t see an empty result from the job and wants an administrator to review the output.

You could quell all output from ssh, but then if any legitimate errors or warnings were sent, you won’t see those. So that is not ideal.

Using bash’s set -v option to trace commands being run on the cPanel server we found that they had modified Red Hat’s stock /etc/bashrc file and added this line:

mesg y

That writes a warning to stderr when there’s no tty because mesg doesn’t make sense in non-interactive environments.

The solution is simple, since we don’t care to hear that warning. We edit that line like this:

mesg y 2>/dev/null

This tip that may only be useful to one or two people ever, if even that many. I hope they enjoy it. :)

Setting user ownership of nginx and Passenger processes

Do this now on all your production Rails app servers:

ps ux | grep Rails

The first column in the results of that command show which user runs your Rails and Passenger processes. If this is a privileged user (sudoer, or worse yet password-less sudoer), then this article is for you.

Assumptions Check

There are several different strategies for modifying which user your Rails app runs as. By default the owner of config/environment.rb is the user which Passenger will run your application as. For some, simply changing the ownership of this file is sufficient, but in some cases, we may want to force Passenger to always use a particular user.

This article assumes you are running nginx compiled with Passenger support and that you have configured an unprivileged user named rails-app. This configuration has been tested with nginx version 0.7.67 and Passenger version 2.2.15. (Dated I know, but now that you can't find the docs for these old versions, this article is extra helpful.)

Modifying nginx.conf

The changes required in nginx are very straight forward.

# Added in the main, top-level section
user rails-app;

# Added in the appropriate http section among your other Passenger related options
passenger_user_switching off;
passenger_default_user rails-app;

The first directive tells nginx to run it's worker processes as the rails-app user. It's not completely clear to me why this was required, but failing to include this resulted in the following error. Bonus points to any one who can help me understand this one.

[error] 1085#0: *1 connect() to unix:/tmp/passenger.1064/master/helper_server.sock failed (111: Connection refused) while connecting to upstream, client: XXX, server: XXX, request: "GET XXX HTTP/1.0", upstream: "passenger://unix:/tmp/passenger.1064/master/helper_server.sock:", host: "XXX"

The second directive, passenger_user_switching off, tells Passenger to ignore the ownership of config/environment.rb and instead use the user specified in the passenger_default_user directive. Pretty straight forward!

Log File Permissions Gotcha

Presumably you're not storing your production log files in your apps log directory, but instead in /var/log/app_name and using logrotate to archive and compress your logs nightly. Make sure you update the configuration of logrotate to create the new log files with the appropriate user. Additionally, make sure you change the ownership of the current log file so that Passenger can write your applications logs!

DevCamps: Creating new camps from a non-default Git branch

I recently set up part of a new Rails project DevCamps installation with a unique Git repo setup and discovered a trick for creating camps from a Git branch other than master. Admittedly, the circumstances that led to me discovering this trick are a bit specific to this project, but the trick itself can be useful in other situations as well.

The Git repo specified in local-config had a master branch with nothing in it but the standard "initial commit." This relatively new project uses a simplifed git-flow workflow and as such, all its code was still in the "develop" branch.

In my case, this empty-ish master branch meant there were no tracked files in __CAMP_PATH__/public directory. This meant that Git did not create that directory when the repo is cloned by `mkcamp`. This meant that apache2 would refuse to start. Camping without a web server makes my back hurt, so I snooped around a little bit...

I discovered two things:

  1. You can tell `git clone` which branch to checkout initially by passing it a '--branch $your_non_default_branch' switch
  2. The `mkcamp` command will happily pass that switch (as well as any other spicy options you include) along to the `git clone` system command it executes. To do that, just add it to your camp type's local-config file as part of the 'repo_path_git' config variable. For example:

    repo_path_git:git@github.com:somegituser/somegitrepo.git --branch develop

Note that this option means your fresh new camp won't have a 'master' branch checked out. This might confuse some users, but we all know the 'master' branch is nothing but a tracking branch with some convention mixed in. A simple `git checkout master` will create that expected master branch easily enough. It's probably worth giving your devs a heads up about this, lest they think something wonky is afoot with mkcamp.

Now, there are people out there that may try to find fault with my solution. These detractors, these misanthropes, these malingering sluggards might cry "Why don't you just commit an empty __CAMP_PATH__/public/.gitkeep" to your master branch?" Well, I like a clean git history. So, to those people I would say, "David, that's messy and silly and wouldn't make a very good blog article at all. I'm embarrassed for you for even bringing it up, David."



Automatically kill process using too much memory on Linux

Sometimes on Linux (and other Unix variants) a process will consume way too much memory. This is more likely if you have a fair amount swap space configured -- but within the range of normal, for example, as much swap as you have RAM.

There are various methods to try to limit trouble from such situations. You can use the shell's ulimit setting to put a hard cap on the amount of RAM allowed to the process. You can adjust settings in /etc/security/limits.conf on both Red Hat- and Debian-based distros. You can wait for the OOM (out of memory) killer to notice the process and kill it.

But all those remedies don't help in situations where you want a process to be able to use a lot of RAM, sometimes, when there's a point to it and it's not just in an infinite loop that will eventually use all memory.

Sometimes such a bad process will bog the machine down horribly before the OOM killer notices it.

We put together the following script about a year ago to handle such cases:

It uses the Proc::ProcessTable module from Perl's CPAN to do the heavy lifting. We invoke it once per minute in cron. If you have processes eating up memory so quickly that they bring down the machine in less than a minute, you could run it in a loop every few seconds instead.

It's easy to customize based on various attributes of a process. In our example here we have it ignore root processes which are assumed to be better vetted. We have commented out a restriction to watch only for Ruby on Rails processes in Passenger. And we kill only processes using 1 GiB or more RAM.

If a process makes it past these tests and is considered bad, we print out a report that crond emails to us, so we can investigate and ideally fix the problem. Then we try to kill the process gracefully, and after 5 seconds forcibly terminate it.

It's simple, easily customizable, and has come in handy for us.

cPanel Exim false positive failure & restart fix

I'm not a big fan of add-on graphical control panels for Linux such as cPanel, Webmin, Ensim, etc. They deviate from the distributor's standard packages and locations for files, often simultaneously tightening security in various ways and weakening security practically by making several more remotely accessible administration logins.

On one of the few servers we maintain that has cPanel on it, today we did a routine Red Hat Network update and reboot to load the latest RHEL 5 kernel, and all seemed to go well.

However, within a few minutes we started getting emailed reports from the cPanel service monitor saying that the Exim mail server had failed and been restarted. These emails began coming in at roughly 5-minute intervals:

Date: Tue, 24 Jul 2012 14:21:05 -0400
From: cPanel ChkServd Service Monitor <cpanel@[SNIP]>
To: [SNIP]
Subject: exim on [SNIP] status: failed

exim failed @ Tue Jul 24 14:21:04 2012. A restart was attempted automagically.

Service Check Method:  [socket connect] 

Reason: TCP Transaction Log: 
<< 220-[SNIP] ESMTP Exim 4.77 #2 Tue, 24 Jul 2012 14:21:04 -0400 
<< 
<< 
>> EHLO localhost
<< 250-[SNIP] Hello localhost.localdomain [127.0.0.1]
<< 
<< 
<< 
<< 
<< 
>> AUTH PLAIN
[SNIP]=
<< 535 Incorrect authentication data
exim: ** [535 Incorrect authentication data != 2]
: Died at /usr/local/cpanel/Cpanel/TailWatch/ChkServd.pm line 689, <$socket_scc> line 10.


Number of Restart Attempts: 1

Startup Log: Starting exim: [  OK  ]

And the relevant entry in /var/log/exim_mainlog was:

2012-07-24 14:08:05 fixed_plain authenticator failed for localhost.localdomain (localhost) [127.0.0.1]:48454: 535 Incorrect authentication data (set_id=__cpane
l__service__auth__exim__[SNIP])

I wasn't able to find a way to fix this in any reasonable amount of time, so I opened a trouble ticket with cPanel support and they had asked for server access, logged in, and fixed the problem within a little over an hour. It was about as painless as tech support ever gets, so kudos to cPanel for that!

The solution was to run this as root:

/scripts/upcp --force

Which resyncs cPanel so that chkservd reports Exim as up and the unwanted service restarts no longer happen.

Here's to responsive tech support.

Automated VM cloning with PowerCLI

Most small businesses cannot afford the high performance storage area networks (SANs) that make traditional redundancy options such high availability and fault tolerance possible. Despite this, the APIs available to administrators of virtualized infrastructure using direct attached storage (DAS) make it possible to recreate many of the benefits of high availability.

High Availability on SAN vs DAS

A single server failure in a virtualized environment can mean many applications and services can become unavailable simultaneously; for small organizations, this can be particularly damaging. High availability with SANs minimize the downtime of applications and services when a host fails by keeping virtual machine (VM) storage off the host and on the SAN. VMs on a failed host can then be automatically restarted on hosts with excess capacity. This of course requires SAN infrastructure to be highly redundant, adding to the already expensive and complex nature of SANs.

Alternatively, direct attached storage (DAS) is very cost effective, performant, and well understood. By using software to automate the snapshot and cloning of VMs via traditional gigabit Ethernet from host to host, we can create a "poor man's" high availability system.

It's important for administrators to understand that there is a very real window of data loss that can range from hours to days depending on the number of systems backed up and hardware in use. However, for many small businesses who may not have trustworthy backups, automated cloning is an excellent step forward.

Automated cloning with VMWare's PowerCLI

Although End Point is primarily an open source shop, my introduction virtualization was with VMWare. For automation and scripting, PowerCLI, the PowerShell based command line interface for vSphere, is the platform on which we will build. The process is as follows:

  • A scheduled task executes the backup script.
  • Delete all old backups to free space.
  • Read CSV of VMs to be backed up and the target host and datastore.
  • For each VM, snapshot and clone to destination.
  • Collect data on cloning failures and email report.

I have created a public GitHub repository for the code and called it powercli_cloner.

Currently, it's fairly customized around the needs of the particular client it was implemented for, so there is much room for generalization and improvement. One area of improvement is immediately obvious: only delete a backup after successfully replacing it. Also, the script must be run as a Windows user with administrator vSphere privileges, as the scripts assumes pass-through authentication is in place. This is probably best for keeping credentials out of plain text. The script should be run during non-peak hours, especially if you have I/O intensive workloads.

Hopefully this tool can provide opportunities to develop backup and disaster recovery procedures that are flexible, cost-effective, and simple. I'd welcome pull requests and other suggestions for improvement.

Changing Passenger Nginx Timeouts

It may frighten you to know that there are applications which take longer than Passenger's default timeout of 10 minutes. Well, it's true. And yes, those application owners know they have bigger fish to fry. But when a customer needs that report run *today* being able to lengthen a timeout is a welcomed stopgap.

Tracing the timeout

There are many different layers at which a timeout can occur, although these may not be immediately obvious to your users. Typically they receive a 504 and an ugly "Gateway Time-out" message from Nginx. Review the Nginx error logs both at the reverse proxy and application server, you might see a message like this:

upstream timed out (110: Connection timed out) while reading response header from upstream

If you're seeing this message on the reverse proxy, the solution is fairly straight forward. Update the proxy_read_timeout setting in your nginx.conf and restart. However, it's more likely you've already tried that and found it ineffective. If you expand your reading of the Nginx error you might notice another clue.

upstream timed out (110: Connection timed out) while reading response header from upstream, 
upstream: "passenger://unix:/tmp/passenger.3940/master/helper_server.sock:"

This is the kind of error message you'd see on the Nginx application server when a Passenger process takes longer than the default timeout of 10 minutes. If you're seeing this message, it'd be wise to review the Rails logs to get a sense for how long this process actually takes to complete so you can make a sane adjustment to the timeout. Additionally, it's good to see what task is actually taking so long so you can offload the job into the background eventually.

Changing nginx-passenger module's timeout

If you're unable to address the slow Rails process problem and must extend the length of the time out, you'll need to modify the Passenger gem's Nginx configuration. Start by locating the Passenger gem's Nginx config with locate nginx/Configuration.c and edit the following lines:

ngx_conf_merge_msec_value(conf->upstream.read_timeout,
                              prev->upstream.read_timeout, 60000);
Replace the 60000 value with your desired timeout in milliseconds. Then run sudo passenger-install-nginx-module to recompile nginx and restart.

Improving Error Pages

Another lesson worth addressing here is that Nginx error pages are ugly and unhelpful. Even if you have a Rails plugin like exception_notification installed, these kind of Nginx errors will be missed, unless you use the error_page directive. In other applications I've setup explicit routes to test exception_notification properly sends an email by creating a controller action that simple raises an error. Using Nginx's error_page directive, you can call an exception controller action and pass useful information along to yourself as well as present the user with a consistent error experience.

.rbenv and Passenger: Working through an Upgrade

Yesterday, I worked on upgrading the Piggybak demo application, which runs on Piggybak, an open source Ruby on Rails ecommerce plugin developed and maintained by End Point. The demo was running on Ruby 1.8.7 and Rails 3.1.3, but I wanted to update it to Ruby 1.9.* and Rails 3.2.6 to take advantage of improved performance in Ruby and the recent Rails security updates. I also wanted to update the Piggybak version, since there have been several recent bug fixes and commits.

One of the constraints with the upgrade was that I wanted to upgrade via .rbenv, because End Point has been happily using .rbenv recently. Below are the steps Richard and I went through for the upgrade, as well as a minor Passenger issue.

Step 1: .rbenv Installation

First, I followed the instructions here to install rbenv and Ruby 1.9.3 locally under the user that Piggybak runs under (let's call it the steph user). I set the local Ruby version to my local install. I also installed bundler using the local Ruby version.

Step 2: bundle update

Next, I blew away the existing bundle config for my application, as well as the installed bundler gem files for the application. I followed the standard steps to install and update the new gems with the local updated Ruby and updated Rails. Then I restarted the app.

Step 3: Fail

At this point, my application would not restart, and the backtrace complained of a Passenger issue, and it referenced Ruby 1.8. Richard and I investigated the errors and concluced that the application's Passenger configuration was still referencing the system Ruby install and the outdated Passenger installation.

Here's where I hit the catch 22: I needed root access to update the passenger.conf as well as to install Passenger against Ruby 1.9.3. This defeated the purpose of using .rbenv and working with a local Ruby install only.

Step 4: Local Passenger Installation

To install Passenger against the local Ruby version, I decided to install it as the steph user. First, I installed the gem:

gem install passenger

Then, I went to the local installed version of Passenger to run the installation:

cd /home/steph/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/passenger-3.0.13/bin
./passenger-install-apache2-module

Next, I copied the passenger installation output to the passenger.conf file:

   LoadModule passenger_module /home/steph/.rbenv/versions/1.9.3-p194/lib/ruby/gems/\
     1.9.1/gems/passenger-3.0.13/ext/apache2/mod_passenger.so
   PassengerRoot /home/steph/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/passenger-3.0.13
   PassengerRuby /home/steph/.rbenv/versions/1.9.3-p194/bin/ruby

With a server restart, the Piggybak demo was up and running on updated Ruby and Rails!

Conclusion

Retrospectively, I could have avoided the Passenger issue by installing Ruby 1.9.3 on the server as root, because there isn't much else on the server. But I like using .rbenv and it's possible that a Passenger upgrade won't be required with every Ruby update, so the new Passenger configuration is acceptable [to me, for now].

SELinux Local Policy Modules

If you don't want to use SELinux, fair enough. But I find many system administrators would like to use it but get flustered at the first problem it causes, and disable it. That's unfortunate, because often it's simple to customize SELinux policy by creating what's known as a local policy module. That way you allow the actions you need while retaining the added security SELinux brings to the system as a whole.

A few years ago my co-worker Adam Vollrath wrote an article on this same subject for Red Hat Enterprise Linux (RHEL) 5, and went into more detail on SELinux file contexts, booleans, etc. I recently went through the process of building an SELinux local policy module on a RHEL 6 mail server and found a few differences and want to document some of the details here. This applies to RHEL 5 and RHEL 6, and near relatives CentOS, Scientific Linux, et al.

When under pressure …

If you're tempted to disable SELinux, consider leaving it on, but in "permissive" mode. That will leave it running but stop it from blocking disallowed actions until you have time to deal with them properly. It's as simple as:

setenforce 0

That will last until you reboot, unless otherwise changed manually. You can edit /etc/sysconfig/selinux and set:

SELINUX=permissive

To keep permissive mode even after a reboot. To see what mode SELinux is in, you can do either of:

getenforce
# or
cat /selinux/enforce

Prerequisites

First make sure you have installed:

yum install policycoreutils
yum install policycoreutils-python   # also needed on RHEL 6

You must have SELinux enabled, though enforcing isn't required; permissive mode is fine. If it's not enabled, edit /etc/sysconfig/selinux for permissive mode and reboot.

You'll need an up-to-date file /var/lib/sepolgen/interface_info, which is created by /usr/sbin/sepolgen-ifgen for the specific machine you're running it on. That should be done automatically, but be aware of it in case it somehow got stale. If you run into any unexpected problems, make sure the timestamp on interface_info is recent, or just regenerate it, which is harmless.

Making the policy module

Choose a unique name for your local policy module. It's better to use something specific to your organization, or the hostname, rather than just "postfix" or "dovecot" or something similar which may conflict with existing vendor policy modules.

Run semodule -l to list the existing modules. For this example I'll use "epmail".

Create a directory for your new policy module:

mkdir -p /root/local-policy-modules/epmail
cd /root/local-policy-modules/epmail

Copy relevant error messages verbatim from /var/log/audit/audit.log to a new file. Here for example are two denials of a script called by Postfix as a transport agent, which needed to connect to PostgreSQL locally:

type=AVC msg=audit(1335581974.308:69047): avc:  denied  { write } for  pid=14649 comm=F9616121202873696E676C65206D65 name=".s.PGSQL.5432" dev=sda2 ino=79924 scontext=system_u:system_r:postfix_pipe_t:s0 tcontext=system_u:object_r:postgresql_tmp_t:s0 tclass=sock_file
type=AVC msg=audit(1335581974.308:69047): avc:  denied  { connectto } for  pid=14649 comm=F9616121202873696E676C65206D65 path="/tmp/.s.PGSQL.5432" scontext=system_u:system_r:postfix_pipe_t:s0 tcontext=system_u:system_r:postgresql_t:s0 tclass=unix_stream_socket

In the logs you want to look for "AVC", which stands for Access Vector Cache and is how SELinux logs denials. You can grab all the recent denials with:

grep ^type=AVC /var/log/audit/audit.log > epmail.log

and then filter it manually to contain just what you need.

You can see a usually more informative explanation of each error by piping it into audit2why:

audit2why < epmail.log

Now you're ready to create your policy module:

audit2allow -m epmail < epmail.log > epmail.te
checkmodule -M -m -o epmail.mod epmail.te
semodule_package -o epmail.pp -m epmail.mod
semodule -i epmail.pp

That's a somewhat longwinded way to do things, but that's how I learned it from my co-worker Kiel, and it's easy once put into a script. See the man page of each program for more details on what that step is doing, and various options.

A more streamlined way that has audit2allow performing the functions of checkmodule and semodule_package is:

audit2allow -M $module_name -R -i epmail.log
semodule -i epmail.pp

Wrap-up

You will of course need to keep an eye on the audit log to look for any more AVC denials, as you exercise all the functions of the system. For a production system it may be best to leave SELinux permissive for a few weeks, and once you're confident you've allowed all the actions needed, you can switch it to enforcing mode.

Finally, I have not normally had to do this, but if you need to force reload the SELinux policy on the server, you can do it with:

semodule -R

Have fun with the extra security SELinux offers!

Easy Creating Ramdisk on Ubuntu

Hard drives are extremely slow compared to RAM. Sometimes it is useful to use a small amount of RAM as a drive.

However, there are some drawbacks to this solution. All the files will be gone when you reboot your computer, so in fact it is suitable only for storing some temporary files - those which are generated during some process and are not useful later.

I will mount the ramdisk in my local directory. I use Ubuntu 11.10, my user name is 'szymon', and my home directory is '/home/szymon'.

I create the directory for mounting the ramdisk in my home dir:

mkdir /home/szymon/ramdisk

When creating the ramdisk, I have a couple of possibilities:

  • ramdisk - there are sixteen standard block devices at /dev/ram* (from /dev/ram0 to /dev/ram15) which can be used for storing ram data. I can format it with any of the filesystems I want, but usually this is too much complication
  • ramfs - a virtual filesystem stored in ram. It can grow dynamically, and in fact it can use all available ram, which could be dangerous.
  • tmpfs - another virtual filesystem stored in ram, but because it has a fixed size, it cannot grow like ramfs.

I want to have a ramdisk that won't be able to use all of my ram, and I want to keep it as simple as possible; therefore, I will use tmpfs.

The following command will mount a simple ramdisk in my new local directory.

$ sudo mount -t tmpfs -o size=512M,mode=777 tmpfs
/home/szymon/ramdisk

I can even unmount it with:

$ sudo umount /home/szymon/ramdisk

Let's check if the ramdisk is really there. I can do so in a couple of ways.

I can use df -h to check the size of the mounted device:

$ df -h | grep szymon
tmpfs                 512M     0  512M   0% /home/szymon/ramdisk

I can also use mount to report on the mounted devices:

$ mount | grep ramdisk
tmpfs on /home/szymon/ramdisk type tmpfs (rw,size=512M,mode=777)

There is one more thing to do - make the ramdisk load automatically at machine start. This can be done by adding the following line into /etc/fstab:

tmpfs    /home/szymon/ramdisk    tmpfs    rw,size=512M,mode=777 0    0

Two things to be aware of:

  • Data stored in ramdisk will be removed at the machine restart. Creating a script for saving the files to hard drive at machine shutdown won't persist the data during a machine crash or reset.
  • The computer won't be able to boot normally if the entry in the fstab file has any errors in it. If you do make such an error, you can always boot the computer in recovery mode, so you have the root console and can fix the fstab file.

Check JSON responses with Nagios

As the developer's love affair with JSON continues to grow, the need to monitor successful JSON output does as well. I wanted a Nagios plugin which would do a few things:

  • Confirm the content-type of the response header was "application/json"
  • Decode the response to verify it is parsable JSON
  • Optionally, verify the JSON response against a data file

Verify content of JSON response

For the most part, Perl's LWP::UserAgent class makes short work of the first requirement. Using $response->header("content-type") the plugin is able to check the content-type easily. Next up, we use the JSON module's decode function to see if we can successfully decode $response->content.

Optionally, we can give the plugin an absolute path to a file which contains a Perl hash which can be iterated through in attempt to find corresponding key/value pairs in the decoded JSON response. For each key/value in the hash it doesn't find in the JSON response, it will append the expected and actual results to the output string, exiting with a critical status. Currently there's no way to check a key/value does not appear in the response, but feel free to make a pull request on check_json on my GitHub page.

Check HTTP redirects with Nagios

Often times there are critical page redirects on a site that may want to be monitored. Often times, it can be as simple as making sure your checkout page is redirecting from HTTP to HTTPS. Or perhaps you have valuable old URLs which Google has been indexing and you want to make sure these redirects remain in place for your PageRank. Whatever your reason for checking HTTP redirects with Nagios, you'll find there are a few scripts available, but none (that I found) which are able to follow more than one redirect. For example, let's suppose we have a redirect chain that looks like this:

http://myshop.com/cart >> http://www.myshop.com/cart >> https://www.mycart.com/cart

Following multiple redirects

In my travels, I found check_http_redirect on Nagios Exchange. It was a well designed plugin, written by Eugene Kovalenja in 2009 and licensed under GPLv2. After experimenting with the plugin, I found it was unable to traverse multiple redirects. Fortunately, Perl's LWP::UserAgent class provides a nifty little option called max_redirect. By revising Eugene's work, I've exposed additional command arguments that help control how many redirects to follow. Here's a summary of usage:

-U          URL to retrieve (http or https)
        -R          URL that must be equal to Header Location Redirect URL
        -t          Timeout in seconds to wait for the URL to load. If the page fails to load, 
                    check_http_redirect will exit with UNKNOWN state (default 60)
        -c          Depth of redirects to follow (default 10)
        -v          Print redirect chain

If check_http_redirect is unable to find any redirects to follow or any of the redirects results in a 4xx or 5xx status code returned, the plugin will report a critical state code and the nature of the problem. Additionally, if the number of redirects exceeds the depth of redirects to follow as specified in the command arguments, it will notify you of this and exit with an unknown state code. An OK status will be returned only if the redirects result in a successful response to a URL which is a regex match against the options specified in the R argument.

The updated check_http_redirect plugin is available on my GitHub page along with several other Nagios plugins I'll write about in the coming weeks. Pull requests welcome, and thank you to Eugene for his original work on this plugin.

PHP Vulnerabilities and Logging

I've recently been working on a Ruby on Rails site on my personal Linode machine. The Rails application was running in development with virtually no caching or optimization, so page load was very slow. While I was not actively developing on the site, I received a Linode alert that the disk I/O rate exceeded the notification threshold for the last 2 hours.

Since I was not working on the site and I did not expect to see search traffic to the site, I was not sure what caused the alert. I logged on to the server and checked the Rails development log to see the following:

Started GET "/muieblackcat" for 200.195.156.242 at 2012-02-15 10:01:18 -0500
Started GET "/admin/index.php" for 200.195.156.242 at 2012-02-15 10:01:21 -0500
Started GET "/admin/pma/index.php" for 200.195.156.242 at 2012-02-15 10:01:22 -0500
Started GET "/admin/phpmyadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:24 -0500
Started GET "/db/index.php" for 200.195.156.242 at 2012-02-15 10:01:25 -0500
Started GET "/dbadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:27 -0500
Started GET "/myadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:28 -0500
Started GET "/mysql/index.php" for 200.195.156.242 at 2012-02-15 10:01:30 -0500
Started GET "/mysqladmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:32 -0500
Started GET "/typo3/phpmyadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:33 -0500
Started GET "/phpadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:35 -0500
Started GET "/phpMyAdmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:36 -0500
Started GET "/phpmyadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:38 -0500
Started GET "/phpmyadmin1/index.php" for 200.195.156.242 at 2012-02-15 10:01:39 -0500
Started GET "/phpmyadmin2/index.php" for 200.195.156.242 at 2012-02-15 10:01:41 -0500
Started GET "/pma/index.php" for 200.195.156.242 at 2012-02-15 10:01:42 -0500
Started GET "/web/phpMyAdmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:44 -0500
Started GET "/xampp/phpmyadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:46 -0500
Started GET "/web/index.php" for 200.195.156.242 at 2012-02-15 10:01:48 -0500
Started GET "/php-my-admin/index.php" for 200.195.156.242 at 2012-02-15 10:01:50 -0500
Started GET "/websql/index.php" for 200.195.156.242 at 2012-02-15 10:01:52 -0500
Started GET "/phpmyadmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:53 -0500
Started GET "/phpMyAdmin/index.php" for 200.195.156.242 at 2012-02-15 10:01:55 -0500
Started GET "/phpMyAdmin-2/index.php" for 200.195.156.242 at 2012-02-15 10:01:57 -0500
Started GET "/php-my-admin/index.php" for 200.195.156.242 at 2012-02-15 10:01:59 -0500
Started GET "/phpMyAdmin-2.2.3/index.php" for 200.195.156.242 at 2012-02-15 10:02:00 -0500
Started GET "/phpMyAdmin-2.2.6/index.php" for 200.195.156.242 at 2012-02-15 10:02:02 -0500
Started GET "/phpMyAdmin-2.5.1/index.php" for 200.195.156.242 at 2012-02-15 10:02:04 -0500
Started GET "/phpMyAdmin-2.5.4/index.php" for 200.195.156.242 at 2012-02-15 10:02:07 -0500
Started GET "/phpMyAdmin-2.5.5-rc1/index.php" for 200.195.156.242 at 2012-02-15 10:02:09 -0500
Started GET "/phpMyAdmin-2.5.5-rc2/index.php" for 200.195.156.242 at 2012-02-15 10:02:10 -0500
Started GET "/phpMyAdmin-2.5.5/index.php" for 200.195.156.242 at 2012-02-15 10:02:12 -0500
Started GET "/phpMyAdmin-2.5.5-pl1/index.php" for 200.195.156.242 at 2012-02-15 10:02:14 -0500
Started GET "/phpMyAdmin-2.5.6-rc1/index.php" for 200.195.156.242 at 2012-02-15 10:02:16 -0500
Started GET "/phpMyAdmin-2.5.6-rc2/index.php" for 200.195.156.242 at 2012-02-15 10:02:17 -0500
Started GET "/phpMyAdmin-2.5.6/index.php" for 200.195.156.242 at 2012-02-15 10:02:19 -0500
Started GET "/phpMyAdmin-2.5.7/index.php" for 200.195.156.242 at 2012-02-15 10:02:21 -0500
Started GET "/phpMyAdmin-2.5.7-pl1/index.php" for 200.195.156.242 at 2012-02-15 10:02:23 -0500
Started GET "/phpMyAdmin-2.5.5-pl1/index.php" for 174.111.11.143 at 2012-02-15 14:09:10 -0500

As it turns out, the domain somehow got picked up by crawlers that were looking for PHP vulnerabilities. It's interesting to see the various versions of phpMyAdmin the crawler is attempting to exploit. Judging from the crawled pages, there may also be a few other applications (e.g. TYPO3) that the crawler was trying to exploit. I'm not up to date on the various security exploits in PHP applications, but I was surprised to not see anything directly related to WordPress in the log, since I often hear of WordPress security issues.

Luckily, this particular application and all other applications on this server have virtually no private data, since most applications running on the server are CMS-type applications where all content is displayed on the front-end.

IPv6 Tunnels with Debian/Ubuntu behind NAT

As part of End Point's preparation for World IPv6 Launch Day, I was asked to get my IPv6 certification from Hurricane Electric. It's a fun little game-based learning program which had me setup a IPv6 tunnel. IPv6 tunnels are used to provide IPv6 for those whose folks whose ISP or hosting provider don't currently support IPv6, by "tunneling" it over IPv4. The process for creating a tunnel is straight forward enough, but there were a few configuration steps I felt could be better explained.

After creating a tunnel, Hurricane Electric kindly provides a summary of your configuration and offers example configurations for several different operating systems and routers. Below is my configuration summary and the example generated by Hurricane Electric.

However, entering these commands change won't survive a restart. For Debian/Ubuntu users an update in /etc/network/interfaces does the trick.

#/etc/network/interfaces
auto he-ipv6
iface he-ipv6 inet6 v4tunnel
  address 2001:470:4:9ae::2
  netmask 64
  endpoint 209.51.161.58
  local 204.8.67.188
  ttl 225 
  gateway 2001:470:4:9ae::1

Firewall Configuration

If you're running UFW the updates to /etc/default/ufw are very straightforward. Simply change the IPV6 directive to yes. Restart the firewall and your network interfaces and you should be able to ping6 ipv6.google.com. I also recommend hitting http://test-ipv6.com/ for a detailed configuration test.

Behind NAT

If you're behind a NAT, the configuration needs to be tweaked a bit. First, you'll want to setup a static IP address behind your router. If you're router supports configuration of forwarding more than just TCP/UDP, you'll want to forward protocol 41 (aka IPv6) (NOT PORT 41), which is responsible for IPv6 tunneling over IPv4, to your static address. If you've got a consumer grade router that doesn't support this, you'll just have to put your machine in the DMZ, thus putting your computer "in front" of your router's firewall. Please make sure you are running a local software firewall if you chose this option.

After handling the routing of protocol 41, there is one small configuration change to /etc/network/interfaces. You must change your tunnel's local address from your public IP address, to your private NATed address. Here is an example configuration including both the static IP configuration and the updated tunnel configuration.

#/etc/network/interfaces
auto eth0
iface eth0 inet static
  address 192.168.0.50
  netmask 255.255.255.0
  gateway 192.168.0.1 

auto he-ipv6
iface he-ipv6 inet6 v4tunnel
  address 2001:470:4:9ae::2
  netmask 64
  endpoint 209.51.161.58
  local 192.168.0.50
  ttl 225 
  gateway 2001:470:4:9ae::1

Don't forget to restart your networking interfaces after these changes. I found a good ol' restart was helpful as well, but of course, we don't have this luxury in production, so be careful!

Checking IPv6

If you're reading this article, you're probably responsible for several hosts. For a gentle reminder which of your sites you've not yet setup IPv6, I recommend checking out IPvFoo for Chrome or 4or6 for Firefox. These tools make it easy for you to see which of your sites are ready for World IPv6 Launch Day!

Getting Help

Hurricane Electric provides really great support for their IPv6 tunnel services (which is completely free). Simply email ipv6@he.net and provide them with some useful information such as:

cat /etc/network/interfaces
cat netstat -nrA inet6  (these are your IPv6 routing tables)
cat /etc/default/ufw
relevant router configurations
I was very impressed to get a response from a competent person in 15 minutes! Sadly, there is one downside to using this tunnel; IRC is not an allowed.
Due to an increase in IRC abuse, new non-BGP tunnels now have IRC blocked by default. If you are a Sage, you can re-enable IRC by visiting the tunnel details page for that specific tunnel and selecting the 'Unblock IRC' option. Existing tunnels have not been filtered.
I guess ya gotta earn it to use IRC over your tunnel. Good luck!

World IPv6 Launch: 6 June 2012

For any of our readers who don’t know: The classic Internet Protocol (IPv4) supported around 4 billion IP addresses, but it recently ran out of free addresses. That the addresses would eventually all be used was no surprise. For more than a decade, a replacement called IPv6 has been under development and allows practically unlimited addresses.

Last year there was a one-day trial run called World IPv6 Day. Our own Josh Williams wrote about it here. It was the first major attempt for mainstream websites to enable dual-stack IPv4 and IPv6 networking, so that both the “old” and “new” Internet could access the same site. It was intended to bring to the surface any problems, and it went very well – most people never knew it was happening, which was the goal.

World IPv6 Launch is 6 June 2012 – The Future is Forever This year there’s a much bigger event planned: World IPv6 Launch, and this time IPv6 is meant to stay on. Google, Facebook, Yahoo!, Bing, and many other major sites are participating. A big advance over last year’s event is that many ISPs and vendors of home networking gear are also participating. That means it won’t just be a test that classic IPv4 still works for people alongside IPv6, but that for some customers, native IPv6 starts working end to end.

Last year, Josh mused that it would have been most appropriate for the day to be June 6, 6-6, for IPv6 and sixes everywhere. I suspect that wasn’t done because it fell on a Monday in 2011, and there are enough new support complaints backlogged from the weekend without adding IPv6 to the mix! But this year it is on June 6.

We got our own www.endpoint.com website running on IPv6 in time for last year’s World IPv6 Day. A few months ago, we switched on IPv6 for bucardo.org and a few new customer sites as well. For this year’s event we plan to prepare most of our remaining internal infrastructure to be dual-stack and to be ready to enable IPv6 for any customer sites upon request.

What can you do to join in and help free the Internet of its address shortage? Try visiting test-ipv6.com to see how compatible your own current Internet access is. If your ISP doesn’t yet offer IPv6, ask them when they will. Demand drives support. And visit Hurricane Electric’s IPv6 Tunnel Broker to set up a free tunnel from your location to the IPv6 Internet. It’s a very nice service.

Have fun, and if all goes well, the beginning of widespread adoption of IPv6 starts on June 6!

DevCamps setup with Ruby 1.9.3, rbenv, Nginx and Unicorn

I was working with Steph Skardal on the setup of a new DevCamps installation that was going to need to use Ruby 1.9.3, Rails 3, Unicorn and Nginx. This setup was going to be much different than a standard setup due to the different application stack that was required.

The first trick for this was going to get Ruby 1.9.3 on the server. We were using Debian Squeeze but that still only comes with Ruby 1.9.1. We wanted Ruby 1.9.3 for the increased overall speed and significant speed increase with Rails 3. We decided on using rbenv for this task. It's a very easy to setup utility that allows you to maintain multiple version of Ruby in your system user account without the headache of adjusting anything but the PATH environment variable. It takes advantage of another easy to setup utility called ruby build to handle the actual installation of the Ruby source code.

A quick and easy version for setting up a user with this is as follows:

Ensure you are in the home directory. Then, clone the repository into a .rbenv directory
git clone git://github.com/sstephenson/rbenv.git .rbenv
Adjust your users path to find the newly installed commands
echo 'export PATH=$HOME/.rbenv/shims:$HOME/.rbenv/bin:$PATH' >> ~/.bash_profile
Install Ruby version 1.9.3-p0
rbenv install 1.9.3-p0
Make Ruby version 1.9.3-p0 your default version every time you log in
rbenv global 1.9.3-p0
Install the bundler gem for Ruby version 1.9.3-p0
gem install bundler
Refresh rbenv to let it know the new system command bundler exists
rbenv rehash

Now you are ready to use the bundler gem to install any other gems required for the application.

The normal camps setup assumes you are going to be using Apache for the web server. In this case, we wanted to use Nginx due to memory constraints. We decided to use the proxy capability and just proxy through to Unicorn instead of having to build our own version Nginx to use Passenger. To do this, we had to use a feature in the local-config file in camps that allows you to skip the Apache setup and use your own commands to start, stop and restart your web server and application. Here is the example from our local-config that controlls Nginx and Unicorn. This approach could also be used with Interchange or any other application if you need other services started when mkcamp is run.

skip_apache:1
httpd_start:/usr/sbin/nginx -c __CAMP_PATH__/nginx/nginx.conf
httpd_stop:pid=`cat __CAMP_PATH__/var/run/nginx.pid 2>/dev/null` && kill $pid
httpd_restart:pid=`cat __CAMP_PATH__/var/run/nginx.pid 2>/dev/null` && kill -HUP $pid || /usr/sbin/nginx -c __CAMP_PATH__/nginx/nginx.conf
app_start:__CAMP_PATH__/bin/start-app
app_stop:pid=`cat __CAMP_PATH__/var/run/unicorn.pid 2>/dev/null` && kill $pid
app_restart:pid=`cat __CAMP_PATH__/var/run/unicorn.pid 2>/dev/null` && kill $pid ; sleep 5 ;  __CAMP_PATH__/bin/start-app
The contents of the start-app script is simply.
cd __CAMP_PATH__ && bundle exec unicorn_rails -c __CAMP_PATH__/config/unicorn.conf.rb -D

You could create one script that handles all aspects of start, stop and restart if you wanted. This setup really wasn't much harder than a normal Ruby on Rails setup. The added time here required to set up rbenv per camp user is offset by the fact that users can manage and try multiple versions of ruby.

Our SoftLayer API tools

We do a lot of our hosting at SoftLayer, which seems to be one of the hosts with the most servers in the world -- they claim to have over 100,000 servers as of last month. More important for us than sheer size are many other fine attributes that SoftLayer has, in no particular order:

  • a strong track record of reliability
  • responsive support
  • datacenters around the U.S. and some in Europe and Asia
  • solid power backup
  • well-connected redundant networks with multiple 10 Gbps uplinks
  • gigabit Ethernet pipes all the way to the Internet
  • first-class IPv6 support
  • an internal private network with no data transfer charge
  • Red Hat Enterprise Linux offered at no extra charge
  • diverse dedicated server offerings at many price & performance points
  • some disk partitioning options (though more flexibility here would be nice, especially with LVM for the /boot and / filesystems)
  • fully automated provisioning, without salesman & quote hassles for standard offerings
  • 3000 GB data transfer per month included standard with most servers
  • month-to-month contracts
  • reasonable prices (though we can of course always use lower prices, we'll take quality over cheapness for most of our hosting needs!)
  • no arbitrary port blocks (some other providers rate-limit incoming TCP connections on port 22 to slow down ssh dictionary attacks, while others forbid IRC, etc.)
  • a web service API for monitoring and controlling many aspects of our account via REST/JSON or SOAP

(No, they're not paying me for writing this! But they really have nice offerings.)

It is this last item, the SoftLayer API, that I want to elaborate on here.

The SoftLayer Development Network features API information and documentation and once you have an API account set up in the management website (quick and easy to do), you can start automating all sorts of tasks, from provisioning new hosts, monitoring your upcoming invoice or other accounting information, and much more.

I've released as open source two scripts we use: One is for managing secondary DNS domains in SoftLayer's DNS servers, from a primary name server running BIND 9. The other is a Nagios check script for monitoring monthly data transfer used and alerting when over a set threshold or over the monthly allotment.

See the GitHub repository of endpoint-softlayer-api if they would be useful to you, or to use as a starting point to interface with other SoftLayer APIs.

MySQL replication monitoring on Ubuntu 10.04 with Nagios and NRPE

If you're using MySQL replication, then you're probably counting on it for some fairly important need. Monitoring via something like Nagios is generally considered a best practice. This article assumes you've already got your Nagios server setup and your intention is to add a Ubuntu 10.04 NRPE client. This article also assumes the Ubuntu 10.04 NRPE client is your MySQL replication master, not the slave. The OS of the slave does not matter.

Getting the Nagios NRPE client setup on Ubuntu 10.04

At first it wasn't clear what packages would be appropriate packages to install. I was initially misled by the naming of the nrpe package, but I found the correct packages to be:

sudo apt-get install nagios-nrpe-server nagios-plugins

The NRPE configuration is stored in /etc/nagios/nrpe.cfg, while the plugins are installed in /usr/lib/nagios/plugins/ (or lib64). The installation of this package will also create a user nagios which does not have login permissions. After the packages are installed the first step is to make sure that /etc/nagios/nrpe.cfg has some basic configuration.

Make sure you note the server port (defaults to 5666) and open it on any firewalls you have running. (I got hung up because I forgot I have both a software and hardware firewall running!) Also make sure the server_address directive is commented out; you wouldn't want to only listen locally in this situation. I recommend limiting incoming hosts by using your firewall of choice.

Choosing what NRPE commands you want to support

Further down in the configuration, you'll see lines like command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10. These are the commands you plan to offer the Nagios server to monitor. Review the contents of /usr/lib/nagios/plugins/ to see what's available and feel free to add what you feel is appropriate. Well designed plugins should give you a usage if you execute them from the command line. Otherwise, you may need to open your favorite editor and dig in!

After verifying you've got your NRPE configuration completed and made sure to open the appropriate ports on your firewall(s), let's restart the NRPE service:

service nagios-nrpe-server restart

This would also be an appropriate time to confirm that the nagios-nrpe-server service is configured to start on boot. I prefer the chkconfig package to help with this task, so if you don't already have it installed:

sudo apt-get install chkconfig
chkconfig | grep nrpe

# You should see...
nagios-nrpe-server     on

# If you don't...
chkconfig nagios-nrpe-server on

Pre flight check - running check_nrpe

Before going any further, log into your Nagios server and run check_nrpe and make sure you can execute at least one of the commands you chose to support in nrpe.cfg. This way, if there are any issues, it is obvious now, while we've not started modifying your Nagios server configuration. The location of your check_nrpe binary may vary, but the syntax is the same:

check_nrpe -H host_of_new_nrpe_client -c command_name

If your command output something useful and expected, your on the right track. A common error you might see: Connection refused by host. Here's a quick checklist:

  • Did you start the nagios-nrpe-server service?
  • Run netstat -lunt on the NRPE client to make sure the service is listening on the right address and ports.
  • Did you open the appropriate ports on all your firewall(s)?
  • Is there NAT translation which needs configuration?

Adding the check_mysql_replication plugin

There is a lot of noise out there on Google for Nagios plugins which offer MySQL replication monitoring. I wrote the following one using ideas pulled from several existing plugins. It is designed to run on the MySQL master server, check the master's log position and then compare it to the slave's log position. If there is a difference in position, the alert is considered Critical. Additionally, it checks the slave's reported status, and if it is not "Waiting for master to send event", the alert is also considered critical. You can find the source for the plugin at my Github account under the project check_mysql_replication. Pull that source down into your plugins directory (/usr/lib/nagios/plugins/ (or lib64)) and make sure the permissions match the other plugins.

With the plugin now in place, add a command to your nrpe.cfg.

command[check_mysql_replication]=sudo /usr/lib/nagios/plugins/check_mysql_replication.sh -H 

At this point you may be saying, WAIT! How will the user running this command (nagios) have login credentials to the MySQL server? Thankfully we can create a home directory for that nagios user, and add a .my.cnf configuration with the appropriate credentials.

usermod -d /home/nagios nagios #set home directory
mkdir /home/nagios
chmod 755 /home/nagios
chown nagios:nagios /home/nagios

# create /home/nagios/.my.cnf with your preferred editor with the following:
[client]
user=example_replication_username
password=replication_password

chmod 600 /home/nagios/.my.cnf
chown nagios:nagios /home/nagios/.my.cnf

This would again be an appropriate place to run a pre flight check and run the check_nrpe from your Nagios server to make sure this configuration works as expected. But first we need to add this command to the sudoer's file.

nagios ALL= NOPASSWD: /usr/lib/nagios/plugins/check_mysql_replication.sh

Wrapping Up

At this point, you should run another check_nrpe command from your server and see the replication monitoring report. If not, go back and check these steps carefully. There are lots of gotchas and permissions and file ownership are easily overlooked. With this in place, just add the NRPE client using the existing templates you have for your Nagios servers and make sure the monitoring is reporting as expected.

Some great press for College District

College District has been getting some positive press lately, the most recent being a Forbes article which talks about the success they have been seeing in the last few years.

College District is a company that sells collegiate merchandise to fans. They got their start focusing on the LSU Tigers at TigerDistrict.com and have branched out to teams such as the Oregon Ducks and Alabama Roll Tide.

We've been working with Jared Loftus @ College District for more then four and a half years. College District is running on a heavily modified Interchange system with some cool Postgres tricks. The system can support a nearly unlimited number of sites, running on 2 catalogs (1 for the admin, 1 for the front end) and 1 database. The key to the system is different schemas, fronted by views, that hide and expose records based on the database user that is connected. The great thing about this system is that Jared can choose to launch a new store within a day and be ready for sales, something he has taken advantage of in the past when a team is on fire and he sees an opportunity he can't pass up.

We are currently preparing for a re-launch of the College District site that will focus on crowd-sourced designs. Artists and fans will submit their designs, have them voted on, some will be chosen to be sold and the folks that have their designs chosen will get paid for their efforts. The goal here is to grow a community that guides what College District and the individual school sites ultimately sell.

With College District's quick growth we've also been helping them improve their order fulfillment process. This includes streamlining how orders are picked, packed and shipped. The introduction of bar code scanners will help with the accuracy and speed of the process.

We get a kick out of seeing our clients succeed, especially those that come to us with a clear vision and a good attitude, and then put the hard work in to make it happen. It's an exciting year ahead for College District and we'll be right there supporting them on the journey.

Automating removal of SSH key patterns

Every now and again, it becomes necessary to remove a user's SSH key from a system. At End Point, we'll often allow multiple developers into multiple user accounts, so cleaning up these keys can be cumbersome. I decided to write a shell script to brush up on those skills, make sure I completed my task comprehensively, and automate future work.

Initial Design and Dependencies

My plan for this script is to accept a single argument which would be used to search the system's authorized_keys files. If the pattern was found, it would offer you the opportunity to delete the line of the file on which the pattern was found.

I've always found mlocate to be very helpful; it makes finding files extremely fast and its usage is trivial. For this script, we'll use the output from locate to find all authorized_keys files in the system. Of course, we'll want to make sure that the mlocate.db has recently been updated. So let's show the user when the database was last updated and offer them a chance to update it.

mlocate_path="/var/lib/mlocate/mlocate.db"
if [ -r $mlocate_path ]
then
    echo -n "mlocate database last updated: "
    stat -c %y $mlocate_path
    echo -n "Do you want to update the locate database this script depends on? [y/n]: "
    read update_locate
    if [ "$update_locate" = "y" ]
    then
        echo "Updating locate database.  This may take a few minutes..."
        updatedb
        echo "Update complete."
    fi  
else
    echo "Cannot read the mlocate db path: $mlocate_path"
    exit 2
fi

First we define the path where we can find the mlocate database. Then we check to see if we can read that file. If we can't read the file, we let the user know and exit. If we can read the file, print the date and time it was last modified and offer the user a chance to update the database. While this is functional, it's pretty brittle. Let's make things a bit more flexible by letting locate tell us where its database is.

if
    mlocate_path=`locate -S`
then
    # locate -S command will output database path in following format:
    # Database /full/path/to/db: (more output)...
    mlocate_path=${mlocate_path%:*} #remove content after colon
    mlocate_path=${mlocate_path#'Database '*} #remove 'Database '
else
    echo "Couldn't run locate command.  Is mlocate installed?"
    exit 5
fi

Instead of hard-coding the path to the database, we collect the locate database details using the -S parameter. By using some string manipulation functions we can tease out the file path from the output.

Because we are going to offer to update the location database (as well as eventually manipulate authorized_keys files), it makes sense to check that we are root before proceeding. Additionally, let's check to see that we get a pattern from our user, and provide some usage guidance.

if [ ! `whoami` = "root" ]
then
    echo "Please run as root."
    exit 4
fi

if [ -z $1 ]
then
    echo "Usage: check_authorized_keys PATTERN"
    exit 3
fi

Checking and modifying authorized_keys for a pattern

With some prerequisites in place, we're finally ready to scan the system's authorized_keys files. Let's just start with the syntax for that loop.

for key_file in `locate authorized_keys`; do
    echo "Searching $key_file..."
done

We do not specify a dollar sign ($) in front of key_file when defining the loop, but once inside our loop we use the regular syntax. We use command substitution by placing a command around back quotes (`) around the output of the command we want to use. We're now scanning each file, but how do we find matching entries?

IFS=$'\n'
for matching_entry in `grep "$1" $key_file`; do
    IFS=' '
    echo "Found an entry in $key_file:"
    echo $matching_entry
done

For each $key_file, we now grep our user's pattern ($1) and store it in $matching_entry. We have to change the Input Field Seperator (IFS) to a new line, instead of the default space, in order to capture each grepped line in its entriety. (Thanks to Brian Miller for that one!)

With a matching entry found in a key file, it's time to finally offer the user a chance to remove the entry.

echo "Found an entry in $key_file:"
echo $matching_entry
echo -n "Remove entry? [y/n]: "
read remove_entry
if [ "$remove_entry" = "y" ]
then
    if [ ! -w $key_file ]
    then
        echo "Cannot write to $key_file."
        exit 1
    else
        sed -i "/$matching_entry/d" $key_file
        echo "Deleted."
    fi
else
    echo "Not deleted."
fi

We prompt the user if they want to delete the shown entry, verify we can write to the $key_file, and then delete the $matching entry. By using the -i option to the sed command, we are able to make modifications in place.

The Final Product

I'm sure there is a lot of room for improvement on this script and I'd welcome pull requests on the GitHub repo I setup for this little block of code. As always, be very careful when running automated scripts as root. Please test this script out on a non-production system before use.

#!/bin/bash

if [ ! `whoami` = "root" ]
then
    echo "Please run as root."
    exit 4
fi


if [ -z $1 ]
then
    echo "Usage: check_authorized_keys PATTERN"
    exit 3
fi

if
    mlocate_path=`locate -S`
then
    # locate -S command will output database path in following format:
    # Database /full/path/to/db: (more output)...
    mlocate_path=${mlocate_path%:*} #remove content after colon
    mlocate_path=${mlocate_path#'Database '*} #remove 'Database '
else
    echo "Couldn't run locate command.  Is mlocate installed?"
    exit 5
fi

if [ -r $mlocate_path ]
then
    echo -n "mlocate database last updated: "
    stat -c %y $mlocate_path
    echo -n "Do you want to update the locate database this script depends on? [y/n]: "
    read update_locate
    if [ "$update_locate" = "y" ]
    then
        echo "Updating locate database.  This may take a few minutes..."
        updatedb
        echo "Update complete."
        echo ""
    fi
else
    echo "Cannot read from $mlocate_path"
    exit 2
fi

for key_file in `locate authorized_keys`; do
    echo "Searching $key_file..."
    IFS=$'\n'
    for matching_entry in `grep "$1" $key_file`; do
    IFS=' '
        echo "Found an entry in $key_file:"
        echo $matching_entry
        echo -n "Remove entry? [y/n]: "
        read remove_entry
        if [ "$remove_entry" = "y" ]
        then
            if [ ! -w $key_file ]
            then
                echo "Cannot write to $key_file."
                exit 1
            else
                sed -i "/$matching_entry/d" $key_file
                echo "Deleted."
            fi
        else
            echo "Not deleted."
        fi
    done
done

echo "Search complete."

Converting CentOS 6 to RHEL 6

A few years ago I needed to convert a Red Hat Enterprise Linux (RHEL) 5 development system to CentOS 5, as our customer did not actively use the system any more and no longer wanted to renew the Red Hat Network entitlement for it. Making the conversion was surprisingly straightforward.

This week I needed to make a conversion in the opposite direction: from CentOS 6 to RHEL 6. I didn't find any instructions on doing so, but found a RHEL 6 to CentOS 6 conversion guide with roughly these steps:

yum clean all
mkdir centos
cd centos
wget http://mirror.centos.org/centos/6.0/os/x86_64/RPM-GPG-KEY-CentOS-6
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/centos-release-6-0.el6.centos.5.x86_64.rpm
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/yum-3.2.27-14.el6.centos.noarch.rpm
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/yum-utils-1.1.26-11.el6.noarch.rpm
wget http://mirror.centos.org/centos/6.0/os/x86_64/Packages/yum-plugin-fastestmirror-1.1.26-11.el6.noarch.rpm
rpm --import RPM-GPG-KEY-CentOS-6
rpm -e --nodeps redhat-release-server
rpm -e yum-rhn-plugin rhn-check rhnsd rhn-setup rhn-setup-gnome
rpm -Uhv --force *.rpm
yum upgrade

I then put together a plan to do more or less the opposite of that. The high-level overview of the steps is:

  1. Completely upgrade the current CentOS and reboot to run the latest kernel, if necessary, to make sure you're starting with a solid system.
  2. Install a handful of packages that will be needed by various RHN tools.
  3. Log into the Red Hat Network web interface and search for and download onto the server the most recent version of these packages for RHEL 6 x86_64:
    • redhat-release-server-6Server
    • rhn-check
    • rhn-client-tools
    • rhnlib
    • rhnsd
    • rhn-setup
    • yum
    • yum-metadata-parser
    • yum-rhn-plugin
    • yum-utils
  4. Install the Red Hat GnuPG signing key.
  5. Forcibly remove the package that identifies this system as CentOS.
  6. Forcibly upgrade to the downloaded RHEL and RHN packages.
  7. Register the system with Red Hat Network.
  8. Update any packages that now need it using the new Yum repository.

The exact steps I used today to convert from CentOS 6.1 to RHEL 6.2 (with URL session tokens munged):

yum upgrade
shutdown -r now
yum install dbus-python libxml2-python m2crypto pyOpenSSL python-dmidecode python-ethtool python-gudev usermode
mkdir rhel
cd rhel
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/redhat-release-server/6Server-6.2.0.3.el6/x86_64/redhat-release-server-6Server-6.2.0.3.el6.x86_64.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhn-check/1.0.0-73.el6/noarch/rhn-check-1.0.0-73.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhn-client-tools/1.0.0-73.el6/noarch/rhn-client-tools-1.0.0-73.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhnlib/2.5.22-12.el6/noarch/rhnlib-2.5.22-12.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhnsd/4.9.3-2.el6/x86_64/rhnsd-4.9.3-2.el6.x86_64.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/rhn-setup/1.0.0-73.el6/noarch/rhn-setup-1.0.0-73.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum/3.2.29-22.el6/noarch/yum-3.2.29-22.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum-metadata-parser/1.1.2-16.el6/x86_64/yum-metadata-parser-1.1.2-16.el6.x86_64.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum-rhn-plugin/0.9.1-36.el6/noarch/yum-rhn-plugin-0.9.1-36.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget 'https://content-web.rhn.redhat.com/rhn/public/NULL/yum-utils/1.1.30-10.el6/noarch/yum-utils-1.1.30-10.el6.noarch.rpm?__gda__=XXX_YYY&ext=.rpm'
wget https://www.redhat.com/security/fd431d51.txt
rpm --import fd431d51.txt
rpm -e --nodeps centos-release
rpm -e centos-release-cr
rpm -Uhv --force *.rpm
rpm -e yum-plugin-fastestmirror
yum clean all
rhn_register
yum upgrade

I'm expecting to use this process a few more times in the near future. It is very useful when working with a hosting provider that does not directly support RHEL, but provides CentOS, so we can get the new servers set up without needing to request a custom operating system installation that may add a day or two to the setup time.

Given the popularity of both RHEL and CentOS, it would be neat for Red Hat to provide a tool that would easily switch, at least "upgrading" from CentOS to RHEL to bring more customers into their fold, if not the other direction!

Semaphore limits and many Apache instances on Linux

On some of our development servers, we run many instances of the Apache httpd web server on the same system. By "many", I mean 30 or more separate Apache instances, each with its own configuration file and child processes. This is not unusual on DevCamps setups with many developers working on many projects on the same server at the same time, each project having a complete software stack nearly identical to production.

On Red Hat Enterprise Linux 5, with somewhere in the range of 30 to 40 Apache instances on a server, you can run into failures at startup time with this error or another similar one in the error log:

[error] (28)No space left on device: Cannot create SSLMutex

The exact error will depend on what Apache modules you are running. The "space left on device" error does not mean you've run out of disk space or free inodes on your filesystem, but that you have run out of SysV IPC semaphores.

You can see what your limits are like this:

# cat /proc/sys/kernel/sem
250 32000 32 128

I typically double those limits by adding this line to /etc/sysctl.conf:

kernel.sem = 500 64000 64 256

That makes sure you'll get the change at the next boot. To make the change take immediate effect:

# sysctl -p

With those limits I've run 100 Apache instances on the same server.

RPM building: Fedora's _sharedstatedir

When Red Hat Enterprise Linux does not offer packages that we need, EPEL (Extra Packages for Enterprise Linux) often has what we want, kept compatible with RHEL. When EPEL also doesn't have a package, or we need a newer release than is offered, we rebuild packages from Fedora, which has consistently high-quality packages even in its "rawhide" development phase. We then distribute our packages in several compatibility-oriented Yum repositories at packages.endpoint.com.

Of course some things in the latest Fedora are not compatible with RHEL. In rebuilding the logcheck package (needed as a dependency for another package), I found that Fedora RPM spec files have begun using the _sharedstatedir macro in /usr/lib/rpm/macros, which RHEL has never used before.

On RHEL that macro has been set to /usr/com, a strange nonexistent path that apparently came from the GNU autoconf tools but wasn't used in RHEL. Now in Fedora the macro is set to /var/lib and is being used, as described in a Fedora wiki page on packaging.

The easiest and most compatible way to make the change without munging the system- or user-wide RPM macros is to add this definition to the top of the spec file where it's needed:

%define _sharedstatedir /var/lib

And then the RPM build is happy.

In related news, alongside the new logcheck package, there are also new End Point RHEL 5 x86_64 packages for the brand-new Git 1.7.7.1 and pbzip2 1.1.5, the multi-CPU core parallel compressor that has had several bugfix releases this year.

OpenSSH known_hosts oddity

A new version of the excellent OpenSSH was recently released, version 5.9. As you'd expect from such widely-used mature software, there are lots of minor improvements to enjoy rather than anything too major.

But what I want to write about today is a little surprise in how ssh handles multiple cached host keys in its known_hosts files.

I had wrongly thought that ssh stopped scanning known_hosts when it hit the first hostname or IP address match, such as happens with lookups in /etc/hosts. But that isn't how it works. The sshd manual reads:

It is permissible (but not recommended) to have several lines or different host keys for the same names. This will inevitably happen when short forms of host names from different domains are put in the file. It is possible that the files contain conflicting information; authentication is accepted if valid information can be found from either file.

The "files" it refers to are the global /etc/ssh/known_hosts and the per-user ~/.ssh/known_hosts.

The surprise was that if there are multiple host key entries in ~/.ssh/known_hosts, say, for 10.0.0.1. If the first one has a non-matching host key, the ssh client tries the second one, and so on until it runs out of matching IP address entries to check. If none have a matching host key, the ssh client error reports the offending line number for the last matching IP address, but gives no indication there are earlier mismatches as well.

This is actually kind of convenient if you have scripts that simply append new host keys to the end of the known_hosts file, and it also makes sense since hostname wildcards and multiple hostnames per line are allowed. It's fine, but it isn't what I expected and is nice to know.

Home router problems with .0 IP address

In our work the occasional mysterious problem surfaces which makes me appreciate how tractable and sane the majority of the challenges are. Here I'll tell the story of one of the mysterious problems.

In Internet routing of IPv4 addresses, there's nothing inherently special about an IP address that ends in .0, .255, or anything else. It all depends on the subnet. In the days before CIDR (Classless Inter-Domain Routing) brought us arbitrary subnet masks, there were classes of routing, most commonly A, B, and C. And the .0 and .255 addresses were special.

That was a long time ago, but it can still cause occasional trouble today. One of our hosting providers assigned us an IP address ending in .0, which we used for hosting a website. It worked fine, and was in service for many months before we heard any reports of trouble.

Then we heard a report from one of our clients that they could not access that website from their home, but they could from their office. We couldn't ever figure out why.

Next one of our own employees found that he could not access the website from his home, but he could from other locations.

Finally we had enough evidence when a friend from the open source community also could not access that website from his home.

The commonality was in the router they were using:

  • Belkin G Wireless Router Model F5D7234-4 v4
  • Belkin F5D9231-4 v1
  • and the third thought it was a Belkin but they were not able to provide the exact model.

We moved the website to a different IP address on the same server, and they had no problem accessing it.

The routers are obviously broken, but there's little sense arguing about that. For now we avoid using any .0 IP address because there are going to be some few people who can't reach it.

Raising open file descriptor limits for Dovecot and nginx

Recently we've needed to increase some limits in two excellent open-source servers: Dovecot, for IMAP and POP email service, and nginx, for HTTP/HTTPS web service. These are running on different servers, both using Red Hat Enterprise Linux 5.

First, let's look at Dovecot. We have a somewhat busy mail server and as it grew busier, it occasionally hit connection limits when the server itself still has plenty of available capacity.

Raising the number of processes in Dovecot is easy. Edit /etc/dovecot.conf and change from the prior (now commented-out) limits to the new limits:

#login_max_processes_count = 128
login_max_processes_count = 512

and later in the file:

#max_mail_processes = 512
max_mail_processes = 2048

However, then Dovecot won't start at all due to a shortage of available file descriptors. There are various ways to change that, including munging the init scripts, changing the system defaults, etc. The most standard and non-interventive way to do so with this RHEL 5 Dovecot RPM package is to edit /etc/sysconfig/dovecot and add:

ulimit -n 131072

That sets the shell's maximum number of open file descriptors allowed in the init script /etc/init.d/dovecot before the Dovecot daemon is run. The default ulimit -n is 1024, so we here increased it to an arbitrarily big enough number (2 * 64K) to handle the new limits and then some.

Similarly, on another server we needed to increase the number of connections allowed per nginx worker process from the default 1024 for a very high-capacity HTTP caching proxy server.

We edited /etc/nginx/nginx.conf and changed the events block like this:

events {
    worker_connections  65536;
}

But then nginx wouldn't start at all. The same problem and same solution applied. We edited /etc/sysconfig/nginx to add:

ulimit -n 131072

And now nginx has enough file descriptors to start.

Changing the limits this way also has the benefit of surviving an upgrade, because /etc/sysconfig files are marked in RPM as configuration files that should have any changes preserved.

RailsConf 2011 - Day One

Today was the first official day of RailsConf 2011. As with most technical conferences, this one spent the first day with tutorials and workshops. For those of us without paid access to the tutorial sessions, the BohConf was a nice way to spend our first day of the four-day event.

BohConf is described as the "unconference" of RailsConf. It's a loosely organized collection of presentations, mini-hackathons and barcamp-style meetings. I spent the first half of Monday at the BohConf. Of particular interest to me was Chris Eppstein and Scott Davis' introduction to Sass and Compass. I've dabbled with Sass in the past but only recently learned of Compass.

Sass is a great way to construct your CSS without the tedious duplication that's typical of most modern spreadsheets. Introducing programming features like variables, inheritance and nested blocks, Sass makes it easy to keep your source material concise and logical. Once your source declarations are ready, compile your production spreadsheets with Sass or Compass.

Compass is effectively a framework for easy construction and deployment of spreadsheets using Sass. To hear Scott describe it, "Compass is to Sass as Rails is to Ruby". Together they're a very attractive combination for the Ruby developer who also dabbles in design (and who doesn't these days?). Truth be told, while I'm very impressed with the capabilities of Sass, I worry about the trend to re-introduce logic and presentation. My mom raised me to abstract the presentation layer for ease on graphic designers, and that rule has suited me well to date. Time will tell.

In the afternoon I wandered into a sponsored workshop from VMware. Dave McCrory and Dekel Tankel led a demonstration of their new CloudFoundry service. A nice reward for attending was getting instant approval of your CloudFoundry.com beta registration. Although I felt mildly guilty for it afterwards, I took advantage of this opportunity to get an extra beta account (my personal request had been approved a week earlier).

Dave introduced everyone to the CloudFoundry beta offering and discussed a vague roadmap for the open source project and their own commercial VM product slated for early 2012. They emphasized that the CloudFoundry core project will remain open source, and that interested parties can fork it on github, hack on the code, and deploy their own private "micro-clouds". Dave even hinted that at least one startup has based their PHP PaaS service on CloudFoundry.

Once all the attendees had received their beta accounts, Dekel walked us through the installation and basic command-line usage of the vmc utility. I was able to immediately vmc login with my Beta account credentials and change my password with vmc passwd. I should note that before logging in, I had to choose my target server with vmc target api.cloudfoundry.com. Why is this important? This will allow developers to easily switch between targeted environments. For example, I could install CloudFoundry on my workstation or development server and "target" it as my development or staging environment. Once my tests pass, I can quickly switch targets and push the changes to production.

They had us follow along by recreating a sample Ruby application designed to check if one Twitter account follows another. The examples were simple and easy to follow. Once we had our Gemfile, controller and views in place, we had to bundle package all of the dependencies. Once this was complete, a quick vmc push myappname and we were live!

Unfortunately, most (if not all) of us encountered a gotcha with this example. Because all of our instantiated applications reside behind the same IP address on CloudFoundry's network, we quickly hit Twitter's API quota. I'm not sure if this will be a problem once CloudFoundry officially launches, but it's something to keep in mind. And while it was useful to vmc logs myappname to debug this problem, an astute attendee brought up the fact that there was no way to tail application logs in CloudFoundry. This is a glaring oversight and one I hope they rectify before the Beta is finished.

The workshop continued with an introduction into binding services like Redis, Mongo or MySQL. We added new functionality to our existing application that introduced a Redis backend to store leaderboard information for Twitter activity. Lastly, Dekel demonstrated the ease of scaling our applications with vmc instances myappname 5. This simple command instantiates new copies of the application behind their dynamic load balancer. Coincidentally, they're not currently offering any sort of scalable backend storage, so keep that in mind before you try to launch any production sites on their Beta service. Now you'd never do that, would you? ;-)

The first day of RailsConf 2011 was impressive. I came last year as an exhibitor and am really excited that I get to attend this year for the conference sessions. I can't wait to see all the talks tomorrow!

RHEL 5 SELinux initscripts problem

I ran into a strange problem updating Red Hat Enterprise Linux 5 a few months ago, and just ran into it again and this time better understood what went wrong.

The problem was serious: After a `yum upgrade` of a RHEL 5 x86_64 server with SELinux enforcing, it never came back after a reboot. Logging into the console I could see that it was stuck in single user mode because there were no init scripts! Investigation showed that indeed the initscripts package was completely missing.

I searched on bugzilla.redhat.com looking for any reported problems and didn't find any. I reinstalled initscripts, rebooted, and the server was fine, but it was not happytimes to have that unexpected downtime.

Tonight I ran into the problem again, this time on a build server where downtime didn't matter so I could investigate more leisurely.

The error from yum looked like this (the same problem affected the screen package as affected initscripts):

Downloading Packages:
screen-4.0.3-4.el5.i386.rpm          | 559 kB      00:00
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
groupadd: unable to open group file
error: %pre(screen-4.0.3-4.el5.i386) scriptlet failed, exit status 10
error:   install: %pre scriptlet failed (2), skipping screen-4.0.3-4.el5

Updated:
  screen.i386 0:4.0.3-4.el5

Complete!
# cat /selinux/enforce
1

The way I dealt with this initially was to temporarily disable SELinux enforcing, update the package, then reboot (to also load a kernel update):

# setenforce 0
# yum -y upgrade
# shutdown -r now

But following up on the specific error message showed:

# ls -lFaZ /etc/group
-rw-r--r--  root root system_u:object_r:file_t:s0      /etc/group

Aha! The SELinux context is wrong. Given that this has happened a couple of different machines, I'm guessing some past upgrade broke the context. What should it be? Let's check /etc/passwd for reference:

# ls -lFaZ /etc/passwd
-rw-r--r--  root root system_u:object_r:etc_t:s0       /etc/passwd

That's confirmed the correct context for /etc/group on another working server. To fix:

# chcon system_u:object_r:etc_t:s0 /etc/group

Then the updates proceed fine. It would be nice to find out what past action set the context wrong on /etc/group.

Google 2-factor authentication

About a month ago, Google made available to all users their new 2-factor authentication, which they call 2-step authentication. In addition to the customary username and password, this optional new feature requires that you enter a 6-digit number that changes every 30 seconds, generated by the Google Authenticator app on your Android, BlackBerry, or iPhone. The app looks like this:

This was straightforward to set up and has worked well for me in the past month. It would thwart bad guys who intercept your password in most cases. It would also lock you out of your Google account if you lose your phone and your emergency scratch codes. :)

I was happy to see this is all based on some open standards under development, and Google has made this even more useful by releasing an open source PAM module called google-authenticator. With that PAM module, a Linux system administrator can require a Google Authenticator code in addition to password authentication for login.

I tried this out on a CentOS x86_64 system and found it fairly straightforward to set up. I ran into two minor gotchas which were reported by others as well:

  • The Makefile calls sudo directly, which it shouldn't -- I was running a minimal installation without sudo installed, and in any case the administrator should decide when to become root and how. (Issue 17)
  • The Makefile installs into /lib/security instead of /lib64/security. This has since been fixed. (Issue 6)

After build and installation it was easy to generate a secret key for each individual user account. The key is stored in the user's home directory, which Issue 4 notes has some downsides, and the resolution to Issue 24 provides a partial workaround for this. The home directory seems like a nice default to me.

In the end, I found the google-authenticator module isn't suitable for my regular ssh use due to no fault of its developers. I normally use SSH public key authentication and that's handled by OpenSSH natively, separately from PAM, and thus bypasses 2-factor authentication entirely. So I can have 2-factor authentication with password authentication, but not with public key authentication, which is really what I want.

Does anyone know of a way to configure things that way? I wasn't able to find a way, so I'm not planning on using this for shell logins right now. But it's still a nice option for Google logins right now, and I expect the google-authenticator project will advance over time.

SAS 70 becomes SSAE 16

In recent years it’s become increasingly common for hosting providers to advertise their compliance with the SAS 70 Type II audit. Interest in that audit often comes from hosting customers’ need to meet Sarbanes-Oxley (aka Sarbox) or other legal requirements in their own businesses. But what is SAS 70?

It was not clear to me at first glance that SAS 70 is actually a financial accounting audit, not one that deals primarily with privacy, information technology security, or other areas.

SAS 70 was created by the American Institute of Certified Public Accountants (AICPA) and contains guidelines for assessing organizations’ service delivery processes and controls. The audit is performed by an independent Certified Public Accountant.

Practically speaking, what does passing a SAS 70 audit tell us about an organization? Most importantly that it is financially reliable, and thus hopefully a safe partner for providing critical Internet hosting and data storage services.

On June 15, 2011, the SAS 70 audit will be effectively replaced by the new SSAE 16 attestation standard (Statement on Standards for Attestation Engagements no. 16, Reporting on Controls at a Service Organization). Thus the focus appears to shift from an external auditor investigating an organization, to the organization making claims about itself under the guidance of an auditor.

SSAE 16 was created by the AICPA to make the United States service organization reporting standard compatible with the new international service organization reporting standard, ISAE 3402, which is freely available in PDF format. The SSAE 16 document is available only for a fee.

The AICPA’s FAQ on the SAS 70 to SSAE 16 transition makes an interesting point:

Q. — Will entities now become “SSAE 16 certified”?

A. — No! A popular misconception about SAS 70 is that a service organization becomes “certified” as SAS 70 compliant after undergoing a type 1 or type 2 service auditor’s engagement. There is no such thing as being SAS 70 certified and there will be no such thing as being SSAE 16 certified. An SSAE 16 report (as with a SAS 70 report) is primarily an auditor to auditor communication, the purpose of which is to provide user auditors with information about controls at a service organization that are relevant to the user entities’ financial statements.

This is interesting because many in the industry informally state that they are “SAS 70 Type II certified”. But practically speaking for those of us involved in Internet hosting, is “certification” very different from “passing an audit”? It serves primarily as a requirement checklist item about hosting providers in either case.

Many major hosting providers have completed a SAS 70 Type II audit, including Rackspace (and Rackspace Cloud), Amazon Web Services, SoftLayer (and The Planet, which SoftLayer recently acquired), Verio, Terremark, and ServePath, to mention a few that we have worked with. Presumably these will make an SSAE 16 attestation later this year.

Note that many VPS and cloud hosting providers do not report having been SAS 70 audited. If this is a requirement for your hosting, it's important to look for it early before settling on a provider.

More details about the SAS 70 to SSAE 16 transition are available on the AICPA Service Organization Controls Reporting website.

Speeding up the Spree demo site

There's a lot that can be done to speed up Spree, and Rails apps in general. Here I'm not going to deal with most of that. Instead I want to show how easy it is to speed up page delivery using standard HTTP server tuning techniques, demonstrated on demo.spreecommerce.com.

First, let's get a baseline performance measure from the excellent webpagetest.org service using their remote Internet Explorer 7 tests:

  • First page load time: 2.1 seconds
  • Repeat page load time: 1.5 seconds

The repeat load is faster because the browser has images, JavaScript, and CSS cached, but it still has to check back with the server to make sure they haven't changed. Full details are in this initial report.

The demo.spreecommerce.com site is run on a Xen VPS with 512 MB RAM, CentOS 5 i386, Apache 2.2, and Passenger 2.2. There were several things to tune in the Apache httpd.conf configuration:

  • mod_deflate was already enabled. Good. That's a big help.
  • Enable HTTP keepalive: KeepAlive On and KeepAliveTimeout 3
  • Limit Apache children to keep RAM available for Rails: StartServers 5, MinSpareServers 2, MaxSpareServers 5
  • Limit Passenger pool size to 2 child processes (down from the default 6), to queue extra requests instead of using slow swap memory: PassengerMaxPoolSize 2
  • Enable browser & intermediate proxy caching of static files: ExpiresActive On and ExpiresByType image/jpeg "access plus 2 hours" etc. (see below for full example)
  • Disable ETags which aren't necessary once Expires is enabled: FileETag None and Header unset ETag
  • Disable unused Apache modules: free up memory by commenting out LoadModule proxy, proxy_http, info, logio, usertrack, speling, userdir, negotiation, vhost_alias, dav_fs, autoindex, most authn_* and authz_* modules
  • Disable SSLv2 (for security and PCI compliance, not performance): SSLProtocol all -SSLv2 and SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:-LOW:-SSLv2:-EXP

After making these changes, without tuning Rails, Spree, or the database at all, a new webpagetest.org run reports:

  • First page load time: 1.2 seconds
  • Repeat page load time: 0.4 seconds

That's an easy improvement, a reduction of 0.9 seconds for the initial load and 1.1 seconds for a repeat load! Complete details are in this follow-on report.

The biggest wins came from enabling HTTP keepalive, which allows serving multiple files from a single HTTP connection, and enabling static file caching which eliminates the majority of requests once the images, JavaScript, and CSS are cached in the browser.

Note that many of the resource-limiting changes I made above to Apache and Passenger would be too restrictive if more RAM or CPU were available, as is typical on a dedicated server with 2 GB RAM or more. But when running on a memory-constrained VPS, it's important to put such limits in place or you'll practically undo any other tuning efforts you make.

I wrote about these topics a year ago in a blog post about Interchange ecommerce performance optimization. I've since expanded the list of MIME types I typically enable static asset caching for in Apache. Here's a sample configuration snippet to put in the <VirtualHost> container in httpd.conf:

    ExpiresActive On
    ExpiresByType image/gif   "access plus 2 hours"
    ExpiresByType image/jpeg  "access plus 2 hours"
    ExpiresByType image/png   "access plus 2 hours"
    ExpiresByType image/tiff  "access plus 2 hours"
    ExpiresByType text/css    "access plus 2 hours"
    ExpiresByType image/bmp   "access plus 2 hours"
    ExpiresByType video/x-flv "access plus 2 hours"
    ExpiresByType video/mpeg  "access plus 2 hours"
    ExpiresByType video/quicktime "access plus 2 hours"
    ExpiresByType video/x-ms-asf  "access plus 2 hours"
    ExpiresByType video/x-ms-wm   "access plus 2 hours"
    ExpiresByType video/x-ms-wmv  "access plus 2 hours"
    ExpiresByType video/x-ms-wmx  "access plus 2 hours"
    ExpiresByType video/x-ms-wvx  "access plus 2 hours"
    ExpiresByType video/x-msvideo "access plus 2 hours"
    ExpiresByType application/postscript        "access plus 2 hours"
    ExpiresByType application/msword            "access plus 2 hours"
    ExpiresByType application/x-javascript      "access plus 2 hours"
    ExpiresByType application/x-shockwave-flash "access plus 2 hours"
    ExpiresByType image/vnd.microsoft.icon      "access plus 2 hours"
    ExpiresByType application/vnd.ms-powerpoint "access plus 2 hours"
    ExpiresByType text/x-component              "access plus 2 hours"

Of course you'll still need to tune your Spree application and database, but why not tune the web server to get the best performance you can there?

Red Hat SELinux policy for mod_wsgi

Using SELinux, you can safely grant a process only the permissions it needs to perform its function, and no more. Linux distributions provide policies to enforce these limits on most software they package, but many aren't covered. We've made allowances for mod_wsgi on RHEL and CentOS 5 by extending Apache httpd's SELinux policy.

It seems the SELinux policy for Apache httpd is twice as large as any other package's. The folks at Red Hat have put a lot of work into making sure that attackers who manage to exploit httpd can't break out to the rest of your system, while still allowing the flexibility to serve most applications. Consult the httpd_selinux man page if messages in audit.log coincide with your error.

File Contexts

If you've created files and/or directories in /etc/httpd, make sure they have the proper file contexts so the daemon can read them:

  # restorecon -vR /etc/httpd

httpd can only serve files with an explicitly allowed file context. Configure the context of files and directories within your production code base using the semanage command:

  # semanage fcontext --add --ftype -- --type httpd_sys_content_t "/home/projectname/live(/.*)?"
  # semanage fcontext --add --ftype -d --type httpd_sys_content_t "/home/projectname/live(/.*)?"
  # restorecon -vR /home/projectname/live

View file contexts with ls -Z. Changes should be generally accomplished with semanage and restorecon -vR.

Booleans

The httpd policy provides several boolean options for easy run-time configuration:

  • httpd_can_network_connect - Allows httpd to make network connections, including the local ones you'll be making to a database
  • httpd_enable_homedirs - Allows httpd to access /home/

Booleans are persistently set using the setsebool command with the -P flag:

  # setsebool -P httpd_can_network_connect on

WSGI Socket

When running in daemon mode, httpd and the mod_wsgi daemon communicate via a UNIX socket file. This should usually have a context of httpd_var_run_t. The standard Red Hat SELinux policy includes an entry for /var/run/wsgi.* to use this context, so it makes sense to put the socket there using the WSGISocketPrefix directive within your httpd configuration:

  WSGISocketPrefix run/wsgi

(Note that run/wsgi translates to /etc/httpd/run/wsgi which is symlinked to /var/run/wsgi.)

If socket communication fails, httpd returns a 503 "Temporarily Unavailable" error response.

SELinux Policy Module

In the course of our testing SELinux denials like the following appeared:

  host=example.com type=AVC msg=audit(1262803154.315:1851): avc:  denied  { execmem } for  pid=5337 comm="httpd" scontext=root:system_r:httpd_t:s0 tcontext=root:system_r:httpd_t:s0 tclass=process

Unusual behavior like this is usually best allowed by creating application-specific SELinux policy modules. If you cannot resolve these AVC errors by manipulating file contexts or booleans, collect all the errors into a single file and feed that into the audit2allow utility:

  # yum install policycoreutils
  # mkdir ~/tmp  # if this doesn't exist already
  # audit2allow --module wsgi < ~/tmp/pile_of_auditd_output > ~/tmp/wsgi.te

This will output source for a new policy module. You might review the .te file before compiling. Ours looks like this:

module wsgi 1.0;

require {
      type httpd_t;
      class process execmem;
}

#============= httpd_t ==============
allow httpd_t self:process execmem;

Compile this source into a new policy module and package it:

  # checkmodule -M -m -o ~/tmp/wsgi.mod ~/tmp/wsgi.te
  # semodule_package --outfile ~/tmp/wsgi.pp --module ~/tmp/wsgi.mod

Once created, the module may be installed permanently into any compatible system's SELinux configuration:

  # semodule --install ~/tmp/wsgi.pp

There's plenty of room for improvement here. The file contexts we assigned with semanage should be defined in a .fc source file and included within the policy module. And creating a new context just for the WSGI daemon to transition into would restrict it further, allowing only a subset of Apache httpd's abilities. Writing your own policy like this allows you much finer tuning of your processes' limits, while allowing their needed functionality.

Surge 2010 day 1

Today (technically, yesterday) was the first day of the Surge 2010 conference in Baltimore, Maryland. The Tremont Grand venue is perfect for a conference. The old Masonic lodge makes for great meeting rooms, and having a hallway connect it to the hotel was nice to avoid the heavy rain today. The conference organization and scheduling and Internet have all been solid. Well done!

There were a lot of great talks, but I wanted to focus on just one that was very interesting to me: Artur Bergman's on scaling Wikia. Some points of interest:

  • They (ab)use Google Analytics to track other things besides the typical pages viewed by whom, when. For example, page load time as measured by JavaScript, with data sent to a separate GA profile for analysis separately from normal traffic. That is then correlated with exit rates to give an idea of the benefit of page delivery speed in terms of user stickiness.
  • They use the excellent Varnish reverse proxy cache.
  • 500 errors from the origin result in a static page served by Varnish, with error data hitting a separate Google Analytics profile.
  • They have both geographically distributed servers and team.
  • They've found SSDs (solid state disks) to be well worth the extra cost: fast, using less power in a given server, and requiring fewer servers overall. They have to use Linux software RAID because no hardware RAID controllers they've tested could keep up with the speed of SSD. They have run into the known problems with disk write performance dropping as they fill and recycle, but haven't found it to be a problem when used on replaceable cache machines.
  • They run their own CDN, with nodes running Varnish, Quagga (for BGP routing), BIND, and Scribe. But they use Akamai for static content.
  • Even running Varnish with 1 second TTL can save your backend app servers when heavy traffic arrives! One hit per second is no problem; thousands may mean meltdown.
  • Serving stale cached content when the backend is down can be a good choice. It means most visitors will never know anything was wrong. (Depends on the site's functions, of course.)
  • Their backup datacenter in Iowa is in a former nuclear bunker. See monitoring graphs for it.
  • Wikia ops staff interact with their users via IRC. This "crowdsourced monitoring" has resulted in a competition between Wikia ops people and the users to see who can spot outages first.
  • Having their own hardware in multiple redundant datacenters has meant much more leverage in pricing discussions with datacenters. "We can just move."
  • They own their own hardware, and run on bare metal. At no time does user traffic pass through any virtualized systems at all. The performance just isn't there. They do use virtual machines for some external monitoring stuff.
  • They use Riak for N-master inter-datacenter synchronization, and RiakFS for sessions and files. RiakFS is for the "legacy" MediaWiki need for POSIX access to files, but they can serve those files to the general public from Riak's HTTP interface via Varnish cache.
  • They use VPN tunnels between datacenters. Sometimes using their own routes, even over multiple hops, leads to faster transit than going over the public Internet.
  • Lots of interesting custom VCL (Varnish Configuration Language) examples.

This had plenty of interesting things to consider for any web application architecture.

PostgreSQL 8.4 in RHEL/CentOS 5.5

The announcement of end of support coming soon for PostgreSQL 7.4, 8.0, and 8.1 means that people who've put off upgrading their Postgres systems are running out of time before they're in the danger zone where critical bugfixes won't be available.

Given that PostgreSQL 7.4 was released in November 2003, that's nearly 7 years of support, quite a long time for free community support of an open-source project.

Many of our systems run Red Hat Enterprise Linux 5, which shipped with PostgreSQL 8.1. All indications are that Red Hat will continue to support that version of Postgres as it does all parts of a given version of RHEL during its support lifetime. But of course it would be nice to get those systems upgraded to a newer version of Postgres to get the performance and feature benefits of newer versions.

For any developers or DBAs familiar with Postgres, upgrading to a new version with RPMs from the PGDG or other custom Yum repository is not a big deal, but occasionally we've had a client worry that using a packages other than the ones supplied by Red Hat is riskier.

For those holdouts still on PostgreSQL 8.1 because it's the "norm" on RHEL 5, Red Hat gave us a gift in their RHEL 5.5 update. It now includes separate PostgreSQL 8.4 packages that may optionally be used on RHEL 5 instead of PostgreSQL 8.1. (Both can't be used on the same system at the same time.)

I know that getting these packages from Red Hat shouldn't be necessary, but for those who feel jittery about using 3rd-party packages, it's a good nudge to switch to Postgres 8.4 using Red Hat's supported packages. Thanks to Tom Lane at Red Hat for making this happen. Though I don't know whose idea it was, Tom is the author of all the RPM commitlog messages, so thanks, Tom!

This brings up a few other rhetorical questions: Will RHEL 6 ship with PostgreSQL 9.0? Will RHEL 5.6 have backported PostgreSQL 9.0 in similar postgresql90 packages? It'd be great to see each new PostgreSQL release have supported packages in RHEL so that there's even less reason to start a new project on an older version of Postgres. RHEL 5.5 with PostgreSQL 8.4 is a nice start in that direction.

Ruby on Rails Typo blog upgrade

I needed to migrate a Typo blog (built on Ruby on Rails) from one RHEL 5 x86_64 server to another. To date I've done Ruby on Rails deployments using Apache with FastCGI, mongrel, and Passenger, and I've been looking for an opportunity to try out an nginx + Unicorn deployment to see how it compares. This was that opportunity, and here are the changes that I made to the stack during the migration:

I used the following packages from End Point's Yum repository for RHEL 5 x86_64:

  • nginx-0.7.64-2.ep
  • ruby-enterprise-1.8.7-3.ep
  • ruby-enterprise-rubygems-1.3.6-3.ep

The rest were standard Red Hat Enterprise Linux packages, including the new RHEL 5.5 postgresql84 packages. The exceptions were the Ruby gems, which were installed locally with the `gem` command as root.

I had to install an older version of one gem dependency manually, sqlite3-ruby, because the current version requires a newer version of sqlite than comes with RHEL 5. The installation commands were roughly:

yum install sqlite-devel.x86_64
gem install sqlite3-ruby -v 1.2.5

gem install unicorn
gem install typo

yum install postgresql84-devel.x86_64
gem install postgres

Then I followed (mostly) the instructions in Upgrading to Typo 5.4, which are still pretty accurate even though outdated by one release. One difference was the need to specify PostgreSQL to override the default of MySQL (even though the default is documented as being sqlite):

typo install /path/to/typo database=postgresql

Then I ran pg_dump on the old Postgres database and imported the data into the new database, and put in place the database.yml configuration file.

The Typo upgrade went pretty smoothly this time. I had to delete the sidebars configuration from the database to stop getting a 500 error for that, and redo the sidebars manually -- which I've had to do with every past Typo upgrade as well. But otherwise it was easy.

I first tested the migrated blog by running unicorn_rails manually in development mode. Then to have Unicorn start at boot time, I wrote this little shell script and put it in ~/bin/start-unicorn.sh:

#!/bin/bash
cd /path/to/app || exit 1
unicorn_rails -E production -D -c config/unicorn.conf.rb

Then added a cron job to run it:

@reboot bin/start-unicorn.sh

That unicorn.conf.rb file contains only:

listen 8080
worker_processes 4

The listen port 8080 is the default, but I may need to change it. Unicorn defaults to only 1 worker process, so I increased it to 4.

I added the following nginx configuration inside the http { ... } block (actually in a separate include file):

upstream app_server {
    server 127.0.0.1:8080 fail_timeout=0;
}

server {
    listen       the.ip.add.ress:80;
    server_name  the.host.name;

    location / { 
        root   /path/to/rails/typo/public/cache;

        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        #proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Host $http_host;
        proxy_redirect off;

        rewrite ^/blog/xml/atom/feed\.xml$ /articles.atom permanent;
        rewrite ^/blog/xml/rss20/feed\.xml$ /articles.rss permanent;

        if (-f $request_filename) {
            break;
        }   

        set $possible_request_filename $request_filename/index.html;
        if (-f $possible_request_filename) {
            rewrite (.*) $1/index.html;
            break;
        }   

        set $possible_request_filename $request_filename.html;
        if (-f $possible_request_filename) {
            rewrite (.*) $1.html;
            break;
        }   

        if (!-f $request_filename) {
            proxy_pass http://app_server;
            break;
        }   
    }   

    # Rails error pages
    error_page 500 502 503 504 /500.html;
    location = /500.html {
        root   /path/to/rails/typo/public;
    }   
}

The configuration was a little complicated to get nginx serving static content directly, including cache files that Typo writes out. I had to add special handling for / which gets cached as /index.html, but can't be called that when passing URIs to Typo, as it doesn't know about any /index.html. And all HTML cache files end in .html, though the URIs don't, so those need special handling too.

But when all is said and done, the blog is now running on the latest version of Typo, on the latest Unicorn, Rails, Ruby Enterprise Edition, PostgreSQL, and nginx, with all static content and fully-cached pages served directly by nginx, and for the most part only dynamic requests being served by Unicorn. I need to tweak the nginx rewrite rules a bit more to get 100% of static content served directly by nginx.

As far as blogging platforms go, I can recommend Typo mainly for Rails enthusiasts who want to write their own plugins, tweak the source, etc. WordPress or Movable Type are so much more widely used that non-programmers are going to have a lot easier time deploying and supporting them. They've had a lot more security vulnerabilities requiring updates, though that may also be a function of popularity and payoff for those exploiting them.

Rails deployment seems to take a lot of memory no matter how you do it. I don't think nginx + Unicorn uses much less RAM than Apache + Passenger, mostly the different between nginx and Apache themselves. But using Unicorn does allow for running the application processes on another server or several servers without needing nginx or Apache on those other servers. It does provide for clean separation between the web server and the application(s), including possibly different SELinux contexts rather than always httpd_sys_script_t as we see with Passenger. Passenger at least switches the child process UID to run with different permissions from the main web server, which is good. Both Passenger and Unicorn are much nicer than FastCGI, which I've always found to be a little buggy, and mongrel, which required specifying a range of ports and load-balancing across all of them in the proxy -- managing multiple port ranges is a pain with multiple apps on the same machine, especially when some need more than others.

I think if you have plenty of RAM, going with Apache + Passenger may still be the easiest Rails web deployment method overall, when mixed with other static content, server-side includes, PHP, and CGIs. But for high-traffic and custom setups, nginx + Unicorn is a nice option.

Efficiency of find -exec vs. find | xargs

This is a quick tip for anyone writing a cron job to purge large numbers of old files.

Without xargs, this is a pretty common way to do such a purge, in this case of all files older than 31 days:

find /path/to/junk/files -type f -mtime +31 -exec rm -f {} \;

But that executes rm once for every single file to be removed, which adds a ton of overhead just to fork and exec rm so many times. Even on modern operating systems that are so efficient with fork, it can easily increase the I/O and load and runtime by 10 times or more than just running a single rm command with a lot of file arguments.

Instead do this:

find /path/to/junk/files -type f -mtime +31 -print0 | xargs -0 -r rm -f

That will run xargs once for each very long list of files to be removed, so the overhead of fork & exec is incurred very rarely, and the job can spend most of its effort actually unlinking files. (The xargs -r option says not to run the command if there is no input to xargs.)

How long can the argument list to xargs be? It depends on the system, but xargs --show-limits will tell us. Here's output from a RHEL 5 x86_64 system (using findutils 4.2.27):

% xargs --show-limits                                                                                                   
Your environment variables take up 2293 bytes                                                                                        
POSIX lower and upper limits on argument length: 2048, 129024                                                                        
Maximum length of command we could actually use: 126731                                                                              
Size of command buffer we are actually using: 126731

The numbers are similar on Debian Etch and Lenny.

And here's output from an Ubuntu 10.04 x86_64 system (using findutils 4.4.2):

% xargs --show-limits
Your environment variables take up 1370 bytes
POSIX upper limit on argument length (this system): 2093734
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2092364
Size of command buffer we are actually using: 131072

Roughly 2 megabytes of arguments is a lot. But even the POSIX minimum of 4 kB is a lot better than processing one file at a time.

It doesn't usually make much of a difference, but we can tune even more. Make sure the maximum number of files is processed at one time by first changing to the base directory so that the relative pathnames are shorter:

cd /path/to/junk/files && find . -type f -mtime +31 -print0 | xargs -0 -r rm -f

That way each file argument is shorter, e.g. ./junkfile compared to /path/to/junk/files/junkfile.

The above assumes you're using GNU findutils, which includes find -print0 and xargs -0 for processing ASCII NUL-delimited filenames for safety when filenames include embedded spaces, newlines, etc.

Why is my load average so high?

One of the most common ways people notice there's a problem with their server is when Nagios, or some other monitoring tool, starts complaining about a high load average. Unfortunately this complaint carries with it very little information about what might be causing the problem. But there are ways around that. On Linux, where I spend most of my time, the load average represents the average number of process in either the "run" or "uninterruptible sleep" states. This code snippet will display all such processes, including their process ID and parent process ID, current state, and the process command line:

#!/bin/sh

ps -eo pid,ppid,state,cmd |\
    awk '$3 ~ /[RD]/ { print $0 }'

Most of the time, this script has simply confirmed what I already anticipated, such as, "PostgreSQL is trying to service 20 times as many simultaneous queries as normal." On occasion, however, it's very useful, such as when it points out that a backup job is running far longer than normal, or when it finds lots of "[pdflush]" operations in process, indicating that the system was working overtime to write dirty pages to disk. I hope it can be similarly useful to others.