Postgresql Blog Archive

Protecting and auditing your secure PostgreSQL data

PostgreSQL functions can be written in many languages. These languages fall into two categories, 'trusted' and 'untrusted'. Trusted languages cannot do things "outside of the database", such as writing to local files, opening sockets, sending email, connecting to other systems, etc. Two such languages are PL/pgSQL and and PL/Perl. For "untrusted" languages, such as PL/PerlU, all bets are off, and they have no limitations placed on what they can do. Untrusted languages can be very powerful, and sometimes dangerous.

One of the reasons untrusted languages can be considered dangerous is that they can cause side effects outside of the normal transactional flow that cannot be rolled back. If your function writes to local disk, and the transaction then rolls back, the changes on disk are still there. Working around this is extremely difficult, as there is no way to detect when a transaction has rolled back at the level where you could, for example, undo your local disk changes.

However, there are times when this effect can be very useful. For example, in a recent thread on the PostgreSQL "general" mailing list (aka pgsql-general), somebody asked for a way to audit SELECT queries into a logging table that would survive someone doing a ROLLBACK. In other words, if you had a function named weapon_details() and wanted to have that function log all requests to it by inserting to a table, a user could simply run the query, read the data, and then rollback to thwart the auditing:


BEGIN;

SELECT weapon_details('BFG 9000'); -- also inserts to an audit table

ROLLBACK;                          -- inserts to the audit table are now gone!

Certainly there are other ways to track who is using this query, the most obvious being by enabling full Postgres logging (by setting log_statement = 'all' in your postgresql.conf file.) However, extracting that information from logs is no fun, so let's find a way to make that INSERT stick, even if the surrounding function was rolled back.

Stepping back for one second, we can see there are actually two problems here: restricting access to the data, and logging that access somewhere. The ultimate access restriction is to simply force everyone to go through your custom interface. However, in this example, we will assume that someone has psql access and needs to be able to run ad hoc SQL queries, as well as be able to BEGIN, ROLLBACK, COMMIT, etc.

Let's assume we have a table with some Very Important Data inside of it. Further, let's establish that regular users can only see some of that data, and that we need to know who asked for what data, and when. For this example, we will create a normal user named Alice:


postgres=> CREATE USER alice;
CREATE ROLE

We need a way to tell which rows are suitable for people like Alice to view. We will set up a quick classification scheme using the nifty ENUM feature of PostgreSQL:


postgres=> CREATE TYPE classification AS ENUM (
 'unclassified',
 'restricted',
 'confidential',
 'secret',
 'top secret'
);
CREATE TYPE

Next, as a superuser, we create the table containing sensitive information, and populate it:


postgres=> CREATE TABLE weapon (
  id              SERIAL          PRIMARY KEY,
  name            TEXT            NOT NULL,
  cost            TEXT            NOT NULL,
  security_level  CLASSIFICATION  NOT NULL,
  description     TEXT            NOT NULL DEFAULT 'a fine weapon'
);
NOTICE:  CREATE TABLE will create implicit sequence "weapon_id_seq" for serial column "weapon.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "weapon_pkey" for table "weapon"
CREATE TABLE

postgres=> INSERT INTO weapon (name,cost,security_level) VALUES
 ('Crowbar',  10,  'unclassified'),
 ('M9',  200,  'restricted'),
 ('M16A2',  300,  'restricted'),
 ('M4A1',  400,  'restricted'),
 ('FGM-148 Javelin',  700,  'confidential'),
 ('Pulse Rifle',  50000,  'secret'),
 ('Zero Point Energy Field Manipulator',  'unknown',  'top secret');
INSERT 0 7

We don't want anyone but ourselves to be able to access this table, so for safety, we make some explicit revocations. We'll examine the permissions before and after we do this:


postgres=> \dp weapon
                           Access privileges
 Schema |  Name  | Type  | Access privileges | Column access privileges 
--------+--------+-------+-------------------+--------------------------
 public | weapon | table |                   | 

postgres=> REVOKE ALL ON TABLE weapon FROM public;
REVOKE

postgres=> \dp weapon

                               Access privileges
 Schema |  Name  | Type  |     Access privileges     | Column access privileges 
--------+--------+-------+---------------------------+--------------------------
 public | weapon | table | postgres=arwdDxt/postgres | 

As you can see, what the REVOKE really does is remove the implicit "no permission" and grant explicit permissions to only the postgres user to view or modify the table. Let's confirm that Alice cannot do anything with that table:


postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> postgres=> SELECT * FROM weapon;
ERROR:  permission denied for relation weapon
postgres=> postgres=> UPDATE weapon SET id = id;
ERROR:  permission denied for relation weapon

Alice does need to have access to parts of this table, so we will create a "wrapper function" that will query the table for us and return some results. By declaring this function as SECURITY DEFINER, it will run as if the person who created the function invoked it - in this case, the postgres user. For this example, we'll be letting Alice see the "cost and description" of exactly one item at a time. Further, we are not going to let her (or anyone else using this function) view certain items. Only those items classified as "confidential" or lower can be viewed (i.e. "confidential", "restricted", or "unclassified"). Here's the first version of our function:


postgres=> CREATE LANGUAGE plperlu;
CREATE LANGUAGE

postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$

use strict;
use warnings;

## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;

## What is the maximum security_level that people who are 
## calling this function can view?
my $seclevel = 'confidential';

## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
  CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};

## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};

## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
   return undef;
}

## Throw an exception if we are not allowed to view this row
if (! $allowed) {
   die qq{Sorry, you are not allowed to view information on that weapon!\n};
}

## Return the requested data
return_next($row);

$bc$;
CREATE FUNCTION

The above should be fairly self-explanatory. We are using PL/Perl's built-in database access functions, such as spi_prepare, to do the actual querying. Let's confirm that this works as it should for Alice:


postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".

postgres=> SELECT * FROM weapon_details('crowbar');
  name   | cost |  description  
---------+------+---------------
 Crowbar | 10   | a fine weapon
(1 row)

postgres=> SELECT * FROM weapon_details('anvil');
 name | cost | description 
------+------+-------------
(0 rows)

postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR:  Sorry, you are not allowed to view information on that weapon!
CONTEXT:  PL/Perl function "weapon_details"

Now that we have solved the restricted access problem, let's move on the auditing. We will create a simple table to hold information about who accessed what and when:


postgres=> CREATE TABLE data_audit (
  tablename TEXT         NOT NULL,
  arguments TEXT             NULL,
  results   INTEGER          NULL,
  status    TEXT         NOT NULL  DEFAULT 'normal',
  username  TEXT         NOT NULL  DEFAULT session_user,
  txntime   TIMESTAMPTZ  NOT NULL  DEFAULT now(),
  realtime  TIMESTAMPTZ  NOT NULL  DEFAULT clock_timestamp()
);
CREATE TABLE

The 'tablename' column simply records which table they are getting data from. The 'arguments' is a free-form field describing what they were looking for. The 'results' column shows how many matching rows were found. The 'status' column will be used primarily to log unusual requests, such as the case where Alice looks for a forbidden item. The 'username' column records the name of the user doing the searching. Because we are using functions with SECURITY DEFINER set, this needs to be session_user, not current_user, as the latter will switch to 'postgres' within the function, and we want to log the real caller (e.g. 'alice'). The final two columns tell us then the current transaction started, and the exact time when an entry was made inside of this table. As a first attempt, we'll have our function do some simple inserts to this new data_audit table:


postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$

use strict;
use warnings;

## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;

## What is the maximum security_level that people who are 
## calling this function can view?
my $seclevel = 'confidential';

## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
  CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};

## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};


## Log this request
$SQL = 'INSERT INTO data_audit(tablename,arguments,results,status)
  VALUES ($1,$2,$3,$4)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
$sth = spi_prepare($SQL, 'TEXT', 'TEXT', 'INTEGER', 'TEXT');
spi_exec_prepared($sth, 'weapon', $name, $rv->{processed}, $status);


## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
   return undef;
}

## Throw an exception if we are not allowed to view this row
if (! $allowed) {
   die qq{Sorry, you are not allowed to view information on that weapon!\n};
}

## Return the requested data
return_next($row);

$bc$;

However, this fails the case pointed out in the original poster's email about viewing the data within a transaction that is then rolled back. It also fails to work at all when a forbidden item is requested, as that insert is rolled back by the die() call:


postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".

postgres=> SELECT * FROM weapon_details('crowbar');
  name   | cost |  description  
---------+------+---------------
 Crowbar | 10   | a fine weapon
(1 row)

postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR:  Sorry, you are not allowed to view information on that weapon!
CONTEXT:  PL/Perl function "weapon_details"

postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM weapon_details('m9');
 name | cost |  description  
------+------+---------------
 M9   | 200  | a fine weapon
(1 row)
postgres=> ROLLBACK;
ROLLBACK

postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=> SELECT * FROM data_audit \x \g
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results   | 1
status    | normal
username  | alice
txntime   | 2012-01-30 17:37:39.497491-05
realtime  | 2012-01-30 17:37:39.545891-05

How do we get around this? We need a way to commit something that will survive the surrounding transaction's rollback. The closest thing Postgres has to such a thing at the moment is to connect back to the database with a new and entirely separate connection. Two such popular ways to do so are with the dblink program and the PL/PerlU language. Obviously, we are going to focus on the latter, but all of this could be done with dblink as well. Here are the additional steps to connect back to the database, do the insert, and then leave again:


postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
VOLATILE
AS $bc$

use strict;
use warnings;
use DBI;

## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;

## What is the maximum security_level that people who are 
## calling this function can view?
my $seclevel = 'confidential';

## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
  CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};

## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = defined $row ? delete $row->{allowed} : 1;

## Log this request
$SQL = 'INSERT INTO data_audit(username,tablename,arguments,results,status)
  VALUES (?,?,?,?,?)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
my $dbh = DBI->connect('dbi:Pg:service=auditor', '', '',
  {AutoCommit=>0, RaiseError=>1, PrintError=>0});
$sth = $dbh->prepare($SQL);
my $user = spi_exec_query('SELECT session_user')->{rows}[0]{session_user};
$sth->execute($user, 'weapon', $name, $rv->{processed}, $status);
$dbh->commit();

## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
   return undef;
}

## Throw an exception if we are not allowed to view this row
if (! $allowed) {
   die qq{Sorry, you are not allowed to view information on that weapon!\n};
}

## Return the requested data
return_next($row);

$bc$;
CREATE FUNCTION

Note that because we are making external changes, we marked the function as VOLATILE, which ensures that it will always be run every time it is called, and not cached in any form. We are also using a Postgres service file with the 'db:Pg:service=auditor'. This means that the connection information (username, password, database) is contained in an external file. This is not only tidier than hard-coding those values into this function, but safer as well, as the function itself can be viewed by Alice. Finally, note that we are passing the 'username' directly into the function this time, as we have a brand new connection which is no longer linked to the 'alice' user, so we have to derive it ourselves from "SELECT session_user" and then pass it along.

Once this new function is in place, and we re-run the same queries as we did before, we see three entries in our audit table:


postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results   | 1
status    | normal
username  | alice
txntime   | 2012-01-30 17:56:01.544557-05
realtime  | 2012-01-30 17:56:01.54569-05
-[ RECORD 2 ]----------------------------
tablename | weapon
arguments | pulse rifle
results   | 1
status    | forbidden
username  | alice
txntime   | 2012-01-30 17:56:01.559532-05
realtime  | 2012-01-30 17:56:01.561225-05
-[ RECORD 3 ]----------------------------
tablename | weapon
arguments | m9
results   | 1
status    | normal
username  | alice
txntime   | 2012-01-30 17:56:01.573335-05
realtime  | 2012-01-30 17:56:01.574989-05

So that's the basic premise of how to solve the auditing problem. For an actual production script, you would probably want to cache the database connection by sticking things inside of the special %_SHARED hash available to PL/Perl and Pl/PerlU. Note that each user gets their own version of that hash, so Alice will not be able to create a function and have access to the same %_SHARED hash that the postgres user has access to. It's probably a good idea to simply not let users like Alice use the language at all. Indeed, that's the default when we do the CREATE LANGUAGE call as above:


postgres=>  \c postgres alice
You are now connected to database "postgres" as user "alice".

postgres=> CREATE FUNCTION showplatform()
RETURNS TEXT
LANGUAGE plperlu
AS $bc$
  return $^O;
$bc$;
ERROR:  permission denied for language plperlu

Further refinements to the actual script might include refactoring the logging bits to a separate function, writing some of the auditing data to a file on the local disk, recording the actual results returned to the user, and sending the data to another Postgres server entirely. For that matter, as we are using DBI, you could send it to other place entirely - such as a MySQL, Oracle, or DB2 database!

Another place for improvement would be associating each user with a security_level classification, such that any user could run the function and only see things at or below their level, rather than hard-coding the level as "confidential" as we have done here. Another nice refinement might be to always return undef (no matches) for items marked "top secret", to prevent the very existence of a top secret weapon from being deduced. :)

Some great press for College District

College District has been getting some positive press lately, the most recent being a Forbes article which talks about the success they have been seeing in the last few years.

College District is a company that sells collegiate merchandise to fans. They got their start focusing on the LSU Tigers at TigerDistrict.com and have branched out to teams such as the Oregon Ducks and Alabama Roll Tide.

We've been working with Jared Loftus @ College District for more then four and a half years. College District is running on a heavily modified Interchange system with some cool Postgres tricks. The system can support a nearly unlimited number of sites, running on 2 catalogs (1 for the admin, 1 for the front end) and 1 database. The key to the system is different schemas, fronted by views, that hide and expose records based on the database user that is connected. The great thing about this system is that Jared can choose to launch a new store within a day and be ready for sales, something he has taken advantage of in the past when a team is on fire and he sees an opportunity he can't pass up.

We are currently preparing for a re-launch of the College District site that will focus on crowd-sourced designs. Artists and fans will submit their designs, have them voted on, some will be chosen to be sold and the folks that have their designs chosen will get paid for their efforts. The goal here is to grow a community that guides what College District and the individual school sites ultimately sell.

With College District's quick growth we've also been helping them improve their order fulfillment process. This includes streamlining how orders are picked, packed and shipped. The introduction of bar code scanners will help with the accuracy and speed of the process.

We get a kick out of seeing our clients succeed, especially those that come to us with a clear vision and a good attitude, and then put the hard work in to make it happen. It's an exciting year ahead for College District and we'll be right there supporting them on the journey.

Sanitizing supposed UTF-8 data

As time passes, it's clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default. In a few more years we'll probably find discussion of different character set encodings to be arcane, relegated to "data historians" and people working with legacy systems.

But we're not there yet! There's still lots of migration to do before we can forget about everything that's not UTF-8.

Last week I again found myself converting data. This time I was taking data from a PostgreSQL database with no specified encoding (so-called "SQL_ASCII", really just raw bytes), and sending it via JSON to a remote web service. JSON uses UTF-8 by default, and that's what I needed here. Most of the source data was in either UTF-8, ISO Latin-1, or Windows-1252, but some was in non-Unicode Chinese or Japanese encodings, and some was just plain mangled.

At this point I need to remind you about one of the most unusual aspects of UTF-8: It has limited valid forms. Legacy encodings typically used all or most of the 255 code points in their 8-byte space (leaving point 0 for traditional ASCII NUL). While UTF-8 is compatible with 7-bit ASCII, it does not allow any possible 8-bit byte in any position. See the Wikipedia summary of invalid byte sequences to know what can be considered invalid.

We had no need to try to fix the truly broken data, but we wanted to convert everything possible to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in what we sent.

I previously wrote about converting a PostgreSQL database dump to UTF-8, and used the Perl CPAN module IsUTF8.

I was going to use that again, but looked around and found an even better module, exactly targeting this use case: Encoding::FixLatin, by Grant McLean. Its documentation says it "takes mixed encoding input and produces UTF-8 output" and that's exactly what it does, focusing on input with mixed UTF-8, Latin-1, and Windows-1252.

It worked as advertised, very well. We would need to use a different module to convert some other legacy encodings, but in this case this was good enough and got the vast majority of the data right.

There's even a standalone fix_latin program designed specifically for processing Postgres pg_dump output from legacy encodings, with some nice examples of how to use it.

One gotcha is similar to a catch that David Christensen reported with the Encode module in a blog post here about a year ago: If the Perl string already has the UTF-8 flag set, Encoding::FixLatin immediately returns it, rather than trying to process it. So it's important that the incoming data be a pure byte stream, or that you otherwise turn off the UTF-8 flag, if you expect it to change anything.

Along the way I found some other CPAN modules that look useful for cases where I need more manual control than Encoding::FixLatin gives:

  • Search::Tools::UTF8 - test for and/or fix bad ASCII, Latin-1, Windows-1252, and UTF-8 strings
  • Encode::Detect - use Mozilla's universal charset detector and convert to UTF-8
  • Unicode::Tussle - ridiculously comprehensive set of Unicode tools that has to be seen to be believed

Once again Perl's thriving open source/free software community made my day!

Finding PostgreSQL temporary_file problems with tail_n_mail


Image by Flickr user dirkjanranzijn

PostgreSQL does as much work as it can in RAM, but sometimes it needs to (or thinks that it needs to) write things temporarily to disk. Typically, this happens on large or complex queries in which the required memory is greater than the work_mem setting.

This is usually an unwanted event: not only is going to disk much slower than keeping things in memory, but it can cause I/O contention. For very large, not-run-very-often queries, writing to disk can be warranted, but in most cases, you will want to adjust the work_mem setting. Keep in mind that this is very flexible setting, and can be adjusted globally (via the postgresql.conf file), per-user (via the ALTER USER command), and dynamically within a session (via the SET command). A good rule of thumb is to set it to something reasonable in your postgresql.conf (e.g. 8MB), and set it higher for specific users that are known to run complex queries. When you discover a particular query run by a normal user requires a lot of memory, adjust the work_mem for that particular query or set of queries.

How do you tell when you work_mem needs adjusting, or more to the point, when Postgres is writing files to disk? The key is the setting in postgresql.conf called log_temp_files. By default it is set to -1, which does no logging at all. Not very useful. A better setting is 0, which is my preferred setting: it logs all temporary files that are created. Setting log_temp_files to a positive number will only log entries that have an on-disk size greater than the given number (in kilobytes). Entries about temporary files used by Postgres will appear like this in your log file:


2011-01-12 16:33:34.175 EST LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp16501.0", size 130220032

The only important part is the size, in bytes. In the example above, the size is 124 MB, which is not that small of a file, especially as it may be created many, many times. So the question becomes, how can we quickly parse the files and get a sense of which queries are causing excess writes to disk? Enter the tail_n_mail program, which I recently tweaked to add a "tempfile" mode for just this purpose.

To enter this mode, just name your config file with "tempfile" in its name, and have it find the lines containing the temporary file information. It's also recommended you make use of the tempfile_limit parameter, which limits the results to the "top X" ones, as the report can get very verbose otherwise. An example config file and an example invocation via cron:


$ cat tail_n_mail.tempfile.myserver.txt

## Config file for the tail_n_mail program
## This file is automatically updated
## Last updated: Thu Nov 10 01:23:45 2011
MAILSUBJECT: Myserver tempfile sizes
EMAIL: greg@endpoint.com
FROM: postgres@myserver.com
INCLUDE: temporary file
TEMPFILE_LIMIT: 5

FILE: /var/log/pg_log/postgres-%Y-%m-%d.log

$ crontab -l | grep tempfile

## Mail a report each morning about tempfile usage:
0 5 * * * bin/tail_n_mail tnm/tail_n_mail.tempfile.myserver.txt --quiet

For the client I wrote this for, we run this once a day and it mails us a nice report giving the worst tempfile offenders. The queries are broken down in three ways:

  • Largest overall temporary file size
  • Largest arithmetic mean (average) size
  • Largest total size across all the same query

Here is a slightly edited version of an actual tempfile report email:


Date: Mon Nov  7 06:39:57 2011 EST
Host: myserver.example.com
Total matches: 1342
Matches from [A] /var/log/pg_log/2011-11-08.log: 1241
Matches from [B] /var/log/pg_log/2011-11-09.log:  101
Not showing all lines: tempfile limit is 5

  Top items by arithmetic mean    |   Top items by total size
----------------------------------+-------------------------------
    860 MB (item 5, count is 1)   |   17 GB (item 4, count is 447)
    779 MB (item 1, count is 2)   |    8 GB (item 2, count is 71)
    597 MB (item 7, count is 1)   |    6 GB (item 334, count is 378)
    597 MB (item 8, count is 1)   |    6 GB (item 46, count is 104)
    596 MB (item 9, count is 1)   |    5 GB (item 3, count is 63)

[1] From file B Count: 2
Arithmetic mean is 779.38 MB, total size is 1.52 GB
Smallest temp file size: 534.75 MB (2011-11-08 12:33:14.312 EST)
Largest temp file size: 1024.00 MB (2011-11-08 16:33:14.121 EST)
First: 2011-11-08 05:30:12.541 EST
Last:  2011-11-09 03:12:22.162 EST
SELECT ab.order_number, TO_CHAR(ab.creation_date, 'YYYY-MM-DD HH24:MI:SS') AS order_date,
FROM orders o
JOIN order_summary os ON (os.order_id = o.id)
JOIN customer c ON (o.customer = c.id)
ORDER BY creation_date DESC

[2] From file A Count: 71
Arithmetic mean is 8.31 MB, total size is 654 MB
Smallest temp file size: 12.12 MB (2011-11-08 06:12:15.012 EST)
Largest temp file size: 24.23 MB (2011-11-08 19:32:45.004 EST)
First: 2011-11-08 06:12:15.012 EST
Last:  2011-11-09 04:12:14.042 EST
CREATE TEMPORARY TABLE tmp_sales_by_month AS SELECT * FROM sales_by_month_view;

While it still needs a little polishing (such as showing which file each smallest/largest came from), it has already been an indispensible tool forfinding queries that causing I/O problems via frequent and/or large temporary files.

PG West 2011 Re-cap

I just recently got back from PG West 2011, and have had some time to ruminate on the experience (do elephants chew a cud?</note-to-self>). I definitely enjoyed San Jose as the location; it's always neat to visit new places and to meet new people, and I have to say that San Jose's weather was perfect for this time of year. I was also glad to be able to renew professional relationships and meet others in the PostgreSQL community.

Topic-wise, I noticed that quite a few talks had to do with replication and virtualization; this certainly seems to be a trend in the industry in general, and has definitely been a pet topic of mine for quite a while. It's interesting to see the various problems that necessitate some form of replication, the tradeoffs/considerations for each specific problem, and wide variety of tools that are available in order to attack each of these problems (e.g. availability, read/write scaling, redundancy, etc).

A few high points from each of the days:

Tuesday

I had dinner with fellow PostgreSQL contributors; some I knew ahead of time, others I got to know. This was followed by additional socializing.

Wednesday

I attended a talk on PostgreSQL HA, which covered the use of traditional cluster-level warm/hot standbys, as well as a solution using pg_pool and slony. This was followed by the keynote address at the conference, given by Charles Fan, Senior Vice President from VMware. This was a high-level overview of the type of work that VMware had been doing in order to support virtualizing PostgreSQL and optimizing for running multiple PostgreSQL instances on separate VMs efficiently.

I was involved in some "lunch track" discussions, and followed this all up with several more talks covering VMWare's specific offerings in more detail.

Evening was dinner and mandatory socializing.

Thursday

I went to Robert Hodges' talk about Tungsten. I had only heard of it in general terms, so it was interesting to get more specific details. Robert's talk covered the basic architecture of Tungsten, as well as how their various adapters between multiple types of databases were used to ensure that the SQL that was executed on heterogeneous clusters would account for differences in datatype representation, encoding, DDL, specific query syntax, etc; for instance when executing a CREATE TABLE statement, MySQL's AUTO_INCREMENT fields would be converted to PostgreSQL's equivalent SERIAL type. There was lots of good discussion after the presentation, and I spoke with Robert after the talk about different design/architecture choices that they made with Tungsten and we discussed differences between that and Bucardo.

At lunchtime I got to meet David Fetter's wife and baby (who looks just like him!), then gave an updated version of my Bucardo: More than just Multimaster talk. Attendance was good, around 30-35, and the audience asked plenty of questions.

After my talk, I attended one about database optimization. This is always an interesting topic for me, so I'm glad to hear other's insights on this subject.

This was all followed up by mandatory socializing.

Friday

I found the talk about Translattice to be very interesting, as it highlighted specific problem domains for distributed, redundant, multi-write database clusters for more fault-tolerant applications. It struck me as utilizing some of the same ideas as Cassandra or other decentralized distributed datastores, but doing so in a way that is transparent to the use of PostgreSQL. What I found particularly interesting about this system was the use of data access/usage patterns, explicit policy, and locality to specify both the costing algorithm for accessing data as well as distributing knowledge about just where each copy of each piece of data exists. The talk, while an introduction to the system, did not skimp on the details and the presenter was happy to answer my many specific questions.

The remaining talks were fairly light-hearted. I went to one called Redis: Data Bacon for the title alone. While I still don't understand why bacon, I walked away with an appreciation of the problem domain Redis addresses and how it could be used in specific cases. The final talk I attended was about Schemaverse, a project which implements a game entirely in SQL. Each player has their own database user created that they can then use from either the web interface or even via just a regular psql connection. I can't speak for the game itself other than the overview given in the talk, but creative use/hacking of the game was explicitly encouraged, and seems like an interesting approach for testing things which may not often be stressed enough in (at least my) regular use of PostgreSQL, such as intra-database security/permissions, huge numbers of users, etc. (It didn't surprise me that this game had been a hit at DEFCON.)

This was followed by the closing session, and final goodbyes, etc. Oh, and (need I say) mandatory socializing.

Final Thoughts

I always enjoy going to PostgreSQL events, and continue to be impressed with the community that surrounds PostgreSQL. Thanks to everyone who attended, and a special thanks to Josh Drake for the work he put into it. Hope to see ya next time!

Viewing schema changes over time with check_postgres


Image by Flickr user edenpictures

Version 2.18.0 of check_postgres, a monitoring tool for PostgreSQL, has just been released. This new version has quite a large number of changes: see the announcement for the full list. One of the major features is the overhaul of the same_schema action. This allows you to compare the structure of one database to another and get a report of all the differences check_postgres finds. Note that "schema" here means the database structure, not the object you get from a "CREATE SCHEMA" command. Further, remember the same_schema action does not compare the actual data, just its structure.

Unlike most check_postgres actions, which deal with the current state of a single database, same_schema can compare databases to each other, as well as audit things by finding changes over time. In addition to having the entire system overhauled, same_schema now allows comparing as many databases you want to each other. The arguments have been simplified, in that a comma-separated list is all that is needed for multiple entries. For example:


./check_postgres.pl --action=same_schema \
  --dbname=prod,qa,dev --dbuser=alice,bob,charlie

The above command will connect to three databases, as three different users, and compare their schemas (i.e. structures). Note that we don't need to specify a warning or critical value: we consider this an 'OK' Nagios check if the schemas match, otherwise it is 'CRITICAL'. Each database gets assigned a number for ease of reporting, and the output looks like this:


POSTGRES_SAME_SCHEMA CRITICAL: (databases:prod,qa,dev)
  Databases were different. Items not matched: 1 | time=0.54s 
DB 1: port=5432 dbname=prod user=alice
DB 1: PG version: 9.1.1
DB 1: Total objects: 312
DB 2: port=5432 dbname=qa user=bob
DB 2: PG version: 9.1.1
DB 2: Total objects: 312
DB 3: port=5432 dbname=dev user=charlie
DB 3: PG version: 9.1.1
DB 3: Total objects: 313
Language "plpgsql" does not exist on all databases:
  Exists on:  3
  Missing on: 1, 2

The second large change was a simplification of the filtering options. Everything is now controlled by the --filter argument, and basically you can tell it what things to ignore. For example:


./check_postgres.pl --action=same_schema \
  --dbname=A,B --filter=nolanguage,nosequence

The above command will compare the schemas on databases A and B, but will ignore any difference in which languages are installed, and ignore any differences in the sequences used by the databases. Most objects can be filtered out in a similar way. There are also a few other useful options for the --filter argument:

  • noposition: Ignore what order columns are in
  • noperms: Do not worry about any permissions on database objects
  • nofuncbody: Do not check function source

The final and most exciting large change is the ability to compare a database to itself, over time. In other words, you can see exactly what changed during a certain time period. We have a client using that now to send a daily report on all schema changes made in the last 24 hours, for all the databases in their system. This is a very nice thing for a DBA to receive: not only is there a nice audit trail in your email, you can answer questions such as:

  • Was this a known change, or did someone make it without letting anyone else know?
  • Did somebody fat-finger and drop an index by mistake?
  • Were the changes applied to database X also applied to database Y and Z?

To enable time-based checks, simply provide a single database to check. The first time it is run, same_schema simply gathers all the schema information and stores it on disk. The next time it is run, it detects the file, reads it in as database "2", and compares it to the current database (number "1"). The --replace argument will rewrite the file with the current data when it is done. So the cronjob for the aforementioned client is as simple as:


10 0 * * * ~/bin/check_postgres.pl --action=same_schema \
  --host=bar --dbname=abc --quiet --replace

The --quiet argument ensures that no output is given if everything is 'OK'. If everything is not okay (i.e. if differences are found), cron gets a bunch of input sent to it and duly mails it out. Thus, a few minutes after 10AM each day, a report is sent if anything has changed in the last day. Here's a slightly redacted version of this morning's report, which shows that a schema named "stat_backup" was dropped at some point in the last 24 hours (which was a known operation):


POSTGRES_SAME_SCHEMA CRITICAL: DB "abc" (host:bar)
  Databases were different. Items not matched: 1 | time=516.56s
DB 1: port=5432 host=bar dbname=abc user=postgres
DB 1: PG version: 8.3.16
DB 1: Total objects: 11863
DB 2: File=check_postgres.audit.port.5432.host.bar.db.abc
DB 2: Creation date: Sun Oct  2 10:06:12 2011  CP version: 2.18.0
DB 2: port=5432 host=bar dbname=abc user=postgres
DB 2: PG version: 8.3.16
DB 2: Total objects: 11864
Schema "stat_backup" does not exist on all databases:
  Exists on:  2
  Missing on: 1

As you can see, the first part is a standard Nagios-looking output, followed by a header explaining how we defined database "1" and "2" (the former a direct database call, and the latter a frozen version of the same.)

Sometimes you want to store more than one version at a time: for example, if you want both a daily and a weekly view. To enable this, use the --suffix argument to create different instances of the saved file. For example:


10 0 * * * ~/bin/check_postgres.pl --action=same_schema \
  --host=bar --dbname=abc --quiet --replace --suffix=daily
10 0 * * Fri ~/bin/check_postgres.pl --action=same_schema \
  --host=bar --dbname=abc --quiet --replace --suffix=weekly

The above command would end up recreating this file every morning at 10:check_postgres.audit.port.5432.host.bar.db.abc.daily and this file each Friday at 10: check_postgres.audit.port.5432.host.bar.db.abc.weekly.

Thanks to all the people that made 2.18.0 happen (see the release notes for the list). There are still some rough edges to the same_schema action: for example, the output could be a little more user-friendly, and not all database objects are checked yet (e.g. no custom aggregates or operator classes). Development is ongoing; patches and other contributions are always welcome. In particular, we need more translators. We have French covered, but would like to include more languages. The code can be checked out at:


git clone git://bucardo.org/check_postgres.git

There is also a github mirror if you so prefer:


https://github.com/bucardo/check_postgres

You can also file a bug (or feature request), or join one of the mailing lists: general, announce, and commit.

PostgreSQL Serializable and Repeatable Read Switcheroo

PostgreSQL allows for different transaction isolation levels to be specified. Because Bucardo needs a consistent snapshot of each database involved in replication to perform its work, the first thing that the Bucardo daemon does when connecting to a remote PostgreSQL database is:


SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE;

The 'READ WRITE' bit sets us in read/write mode, just in case the entire database has been set to read only (a quick and easy way to make your slave databases non-writeable!). It also sets the transaction isolation level to 'SERIALIZABLE'. At least, it used to. Now Bucardo uses 'REPEATABLE READ' like this:


SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ WRITE;

Why the change? In version 9.1 of PostgreSQL the concept of SSI (Serializable Snapshot Isolation) was introduced. How it actually works is a little complicated (follow the link for more detail), but before 9.1 PostgreSQL was only *sort of* doing serialized transactions when you asked for serializable mode. What it was really doing was repeatable read and not trying to really serialize the transactions. In 9.1, PostgreSQL is doing *true* serializable transactions. It also adds a new distinct 'internal' transaction mode, 'repeatable read', which does exactly what the old 'serializable' used to do. Finally, if you issue a 'repeatable read' on a pre-9.1 database, it silently upgrades it to the old 'serializable' mode.

So in summary, if your application was using 'SERIALIZABLE' before, you can now replace that with 'REPEATABLE READ' and get the exact same behavior as before, regardless of the version. Of course, if you want *true* serializable transactions, use SERIALIZABLE. It will continue to mean the same as 'REPEATABLE READ' in pre-9.1 databases, and provide true serializability in 9.1 and beyond. (I haven't determined yet if Bucardo is going to use this new level, as it comes with a little bit of overhead)

Since this can be a little confusing, here's a handy chart showing how version 9.1 changed the meaning of SERIALIZABLE, and added a new 'internal' isolation level:

Postgres version 9.0 and earlierPostgres version 9.1 and later
Requested isolation levelActual internal isolation levelVersion comparisonActual internal isolation levelRequested isolation level
READ UNCOMMITTEDRead committedExact sameRead committedREAD UNCOMMITTED
READ COMMITTEDREAD COMMITTED
REPEATABLE READSerializableFunctionally identicalRepeatable readREPEATABLE READ
SERIALIZABLE
 9.1 only!Serializable (true)SERIALIZABLE

Congratulations and thanks to Kevin Grittner and Dan Ports for making true serializability a reality!

Another Post-Postgres Open Post

Well, that was fun! I've always found attending conferences to be an invigorating experience. The talks are generally very informative, it's always nice to put a face to names seen online in the community, and between the "hall track", lunches, and after-session social activities it's difficult to not find engaging discussions.

My favorite presentations:

  • Scaling servers with Skytools -- seeing what it takes to balance several high-velocity nodes was intriguing.
  • Mission Impossible -- lots of good arguments for why Postgres can be an equivalent, nay, better replacement for an enterprise database.
  • The PostgreSQL replication protocol -- even if I never intend to write something that'll interact with it directly, knowing how something like the new streaming replication works under the hood goes a long way to keeping it running at a higher level.
  • True Serializable Transactions Are Here! -- I'll admit I haven't had a chance to fully check out the changes to Serializable, so getting to hear some of the reasoning and stepping through some of the use cases was quite helpful.

But what of my talks? Monitoring went well -- it seemed to get the message out. There was a lot of "gee, I have Postgres, and Nagios, but they're not talkin'. Now they can!" So hopefully, with a little more visibility into how the database is standing, the tools can boost confidence within business environments that aren't as sure about Postgres and help keep existing installations in place. I think the Bucardo presentation had me a bit more animated for some reason. That one also led to some interesting questions from the audience, and a couple challenges for the Bucardo project.

All in all, great work everyone!

Headed out to PgWest next week

I'm gearing up to go out to San Jose to attend and speak at the PG West PostgreSQL conference in sunny San Jose. (Does anyone have directions...?)

I'm excited to again meet and mingle with more PostgreSQL experts and enthusiasts and look forward to the various talks, technical discussions, and social opportunities. My talk will be on Bucardo and many uses for it as a general tool. It'll also cover additional changes coming down the pipe in Bucardo 5.

I look forward to seeing everyone!

Bucardo, 9.1, and you!

A little bit of bad news for Bucardo fans, Greg Sabino Mullane won't be making Postgres Open due to scheduling conflicts. But not to worry, I'll be giving the "Postgres masters, other slaves" talk in the meantime in his place.

In looking over the slides, one thing that catches my eye is how quickly Bucardo is adopting PostgreSQL 9.1 features. Specifically, Unlogged Tables will be very useful in boosting performance where Bucardo stages information about changed rows for multi-database updates. I also wonder if the enhanced Serializable Snapshot Isolation would be helpful in some situations. Innovation encouraging more innovation, gotta love open source!

If I hadn't said it before, thanks to everyone that made Postgres 9.1 possible. Some of the other enhancements are just as exciting. For instance, I'm eager to see some creative uses for Writable CTE's. And it'll be very interesting to see what additional Foreign Data Wrappers pop up over time.

Now, back to packing...

Postgres Open: One week to go!

Wow, time flies, Postgres Open is almost upon us!

I'll be there giving a talk Thursday morning on monitoring tools and techniques, and possibly helping with the Bucardo 5 replication session Friday afternoon. Sadly I'll need need to catch a flight shortly after that, so there won't be much time to explore Chicago around everything going on. But at least it'll be nice to get out to a conference again!

Bucardo PostgreSQL replication to other tables with customname


Image by Flickr user Soggydan

(Don't miss the Bucardo5 talk at Postgres Open in Chicago)

Work on the next major version of Bucardo is wrapping up (version 5 is now in beta), and two new features have been added to this major version. The first, called customname, allows you to replicate to a table with a different name. This has been a feature people have been asking for a long time, and even allows you to replicate between differently named Postgres schemas. The second option, called customcols, allows you replicate to different columns on the target: not only a subset, but different column names (and types), as well as other neat tricks.

The "customname" options allows changing of the table name for one or more targets. Bucardo replicates tables from the source databases to the target databases, and all tables must have the same name and schema everywhere. With the customname feature, you can change the target table names, either globally, per database, or per sync.

We'll go through a full example here, using a stock 64-bit RedHat 6.1 EC2 box (ami-5e837b37). I find EC2 a great testing platform - not only can you try different operating systems and architectures, but (as my own personal box is very customized) it is great to start afresh from a stock configuration.

First, let's turn off SELinux, install the EPEL rpm, update the box, and install a few needed packages.

#
#
#
#
#
#
echo 0 > /selinux/enforce
wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm        
rpm -ivh epel-release-6-5.noarch.rpm
yum update
yum install emacs-nox perl-DBIx-Safe perl-DBD-Pg git postgresql-plperl
cpan boolean

The yum update takes a while to run, but I always feel better when things are up to date. Next, we will create a new database cluster, create the /var/run/bucardo directory that Bucardo uses to store its PIDs, adjust the ultraconservative stock pg_hba.conf file, and start Postgres up:

#
#
#
#
#
service postgresql initdb
mkdir /var/run/bucardo
chown postgres.postgres /var/run/bucardo
emacs /var/lib/pgsql/data/pg_hba.conf                                        
service postgresql start

For the pg_hba.conf configuration file, because we want to be able to connect to the database as the bucardo user without actually logging into that account, we will allow access using the 'md5' (password) method instead of 'ident'. But we don't want to bother creating a password for the postgres user, we will still allow those connections via ident. The relevant lines in the pg_hba.conf will end up like this:


# TYPE   DATABASE   USER       METHOD
local    all        postgres   ident                          
local    all        all        md5                          

At this point, we (as the postgres user) download and install Bucardo itself:

#
$
$
$
$
$
$
su - postgres
git clone git://bucardo.org/bucardo.git
cd bucardo
perl Makefile.PL
make
sudo make install                                      
bucardo install# (enter 'p' and keep the default values)

We are now ready to start testing out the new customname feature. First we will need some data to replicate! For this demo we are going to use one of the handy sample datasets from the dbsamples project. The one we will use has a few small tables with information about towns in France. Note that the tarball does not (sadly) contain a top-level directory, so we have to create one ourselves. We will then create three identical databases holding the data from that file.

$
$
$
$
$
$
$
$
$
wget http://pgfoundry.org/frs/download.php/935/french-towns-communes-francaises-1.0.tar.gz                
mkdir frenchtowns
cd frenchtowns
tar xvfz ../french-towns-communes-francaises-1.0.tar.gz
psql -c 'create database french1'
psql french1 -q -f french-towns-communes-francaises.sql
psql -c 'create database french2 template french1'
psql -c 'create database french3 template french1'
psql -c 'create database french4 template french1'

Bucardo is installed but does not know what to do yet, so we will teach Bucardo about each of the databases, and add in all the tables, grouping then into a herd in the process. Finally, we create a sync in which french1 and french2 are both source (master) databases, and french3 and french4 will be target (slave) databases.

$
$
$
$
$
$
bucardo add db f1 db=french1
bucardo add db f2 db=french2
bucardo add db f3 db=french3
bucardo add db f4 db=french4
bucardo add all tables herd=fherd
bucardo add sync wildstar herd=fherd dbs=f1=source,f2=source,f3=target,f4=target

Before starting it up, I usually raise the debug level, as it gives a much clearer picture of what is going on in the logs. It does make the logs a lot more crowded, so it is not recommended for production use:

$
echo log_level=DEBUG >> ~/.bucardorc

Next, we start Bucardo up and make sure everything is working as it should. Scanning the log.bucardo file that is generated is a great way to do this:

$
$
$
bucardo start
sleep 3
tail log.bucardo

If all goes well, you should see something very similar to this in the last lines of your log.bucardo file:


(972) [Sat Sep  3 16:18:54 2011] KID Total time for sync "wildstar" (0 rows): 0.05 seconds
(966) [Sat Sep  3 16:18:55 2011] CTL Got NOTICE ctl_syncdone_wildstar from 973 (line 1624)
(966) [Sat Sep  3 16:18:55 2011] CTL Kid 973 has reported that sync wildstar is done
(966) [Sat Sep  3 16:18:55 2011] CTL Sending NOTIFY "syncdone_wildstar" (line 1709)
(954) [Sat Sep  3 16:18:55 2011] MCP Got NOTICE syncdone_wildstar from 967 (line 749)
(954) [Sat Sep  3 16:18:55 2011] MCP Sync wildstar has finished
(954) [Sat Sep  3 16:18:55 2011] MCP Sending NOTIFY "syncdone_wildstar" (line 812)
(954) [Sat Sep  3 16:18:56 2011] MCP Got NOTICE syncdone_wildstar from 957 (Bucardo DB) (line 749)

From the above, we see that a KID finished running the sync we created, without finding any changed rows to replicate. Then there is some chatter between the different Bucardo processes. Now to test out the customname feature. We'll rename one of the tables, tell Bucardo about the change, reload the sync, and verify that all is still being replicated.

$
$
$
psql french3 -c 'ALTER TABLE regions RENAME TO tesla'
bucardo add customname regions tesla db=f3
bucardo reload wildstar
$

$

$

$
 
psql french3 -c 'truncate table tesla cascade'
TRUNCATE
psql french3 -t -c 'select count(*) from tesla'
0
psql french1 -c 'update regions set name=name'
UPDATE 26
psql french3 -t -c 'select count(*) from tesla'
26

In the above, the update on the regions table inthe french1 database calls a trigger that notifies Bucardo that some rows have changed; Bucardo then has a KID copy the rows from the source databases french1 to the other source database french2, as well as the targets french3 and french4. The final internal DELETE and COPY that it performs is done on database french3 to the tesla table rather than the regions table.

The customname feature cannot be used to change the tables in a source database, as they must all be the same (for obvious reasons). We can, however, specify that a different schema be used for a target, as well as a different table. This only applies to Postgres targets, as other database types (e.g. MySQL) do not use schemas. Let's see that in action:

$
$
$
$
$
psql french4 -c 'create schema banana'
psql french4 -c 'alter table regions set schema banana'
psql french4 -c 'truncate table banana.regions cascade'
bucardo add customname regions banana.regions db=f4
bucardo reload wildstar
$

$

$
 
psql french4 -t -c 'select count(*) from banana.regions'
0
psql french2 -c 'update regions set name=name'
UPDATE 26
psql french4 -t -c 'select count(*) from banana.regions'
26

As before, the update on a source causes the changes to propagate to the other source database, as well as both targets. Note that the ALTER TABLE also mutated the associated sequence for the table, so there will be warnings in Bucardo's logs about the DEFAULT values for the primary keys in the regions' tables being different. Since this post is getting long, I will save the discussion of customcols for another day.

PostgreSQL log analysis / PGSI


Image by "exfordy" on Flickr

End Point recently started working with a new client (a startup in stealth mode, cannot name names, etc.) who is using PostgreSQL because of the great success some of the people starting the company have had with Postgres in previous companies. One of the things we recommend to our clients is a regular look at the database to see where the bottlenecks are. A good way to do this is by analyzing the logs. The two main tools for doing so are PGSI (Postgres System Impact) and pgfouine. We prefer PGSI for a few reasons: the output is better, it considers more factors, and it does not require you to munge your log_line_prefix setting quite as badly.

Both programs work basically the same: given a large number of log lines from Postgres, normalize the queries, see how long they took, and produce some pretty output.If you only want to look at the longest queries, it's usually enough to set your log_min_duration_statement to something sane (such as 200), and then run a daily tail_n_mail job against it. This is what we are doing with this client, and it sends a daily report that looks like this:


Date: Mon Aug 29 11:22:33 2011 UTC
Host: acme-postgres-1
Minimum duration: 2000 ms
Matches from /var/log/pg_log/postgres-2011-08-29.log: 7

[1] (from line 227)
2011-08-29 08:36:50 UTC postgres@maindb [25198]
LOG: duration: 276945.482 ms statement: COPY public.sales 
(id, name, region, item, quantity) TO stdout;

[2] (from line 729)
2011-08-29 21:29:18 UTC tony@quadrant [17176]
LOG: duration: 8229.237 ms execute dbdpg_p29855_1: SELECT 
id, singer, track FROM album JOIN artist ON artist.id = 
album.singer WHERE id < 1000 AND track <> 1

However, the PGSI program was born of the need to look at all the queries in the database, not just the slowest-running ones; the cumulative effect of many short queries can have much more of an impact on the server than a smaller number of long-running queries. Thus, PGSI looks not only at how long a query takes to run, but how many times it has run in a certain period, as well as how often it runs. All of this information is put together to give a score to each normalized query, known as the "system impact". Like the costs on a Postgres explain plan, this is a unit-less number and of little importance in and of itself - the important thing is to compare it to the other queries to see the relative impact. We also have that report emailed out, it looks similar to this (this is a text version of the HTML produced):


Log file: /var/log/pg_log/postgres-2011-08-29.log

 * SELECT (24)
 * UPDATE (1)

Query System Impact : SELECT

 Log activity from 2011-08-29 11:00:01 to 2011-08-29 11:15:01

   +----------------------------------+
   |   System Impact: | 0.15          |
   |   Mean Duration: | 1230.95 ms    |
   | Median Duration: | 1224.70 ms    |
   |     Total Count: | 411           |
   |   Mean Interval: | 4195 seconds  |
   |  Std. Deviation: | 126.01 ms     |
   +---------------------------------+

 SELECT *
  FROM albums
  WHERE track <> ? AND artist = ?
  ORDER BY artist, track

At this point you may be wondering how we get all the queries into the log. This is done by setting log_min_duration_statement to 0. However, most (but not all!) clients do not want full logging 24 hours a day, as this creates some very large log files. So the solution we use is to analyze a slice of the day, only. It depends on the client, but we try for about 15 minutes during a busy time of day. Thus, the sequence of events is:

  1. Turn on "full logging" by dropping log_min_duration_statement to zero
  2. Some time later, set log_min_duration_statement back to what it was (e.g. 200)
  3. Extract the logs from the time it was set to zero to when it was flipped back.
  4. Run PGSI against the log subsection we pulled out
  5. Mail the results out

All of this is run by cron. The first problem is how to update the postgresql.conf file and have Postgres re-read it, all automatically. As covered previously, we use the modify_postgres.pl script for this.

The exact incantation looks like this:


0 11 * * * perl bin/modify_postgres_conf --quiet \
  --pgconf /etc/postgresql/9.0/main/postgresql.conf \
  --change log_min_duration_statement=0
15 11 * * * perl bin/modify_postgres_conf --quiet \
  --pgconf /etc/postgresql/9.0/main/postgresql.conf \
  --change log_min_duration_statement=200 --no-comment
## The above are both one line each, but split for readability here

This changes log_min_duration_statement to 0 at 11AM, and then back to 200 15 minutes later. We use the --quiet argument as this is run from cron so we don't want any output from modify_postgres_conf on success. We do want a comment when we flip it to 0, as this is the temporary state and we want people viewing the postgresql.conf file at that time to realize it (or someone just doing a "git diff"). We don't want a comment when we flip it back, as the timestamp in the comment would cause git to think the file had changed.

Now for the tricky bit: extracting out just the section of logs that we want and sending it to PGSI. Here's the recipe I came up with for this client:


16 11 * * * tac `ls -1rt /var/log/pg_log/postgres*log \
  | tail -1` \
  | sed -n '/statement" changed to "200"/,/statement" changed to "0"/ p' \
  | tac \
  | bin/pgsi.pl --quiet > tmp/pgsi.html && bin/send_pgsi.pl
## Again, the above is all one line

What does this do? First, it finds the latest file in the /var/log/pg_log directory that starts with 'postgres' and ends with 'log'. Then it uses the tac program to spool the file backwards, one line at a time ('tac' is the opposite of 'cat'). Then it pipes that output to the sed program, which prints out all lines starting with the one where we changed the log_min_duration_statement to 200, and ending with the one where we changed it to 0 (the reverse of what we actually did, as we are reading it backwards). Finally, we use tac again to put the lines back in the correct order, pipe the output to pgsi, write the output to a temporary file, and then call a quick Perl script named send_pgsi.pl which mails the temporary HTML file to some interested parties.

Why do we use tac? Because we want to read the file backwards, so as to make sure we get the correct slice of log files as delimited by the log_min_duration_statement changes. If we simply started at the beginning of the file, we might encounter other similar changes that were made earlier and not by us.

All of this is not foolproof, of course, but it does not have to be, as it is very easy to run manually is something (for example the sed recipe) goes wrong, as the log file will still be there. Yes, it's also possible to grab the ranges in other ways (such as perl), but I find sed the quickest and easiest. As tempting as it was to write Yet Another Perl Script to extract the lines, sometimes a few chained Unix programs can do the job quite nicely.

Changing postgresql.conf from a script


Image by "TheBusyBrain" on Flickr

The modify_postgres_conf script for Postgres allows you to change your postgresql.conf file from the command line, via a cron job, or any time when you want to automate the process.

Postgres runs as a background daemon. The configuration parameters it runs with are stored in a file named postgresql.conf. To change the behavior of Postgres, one must usually edit this file, and then tell Postgres that you have made the changes. Sometimes all that is needed is to 'HUP' or reload Postgres. Most changes fall into this category. Other changes require a full restart of Postgres, which entails disconnecting all current clients.

Thus, to make a change, one must edit the file, find the item to change (the file consists of "name = value" lines), change it, then send a signal to the main Postgres process so it picks up the change. Finally, you should then connect to Postgres to make sure it is still running and has accepted the latest change.

Doing this automatically (such as via a cron script) is very difficult. One method, if you are doing something simple like toggling between two known configuration files, is to simply store copies of both files and replace them, like this example cronjob:


30 10 * * * cp -f conf/postgresql.conf.1 /etc/postgresql.conf; /etc/init.d/postgresql reload
50 10 * * * cp -f conf/postgresql.conf.2 /etc/postgresql.conf; /etc/init.d/postgresql reload

The major problem with that approach, as I quickly learned when I tried it, is that despite nobody making changes to the postgresql.conf file in *years*, a few days after I put the above change in place, someone decided to edit postgresql.conf. At 10:30AM the next day, their changes were blown away. A better way is to simply write a program to make the change for you. Thus, the modify_postgres_conf.pl script.

The basic usage is to tell the script where the conf file is, and list what changes you want to make. Here's an example that will change the random_page_cost to 2 on a Debian system:


./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2

Here is exactly what the script does for the above statement:

  • For each item to be changed, we:
    • Ask the database what the current value is (and die if that parameter does not exist)
    • If the current and new value are the same, do nothing
    • Otherwise, open (and flock) the configuration file and change the parameter
  • If no changes were made, exit
  • Otherwise, close the configuration file
  • Figure out the Postgres PID and send it a HUP signal
  • Reconnect to the database and confirm each change has taken effect

By default, it adds a comment after the changed value as well, to help in tracking down who made the change. A diff of the postgresql.conf file after running the example above produces:


diff -r1.1 postgresql.conf
499c499
< random_page_cost = 4
---
> random_page_cost = 2 ## changed by modify_postgres_conf.pl on Wed Aug 10 13:31:34 2011

The addition of the comment can be stopped by added a --no-comment argument. If the script runs successfully, it also returns two items of information: the size and name of the current Postgres log file. This is useful so you can know exactly where in the log this change took place. Note that this only works for items that are already explicitly set in your configuration file. However, as discussed before, you should already have all the items that you may possibly change explicitly listed out at the bottom of the file already. Whitespace is preserved as well, for those (like me) who like to keep things lined up neatly inside the file (see examples in the link above).

Here are some more examples of the script in action:


$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
114991 /var/log/postgres/postgres-2011-08-10.log

$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
No change made: value of "random_page_cost" is already 2

$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \
> --change random_page_cost=2 \
> --change log_statement=ddl \
> --change log_min_duration_statement=100

No change made: value of "random_page_cost" is already 2
118459 /var/log/postgres/postgres-2011-08-10.log

$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \
> --change default_statitics_target=200 --no-comment
There is no Postgres variable named "default_statitics_target"!

$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \
> --change default_statistics_target=200 --no-comment
123396 /var/log/postgres/postgres-2011-08-10.log

Note that we make no attempt to automatically check changes in to version control: as you will see in an upcoming blog post on a real-life use case, such a checkin is usually not wanted, as we are making temporary changes.

This is a fairly simple Perl script, but I thought I would put it out there in the hopes of helping others out (and preventing the reinventing of wheels). Of course, if you find a bug or want to write a patch for it, those are welcome additions at any time! The code can be found on github:


git clone git://git@github.com:bucardo/modify_postgres_config.git

Debian Postgres readline psql problem and the solutions

There was a bit of a controversy back in February as Debian decided to replace libreadline with libedit, which affected a number of apps, the most important of which for Postgres people is the psql utility. They did this because psql links to both OpenSSL and readline, and although psql is compatible with both, they are not compatible with each other!

By compatible, I mean that the licenses they use (OpenSSL and readline) are not, in one strict interpretation, allowed to be used together. Debian attempts to live by the letter and spirit of the law as close as possible, and thus determined that they could not bundle both together. Interestingly, Red Hat does still ship psql using OpenSSL and readline; apparently their lawyers reached a different conclusion. Or perhaps they, as a business, are being more pragmatic than strictly legal, as it's very unlikely there would be any consequence for violating the licenses in this way.

While libreadline (the library for GNU readline) is a feature rich, standard, mature, and widely used library, libedit (sadly) is not as developed and has some important bugs and shortcomings (including no home page, apparently, and no Wikipedia page!). This resulted in frustration for many Debian users, who found that their command-line history commands in psql no longer worked, and worse, psql no longer supported non-ASCII input! Since I came across this problem recently on a client machine, I thought I would lay out the current solutions.

The first and easiest solution is to simply upgrade. Debian has made a "workaround" by forcing psql to use the readline library when it is invoked.

The next best solution, for those rare cases when you cannot upgrade, is to apply Debian's solution yourself by patching the 'pg_wrapper' program that Debian uses. In order to support running different versions of Postgres on the same box in a sane and standard fashion, Debian uses some wrapper scripts around some of the Postgres command-line utilities such as psql. Thus, the psql command in /usr/bin/psql is actually a symlink to the shell script pg_wrapper, which parses some arguments and then calls the actual psql binary, which is no longer in the default path. So, to apply the Debian fix, just patch your pg_wrapper file like so:


*** pg_wrapper  2011/07/18 03:46:49     1.1
--- pg_wrapper  2011/07/18 03:48:23
***************
*** 94,100 ****
  }
  
  error 'Invalid PostgreSQL cluster version' unless -d "/usr/lib/postgresql/$version";
! my $cmd = get_program_path (((split '/', $0)[-1]), $version);
  error 'pg_wrapper: invalid command name' unless $cmd;
  unshift @ARGV, $cmd;
  exec @ARGV;
--- 94,110 ----
  }
  
  error 'Invalid PostgreSQL cluster version' unless -d "/usr/lib/postgresql/$version";
! my $cmdname = (split '/', $0)[-1];
! my $cmd = get_program_path ($cmdname, $version);
! 
! # libreadline is a lot better than libedit, so prefer that                                                                  
! if ($cmdname eq 'psql') {
!     my @readlines = sort();
!     if (@readlines) {
!       $ENV{'LD_PRELOAD'} = ($ENV{'LD_PRELOAD'} or '') . ':' . $readlines[-1];
!     }
! }
! 
  error 'pg_wrapper: invalid command name' unless $cmd;
  unshift @ARGV, $cmd;
  exec @ARGV;

As you can see, what Debian has done is set the LD_PRELOAD environment variable to point to the libreadline shared object, which means that when psql is started, it uses the libreadline library instead of libedit. This is great news for Debian users. I'm unconvinced of how "legal" this is per Debian's standards, but then I'm in the camp that think they are interpreting all the licensing around this in the wrong way, and should have just left libreadline alone.

The second best solution, after patching pg_wrapper, is to simply define LD_PRELOAD yourself, either globally or per user.

Another solution is to use the 'rlwrap' program, which is a wrapper around some arbitrary program (in this case, psql) which routes the user input through readline. So a quick alias would be:


alias p='rlwrap psql --no-readline'

(Yes, we could also use -n, but it's an alias and thus we don't have to type it out each time, so it's better to be more verbose). The rlwrap solution is a quick hack, and I do not recommend it, as it still leaves out many psql features, such as autocompletion and ctrl-c support.

All of this is not strictly Debian's fault. If you read the various Debian bug reports as well as some of the Postgres mailing list threads about this topic, you will find there is plenty of finger pointing going around. It seems to me the least guilty party here is readline itself, whose only fault is that it is GPL and not a better license ;). Debian should take a little blame, both for being too strict in what is obviously a very uncharted legal licensing mess, and for making this change so quickly without any announcement and apparently without realizing how many things would break. The worst offender appears to be OpenSSL, which apparently is being stubborn about changing its license to allow linking with the GPL readline. I'll throw a little bit of blame towards libedit as well, merely for its inability to keep up with 20th century ideas like Unicode (because whose database doesn't need more 麟?).

The current Debian "solution" has stilled the waters a little bit, but we (Postgres) really need a long-term solution. Or solutions, as the case may be. As with my previous post, the big question there is "who shall put the bell on the cat"? I'd like to see Debian itself fund some work into improving libedit, since they are strongly encouraging use of it over libreadline. That's solution one: improve libedit such that it becomes a decent readline replacement. This is nice because as great as libreadline is, it's one of the only pieces of Postgres that used the GPL, and it would be nice to get rid of it for that reason alone (the other big one is PostGIS).

Another solution is to replace OpenSSL, since they apparently are never going to change their license, despite it being in everyone's best interest. GnuTLS is an oft-mentioned replacement, which seems to be production ready, unlike libedit. The problem here is that psql has a lot of "openssl-isms" in the code. However, that is something that can be accomplished by the Postgres community.

Another option is to get readline to make an exception so it can play nicely with OpenSSL. Not only is this unlikely to happen, I think it's a band-aid and I'd rather see the above two actions happen instead.

So, in summary, there are really two ways out of this mess: fix up libedit (hello Debian community) and allow Postgres support for GnuTLS (or other non-OpenSSL system for that matter) (hello Postgres community).

For those wanting to dig into this some more, Greg Smith's excellent summation in this thread is a great read.

Announcing pg_blockinfo!

I'm pleased to announce the initial release of pg_blockinfo. It is a tool to examine your PostgreSQL heap data files, written in Perl.

Similar in purpose to pg_filedump, it is used to display (and soon validate) buffer-page-level information for PostgreSQL page/heap files.

pg_blockinfo aims to work in a portable and non-destructive way, using read-only "mmap", sys-level IO functions, and "unpack" in order to minimize any Perl overhead.

What we buy for the compromise of writing this in Perl instead of C is two-fold:

  1. portability/low impactpg_blockinfo has no other dependencies than Perl and several core Perl modules and will work in environments where you can't or won't easily install other packages or compile based on specific headers.
  2. expressibility — while not currently supported in full, one of pg_blockinfo's future goals is to allow you to specify criteria for display of both page-level and tuple-level info. It will allow you to define arbitrary Perl expressions to filter the objects you're looking at (i.e., pages, tuples, etc; think "grep" but on a tuple level). It will support a DSL for querying based off of the named fields as well as allow you to supply arbitrary Perl for scanning for any criteria.

Requirements

We require a perl version with PerlIO ":mmap" support, which basically means any perl >= 5.8. We do not require any non-core perl modules; currently we only use Data::Dumper and Getopt::Long for debugging and option parsing respectively, the former only when requested.

Getting pg_blockinfo

The canonical git repo for development for pg_blockinfo is located at github:

http://github.com/machack666/pg_blockinfo/

For the development repo, simply run:

$ git clone git://github.com/machack666/pg_blockinfo.git

Or you can just grab the current script directly here:

https://raw.github.com/machack666/pg_blockinfo/master/pg_blockinfo

Using pg_blockinfo

To get help including available options, canonical and alternate/abbreviated names of recognized fields, range syntax:

$ pg_blockinfo -h

To dump all fields for all page headers for all pages in a relation:

$ pg_blockinfo /path/to/relfile

To include only specific fields in the output you can specify multiple -f options and/or include multiple options per -f argument by comma delimiting. Field specifiers are processed in order, so only the final logical set will be included.

"all" is a special shorthand type which will expand to all known columns. pg_blockinfo may support other shorthand groups in the future. When no fields are provided explicitly, "all" is implicitly assumed.

There are both positive and negative field inclusions. An example of a positive inclusion is:

$ pg_blockinfo /path/to/relfile -f prune_xid,tli

This will display only the indicated fields in question for all blocks in relfile. To include all fields *except* certain ones, prefix their name with a '-' sign:

$ pg_blockinfo -f -pagesize_version /path/to/relfile

This will display all page header fields in all blocks with the exception of the pagesize_version header field.

One consequence of the way these field display options are designed (particularly going forward with additional field/tuple display options) that you could define a "view" of the column data using a shell alias, then add/remove columns/criteria by passing additional -f options to it:

# using this as a shorthand to display just those fields
$ alias lsn='pg_blockinfo -f lsn_seq,lsn_off,tli'
$ lsn -f -tli /path/to/foo                          # remove fields from the display
$ lsn -f prune_xid /path/to/foo                     # or add to the list as well

Similar functionality is available for selecting the specific blocks available using the range option (-r or -b), which lets you specify a range of blocks to look at instead of the entire file.

$ pg_blockinfo -r 2-49 /path/to/relfile
$ pg_blockinfo -r -100 /path/to/relfile
$ pg_blockinfo -r 2,4,120-140,0xFF-0x1100 /path/to/relfile

Range options can be provided multiple times, each with one or more comma-delimited block-range specifications. Blocks are numbered from 0, can be provided in decimal or hexadecimal (when prefixed via 0x), and can appear singly or in a range (unbounded or unbounded) when separated by a hyphen.

Planned future features/TODO

In no particular order:

  • dump tuples/tuple headers.
  • better output/interpretation of bitflags.
  • support alternate structures to allow detection/specification of different target versions of the page/tuple headers.
  • allow querying/filtering pages/tuples.
  • validation/sanity checking of various pages.
  • actual extraction of ranges in the heap file.
  • extract/dump tuples by raw ctid.
  • allow arbitrary expressions to define powerful filtering options when querying/displaying information about the tuples/data files.
  • detections of invalid toast tuple pointers/corrupted lz_compressed data (will require connection to theactive system catalog).
  • detect relfile type?
  • mvcc queries against tuples at a given arbitrarily-constructed snapshot
  • detect xids that are invalid (i.e. map to non-existent pages in the pg_clog directory).
  • better/shorter name?

I look forward to any feedback, patches, or other improvements/interest.

DBD::Pg UTF-8 for PostgreSQL server_encoding

We are preparing to make a major version bump in DBD::Pg, the Perl interface for PostgreSQL, from the 2.x series to 3.x. This is due to a reworking of how we handle UTF-8. The change is not going to be backwards compatible, but will probably not affect many people. If you are using the pg_enable_utf8 flag, however, you definitely need to read on for the details.

The short version is that DBD::Pg is going return all strings from the Postgres server with the Perl utf8 flag on. The sole exception will be databases in which the server_encoding is SQL_ASCII, in which case the flag will never be turned on.

For backwards compatibility and fine-tuning control, there is a new attribute called pg_utf8_strings that can be set at connection time to override the decision above. For example, if you need your connection to return byte-soup, non-utf8-marked strings, despite coming from a UTF-8 Postgres database, you can say:


  my $dsn = 'dbi:Pg:dbname=foobar';
  my $dbh = DBI->connect($dsn, $dbuser, $dbpass,
    { AutoCommit => 0,
      RaiseError => 0,
      PrintError => 0,
      pg_utf8_strings => 0,
    }
  );

Similarly, you can set pg_utf8_strings to 1 and it will force settings returned strings as utf8, even if the backend is SQL_ASCII. You should not be using SQL_ASCII of course, and certainly not forcing the strings returned from it to UTF-8. :)

All Perl variables (be they strings or otherwise) are actually Perl objects, with some internal attributes defined on them. One of those is the utf8 flag, which can be flipped on to indicate that the string should be treated as possibly containing multi-byte characters, or it can be left off, to indicate the string should always be treated on a byte-by-byte basis. This will affect things like the Perl length function, and the Perl \w regex flag. This is completely unrelated to the Perl pragma use utf8, which DBD::Pg has nothing at all to do with. Have I mentioned that UTF-8, and UTF-8 in Perl in particular, can be quite confusing?

There are a few exceptions as to what things DBD::Pg will mark as utf8. Integers and other numbers will not, boolean values will not, and no bytea data will ever have the flag set. When in doubt, assume that it is set.

The old attribute, pg_enable_utf8, will be deprecated, and have no effect. We thought about re-using that but it seemed clearer and cleaner to simply create a new variable (pg_utf8_strings), as the behavior has significantly changed.

A beta version of DBD::Pg (2.99.9_1) with these changes has been uploaded to CPAN for anyone to experiment with. Right now, none of this is set in stone, but we did want to get a working version out there to start the discussion and see how it interacts with applications that were making use of the pg_enable_utf8 flag. You can web search for "dbdpg" and look for the "Latest Dev. Release", or jump straight to the page for DBD::Pg 2.99.9_1. The trailing underscore is a CPAN convention that indicates this is a development version only, and thus will not replace the latest production version (2.18.1 as of this writing).

As a reminder, DBD::Pg has switched to using git, so you can follow along with the development with:


git clone git://bucardo.org/dbdpg.git

There is also a commits mailing list you can join to receive notifications of commits as they are pushed to the main repo. To sign up, send an email to dbd-pg-changes-subscribe@perl.org.

DBD::Pg moves to git!

Just a note to everyone that development the official DBD::Pg DBI driver for PostgreSQL source code repository has moved from its old home in SVN to a git repository. All development has now moved to this repo.

We have imported the SVN revision history, so it's just a matter of pointing your git clients to:

$ git clone git://bucardo.org/dbdpg.git

For those who prefer, there is a github mirror:

$ git clone git://github.com/bucardo/dbdpg.git

Git is available via many package managers or by following the download links at http://git-scm.com/download for your platform.

Enjoy!

MongoDB replication from Postgres using Bucardo

One of the features of the upcoming version of Bucardo (a replication system for the PostgreSQL RDBMS) is the ability to replicate data to things other than PostgreSQL databases. One of those new targets is MongoDB, a non-relational 'document-based' database. (to be clear, we can only use MongoDB as a target, not as a source)

To see this in action, let's setup a quick example, modified from the earlier blog post on running Bucardo 5. We will create a Bucardo instance that replicates from two Postgres master databases to a Postgres database target and a MongoDB instance target. We will start by setting up the prerequisites:


sudo aptitude install postgresql-server \
perl-DBIx-Safe \
perl-DBD-Pg \
postgresql-contrib

Getting Postgres up and running is left as an exercise to the reader. If you have problems, the friendly folks at #postgresql on irc.freenode.net will be able to help you out.

Now for the MongoDB parts. First, we need the server itself. Your distro may have it already available, in which case it's as simple as:


aptitude install mongodb

For more installation information, follow the links from the MongoDB Quickstart page. For my test box, I ended up installing from source by following the directions at the Building for Linux page.

Once MongoDB is installed, we will need to start it up. First, create a place for MongoDB to store its data, and then launch the mongodb process:


$  mkdir /tmp/mongodata
$  mongod --dbpath=/tmp/mongodata --fork --logpath=/tmp/mongo.log
all output going to: /tmp/mongo.log
forked process: 428

You can perform a quick test that it is working by invoking the command-line shell for MongoDB (named "mongo" of course) Use quit() to exit:


$  mongo
MongoDB shell version: 1.8.1
Fri Jun 10 12:45:00
connecting to: test
> quit()
$ 

The other piece we need is a Perl driver so that Bucardo (which is written in Perl) can talk to the MongoDB server. Luckily, there is an excellent one available on CPAN named 'MongoDB'. We started the MongoDB server before doing this step because the driver we will install needs a running MongoDB instance to pass all of its tests. The module has very good documentation available on its CPAN page. Installation may be as easy as:


$  sudo cpan MongoDB

If that did not work for you (case matters!), there are more detailed directions on the Perl Language Center page.

Our next step is to grab the latest Bucardo, install it, and create a new Bucardo instance. See the previous blog post for more details about each step.


$ git clone git://bucardo.org/bucardo.git
Initialized empty Git repository...

$ cd bucardo
$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for Bucardo
$ make
cp bucardo.schema blib/share/bucardo.schema
cp Bucardo.pm blib/lib/Bucardo.pm
cp bucardo blib/script/bucardo
/usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/bucardo
Manifying blib/man1/bucardo.1pm
Manifying blib/man3/Bucardo.3pm
$ sudo make install
Installing /usr/local/lib/perl5/site_perl/5.10.0/Bucardo.pm
Installing /usr/local/share/bucardo/bucardo.schema
Installing /usr/local/bin/bucardo
Installing /usr/local/share/man/man1/bucardo.1pm
Installing /usr/local/share/man/man3/Bucardo.3pm
Appending installation info to /usr/lib/perl5/5.10.0/i386-linux-thread-multi/perllocal.pod
$ sudo mkdir /var/run/bucardo
$ sudo chown $USER /var/run/bucardo
$ bucardo install
This will install the bucardo database into an existing Postgres cluster.
...
Installation is now complete.

Now we create some test databases and populate with pgbench:


$ psql -c 'create database btest1'
CREATE DATABASE
$ pgbench -i btest1
NOTICE:  table "pgbench_branches" does not exist, skipping
...
creating tables...
10000 tuples done.
20000 tuples done.
...
100000 tuples done.
$ psql -c 'create database btest2 template btest1'
CREATE DATABASE
$ psql -c 'create database btest3 template btest1'
CREATE DATABASE
$ psql btest3 -c 'truncate table pgbench_accounts'
TRUNCATE TABLE

$ bucardo add db t1 dbname=btest1
Added database "t1"
$ bucardo add db t2 dbname=btest2
Added database "t2"
$ bucardo add db t3 dbname=btest3
Added database "t3"
$ bucardo list dbs
Database: t1  Status: active  Conn: psql -p 5432 -U bucardo -d btest1
Database: t2  Status: active  Conn: psql -p 5432 -U bucardo -d btest2
Database: t3  Status: active  Conn: psql -p 5432 -U bucardo -d btest3

$ bucardo add tables pgbench_accounts pgbench_branches pgbench_tellers herd=therd
Created herd "therd"
Added table "public.pgbench_accounts"
Added table "public.pgbench_branches"
Added table "public.pgbench_tellers"

$ bucardo list tables
Table: public.pgbench_accounts  DB: t1  PK: aid (int4)
Table: public.pgbench_branches  DB: t1  PK: bid (int4)
Table: public.pgbench_tellers   DB: t1  PK: tid (int4)

The next step is to add in our MongoDB instance. The syntax is the same as the "add db" above, but we also tell it the type of database, as it is not the default of "postgres". We will also assign an arbitrary database name, "btest1", the same as the others. Everything else (such as the port and host) is default, so all we need to say is:


$  bucardo add db m1 dbname=btest1 type=mongo
Added database "m1"
$  bucardo list dbs
Database: m1  Type: mongo     Status: active  
Database: t1  Type: postgres  Status: active  Conn: psql -p 5432 -U bucardo -d btest1
Database: t2  Type: postgres  Status: active  Conn: psql -p 5432 -U bucardo -d btest2
Database: t3  Type: postgres  Status: active  Conn: psql -p 5432 -U bucardo -d btest3

Next we group our databases together and assign them roles:


$  bucardo add dbgroup tgroup  t1:source  t2:source  t3:target  m1:target
Created database group "tgroup"
Added database "t1" to group "tgroup" as source
Added database "t2" to group "tgroup" as source
Added database "t3" to group "tgroup" as target
Added database "m1" to group "tgroup" as target

Note that "target" is the default action, so we could shorten that to:


$  bucardo add dbgroup tgroup t1:source  t2  t3  m1

However, I think it is best to be explicit, even if it does (incorrectly) hint that m1 could be anything *other* than a target. :)

We are almost ready to go. The final step is to create a sync (a basic replication event in Bucardo), then we can start up Bucardo, put some test data into the master databases, and 'kick' the sync:


$  bucardo add sync mongotest  herd=therd  dbs=tgroup  ping=false
Added sync "mongotest"

$  bucardo start
Checking for existing processes
Starting Bucardo

$  pgbench -t 10000 btest1
starting vacuum...end.
transaction type: TPC-B (sort of)
number of transactions actually processed: 10000/10000
...
tps = 503.300595 (excluding connections establishing)
$  pgbench -t 10000 btest2
number of transactions actually processed: 10000/10000
...
tps = 408.059368 (excluding connections establishing)
$  bucardo kick mongotest

We'll give it a few seconds to replicate those changes (it took 18 seconds on my test box), and then check the output of bucardo status:


$  bucardo status
PID of Bucardo MCP: 3317
 Name        State    Last good    Time    Last I/D/C    Last bad    Time  
===========+========+============+=======+=============+===========+=======
 mongotest | Good   | 21:57:47   | 11s   | 6/36234/898 | none      |

Looks good, but what about the data in MongoDB? Let's get some counts from the Postgres masters and slave, and then look at the data inside MongoDB with the mongo command-line client:


$  psql btest1 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$  psql btest2 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$  psql btest3 -c 'SELECT count(*) FROM pgbench_accounts'
18106
$  psql btest1 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$  psql btest2 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$  psql btest3 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010

$  mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:46:00
connecting to: btest1
> show collections
bucardo_status
pgbench_accounts
pgbench_branches
pgbench_tellers
system.indexes
>  db.pgbench_accounts.count()
18106
>  db.pgbench_accounts.find().sort({abalance:1}).limit(1).next()
{
  "_id" : ObjectId("4df39bcb8795839660001de5"),
  "abalance" : -12071,
  "aid" : 84733,
  "bid" : 1,
  "filler" : "               "
}
> db.pgbench_accounts.find().sort({abalance:-1}).limit(1).next()
{
  "_id" : ObjectId("4df39bd08795839660002fb0"),
  "abalance" : 13010,
  "aid" : 45500,
  "bid" : 1,
  "filler" : "               "
}

Why the difference in counts? We only started replicating after we populated the Postgres tables on the master databases with 100,000 rows, so the eighteen thousand is the number of rows that was changed during the subsequent pgbench run. (Note that pgbench uses randomness, so your numbers will be different than the above). In the future Bucardo will support the "onetimecopy" feature for MongoDB, but until then we can fully populate the pgbench_accounts collection simply by "touching' all the records on one of the masters:


$ psql btest1 -c 'UPDATE pgbench_accounts SET aid=aid'
UPDATE 100000
$ bucardo kick mongotest
Kicked sync mongotest
$ echo 'db.pgbench_accounts.count()' | mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:47:00
connecting to: btest1
> 100000
> bye

A nice feature of MongoDB is its autovivification ability (aka dynamic schemas), which means unlike Postgres you do not have to create your tables first, but can simply ask MongoDB to do an insert, and it will create the table (or, in mongospeak, the collection) automatically for you.

Because MongoDB has no concept of transactions, and because Bucardo does not update, but does deletes plus inserts (for reasons I'll not get into today), there is one more trick Bucardo does when replicating to a MongoDB instance. A collection named 'bucardo_status' is created and updated at the start and the end of a sync (a replication event). Thus, your application can pause if it sees this table has a 'started' value, and wait until it sees 'complete' or 'failed'. Not foolproof by any means, but better than nothing :) You should, of course, carefully consider the way your app and Bucardo will coordinate things.

Feedback from Postgres or MongoDB folk is much appreciated: there are probably some rough edges, but as you can see from above, the basics are there are working. Feel free to email the bucardo-general mailing list or make a feature request / bug report on the Bucardo Bugzilla page.

Bucardo multi-master for PostgreSQL

The original Bucardo

The next version of Bucardo, a replication system for Postgres, is almost complete. The scope of the changes required a major version bump, so this Bucardo will start at version 5.0.0. Much of the innards was rewritten, with the following goals:

Multi-master support

Where "multi" means "as many as you want"! There are no more pushdelta (master to slaves) or swap (master to master) syncs: there is simply one sync where you tell it which databases to use, and what role they play. See examples below.

Ease of use

The bucardo program (previously known as 'bucardo_ctl') has been greatly improved, making all the administrative tasks such as adding tables, creating syncs, etc. much easier.

Performance

Much of the underlying architecture was improved, and sometimes rewritten, to make things go much faster. Most striking is the difference between the old multi-master "swap syncs" and the new method, which has been described as "orders of magnitudes" faster by early testers. We use async database calls whenever possible, and no longer have the bottleneck of a single large bucardo_delta table.

Improved logging

Not only are more details provided, there is now the ability to control how verbose the logs are. Just set the log_level parameter to terse, normal, verbose, or debug. Those who had busy systems, which was the equivalent of a 'debug' firehose, will really appreciate this.

Different targets

Who says your slave (target) databases need to be Postgres? In addition to the ability to write text SQL files (for say, shipping to a different system), you can have Bucardo push to other systems as well. Stay tuned for more details on this. (Update: there is a blog post about using MongoDB as a target)


This new version is not quite at beta yet, but you can try out a demo of multi-master on Postgres quie easily. Let's see if we can do it in ten steps.

I. Download all prerequisites

To run Bucardo, you will need a Postgres database (obviously), the DBIx::Safe module, the DBI and DBD::Pg modules, and (for the purposes of this demo) the pgbench utility. Systems vary, but on aptitude-based systems, one can grab all of the above like this:


aptitude install postgresql-server \
perl-DBIx-Safe \
perl-DBD-Pg \
postgresql-contrib

II. Grab the latest Bucardo


git clone git://bucardo.org/bucardo.git

III. Install the program


cd bucardo
perl Makefile.PL
make
sudo make install

You can ignore any errors that come up about ExtUtils::MakeMaker not being recent.

IV. Setup an instance of Bucardo

This step assumes there is a running Postgres available to connect to.


sudo mkdir /var/run/bucardo
sudo chown $USER /var/run/bucardo
bucardo install

V. Use the pgbench program to create some test tables


psql -c 'CREATE DATABASE btest1'
pgbench -i btest1
psql -c 'CREATE DATABASE btest2 TEMPLATE btest1'
psql -c 'CREATE DATABASE btest3 TEMPLATE btest1'
psql -c 'CREATE DATABASE btest4 TEMPLATE btest1'
psql -c 'CREATE DATABASE btest5 TEMPLATE btest1'

VI. Tell Bucardo about the databases and tables you are going to use


bucardo add db t1 dbname=btest1
bucardo add db t2 dbname=btest2
bucardo add db t3 dbname=btest3
bucardo add db t4 dbname=btest4
bucardo add db t5 dbname=btest5
bucardo list dbs

bucardo add table pgbench_accounts pgbench_branches pgbench_tellers herd=therd
bucardo list tables

A herd is simply a logical grouping of tables. We did not add the other pgbench table, pgbench_history, because it has no primary key or unique index.

VII. Group the databases together and set their roles


bucardo add dbgroup tgroup t1:source t2:source t3:source t4:source t5:target

We've grouped all five databases together, and made four of them masters (aka source), and one of them a slave (aka target). You can any combination of master and slaves you want, as long as there is at least one master.

VII. Create the Bucardo sync

bucardo add sync foobar herd=therd dbs=tgroup ping=false

Here we simply create a new sync, which is a controllable replication event, telling it which tables we want to replicate, and which databases we are going to use. We also set ping to false, which means that we will not create triggers to automatically fire off replication on any changes, but will do it manually. In a real world scenario, you generally do want those triggers, or want to set Bucardo to check periodically.

VIII. Start up Bucardo


bucardo start

If all went well, you should see some information in the log.bucardo file in the current directory.

IX. Make a bunch of changes on all the source databases.


pgbench -t 10000 btest1
pgbench -t 10000 btest2
pgbench -t 10000 btest3
pgbench -t 10000 btest4

Here, we've told pgbench to run ten thousand transactions against each of the first four databases. Triggers on these tables have captured the changes.

X. Kick off the sync and watch the fun.


bucardo kick foobar

You can now tail the log.bucardo file to see the fun, or simply run:


bucardo status

...to see what it is doing, and the final counts when we are done. Don't forget to stop Bucardo when you are done testing:


bucardo stop

The output of bucardo status, after the sync has completed, should look like this:


bucardo status

Name     State    Last good    Time    Last I/D/C           Last bad    Time
========+========+============+=======+====================+===========+=======
foobar | Good   | 17:58:37   | 3m2s  | 131836/131836/4785 | none      |

Here we see that this syncs has never failed ("Last bad"), the time of day of the last good run, how long ago it was from right now (3 minutes and 2 seconds), as well as details of the last successful run. Last I/D/C stands for number of inserts, deletes, and collisions across all databases for this syncs. This is just an overview of all syncs at a high level, but we can also give status an argument of a sync name to see more details like so:


bucardo status foobar

Last good                       : Jun 02, 2011 17:57:47 (time to run: 42s)
Rows deleted/inserted/conflicts : 131,836 / 131,836 / 4,785
Sync name                       : foobar
Current state                   : Good
Source herd/database            : therd / t1
Tables in sync                  : 3
Status                          : active
Check time                      : none
Overdue time                    : 00:00:00
Expired time                    : 00:00:00
Stayalive/Kidsalive             : yes / yes
Rebuild index                   : 0
Ping                            : no
Onetimecopy                     : 0
Post-copy analyze               : Yes
Last error:                     :

This gives us a little more information about the sync itself, as well as another important metric, how long the sync itself took to run, in this case, 42 seconds. That particular metric might make its way back to the overall "status" view above. Try things out and help us find bugs and improve Bucardo!

Saving time with generate_series()

I was giving a presentation once on various SQL constructs, and, borrowing an analogy I'd seen elsewhere, described PostgreSQL's generate_series() function as something you might use in places where, in some other language, you'd use a FOR loop. One attendee asked, "So, why would you ever want a FOR loop in a SQL query?" A fair question, and one that I answered using examples later in the presentation. Another such example showed up recently on a client's system where the ORM was trying to be helpful, and chose a really bad query to do it.

The application in question was trying to display a list of records, and allow the user to search through them, modify them, filter them, etc. Since the ORM knew users might filter on a date-based field, it wanted to present a list of years containing valid records. So it did this:

SELECT DISTINCT DATE_TRUNC('year', some_date_field) FROM some_table;

In fairness to the ORM, this query wouldn't be so bad if some_table only had a few hundred or thousand rows. But in our case it has several tens of millions. This query results in a sequential scan of each of those records, in order to build a list of, as it turns out, about fifty total years. There must be a better way...

The better way we chose turns out to be, in essence, this: find the years of the maximum and minimum date values in the date field, construct a list of all years between the minimum and maximum, inclusive, and see which ones exist in the table. This date field is indexed, so finding its maximum and minimum is very fast:

SELECT
    DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
    DATE_TRUNC('year', MAX(some_date_field)) AS mymax
FROM some_table

Here's where the FOR loop idea comes in, though it's probably better described as an "iterator" rather than a FOR loop specifically: for each year between mymin and mymax inclusive, I want a database row. The analogy may not hold terribly well, but the technique is very useful, because it will create a list of all the possible years I might be interested in, and it will do it with just two scans of the some_date_field index, rather than a sequential scan of millions of rows.

SELECT
    generate_series(mymin::INTEGER, mymax::INTEGER) AS yearnum
FROM (
    SELECT
        DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
        DATE_TRUNC('year', MAX(some_date_field)) AS mymax
    FROM some_table
) minmax_tbl

Now I simply have to convert these values to years, and see which ones exist in the underlying table:

SELECT
    yearbegin::timestamptz
FROM
    (
        SELECT
            yearnum * INTERVAL '1 year' + '0000-01-01'::date AS yearbegin
        FROM (
            SELECT
                generate_series(mymin::INTEGER, mymax::INTEGER) AS yearnum
            FROM (
                SELECT
                    DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
                    DATE_TRUNC('year', MAX(some_date_field)) AS mymax
                FROM some_table
        ) yearnum_tbl
    ) beginend_tbl
WHERE
    EXISTS (
        SELECT 1 FROM some_table
        WHERE
            some_date_field BETWEEN yearbegin AND yearbegin + INTERVAL '1 year'
    )
ORDER BY yearbegin ASC
;

As expected, this probes the some_date_field index twice, to get the maximum and minimum date values, and then once for each year between those values. Because of some strangely-dated data in there, that means nearly 10,000 index probes, but that's still much faster than scanning the entire table.

Postgres Bug Tracking - Help Wanted!

Once again there is talk in the Postgres community about adopting the use of a bug tracker. The latest thread, on pgsql-hackers, was started by someone asking about the status of their patch. Or rather, asking an even better meta-question about how one finds out the status of a PostgreSQL bug report or patch. Sadly, the answer is that there is no standard way, other than sending emails until someone replies one way or another. The current process works something like this:

  1. Someone finds a bug
  2. They send an email to pgsql-bugs@postgresql.org OR they use the web form, which grabs a sequential number and mails the report to pgsql-bugs@postgresql.org. Nothing else is done/stored, it just sends the email.
  3. Someone replies about the bug OR nobody replies about the bug.
  4. After a fix is found, which may involve some emails on other mailing lists, someone replies that the bug is fixed on the original thread. Maybe.

As you can see, there is some room for improvement there. Some of the most major and glaring holes in the current system:

  • No way to search previous / existing bugs
  • No way to tell the status of a bug
  • No way to categorize and group bugs (per version, per platform, per component, per severity, etc.)
  • No way to know who is working on a bug
  • No way to prevent things from slipping through the cracks

Luckily, the above problems have been solved for many many years now but a wide variety of bug tracking software. There have traditionally been three problems to getting a bug tracker working for the Postgres project:

Inertia

The current system is, in a very literal sense, "good enough", so it's hard to impose the inevitable short-term pain of a new system when there always seem to be more pressing matters to attend to.

Doesn't Make Julienne Fries

Everyone wants a different set of features, and getting all the hackers involved to agree on even a simple subset of desired features is pretty difficult. This is sort of similar to the crusade by myself and others to get git as the replacement version control system; there were some strong voices for competing systems (e.g. mercurial).

Who Will Put the Bell on the Cat?

Everyone talks about the problem, and there have even been some attempts over the years to implement some sort of system, but the problem remains that setting up such a system, getting it smoothly integrated into the project's work flow, and then maintaining said system is a non-trivial task. Especially when you can't be assured of buy-in from some of the major players.

I'm hopeful that the recent thread indicates a slight shift of late in global acceptance of the need for a bug tracking system. The question is, which one, and who is going to take the time to write something? I'm really hoping someone who has been lurking in the background will step up and help create something wonderful (okay, we can start with 'decent' :) Perhaps even someone with experience setting up bug tracking systems. Certainly Postgres must be one of the last major open source projects without a bug tracker; there is plenty of hard-won experience out there to be learned from. It would also be ideal if the person or persons was *not* a Postgres hacker of any sort, as taking the time to build and maintain this system would definitely take time away from their other hacking tasks. On the other hand, one could argue that a bug tracker is a vital piece of project infrastructure that is potentially as important as any other work that goes on. I certainly think so.

Only Try This At Home

Taken by Josh 6 years to the day before the release of 9.1 beta 1
Taken by Josh 6 years to the day before the release of 9.1 beta 1

For the record, 9.1 is gearing up to be an awesome release. I was tinkering and testing PostgreSQL 9.1 Beta 1 (... You are beta testing, too, right?) ... and some of the new PL/Python features caught my eye. These are minor among all the really cool high profile features, to be sure. But it made me think back to a little bit of experimental code written some time ago, and how these couple language additions could make a big difference.

For one reason or another I'd just hit the top level postgresql.org website, and suddenly realized just how many Postgres databases it took to put together what I was seeing on the screen. Not only does it power the content database that generated the page, of course, but even the lookup of the .org went through Afilias and their Postgres-backed domain service. It's a pity the DBMS couldn't act as the middle layer between those.

Or could it?

That's a shortened form of it just for demonstration purposes (the original one had things like a table browser) ... but it works. For example, on this test 9.1 install, hit http://localhost:8000/public/webtest and the following table appears:

generate_serieslhrnd
100.548577250913
211.70926172473
311.24841631576
(etc)......

Note the use of two specific 9.1 features, though. The plpy object contains nice query building helper utilities like quote_ident that you may be familiar with in other languages. But this also makes use of subtransactions, which helps recover from db errors. That's important here, as something like a typo in a table name will generate an error from Postgres and without that in place the database will end the transaction and ignore any subsequent commands the function tries to run.

But with that in place, the page shows the 404 error, and picks up where it left off with subsequent requests:

Error code 404.

Message: Table not found.

By the way, if it's not clear by now don't take this anywhere near a production database, if not any other reason that a transaction will be held open as long as that function runs. That will hold back all the nice maintenance stuff that keeps things running efficiently. Still, I think it helps show off what just a handful of lines of code can do in a powerful language like PL/Python. I'm sure with the right module PL/PerlU could do something very similar. But even more I think it shows how Postgres is growing and innovating by leaps and bounds, seemingly every day!

DBD::Pg and the libpq COPY bug

(image by kvanhorn)

Version 2.18.1 of DBD::Pg, the Perl driver for Postgres, was just released. This was to fix a serious bug in which we were not properly clearing things out after performing a COPY. The only time the bug manifested, however, is if an asynchronous query was done immediately after a COPY finished. I discovered this while working on the new version of Bucardo. The failing code section was this (simplified):


## Prepare the source
my $srccmd = "COPY (SELECT * FROM $S.$T WHERE $pkcols IN ($pkvals)) TO STDOUT";
$fromdbh->do($srccmd);

## Prepare each target
for my $t (@$todb) {
    my $tgtcmd = "COPY $S.$T FROM STDIN";
    $t->{dbh}->do($tgtcmd);
}

## Pull a row from the source, and push it to each target
while ($fromdbh->pg_getcopydata($buffer) >= 0) {
    for my $t (@$todb) {
        $t->{dbh}->pg_putcopydata($buffer);
    }
}

## Tell each target we are done with COPYing
for my $t (@$todb) {
    $t->{dbh}->pg_putcopyend();
}

## Later on, run an asynchronous command on the source database
$sth{track}{$dbname}{$g} = $fromdbh->prepare($SQL, {pg_async => PG_ASYNC});
$sth{track}{$dbname}{$g}->execute();

This gave the error "another command is already in progress". This error did not come from Postgres or DBD::Pg, but from libpq, the underlying C library which DBD::Pg uses to talk to the database. Strangely enough, taking out the async part and running the exact same command produced no errors.

After tracking back through the libpq code, it turns out that DBD::Pg was only calling PQresult a single time after the copy ended. I can see why this was done: the docs for PQputCopyEnd state: "After successfully calling PQputCopyEnd, call PQgetResult to obtain the final result status of the COPY command. One can wait for this result to be available in the usual way. Then return to normal operation." What's not explicitly stated is that you need call PQgetResult again, and keep calling it, until it returns null, to "clear out the message queue". In this case, PQresult pulled back a 'c' message from Postgres, via the frontend/backend protocol, indicating that the copy command was complete. However, what it really needed was to call PQresult two more times, once to get back a 'C' (indicating the COPY statement was complete), and a 'Z' (indicating the backend was ready for a new query). Technically, there was nothing stopping libpq from sending a fresh query except that its own internal flag, conn->asyncStatus, is not reset on a simple end of copy, but only when 'Z' is encountered. Thus, DBD::Pg 2.18.1 now calls PQresult until it returns null.

If your application is encountering this bug and you cannot upgrade to 2.18.1 yet, the solution is simple: perform a non-asynchronous query between the end of the copy and the start of the asynchronous query. It can be any query at all, so the above code could be cured with:


...
## Tell each target we are done with COPYing
for my $t (@$todb) {
    $t->{dbh}->pg_putcopyend();
    $t->{dbh}->do('SELECT 123');
}

## Later on, run an asynchronous command on the source database
$fromdbh->do('SELECT 123');
$sth{track}{$dbname}{$g} = $fromdbh->prepare($SQL, {pg_async => PG_ASYNC});
$sth{track}{$dbname}{$g}->execute();

Why does the non-asynchronous command work? Doesn't it check the conn->asyncStatus as well? The secret is that PQexecstart has this bit of code in it:


    /*
     * Silently discard any prior query result that application didn't eat.
     * This is probably poor design, but it's here for backward compatibility.
     */
    while ((result = PQgetResult(conn)) != NULL)

Wow, that code looks familiar! So it turns out that the only reason this was not spotted earlier is that non-asynchronous commands (e.g. those using PQexec) were silently clearing out the message queue, kind of as a little favor from libpq to the driver. The async function, PQsendQuery, is not as nice, so it does the correct thing and fails right away with the error seen above (via PQsendQueryStart).

NOTIFY vs Prepared Transactions in Postgres (the Bucardo solution)

We recently had a client use Bucardo to migrate their app from Postgres 8.2 to Postgres 9.0 with no downtime (which went great). They also were using Bucardo to replicate from the new 9.0 mater to a bunch of 9.0 slaves. This ran into problems the moment the application started, as we started seeing these messages in the logs:


ERROR:  cannot PREPARE a transaction that has 
executed LISTEN, UNLISTEN or NOTIFY

The problem is that the Postgres LISTEN/NOTIFY system cannot be used with prepared transactions. Bucardo uses a trigger on the source tables that issues a NOTIFY to let the main Bucardo daemon know that something has changed and needs to be replicated. However, their application was issuing a PREPARE TRANSACTION as an occasional part of its work. Thus, they would update the table, which would fire the trigger, which would send the NOTIFY. Then the application would issue the PREPARE TRANSACTION which produced the error given above. Bucardo is setup to deal with this situation; rather than using notify triggers, the Bucardo daemon can be set to look for any changes at a set interval. The steps to change Bucardo's behavior for a given sync is simply:


$ bucardo_ctl update sync foobar ping=false checktime=15
$ bucardo_ctl validate foobar
$ bucardo_ctl reload foobar

The first command tells the sync not to use notify triggers (these are actually statement-level triggers that simply issue a NOTIFY bucardo_kick_sync_foobar. It also sets a checktime of 15 seconds, which means that the Bucardo daemon will check for changes every 15 seconds - or as if the original notify trigger is firing every 15 seconds. The second command validates the sync but checking that all supporting tables, functions, triggers, etc. are installed and up to date. It also removes triggers that are no longer needed: in this case, the statement-level notify triggers for all tables in this sync. Finally, the third command simply tells the Bucardo daemon to stop the sync, load in the new changes, and restart it.

Another solution to the problem is to simply not use prepared transactions: very few applications actually need it, but I've noticed a few that use it anyway when they should not be. What exactly is a prepared transaction? It's the Postgres way of implementing two-part commit. Basically, this means that a transaction's state is stored away on disk, and can be committed or rolled back at a later time - even by a different session. This is handy if you need to ensure that, for example, you can atomically commit multiple database connections. By atomically, I mean that either they all commit or none of them do. This is done by doing work on each database, issuing a PREPARE TRANSACTION, and then, once all have been prepared, issuing the COMMIT TRANSACTION against each one.

As an aside, prepared transactions are often confused with prepared statements. While the use of prepared statements is very common, use of prepared transactions is very rare. Prepared statements are simply a way of planning a query one time, then re-running it multiple times without having to run the query through the planner each time. Many interfaces, such as DBD::Pg, will do this for you automatically behind the scenes. Sometimes using prepared statements can cause issues, but it is usually a win.

As mentioned above, the use of 2PC (two-phase commit) is very rare, which is why the default for the max_prepared_transactions variable was recently changed to 0, which effectively disallows the use of prepared transactions until you explicitly turning them on in your postgresql.conf file. This helps prevent people from accidentally issuing a PREPARE TRANSACTION and then leaving them around. This mistake is easy to do, for once you issue the command, everything goes back to normal and it's easy to forget about them. However, having them around is a bad thing, as they continue to hold locks, and can prevent vacuum from running.The check_postgres program even has a specific check for this situation:check_prepared_txns.

What does two-part commit look like? There are only three basic commands: PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED. Each takes a name, which is an arbitrary string 200 characters or less. Usage is to start a transaction, do some work, and then issue a PREPARE TRANSACTION instead of a COMMIT. At this point, all the work you have done is gone from your session and stored on disk. You cannot get back into this transaction: you can only commit it or roll it back. See the docs on PREPARE TRANSACTION for the full details.

Here's an example of two-part commit in action:


testdb=# BEGIN;
BEGIN
testdb=#*  CREATE TABLE preptest(a int);
CREATE TABLE
testdb=#*  INSERT INTO preptest VALUES (1),(2),(3);
INSERT 0 3
testdb=#*  SELECT * FROM preptest;
 a 
---
 1
 2
 3
(3 rows)

testdb=#*  PREPARE TRANSACTION 'foobar';
PREPARE TRANSACTION
testdb=# SELECT * FROM preptest;
ERROR:  relation "preptest" does not exist
LINE 1: SELECT * FROM preptest;
                      ^
testdb=# COMMIT PREPARED 'foobar';
COMMIT PREPARED
testdb=# SELECT * FROM preptest;
 a 
---
 1
 2
 3
(3 rows)

A contrived example, but you can see how easy it could be to issue a PREPARE TRANSACTION and not even realize that it actually sticks around forever!

Postgres query caching with DBIx::Cache

A few years back, I started working on a module named DBIx::Cache which would add a caching layer at the database driver level. The project that was driving it got put on hold indefinitely, so it's been on my long-term todo list to release what I did have to the public in the hope that someone else may find it useful. Hence, I've just released version 1.0.1 of DBIx::Cache. Consider it the closest thing Postgres has at the moment for query caching. :) The canonical webpage:

http://bucardo.org/wiki/DBIx-Cache

You can also grab it via git, either directly:

git clone git://bucardo.org/dbixcache.git/

or through the indispensable github:

https://github.com/bucardo/dbixcache

So, what does it do exactly? Well, the idea is that certain queries that are either repeated often and/or are very expensive to run should be cached somewhere, such that the database does not have to redo all the same work, just to return the same results over and over to the client application. Currently, the best you can hope for with Postgres is that things are in RAM from being run recently. DBIx::Cache changes this by caching the results somewhere else. The default destination is memcached.

DBIx::Cache acts as a transparent layer around your DBI calls. You can control which queries, or classes of queries get cached. Most of the basic DBI methods are overridden so that rather than query Postgres, they actually query memcached as needed (or other caching layer - could even query back into Postgres itself!). Let's look at a simple example:


use strict;
use warnings;
use Data::Dumper;
use DBIx::Cache;
use Cache::Memcached::Fast;

## Connect to an existing memcached server, 
## and establish a default namespace
my $mc = Cache::Memcached::Fast->new(
  {
    servers   => [ { address => 'localhost:11211' } ],
    namespace => 'joy',
  });

## Rather than DBI->connect, use DBIx->connect
## Tell it what to use as our caching source
## (the memcached server above)
my $dbh = DBIx::Cache->connect('', '', '',
  { RaiseError => 1,
    dxc_cachehandle => $mc
});

## This is an expensive query, that takes 30 seconds to run:
my $SQL = 'SELECT * FROM analyze_sales_data()';

## Prepare this query
my $sth = $dbh->prepare($SQL);

## Run it ten times in a row.
## The first time takes 30 seconds, the other nine return instantly.
for (1..10) {
    my $count = $sth->execute();
 my $info = $sth->fetchall_arrayref({});
    print Dumper $info;
} 

In the above, the prepare($SQL) is actually calling the DBIx::Class::prepare method. This parses the query and tries to determine if it is cacheable or not, then stores that decision internally. Regardless of the result, it calls DBI::prepare (which is techincally DBD::Pg::prepare), and returns the result.The magic comes in the call to execute() later on. As you might imagine, this is also actually the DBIx::Class::execute() method. If the query is not cacheable, it simply runs it as normal and returns. If it is cacheable, and this is the first time it is run, DBIx::Class runs an EXPLAIN EXECUTE on the original statement, and parses out a list of all tables that are used in this query. Then it caches all of this information into memcached, so that subsequent runs using the same list of arguments to execute() don't need to do that work again.

Finally, we come to fetchall_arrayref(). The first time it is run, we simply call the parent methods and get the data back. Then we build unique keys and store the results of the query into memcached. Finally, we mark the execute() as fully cached. Thus, on subsequent calls to execute(), we don't actually execute anything on the database server, but simply return the count as stashed inside of memcached (in the case of execute, this is the number of affected rows). For the various fetch() methods, we do the same thing - rather than fetch things from the database (via DBI, DBD::Pg, and libpq), we get the results from memcached (frozen via Data::Dumper), and then unpack and return them. Since we don't actually need to do any work against the database, everything returns as fast as we can query memcached - which is in general very fast indeed.

Most of the above is working, but the piece that is not written is the cache invalidation. DBIx::Cache knows which tables go to which queries, so in theory you could have (for example), an UPDATE/INSERT/DELETE trigger on table X which calls DBIx::Cache and tells it to invalidate all items related to table X, so that the next call to prepare() or execute() or fetch() will not find any memcached matches and re-run the whole query and store the results. You could also simply handle that in your application, of course, and have it decide when to invalidate items.

It's been a while since I've really looked at the code, but as far as I can tell it is close to being able to actually use somewhere. :) Patches and questions welcome!

DBD::Pg query cancelling in Postgres

A new version of DBD::Pg, the Perl driver for PostgreSQL, has just been released. In addition to fixing some memory leaks and other minor bugs, this release (version 2.18.0) introduces support for the DBI method known as cancel(). A giant thanks to Eric Simon, who wrote this new feature. The new method is similar to the existing pg_cancel() method, except it works on synchronous rather than asynchronous queries. I'll show an example of both below.

DBD::Pg has been able to handle asynchronous queries for a while now. Basically, that means you don't have to wait around for the database to finish a query. Your application can do other things while the query runs, then check back later to see if it has completed and grab the results. The way to cancel an already kicked-off asynchronous query is with the pg_cancel() method (the other asynchronous methods are pg_ready and pg_result, which have no synchronous equivalents).

The prefix "pg_" is used because there is no corresponding built-in DBI method to override, and the convention is to prefix everything custom to a driver with the driver's prefix, in our case 'pg'. Here's an example showing one possible use of asynchronous queries using DBD::Pg in some Perl code:


  ## We are connecting to two servers and running expensive 
  ## queries on both. We kick both off right away, then wait 
  ## for them both to finish. Our total wait time is thus
  ## max(server1,server2) rather than sum(server1,server2)

  use strict;
  use warnings;
  use DBI;
  use DBD::Pg qw{ :async };

  my $dsn1 = 'dbi:Pg:dbname=sales;host=example1.com';
  my $dsn2 = 'dbi:Pg:dbname=sales;host=example2.com';

  my $dbh1 = DBI->connect($dsn1, '', '', {AutoCommit=>0, RaiseError=>1});
  my $dbh2 = DBI->connect($dsn2, '', '', {AutoCommit=>0, RaiseError=>1});

  my $SQL = 'SELECT gather_yearly_sales_data()';
  print "Kicking off a long, expensive query on database one\n";
  ## Normally, a do() will not return until the query is complete
  ## However, the async flag causes it to return immediately
  $dbh1->do($SQL, {pg_async => PG_ASYNC});

  print "Kicking off a long, expensive query on database two\n";
  $dbh2->do($SQL, {pg_async => PG_ASYNC});

  ## Both queries are running in the 'background'
  ## We have to wait for both, so it doesn't matter which one we wait for here
  ## However, if it's been over 2 minutes, we'll cancel both and quit
  my $time = 0;
  while ( ! $dbh1->pg_ready() ) {
    sleep 1;
    if ($time++ > 120) {
      print "Taking too long, let's cancel the queries\n";
      $dbh1->pg_cancel();
      $dbh2->pg_cancel();
      $dbh1->rollback();
      $dbh2->rollback();
      die "No sales data was retrieved\n";
    }
  }

  ## We know that database 1 has finished, so we read in the results
  my $rows1 = $dbh1->pg_result();
  ## We then grab results from database 2
  ## This will block until done, which is okay
  my $rows2 = $dbh2->pg_result();

The new method, simply known as cancel(), will kill any synchronously running query. One of the main uses for this is to timeout a query by using the builtin Perl alarm function. However, since the builtin alarm function has some quirks, we will instead use the much safer POSIX::SigAction method. Another example:


  ## We are running a series of queries against a database, but if
  ## the whole thing is taking over 30 seconds, we want to cancel
  ## the currently running query and move on to something else.

  use strict;
  use warnings;
  use DBI;
  use DBD::Pg qw{ :async };

  my $dsn = 'dbi:Pg:dbname=dq';

  my $dbh = DBI->connect($dsn, '', '', {AutoCommit=>0, RaiseError=>1});

  ## Setup all the POSIX alarm plumbing
  my $mask = POSIX::SigSet->new(SIGALRM);
  my $action = POSIX::SigAction->new(
    sub { die "TIMEOUT\n" },
    $mask,
  );
  my $oldaction = POSIX::SigAction->new();
  sigaction( SIGALRM, $action, $oldaction );

  ## Prepare the queries
  my $upd = $dbh->prepare('UPDATE foobar SET x=? WHERE y=?');
  my $inv = $dbh->prepare('SELECT refresh_inventory(?)');

  ## Yes, a double eval. Async is looking better all the time :)
  eval {
    eval {
          alarm 30;
          for my $y (12,24,48) {
              print "Adjusting widget #$y\n";
              $upd->execute(555,$y);
              print "Recalculating inventory\n";
              $inv->execute($y);
          }
        };
        alarm 0; ## Turn off our alarm
        die "$@\n" if $@; ## Bubble the error to the outer eval
    };
    if ($@) { ## Something went wrong
      if ($@ =~ /TIMEOUT/) {
        print "Queries are taking too long! Cancelling\n";
        ## We don't know which one is still running, and don't care
        ## It's safe to cancel a non-active statement handle
        $upd->cancel() or die qq{Failed to cancel the query!\n};
        $inv->cancel() or die qq{Failed to cancel the query!\n};
        $dbh->rollback();
        die "Who has time to wait 30 seconds anymore?";
      }
      ## Some other non-alarm error, so we simply:
      die $@;
    }

    print "Updates are complete\n";
    $dbh->commit();
    exit;

Got an interesting use case for asynchronous queries or the new $dbh‑>cancel()? Let me know!

Annotating Your Logs

We recently did some PostgreSQL performance analysis for a client with an application having some scaling problems. In essence, they wanted to know where Postgres was getting bogged down, and once we knew that we'd be able to target some fixes. But to get to that point, we had to gather a whole bunch of log data for analysis while the test software hit the site.

This is on Postgres 8.3 in a rather locked down environment, by the way. Coordinated pg_rotate_logfile() was useful, but occasionally it would seem to devolve to something resembling: "Okay, we're adding 60 more users ... now!" And I'd write down the time stamp, and figure out an appropriate place to slice the log file later.

Got me thinking, what if we could just drop an entry into the log file, and use it to filter things out later? My first instinct was to start looking at seeing if a patch would be accepted, maybe a wrapper for ereport(), something easy. Turns out, it's even easier than that...

pubsite=# DO $$BEGIN RAISE LOG 'MARK: 60 users'; END;$$;
DO
Time: 0.464 ms
pubsite=# DO $$BEGIN RAISE LOG 'MARK: 120 users'; END;$$;
DO
Time: 0.378 ms
pubsite=# DO $$BEGIN RAISE LOG 'MARK: 360 users'; END;$$;
DO
Time: 0.700 ms

Of course the above will only work on version 9.0 and up (eventually). Previous versions that have PL/pgSQL turned can just create a function that does the same thing. The "LOG" severity level is an informational message that's supposed to always make it into the log files. So with those in place, a grep through the log can reveal just where they appear, and sed can extract the sections of log between those lines and feed them into your favorite analysis utility:

postgres@mothra:~$ grep -n 'LOG:  MARK' /var/log/postgresql/postgresql-9.0-main.log 
19180:2011-03-31 20:20:37 EDT LOG:  MARK: 60 users
19478:2011-03-31 20:25:48 EDT LOG:  MARK: 120 users
20247:2011-03-31 20:32:15 EDT LOG:  MARK: 360 users
postgres@mothra:~$ sed -n '19180,19478p' /var/log/postgresql/postgresql-9.0-main.log | bin/pgsi.pl > 60users.html

Oh, and the performance problem? Turns out it wasn't Postgres at all, every single query average execution time was shown to vary minimally as the concurrent user count was scaled higher and higher. But that's another story.

Postgres Build Farm Animal Differences

I'm a big fan of the Postgres Build Farm, a distributed network of computers that are constantly installling, building, and testing Postgres to detect any problems in the code. The build farm works best when there is a wide variety of operating systems and architectures testing. Thus, while I have a rather common x86_64 Linux box available for testing, I try to make it a little unique to get better test coverage.

One thing I've been working on is clang support (clang is an alternative to gcc). Unfortunately, the latest version of clang has a bug that prevents it from building Postgres on Linux boxes. I submitted a small patch to the Postgres source to fix this, but it was decided that we'll wait until clang fixes their bug. Supposedly they have in their svn head, but I've not been able to get that to compile successfully.

So I also just installed gcc 4.6.0, the latest and greatest. Installing it was not easy (nasty problems with the mfpr dependencies), but it's done now and working. It probably won't make any difference as far as the results, but at least my box is somewhat different from all the other x86_64 Linux boxes in the farm. :)

I've asked before on the list (with no response) about what sort of configuration changes could be made to expand the range of testing. The build farm itself provides a handful of things to choose from, and most of the animals in the farm have most of them configured (I have everything except "pam" and "vpath" enabled). However, one thing I've thought about changing is NAMEDATALEN. It's basically a compile-time option that sets the maximum number of characters things like table names can have. It is set by default to 64, while the SQL spec wants it to be 128. The problem is that this causes some tests to fail, as they have a hard-coded assumption about the length. The real problem of course is that Postgres' 'make check' is a very crude test. I've got some ideas on how to fix that, but that's another post for another day. So, anyone have other ideas on how to make my particular build farm member, and others like it, more useful?

Presenting at PgEast

I'm excited to be going to the upcoming PostgreSQL East Conference. This will be both my first PostgreSQL conference to attend, as well as my first time presenting. I will be giving a talk on Bucardo entitled Bucardo: More than Just Multi-Master. I'll be in NYC for the conference, so I'll get to work for a couple days at our company's main office as well.

I look forward to learning more about PostgreSQL, putting some names and faces with some IRC nicks, and socializing with others in the PostgreSQL community; after all, Postgres' community is one of its strongest assets.

Hope to see you there!

Pausing Hot Standby Replay in PostgreSQL 9.0

When using a PostgreSQL Hot Standby master/replica pair, it can be useful to temporarily pause WAL replay on the replica. While future versions of Postgres will include the ability to pause recovery using administrative SQL functions, the current released version does not have this support. This article describes two options for pausing recovery for the rest of us that need this feature in the present. These two approaches are both based around the same basic idea: utilizing a "pause file", whose presence causes recovery to pause until the file has been removed.

Option 1: patched pg_standby

pg_standby is a fairly standard tool that is often used as a restore_command for WAL replay. I wrote a patch for it (available at my github repo) to support the "pause file" notion. The patch adds a -p path/to/pausefile optional argument, which if present will check for the pausefile and wait until it is removed before proceeding with recovery.

The benefit of patching pg_standby is that the we're building on mature production-level code, adding a functionality at its most relevant place. In particular, we know that signal handling is already sensibly handled; (this was something I was less than positive about with when it comes to the wrapper shell script described later). The downside here is that you need to compile your own version of pg_standby in order to take advantage of it. However, it may be considered useful enough of a patch to accept in the 9.0 tree, so future releases could support it out-of-the-box.

After patching, compiling, and installing the modified version of pg_standby the only change to an existing restore_command already using pg_standby would be the addition of the -p /path/to/pausefile argument; e.g.:

restore_command = 'pg_standby -p /tmp/pausefile /path/to/archive %f %p'

After restarting the standby, simply touching the /tmp/pausefile file will pause recovery until the file is subsequently removed.

Option 2: a shell script

The pause-while script is a simple wrapper script I wrote which can be used to gate the invocation of any command by checking if the "pause file" (a file path passed as the first argument) exists. If the pause file exists, we loop in a sleep cycle until it is removed. Once the pause file does not exist (or if it did not exist in the first place), we execute the rest of the provided command string.

Sample invocation:

[user@host<1>] $ touch /tmp/pausefile; pause-while /tmp/pausefile echo hi
... # pauses, notifying of status

[user@host<2>] $ rm /tmp/pausefile
... # shell 1 will now output "hi"

Here's the script:

pause-while:

#!/bin/bash

# we're trapping this signal
trap 'exit 1' INT;

PAUSE_FILE=$1;
shift;

while [ -f $PAUSE_FILE ]; do
 echo "'$PAUSE_FILE' present; pausing. remove to continue" >&2
 sleep 1;
 PAUSED=1
done

[ "$PAUSED" ] && echo "'$PAUSE_FILE' removed; " >&2

# untrap so we don't block the invoked command's expected signal handling
trap INT;

# now we know the pause file doesn't exist, proceed to execute our
# command as normal

exec $@;

We need to trap SIGINT to prevent the wrapped command from executing if the sleep cycle is interrupted.

Putting this to use in our Hot Standby case, we will want to use pause-while as a wrapper for the existing restore_command, thus adjusting recovery.conf to something like this:

restore_command = 'pause-while /tmp/standby.pause pg_standby ... <args>'

With this configuration, when you want to pause WAL replay on the replica simply touch the /tmp/standby.pause pause file and the next invocation of restore_command will wait until that file is removed before proceeding.

The wrapper script approach has the benefit of working with any defined restore_command and is not limited to just working with pg_standby.

Limitations

  • Since this is based on WAL archive restoration, this has a very coarse granularity; recovery can only pause between WAL files, which are 16MB. It is likely that future SQL support functions will support this at arbitrary transaction boundaries and will not have this specific limitation.
  • Neither of these options will work with Streaming Replication. Streaming Replication uses a non-zero exit status of the restore_command as the "End of Archive" marker to flip from archive restoration/catchup mode to WAL Streaming mode. pg_standby's default behavior (even before this patch) is to wait for the next archive file to appear before returning a zero exit status, and returning a non-zero exit status only on error, signal, or because its failover trigger file now exists. This means that if you use pg_standby as the restore_command with Streaming Replication enabled, you will never actually flip over into WAL streaming mode, and will stay pointlessly in rechive restoration mode. (Technically speaking you could touch the failover trigger file; that would get you out of the archive mode, and into WAL streaming mode, but would not result in actually failing over.) It is likely that future SQL support functions for pausing recovery will not have this same dependency/limitation, and will be able to pause recovery when utilizing Streaming Replication.
  • While reviewed/manually tested, these programs have not been production-tested. I've done basic testing on both the shell script and pg_standby patch, however this has not been battle-tested, and likely has some corner cases that haven't been considered (I'm particularly concerned about the shell script's signal handling interactions.)
  • pg_standby has been deprecated and removed in future releases of PostgreSQL. I believe it would still be possible to compile/use pg_standby for future releases based on the version in the 9.0 source tree, but I believe it was removed because of the issues in conjunction with Streaming Replication. Presumably it (and this approach) would still be relevant if people wanted to utilize a traditional log-shipping standby with Hot Standby.

Comments/improvements welcome/appreciated!

check_postgres without Nagios (Postgres checkpoints)

Version 2.16.0 of check_postgres, a monitoring tool for Postgres, was just released. We're still trying to keep a "release often" schedule, and hopefully this year will see many releases. In addition to a few minor bug fixes, we added a new check by Nicola Thauvin called hot_standby_delay, which, as you might have guessed from the name, calculates the streaming replication lag between a master server and one of the slaves connected to it. Obviously the servers must be running PostgreSQL 9.0 or better.

Another recently added feature (in version 2.15.0) was the simple addition of a --quiet flag. All this does is to prevent any normal output when an OK status is found. I wrote this because sometimes even Nagios is overkill. In the default mode (Nagios, the other major mode is MRTG), check_postgres will exit with one of four states, each with their own exit code: OK, WARNING, CRITICAL, or UNKNOWN. It also outputs a small message, per Nagios conventions, so a txn_idle action might exit with a value of 1 and output something similar to this:


POSTGRES_TXN_IDLE WARNING: (host:svr1) longest idle in txn: 4638s

I had a situation where I wanted to use the functionality of check_postgres (to examine the lag on a warm standby server), but did not want the overhead of adding it into Nagios, and just needed a quick email to be sent if there were any problems. Thus, the use of the quiet flag yielded a quick and cheap Nagios replacement using cron:


*/10 * * * * bin/check_postgres.pl --action=checkpoint -w 300 -c 600 --datadir=/dbdir --quiet

So every 10 minutes the script gathers the number of seconds since the last checkpoint was run. If that number is under five minutes (300 seconds), it exits silently. If it's over five minutes, it outputs something similar to this, which cron then sends in an email:


POSTGRES_CHECKPOINT CRITICAL:  Last checkpoint was 842 seconds ago

I'm not advocating replacing Nagios of course: there are many other good reasons to use Nagios instead of cron, but this worked well for the situation at hand. Other actions, feature requests, and patches for check_postgres are always welcome, either on the check_postgres bug tracker or the mailing list.

DBD::Pg, UTF-8, and Postgres client_encoding

Photo by Roger Smith

I've been working on getting DBD::Pg to play nicely with UTF-8, as the current system is suboptimal at best. DBD::Pg is the Perl interface to Postgres, and is the glue code that takes the data from the database (via libpq) and gives it to your Perl program. However, not all data is created equal, and that's where the complications begin.

Currently, everything coming back from the database is, by default, treated as byte soup, meaning no conversion is done, and no strings are marked as utf8 (Perl strings are actually objects in which one of the attributes you can set is 'utf8'). If you want strings marked as utf8, you must currently set the pg_enable_utf8 attribute on the database handle like so:

$dbh->{pg_enable_utf8} = 1;

This causes DBD::Pg to scan incoming strings for high bits and mark the string as utf8 if it finds them. There are a few drawbacks to this system:

  • It does this for all databases, even SQL_ASCII!
  • It doesn't do this for everything, e.g. arrays, custom data types, xml.
  • It requires the user to remember to set pg_enable_utf8.
  • It adds overhead as we have to parse every single byte coming back from the database.

Here's one proposal for a new system. Feedback welcome, as this is a tricky thing to get right.

DBD::Pg will examine the client_encoding parameter, and see if it matches UTF8. If it does, then we can assume everything coming back to us from Postgres is UTF-8. Therefore, we'll simply flip the utf8 bit on for all strings. The one exception is bytea data, of course, which we'll read in and dequote into a non-utf8 string. Any non-UTF8 client_encodings (e.g. the monstrosity that is SQL_ASCII) will simply get back a byte soup, with no utf8 markings on our part.

The pg_enable_utf8 attribute will remain, so that applications that do their own decoding, or otherwise do not want the utf8 flag set, can forcibly disable it by setting pg_enable_utf8 to 0. Similarly, it can be forced on by setting pg_enable_utf8 to 1. The flag will always trump the client_encoding parameter.

A further complication is client_encoding: What if it defaults to something else? We can set it ourselves upon first connecting, and then if the program changes it after that point, it's on them to deal with the issues. (As DBD::Pg will still assume it is UTF-8, as we don't constantly recheck the parameter.)

Someone also raised the issue of marking ASCII-only strings as utf8. While technically this is not correct, it would be nice to avoid having to parse every single byte that comes out of the database to look for high bits. Hopefully, programs requesting data from a UTF-8 database will not be surprised when things come back marked as utf8.

Feel free to comment here or on the bug that started it all. Thanks also to David Christensen, who has given me great input on this topic.

SSH config wildcards and multiple Postgres servers per client

The SSH config file has some nice features that help me to keep my sanity among a wide variety of servers spread across many different clients. Nearly all of my Postgres work is done by using SSH to connect to remote client sites, so the ability to connect to the various servers easily and intuitively is important. I'll go over an example of how a ssh config file might progress as you deal with an ever‑expanding client.

Some quick background: the ssh config file is a per‑user configuration file for the SSH program. It typically exists as ~/.ssh/config. It has two main purposes: setting global configuration items (such as ForwardX11 no), and setting things on a host‑by‑host basis. We'll be focusing on the latter.

Inside the ssh config file, you can create Host sections which specify options that apply only to one or more matching hosts. The sections are applied if the host name you type in as the argument to the ssh command matches what is after the word "Host". As we'll see, this also allows for wildcards, which can be very useful.

I'm going to walk through a hypothetical client, Acme Corporation, and show how the ssh config can grow as the client does, until the final example mirrors an actual section of my ssh config section file.

So, you've just got a new Postgres client called Acme Corporation, and they are using Amazon Web Services (AWS) to host their server. We're coming in as the postgres user, and have our public ssh keys already in place inside ~postgres/.ssh/authorized_keys on their server. The hostname is ec2‑456‑55‑123‑45.compute‑1.amazonaws.com. So, generally, we would connect by running:


$ ssh postgres@ec2‑456‑55‑123‑45.compute‑1.amazonaws.com

That's a lot to type each time! We could create a bash alias to handle this, but it's better to use the ssh config file instead. We'll add this to the end of our ssh config:


##
## Client: Acme Corporation
##

Host  acmecorp
User postgres
Hostname  ec2-456-55-123-45.compute-1.amazonaws.com

Now we can simply use 'acmecorp' in place of that ugly string:


$ ssh acmecorp

Notice that we don't need to specify the user anymore: ssh config plugs that in for us. We can still override it if we need to connect as someone else:


$ ssh greg@acmecorp

The next week, Acme Corporation decides that rather than allow anyone to SSH to their servers, they will use iptables or something similar to restrict access to select known hosts. Because different people with different IPs at End Point may need to access Acme, and because we don't want to have Acme have to open a new hole each time we connect from a different place, we will connect from a shared company box. In this case, the box is vp.endpoint.com. Acme arranges to allow SSH from that box to their servers, and each End Point employee has a login on the vp.endpoint.com box. What we need to do now is create a SSH tunnel. Inside of the ssh config file, we add a new line to the entry for 'acmecorp':


Host  acmecorp
User  postgres
Hostname  ec2-456-55-123-45.compute-1.amazonaws.com
ProxyCommand  ssh -q greg@vp.endpoint.com nc -w 180 %h %p

Now, when we run this:


$ ssh acmecorp

...everything looks the same to us, but what we are really doing is connecting to vp.endpoint.com, running the nc (netcat) command, and then connecting to the amazonaws.com box over the new netcat connection. (The arguments to netcat specify that the connection should be closed if there is the connection goes away for 180 seconds, and the host and port should be echoed along). As far as amazonaws.com is concerned, we are connecting from vp.endpoint.com. As far as we are concerned, we are going directly to amazonaws.com. A nice side effect, and a big reason why we don't simply use bash aliases, is that the scp program will use these aliases as well. So we can now do something like this:


$ scp check_postgres.pl acmecorp:

This will copy the check_postgres.pl program from our computer to the Acme one, going through the tunnel at vp.endpoint.com.

Business has been good for Acme lately and they finally have conceded to your strong suggestion to set up a warm standby server (using Postgres' Point In Time Recovery system). This new server is located at ec2‑456‑55‑123‑99.compute‑1.amazonaws.com, and the internal host name they give it is maindb‑replica (the original box is known as maindb‑db). This new server requires another host entry to ssh config. Rather than copy over the same ProxyCommand, we'll refactor the information out into a separate host entry. What we end up with is this:


Host  acmetunnel
User  greg
Hostname  vp.endpoint.com

Host  acmedb
User  postgres
Hostname  ec2-456-55-123-45.compute-1.amazonaws.com
ProxyCommand  ssh -q acmetunnel nc -w 180 %h %p

Host  acmereplica
User  postgres
Hostname  ec2-456-55-123-99.compute-1.amazonaws.com
ProxyCommand  ssh -q acmetunnel nc -w 180 %h %p

We also changed the name from acmecorp to just "acme" as that's enough to uniquely identify among our clients, and who wants to type more than they have to?

Next, the company adds a QA box they want End Point to help setup. This box, however, is *not* reachable from outside their network; it can be reached only from other hosts in their network. Luckily, we already have access to some of those. What we'll do is extend our tunnel by one more host, so that the path we travel from us to the Acme QA box is:

Local box → vp.endpoint.com → acreplica → acqa

Here's the section of the ssh config after we've added in the QA box:


Host  acmetunnel
User  greg
Hostname  vp.endpoint.com

Host  acmedb
User  postgres
Hostname  ec2-456-55-123-45.compute-1.amazonaws.com
ProxyCommand  ssh -q acmetunnel nc -w 180 %h %p

Host  acmereplica
User  postgres
Hostname  ec2-456-55-123-99.compute-1.amazonaws.com
ProxyCommand  ssh -q acmetunnel nc -w 180 %h %p

Host  acmeqa
User  postgres
Hostname  qa
ProxyCommand  ssh -q acreplica nc -w 180 %h %p

Note that we don't need the full hostname at this point for the "acmeqa" Hostname, as we can simply say 'qa' and the acreplica box knows how to get there.

There is still some unwanted repetition in the file, so let's take advantage of the fact that the "Host" item inside the ssh config file will take wildcards as well. It's not really apparent until you use wildcards, but a ssh host can match more than one "Host" section in the ssh config file, and thus you can achieve a form of inheritance. (However, once something has been set, it cannot be changed, so you always want to set the more specific items first). Here's what the file looks like after adding a wildcard section:


Host  acme*
User  postgres
ProxyCommand  ssh -q greg@vp.endpoint.com nc -w 180 %h %p

Host  acmedb
Hostname  ec2-456-55-123-45.compute-1.amazonaws.com

Host  acmereplica
Hostname  ec2-456-55-123-99.compute-1.amazonaws.com

Host  acmeqa
User  root
Hostname  qa
ProxyCommand  ssh -q acreplica nc -w 180 %h %p

Notice that the file is now simplified quite a bit. If we run this command:


$ ssh acmereplica

...then the Host acme* section sets up both the User and the ProxyCommand. It then also matches on the Host acmereplica section and applies the Hostname there.

Note that we have removed the "acmetunnel" section. Now that all the ProxyCommands are in a single place, we can simply go back to the original ProxyCommand and specify the exact user and host.

All of the above presumes we want to login as the postgres user, but there are also times when we need to login as a different user (e.g. 'root'). We can again use wildcards, this time to match the end of the host, to specify which user we want. Anything ending in the letter "r" means we log in as user root, and anything ending in the letter "p" means we log in as user postgres. Our final ssh config section for Acme is now:


##
## Client: Acme Corporation
##

Host  acme*
ProxyCommand  ssh -q greg@vp.endpoint.com nc -w 180 %h %p
Host  acme*r
User  root
Host  acme*p
User  postgres

Host  acmedb*
Hostname  ec2-456-55-123-45.compute-1.amazonaws.com

Host  acmereplica*
Hostname  ec2-456-55-123-99.compute-1.amazonaws.com

Host  acmeqa*
Hostname  qa
ProxyCommand  ssh -q acreplica nc -w 180 %h %p

From this point on, if Acme decides to add a new server, adding it into our ssh config is as simple as adding two lines:


Host  acmedev*
Hostname  ec2-456-55-999-45.compute-1.amazonaws.com

This automatically sets up two hosts for us, "acmedevr" and "acmedevp". What if we leave out the ending "r" or "p" and just ssh to "acmedev"? Then we'll connect as the default user, or $ENV{USER} (in my case, "greg").

Have fun configuring your ssh config file, don't be afraid to leave lots of comments inside of it, and of course keep it in version control!

Version Control Visualization and End Point in Open Source

Over the weekend, I discovered an open source tool for version control visualization, Gource. I decided to put together a few videos to showcase End Point's involvement in several open source projects.

Here's a quick legend to help understand the videos below:

The branches and nodes correlate to directories and files, respectively. In the case of the image to the left, the repository has a main directory with several files and three directories. One of the child directories has one file and the other two have multiple files.
A big dot represents a person, and a flash connecting the person and a file signifies a commit.
White + blue dots represent current End Point employees.
White + grey dots represent former End Point employees.
White dots represent other people, out there!

The Videos

Interchange from endpoint on Vimeo.

pgsi from endpoint on Vimeo.

Spree from endpoint on Vimeo.

Bucardo from endpoint on Vimeo.

One of the articles that references Gource suggests that the videos can be used to visualize and analyze the community involvement of a project (open source or not). One might also be able to qualitatively analyze the stability of project file architecture from a video, but this won't reveal anything definitive about the code stability since external factors can influence file structure. For example, since I am intimately familiar with the progress of Spree, I can identify when Spree transitioned to Rails 3 in the video, which required reorganization of the Spree core functionality (read more about this here and here).

In the case of this article, I wanted to highlight End Point's involvement in a few open source projects where we've had various levels of involvement. We've contributed to Interchange since 2000. We've been involved in Spree less lately, but had more presence in early 2009. In the smaller projects Bucardo and pgsi, End Point employees have worked on a team to be the primary contributors to the projects in addition to a few external contributors. Open source is important to End Point, and it's great to see our presence demonstrated in these cute videos.

PostgreSQL 9.0 High Performance Review

I recently had the privilege of reading and reviewing the book PostgreSQL 9.0 High Performance by Greg Smith. While the title of the book suggests that it may be relevant only to PostgreSQL 9.0, there is in fact a wealth of information to be found which is relevant for all community supported versions of Postgres.

Acheiving the highest performance with PostgreSQL is definitely something which touches all layers of the stack, from your specific disk hardware, OS and filesystem to the database configuration, connection/data access patterns, and queries in use. This book gathers up a lot of the information and advice that I've seen bandied about on the IRC channel and the PostgreSQL mailing lists and presents it in one place.

While seemingly related, I believe some of the main points of the book could be summed up as:

  1. Measure, don't guess. From the early chapters which cover the lowest-level considerations, such as disk hardware/configuration to the later chapters which cover such topics as query optimization, replication and partitioning, considerable emphasis is placed on determining the metrics by which to measure performance before/after specific changes. This is the only way to determine the impact the changes you make have.
  2. Tailor to your specific needs/workflows. While there are many good rules of thumb out there when it comes to configuration/tuning, this book emphasizes the process of determining/refining those more general numbers to tailoring configuration/setup to your specific database's needs.
  3. Review the information the database system itself gives you. Information provided by the pg_stat_* views can be useful in identifying bottlenecks in queries, unused/underused indexes.

This book also introduced me to a few goodies which I had not encountered previously. One of the more interesting ones is the pg_buffercache contrib module. This suite of functions allows you to peek at the internals of the shared_buffers cache to get a feel for which relations are heavily accessed on a block-by-block basis. The examples in the book show this being used to more accurately size shared_buffers based on the actual number of accesses to specific portions of different relations.

I found the book to be well-written (always a plus when reading technical books) and felt it covered quite a bit of depth given its ambitious scope. Overall, it was an informative and enjoyable read.

PostgreSQL 9.0 Admin Cookbook

I've been reading through the recently published book PostgreSQL 9.0 Admin Cookbook of late, and found that it satisfies an itch for me, at least for now. Every time I get involved in a new project, or work with a new group of people, there's a period of adjustment where I get introduced to new tools and new procedures. I enjoy seeing new (and not uncommonly, better) ways of doing the things I do regularly. At conferences I'll often spend time playing "What's on your desktop" with people I meet, to get an idea of how they do their work, and what methods they use. Questions about various peoples' favorite window manager, email reader, browser plugin, or IRC client are not uncommon. Sometimes I'm surprised by a utility or a technique I'd never known before, and sometimes it's nice just to see minor differences in the ways people do things, to expand my toolbox somewhat. This book did that for me.

As the title suggests, authors Simon Riggs and Hannu Krosing have organized their book similarly to a cookbook, made up of simple "recipes" organized in subject groups. Each recipe covers a simple topic, such as "Connecting using SSL", "Adding/Removing tablespaces", and "Managing Hot Standby", with detail sufficient to guide a user from beginning to end. Of course in many of the more complex cases some amount of detail must be skipped, and in general this book probably won't provide its reader with an in depth education, but it will provide a framework to guide further research into a particular topic. It includes a description of the manuals, and locations of some of the mailing lists to get the researcher started.

I've used PostgreSQL for many different projects and been involved in the community for several years, so I didn't find anything in the book that was completely unfamiliar. But PostgreSQL is an open source project with a large community. There exists a wide array of tools, many of which I've never had occasion to use. Reading about some of them, and seeing examples in print, was a pleasant and educational experience. For instance, one recipe describes "Selective replication using Londiste". My tool of choice for such problems is generally Bucardo, so I'd not been exposed to Londiste's way of doing things. Nor have I used pgstatspack, a project for collecting various statistics and metrics from database views which is discussed under "Collecting regular statistics from pg_stat_* views".

In short, the book gave me the opportunity to look over the shoulder of experienced PostgreSQL users and administrators to see how they go about doing things, and compare to how I've done them. I'm glad to have had the opportunity.

Upgrading old versions of Postgres

Old elephant courtesy of Photos8.com

The recent release of Postgres 9.0.0 at the start of October 2010 was not the only big news from the project. Also released were versions 7.4.30 and 8.0.26, which, as I noted in my usual PGP checksum report, are going to be the last publicly released revisions in the 7.4 and 8.0 branches. In addition, the 8.1 branch will no longer be supported by the end of 2010. If you are still using one of those branches (or something older!), this should be the incentive you need upgrade as soon as possible. To be clear, this means that anyone running Postgres 8.1 or older is not going to get any official updates, including security and bug fixes.

A brief recap: Postgres uses major versions, containing two numbers, to indicate a major change in features and functionality. These are released about every two years. Each of these major versions has many revisions, which are released as often as needed. These revisions are designed to be completely binary compatible with the previous revision, meaning you can upgrade revisions very easily, with no dump and restore of the data needed.

Below are the options available for those running older versions of Postgres, from the most desirable to the least desirable. The three general options are to upgrade to the latest release (9.0 as I write this), migrate to a newer version, or stay on your release.

1. Upgrade to the latest release

This is the best option, as each new version of Postgres adds more features and becomes more efficient, all while maintaining the high code quality standards Postgres is known for. There are three general approaches to upgrading: pg_upgrade, pg_dump, and Bucardo / Slony.

Using pg_upgrade

The pg_upgrade utility is the preferred method for upgrading in the future. Basically, it rewrites your data directory from the "old" on-disk format to the "new" one. Unfortunately, pg_upgrade only works from version 8.3 and onwards, which means it cannot be used if you are coming from an older version. (This utility used to be called pg_migrator, in case you see references to that.)

Dump and restore

The next best method is the tried and true "dump and restore". This involves using pg_dump to create a logical representation of the old database, and then loading it into your new database with pg_restore or psql. The disadvantage to this method is time - dump and reload can take a very, very long time for large databases. Not only does the data need to get loaded into the new database tables, but all the indexes must be recreated, which can be agonizingly slow.

Replication systems

A third option is to use a replication system such as Slony or Bucardo to help with the upgrade. With Slony, you can set up a replication from the old version to the new version, and then failover to the new version once replication is caught up and running smooth. You can do something similar with Bucardo. Note that both systems can only replicate sequences, and tables containing primary keys or unique indexes. Bucardo has a "fullcopy" mode that will copy any table, regardless of primary keys, but it's slow as it's equivalent to a full dump and restore of the table. Note that Bucardo is really only tested on the 8.X versions: for anything older, you will need to use Slony.

Even if you cannot replicate all your tables, such systems can help a migration by replicating most of your data. For example, if you have a 750 GB table full of mostly historical data, you can have Bucardo start tracking changes to the table, set up a copy on the new version (perhaps by using warm standby or a snapshot to reduce load on the master), and then start Bucardo to catch up the rows that have changed since the changes were tracked. If you do this for all your large tables, the actual upgrade process can proceed with minimal downtime by shutting down the master, doing a pg_dump of only the non-tracked tables, and then pointing your apps at the new server.

2. Migrate to a newer version

Even if you don't go to 9.0, you may want to upgrade to a newer version. Why not go all the way to 9.0? There are only two good reasons not to. One, if your system's packaging system does not have 9.0 yet, or you have custom packaging requirements that prevent you from doing so. Two, if you have concerns about application compatibility between two versions. However, that latter concern should be minimal. The largest and most disruptive compatibility change appeared in version 8.3 with the removal of implicit casts. Since 8.2 is likely to be unsupported in the next couple years, you should be going to at least 8.3. And if you can go to 8.3, you can go to 9.0.

3. Stay on your release

This is obviously the least-desirable option, but may be necessary due to real-world constraints involving time, testing, compatibility with other programs, etc. At the bare minimum, make sure you are at least running the latest revision, e.g. 7.4.30 if running 7.4. Moving forward, you will need to keep an eye on the Postgres commits list and/or the detailed release notes for new versions, and examine if any of the fixed bugs apply to your version or your situation. If they do, you'll need to figure out how to apply the patch to your older version, and then release this new version into your environment. Sound risky? It gets worse, because your patch is only being used and tested by an extremely small pool of people, has no build farm support, and is not available to the Postgres developers. If you want to go this route, there are companies familiar with the Postgres code base (including End Point) that will help you do so. But know in advance that we are also going to push you very hard to upgrade to a modern, supported version instead (which we can help you with as well, of course :).

PostgreSQL 8.4 in RHEL/CentOS 5.5

The announcement of end of support coming soon for PostgreSQL 7.4, 8.0, and 8.1 means that people who've put off upgrading their Postgres systems are running out of time before they're in the danger zone where critical bugfixes won't be available.

Given that PostgreSQL 7.4 was released in November 2003, that's nearly 7 years of support, quite a long time for free community support of an open-source project.

Many of our systems run Red Hat Enterprise Linux 5, which shipped with PostgreSQL 8.1. All indications are that Red Hat will continue to support that version of Postgres as it does all parts of a given version of RHEL during its support lifetime. But of course it would be nice to get those systems upgraded to a newer version of Postgres to get the performance and feature benefits of newer versions.

For any developers or DBAs familiar with Postgres, upgrading to a new version with RPMs from the PGDG or other custom Yum repository is not a big deal, but occasionally we've had a client worry that using a packages other than the ones supplied by Red Hat is riskier.

For those holdouts still on PostgreSQL 8.1 because it's the "norm" on RHEL 5, Red Hat gave us a gift in their RHEL 5.5 update. It now includes separate PostgreSQL 8.4 packages that may optionally be used on RHEL 5 instead of PostgreSQL 8.1. (Both can't be used on the same system at the same time.)

I know that getting these packages from Red Hat shouldn't be necessary, but for those who feel jittery about using 3rd-party packages, it's a good nudge to switch to Postgres 8.4 using Red Hat's supported packages. Thanks to Tom Lane at Red Hat for making this happen. Though I don't know whose idea it was, Tom is the author of all the RPM commitlog messages, so thanks, Tom!

This brings up a few other rhetorical questions: Will RHEL 6 ship with PostgreSQL 9.0? Will RHEL 5.6 have backported PostgreSQL 9.0 in similar postgresql90 packages? It'd be great to see each new PostgreSQL release have supported packages in RHEL so that there's even less reason to start a new project on an older version of Postgres. RHEL 5.5 with PostgreSQL 8.4 is a nice start in that direction.

Postgres configuration best practices

This is the first in an occasional series of articles about configuring PostgreSQL. The main way to do this, of course, is the postgresql.conf file, which is read by the Postgres daemon on startup and contains a large number of parameters that affect the database's performance and behavior. Later posts will address specific settings inside this file, but before we do that, there are some global best practices to address.

Version Control

The single most important thing you can do is to put your postgresql.conf file into version control. I care not which one you use, but go do it right now. If you don't already have a version control system on your database box, git is a good choice to use. Barring that, RCS. Doing so is extremely easy. Just change to the directory postgresql.conf is in. The process for git:

  • Install git if not there already (e.g. "sudo yum install git")
  • Run: git init
  • Run: git add postgresql.conf pg_hba.conf
  • Run: git commit -a -m "Initial commit"

For RCS:

  • Install as needed (e.g. "sudo apt-get install rcs")
  • Run: mkdir RCS
  • Run: ci -l postgresql.conf pg_hba.conf

Note that we also checked in pg_hba.conf as well. You want to check in any file in that directory you may possibly change. For most people, that only means postgresql.conf and pg_hba.conf, but if you use other files (pg_ident.conf) check those in as well.

Ideally you want the version checked in to be the "raw" configuration files that came with the system - in other words, before you started messing with them. Then you make your initial changes and check it in. From then on of course, you commit every time you change the file.

At a bare minimum, the version control system should be telling you:

  • Exactly what was changed
  • When it was changed
  • Who made the change
  • Why it was changed

The first two items happen automatically in all version control systems, so you don't have to worry about those. The third item, "who made the change", must be entered manually if on a shared account (e.g. postgres) and using RCS. If you are using git, you can simply set the environment variables GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL. For shared accounts, I have a custom bashrc file called "gregbashrc" that is called when I log in that sets those ENVs as well as a host of other items.

The fourth item, "why it was changed", is generally the content of the commit message. Never leave this blank, and be as descriptive and verbose as possible - someone later on will be grateful you did. It's okay to be repetitive and state the obvious. If this was done as part of a specific ticket number or project name, mention that as well.

Safe Changes

It's important that the changes you make to the postgresql.conf file (or other files) actually work and don't cause Postgres to be unable to parse the file, or handle a changed setting. Never make changes and restart Postgres, because if it doesn't work, you've got a broken config file, no Postgres daemon, and most likely unhappy applications and/or users. At the very least, do a reload first (e.g. /etc/init.d/postgresql reload or just kill -HUP the PID). Check the logs and see if Postgres was happy with your changes. If you are lucky, it won't even require a restart (some changes do, some do not).

A better way to test your changes is to make it on an identical test box. That way, all the wrinkles are ironed out before you make the changes on production and attempt a reload or restart.

Another way I've found handy is to simply start a new Postgres daemon. Sounds like a lot of work, but it's pretty automatic once you've done it a few times. The process generally looks like this, assuming your production postgresql.conf is in the "data" directory, and your changes are in data/postgresql.conf.new:

  • cd ..
  • initdb testdata
  • cp -f data/postgresql.conf.new testdata/
  • echo port=5555 >> testdata/postgresql.conf
  • echo max_connections=10 >> testdata/postgresql.conf

The max_connections is not strictly necessary, of course, but unless you are changing something that relies on that setting, it's nicer to keep it (and the resulting memory) low.

  • pg_ctl -D testdata -l test.log start
  • cat test.log
  • pg_ctl -D testdata stop
  • rm -fr testdata (or just keep it around for next time)

The test.log file will show you any problems that might have popped up with your changes, and once it works you can be fairly confident it will work for the "main" daemon as well, so to finish up:

  • cd data
  • mv -f postgresql.conf.new postgresql.conf
  • git commit postgresql.conf -m "Adjusted random_page_cost to 2, per bug #4151"
  • kill -HUP `head -1 postmaster.pid`
  • psql -c 'show random_page_cost'

Keeping it Clean

The postgresql.conf file is fairly long, and can be confusing to read with its mixture of comments, in-line comments, strange wrapping, and the commented out vs. not-commented-out variables. Hence, I recommend this system:

  • Put a big notice at the top of the file asking people to make changes to the bottom
  • Put all important variables at the bottom, sans comments, one per line
  • Line things up
  • Put into logical groups.

This avoids having to hunt for settings, prevents the gotcha of when a setting is changed twice in the file, and makes things much easier to read visually. Here's what I put at the top of the postgresql.conf:

##
## PLEASE MAKE ALL CHANGES TO THE BOTTOM OF THIS FILE!
##

I then add a good 20+ empty lines, so anyone viewing the file is forced to focus on the all-caps message above.

The next step is to put all the settings you care about at the bottom of the file. Which ones should you care about? Any setting you have changed (obviously), any setting that you *might* change in the future, and any that you may not have changed, but someone may want to look up. In practice, this means a list of about 25 items. After aligning all the values to the right and breaking things into logical groups, here's what the bottom of the postgresql.conf looks like:

## Connecting
port                            = 5432
listen_addresses                = '*'
max_connections                 = 100

## Memory
shared_buffers                  = 400MB
work_mem                        = 1MB
maintenance_work_mem            = 1GB

## Disk
fsync                           = on
synchronous_commit              = on
full_page_writes                = on
checkpoint_segments             = 100

## PITR
archive_mode                    = off
archive_command                 = ''
archive_timeout                 = 0

## Planner
effective_cache_size            = 18GB
random_page_cost                = 2

## Logging
log_destination                 = 'stderr'
logging_collector               = on
log_filename                    = 'postgres-%Y-%m-%d.log'
log_truncate_on_rotation        = off
log_rotation_age                = 1d
log_rotation_size               = 0
log_min_duration_statement      = 200
log_statement                   = 'ddl'
log_line_prefix                 = '%t %u@%d %p'

## Autovacuum
autovacuum                      = on
autovacuum_vacuum_scale_factor  = 0.1
autovacuum_analyze_scale_factor = 0.3

Because everything is in one place, at the bottom of the file, and not commented out, it's very easy to see what is going on. The groups above are somewhat arbitrary, and you can leave them out or create your own, but at least keep things grouped together as much as possible. When in doubt, use the same order as they appear in the original postgresql.conf.

Sometimes people change important settings in a group, such as for bulk loading of data. In this case, I usually make a separate group for it at the very bottom. This makes it easy to switch back and forth, and helps to prevent people from (for example) forgetting to switch fsync back on:

## Bulk loading only - leave 'on' for everyday use!
autovacuum                      = off
fsync                           = off
full_page_writes                = off

Ownership and permissions

All the conf files should be owned by the postgres user, and the configuration files should be world-readable if possible (indeed, it's a requirement for Debian based system that postgresql.conf be readable for psql to work!). Be careful about SELinux as well: it can get ornery if you do things like use symlinks.

Backups

One final note - make sure you are backing up your changes as well. PITR and pg_dump won't save your postgresql.conf! If you are checking things in to a remote version control system, then some of the pressure is off, but you should have some sort of policy for backing up all your conf files explicitly. Even if using a local git repo, tarring and copying up the whole thing is usually a very quick and cheap action.

Anonymous code blocks

With the release of PostgreSQL 9.0 comes the ability to execute "anonymous code blocks" in various of the PostgreSQL procedural languages. The idea stemmed from work back in autumn of 2009 that tried to respond to a common question on IRC or the mailing lists: how do I grant a permission to a particular user for all objects in a schema? At the time, the only solution short of manually writing commands to grant the permission in question on every object individually was to write a script of some sort. Further discussion uncovered several people that often found themselves writing simple functions to handle various administrative tasks. Many of those people, it turned out, would rather simply call one statement, rather than create a function, call the function, and then drop (or just ignore) the function they'd never need again. Hence, the new DO command.

The first language to support DO was PL/pgSQL. The PostgreSQL documentation provides an example to answer the original question: how do I grant permissions on everything to a particular user.

DO $$DECLARE r record;
BEGIN
    FOR r IN SELECT table_schema, table_name FROM information_schema.tables
             WHERE table_type = 'VIEW' AND table_schema = 'public'
    LOOP
        EXECUTE 'GRANT ALL ON ' || quote_ident(r.table_schema) || '.' || quote_ident(r.table_name) || ' TO webuser';
    END LOOP;
END$$;

Notice that this doesn't actually tell us what language to use. If no language is specified, DO defaults to PL/pgSQL (which, in 9.0, is enabled by default). But you can use other languages as well:

DO $$
HAI
    BTW Calculate pi using Gregory-Leibniz series
    BTW This method does not converge particularly quickly...
    I HAS A PIADD ITZ 0.0
    I HAS A PISUB ITZ 0.0
    I HAS A ITR ITZ 0
    I HAS A T1
    I HAS A T2
    I HAS A PI ITZ 0.0
    I HAS A ITERASHUNZ ITZ 1000

    IM IN YR LOOP
        T1 R QUOSHUNT OF 4.0 AN SUM OF 3.0 AN ITR
        T2 R QUOSHUNT OF 4.0 AN SUM OF 5.0 AN ITR
        PISUB R SUM OF PISUB AN T1
        PIADD R SUM OF PIADD AN T2
        ITR R SUM OF ITR AN 4.0
        BOTH SAEM ITR AN BIGGR OF ITR AN ITERASHUNZ, O RLY?
            YA RLY, GTFO
        OIC
    IM OUTTA YR LOOP
    PI R SUM OF 4.0 AN DIFF OF PIADD AN PISUB
    VISIBLE "PI R: "
    VISIBLE PI
    FOUND YR PI
KTHXBYE
$$ LANGUAGE PLLOLCODE;

I tried to rewrite the GRANT function shown above in PL/LOLCODE for this example, until I discovered that some of PL/LOLCODE's limitations make it extremely difficult, if not impossible. So far as I know, PL/LOLCODE was the second language to support anonymous blocks, thanks to what turned out to be a relatively simple programming exercise. After finishing PL/LOLCODE's DO support, I decided to do the same for PL/Perl. I wasn't particularly surprised to find that PL/Perl was harder to extend than PL/LOLCODE; PL/Perl is a much more feature-rich (and hence, complicated) language and I wasn't as familiar with its internals. However, after my initial submission and with helpful commentary from several other people, Andrew Dunstan tied off the loose ends and got it committed. It looks like this:

DO $$
    my $row;
    my $rv = spi_exec_query(q{
        SELECT quote_ident(table_schema) || '.' || quote_ident(table_name) AS relname
        FROM information_schema.tables WHERE table_type = 'VIEW' AND table_schema = 'public'
    });
    my $nrows = $rv->{processed};
    foreach my $i (0 .. $nrows - 1) {
        my $row = $rv->{rows}[$rn];
        spi_exec_query("GRANT ALL ON $row->{relname} TO webuser");
    }
$$ LANGUAGE plperl;

DO wasn't the only thing to come from the pgsql-hackers discussion I mentioned above. In PostgreSQL 9.0, the GRANT command has also been modified, so it's now possible to grant permissions several objects in one stroke syntax. For instance:

GRANT SELECT ON ALL TABLES IN SCHEMA public TO webuser

pg_wrapper's very symbolic links

I like pg_wrapper. For a development environment, or testing replication scenarios, it's brilliant. If you're not familiar with pg_wrapper and its family of tools, it's a set of scripts in the postgresql-common and postgresql-client-common packages available in Debian, as well as Ubuntu and other Debian-like distributions. As you may have guessed pg_wrapper itself is a wrapper script that calls the correct version of the binary you're invoking – psql, pg_dump, etc – depending on the version of the database you want to connect to. Maybe not all that exciting in itself, but implied therein is the really cool bit: This set of tools lets you manage multiple installations of Postgres, spanning multiple versions, easily and reliably.

Well, usually reliably. We were helping a client upgrade their production boxes from Postgres 8.1 to 8.4. This was just before the 9.0 release, otherwise we'd consider moving the directly to that instead. It was going fairly smoothly until on one box we hit this message:

Could not parse locale out of pg_controldata output

Oops, they had pinned the older postgres-common version. An upgrade of those packages and no more error!

$ pg_lsclusters
Version Cluster   Port Status Owner    Data directory                     Log file
8.1     main      5439 online postgres /var/lib/postgresql/8.1/main       custom
Error: Invalid data directory

Hmm, interesting. Okay, so not quite, got a little bit more work to do. This one took some tracing through the code. The pg_wrapper scripts, if they don't already know it, look for the data directory in a couple of places. The first stop is the postgresql.conf file, specifically /etc/postgresql/<version>/<cluster-name>/postgresql.conf, looking for the data_directory parameter. But, in its transitional state at the time, the postgresql.conf was still a work in progress.

The second place it looks is a symlink in the same /etc/postgresql/<version>/<cluster-name>/ directory. While that's the old way of doing things, it at least let us get things looking reasonable:

# ln -s /var/lib/postgresql/8.4/main /etc/postgresql/8.4/main/pgdata
# /etc/init.d/postgresql-8.4 status
8.1     main      5439 online postgres /var/lib/postgresql/8.1/main       custom
8.4     main      5432 online postgres /var/lib/postgresql/8.4/main       custom

Voilà! From there we were able to proceed with the upgrade, confident that the instance will behave as expected. And now, everything is running great!

As with most things that provide a simpler experience on the surface, there's additional complexity under the hood. But for now, we have one more client upgraded. Thanks, Postgres!

Listen/Notify improvements in PostgreSQL 9.0

Improved listen/notify is one of the new features of Postgres 9.0 I've been waiting for a long time. There are basically two major changes: everything is in shared memory instead of using system tables, and full support for "payload" messages is enabled.

Before I demonstrate the changes, here's a review of what exactly the listen/notify system in Postgres is. Basically, it is an inter-process signalling system, which uses the pg_listener system table to coordinate simple named events between processes. One or more clients connects to the database and issues a command such as:

LISTEN foobar;

The name foobar can be replaced by any valid name; usually the name is something that gives a contextual clue to the listening process, such as the name of a table. Another client (or even one of the original ones) will then issue a notification like so:

NOTIFY foobar;

Each client that is listening for the 'foobar' message will receive a notification that the sender has issued the NOTIFY. It also receives the PID of the sending process. Multiple notifications are collapsed into a single notice, and the notification is not sent until a transaction is committed.

Here's some sample code using DBD::Pg that demonstrates how the system works:

#!/usr/bin/env perl
# -*-mode:cperl; indent-tabs-mode: nil-*-

use strict;
use warnings;
use DBI;

my $dsn = 'dbi:Pg:dbname=test';
my $dbh1 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
my $dbh2 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});

print "Postgres version is $dbh1->{pg_server_version}\n";

my $SQL = 'SELECT pg_backend_pid(), version()';
my $pid1 = $dbh1->selectall_arrayref($SQL)->[0][0];
my $pid2 = $dbh2->selectall_arrayref($SQL)->[0][0];
print "Process one has a PID of $pid1\n";
print "Process two has a PID of $pid2\n";

## Process one listens for a notice named "jtx"
$dbh1->do(q{LISTEN jtx});
$dbh1->commit();
## Process one checks for any notices received
print show_notices($dbh1);

## Process two sends a notice, but does not commit
$dbh2->do(q{NOTIFY jtx});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx});
$dbh2->commit();

sleep 1; ## Ensure the notice has time to get to propogate
## Process two receives a single notice from process one
print show_notices($dbh1);

## Now that it has seen the notice, it reports nothing again:
print show_notices($dbh1);

sub show_notices { ## Function to return any notices received
       my $dbh = shift;
       my $messages = '';
       $dbh->commit();
       while (my $n = $dbh->func('pg_notifies')) {
          $messages .= "Got notice '$n->[0]' from PID $n->[1]\n";
       }
       return $messages || "No messages\n";
}

The output of the above script on a 8.4 Postgres server is:

Postgres version is 80401
Process one has a PID of 18238
Process two has a PID of 18239
No messages
No messages
Got notice 'jtx' from PID 18239
No messages

As expected, we got a notification only after the other process committed.

Note that because this is asychronous and involves the system tables, we added a sleep call to ensure that the notice had time to propagate so that the other processes will see it. Without the sleep, we usually see four "No messages" appear, as the script goes too fast for the pg_listener table to catch up.

Now for the aforementioned payloads. Payloads allow an arbitrary string to be attached to the notification, such that you can have a standard name like before, but you can also attach some specific text that the other processes can see. I added support for payloads to DBD::Pg back in June 2008, so let's modify the script a little bit to demonstrate the new payload mechanism:

...
## Process two sends two notices, but does not commit
$dbh2->do(q{NOTIFY jtx, 'square'});
$dbh2->do(q{NOTIFY jtx, 'square'});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx, 'triangle'});
$dbh2->commit();
...
 ## This part changes: we get an extra item from our array:
 $messages .= "Got notice '$n->[0]' from PID $n->[1] message is '$n->[2]'\n";
...

Here's what the output looks like under version 9.0 of Postgres:

Postgres version is 90000
Process one has a PID of 19089
Process two has a PID of 19090
No messages
No messages
Got notice 'jtx' from PID 19090 message is 'square'
Got notice 'jtx' from PID 19090 message is 'triangle'
No messages

Note that the collapsing of identical messages into a single notification now takes into account the message as well, so we received two notifications in the above example for the three total notifications sent. To add a payload, we simply say NOTIFY, then the name of the notification, add a comma, and specify a payload as a quoted string. Of course, the payload string is still completely optional. If no payload is specified, DBD::Pg will simply treat the payload as an empty string (this is also the behavior when you request the payload using DBD::Pg against a pre-9.0 server, so all combinations should be 100% backwards compatible).

We also got rid of the sleep. Because we are now using shared memory instead of system tables, there is no lag whatsoever, and the other process can see the notices right away.

Another large advantage to removing the pg_listener table is that systems that make heavy use of it (such as the replication systems Bucardo and Slony) no longer have to worry about bloat in these tables.

The use of payloads also means that many application can be greatly simplified: in the past, one had to be creative in the name of your notifications in order to pass meta-information to your listener. For example, Bucardo uses a large collection of notifications, meaning that the Bucardo processes had to do the equivalent of things like this:

$dbh->do(q{LISTEN bucardo_reload_config});
$dbh->do(q{LISTEN bucardo_log_message});
$dbh->do(q{LISTEN bucardo_activate_sync_$sync});
$dbh->do(q{LISTEN bucardo_deactivate_sync_$sync});
$dbh->do(q{LISTEN bucardo_kick_sync_$sync});
...
while (my $notice = $dbh->func('pg_notifies')) {
 my ($name, $pid) = @$notice;
 if ($name eq 'bucardo_reload_config') {
 ...
 }
 elsif ($name =~ /bucardo_kick_sync_(.+)/) {
 ...
 }
...
}

We can instead do things like this:

$dbh->do(q{LISTEN bucardo});
...
while (my $notice = $dbh->func('pg_notifies')) {
 my ($name, $pid, $msg) = @$notice;
 if ($msg eq 'bucardo_reload_config') {
 ...
 }
 elsif ($msg =~ /bucardo_kick_sync_(.+)/) {
 ...
 }
...
}

I hope to add this support to Bucardo shortly; it's simply a matter of refactoring all the listen and notify calls into a function that does the right thing depending on the server version it is attached to.

PostgreSQL odd checkpoint failure

Nothing strikes fear into the heart of a DBA like error messages, particularly ones which indicate that there may be data corruption. One such situation happened recently to us, when we ran into a recent unusual situation in an upgrade to PostgreSQL 8.1.21. We had updated the software and manually been running a REINDEX DATABASE command, when we started to notice some errors being reported on the front-end. We decided to dump the database in question to ensure we had a backup to return to, however we still ended up with more messages:

  pg_dump -Fc database1 > pgdump.database1.archive

  pg_dump: WARNING:  could not write block 1 of 1663/207394263/443523507
  DETAIL:  Multiple failures --- write error may be permanent.
  pg_dump: ERROR:  could not open relation 1663/207394263/443523507: No such file or directory
  CONTEXT:  writing block 1 of relation 1663/207394263/443523507
  pg_dump: SQL command to dump the contents of table "table1" failed: PQendcopy() failed.
  pg_dump: Error message from server: ERROR:  could not open relation 1663/207394263/443523507: No such file or directory
  CONTEXT:  writing block 1 of relation 1663/207394263/443523507
  pg_dump: The command was: COPY public."table1" (id, field1, field2, field3) TO stdout;

Looking at the pg_database contents revealed that 207394263 was not even the database in question. I connected to the aforementioned database and looked for a relation that matched that pg_class.oid, and barring that pg_class.relfilenode. This search revealed nothing. So where was the object itself living, and why were we getting this message?

We decided that since it appeared that something was awry with the database system in general, that we should take this opportunity to dump the tables in question. I proceeded to write a quick script to go through the database tables and dump each one individually using pg_dump's -t option. This worked for some of the tables, but not all of them, which would die with the same error. Looking at the pg_class.relpages field for the non-dumpable tables revealed that these were all the larger tables in the database. Obviously not good, since this is where the bulk of the data lay. However, we also noticed that the message that we got referenced the exact same filesystem path, so it appeared to be something separate from the table that was being dumped.

After some advice on IRC, we reviewed the logs for checkpoint logging, which revealed that checkpoints had been failing. This further meant that the database was in a state such that it could not be shut down cleanly, had we wanted to try to restart to see if that cleared up the flakiness. This further meant that we'd only be able to shutdown via a hard kill, which is definitely something to avoid, WAL or not, particularly since there had not been a checkpoint for some time. A manual CHECKPOINT further failed after a timeout.

Before we went down the road of forcing a hard server shutdown, we ended up just touching the specific relation path in question into existence and then running a CHECKPOINT. This time since the file existed, it was able to complete the checkpoint, and restore working order to the database. We successfully (and quickly) ran a full pg_dump, and went about the task of manually vetting a few of the affected tables, etc.

Our working theory for this is that somehow there was a dirty buffer that referenced a relation that no longer existed, and hence when the there was a checkpoint or other event which attempted to flush shared_buffers (i.e., the loading of a large relation which would require a flush of Least Recently Used pages as in the pg_dump case), the flush attempt for the missing relation failed, which aborted the checkpoint/other action.

After the file existed and PostgreSQL had successfully synched to disk, it was a single two-block file, of which the first block was completely empty and the second block looked like an index page (due to the layout/contents of the data). The most suggestive cause was that had been an interrupted REINDEX earlier in the day. Since this machine was showing no other signs of data corruption and everything else seemed reasonable, our best guess is that there was some race condition that had caused the relation's data to exist in memory even while the canceled REINDEX ensured that the actual relfile and the pg_class rows did not exist for the buffer.

Perl Testing - stopping the firehose

I maintain a large number of Perl modules and scripts, and one thing they all have in common is a test suite, which is basically a collection of scripts inside a "t" subdirectory used to thoroughly test the behavior of the program. When using Perl, this means you are using the awesome Test::More module, which uses the Test Anything Protocol (TAP). While I love Test::More, I often find myself needing to stop the testing entirely after a certain number of failures (usually one). This is the solution I came up with.

Normally tests are run as a group, by invoking all files named t/*.t; each file has numerous tests inside of it, and these individual tests issue a pass or a fail. At the end of each file, a summary is output stating how many tests passed and how many failed. So why is stopping after a failed test even needed? The reasons below mostly relate to the tests I write for the Bucardo program, which has a fairly large and complex test suite. Some of the reasons I like having fine-grained control of when to stop are:

  • Scrolling back through screens and screens of failing tests to find the point where the test began to fail is not just annoying, but a very unproductive use of my time.
  • Tests are very often dependent. If test #23 fails, it means there is a very good chance that most if not all of the subsequent tests are going to fail as well, and it makes no sense for me to look at fixing anything but test #23 first.
  • Tests can take a very long time to run, and I can't wait around for the errors to start appearing and hit ctrl-c. I need to kick them off, go do something else, and then come back and have the tests stop running immediately after the first failed test. Bucardo tests, for example, create and startup four different Postgres clusters, populates the databases inside each cluster with test data, installs a fresh copy of Bucardo, and *then* begins the real testing. No way I'm going to wait around for that to happen.
  • Debugging is greatly aided by having the tests stop where I want them to. Often tests after the failing one will modify data and otherwise destroy the "state" such that I cannot manually duplicate the error right then and there, and thus fix it easily.

For now, my solution is to override some of the methods from Test::More. I have a standard script that does this, and I 'use' this script after I 'use Test::More' inside my test scripts. For example, a test script might look like this:


#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use Test::More tests => 356;
use TestOverride;

sub some_function {
       my $arr = [];
       push @$arr => 4,9;
       return [$arr];
}

my $t = q{Function some_function() returns correct value when called with 'foo'};
my $value = some_function('foo');
my $res = [[3],[5]];
is_deeply( $value, $res, $t);

...

$t = q{Value of baz is 123};
is ($baz, 123, $t);
...

In turn, the TestOverride file contains this:


...
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Pad = '|';

use base 'Exporter';
our @EXPORT = qw{ is_deeply like pass is isa_ok ok };

my $bail_on_error = $ENV{TESTBAIL} || 0;

my $total_errors = 0;

sub is_deeply {

   # Return right away if the test passes
   my $rv = Test::More::is_deeply(@_);
   return $rv if $rv;

   if ($bail_on_error and ++$total_errors >= $bail_on_error) {
       my ($file,$line) = (caller)[1,2];
       Test::More::diag("GOT: ".Dumper $_[0]);
       Test::More::diag("EXPECTED: ".Dumper $_[1]);
       Test::More::BAIL_OUT "Stopping on a failed 'is_deeply' test from line $line of $file.";
   }

   return;

} ## end of is_deeply

sub is {
   my $rv = Test::More::is(@_);
   return $rv if $rv;
   if ($bail_on_error and ++$total_errors >= $bail_on_error) {
       my ($file,$line) = (caller)[1,2];
       Test::More::BAIL_OUT "Stopping on a failed 'is' test from line $line of $file.";
   }
   return;
} ## end of is

The is_deeply compares two arbitrary Perl structures (such as the arrayref here, but it can do hashes as well), and points out if they differ, and where. The "deeply" is because it will walk through the entire structure to find any differences. Good stuff.

Some things to note about the new is_deeply function: first, we simply pass in our parameters to the "real" is_deeply subroutine - the one found inside the Test::More package. If this passes (by returning true), we simply pass that truth back to the caller, and it's completely as if is_deeply had not been overwritten at all. However, if the test fails, Test::More::is_deeply will output a failure notice, but we check to see if the total number of failures for this test script ($total_errors) is greater than or equal to the threshold ($bail_on_error) that we set via then environment variable TESTBAIL. (Having it as an environment variable that defaults to zero allows the traditional behavior to be easily changed without editing any files).

If the number of failed tests is over our threshhold, we call the BAIL_OUT method from Test::More, which not only stops the current test script from running any more scripts, but stops any subsequent test files from running as well.

Before calling BAIL_OUT however, we also take advantage of the overriding to provide a little more detail about the failure. We output the line and file the test came from (because Test::More::is_deeply only sees that we are calling it from within the TestOverride.pm file). Most importantly, we output a complete dump of the expected and actual structures passed to is_deeply to be compared. The regular is_deeply only describes where the first mismatch occurs, but I often need to see the entire surrounding object. So rather than normal output looking like this:


1..356
not ok 1 - Function some_function() returns correct value when called with 'foo'
#   Failed test 'Function some_function() returns correct value when called with 'foo''
#   at test1.t line 18.
#     Structures begin differing at:
#          $got->[0] = '4'
#     $expected->[0] = '3'
# Looks like you planned 356 tests but ran 1.
# Looks like you failed 1 test of 1 run.

The new output looks like this:


1..356
not ok 1 - Function some_function() returns correct value when called with 'foo'
#   Failed test 'Function some_function() returns correct value when called with 'foo''
#   at TestOverride.pm line 23.
#     Structures begin differing at:
#          $got->[0] = '4'
#     $expected->[0] = '3'
# GOT: |[
# |  4,
# |  [
# |    9
# |  ]
# |]
# EXPECTED: |[
# |  3
# |]
Bail out!  Stopping on a failed 'is_deeply' test from line 17 of test1.t.

Yes, the Test::Most module does some similar things, but I don't use it because it's yet another module dependency, it doesn't allow me to control the number of acceptable failures before bailing, and it doesn't show pretty output for is_deeply.

Reducing bloat without locking

It's not altogether uncommon to find a database where someone has turned off vacuuming, for a table or for the entire database. I assume people do this thinking that vacuuming is taking too much processor time or disk IO or something, and needs to be turned off. While this fixes the problem very temporarily, in the long run it causes tables to grow enormous and performance to take a dive. There are two ways to fix the problem: moving rows around to consolidate them, or rewriting the table completely. Prior to PostgreSQL 9.0, VACUUM FULL did the former; in 9.0 and above, it does the latter. CLUSTER is another suitable alternative, which also does the latter. Unfortunately all these methods require heavy table locking.

Recently I've been experimenting with an alternative method -- sort of a VACUUM FULL Lite. Vanilla VACUUM can reduce table size when the pages at the end of a table are completely empty. The trick is to empty those pages of live data. You do that by paying close attention to the table's ctid column:

5432 josh@josh# \d foo
      Table "public.foo"
 Column |  Type   | Modifiers 
--------+---------+-----------
 a      | integer | not null
 b      | integer | 
Indexes:
    "foo_pkey" PRIMARY KEY, btree (a)

5432 josh@josh# select ctid, * from foo;
 ctid  | a | b 
-------+---+---
 (0,1) | 1 | 1
 (0,2) | 2 | 2
(2 rows)

The ctid is one of several hidden columns found in each PostgreSQL table. It shows up in query results only if you explicitly ask for it, and tells you two values: a page number, and a tuple number. Pages are numbered sequentially from zero, starting with the first page in the relation's first file, and ending with the last page in its last file. Tuple numbers refer to entries within each page, and are numbered sequentially starting from one. When I update a row, the row's ctid changes, because the update creates a new version of the row and leaves the old version behind (see this page for explanation of that behavior).

5432 josh@josh# update foo set a = 3 where a = 2;
UPDATE 1
5432 josh@josh*# select ctid, * from foo;
 ctid  | a | b 
-------+---+---
 (0,1) | 1 | 1
 (0,3) | 3 | 2
(2 rows)

Note the changed ctid for the second row. If I vacuum this table now, I'll see it remove one dead row version, from both the table and its associated index:

5432 josh@josh# VACUUM verbose foo;
INFO:  vacuuming "public.foo"
INFO:  scanned index "foo_pkey" to remove 1 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": removed 1 row versions in 1 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "foo_pkey" now contains 2 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": found 1 removable, 2 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
1 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

So given these basics, how can I make tables smaller? Let's build a bloated table:

5432 josh@josh# truncate foo;
TRUNCATE TABLE
5432 josh@josh*# insert into foo select generate_series(1, 1000);
INSERT 0 1000
5432 josh@josh*# delete from foo where a % 2 = 0;
DELETE 500
5432 josh@josh*# select max(ctid) from foo;
   max   
---------
 (3,234)
(1 row)
5432 josh@josh# vacuum verbose foo;
INFO:  vacuuming "public.foo"
INFO:  scanned index "foo_pkey" to remove 500 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": removed 500 row versions in 4 pages
...

I've filled the table with 1000 rows, and then deleted every other row. The last tuple is on the fourth page (remember they're numbered starting with zero), but since half the table is empty space, I can probably squish it into three or maybe just two pages. I'll start by moving the tuples on the last page off to another page, by updating them:

5432 josh@josh# begin;
BEGIN
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
UPDATE 117
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
UPDATE 117
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
UPDATE 21
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
UPDATE 0
5432 josh@josh*# commit;
COMMIT

Here I'm not changing the row at all, but the tuples are moving around into dead space earlier in the table; this is apparent because the number of rows affected decreases. For the first update or two, there's room enough on the page to store all the new rows, but after a few updates they have to start moving to new pages. Eventually the row count goes to zero, meaning there are no rows on or after page #3, so vacuum can truncate that page:

5432 josh@josh# vacuum verbose foo;
INFO:  vacuuming "public.foo"
...

INFO:  "foo": truncated 4 to 3 pages

It's important to note that I did this all within a transaction. If I hadn't, there's a possibility that vacuum would have reclaimed some of the dead space made by the updates, so instead of moving to different pages, the tuples would have moved back and forth within the same page.

There remains one problem: I can't remove index bloat, and in fact, all this tuple-moving causes more index bloat. I can't fix that completely, but in PostgreSQL 8.3 and later I can avoid creating too much new bloat by updating an unindexed column instead of an indexed one. In PostgreSQL 8.3 and later, the heap-only tuples (HOT) feature avoids modifying indexes if:

  1. the update touches only unindexed columns, and
  2. there's sufficient free space available for the tuple to stay on the same page.
Despite the index bloat caveat, this can be a useful technique to slim down particularly bloated tables without VACUUM FULL and its associated locking.

Creativity with fuzzy string search

PostgreSQL provides a useful set of contrib modules for "fuzzy" string searching; that is, searching for something that sounds like or looks like the original search key, but that might not exactly match. One place this type of searching shows up frequently is when looking for peoples' names. For instance, a receptionist at the dentist's office doesn't want to have to ask for the exact spelling of your name every time you call asking for an appointment, so the scheduling application allows "fuzzy" searches, and the receptionist doesn't have to get it exactly right to find out who you really are. The PostgreSQL documentation provides an excellent introduction to the topic in terms of the available modules; This blog post also demonstrates some of the things they can do.

The TriSano application was originally written to use soundex search alone to find patient names, but that proved insufficient, particularly because common-sounding last names with unusual spellings would be ranked very poorly in the search results. Our solution, which has worked quite well in practice, involved creative use of PostgreSQL's full-text search combined with the pg_trgm contrib module.

A trigram is a set of three characters. In the case of pg_trgm, it's three adjacent characters taken from a given input text. The pg_trgm module provides easy ways to extract all possible trigrams from an input, and compare them with similar sets taken from other inputs. Two strings that generate similar trigram lists are, in theory, similar strings. There's no particular reason you couldn't use two, four, or some other number of characters instead of trigrams, but you'd trade sensitivity and variability. And as the name implies, pg_trgm only supports trigrams.

Straight trigram search didn't buy us much on top of soundex, so we got a bit more creative. A trigram is just a set of three characters, which looks pretty much just like a word, so we thought we'd try using PostgreSQL's full text search on trigram data. Typically full text search has a list of "stop words": un-indexed words judged too common and too short to contribute meaningfully to an index. Our words would all be three characters long, so we had to create a new text search configuration using a dictionary with an empty stop word list. With that text search configuration, we could index trigrams effectively.

This search helped, but wasn't quite good enough. We finally borrowed a simplified version of a data mining technique called "boosting", which involves using multiple "weak" classifiers or searchers to create one relatively good result set. We combined straightforward trigram, soundex, and metaphone searches with a normal full text search of the unmodified name data and a full text search over the trigrams generated from the names. The data sizes in question aren't particularly large, so this amount of searching hasn't proven unsustainably taxing on processor power, and it provides excellent results. The code is on github; feel free to try it out.

Update: One of the comments suggested a demonstration of the results, which of course makes perfect sense. So I resurrected some of the scripts I used when developing the technique. In addition to the scripts used to install the fuzzystrmatch and pg_trgm modules and the name_search.sql script linked above, I had a script that populated the people table with a bunch of fake names. Then, it's easy to test the search mechanism like this:

select * from search_for_name('John Doe')
as a(id integer, last_name text, first_name text, sources text[], rank double precision);

 id  |  last_name  | first_name |                     sources                     |        rank        
-----+-------------+------------+-------------------------------------------------+--------------------
 167 | Krohn       | Javier     | {trigram_fts,name_trgm,trigram_fts,trigram_fts} |  0.281305521726608
 228 | Jordahl     | Javier     | {trigram_fts,name_trgm,trigram_fts}             |  0.237995445728302
  59 | Pesce       | Dona       | {trigram_fts}                                   |  0.174265757203102
 185 | Finchum     | Dona       | {trigram_fts}                                   |  0.174265757203102
 104 | Rumore      | Dona       | {trigram_fts}                                   |  0.174265757203102
 250 | Dumond      | Julio      | {name_trgm,trigram_fts,trigram_fts}             |   0.16849160194397
 200 | Dedmon      | Javier     | {name_trgm,trigram_fts,trigram_fts}             |  0.163729697465897
 230 | Dossey      | Malinda    | {name_trgm,trigram_fts}                         |  0.158055320382118
  50 | Dress       | Darren     | {name_trgm,trigram_fts}                         |  0.153293430805206
 136 | Doshier     | Neil       | {name_trgm,trigram_fts}                         |  0.148531511425972
 165 | Donatelli   | Lance      | {name_trgm,trigram_fts}                         |  0.132845237851143
 280 | Dollinger   | Clinton    | {name_trgm,trigram_fts}                         |  0.132845237851143
 273 | Dimeo       | Milagros   | {name_trgm,trigram_fts}                         | 0.0866267532110214
  49 | Dawdy       | Christian  | {name_trgm,trigram_fts}                         | 0.0866267532110214
 298 | Elswick     | Jami       | {trigram_fts}                                   | 0.0845221653580666

This isn't all the results it returned, but it gives an idea what the results look like. The rank value ranks results based on the rankings given by each of the underlying search methods, and the sources column shows which of the search methods found this particular entry. Some search methods may show up twice, because that search method found multiple matches between the input text and the result record. These results don't look particularly good, because there isn't really a good match for "John Doe" in the data set. But if I horribly misspell "Jamie Elswick", the search does a good job:

select * from search_for_name('Jomy Elswik') as a(id integer, last_name text,                                                 
first_name text, sources text[], rank double precision)

 id  |  last_name  | first_name |                     sources                     |        rank        
-----+-------------+------------+-------------------------------------------------+--------------------
 298 | Elswick     | Jami       | {trigram_fts,name_trgm,trigram_fts,trigram_fts} |  0.480943143367767
 312 | Elswick     | Kurt       | {name_trgm,trigram_fts}                         |  0.381967514753342
 228 | Jordahl     | Javier     | {trigram_fts,name_trgm,trigram_fts}             |  0.197063013911247
 403 | Walberg     | Erik       | {trigram_fts}                                   |  0.145491883158684
 309 | Hammaker    | Erik       | {trigram_fts}                                   |  0.145491883158684

Tail_n_mail and the log_line_prefix curse

One of the problems I had when writing tail_n_mail (a program that parses log files and mails interesting lines to you) was getting the program to understand the format of the Postgres log files. There are quite a few options inside of postgresql.conf that control where the logging goes, and what it looks like. The basic three options are to send it to a rotating logfile with a custom prefix at the start of each line, to use syslog, or to write it in CSV format. I'll save a discussion of all the logging parameters for another time, but the important one for this story is log_line_prefix. This is what gets prepended to each log line when using 'stderr' mode (e.g. regular log files and not syslog or csvlog). By default, log_line_prefix is an empty string. This is a very useless default.

What you can put in the log_line_prefix parameter is a string of sprintf style escapes, which Postgres will expand for you as it writes the log. There are a large number of escapes, but only a few are commonly used or useful. Here's a log_line_prefix I commonly use:


log_line_prefix = '%t [%p] %u@%d '

This tells Postgres to print out the timestamp, the PID aka process id (inside of square brackets), the current username and database name, and finally a single space to help separate the prefix visually from the rest of the line. The above will generate lines that look like this:


2010-08-06 09:24:57.714 EDT [7229] joy@joymail LOG: execute dbdpg_p7228_5: SELECT count(id) FROM joymail WHERE folder = $1
2010-08-06 09:24:57.714 EDT [7229] joy@joymail DETAIL:  parameters: $1 = '4'

As you might imagine, the customizability of log_line_prefix makes parsing the log files all but impossible without some prior knowledge. I didn't want to go the pgfouine route and make people change their log_line_prefix to a specific setting. I think it's kind of rude to force your database to change its logging to accommodate your tools :). The original quick solution I came up with was to have a set of predefined regular expressions and the user would pick one that most closely matched their logs. For tail_n_mail to work properly, it needs to pick up at least the PID so it can tell when one statement ends a new one begins. For example, if you chose "regex #1", the log parsing regex would look like this:


(\d\d\d\d\-\d\d\-\d\d \d\d:\d\d:\d\d).+?(\d+)

This works fine on the example above, and gets us the timestamp and the PID from each line. The stock regexes worked for many different log_line_prefixes I came across that our clients were using, but I was never very happy with this solution. Not only was it susceptible to failing completely when a client was using a log_line_prefix not fitting into the current list of regexes, but there was no way to know exactly where the prefix ended and the statement began, which is important for the formatting of the output and the canonicaliztion of similar queries.

Enter the current solution: building a regex on the fly. Since we don't have a connection to the database at all, merely to the the log files, this requires that the user enter in their current log_line_prefix. This is a simple entry into the tailnmailrc file that looks just like the entry in postgresql.conf, e.g.:


log_line_prefix = '%t [%p] %u@%d '

The tail_n_mail script uses that variable to build a custom regex specifically tailored to that log_line_prefix and thus to the Postgres logs being used. Not only can we grab whatever bits we want (currently we only care about the timestamp (%t and %m) and the PID (%p)), but we can now cleanly break apart each line in the log into the prefix and the actual statement. This means the canonicalization/flattening of the queries is more effective, and allows us to only output the prefix information once. The output of tail_n_mail looks something like this:


Date: Fri Aug  6 11:01:03 2010 UTC                                                        
Host: whale.example.com
Unique items: 7
Total matches: 85
Matches from [A] /var/log/pg_log/postgresql-2010-08-05.log: 61
Matches from [B] /var/log/pg_log/postgresql-2010-08-06.log: 24

[1] From files A to B (between lines 14,205 of A and 527 of B, occurs 64 times)
First: [A] 2010-08-05 16:52:11 UTC [1602]  postgres@mydb
Last:  [B] 2010-08-06 01:18:14 UTC [20981] postgres@mydb
ERROR: syntax error at or near ")" 
STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (?,?,?))
-
ERROR: syntax error at or near ")"
STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (123,'chocolate','donut'));

[2] From file A (line 12,172)                                                                                                
2010-08-05 12:27:48 UTC [2906] bob@otherdb
ERROR: invalid input syntax for type date: "May" 
STATEMENT: UPDATE personnel SET birthdate='May' WHERE id = 1234;

(plus five other entries)

For the entry in the above example, we are able to show the complete prefix of the log lines where the error first occurred and where it most recently occurred. The next two lines show the "flattened" version of the query that tail_n_mail uses to group together similar errors. We then show a non-flattened example of an actual query from that group. In this case, someone added an extra closing paren in their application somewhere, which gives the same error each time, although the exact output changes depending on the values used. In the second example, because there is only one match, we don't bother to show the flattened version at all.

So in theory tail_n_mail should be now be able to handle any Postgres log you care to throw at it (yes, it can read syslog and csvlog format as well). As my coworker pointed out, parsing log files in this way is something that should probably be abstracted into a common module so other tools like pgsi can take advantage of it as well.

Distributed Transactions and Two-Phase Commit

The typical example of a transaction involves Alice and Bob, and their bank. Alice pays Bob $100, and the bank needs to debit Alice and credit Bob. Easy enough, provided the server doesn't crash. But what happens if the bank debits Alice, and then before crediting Bob, the server goes down? Or what if they credit Bob first, and then try to debit Alice only to find she doesn't have enough funds? A transaction allows the debit and credit operations to happen as a package ("atomically" is the word commonly used), so either both operations happen or neither happens, even if the server crashes halfway through the transaction. That way the bank never credits Bob without debiting Alice, or vice versa.

That's simple enough, but the situation can become more complex. What if, for instance, for buzzword-compliance purposes, the bank has "sharded" its accounts database by splitting it in pieces and putting each piece on a different server (whether this is would be smart or not is outside the scope of this post). The typical transaction handles statements issued only for one database, so we can't wrap the debit and credit operations within a single BEGIN/COMMIT if Alice's account information lives on one server and Bob's lives on another.

Enter "distributed transactions". A distributed transaction allows applications to group multiple transaction-aware systems into a single transaction. These systems might be different databases, or they might include other systems such as message queues, in which case the transaction concept means a message would get delivered if and only if the rest of the transaction completed. So with a distributed transaction, the bank could debit Alice's account in one database and credit Bob's in another, atomically.

All this comes at some cost. Distributed transactions require a "transaction manager", an application which handles the special semantics required to commit a distributed transaction. Second, the systems involved must support "two-phase commit" (which was added to PostgreSQL in version 8.1). Distributed transactions are committed using PREPARE TRANSACTION 'foo' (phase 1), and COMMIT PREPARED 'foo' or ROLLBACK PREPARED 'foo' (phase 2), rather than the usual COMMIT or ROLLBACK.

The beginning of a distributed transaction looks just like any other transaction: the application issues a BEGIN statement (optional in PostgreSQL), followed by normal SQL statements. When the transaction manager is instructed to commit, it runs the first commit phase by saying "PREPARE TRANSACTION 'foo'" (where "foo" is some arbitrary identifier for this transaction) on each system involved in the distributed transaction. Each system does whatever it needs to do to determine whether or not this transaction can be committed and to make sure it can be committed even if the server crashes, and reports success or failure. If all systems succeed, the transaction manager follows up with "COMMIT PREPARED 'foo'", and if a system reports failure, the transaction manager can roll back all the other systems using either ROLLBACK (for those transactions it hasn't yet prepared), or "ROLLBACK PREPARED 'foo'". Using two-phase commit is obviously slower than committing transactions on only one database, but sometimes the data integrity it provides justifies the extra cost.

In PostgreSQL, two-phase commit is supported provided max_prepared_transactions is nonzero. A PREPARE TRANSACTION statement persists the current transaction to disk, and dissociates it from the current session. That way it can survive even if the database goes down. The current session no longer has an active transaction. However, the prepared transaction acts like any other open transaction in that all locks held by the prepared transaction remain held, and VACUUM cannot reclaim storage from that transaction. So it's not a good idea to leave prepared transactions open for a long time.

Distributed transactions are most common, it seems, in Java applications. Full J2EE application servers typically come with a transaction manager component. For my examples I'll use an open source, standalone transaction manager, called Bitronix. I'm not particularly fond of using Java for simple scripts, though, so I've used JRuby for this demonstration code.

This script uses two databases, which I've called "athos" and "porthos". Each has same schema, which provides a simple framework for the sharded bank example described above. This schema provides a table for account names, another for ledger information, and a simple trigger to raise an exception when a transaction would bring a person's balance below $0. I'll first populate athos with Alice's account information. She gets $200 to start. Bob will go in the porthos database, with no initial balance.

5432 josh@athos# insert into accounts values ('Alice');
INSERT 0 1
5432 josh@athos*# insert into ledger values ('Alice', 200);
INSERT 0 1
5432 josh@athos*# commit;
COMMIT5432 josh@athos# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# insert into accounts values ('Bob');
INSERT 0 1
5432 josh@porthos*# commit;
COMMIT

Use of Bitronix is pretty straightforward. After setting up a few constants for easier typing, I create a Bitronix data source for each PostgreSQL database. Here I have to use the PostgreSQL JDBC driver's org.postgresql.xa.PGXADataSource class; "XA" is Java's protocol for two-phase commit, and requires JDBC driver support. Here's the code for setting up one data source; the other is just the same.

ds1 = PDS.new
ds1.set_class_name 'org.postgresql.xa.PGXADataSource'
ds1.set_unique_name 'pgsql1'
ds1.set_max_pool_size 3
ds1.get_driver_properties.set_property 'databaseName', 'athos'
ds1.get_driver_properties.set_property 'user', 'josh'
ds1.init

Then I simply get a connection from each data source, instantiate a Bitronix TransactionManager object, and begin a transaction.

c1 = ds1.get_connection
c2 = ds2.get_connection
btm = TxnSvc.get_transaction_manager
btm.begin

Within my transaction, I just use normal JDBC commands to debit Alice and credit Bob, after which I commit the transaction through the TransactionManager object. If this transaction fails, it raises an exception, which I can capture using Ruby's begin/rescue exception handling, and roll back the transaction.

begin
  s2 = c2.prepare_statement "INSERT INTO ledger VALUES ('Bob', 100)"
  s2.execute_update
  s2.close

  s1 = c1.prepare_statement "INSERT INTO ledger VALUES ('Alice', -100)"
  s1.execute_update
  s1.close

  btm.commit
  puts "Successfully committed"
rescue
  puts "Something bad happened: " + $!
  btm.rollback
end

When I run this, Bitronix gives me a bunch of output, which I haven't bothered to suppress, but among it all is the "Successfully committed" string I told it to print on success. Since Alice is debited $100 each time we run this, and she started with $200, we can run it twice before hitting errors. On the third time, we get this:

Something bad happened: org.postgresql.util.PSQLException: ERROR: Rejecting operation; account owner Alice's balance would drop below 0

This is our trigger firing, to tell us that we can't debit Alice any more. If I look in the two databases, I can see that everything worked as planned:

5432 josh@athos*# select get_balance('Alice');
 get_balance 
-------------
           0
(1 row)

5432 josh@athos*# \c porthos 
You are now connected to database "porthos".
5432 josh@porthos# select get_balance('Bob');
 get_balance 
-------------
         200
(1 row)

Remember I've run my script three times, but Bob has only been credited $200, because that's all Alice had to start with.

PostgreSQL: per-version .psqlrc

File this under "you learn something new every day." I came across this little tidbit while browsing the source code for psql: you can have a per-version .psqlrc file which will be executed only by the psql associated with that major version. Just name the file .psqlrc-$version, substituting the major version for the $version token. So for PostgreSQL 8.4.4, it would look for a file named .psqlrc-8.4.4 in your $HOME directory.

It's worth noting that the version-specific .psqlrc file requires the full minor version, so you cannot currently define (say) an 8.4-only version which applies to all 8.4 psqls. I don't know if this feature gets enough mileage to make said modification worth it, but it would be easy enough to just use a symlink from the .psqlrc-$majorversion to the specific .psqlrc file with minor version.

This seems of most interest to developers, who may simultaneously run many versions of psql which may have incompatible settings, but also could come in handy to regular users as well.

PostgreSQL: Dynamic SQL Function

Sometimes when you're doing something in SQL, you find yourself doing something repetitive, which naturally lends itself to the desire to abstract out the boring parts. This pattern is often prevalent when doing maintenance-related tasks such as creating or otherwise modifying DDL in a systematic kind of way. If you've ever thought, "Hey, I could write a query to handle this," then you're probably looking for dynamic SQL.

The standard approach to using dynamic SQL in PostgreSQL is plpgsql's EXECUTE function, which takes a text argument as the SQL statement to execute. One technique fairly well-known on the #postgresql IRC channel is to create a function which essentially wraps the EXECUTE statement, commonly known as exec(). Here is the definition of exec():

CREATE FUNCTION exec(text) RETURNS text AS $$ BEGIN EXECUTE $1; RETURN $1; END $$ LANGUAGE plpgsql;

Using exec() then takes the form of a SELECT query with the appropriately generated query to be executed passed as the sole argument. We return the generated query text as an ease in auditing the actually executed results. Some examples:

SELECT exec('CREATE TABLE partition_' || generate_series(1,100) || ' (LIKE original_table)');
SELECT exec('ALTER TABLE ' || quote_identifier(attrelid::regclass) || ' DROP COLUMN foo') FROM pg_attribute WHERE attname = 'foo';

Some notes about the exec() function: since the generated SQL statement is being run inside a function, it is not run in a top-level transaction, so some commands will not work, including CREATE/DROP DATABASE, ALTER TABLESPACE, VACUUM, etc.

Starting in PostgreSQL 9.0, the plpgsql language will be pre-installed in all new databases, which will make this recipe even easier to use.

PostgreSQL: Migration Support Checklist

A database migration (be it from some other database to PostgreSQL, or even from an older version of PostgreSQL to a nice shiny new one) can be a complicated procedure with many details and many moving parts. I've found it helpful to construct a list of questions in order to make sure that you're considering all aspects of the migrations and gauge the scope of what will be involved. This list includes questions we ask our clients; feel free to contribute your own additional considerations or suggestions.

Technical questions:

  1. Database servers: How many database servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? What kind of storage are you using for the existing database, and what do you plan to use for the new database? Direct-attached storage (SAS, SATA, etc.), SAN (what vendor?), or other? Do you use any configuration management system such as Puppet, Chef, etc.?
  2. Application servers and other remote access: How many application servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? Do you use any configuration management system such as Puppet, Chef, etc.? What other network considerations are there? Is ODBC used, or SSL transport, any VPNs? Are multiple datacenters involved? How about egress/ingress firewalls?
  3. Middleware: Do you currently use any sort of connection pooling, load balancing, or other middleware between your application and database servers?
  4. Data needs: Can you describe your data access patterns? i.e., is the majority of your data historical and rarely accessed? Are there any existing reporting needs that will need to be duplicated on the PostgreSQL system? Do you already have reports of database usage, including traffic levels, frequent or intensive queries, etc?
  5. Size: What kind of transaction volume do you see? How large are your databases? How many tables do you have and what is the size of the larger ones? How many users or database connections will you need to support?
  6. Backups: What are your current backup policies/procedures? How will these need to change with the move to PostgreSQL?
  7. Replication/load balancing: What kind of system redundancy do you currently have/need? Do you have any kind of database load-balancing or master-slave replication?
  8. Monitoring: What is the current monitoring/in-house support infrastructure? What needs to be duplicated, and can any portion of this facility be reused?
  9. Interfaces: What language are your applications written in, and what drivers exist to connect to your current database? Will there be a compatible driver available in your language of choice in order?
  10. Extensions: Are you currently using any in-database procedures or functionality (i.e., in PL/SQL or another embedded language of choice)? If so, how many? What will the difficulty be in porting these functions to PostgreSQL?

And a couple of business-related questions:

  1. Scheduling: What is the timeframe for transition? When can appropriate downtime be scheduled? How much database downtime can you afford?
  2. Staffing: Do you currently have in-house DBAs to manage the servers, etc on a day-to-day basis? Is there anyone with PostgreSQL experience or familiarity on staff?

Being able to answer all of these questions is critical to formulating a migration plan and carrying out a migration successfully.

Particularly with the impending (July 2010) end of life for previous PostgreSQL releases 7.4, 8.0 and (in November 2010) 8.1, a database migration may be on your radar. End Point is one of many professional PostgreSQL support companies who would be happy to assist you in your transition.

Views across many similar tables

An application I'm working on has a host of (a dozen or so) status tables, each containing various rows that reflect the state of associated rows in other tables. For instance:

Table "public.inventory"
...
status_code      | character varying(50)       | not null

Table "public.inventory_statuses"
code          | character varying(50)       | not null
display_label | character varying(70)       | not null

SELECT * FROM inventory_statuses;

  code    | display_label
-----------+---------------
ordered   | Ordered
shipped   | Shipped
returned  | Returned
repaired  | Repaired
etc.

Several of the codes are common to several tables. For instance, "void" is a status that occurs in seven tables. The application cares about this; there are code-level triggers that will respond to a change of status to "void" in one table, and pass that information along to another table higher up the chain.

Since I wasn't present at the birth of the system (nor do I have unlimited memory to keep 180+ codes in my head), I needed a way to answer the question, "In which table(s) does status 'foo' occur?" This was made rather easier by attention to detail early on: each of the status tables was named "*_statuses"; each primary key was named "code"; and each human-readable description field was named "display_label". I wrote a Pl/PgSQL function to create a view spanning all the tables. (I could have just created the SQL by hand, but I wanted a way to reproduce this effort later, if tables are added, dropped, or modified.)

CREATE FUNCTION create_all_statuses()
RETURNS VOID
LANGUAGE 'plpgsql'
AS $$
DECLARE
   stmt TEXT;
   tbl RECORD;
BEGIN
   stmt := '';
   FOR tbl IN EXECUTE $SQL$
SELECT DISTINCT table_name
FROM information_schema.columns a
JOIN information_schema.columns b
USING (table_name)
JOIN information_schema.tables t
USING (table_name)
WHERE a.column_name = 'code'
AND   b.column_name = 'display_label'
AND   table_name ~ '_statuses$'
AND   t.table_type  = 'BASE TABLE'
$SQL$
   LOOP
       IF (LENGTH(stmt) > 0)
       THEN
           stmt := stmt || ' UNION ';
       END IF;
       stmt := stmt || 'SELECT code, display_label, ' ||
           quote_literal(tbl.table_name) ||
           ' AS table_name FROM ' ||
           quote_ident(tbl.table_name);
   END LOOP;

   EXECUTE 'CREATE VIEW all_statuses AS ' || stmt;
   RETURN;
END;
$$;
Now it's easy to answer the question:
select * from all_statuses where code = 'void';

code | display_label |              table_name
------+---------------+--------------------------------------
void | Void          | inventory_statuses
void | Void          | parcel_statuses
void | Void          | pick_list_statuses
etc.

If your database uses boilerplate columns such as "last_modified" or "date_created" to record timestamps on rows, you could use similar logic to create a view that would tell you which tables were the most recently modified.

pgcrypto pg_cipher_exists errors on upgrade from PostgreSQL 8.1

While migrating a client from a 8.1 Postgres database to a 8.4 Postgres database, I came across a very annoying pgcrypto problem. (pgcrypto is a very powerful and useful contrib module that contains many functions for encryption and hashing.) Specifically, the following functions were removed from pgcrypto as of version 8.2 of Postgres:

  • pg_cipher_exists
  • pg_digest_exists
  • pg_hmac_exists

While the functions listed above were deprecated, and marked as such for a while, their complete removal from 8.2 presents problems when upgrading via a simple pg_dump. Specifically, even though the client was not using those functions, they were still there as part of the dump. Here's what the error message looked like:

$ pg_dump mydb --create | psql -X -p 5433 -f - >pg.stdout 2>pg.stderr
...
psql::2654: ERROR:  could not find function "pg_cipher_exists"
  in file "/var/lib/postgresql/8.4/lib/pgcrypto.so"
psql::2657: ERROR:  function public.cipher_exists(text) does not exist

While it doesn't stop the rest of the dump from importing, I like to remove any errors I can. In this case, it really was a SMOP. Inside the Postgres 8.4 source tree, in the contrib/pgcrypto directory, I added the following declarations to pgcrypto.h:


Datum       pg_cipher_exists(PG_FUNCTION_ARGS);
Datum       pg_digest_exists(PG_FUNCTION_ARGS);
Datum       pg_hmac_exists(PG_FUNCTION_ARGS);

Then I added three simple functions to the bottom of the pgcrypto.c file that simply throw an error if they are invoked, letting the user know that the functions are deprecated. This is a much friendlier way than simply removing the functions, IMHO.


/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_cipher_exists);

Datum
pg_cipher_exists(PG_FUNCTION_ARGS)
{
    ereport(ERROR,
            (errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
             errmsg("pg_cipher_exists is a deprecated function")));
    PG_RETURN_TEXT_P("0");
}

/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_digest_exists);

Datum
pg_digest_exists(PG_FUNCTION_ARGS)
{

    ereport(ERROR,
            (errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
             errmsg("pg_digest_exists is a deprecated function")));
    PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_hmac_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_hmac_exists);

Datum
pg_hmac_exists(PG_FUNCTION_ARGS)
{

    ereport(ERROR,
            (errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
             errmsg("pg_hmac_exists is a deprecated function")));
    PG_RETURN_TEXT_P("0");
}

After running make install from the pgcrypto directory, the dump proceeded without any further pgcrypto errors. From this point forward, if the anyone attempts to use one of the functions, it will be quite obvious that the function is deprecated, rather than leaving the user wondering if they typed the function name incorrectly or wondering if pgcrypto is perhaps not installed.

Why not just add some dummy SQL functions to the pgcrypto.sql file instead of hacking the C code? Because pg_dump by default will create the database as a copy of template0. While there are other ways around the problem (such as putting the SQL functions into template1 and forcing the load to use that instead of template0, or by creating the database, adding the SQL functions, and then loading the data), this was the simplest approach.

Photo of Enigma machine by Marcin Wichary

Learn more about End Point's Postgres Support, Development, and Consulting.

Tracking Down Database Corruption With psql

I love broken Postgres. Really. Well, not nearly as much as I love the usual working Postgres, but it's still a fantastic learning opportunity. A crash can expose a slice of the inner workings you wouldn't normally see in any typical case. And, assuming you have the resources to poke at it, that can provide some valuable insight without lots and lots of studying internals (still on my TODO list.)

As a member of the PostgreSQL support team at End Point a number of diverse situations tend to cross my desk. So imagine my excitement when I get an email containing a bit of log output that would normally make a DBA tremble in fear:

LOG:  server process (PID 10023) was terminated by signal 11
LOG:  terminating any other active server processes
FATAL:  the database system is in recovery mode
LOG:  all server processes terminated; reinitializing

Oops, signal 11 is SIGSEGV, Segmentation Fault. Really not supposed to happen, especially in day to day activities. That'll cause Postgres to drop all of its current sessions and restart itself, as the log lines indicate. That crash was in response to a specific query their application was running, which essentially runs a process on a column across an entire table. Upon running pg_dump they received a different error:

ERROR:  invalid memory alloc request size 2667865904
STATEMENT:  COPY public.different_table (etc, etc) TO stdout

Different, but still very annoying and in the way of their data. So we have (at least) two areas of corruption. But therein lies the bigger problem: Neither of these messages give us any clues about where in these potentially very large tables it's encountering a problem.

Yes, my hope is that the corruption is not widespread. I know this database tends to not see a whole lot of churn, relatively speaking, and that they look at most if not all the data rather frequently. So the expectation is that it was caught not long after the disk controller or some memory or something went bad, and that whatever's wrong is isolated to a handful of pages.

Our good and trusty psql command line client to the rescue! One of the options available in psql is FETCH_COUNT, which if set will wrap a SELECT query in a cursor then automatically and repeatedly fetch the specified number of rows from it. This option is there primarily to allow psql to show the results of large queries without having to dedicate so much memory up front. But in this case it lets us see the output of a table scan as it happens:

testdb=# \set FETCH_COUNT 1
testdb=# \pset pager off
Pager usage is off.
testdb=# SELECT ctid, * FROM gs;
 ctid  | generate_series 
-------+-----------------
 (0,1) |               0
 (0,2) |               1
(scroll, scroll, scroll...)

(You did start that in a screen session, right? No need to have it send all the data over to your terminal, especially if you're working remotely. Set screen to watch for the output to go idle, Ctrl-A, _ keys by default, and switch to a different window. Oh, and this of course isn't the client's database, but one where I've intentionally introduced some corruption.)

We select the system column ctid to tell us the page where the problem occurs. Or more specifically, the page and positions leading up to the problem:

 (439,226) |           99878
 (439,227) |           99879
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
:|!>?

Yup, there it is. Some point after item pointer 227 on page 439, which probably actually means page 440. At this point we can reconnect, and possibly through a bit of trial and error narrow down the affected area a little more. But for now let's run with page 440 being suspect; let's take a closer look. And it here it should be noted that if you're going to try anything, shut down Postgres and take a file-level backup of the data directory. Anyway, first we need to find the underlying file for our table...

testdb=# select oid from pg_database where datname = 'testdb';
  oid  
-------
 16393
(1 row)

testdb=#* select relfilenode from pg_class where relname = 'gs';
 relfilenode 
-------------
       16394
(1 row)

testdb=#* \q
demo:~/p82$ dd if=data/base/16393/16394 bs=8192 skip=440 count=1 | hexdump -C | less
...
000001f0  00 91 40 00 e0 90 40 00  00 00 00 00 00 00 00 00  |..@...@.........|
00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  1f 8b 08 08 00 00 00 00  02 03 70 6f 73 74 67 72  |..........postgr|
00001010  65 73 71 6c 2d 39 2e 30  62 65 74 61 31 2e 74 61  |esql-9.0beta1.ta|
00001020  72 00 ec 7d 69 63 1b b7  d1 f0 f3 55 fb 2b 50 8a  |r..}ic.....U.+P.|
00001030  2d 25 96 87 24 5f 89 14  a6 a5 25 5a 56 4b 1d 8f  |-%..$_....%ZVK..|
00001040  28 27 4e 2d 87 5a 91 2b  6a 6b 72 97 d9 25 75 c4  |('N-.Z.+jkr..%u.|
00001050  f6 fb db df 39 00 2c b0  bb a4 28 5b 71 d2 3e 76  |....9.,...([q.>v|
00001060  1b 11 8b 63 30 b8 06 83  c1 60 66 1c c6 93 41 e4  |...c0....`f...A.|
...

Huh, so through perhaps either a kernel bug, a disk controller problem, or bizarre action on the part of a sysadmin, the last bit of our table has been overwritten by the 9.0beta1 tarball distribution. Incidentally this is not one of the recommended ways of upgrading your database.

With a corrupt page identified, if it's fairly clear the invalid data covers most or all of the page it's probably not too likely we'll be able to recover any rows from it. Our best bet is to "zero out" the page so that Postgres will skip over it and let us pull the rest of the data from the table. We can use `dd` to seek to the corrupt block in the table and write out an 8k block of zero-bytes in its place. Shut down Postgres (just to make sure it doesn't re-overwrite your work later) and note the conv=notrunc that'll keep dd from truncating the rest of the table.

demo:~/p82$ dd if=/dev/zero of=data/base/16393/16394 bs=8192 seek=440 count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000141498 s, 57.9 MB/s
demo:~/p82$ dd if=data/base/16393/16394 bs=8192 skip=440 count=1 | hexdump -C
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000147993 s, 55.4 MB/s
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00002000

Cool, it's now an empty, uninitialized page that Postgres should be fine skipping right over. Let's test it, start Postgres back up and run psql again...

testdb=# select count(*) from gs;
 count 
-------
 99880
(1 row)

No crash, hurray! We've clearly lost some rows from the table, but that should now allow us to rescue any of the surrounding data. As always it's worth dumping out all the data you can, running initdb, and loading it back in. You never know what else might have been affected in the original database. This is of course no substitute for a real backup, but if you're in a pinch at least there is some hope. For now, PostgreSQL is happy again!

Learn more about End Point's Postgres Support, Development, and Consulting.

The PGCon "Hall Track"

One of my favorite parts of PGCon is always the "hall track", a general term for the sideline discussions and brainstorming sessions that happen over dinner, between sessions (or sometimes during sessions), and pretty much everywhere else during the conference. This year's hall track topics seemed to be set by the developers' meeting; everywhere I went, someone was talking about hooks for external security modules, MERGE, predicate locking, extension packaging and distribution, or exposing transaction order for replication. Other developers' pet projects that didn't appear in the meeting showed up occasionally, including unlogged tables and range types. Even more than, for instance, the wiki pages describing the things people plan to work on, these interstitial discussions demonstrate the vibrancy of the community and give a good idea just how active our development really is.

This year I shared rooms with Robert Haas, so I got a good overview of his plans for global temporary and unlogged tables. I spent a while with Jeff Davis looking through the code for exclusion constraints and deciding whether it was realistically possible to cause a starvation problem with many concurrent insertions into a table with an exclusion constraint. I didn't spend the time I should have talking with Dimitri Fontaine about his PostgreSQL extensions project, but if time permits I'd like to see if I could help out with it. Nor did I find the time I'd have liked to work on PL/Parrot, but I was glad to meet Jonathan Leto, who has done most of the coding work thus far on that project.

In contrast to other conferences, I didn't have a particular itch of my own to scratch between sessions. During past conferences I've been eager to discuss ideas for multi-column statistics; though that work continues, slowly, time hasn't permitted enough recent development even for the topic to be fresh in my mind, much less worthy of in-depth discussion. This lack of one overriding subject turned out to be a refreshing change, however, as it left the other hall track subjects less filtered.

Finally, it was nice to spend time with co-workers, and in fact to meet (finally) in person the one of the "Greg"s I'd talked to on the phone many times, but never actually met in person. Various engagements in my family or his have gotten in the way in the past. One of the quirks of working for a distributed organization...

Update: Fixed link to developers' meeting wiki page, thanks to comment from roppert

Learn more about End Point's Postgres Support, Development, and Consulting.

Postgres Conference - PGCon2010 - Day Two

Day two of the PostgreSQL Conference started a little later than the previous day in obvious recognition of the fact that many people were up very, very late the night before. (Technically, this is day four, as the first two days consisted of tutorials; this was the second day of "talks").

The first talk I went to was PgMQ: Embedding messaging in PostgreSQL by Chris Bohn. It was well attended, although there were definitely a lot of late-comers and bleary eyes. A tough slot to fill! Chris is from Etsy.com and I've worked with him there, although I had no interaction with the PgMQ project, which looks pretty cool. From the talk description:

PgMQ (PostgreSQL Message Queueing) is an add-on that embeds a messaging client inside PostgreSQL. It supports the AMQP, STOMP and OpenWire messaging protocols, meaning that it can work with all of the major messaging systems such as ActiveMQ and RabbitMQ. PgMQ enables two replication capabilities: "Eventually Consistent" Replication and sharding.

As near as I can tell, "eventually consistent" is the same as "asynchronous replication": the slave won't be the same as the master right away, but will be eventually. As with Bucardo and Slony, the actual lag is very small in practice: a handful of seconds at the most. I like the fact that it supports all those common messaging protocols. Chris mentioned in the talk that it should be possible for other systems like Bucardo to support something similar. I'll have to play around with PgMQ a bit and see about doing just that. :)

The typical post-talk gatherings
The typical post-talk gatherings

The next "talk" was the enigmatically labeled Replication Panel. Enigmatic in this case as it had no description whatsoever. It's a good thing I had decided to check it out anyway (I'm a sucker for any talk related to replication, in case it wasn't obvious yet). I was apparently nominated to be on the panel, representing Bucardo! So much for getting all my speaking done and over with the first day. The panel represented a pretty wide swatch of Postgres replication technologies, and by the people who are very deep in the development of each one. From left to right on a cluster of stools at the front of the room was:

After a quick one-minute each intro describing who we were and what our replication system was, we took questions from the audience. Rather, Dan Langille played the part of the moderator and gathered written questions from the audience which he read to us, and we each took turns answering. We managed to get through 16 questions. All were interesting, even if some did not apply to all the solutions. Some of the more relevant ones I remember:

    "If your replication solution was not available, which of the other replication solutions would you recommend?" This was my favorite question. My answer was: if using Bucardo in multi-master mode, switch to pgpool. If using in master-slave mode, use Slony.

    "How will PG 9.0 affect your solution? Will your solution still remain relevant?" This most heavily affects Bucardo, Slony, and Londiste, and we all agreed that we're happy to lose users who simply need a read-only copy of their database. Their remains plenty of use cases that 9.0 will not solve however.

    "For multi-master solutions: How are database collisions resolved? Do you recommend your solution for geographically remote locations?" This one is pretty much for me alone. :) I gave a quick overview of Bucardo's built-in conflict resolution systems, and how custom ones built on business logic works. Since Bucardo was originally built to support servers over a non-optimal network, the second part was an easy Yes.

    "Is there a way to standardize and reduce the number of replication systems and focus on making the subset more robust, efficient, and versatile?" The general answer was no, as the use cases for all of them are so wildly different. I thought the only possible reduction was to combine Slony and Londiste, as they are very close technically and have pretty much identical use cases.

    "How easy is it to switch masters? Are you planning on improving the tools to do so?" With Bucardo, switching is as easy as pointing to a different database if using master-master. However, Bucardo master-slave has no built in support at all for failover (like Slony does). So the answer is "not easy at all" and yes, we want to provide tools to do so.

    "What is your biggest bug, problem, or limitation you are fixing now?" All three of the async trigger solutions (Bucardo, Slony, and Londiste) answered "DDL triggers". Which is hopefully coming for 9.1 (stop reading this blog and get to work on that, Jan).

    All in all, I really liked the panel, and I think the audience did as well. Hopefully we'll see more things like at future conferences. Since we did not know the questions before hand, and took everything from the audience, it was the polar opposite of someone giving a talk with prepared slides.

    I had some people come up to me afterwards to ask for more details about Bucardo, because (as they pointed out), it's the only multi-master replication system for Postgres (not technically true, as pg-pool and rubyrep provide multi-master use cases as well, but the former is synchronous and fairly complex, while the latter is very new and lacking some features). Maybe next year I should give a whole talk on Bucardo rather than just blabbing about it here on the blog. :)

    After that, I popped into the Check Please! What Your Postgres Databases Wishes You Would Monitor talk by Robert Treat (who I also used to work with). It was a good talk, but pretty much review for me, as watching over and monitoring databases is what I spend a lot of my time doing. :) Here's the description:

    Compared to many proprietary systems, Postgres tends to be pretty straight forward to run. However, if you want to get the most from your database, you shouldn't just set it and forget it, you need to monitor a few key pieces of information to keep performance going. This talk will review several key metrics you should be aware of, and explain under which scenarios you may need additional monitoring.

    The final talk I went to was Deploying and testing triggers and functions in multiple databases by Norman Yamada. This was an interesting talk for me because he was using a lot of the code from the same_schema action in the check_postgres program to do the actual comparison. Indeed, I made some patches while at the conference to allow for better index comparison's at Norman's request. I also managed to get some work done on tail_n_mail and Bucardo while there - something about being surrounded by all that Postgres energy made me productive despite having very little free time.

    I had to catch an early flight, and was not able to catch the final talk slot of the day, nor the closing session or the BOFs that night. Hopefully someone who did catch those will blog about it and let me know how it went. I hear the t-shirt we signed at the developer's meeting went for a sweet ransom.

    If you went to PgCon, I have two requests for you. First, please fill out the feedback for each talk you went to. It takes less than a minute per talk, and is invaluable for both the speakers and the conference organizers. Second, please blog about PgCon. It's helpful for people who did not get to go to see the conference through other people's eyes. And do it now, while things are still fresh.

    If you did not go to PgCon, I have one request for you: go next year! Perhaps next year at PgCon 2011 we'll break the 200 person mark. Thanks to Dan Langille as always for creating PgCon and keeping it running smooth year after year.

Learn more about End Point's Postgres Support, Development, and Consulting.

PostgreSQL Conference - PGCon 2010 - Day One

The first day of talks for PGCon 2010 is now over, here's a recap of the parts that I attended.

On Wednesday, the developer's meeting took place. It was basically 20 of us gathered around a long conference table, with Dave Page keeping us to a strict schedule. While there were a few side conversations and contentious issues, overall we covered an amazing amount of things in a short period of time, and actually made action items out of almost all of them. My favorite *decision* we made was to finally move to git, something myself and others have been championing for years. The other most interesting parts for me were the discussion of what features we will try to focus on for 9.1 (it's an ambitious list, no doubt), and DDL triggers! It sounds like Jan Wieck has already given this a lot of thought, so I'm looking forward to working with him in implementing these triggers (or at least nagging him about it if he slows down). These triggers will be immensely useful to replication systems like Bucardo and Slony, which implement DDL replication in a very manual and unsatisfactory way. These triggers will not be like the current triggers, in that they will not be directly attached to system tables. Instead, they will be associated with certain DDL events, such that you could have a trigger on any CREATE events (or perhaps also allowing something finer grained such as a trigger on a CREATE TABLE event). Whenever it comes in, I'll make sure that Bucardo supports it, of course!

The first day of talks kicked off the the plenary by Gavin Roy called "Perspectives on NoSQL" (description and slides are available). Gavin actually took the time to *gasp* research the topic, and gave a quick rundown of some of the more popular "NoSQL" solutions, including CouchDB, MongoDB, Cassandra, Project Voldemort, Redis, and Tokyo Tyrant. He then benchmarked all of them against Postgres for various tasks - and did it against both "regular safe" Postgres and "running with scissors" fsync-off Postgres. The results? Postgres scales, very well, and more than holds it own against the NoSQL newcomers. MongoDB did surprisingly well: see the slides for the details. His slides also had the unfortunate portmanteau of "YeSQL", which only helps to empahsize how silly our "PostgreSQL" name is. :)

The next talk was Postgres (for non-Postgres people) by Greg Sabino Mullane (me!). Unlike previous years, my slides are already online. Yes, at first blush, it seems a strange talk to give at a conference like this, but we always have a good number of people from other database systems that are considering Postgres, are in the process of migrating to Postgres, or are just new to Postgres. The talk was in three parts: the first was about the mechanics of migrating your application to Postgres: the data types that Postgres uses, how we implement indexes, the best way to migrate your data, and many other things, with an eye towards common migration problems (especially when coming from MySQL). The second part of the talk discussed some of the quirks of Postgres people coming from DB2, Oracle, etc. should be aware of. Some things discussed: how Postgres does MVCC and need for vacuum, our really smart planner and lack of hints, the automatic (and against the spec) lowercasing, and our concept of schemas. I also touched on what I see as some of our drawbacks: tuned for a toaster, no true in place upgrade, the unpronounceable name, the lack of marketing. and what some of our perceived-but-not-real drawbacks are: lack of replication, poor speed. What would a list of drawbacks be without a list of strengths?: transactional DDL, very friendly and helpful community, PostGIS, authentication options, awesome query planner, the ability to create your own custom database objects, and our distributed nature that ensures the project cannot be bought out or destroyed. The last part of the talk went over the Postgres project itself: the community, the developers, the philosophy, and how it all fits together. I ran out of time so did not get to tell my "longest patch process ever" story for \dfS (six years!) but I don't think I missed anything important and gave time for some questions.

The next talk was Hypothetical Indexes towards self-tuning in PostgreSQL by Sergio Lifschitz. In the words of Sergio:

Hypothetical indexes are simulated index structures created solely in the database catalog. This type of index has no physical extension and, therefore, cannot be used to answer actual queries. The main benefit is to provide a means for simulating how query execution plans would change if the hypothetical indexes were actually created in the database. This feature is quite useful for database tuners and DBAs.

It was a very interesting talk. Robert Haas asked him to put it in the PostgreSQL license so we can easily put it into the project as needed. Sergio promised to make the change immediately after the talk!

After lunch, the next talk was pg_statsinfo - More useful statistics information for DBAs by Tatsuhito Kasahara. This talk was a little hard to follow along, but had some interesting ideas about monitoring Postgres, a lot of which overlapped with some of my projects such as tail_n_mail and check_postgres.

The next talk was Forensic Analysis of Corrupted Databases by Greg Stark. This was a neat little talk; many of the error messages he displayed were all too familiar to me. It was nice overview of how to track down the exact location of a problem in a corrupted database, and some strategies for fixing it, including the old "using dd to write things from /dev/zero directly into your Postgres files" trick. There was even a discussion about the possibility of zeroing out specific parts of a page header, with the consensus that it would not work as one would hope.

After a quick hacky sack break with Robert Treat and some Canadian locals, I went to the final real talk of the day: The PostgreSQL Query Planner by Robert Haas. I had seen this talk recently, but wanted to see it again as I missed some of the beginning of the talk when I saw it at Pg East 2010 in Philly. Robert gave a good talk, and was very good at repeating the audience's questions. I didn't learn all that much, but it was a very good overview of the planner, including some of the new planner tricks (such as join removal) in 9.0 and 9.1.

After that, the lightning talks started. I really like lightning talks, and thankfully they weren't held on the last day of the conference this time (a common mistake). The MC was Selena Deckelmann, who did a great job of making sure all the slides were gathered up beforehand, and strictly enforced the five minute time limit. The list of slides is on the Postgres wiki. I talked on my latest favorite project, tail_n_mail - the slides are available on the wiki. I didn't make it through all my slides, so if you were at the talks, check out the PDF for the final two that were not shown. There seemed to be good interest in the project, and I had several people tell me afterwards they would try it out.

The night ended with the EnterpriseDB sponsored party. I spoke to a lot of people there, about replication, PITR scripts, log monitoring, the problem with a large number of inherited objects, and many other topics. Note to EDB: I don't think that venue is going to scale, as the conference gets bigger each year! The total number of people at the conference this year was 184, a new record.

A very good first day: I learned a lot, met new people, saw old friends, and hopefully sold Postgres to some non-Postgres people :). I also managed to git push some changes to tail_n_mail, check_postgres, and Bucardo. It's hard to say no to feature requests when someone asks you in person. :)

Learn more about End Point's Postgres Support, Development, and Consulting.

PostgreSQL switches to Git

Looks like the Postgres project is finally going to be bite the bullet and switch to git as the canonical VCS. Some details are yet to be hashed out, but the decision has been made and a new repo will be built soon. Now to lobby to get that commit-with-inline-patches list to be created...

PostgreSQL 8.4 on RHEL 4: Teaching an old dog new tricks

So a client has been running a really old version of PostgreSQL in production for a while. We finally got the approval to upgrade them from 7.3 to the latest 8.4. Considering the age of the installation, it should come as little surprise that they had been running a similarly ancient OS: RHEL 4.

Like the installed PostgreSQL version, RHEL 4 is ancient -- 5 years old. I anticipated that in order to get us to a current version of PostgreSQL, we'd need to resort to a source build or rolling our own PostgreSQL RPMs. Neither approach was particularly appealing.

While the age/decrepitude of the current machine's OS came as little surprise, what did come as a surprise was that there were supported RPMs available for RHEL 4 in the community yum rpm repository, located at http://yum.pgrpms.org/8.4/redhat/rhel-4-i386/repoview/ (modulo your architecture of choice).

In order to get things installed, I followed the instructions for installing the specific yum repo. There were a few seconds where I was confused because the installation command was giving a "permission denied" error when attempting to install the 8.4 PGDG rpm as root. A little brainstorming and a lsattr later revealed that a previous administrator, apparently in the quest for über-security, had performed a chattr +i on the /etc/yum.repo.d directory.

Evil having been thwarted, in the interest of über-usability I did a quick chattr -i /etc/yum.repo.d and installed the PGDG rpm. Away we went. From that point, the install was completely straightforward; I had a PostgreSQL 8.4.4 system running in no time, and could finally get off that 7.3 behemoth. Now to talk my way into an OS upgrade...

Learn more about End Point's Postgres Support, Development, and Consulting.

Finding the PostgreSQL version - without logging in!

Metasploit used the error messages given by a PostgreSQL server to find out the version without actually having to log in and issue a "SELECT version()" command. The original article is at http://blog.metasploit.com/2010/02/postgres-fingerprinting.html and is worth a read. I'll wait.

The basic idea is that because version 3 of the Postgres protocol gives you the file and the line number in which the error is generated, you can use the information to figure out what version of Postgres is running, as the line numbers change from version to version. In effect, each version of Postgres reveals enough in its error message to fingerprint it. This was a neat little trick, and I wanted to explore it more myself. The first step was to write a quick Perl script to connect and get the error string out. The original Metasploit script focuses on failed login attempts, but after some experimenting I found an easier way was to send an invalid protocol number (Postgres expects "2.0" or "3.0"). Sending a startup packet with an invalid protocol of "3.1" gave me back the following string:


E|SFATALC0A000Munsupported frontend protocol 3.1: 
server supports 1.0 to 3.0Fpostmaster.cL1507RProcessStartupPacket

The important part of the string was the parts indicating the file and line number:


Fpostmaster.cL1507

In this case, we can clearly see that line 1507 of postmaster.c was throwing the error. After firing up a few more versions of Postgres and recording the line numbers, I found that all versions since 7.3 were hitting the same chunk of code from postmaster.c:


/* Check we can handle the protocol the frontend is using. */

if (PG_PROTOCOL_MAJOR(proto) <> PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) ||
  (PG_PROTOCOL_MAJOR(proto) == PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) &&
   PG_PROTOCOL_MINOR(proto) > PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST)))
  ereport(FATAL,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
    errmsg("unsupported frontend protocol %u.%u: server supports %u.0 to %u.%u",
      PG_PROTOCOL_MAJOR(proto), PG_PROTOCOL_MINOR(proto),
      PG_PROTOCOL_MAJOR(PG_PROTOCOL_EARLIEST),
      PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST),
      PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST))));

Line numbers were definitely different across major versions of Postgres (e.g. 8.2 vs. 8.3), and were even different sometimes across revisions. Rather than fire up every possible revision of Postgres and run my program against it, I simply took advantage of the cvs tags (aka symbolic names) and did this:


cvs update -rREL8_3_0 -p postmaster.c | grep -Fn 'LATEST))))'

This showed me that the string occurred on line 1497 of postmaster.c. I created a Postgres instance and verified that the line number was the same. At that point, it was a simple matter of making a bash script to grab all releases since 7.3 and build up a comprehensive list of when that line changed from version to version.

Once that was done, I rolled the whole thing up into a new Perl script called "detect_postgres_version.pl". Here's the script, broken into pieces for explanation. A link to the entire script is at the bottom of the post.

First, we do some standard Perl script things and read in the __DATA__ section at the bottom of the script, which lists at which version the message has changed:


#!/usr/bin/env perl

## Quickly and roughly determine what version of Postgres is running
## greg@endpoint.com

use strict;
use warnings;
use IO::Socket;
use Data::Dumper;
use Getopt::Long;

## __DATA__ looks like this: filname / line / version when it changed
## postmaster.c 1287 7.4.0
## postmaster.c 1293 7.4.2
## postmaster.c 1293 7.4.29
##
## postmaster.c 1408 8.0.0
## postmaster.c 1431 8.0.2

## Build our hash of file-and-line to version matches
my %map;
my ($last,$lastmin,$lastline) = ('',0,0);
while () {
   next if $_ !~ /(\w\S+)\s+(\d+)\s+(.+)/;
   my ($file,$line,$version) = ($1,$2,$3);
   die if $version !~ /(\d+)\.(\d+)\.(\d+)/;
   my ($vmaj,$vmin,$vrev) = ($1,$2,$3);
   my $current = "$file|$vmaj|$vmin";
   if ($current eq $last) {
       my ($lfile,$lmaj,$lmin) = split /\|/ => $last;
       for (my $x = $lastmin+1 ; $x<$vrev; $x++) {
           push @{$map{$file}{$lastline}}
             => ["$lmaj.$lmin","$lmaj.$lmin.$x"];
       }
   }
   push @{$map{$file}{$line}} => ["$vmaj.$vmin",$version];
   $last = $current;
   $lastmin = $vrev;
   $lastline = $line;
}

Next, we allow a few options to the script: port and host. We'll default to a Unix socket if the host is not set, and default to port 5432 if none is given:


## Read in user options and set defaults
my %opt;
GetOptions(\%opt,
          'port=i',
          'host=s',
);

my $port = $opt{port} || 5432;
my $host = $opt{host} || '';

We're ready to connect, using the very standard IO::Socket module. If the host starts with a slash, we assume this is the unix_socket_directory and replace the default '/tmp' location:


## Start the connection, either unix or tcp
my $server;
if (!$host or !index $host, '/') {
   my $path = $host || '/tmp';
   $server = IO::Socket::UNIX->new(
       Type => IO::Socket::SOCK_STREAM,
       Peer => "$path/.s.PGSQL.$port",
   ) or die "Could not connect!: $@";
}
else {
   $server = IO::Socket::INET->new(
       PeerAddr => $host,
       PeerPort => $port,
       Proto    => 'tcp',
       Timeout  => 3,
   ) or warn "Could not connect!: $@";
}

Now we're ready to actually send something over our new socket. Postgres expects the startup packet to be in a certain format. We'll follow that format, but send it an invalid protocol number, 3.1. The rest of the information does not really matter, but we'll also tell it we're connecting as user "pg". Finally, we read back in the message, extract the file and line number, and spit them back out to the user:


## Build and sent the packet
my $packet = pack('nn', 3,1) . "user\0pg\0\0";
$packet = pack('N', length($packet) + 4). $packet;
$server->send($packet, 0);

## Get the message back and extract the filename and line number
my $msg;
recv $server, $msg, 1000, 0;
if ($msg !~ /F([\w\.]+)\0L(\d+)/) {
   die "Could not find a file and line from error message: $msg\n";
}

my ($file,$line) = ($1,$2);

print "File: $file Line: $line\n";

Finally, we try to map the file name and line number we received back to the version of PostgreSQL it came from. If the file is not recognized, or the line number is not known, we bail out early:


$map{$file}
   or die qq{Sorry, I do not know anything about the file "$file"\n};

$map{$file}{$line}
   or die qq{Sorry, I do not know anything about line $line of file "$file"\n};

If there is only one result for this line and file number, we can state what it is and exit.


my $result = $map{$file}{$line};

if (1 == @$result) {
   print "Most likely Postgres version $result->[0][1]\n";
   exit;
}

In most cases, though, we don't know the exact version down to the revision after the second dot, so we'll state what the major version is, and all the possible revisions:


## Walk through and figure out which versions it may be.
## For now, we know that the major version does not overlap
print "Most likely Postgres version $result->[0][0]\n";
print "Specifically, one of these:\n";

for my $row (@$result) {
   print "  Postgres version $row->[1]\n";
}

exit;

The only thing left is the DATA section, which I'll show here to be complete:


__DATA__

## Format: filename line version

postmaster.c 1167 7.3.0
postmaster.c 1167 7.3.21

postmaster.c 1287 7.4.0
postmaster.c 1293 7.4.2
postmaster.c 1293 7.4.29

postmaster.c 1408 8.0.0
postmaster.c 1431 8.0.2
postmaster.c 1441 8.0.5
postmaster.c 1445 8.0.6
postmaster.c 1439 8.0.7
postmaster.c 1443 8.0.9
postmaster.c 1445 8.0.14
postmaster.c 1445 8.0.25

postmaster.c 1449 8.1.0
postmaster.c 1450 8.1.1
postmaster.c 1454 8.1.2
postmaster.c 1448 8.1.3
postmaster.c 1452 8.1.4
postmaster.c 1448 8.1.9
postmaster.c 1454 8.1.10
postmaster.c 1454 8.1.21

postmaster.c 1432 8.2.0
postmaster.c 1437 8.2.1
postmaster.c 1440 8.2.5
postmaster.c 1432 8.2.17

postmaster.c 1497 8.3.0
postmaster.c 1507 8.3.8
postmaster.c 1507 8.3.11

postmaster.c 1570 8.4.0
postmaster.c 1621 8.4.1
postmaster.c 1621 8.4.4

postmaster.c 1664 9.0.0

(Because version 9.0 is not released yet, its line number may still change.)

I found this particular protocol error to be a good one because there is no overlap of line numbers across major versions. Of the approximately 125 different versions released since 7.3.0, only 6 are unique enough to identify to the exact revision. That's okay for this iteration of the script. If you wanted to know the exact revision, you could try other errors, such as an invalid login, as the metasploit code does.

The complete code can be read here: detect_postgres_version.pl

I'll be giving a talk later on this week at PgCon 2010, so say hi if you see me there. I'll probably be giving a lightning talk as well.

Learn more about End Point's Postgres Support, Development, and Consulting.

Using PostgreSQL Hooks

PostgreSQL is well known for its extensibility; users can build new functions, operators, data types, and procedural languages, among others, without having to modify the core PostgreSQL code. Less well known is PostgreSQL's extensive set of "hooks", available to the more persistent coder. These hooks allow users to interrupt and modify behavior in all kinds of places without having to rebuild PostgreSQL.

Few if any of these hooks appear in the documentation, mostly because the code documents them quite well, and anyone wanting to use them is assumed already to be sufficiently familiar with the code to find the information they'd need to use one. For those interested in getting started using hooks, though, an example can be useful. Fortunately, the contrib source provides one, in the form of passwordcheck, a simple contrib module that checks users' passwords for sufficient strength. These checks include having a length greater than 8 characters, being distinct from the username, and containing both alphabetic and non-alphabetic characters. It can also use CrackLib for more intense password testing, if built against the CrackLib code.

In general, these hooks consist of global function pointers of a specific type, which are initially set to NULL. Whenever PostgreSQL wants actually to use a hook, it checks the function pointer, and if it's not NULL, calls the function it points to. When someone implements a hook, they write a function of the proper type and an initialization function to set the function pointer variable. They then package the functions in a library, and tell PostgreSQL to load the result, often using shared_preload_libraries.

For our example, the important pieces of the PostgreSQL code are in src/backend/commands/user.c and src/include/commands/user.h. First, we need a function pointer type, which in this case is called check_password_hook_type:

typedef void (*check_password_hook_type)
   (const char *username, const char *password,
   int password_type, Datum validuntil_time,
   bool validuntil_null);

extern PGDLLIMPORT check_password_hook_type check_password_hook;

This says the check_password_hook will take arguments for user name, password, password type, and validity information (for passwords valid until certain dates). It also provides an extern declaration of the actual function pointer, called "check_password_hook".

The next important pieces of code are in src/backend/commands/user.c, as follows:

/* Hook to check passwords in CreateRole() and AlterRole() */
check_password_hook_type check_password_hook = NULL;

...which defines the function hook variable, and this:

 if (check_password_hook && password)
  (*check_password_hook) (stmt->role, password,
      isMD5(password) ? PASSWORD_TYPE_MD5 : PASSWORD_TYPE_PLAINTEXT,
    validUntil_datum,
    validUntil_null);

...which actually uses the hook. Actually the hook is used twice, with identical code, once in CreateRole() and once in AlterRole(), so as to provide password checking in both places. (Insert D.R.Y. rant here).

In order to take advantage of this hook, the passwordcheck module needs to implement the hook function, and set the check_password_hook variable to point to that function. First, passwordcheck.c needs to include a few things, including "commands/user.h" to ge the definitions of check_password_hook and check_password_hook_type, and call the PG_MODULE_MAGIC macro every PostgreSQL shared library needs. Then, it implements the password checking logic in a function called check_password():


static void
check_password(const char *username,
      const char *password,
      int password_type,
      Datum validuntil_time,
      bool validuntil_null)
{
/* Actual password checking logic goes here */
}

Note that this declaration matches the arguments described in the check_password_hook_type, above.

Now to ensure the check_password_hook variable points to this new check_password() function. When loading a shared library, PostgreSQL looks for a function defined in that library called _PG_init(), and runs it if it exists. In passwordcheck, the _PG_init() function is as simple as this:

void
_PG_init(void)
{
 /* activate password checks when the module is loaded */
 check_password_hook = check_password;
}

Other modules using hooks often check the hook variable for NULL before setting it, in case something else is already using the hook. For instance, the auto_explain contrib module does this in _PG_init() (note that auto_explain uses three different hooks):

 prev_ExecutorStart = ExecutorStart_hook;
 ExecutorStart_hook = explain_ExecutorStart;
 prev_ExecutorRun = ExecutorRun_hook;
 ExecutorRun_hook = explain_ExecutorRun;
 prev_ExecutorEnd = ExecutorEnd_hook;
 ExecutorEnd_hook = explain_ExecutorEnd;

auto_explain also resets the hook variables in its _PG_fini() function. Since unloading modules isn't yet supported and thus, _PG_fini() never gets called, this is perhaps unimportant, but is good for the sake of being thorough.

Back to passwordcheck. Having set the hook variable, all that remains is to get PostgreSQL to load this library. The easiest way to do that is to set shared_preload_libraries in postgresql.conf:

josh@eddie:~/devel/pgsrc/pg-eggyknap/contrib/passwordcheck$ psql
psql (9.0devel)
Type "help" for help.

5432 josh@josh# show shared_preload_libraries ;
 shared_preload_libraries 
--------------------------
 passwordcheck
(1 row)

Restarting PostgreSQL loads the library, proven as follows:


5432 josh@josh# create user badpass with password 'bad';
ERROR:  password is too short

There are hooks like this all over the PostgreSQL code base. Simply search for "_hook_type", to find such possibilities as these:

NameDescription
shmem_startup_hookCalled when PostgreSQL initializes its shared memory segment
explain_get_index_name_hookCalled when explain finds indexes' names.
planner_hookRuns when the planner begins, so plugins can monitor or even modify the planner's behavior
get_relation_info_hookAllows modification of expansion of the information PostgreSQL gets from the catalogs for a particular relation, including adding fake indexes

Learn more about End Point's Postgres Support, Development, and Consulting.

PostgreSQL template databases to restore to a known state

Someone asked on the mailing lists recently about restoring a PostgreSQL database to a known state for testing purposes. How to do this depends a little bit on what one means by "known state", so let's explore a few scenarios and their solutions.

First, let's assume you have a Postgres cluster with one or more databases that you create for developers or QA people to mess around with. At some point, you want to "reset" the database to the pristine state it was in before people starting making changes to it.

The first situation is that people have made both DDL changes (such as ALTER TABLE ... ADD COLUMN) and DML changes (such as INSERT/UPDATE/DELETE). In this case, what you want is a complete snapshot of the database at a point in time, which you can then restore from. The easiest way to do this is to use the TEMPLATE feature of the CREATE DATABASE command.

Every time you run CREATE DATABASE, it uses an already existing database as the "template". Basically, it creates a copy of the template database you specify. If no template is specified, it uses "template1" by default, so that these two commands are equivalent:


CREATE DATABASE foobar;
CREATE DATABASE foobar TEMPLATE template1;

Thus, if we want to create a complete copy of an existing database, we simply use it as a template for our copy:


CREATE DATABASE mydb_template TEMPLATE mydb;

Thus, when we want to restore the mydb database to the exact same state as it was when we ran the above command, we simply do:


DROP DATABASE mydb;
CREATE DATABASE mydb TEMPLATE mydb_template;

You may want to make sure that nobody changes your new template database. One way to do this is to not allow any non-superusers to connect to the database by setting the user limit to zero. This can be done either at creation time, or afterwards, like so:


CREATE DATABASE mydb_template TEMPLATE mydb CONNECTION LIMIT 0;

ALTER DATABASE mydb_template CONNECTION LIMIT 0;

You may want to go further by granting the database official "template" status by adjusting the datistemplate column in the pg_database table:


UPDATE pg_database SET datistemplate = TRUE WHERE datname = 'mydb_template';

This will allow anyone to use the database as a template, as long as they have the CREATEDB privilege. You can also restrict *all* connections to the database, even superusers, by adjusting the datallowconn column:


UPDATE pg_database SET datallowconn = FALSE WHERE datname = 'mydb_template';

Another way to restore the database to a known state is to use the pg_dump utility to create a file, then use psql to restore that database. In this case, the command to save a copy would be:


pg_dump mydb --create > mydb.template.pg

The --create option tells pg_dump to create the database itself as the first command in the file. If you look at the generated file, you'll see that it is using template0 as the template database in this case. Why does Postgres have template0 and template1? The template1 database is meant as a user configurable template that you can make changes to that will be picked up by all future CREATE DATABASE commands (a common example is a CREATE LANGUAGE command). The template0 database on the other hand is meant as a "hands off, don't ever change it" stable database that can always safely be used as a template, with no changes from when the cluster was first created. To that end, you are not even allowed to connect to the template0 database (thanks to the datallowconn column metioned earlier).

Now that we have a file (mydb.template.pg), the procedure to recreate the database becomes:


psql -X -c 'DROP DATABASE mydb'

psql -X --set ON_ERROR_STOP=on --quiet --file mydb.template.pg

We use the -X argument to ensure we don't have any surprises lurking inside of psqlrc files. The --set ON_ERROR_STOP=on option tells psql to stop processing the moment it encounters an error, and the --quiet tells psql to not be verbose and only let us know about very important things. (While I normally advocate using the --single-transaction option as well, we cannot in this case as our file contains a CREATE DATABASE line).


What if (as someone posited in the thread) the original poster really wanted only the *data* to be cleaned out, and not the schema (e.g. DDL)?. In this case, what we want to do is remove all rows from all tables. The easiest way to do this is with the TRUNCATE command of course. Because we don't want to worry about which tables need to be deleted before other ones because of foreign key constraints, we'll also use the CASCADE option to TRUNCATE. We'll query the system catalogs for a list of all user tables, generate truncate commands for them, and then play back the commands we just created. First, we create a simple text file containing commands to truncate all the tables:


SELECT 'TRUNCATE TABLE '
 || quote_ident(nspname)
 || '.'
 || quote_ident(relname)
 || ' CASCADE;'
FROM pg_class
JOIN pg_namespace n ON (n.oid = relnamespace)
WHERE nspname !~ '^pg'
AND nspname <> 'information_schema'
AND relkind = 'r';

Once that's saved as truncate_all_tables.pg, resetting the database by removing all rows from all tables becomes as simple as:


psql mydb -X -t -f truncate_all_tables.pg | psql mydb --quiet

We again use the --quiet option to limit the output, as we don't need to see a string of "TRUNCATE TABLE" strings scroll by. The -t option (also written as --tuples-only) prevents the headers and footers from being output, as we don't want to pipe those back in.

It's most likely you'd also want the sequences to be reset to their starting point as well. While sequences generally start at "1", we'll take out the guesswork by using the "ALTER SEQUENCE seqname RESTART" syntax. We'll append the following SQL to the text file we created earlier:


SELECT 'ALTER SEQUENCE '
 || quote_ident(nspname)
 || '.'
 || quote_ident(relname)
 || ' RESTART;'
FROM pg_class
JOIN pg_namespace n ON (n.oid = relnamespace)
WHERE nspname !~ '^pg'
AND nspname <> 'information_schema'
AND relkind = 'S';

The command is run the same as before, but now in addition to table truncation, the sequences are all reset to their starting values.


A final way to restore the database to a known state is a variation on the previous pg_dump command. Rather than save the schema *and* data, we simply want to restore the database without any data:


## Create the template file:
pg_dump mydb --schema-only --create > mydb.template.schemaonly.pg

## Restore it:
psql -X -c 'DROP DATABASE mydb'
psql -X --set ON_ERROR_STOP=on --file mydb.template.schemaonly.pg

Those are a few basic ideas on how to reset your database. There are a few limitations that got glossed over, such as that nobody can be connected to the database that is being used as a template for another one when the CREATE DATABASE command is being run, but this should be enough to get you started.

Learn more about End Point's Postgres Support, Development, and Consulting.

Tail_n_Mail does Windows (log file monitoring)

I've just released version 1.10.1 of tail_n_mail.pl, the handy script for watching over your Postgres logs and sending email when interesting things happen.

Much of the recent work on tail_n_mail has been in improving the parsing of statements in order to normalize them and give reports like this:


[1] From files A to Q Count: 839
First: [A] 2010-05-08T05:10:46-05:00 alpha postgres[13567]
Last:  [Q] 2010-05-09T05:02:27-05:00 bravo postgres[19334]
ERROR: duplicate key violates unique constraint "unique_email_address"
STATEMENT: INSERT INTO email_table (id, email, request, token) VALUES (?)

[2] From files C to E (between lines 12523 of A and 268431 of B, occurs 6159 times)                                          
First: [C] 2010-05-04 16:32:23 UTC [22504]                                                                                    
Last:  [E] 2010-05-05 05:04:53 UTC [23907]                                                                                    
ERROR: invalid byte sequence for encoding "UTF8": 0x????
HINT: This error can also happen if the byte sequence does not 
match the encoding expected by the server, which is controlled 
by "client_encoding".

## The above examples are from two separate instances, the first 
## of which has the "find_line_number" option turned off

However, I've only ever used tail_n_mail on Linux-like systems, so it will not work on Windows systems...until now. Thanks to an error report and patch from Paulo Saudin, this program will now work on Windows. There is an new option, mailmode, which defaults to 'sendmail', for the same behavior as previous versions of tail_n_mail. This assumes you have access to a sendmail binary (which may or may not be from the actual Sendmail program: many mail programs provide a compatible binary of the same name). If you don't have sendmail, you can now specify an argument of 'smtp' to the mailmode argument (you can also simply use --smtp). This switches to using the Net::SMTP::SSL module to send the mail instead of sendmail.

Switching the mailmode is not enough, of course, so there are some additional flags to help the mail go out:

  • --mailserver : the name of the outgoing SMTP server
  • --mailuser : the user to authenticate with
  • --mailpass : the password of the user
  • --mailport : the port to use: defaults to 465

Needless to say, using the --mailpass option from the command line or even in a script is not the best practice, so it is highly recommended that you put the new variables inside a tailnmailrc file. When the script starts, it looks for a file named .tailnmailrc in the current directory. If that is not found, it looks for the same file in your home directory (or technically, whatever the HOME environment variable is set to). If that does not exist, it checks for the file /etc/tailnmailrc. You can override those checks by specifying the file directly with the --tailnmailrc= option, or disable all rc files with the --no-tailnmailrc option.

The tailnmailrc file is very straightforward: each line is a name and value pair, separated by a colon or an equal sign. Lines starting with a '#' indicate a comment and are skipped. So someone using the new Net::SMTP::SSL method might have a .tailnmailrc in their home directory that looks like this:

mailmode=smtp
mailserver=mail.example.com
mailuser=greg@example.com
mailpass=mysupersekretpassword

The tail_n_mail program is open source and BSD licensed. Contributions are always welcome: send a patch, or fork a version through the Github mirror. There is also a Bugzilla system to accept bug reports and feature requests.

Learn more about End Point's Postgres Support, Development, and Consulting.

PostgreSQL startup Debian logging failure

I ran into issues with debugging why a fresh PostgreSQL replica wasn't starting on Debian. This was with a highly-customized postgresql.conf file with custom logging location, data_directory, etc. set.

The system log files were not showing any information about the failed pg_ctlcluster output, nor was there any information in /var/log/postgresql/ or the defined log_directory.

I was able to successfully create a new cluster with pg_createcluster and see logs for the new cluster in /var/log/postgresql/. The utility pg_lsclusters showed both clusters in the listing, but the initial cluster was still down, showing up with a custom log location. After reviewing the Debian wrapper scripts (fortunately written in Perl) I disabled log_filename, log_directory, and logging_collector, leaving log_destination = stderr. I was then finally able to get log information spit out to the terminal.

In this case, it was due to a fresh Amazon EC2 instance lacking appropriate sysctl.conf settings for kernel.shmmax and kernel.shmall. This particular error occurred before the logging was fully set up, which is why we did not get logging information in the postgresql.conf-designated location.

Once I had the log information, it was a short matter to correct the issue. It just goes to show that often finding the problem is 90% of the work. Hopefully this comes in handy to someone else.

Tickle me Postgres: Tcl inside PostgreSQL with pl/tcl and pl/tclu

Although I really love Pl/Perl and find it the most useful language to write PostgreSQL functions in, Postgres has had (for a long time) another set of procedural languages: Pl/Tcl and Pl/TclU. The Tcl language is pronounced "tickle", so those two languages are pronounced as "pee-el-tickle" and "pee-el-tickle-you". The pl/tcl languages have been around since before any others, even pl/perl; for a long time in the early days of Postgres using pl/tclu was the only way to do things "outside of the database", such as making system calls, writing files, sending email, etc.

Sometimes people are surprised when they hear I still use Tcl. Although it's not as widely mentioned as other procedural languages, it's a very clean, easy to read, powerful language that shouldn't be overlooked. Of course, with Postgres, you have a wide variety of languages to write your functions in, including:

The nice thing about Tcl is that not only is it an easy language to write in, it's fully supported by Postgres. Only three languages are maintained inside the Postgres tree itself: Perl, Tcl, and Python. Only two of those have a trusted and untrusted version: Perl and Tcl. All procedural languages in Postgres are untrusted by default, which means they can do things like make system calls. To be a trusted language, there must be some capacity to limit what can be done by the language. With Perl, this is accomplished through the "Safe" Perl module. For Tcl, this is accomplished by having two versions of the Tcl interpreter: a normal one for pltclu and a separate one that uses the "Safe-Tcl mechanism" for pltcl.

Let's take a quick look at what a pltcl function looks like. We'll use pl/tcl to implement the common problem of "SELECT COUNT(*) is very slow" by tracking the row count using triggers as we go along. For this, we'll start with a sample table that we want to be able to find out exactly how many rows are inside of any time, without suffering the delay of COUNT(*). Here's the table definition, and a quick command to populate it with some dummy data:


CREATE SEQUENCE customer_id_seq;

CREATE TABLE customer (
  id      INTEGER     NOT NULL DEFAULT nextval('customer_id_seq') PRIMARY KEY,
  email   TEXT            NULL,
  address TEXT            NULL,
  cdate   TIMESTAMPTZ NOT NULL DEFAULT now()
);

INSERT INTO customer (email, address)
  SELECT 'jsixpack@example.com', '123 Main Street'
  FROM generate_series(1,10000);

A quick review: we create a sequence for use by the table to populate its primary key, the 'id' column. Each customer also has an optional email and address, plus we automatically track when we create the row by using the "DEFAULT now()" trick on the 'cdate' column. Finally, we use the super handy generate_series function to populate the new table with ten thousand rows of data.

Next, we'll create a helper table that will keep track of the rows for us. We'll make it generic so that it can track any number of tables:


CREATE TABLE table_count (
  schemaname TEXT   NOT NULL,
  tablename  TEXT   NOT NULL,
  rows       BIGINT NOT NULL DEFAULT 0
);

INSERT INTO table_count(schemaname,tablename,rows)
  SELECT 'public', 'customer', count(*) FROM customer;

We also populated it with the current number of rows in customer. Of course, this will be out of date as soon as someone updates the table, so let's add our triggers. We don't want to update the table_count table on every single row change, but only at the end of each statement. To do that, we'll make a row-level trigger that stores up the changes inside a global variable, and then a statement-level trigger that uses the global variable to update the table_count table.


CREATE FUNCTION update_table_count_row()
  RETURNS TRIGGER
  SECURITY DEFINER
  VOLATILE
  LANGUAGE pltcl
AS $BC$

  ## Declare tablecount as a global variable so other functions
  ## can access our changes
  variable tablecount

  ## Set the local count of rows changed to 0
  set rows 0

  ## $TG_op indicates what type of command was just run
  ## Modify the local variable rows depending on what we just did
  switch $TG_op {
    INSERT {
      incr rows 1
    }
    UPDATE {
      ## No change in number of rows
      ## We could also leave out the ON UPDATE from the trigger below
    }
    DELETE {
      incr rows -1
    }
  }

  ## The tablecount variable will be an associative array
  ## The index will be this table's name, the value is the rows changed
  ## We should probably be using $TG_schema_name as well, but we'll ignore that

  ## If there is no variable for this table yet, create it, otherwise just change it
  if {![ info exists tablecount($TG_table_name) ] } {
    set tablecount($TG_table_name) $rows
  } else {
    incr tablecount($TG_table_name) $rows
  }

  return OK
$BC$;

CREATE FUNCTION update_table_count_statement()
  RETURNS TRIGGER
  SECURITY DEFINER
  LANGUAGE pltcl
AS $BC$

  ## Make sure we access the global version of the tablecount variable
  variable tablecount

  ## If it doesn't exist yet (for example, when an update changes no 
  ## rows), we simply exit early without making changes
  if { ! [ info exists tablecount ] } {
    return OK
  }
  ## Same logic if our specific entry in the array does not exist
  if { ! [ info exists tablecount($TG_table_name) ] } {
    return OK
  }
  ## If no rows were changed, we simply exit
  if { $tablecount($TG_table_name) == 0 } {
    return OK
  }

  ## Update the table_count table: may be a positive ior negative shift
  spi_exec "
    UPDATE table_count
    SET rows=rows+$tablecount($TG_table_name)
    WHERE tablename = '$TG_table_name'
  "

  ## Reset the global variable for the next round
  set tablecount($TG_table_name) 0

  return OK
$BC$;

CREATE TRIGGER update_table_count_row
  AFTER INSERT OR UPDATE OR DELETE
  ON public.customer
  FOR EACH ROW
  EXECUTE PROCEDURE update_table_count_row();

CREATE TRIGGER update_table_count_statement
  AFTER INSERT OR UPDATE OR DELETE
  ON public.customer
  FOR EACH STATEMENT
  EXECUTE PROCEDURE update_table_count_statement();

(Caveat: because there is a single Tcl interpreter for all pl/tcl functions, these functions are not 100% safe, as there is a theoretical chance that changes made by processes running at the exact same time may step on each other's global variables. In practice, this is unlikely.)

If everything is working correctly, we should see the entries in the table_count table match up with the output of SELECT COUNT(*). Let's take a look via a psql session:

psql=# \t
Showing only tuples.
psql=# \a
Output format is unaligned.

psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer;
public|customer|10000
10000

psql=# UPDATE customer SET email=email WHERE id <= 10;
UPDATE 10

psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer;
public|customer|10000
10000

psql=# INSERT INTO customer (email, address)
psql-#   SELECT email, address FROM customer LIMIT 4;
INSERT 0 4

psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer;
public|customer|10004
10004

psql=# DELETE FROM customer WHERE id <= 10;
DELETE 10

psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer;
public|customer|9994
9994

psql=# TRUNCATE TABLE customer;
TRUNCATE TABLE

psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer;
public|customer|9994
0

Whoops! Everything matched up until that TRUNCATE. On earlier versions of Postgres, there was no way around that problem, but if we have Postgres version 8.4 or better, we can use truncate triggers!


CREATE FUNCTION update_table_count_truncate()
  RETURNS TRIGGER
  SECURITY DEFINER
  LANGUAGE pltcl
AS $BC$

  spi_exec "
    UPDATE table_count
    SET rows=0
    WHERE tablename = '$TG_table_name'
  "

  set tablecount($TG_table_name) 0

 return OK
$BC$;

CREATE TRIGGER update_table_count_truncate
  AFTER TRUNCATE
  ON public.customer
  FOR EACH STATEMENT
  EXECUTE PROCEDURE update_table_count_truncate();

Pretty straightforward, let's make sure it works:

psql=# TRUNCATE TABLE customer;
TRUNCATE TABLE

psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer;
public|customer|0
0

Success! This was a fairly contrived example, but Tcl (and especially pl/tclU) offers a lot more functionality. If you want to examine pl/tcl and pl/tclu for yourself, you'll need to make sure it's compiled into the Postgres you are using. If using a packaging system, it's as simple as doing this (or something like it, depending on what packaging system you use):

yum install postgresql-pltcl

If compiling from source, just pass the --with-tcl option to configure. You'll probably also need to install the Tcl development package, e.g. with yum install tcl-devel

Once installed, installing it into a specific database is as simple as:

$ CREATE LANGUAGE pltcl;
CREATE LANGUAGE
$ CREATE LANGUAGE pltclu;
CREATE LANGUAGE

For more about Tcl, check out the The Tcl Wiki, the Tcl tutorial, or this Tcl reference. For more about pl/tcl and pl/tclu. visit the Postgres pltcl documentation

LinuxFest Northwest: PostgreSQL 9.0 upcoming features

Once again, LinuxFest Northwest provided a full track of PostgreSQL talks during their two-day conference in Bellingham, WA.

Gabrielle Roth and I presented our favorite features in 9.0, including a live demo of Hot Standby with streaming replication! We also demonstrated features like:

The full feature list is available at on the developer site right now!

Viewing Postgres function progress from the outside

Getting visibility into what your PostgreSQL function is doing can be a difficult task. While you can sprinkle notices inside your code, for example with the RAISE feature of plpgsql, that only shows the notices to the session that is currently running the function. Let's look at a solution to peek inside a long-running function from any session.

While there are a few ways to do this, one of the most elegant is to use Postgres sequences, which have the unique property of living "outside" the normal MVCC visibility rules. We'll abuse this feature to allow the function to update its status as it goes along.

First, let's create a simple example function that simulates doing a lot of work, and taking a long time to do so. The function doesn't really do anything, of course, so we'll throw some random sleeps in to emulate the effects of running on a busy production machine. Here's what the first version looks like:


DROP FUNCTION IF EXISTS slowfunc();

CREATE FUNCTION slowfunc()
RETURNS TEXT
VOLATILE
SECURITY DEFINER
LANGUAGE plpgsql
AS $BC$
DECLARE
  x INT = 1;
  mynumber INT;
BEGIN
  RAISE NOTICE 'Start of function';

  WHILE x <= 5 LOOP
    -- Random number from 1 to 10
    SELECT 1+(random()*9)::int INTO mynumber;
    RAISE NOTICE 'Start expensive step %: time to run=%', x, mynumber;
 PERFORM pg_sleep(mynumber);
    x = x + 1;
  END LOOP;

  RETURN 'End of function';
END
$BC$;

Pretty straightforward function: we simply emulate doing five expensive steps, and output a small notice as we go along. Running it gives this output (with pauses from 1-10 seconds of course):

$ psql -f slowfunc.sql
DROP FUNCTION
CREATE FUNCTION
psql:slowfunc.sql:30: NOTICE:  Start of function
psql:slowfunc.sql:30: NOTICE:  Start expensive step 1: time to run=2
psql:slowfunc.sql:30: NOTICE:  Start expensive step 2: time to run=7
psql:slowfunc.sql:30: NOTICE:  Start expensive step 3: time to run=3
psql:slowfunc.sql:30: NOTICE:  Start expensive step 4: time to run=8
psql:slowfunc.sql:30: NOTICE:  Start expensive step 5: time to run=5
    slowfunc     
-----------------
 End of function

To grant some visibility to other processes about where we are, we're going to change a sequence from within the function itself. First we need to decide on what sequence to use. While we could pick a common name, this won't allow us to run the function in more than one process at a time. Therefore, we'll create unique sequences based on the PID of the process running the function. Doing so is fairly trivial for an application: just create that sequence before the expensive function is called. For this example, we'll use some psql tricks to achieve the same effect like so:


\t
\o tmp.drop.sql
SELECT 'DROP SEQUENCE IF EXISTS slowfuncseq_' || pg_backend_pid() || ';';
\o tmp.create.sql
SELECT 'CREATE SEQUENCE slowfuncseq_' || pg_backend_pid() || ';';
\o
\t
\i tmp.drop.sql
\i tmp.create.sql

From the top, this script turns off everything but tuples (so we have a clean output), then arranges for all output to go to the file named "tmp.drop.sql". Then we build a sequence name by concatenating the string 'slowfuncseq_' with the current PID. We put that into a DROP SEQUENCE statement. Then we redirect the output to a new file named "tmp.create.sql" (this closes the old one as well). We do the same thing for CREATE SEQUENCE. Finally, we stop sending things to the file, turn off "tuples only" mode, and import the two files we just created, first to drop the sequence if it exists, and then to create it. The files will look something like this:

$ more tmp.*.sql
::::::::::::::
tmp.drop.sql
::::::::::::::
 DROP SEQUENCE IF EXISTS slowfuncseq_8762;

::::::::::::::
tmp.create.sql
::::::::::::::
 CREATE SEQUENCE slowfuncseq_8762;

The only thing left is to add the calls to the sequence from within the function itself. Remember that the sequence called must exist, or the function will throw an exception, so make sure you create the sequence before the function is called! (Alternatively, you could use the same named sequence every time, but as explained before, you lose the ability to track more than one iteration of the function at a time.)


DROP FUNCTION IF EXISTS slowfunc();

CREATE FUNCTION slowfunc()
RETURNS TEXT
VOLATILE
SECURITY DEFINER
LANGUAGE plpgsql
AS $BC$
DECLARE
  x INT = 1;
  mynumber INT;
  seqname TEXT;
BEGIN
  SELECT INTO seqname 'slowfuncseq_' || pg_backend_pid();
  PERFORM nextval(seqname);

  RAISE NOTICE 'Start of function';

  WHILE x <= 5 LOOP
    -- Random number from 1 to 10
    SELECT 1+(random()*9)::int INTO mynumber;
    RAISE NOTICE 'Start expensive step %: time to run=%', x, mynumber;
 PERFORM pg_sleep(mynumber);
    PERFORM nextval(seqname);
    x = x + 1;
  END LOOP;

  RETURN 'End of function';
END
$BC$;

Again, it's important that the steps become to create the sequence, run the function, and then drop the sequence. While access to sequences lives outside MVCC, creation of the sequence itself is not. Here's what the whole thing will look like in psql:


\t
\o tmp.drop.sql
SELECT 'DROP SEQUENCE IF EXISTS slowfuncseq_' || pg_backend_pid() || ';';
\o tmp.create.sql
SELECT 'CREATE SEQUENCE slowfuncseq_' || pg_backend_pid() || ';';
\o
\t
\i tmp.drop.sql
\i tmp.create.sql
SELECT slowfunc();
\i tmp.drop.sql

Now you can see how far along the function is from any other process. For example, if we kick off the script above, then go into psql from another window, we can use the process id from the pg_stat_activity view to see how far along our function is:

$ select procpid, current_query from pg_stat_activity;
 procpid |                    current_query                     
---------+------------------------------------------------------
   10206 | SELECT slowfunc();
   10313 | select procpid, current_query from pg_stat_activity;

$ select last_value from slowfuncseq_10206;
 last_value 
------------
          3

You can assign your own values and meanings to the numbers, of course: this one simply tells us that the script is on the third iteration of our sleep loop. You could use multiple sequences to convey even more information.

There are other ways besides sequences to achieve this trick: one that I've used before is to have a plperlu function open a new connection to the existing database and update a text column in a simple tracking table. Another idea is to update a small semaphore table within the function, and check the modification time of the underlying file underneath your data directory.

PostgreSQL at LinuxFest Northwest

This is my third year driving up to Bellingham for LinuxFest Northwest, and I'm excited to be presenting two talks about PostgreSQL there. Adrian Klaver is one of the organizers of the conference, and has always been a huge supporter of PostgreSQL. He has gone out of his way to have a track of content about our favorite database.

I'll be presenting an introduction to Bucardo and co-hosting a talk about new features in version 9.0 of PostgreSQL with Gabrielle Roth.

Talking about Bucardo and replication is always a blast. The last time I gave this talk to a packed house in Seattle, so I'm hoping for another lively discussion about the state of replication in PostgreSQL.

Restoring individual table data from a Postgres dump

Recently, one of our clients needed to restore the data in a specific table from the previous night's PostgreSQL dump file. Basically, there was a UPDATE query that did not do what it was supposed to, and some of the columns in the table were irreversibly changed. So, the challenge was to quickly restore the contents of that table.

The SQL dump file was generated by the pg_dumpall command, and thus there was no easy way to extract individual tables. If you are using the pg_dump command, you can specify a "custom" dump format by adding the -Fc option. Then, pulling out the data from a single table becomes as simple as adding a few flags to the pg_restore command like so:

$ pg_restore --data-only --table=alpha large.custom.dumpfile.pg > alpha.data.pg

One of the drawbacks of using the custom format is that it is only available on a per-database basis; you cannot use it with pg_dumpall. That was the case here, so we needed to extract the data of that one table from within the large dump file. If you know me well, you might suspect at this point that I've written yet another handy perl script to tackle the problem. As tempting as that may have been, time was of the essence, and the wonderful array of Unix command line tools already provided me with everything I needed.

Our goal at this point was to pull the data from a single table ("alpha") from a very large dump file ("large.dumpfile.pg") into a separate and smaller file that we could use to import directly into the database.

The first step was to find exactly where in the file the data was. We knew the name of the table, and we also know that a dump file inserts data by using the COPY command, so there should be a line like this in the dump file:

COPY alpha (a,b,c,d) FROM stdin;

Because all the COPYs are done together, we can be pretty sure that the command after "COPY alpha" is another copy. So the first thing to try is:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '

This uses grep's handy -n option (aka --line-number) to output the line number that each match appears on. Then we pipe that back to grep, search for our table name, and print the line after it with the -A option (aka --after-context). The output looked like this:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '
1233889:COPY alpha (cdate, who, state, add, remove) FROM stdin;
12182851:COPY alpha_sequence (sname, value) FROM stdin;

Note that many of the options here are GNU specific. If you are using an operating system that doesn't support the common GNU tools, you are going to have a much harder time doing this (and many other shell tasks)!

We now have a pretty good guess at the starting and ending lines for our data: 1233889 to lines 12182850 (we subtract 1 as we don't want the next COPY). We can now use head and tail to extract the lines we want, once we figure out how many lines our data spans:

$ echo 12182851 - 1233889 | bc
10948962
$ head -1233889 large.dumpfile.pg | tail -10948962 > alpha.data.pg

However, what if the next command was not a COPY? We'll have to scan forward for the end of the COPY section, which is always a backslash and a single dot at the start of a new line. The new command becomes (all one line, but broken down for readability):

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | grep -n -m1 '^\\\.'

That's a lot, but in the spirit of Unix tools doing one thing and one thing well, it's easy to break down. First, we grab the line numbers where COPY occurs in our file, then we find the first occurrence of our table (using the -m aka --max-count option). We cut out the first field from that output, using a colon as the delimiter. This gives is the line number where the COPY begins. We pass this to xargs, and tail the file with a --lines=+x argument, which outputs all lines from that file *starting* at the given line number. Finally, we pipe that output to grep and look for the end of copy indicator, stopping at the first one, and also outputting the line number. Here's what we get:

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | grep -n -m1 '^\\\.'

148956:\.
xargs: tail: terminated by signal 13

This tells us that 148956 lines after the COPY, we encountered the string "\.". (The complaint from xargs can be ignored). Now we can create our data file:

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | head -148956 > alpha.data.pg

Now that the file is there, we should do a quick sanity check on it. If the file is small enough, we could simply call it up in your favorite editor or run it through less or more. You can also check things out by knowing that a Postgres dump file separates the data in columns by a tab character when using the COPY command. So we can view all lines that don't have a tab, and make sure there is nothing except comments and the COPY and \. lines:

$ grep -v -P '\t' alpha.data.pg

The grep option -P (aka --perl-regexp) instructs grep to interpret the argument ("backslash t" in this case) as a Perl regular expression. You could also simply input a literal tab there: on most systems this can be done with the <ctrl-v><TAB> key combination.

It's time to replace that bad data. We'll need to truncate the existing table, then COPY our data back in. To do this, we'll create a file that we'll feed to psql -X -f. Here's the top of the file:

$ cat > alpha.restore.pg

\set ON_ERROR_STOP on
\timing

\c mydatabase someuser

BEGIN;

CREATE SCHEMA backup;

CREATE TABLE backup.alpha AS SELECT * FROM public.alpha;

TRUNCATE TABLE alpha;

From the top: we tell psql to stop right away if it encounters any problems, and then turn on the timing of all queries. We explicitly connect to the correct database as the correct user. Putting it here in the script is a safety feature. Then we start a new transaction, create a backup schema, and make a copy of the existing data into a backup table before truncating the original table. The next step is to add in the data, then wrap things up:

$ cat alpha.data.pg >> alpha.restore.pg

Now we run it and check for any errors. We use the -X argument to ensure control of exactly which psql options are in effect, bypassing any psqlrc files that may be in use.

$ psql -X -f alpha.restore.pg

If everything looks good, the final step is to add a COMMIT and run the file again:

$ echo "COMMIT;" >> alpha.restore.pg
$ psql -X -f alpha.restore.pg

And we are done! All of this is a little simplified, as in real life there was actually more than one table to be restored, and each had some foreign key dependencies that had to be worked around, but the basic idea remains the same. (and yes, I know you can do the extraction in a Perl one-liner)

PostgreSQL Conference East 2010 review

I just returned from the PostgreSQL Conference East 2010. This is one of the US "regional" Postgres conferences, which usually occur once a year on both the East and West coast. This is the second year the East conference has taken place in my home town of Philadelphia.

Overall, it was a great conference. In addition to the talks, of course, there are many other important benefits to such a conference, such as the "hallway tracks", seeing old friends and clients, meeting new ones, and getting to argue about default postgresql.conf settings over lunch. I gave a 90 minute talk on "Postgres for non-Postgres people" and a lightning talk on the indispensable tail_n_mail.pl program.

This year saw the conference take place at a hotel for the first time, and this was a big improvement over the previous school campus-based conferences. Everything was in one building, there was plenty of space to hang out and chat between the talks, and everything just felt a little bit easier. The one drawback was that the rooms were not really designed to lecture to large numbers of people (e.g. no stadium seating), but this was not too much of an issue for most of the talks.

A few of the talks I attended included:

  • Mine! Luckily, my talk was in the very first slot, so I was able to give it and then be done talking for the rest of the conference (with the exception of the lightning talk). My talk was "PostgreSQL for MySQL (and other database people)". A quick show of hands showed that in addition to a good number of MySQL people, we had people coming from Oracle, Microsoft SQL Server, and even Informix. I walked through the steps to take when upgrading your application from using some other database to using Postgres, pointing out some of the pain points and particular Postgres gotchas, focusing on the SQL syntax. The second half of the talk focused on the Postgres project itself, explaining how it all worked, what the "community" and "core" consists of, how companies are involved, how development is done, and the philosophy of the project.
  • "PostgreSQL at myYearbook.com" by Gavin M. Roy. I've heard earlier versions of this talk before, but it was neat to see how much myyearbook.com had grown in just one year and some of the new challenges they faced. Of course, Gavin is still upset about the primary key situation and they are still doing unique indexes instead of PKs so they can do in-place reindexing for bloat removal.
  • Baron Schwartz spoke about "Query Analysis with mk-query-digest". The "mk" is short for maatkit, a nice suite of tools for doing all sorts of database-related things. Granted, it's very MySQL focused at the moment, but Baron has started to port things over to Postgres, and the demo he gave was pretty impressive. I'll definitely be downloading that code and taking a look.
  • Magnus Hagander gave a talk on "Secure PostgreSQL Deployment" which was a lot more interesting than I thought it would be (I knew it had Windows slides). My take-home lessons: never use the ssl mode of "prefer", and always check your Debian systems as they like to switch SSL on everything for no good reason. It's also quite fascinating to see the number of ways you can authenticate to a Postgres database.
  • I attended a talk on "Inside the PostgreSQL Infrastructure" by Dave Page. A lot of it I already knew, as I'm a little involved in said infrastructure, but it was good to hear some of the future plans, including standardizing on Debian instead of FreeBSD in the future.
  • Spencer Christensen's talk on "PostgreSQL Administration for System Administrators" was very well done but mostly review for me :). It was nice to see a shout out in his talk (and some others) for check_postgres.pl.
  • Robert Haas gave a good talk on "The PostgreSQL Query Planner" that seemed to be very well received. The bit about the join removal tech was particularly interesting: the Postgres planner does some really, really clever things when trying to build the best possible plan for your query.

At the lunch on Saturday, Josh Drake asked if anyone else wanted to do a lightning talk, so I made a quick outline on the back of a nearby piece of paper and gave a no-slides, no-notes five minute talk on tail_n_mail.pl. It went pretty well, and I even had 30 seconds left over at the end for questions. To clarify my answer to one of those further now: tail_n_mail.pl can parse CSV logs (indeed, any text file), but it cannot consolidate similar entries yet or any of the other neat things it does until we can teach TNM about how to parse the CSV logs properly.

An excellent conference overall, but I'd be amiss if I didn't offer a little constructive criticism for the next time (and other conferences):

  • Scheduling. The rooms were sometimes hard to find, and the schedule did not list the room next to the talk. That color-coded thing just does not work. In addition, it seemed like similar talks were sometimes stacked up against each other rather than staggered. Thus, you could learn about londiste OR rubyrep, but not both. Similarly, there were two Python talks up against each other.
  • Lightning talks. Always, always put the lightning talks at the *start* of the conference, not the end. Lightning talks are a great way to learn about what other people are doing. By having it at the start of the conference, you have the entire rest of the time to followup with people about their talks and foster more real-life discussions.
  • Lightning talks. Okay, not done talking about these yet. Lightning talks are somewhat notorious for spending lots of time getting the video to work right, as people switch computers, fiddle with plugs, etc. If you can't get it setup in 30 seconds, start the clock! You should be able to give your lightning talk without slides, if need be.

LibrePlanet 2010: Eben Moglen and the future of Oracle in free software

I just got back from Libre Planet 2010, a conference for free software activists put on by the Free Software Foundation. I imagine most readers of this blog are familiar with the language debate over free software vs. open source. Much of the business and software community has settled into using open source as the term of choice, but Libre Planet is certainly a place where saying "free software" is the norm.

I presented two talks - one on how to give good talks by connecting with your audience, and a second about non-coding roles in free software communities. The first talk is built on my work with user groups and giving presentations at primarily free software conferences over the last five years. The second was built off of the great work of Josh Berkus, for a talk that he first gave at a mini-conference I arranged the day before OSCON 2007 for Postgres.

One talk I attended surprised me with an important discussion of the future of the open source database market.

Eben Moglen spoke about the future of the Free Software Foundation and the new challenges that software freedom faces in a world increasingly dominated by network services - social networking, collaboration tools and other software where ownership of data is largely shared, and no single person or entity can be legitimately claimed to be sole owner of the data or structure that emerges.

Eben Moglen said, "We are at a point of inflection in our long campaign." He talked at length about the work the Software Freedom Law Center has done, collaborating with organizations whose goals were not necessarily software freedom, nor directly aligned with the FSF. He specifically brought up patent pools, and work that the SFLC has done to bring non-free companies in the fight against abusive patents.

Eben then turned his attention to the issue of the Oracle/Sun acquisition. He commented we haven't really looked to Oracle for pro-software freedom activity in the past. And then that "every technically competent 15-year old in the world uses MySQL." While this isn't music to the ears of Postgres users and developers, with applications like Wordpress, I'd say that Eben isn't too far off.

What was interesting to me was Eben's conjecture that MySQL is now essentially a tool that's now being sharpened to stab deeply into the heart of Microsoft's SQL Server market. He pointed out that Oracle has about 375,000 customers, and claimed that there's no where you can learn Oracle for free (to which several people have pointed out -- you can download crippled versions of Oracle for free to learn basics.. but I claim that's not the same thing as being able to download and install full server versions of something like MySQL or PostgreSQL).

Regardless of the details, this play by Oracle would be an interesting use of open source software to disrupt a market.

I suggest to the Postgres community that SQL Server to Postgres migrations are a real business opportunity our consultants, and an area in which we as a community should pursue documenting and assisting with transitions as much as possible.

Using psql \o to append to a file

I had a slow query I was working on recently, and wanted to capture the output of EXPLAIN ANALYZE to a file. This is easy, with psql's \o command:

5432 josh@josh# \o explain-results

Once EXPLAIN ANALYZE had finished running, I wanted the psql output back in my psql console window. This, too, is easy, using the \o command without a filename:

5432 josh@josh# \o

But later, after adding an index or two and changing some settings, I wanted to run a new EXPLAIN ANALYZE, and I wanted its output appended to the explain-analyze file I built earlier. At least on my system, \o will normally overwrite the target file, which would mean I'd lose my original results. I realize it's simple to, say, pipe output to a new file ("explain-analyze-2"), but I wasn't interested. Instead, because \o can also accept a pipe character and a shell command to pipe its output to, I did this:

5432 josh@josh# \o | cat - >> explain-results

Life is good.

Update: A helpful commenter pointed out I hadn't actually used the same files in the original post. Oops. Fixed.

PostgreSQL UTF-8 Conversion

It's becoming increasingly common for me to be involved in conversion of an old version of PostgreSQL to a new one, and at the same time, from an old "SQL_ASCII" encoding (that is, undeclared, unvalidated byte soup) to UTF-8.

Common ways to do this are to run pg_dumpall and then pipe the output through iconv or recode. When your source encoding is all pure ASCII, you don't need to do even that. When it's really all Windows-1252 (a superset of Latin-1 aka ISO-8859-1) it's easy.

But often, the data is stored in various unknown encodings from several sources over the course of years, including some that's already in UTF-8. When you convert with iconv, it dies with an error at the first problem, whereas recode will let you ignore encoding problems, but that leaves you with junk in your output.

The case I'm often encountering is fairly easy, but not perfect: Lots of ASCII, some Windows-1252, and some UTF-8. Since both pure ASCII and UTF-8 can be mechanistically detected, I put together this script to do the detection. It's Perl and uses the nice IsUTF8 module to do its character encoding detection:

Pipe input to the script. It handles one line at a time. When run with any arguments (such as --test) it will swallow pure ASCII lines, write lines it thinks are valid UTF-8 to stderr, and will convert the remaining presumed Windows-1252 lines to stdout, for manual examination.

If its guesses look correct, run it again with no arguments, and it will write all 3 types of encoding to stdout, ready for input to psql in your new UTF-8 encoded database.

(Don't forget to munge your pg_dump file to remove any hardcoded declarations of "SQL_ASCII" encoding from CREATE DATABASE statements, or otherwise make sure your database actually is created with UTF-8 encoding!)

PostgreSQL tip: arbitrary serialized rows

Sometimes when using PostgreSQL, you want to deal with a record in its serialized form. If you're dealing with a specific table, you can accomplish this using the table name itself:

psql # CREATE TABLE foo (bar text, baz int);
CREATE TABLE

psql # INSERT INTO foo VALUES ('test 1', 1), ('test 2', 2);
INSERT 0 2

psql # SELECT foo FROM foo;
     foo      
--------------
 ("test 1",1)
 ("test 2",2)
(2 rows)

This works fine for defined tables, but how to go about this for arbitrary SELECTs? The answer is simple: wrap in a subselect and alias as so:

psql # SELECT q FROM (SELECT 1, 2) q;
   q   
-------
 (1,2)
(1 row)

PostgreSQL EC2/EBS/RAID 0 snapshot backup

One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.

A big benefit of Amazon's cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon's EBS storage is that it is much more I/O limited than, say, a nice SAN.

To partially mitigate the I/O limitations, they're using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.

They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That's working fine.

They also do nightly backups using pg_dumpall on the master so that there's a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).

There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I'm not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:

  1. Create an atomic snapshot of the block devices
  2. Spin up another virtual server
  3. Mount the backup volume
  4. Start Postgres and allow it to recover from the apparent "crash" the server had (since there wasn't a clean shutdown of the database before the snapshot
  5. Do whatever pg_dump or other backups are desired
  6. Make throwaway copies of the snapshot for QA or other testing

The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That's a big difference from a freshly created database and import from pg_dump.

The difference here is that we're using 4 EBS volumes with RAID 0 striped across them, and there isn't currently a way to do an atomic snapshot of all 4 volumes at the same time. So it's no longer "atomic" and who knows what state the filesystem metadata and the file data itself would be in?

Well, why not try it anyway? Filesystem metadata doesn't change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren't all crashes a little unpredictable? Why shouldn't the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.

I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:

  1. Created one EC2 image:
    Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
    Debian AMI ID ami-4ffe1926 (x86_64)
    Instance Type: High-CPU Extra Large (c1.xlarge) - 7 GB RAM, 8 CPU cores
  2. Created 4 x 10 GB EBS volumes
  3. Attached volumes to the image
  4. Created software RAID 0 device:
    mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi
  5. Created XFS filesystem on top of RAID 0 device:
    mkfs -t xfs -L /pgdata /dev/md0
  6. Set up in /etc/fstab and mounted:
    mkdir /pgdata
    # edit /etc/fstab, with noatime
    mount /pgdata
  7. Installed PostgreSQL 8.3
  8. Configured postgresql.conf to be similar to primary production database server
  9. Created empty new database cluster with data directory in /pgdata
  10. Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory
  11. Ran some bulk inserts to grow database to around 5 GB
  12. Rebooted EC2 instance to confirm everything came back up correctly on its own
  13. Set up two concurrent data-insertion processes:
    • 50 million row insert based on another local table (INSERT INTO ... SELECT ...), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won't have committed before the snapshot is taken)
    • Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through
  14. Started those "beater" jobs, which mostly consumed 2-3 CPU cores
  15. Manually inserted a known test row and created a known view that should appear in the snapshot
  16. Started Postgres's backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:
    SELECT pg_start_backup('raid_backup');
  17. Manually inserted a 2nd known test row & 2nd known test view that I don't want to appear in the snapshot after recovery
  18. Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes -- during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes
  19. Tell Postgres the backup's over:
    SELECT pg_stop_backup();
  20. Ran script to create new EBS volumes derived from the 4 snapshots (which aren't directly usable and always go into S3), using ec2-create-volume --snapshot
  21. Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume
  22. Then, on the new EC2 instance for doing backups:
    • mdadm --assemble --scan
    • mount /pgdata
    • Start Postgres
    • Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn't show any new rows, and that the table getting individual rows committed to reads correctly
    • VACUUM VERBOSE -- and confirm no errors or inconsistencies detected
    • pg_dumpall # confirmed no errors and data looks sound

It worked! No errors or problems, and pretty straightforward to do.

Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn't make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn't a fool's errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they'd be close to atomic.

There are lots of caveats to note here:

  • This is an experiment in progress, not a how-to for the general public.
  • The data set that was snapshotted was fairly small.
  • Two successful runs, even with no failures, is not a very big sample set. :)
  • I didn't use Postgres's point-in-time recovery (PITR) here at all -- I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.

So there's more work to be done to prove this technique viable in production for a mission-critical database, but it's a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store's snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.

PostgreSQL tip: dump objects into a new schema

Sometimes the need arises to export a PostgreSQL database and put its contents into its own schema; say you've been busy developing things in the public schema. Sometime people suggest manipulating the pg_dump output either manually or using a tool such as sed or perl to explicitly schema-qualify all table objects, etc, but this is error-prone depending on your table names, and can be more trouble than its worth.

One trick that may work for you if your current database is not in use by anyone else is to rename the default public schema to your desired schema name before dumping, and then optionally changing it back to public afterward. This has the benefit that all objects will be properly dumped in the new schema (sequences, etc) and not just tables, plus you don't have to worry about trying to parse SQL with regexes to modify this explicitly.

$ psql -c "ALTER SCHEMA public RENAME new_name"
$ pg_dump --schema=new_name > new_name_dump.sql
$ psql -c "ALTER SCHEMA new_name RENAME public"
$ # load new_name_dump.sql elsewhere

Cheers!

PostgreSQL version 9.0 release date prediction

So when will PostgreSQL version 9.0 come out? I decided to "run the numbers" and take a look at how the Postgres project has done historically. Here's a quick graph showing the approximate number of days each major release since version 6.0 took:

Some interesting things can be seen here: there is a rough correlation between the complexity of a new release and the time it takes, major releases take longer, and the trend is gradually towards more days per release. Overall the project is doing great, releasing on average every 288 days since version 6. If we only look at version 7 and onwards, the releases are on average 367 days apart. If we look at *just* version 7, the average is 324 days. If we look at *just* version 8, the average is 410. Since the last major version that came out was on July 1, 2009, the numbers predict 9.0 will be released on July 3, 2010, based on the version 7 and 8 averages, and on August 15, 2010, based on just the version 8 averages. However, this upcoming version has two very major features, streaming replication (SR) and hot standby (HS). How those will affect the release schedule remains to be seen, but I suspect the 9.0 to 9.1 window will be short indeed.

As a recap, the Postgres project only bumps the first part of the version number for major changes (Although many, myself included, would argue that 7.4 was such a major jump it should have been called 8.0). The second number occurs anytime a "new release" happens, and means new features and enhancements. The final number, the revision, is only incremented for security and bug fixes, and is almost always a 100% binary compatible drop in for the previous revision in the branch. (What's the average (mean) days between revisions? 84 days since version 6, and 88 days since version 7. The medians are 84 and 87 respectively.)

How busy were those periods? Here's the number of commits per release period. Note that I said release period, not release, as commits are still being made to old branches, although this is a very small minority of the commits, so I did not bother to break it down at that level.

There is a strong correlation with the previous chart. Of note is version 8.1, which had few commits and was released relatively quickly. Also note that version 8.0 is still winning as far as the sheer number of commits, most likely due to the fact that native Windows support was added in that version.

Some other items of interest from the data:

  • There have been roughly 140,000 commits from version 6.0 to 8.4.2.
  • There have been 32 CVS committers since the start of the project (and of course, many hundreds of others whose work was funnelled through those committers)
  • The mean number of commits per person is 4383, but the distribution is very skewed: Bruce, Peter, and Tom account for 80% of all commits, with the mean between them of 37,000 commits.
  • Commits changed about 40 lines on average.

Alright, two final charts: commits per time periods. I'll let the data speak for itself this time. Stay tuned for future blog posts exploring this data further!

LCA2010: Postgres represent!

I had the pleasure of attending and presenting at LinuxConf.AU this year in Wellington, NZ. Linux Conf.AU is an institution whose friendliness and focus on the practical business of creating and sustaining open source projects was truly inspirational.

My talk this year was "A Survey of Open Source Databases", where I actually created a survey and asked over 35 open source database projects to respond. I have received about 15 responses so far, and also did my own research on the over 50 projects I identified. I created a place-holder site for my research at: ossdbsurvey.org. I'm hoping to revise the survey (make it shorter!!) and get more projects to provide information.

Ultimately, I'd like the site to be a central location for finding information and comparing different projects. Performance of each is a huge issue, and there are a lot of individuals constructing good (and bad) systems for comparing. I don't think I want to dive into that pool, yet. But I would like to start collecting the work others have done in a central place. Right now it is really far too difficult to find all of this information.

Part of the talk was also a foray into the dangerous world of classification. I tried to put together basic categories, based on conversations with individual developers and some fine-tuning with Josh Berkus. Josh gave a short overview of database models during "Relational vs Non-relational" in the Data Storage mini-conf, and we collaborated some on category definition. I also saw Devdas Bhagat give a use case talk on using Postgres, yet again confirming how wonderful transactional DDL is for developers. I also gave a lightning talk (WITHOUT SLIDES!) on Bucardo at the tail end of the Data Storage mini-conf.

Josh Berkus, during "PostgreSQL Development Today", announced to the world that the new version of Postgres would be version 9.0! And he did a live demonstration of streaming replication and hot standby. The audience seemed pleased.

I was delighted to see representatives from the Postgres community on the main stage of the conference three times during LCA!

And finally, I had the pleasure of participating in the Friday keynote lightning talks. I kicked things off by telling the story of the elections in Ondo State, Nigeria, in 5 minutes. I saw that one of the IT people I met while in Akure was now helping Osun state investigate and correct election fraud in January. So glad to see that their good work continues!

Automatic migration from Slony to Bucardo

About a month ago, Bucardo added an interesting set of features in the form of a new script called slony_migrator.pl. In this post I'll describe slony_migrator.pl and its three major functions.

The Setup

For these examples, I'm using the pagila sample database along with a set of scripts I wrote and made available here. These scripts build two different Slony clusters. The first is a simple one, which replicates this database from a database called "pagila1" on one host to a database "pagila2" on another host. The second is more complex. Its one master node replicates the pagila database to two slave nodes, one of which replicates it again to a fourth slave using Slony's FORWARD function as described here. I implemented this setup on two FreeBSD virtual machines, known as myfreebsd and myfreebsd2. The reset-simple.sh and reset-complex.sh scripts in the script package I've linked to will build all the necessary databases from one pagila database and do all the Slony configuration.

Slony Synopsis

The slony_migrator.pl script has three possible actions, the first of which is to connect to a running Slony cluster and print a synopsis of the Slony setup it discovers. You can do this safely against a running, production Slony cluster; it gathers all its necessary information from a few simple Slony queries. Here's the synopsis the script writes for the simple configuration I described above:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd
Slony version: 1.2.16
psql version: 8.3
Postgres version: 8.3.7
Slony schema: _pagila
Local node: 1
SET 1: All pagila tables
* Master node: 1  Active: Yes  PID: 3309  Comment: "Cluster node 1"
  (dbname=pagila1 host=myfreebsd user=postgres)
  ** Slave node:  2  Active: Yes  Forward: Yes  Provider:  1  Comment: "Node 2"
     (dbname=pagila2 host=myfreebsd2 user=postgres)

The script has reported the Slony, PostgreSQL, and psql versions, the Slony schema name, and shows that there's only one set, replicated from the master node to one slave node, including connection information for each node. Here is the output of the same action, run against the complex slony setup. Notice that node 3 has node 2 as its provider, not node 1:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd
Slony version: 1.2.16
psql version: 8.3
Postgres version: 8.3.7
Slony schema: _pagila
Local node: 1
SET 1: All pagila tables
* Master node: 1  Active: Yes  PID: 3764  Comment: "Cluster node 1"
  (dbname=pagila1 host=myfreebsd  user=postgres)
  ** Slave node:  2  Active: Yes  Forward: Yes  Provider:  1  Comment: "Cluster node 2"
     (dbname=pagila2 host=myfreebsd2 user=postgres)
  ** Slave node:  3  Active:  No  Forward: Yes  Provider:  2  Comment: "Cluster node 3"
     (dbname=pagila3 host=myfreebsd2 user=postgres)
  ** Slave node:  4  Active: Yes  Forward: Yes  Provider:  1  Comment: "Cluster node 4"
     (dbname=pagila4 host=myfreebsd  user=postgres)

This is a simple way to get an idea of how a Slony cluster is organized. Again, we can get all this without downtime or any impact on the Slony cluster.

Creating Slonik Scripts Automatically

Slony gets its configuration entirely through scripts passed to an application called Slonik, which writes configuration entries into a Slony schema within a replicated database. At least as far as I know, however, Slony doesn't provide a way to regenerate those scripts based on the contents of that schema. The slony_migrator.pl script will do that for you with the --slonik option. For example, here is the Slonik script it generates for the simple configuration:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --slonik
CLUSTER NAME = pagila;
NODE 1 ADMIN CONNINFO = 'dbname=pagila1 host=myfreebsd user=postgres';
NODE 2 ADMIN CONNINFO = 'dbname=pagila2 host=myfreebsd2 user=postgres';
INIT CLUSTER (ID = 1, COMMENT = 'Cluster node 1');
STORE NODE (ID = 2, EVENT NODE = 1, COMMENT = 'Node 2');
STORE PATH (SERVER = 1, CLIENT = 2, CONNINFO = 'dbname=pagila1 host=myfreebsd user=postgres', CONNRETRY = 10);
STORE PATH (SERVER = 2, CLIENT = 1, CONNINFO = 'dbname=pagila2 host=myfreebsd2 user=postgres', CONNRETRY = 10);
ECHO 'Please start up replication nodes here';
TRY {
    CREATE SET (ID = 1, ORIGIN = 1, COMMENT = 'All pagila tables');
} ON ERROR {
    EXIT -1;
}
SET ADD TABLE (ID = 6, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.customer', KEY = 'customer_pkey', COMMENT = 'public.customer');
SET ADD TABLE (ID = 11, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.language', KEY = 'language_pkey', COMMENT = 'public.language');
--- snip ---
SET ADD SEQUENCE (ID = 13, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.store_store_id_seq', COMMENT = 'public.store_store_id_seq');
SET ADD SEQUENCE (ID = 10, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.payment_payment_id_seq', COMMENT = 'public.payment_payment_id_seq');
SET ADD SEQUENCE (ID = 5, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.country_country_id_seq', COMMENT = 'public.country_country_id_seq');
SUBSCRIBE SET (ID = 1, PROVIDER = 1, RECEIVER = 2, FORWARD = YES);

The pagila database contains many tables and sequences, and I've removed the repetitive commands to tell Slony about all of them, for the sake of brevity, but in its original form, the code above would rebuild the simple Slony cluster exactly, and can be very useful for getting an idea of how an otherwise unknown cluster is configured. I won't promise the Slonik code is ideal, but it does recreate a working cluster. The more complex Slonik output is very similar, differing only in how the sets are subscribed. Here I'll show only the major differences, which are the commands required to create the more complex Slony subscription scheme. In the downloadable script package I mentioned above, this subscription code is somewhat more complex, specifically because Slony won't let you subscribe node 3 to updates from node 2 until node 2 is fully subscribed itself. The slony_migrator.pl script isn't smart enough on its own to add necessary WAIT FOR EVENT Slonik commands, but it does get most of the code right, and, importantly, creates the subscriptions in the proper order.

SET ADD SEQUENCE (ID = 10, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.payment_payment_id_seq', COMMENT = 'public.payment_payment_id_seq');
SET ADD SEQUENCE (ID = 5, ORIGIN = 1, SET ID = 1, FULLY QUALIFIED NAME = 'public.country_country_id_seq', COMMENT = 'public.country_country_id_seq');
SUBSCRIBE SET (ID = 1, PROVIDER = 1, RECEIVER = 4, FORWARD = YES);
SUBSCRIBE SET (ID = 1, PROVIDER = 1, RECEIVER = 2, FORWARD = YES);
SUBSCRIBE SET (ID = 1, PROVIDER = 2, RECEIVER = 3, FORWARD = YES);

Migrating Slony Clusters to Bucardo

The final slony_migrator.pl option will create a set of bucardo_ctl commands to create a Bucardo cluster to match an existing Slony setup. Although Bucardo can be configured by directly modifying its configuration database, a great deal of work of late has gone into making configuration easier through the bucardo_ctl program. Here's the output from slony_migrator.pl on the simple Slony cluster. Note the --bucardo command-line option, which invokes this function:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --bucardo
./bucardo_ctl add db pagila_1 dbname=pagila1  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_2 dbname=pagila2  host=myfreebsd2 user=postgres
./bucardo_ctl add table public.customer db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.language db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.store db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.category db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.film db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
--- snip ---
./bucardo_ctl add sequence public.city_city_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.store_store_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.payment_payment_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.country_country_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sync pagila_set1_node1_to_node2 source=pagila_node1_set1 targetdb=pagila_2 type=pushdelta

The Bucardo model of a replication system differs from Slony, but the two match fairly closely, especially for a simple scenario like this one. But slony_migrator.pl will work for the more complex Slony example I've been using, shown here:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --bucardo
./bucardo_ctl add db pagila_1 dbname=pagila1  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_4 dbname=pagila4  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_3 dbname=pagila3  host=myfreebsd2 user=postgres
./bucardo_ctl add db pagila_2 dbname=pagila2  host=myfreebsd2 user=postgres
./bucardo_ctl add table public.customer db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.language db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.store db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
--- snip ---
./bucardo_ctl add sequence public.payment_payment_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sequence public.country_country_id_seq db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add sync pagila_set1_node1_to_node4 source=pagila_node1_set1 targetdb=pagila_4 type=pushdelta
./bucardo_ctl add sync pagila_set1_node1_to_node2 source=pagila_node1_set1 targetdb=pagila_2 type=pushdelta target_makedelta=on
./bucardo_ctl add table public.customer db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add table public.language db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add table public.store db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
--- snip ---
./bucardo_ctl add sequence public.store_store_id_seq db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add sequence public.payment_payment_id_seq db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add sequence public.country_country_id_seq db=pagila_2 ping=true standard_conflict=source herd=pagila_node2_set1
./bucardo_ctl add sync pagila_set1_node2_to_node3 source=pagila_node2_set1 targetdb=pagila_3 type=pushdelta

I mentioned the Bucardo data model differs from that of Slony. Slony contains a set of tables and sequences in a "set", and that Slony set remains a distinct object on all databases where those objects are found. Bucardo, on the other hand, has a concept of a "sync", which is a replication job from one database to one or more slaves (here I'm talking only about master->slave syncs, and ignoring for purposes of this post Bucardo's ability to do multi-master replication). This makes the setup slightly different for the more complex Slony scenario, in that whereas Slony has one set and different subscriptions, in Bucardo I need to define the tables and sequences involved in each of three syncs: one from node 1 to node 2, one from node 1 to node 4, and one from node 2 to node 3. I also need to turn on Bucardo's "makedelta" option for the node 1 -> node 2 sync, which is the Bucardo equivalent of the Slony FORWARD subscription option.

Migrating from Slony to Bucardo

This post is getting long, but for the sake of demonstration let's show a migration from Slony to Bucardo, using the more complex Slony example. First, I'll create a blank database, and install Bucardo in it:

josh@eddie:~/devel/bucardo$ createdb bucardo
josh@eddie:~/devel/bucardo$ ./bucardo_ctl install
This will install the bucardo database into an existing Postgres cluster.
Postgres must have been compiled with Perl support,
and you must connect as a superuser

We will create a new superuser named 'bucardo',
and make it the owner of a new database named 'bucardo'

Current connection settings:
1. Host:          /tmp
2. Port:          5432
3. User:          postgres
4. Database:      postgres
5. PID directory: /var/run/bucardo
Enter a number to change it, P to proceed, or Q to quit: 

I'll make the necessary configuration changes, and run the installation by following the simple menu.

Current connection settings:
1. Host:          /tmp
2. Port:          5432
3. User:          postgres
4. Database:      bucardo
5. PID directory: /home/josh/devel/bucardo/pid
Enter a number to change it, P to proceed, or Q to quit: p

Postgres version is: 8.3
Attempting to create and populate the bucardo database and schema
Database creation is complete

Connecting to database 'bucardo' as user 'bucardo'
Updated configuration setting "piddir"
Installation is now complete.

If you see any unexpected errors above, please report them to bucardo-general@bucardo.org

You should probably check over the configuration variables next, by running:
./bucardo_ctl show all
Change any setting by using: ./bucardo_ctl set foo=bar

Now I'll use slony_migrator.pl to get a set of bucardo_ctl scripts to build my Bucardo cluster:

josh@eddie:~/devel/bucardo/scripts$ ./slony_migrator.pl -db pagila1 -H myfreebsd --bucardo > pagila-slony2bucardo.sh
josh@eddie:~/devel/bucardo/scripts$ head pagila-slony2bucardo.sh 
./bucardo_ctl add db pagila_1 dbname=pagila1  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_4 dbname=pagila4  host=myfreebsd user=postgres
./bucardo_ctl add db pagila_3 dbname=pagila3  host=myfreebsd2 user=postgres
./bucardo_ctl add db pagila_2 dbname=pagila2  host=myfreebsd2 user=postgres
./bucardo_ctl add table public.customer db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.language db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.store db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.category db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.film db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1
./bucardo_ctl add table public.film_category db=pagila_1 ping=true standard_conflict=source herd=pagila_node1_set1

I'll run the script...

josh@eddie:~/devel/bucardo$ sh scripts/pagila-slony2bucardo.sh
Added database "pagila_1"   
Added database "pagila_4"   
Added database "pagila_3"   
Added database "pagila_2"   
Created herd "pagila_node1_set1"
Added table "public.customer"
Added table "public.language"
Added table "public.store"
--- snip ---
Added sequence "public.store_store_id_seq"
Added sequence "public.payment_payment_id_seq"
Added sequence "public.country_country_id_seq"
Added sync "pagila_set1_node1_to_node4"
Added sync "pagila_set1_node1_to_node2"
Created herd "pagila_node2_set1"
Added table "public.customer"
Added table "public.language"
Added table "public.store"
--- snip ---
Added sequence "public.store_store_id_seq"
Added sequence "public.payment_payment_id_seq"
Added sequence "public.country_country_id_seq"
Added sync "pagila_set1_node2_to_node3"

Now all that's left is to shut down Slony (I just use the "pkill slon" command on each database server), start Bucardo, and, eventually, remove the Slony schemas. Note that Bucardo runs only on one machine (which in this case isn't either of the database servers I'm using for this demonstration -- Bucardo can run effectively anywhere you want).

josh@eddie:~/devel/bucardo$ ./bucardo_ctl start
Checking for existing processes
Removing /home/josh/devel/bucardo/pid/fullstopbucardo
Starting Bucardo
josh@eddie:~/devel/bucardo$ tail -f log.bucardo 
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.actor_actor_id_seq to value of 202, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.city_city_id_seq to value of 600, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.store_store_id_seq to value of 2, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.payment_payment_id_seq to value of 32098, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Setting sequence public.country_country_id_seq to value of 109, is_called is 1
[Mon Feb  1 21:45:27 2010]  KID Total delta count: 0
[Mon Feb  1 21:45:27 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node2_pagila_2" from 22961
[Mon Feb  1 21:45:27 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node2"
[Mon Feb  1 21:45:27 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node4_pagila_4" from 22962
[Mon Feb  1 21:45:27 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node4"

Based on those logs, it looks like everything's running fine, but just to make sure, I'll use bucardo_ctl's "list syncs" and "status" commands:

josh@eddie:~/devel/bucardo$ ./bucardo_ctl list syncs
Sync: pagila_set1_node1_to_node2  (pushdelta)  pagila_node1_set1 =>  pagila_2  (Active)
Sync: pagila_set1_node1_to_node4  (pushdelta)  pagila_node1_set1 =>  pagila_4  (Active)
Sync: pagila_set1_node2_to_node3  (pushdelta)  pagila_node2_set1 =>  pagila_3  (Active)

josh@eddie:~/devel/bucardo$ ./bucardo_ctl status
Days back: 3  User: bucardo  Database: bucardo  Host: /tmp  PID of Bucardo MCP: 22936
Name                       Type  State PID   Last_good Time  I/U/D Last_bad Time
==========================+=====+=====+=====+=========+=====+=====+========+====
pagila_set1_node1_to_node2| P   |idle |22952|52s      |0s   |0/0/0|unknown |    
pagila_set1_node1_to_node4| P   |idle |22953|52s      |0s   |0/0/0|unknown |    
pagila_set1_node2_to_node3| P   |idle |22954|52s      |0s   |0/0/0|unknown |    

Everything looks good. Before I test that data are really replicated correctly, I'll issue the a "DROP SCHEMA _pagila CASCADE" command in each database, which I can do while Bucardo's running. If this were a production system, the best strategy, to avoid things getting replicated twice) would be to stop all applications, stop Slony, start Bucardo, and start the applications, though because Slony and Bucardo both replicate rows using primary keys, doing otherwise wouldn't cause duplicated data.

Finally, I'll tail the Bucardo logs while inserting rows in the pagila1 database, to see what happens. These rows tell me it's working:

[Mon Feb  1 21:55:42 2010]  KID Setting sequence public.payment_payment_id_seq to value of 32098, is_called is 1
[Mon Feb  1 21:55:42 2010]  KID Setting sequence public.inventory_inventory_id_seq to value of 4581, is_called is 1
[Mon Feb  1 21:55:42 2010]  KID Setting sequence public.country_country_id_seq to value of 109, is_called is 1
[Mon Feb  1 21:55:42 2010]  KID Total delta count: 1
[Mon Feb  1 21:55:42 2010]  KID Deleting rows from public.actor
[Mon Feb  1 21:55:42 2010]  KID Begin COPY to public.actor
[Mon Feb  1 21:55:42 2010]  KID End COPY to public.actor
[Mon Feb  1 21:55:42 2010]  KID Pushdelta counts: deletes=0 inserts=1
[Mon Feb  1 21:55:42 2010]  KID Updating bucardo_track for public.actor on pagila_1
...
[Mon Feb  1 21:55:43 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node4_pagila_4" from 22962
[Mon Feb  1 21:55:43 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node4"
[Mon Feb  1 21:55:43 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node1_to_node2_pagila_2" from 22961
[Mon Feb  1 21:55:43 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node1_to_node2"

In this case I need to "kick" the node 2 -> node 3 sync to get it to replicate, but I could configure the sync with a timeout so that happened automatically. Once I do that, I get log messages for it as well.

[Mon Feb  1 22:00:34 2010]  CTL Got notice "bucardo_syncdone_pagila_set1_node2_to_node3_pagila_3" from 22963
[Mon Feb  1 22:00:34 2010]  CTL Sent notice "bucardo_syncdone_pagila_set1_node2_to_node3"

Please consider giving slony_migrator.pl a try. I'd be glad to hear how it works out.

PostgreSQL tip: using pg_dump to extract a single function

A common task that comes up in PostgreSQL is the need to dump/edit a specific function. While ideally, you're using DDL files and version control (hello, git!) to manage your schema, you don't always have the luxury of working in such a controlled environment. Recent versions of psql have the \ef command to edit a function from within your favorite editor, but this is available from version 8.4 onward only.

An alternate approach is to use the following invocation:

  pg_dump -Fc -s | pg_restore -P 'funcname(args)'

The -s flag is the short form of --schema-only; i.e., we don't care about wasting time/space with the data. -P tells pg_restore to extract the function with the following signature.

As always, there are some caveats: the function name must be spelled out explicitly using the full types as they occur in the dump's custom format (i.e., you must use 'foo_func(integer)' instead of 'foo_func(int)'). You can always see a list of all of the available functions by using the command:

  pg_dump -Fc -s | pg_restore -l | grep FUNCTION

Slony: Cascading Subscriptions

Sometime you run into a situation where you need to replicate one dataset to many machines in multiple datacenters, with different costs associated with sending to each (either real costs as in bandwidth, or virtual costs as in the amount of time it takes to transmit to each machine). Defining a Slony cluster to handle this is easy, as you can specify the topology and paths taken to replicate any changes.

    Basic topology:
  • Data center A, with machines A1, A2, A3, and A4.
  • Data center B, with machines B1, B2, B3, and B4.
  • Data center C, with machines C1, C2, C3, and C4.


Figure 1: Non-cascaded slony replication nodes/pathways.

Node A1 is the master, which propagates its changes to all other machines. In the simple setup, A1 would push all of its changes to each node, however if data centers B and C have high costs associated with transfer to the nodes, you end up transferring 4x the data needed for each data center. (We are assuming that traffic on the local subnet at each data center is cheap and fast.)

The basic idea then, is to push the changes only once to each datacenter, and let the "master" machine in the data center push the changes out to the others in the data center. This reduces traffic from the master to each datacenter, plus removes any other associated costs associated with pushing to every node.


Figure 2: Cascaded slony replication nodes/pathways

Let's look at an example configuration:

cluster_init.sh:
    #!/bin/bash

    # admin node definitions and other slony-related information are
    # stored in our preamble file.  This will define the $PREAMBLE
    # environment variable that contains basic information common to all
    # Slony-related scripts, such as slony cluster name, the nodes
    # present, and how to reach them to install slony, etc.

    . slony_preamble.sh

    slonik <<EOF
    $PREAMBLE

    init cluster ( id = 1, comment = 'A1' );

    store node (id=2,  comment = 'A2', event node=1);
    store node (id=3,  comment = 'A3', event node=1);
    store node (id=4,  comment = 'A4', event node=1);
    store node (id=5,  comment = 'B1', event node=1);
    store node (id=6,  comment = 'B2', event node=1);
    store node (id=7,  comment = 'B3', event node=1);
    store node (id=8,  comment = 'B4', event node=1);
    store node (id=9,  comment = 'C1', event node=1);
    store node (id=10, comment = 'C2', event node=1);
    store node (id=11, comment = 'C3', event node=1);
    store node (id=12, comment = 'C4', event node=1);

    # pathways from A1 -> A2, A3, A4 and back
    store path (server = 1, client = 2, conninfo = 'dbname=data host=node2.datacenter-a.com');
    store path (server = 1, client = 3, conninfo = 'dbname=data host=node3.datacenter-a.com');
    store path (server = 1, client = 4, conninfo = 'dbname=data host=node4.datacenter-a.com');
    store path (server = 2, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');
    store path (server = 3, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');
    store path (server = 4, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');

    # pathway from A1 -> B1 and back
    store path (server = 1, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');
    store path (server = 5, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');

    # pathways from B1 -> B2, B3, B4 and back
    store path (server = 5, client = 6, conninfo = 'dbname=data host=node2.datacenter-b.com');
    store path (server = 5, client = 7, conninfo = 'dbname=data host=node3.datacenter-b.com');
    store path (server = 5, client = 8, conninfo = 'dbname=data host=node4.datacenter-b.com');
    store path (server = 6, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');
    store path (server = 7, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');
    store path (server = 8, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');

    # pathway from A1 -> C1 and back
    store path (server = 1, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');
    store path (server = 9, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');

    # pathways from C1 -> C2, C3, C4 and back
    store path (server = 9, client = 10, conninfo = 'dbname=data host=node2.datacenter-c.com');
    store path (server = 9, client = 11, conninfo = 'dbname=data host=node3.datacenter-c.com');
    store path (server = 9, client = 12, conninfo = 'dbname=data host=node4.datacenter-c.com');
    store path (server = 10, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');
    store path (server = 11, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');
    store path (server = 12, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');

    EOF

As you can see in the initialization script, we're defining the basic topology for the cluster. We're defining each individual node, and the paths that slony will use to communicate events and other status. Since slony needs to communicate status both ways, we need to define the paths for each node's edge both ways. In particular, we've defined pathways from A1 to each of the other A nodes, A1 to B1 and C1, and B1 and C1 to each of their respective nodes.

Now it's a matter of defining the replication sets and describing the subscriptions for each. We will use something like the following for our script:

cluster_define_set1.sh:
    #!/bin/bash

    # reusing our standard cluster information
    . slony_preamble.sh

    slonik <<EOF
    $PREAMBLE

    create set ( id = 1, origin = 1, comment = 'set 1' );

    set add table ( set id = 1, origin = 1, id = 1, fully qualified name = 'public.table1');
    set add table ( set id = 1, origin = 1, id = 2, fully qualified name = 'public.table2');
    set add table ( set id = 1, origin = 1, id = 3, fully qualified name = 'public.table3');

    EOF

Here we've defined the tables that we want replicated from A1 to the entire cluster; there is nothing specific to this particular scenario that we need to consider.

cluster_subscribe_set1.sh:
    #!/bin/bash

    # reusing our standard cluster information
    . slony_preamble.sh

    slonik <<EOF
    $PREAMBLE

    # define our forwarding subscriptions (i.e., A1 -> B1, C1)
    subscribe set ( id = 1, provider = 1, receiver = 5, forward = yes);
    subscribe set ( id = 1, provider = 1, receiver = 9, forward = yes);

    # define the subscriptions for each of the datacenter sets
    # A1 -> A2, A3, A4
    subscribe set ( id = 1, provider = 1, receiver = 2, forward = no);
    subscribe set ( id = 1, provider = 1, receiver = 3, forward = no);
    subscribe set ( id = 1, provider = 1, receiver = 4, forward = no);

    # B1 -> B2, B3, B4
    subscribe set ( id = 1, provider = 5, receiver = 6, forward = no);
    subscribe set ( id = 1, provider = 5, receiver = 7, forward = no);
    subscribe set ( id = 1, provider = 5, receiver = 8, forward = no);

    # C1 -> C2, C3, C4
    subscribe set ( id = 1, provider = 9, receiver = 10, forward = no);
    subscribe set ( id = 1, provider = 9, receiver = 11, forward = no);
    subscribe set ( id = 1, provider = 9, receiver = 12, forward = no);

    EOF

The key points here are that you specify the provider nodes and the receiver nodes to specify how the particular replication occurs. For the subscription to any cascade point (i.e., B1 and C1), you need to have the 'forward = yes' parameter to ensure that the events properly cascade to the sub-nodes. In any of the other nodes' subscription, you should set 'forward = no'.

In actual deployment of this setup, you would want to wait for the subscription from A1 -> B1 and A1 -> C1 to complete successfully before subscribing the sub-nodes. Additionally, this solution assumes high availability between nodes and does not address failure of particular machines; in particular, A1, B1, and C1 are key to maintaining the full replication.

Postgres: Hello git, goodbye CVS

It looks like 2010 *might* be the year that Postgres officially makes the jump to git. Currently, the project uses CVS, with a script that moves things to the now canonical Postgres git repo at git.postgresql.org. This script has been causing problems, and is still continuing to do so, as CVS is not atomic. Once the project flips over, CVS will still be available, but CVS will be the slave and git the master, to put things in database terms. The conversion from git to CVS is trivial compared to the other way around, so there is no reason Postgres cannot continue to offer CVS access to the code for those unwilling or unable to use git.

On that note, I'm happy to see that the number of developers and committers who are using git - and publicly stating their happiness with doing so - has grown sharply in the last couple of years. Peter Eisentraut (with some help from myself) set up git.postgresql.org in 2008, but interest at that time was not terribly high, and there was still a lingering question of whether git was really the replacement for CVS, or if it would be some other version control system. There is little doubt now that git is going to win. Not only for the Postgres project, but across the development world in general (both open and closed source).

To drive the point home, Andrew has announced he is working on git integration with the Postgres build farm. Of course, I submitted a patch to do just that back in March 2008, but I was ahead of my time :). Besides, mine was a simple proof of concept, while it sounds like Andrew is actually going to do it the right way. Go Andrew!

Of all the projects I work on, the great majority are using git now. We've been using git at End Point as our preferred VCS for both internal projects and client work for a while now, and are very happy with our choice. There is only one other project I work on besides Postgres that uses CVS, but it's a small project. I don't know of any other project of Postgres' size that is still using CVS (anyone know of any?). Even emacs recently switched away from CVS, although they went with bazaar instead of git for some reason. Subversion is still being used by a substantial minority of the projects I'm involved with, mostly due to the historical fact that there was a window of time in which CVS was showing its limitations, but subversion was the only viable option. Sure would be nice if perl.org would offer git for Perl modules, as they do for subversion currently (/hint). Finally, there are a few of my projects that use something else (mercurial, monotone, etc.). Overall, git accounts for the lion's share of all my projects, and I'm very happy about that. There is a very steep learning curve with git, but the effort is well worth it.

If you want to try out git with the Postgres project, first start by installing git. Unfortunately, git is still new enough, and actively developed enough, that it may not be available on your distro's packaging system, or worse, the version available may be too old to be useful. Anything older than 1.5 should *not* be used, period, and 1.6 is highly preferred. I'd recommend taking the trouble to install from source if git is older than 1.6. Once installed, here's the steps to clone the Postgres repo.

git clone git://git.postgresql.org/git/postgresql.git postgres

This step may take a while, as git is basically putting the entire Postgres project on your computer - history and all! It took me three and a half minutes to run, but your time may vary.

Once that is done, you'll have a directory named "postgres". Change to it, and you can now poking around in the code, just like CVS, but without all the ugly CVS directories. :)

For more information, check out the "Working with git" page on the Postgres wiki.

Here's to 2010 being the year Postgres finally abandons CVS!

Splitting Postgres pg_dump into pre and post data files

I've just released a small Perl script that has helped me solve a specific problem with Postgres dump files. When you use pg_dump or pg_dumpall, it outputs things in the following order, per database:

  1. schema creation commands (e.g. CREATE TABLE)
  2. data loading command (e.g. COPY tablename FROM STDIN)
  3. post-data schema commands (e.g. CREATE INDEX)

The problem is that using the --schema-only flag outputs the first and third sections into a single file. Hence, if you load the file and then load a separate --data-only dump, it can be very slow as all the constraints, indexes, and triggers are already in place. The split_postgres_dump script breaks the dump file into two segments, a "pre" and a "post". (It doesn't handle a file with a data section yet, only a --schema-only version)

Why would you need to do this instead of just using a full dump? Some reasons I've found include:

  • When you need to load the data more than once, such as debugging a data load error.
  • When you want to stop after the data load step (which you can't do with a full dump)
  • When you need to make adjustments to the schema before the data is loaded (seen quite a bit on major version upgrades)

Usage is simply ./split_postgres_dump.pl yourdumpfile.pg, which will then create two new files, yourdumpfile.pg.pre and yourdumpfile.pg.post. It doesn't produce perfectly formatted files, but it gets the job done!

It's a small script, so it has no bug tracker, git repo, etc. but it does have a small wiki page at http://bucardo.org/wiki/Split_postgres_dump from which you can download the latest version.

Future versions of pg_dump will allow you to break things into pre and post data sections with flags, but until then, I hope somebody finds this script useful.

Update: There is now a git repo:
git clone git://bucardo.org/split_postgres_dump.git

Gathering server information with boxinfo

I've just publicly released another Postgres-related script, this one called "boxinfo". Basically, it gathers information about a box (server), hence the catchy and original name. It outputs the information it finds into an HTML page, or into a MediaWiki formatted page.

The goal of boxinfo is to have a simple, single script that quickly gathers important information about a server into a web page, so that you can get a quick overview of what is installed on the server and how things are configured. It's also useful as a reference page when you are trying to remember which server was it that had Bucardo version 4.5.0 installed and was running pgbouncer.

As we use MediaWiki internally here at End Point (running with a Postgres backend, naturally), the original (and default) format is HTML with some MediaWiki specific items inside of it.

Because it is meant to run on a wide a range of boxes as possible, it's written in Perl. While we've run into a few boxes over the years that did not have Perl installed, the number that had any other language you choose (except perhaps sh) is much greater. It requires no other Perl modules, and simply makes a lot of system calls.

Various information about the box is gathered. System wide things such as mount points, disk space, schedulers, packaging systems are gathered first, along with versions of many common Unix utilities. We also gather information on some programs where more than just the version number is important, such as puppet, heartbeat, and lifekeeper. Of course, we also go into a great amount of detail about all the installed Postgres clusters on the box as well.

The program tries its best to locate every active Postgres cluster on the box, and then gathers information about it, such as where pg_xlog is linked to, any contrib modules installed, any interesting configuration variables from postgresql.conf, the size of each database, and lots of detailed information about any Slony or Bucardo configurations it finds.

The main page for it is on the Bucardo wiki at http://bucardo.org/wiki/Boxinfo. That page details the various command line options and should be considered the canonical documentation for the script. The latest version of boxinfo can be downloaded from that page as well. For any enhancement requests or problems to report, please visit the bug tracker at http://bucardo.org/bugzilla/.

What exactly does the output look like? We've got an example on the wiki showing the sample output from a run against my laptop. Some of the items were removed, but it should give you an idea of what the script can do, particularly with regards to the Postgres information: http://bucardo.org/wiki/Boxinfo/Example

The script is still a little rough, so we welcome any patches, bug reports, requests, or comments. The development version can be obtained by running: git clone git://bucardo.org/boxinfo.git

Postgres Upgrades - Ten Problems and Solutions

Upgrading between major versions of Postgres is a fairly straightforward affair, but Murphy's law often gets in the way. Here at End Point we perform a lot of upgrades, and the following list explains some of the problems that come up, either during the upgrade itself, or afterwards.

When we say upgrade, we mean going from an older major version to a newer major version. We've (recently) migrated client systems as old as 7.2 to as new as 8.4. The canonical way to perform such an upgrade is to simply do:

pg_dumpall -h oldsystem > dumpfile
psql -h newsystem -f dumpfile

The reality can be a little more complicated. Here are the top ten gotchas we've come across, and their solutions. The more common and severe problems are at the top.

1. Removal of implicit casting

Postgres 8.3 removed many of the "implicit casts", meaning that many queries that used to work on previous versions now gave an error. This was a pretty severe regression, and while it is technically correct to not have them, the sudden removal of these casts has caused *lots* of problems. Basically, if you are going from any version of PostgreSQL 8.2 or lower to any version 8.3 or higher, expect to run into this problem.

Solution: The best way of course is to "fix your app", which means specifically casting items to the proper datatype, for example writing "123::int" instead of "123". However, it's not always easy to do this - not only can finding and changing all instances across your code base be a huge undertaking, but the problem also exists for some database drivers and other parts of your system that may be out of your direct control. Therefore, the other option is to add the casts back in. Peter Eisentraut posted a list of casts that restore some of the pre-8.3 behavior. Do not just apply them all, but add in the ones that you need. We've found that the first one (integer AS text) solves 99% of our clients' casting issues.

2. Encoding issues (bad data)

Older databases frequently were not careful about their encoding, and ended up using the default "no encoding" mode of SQL_ASCII. Often this was done because nobody was thinking about, or worrying about, encoding issues when the database as first being designed. Flash forward years later, and people want to move to something better than SQL_ASCII such as the now-standard UTF-8. The problem is that SQL_ASCII accepts everything without complaint, and this can cause you migration to fail as the data will not load into the new database with a different encoding. (Also note that even UTF-8 to UTF-8 may cause problems as it was not until Postgres version 8.1 that UTF-8 input was strictly validated.)

Solution: The best remedy is to clean the data on the "old" database and try the migration again. How to do this depends on the nature of the bad data. If it's just a few known rows, manual updates can be done. Otherwise, we usually write a Perl script to search for invalid characters and replace them. Alternatively, you can pipe the data through iconv in the middle of the upgrade. If all else fails, you can always fall back to SQL_ASCII on the new database, but that should really be a last resort.

3. Time

Since the database is almost always an integral part of the business, minimizing the time it is unavailable for use is very important. People tend to underestimate how much time an upgrade can take. (Here we are talking about the actual migration, not the testing, which is a very important step that should not be neglected.) Creating the new database and schema objects is very fast, of course, but the data must be copied row by row, and then all the constraints and indexes created. For large databases with many indexes, the index creation step can take longer than the data import!

Solution: The first step is to do a practice run with as similar hardware as possible to get an idea of how long it will take. If this time period does not comfortably fit within your downtime window (and by comfortable, I mean add 50% to account for Murphy), then another solution is needed. The easiest way is to use a replication system like Bucardo to "pre-populate" the static part of the database, and then the final migration only involves a small percentage of your database. It should also be noted that recent versions of Postgres can speed things up by using the "-j" flag to the pg_restore utility, which allows some of the restore to be done in parallel.

4. Dependencies

When you upgrade Postgres, you're upgrading the libraries as well, which many other programs (e.g. database drivers) depend on. Therefore, it's important to make sure everything else relying on those libraries still works. If you are installing Postgres with a packaging system, this is usually not a problem as the dependencies are taken care of for you.

Solution: Make sure your test box has all the applications, drivers, cron scripts, etc. that your production box has and make sure that each of them either works with the new version, or has a sane upgrade plan. Note: Postgres may have some hidden indirect dependencies as well. For example, if you are using Pl/PerlU, make sure that any external modules used by your functions are installed on the box.

5. Postgres contrib modules

Going from one version of Postgres to another can introduce some serious challenges when it comes to contrib modules. Unfortunately, they are not treated with the same level of care as the Postgres core is. To be fair, most of them will continue to just work, simply by doing a "make install" on the new database before attempting to import. Some modules, however, have functions that no longer exist. Some are not 100% forward compatible, and some even lack important pieces such as uninstall scripts.

Solution: Solving this depends quite a bit on the exact nature of the problem. We've done everything from carefully modifying the --schema-only output, to modifying the underlying C code and recompiling the modules, to removing them entirely and getting the functionality in other ways.

6. Invalid constraints (bad data)

Sometimes when upgrading, we find that the existing constraints are not letting the existing data back in! This can happen for a number of reasons, but basically it means that you have invalid data. This can be mundane (a check constraint is missing a potential value) or more serious (multiple primary keys with the same value).

Solution: The best bet is to fix the underlying problem on the old database. Sometimes this is a few rows, but sometimes (as in a case with multiple identical primary keys), it indicates an underlying hardware problem (e.g. RAM). In the latter case, the damage can be very widespread, and your simple upgrade plan has now turned into a major damage control exercise (but aren't you glad you found such a problem now rather than later?) Detecting and preventing such problems is the topic for another day. :)

7. tsearch2

This is a special case for the contrib module situation mentioned above. The tsearch2 module first appeared in version 7.4, and was moved into core of Postgres in version 8.3. While there was a good attempt at providing an upgrade path, upgrades can still cause an occasional issue.

Solution: Sometimes the only real solution is edit the pg_dump output by hand. If you are not using tsearch in that many places (e.g. just a few indexes or columns on a couple tables), you can also simply remove it before the upgrade, then add it back in afterwards.

8. Application behavior

In addition to the implicit casting issues above, applications sometimes have bad behaviors that were tolerated in older versions of Postgres, but now are not. A typical example is writing queries without explicitly naming all of the tables in the "FROM" section.

Solution: As always, fixing the app is the best solution. However, for some things you can also flip a compatibility switch inside of postgresql.conf. In the example above, one would change the "add_missing_from" from its default of 'off' to 'on'. This should be considered an option of last resort, however.

9. System catalogs

Seldom a major update goes by that doesn't see a change in the system catalogs, the low-level meta-data tables used by Postgres to describe everything in the database. Sometimes programs rely on the catalogs looking a certain way.

Solution: Most programs, if they use the system catalogs directly, are careful about it, and upgrading the program version often solves the problem. At other times, we've had to rewrite the program right then and there, either by having it abstract out the information (for example, by using the information_schema views), or (less preferred) by adding conditionals to the code to handle multiple versions of the system catalogs.

10. Embedded data

This is a rare but annoying problem: triggers on a table rely on certain data being in other tables, such that doing a --schema-only dump before a --data-only dump will always fail when importing.

Solution: The easiest way is to simply use pg_dumpall, which loads the schema, then the data, then the constraints and indexes. However, this may not be possible if you have to separate things for other reasons (such as contrib module issues). In this case, you can break the --schema-only pg_dump output into pre and post segments. We have a script that does this for us, but it is also slated to be an option for pg_dump in the future.

That's the list! If you've seen other things, please make a note in the comments. Don't forget to run a database-wide ANALYZE after importing into your new database, as the table statistics are not carried across when using pg_dump.

Postgres SQL Backup Gzip Shrinkage, aka DON'T PANIC!!!

I was investigating a recent Postgres server issue, where we had discovered that one of the RAM modules on the server in question had gone bad. Unsurprisingly, one of the things we looked at was the possibility of having to do a restore from a SQL dump, as if there had been any potential corruption to the data directory, a base backup would potentially have been subject to the same possible errors that we were trying to restore to avoid.

As it was already the middle of the night (anyone have a server emergency during the normal business hours?), my investigations were hampered by my lack of sleep.

If there had been some data directory corruption, the pg_dump process would likely fail earlier than in the backup process, and we'd expect the dumps to be truncated; ideally this wasn't the case, as memory testing had not shown the DIMM to be bad, but the sensor had alerted us as well.

I logged into the backup server and looked at the backup dumps; from the alerts that we'd gotten, the memory was flagged bad on January 3. I listed the files, and noticed the following oddity:

 -rw-r--r-- 1 postgres postgres  2379274138 Jan  1 04:33 backup-Jan-01.sql.gz
 -rw-r--r-- 1 postgres postgres  1957858685 Jan  2 09:33 backup-Jan-02.sql.gz

Well, this was disconcerting. The memory event had taken place on the 3rd, but there was a large drop in size of the dumps between January 1st and January 2nd (more than 400MB of *compressed* output, for those of you playing along at home). This indicated that either the memory event took place earlier than recorded, or something somewhat catastrophic had happened to the database; perhaps some large deletion or truncation of some key tables.

Racking my brains, I tried to come up with an explanation: we'd had a recent maintenance window that took place between January 1 and January 2; we'd scheduled a CLUSTER/REINDEX to reclaim some of the bloat which was in the database itself. But this would only reduce the size of the data directory; the amount of live data would have stayed the same or with a modest increase.

Obviously we needed to compare the two files in order to determine what had changed between the two days. I tried:

 diff <(zcat backup-Jan-01.sql.gz | head -2300) <(zcat backup-Jan-02.sql.gz | head -2300)

Based on my earlier testing, this was the offset in the SQL dumps which defined the actual schema for the database excluding the data; in particular I was interested to see if there had been (say) any temporarily created tables which had been dropped during the maintenance window. However, this showed only minor changes (updates to default sequence values). It was time to do a full diff of the data to try and see if some of the aforementioned temporary tables had been truncated or if some catastrophic deletion had occurred or...you get the idea. I tried:

 diff <(zcat backup-Jan-01.sql.gz) <(zcat backup-Jan-02.sql.gz)

However, this approach fell down when diff ran out of memory. We decided to unzip the files and manually diff the two files in case it had something to do with the parallel unzips, and here was a mystery; after unzipping the dumps in question, we saw the following:

 -rw-r--r-- 1 root root 10200609877 Jan  8 02:19 backup-Jan-01.sql
 -rw-r--r-- 1 root root 10202928838 Jan  8 02:24 backup-Jan-02.sql

The uncompressed versions of these files showed sizes consistent with slow growth; the Jan 02 backup was slightly larger than the Jan 01 backup. This was really weird! Was there some threshold in gzip where given a particular size file it switched to a different compression algorithm? Had someone tweaked the backup script to gzip with a different compression level? Had I just gone delusional from lack of sleep? Since gzip can operate on streams, the first option seemed unlikely, and something I would have heard about before. I verified that the arguments to gzip in the backup job had not changed, so that took that choice off the table. Which left the last option, but I had the terminal scrollback history to back me up.

We finished the rest of our work that night, but the gzip oddity stuck with me through the next day. I was relating the oddity of it all to a co-worker, when insight struck: since we'd CLUSTERed the table, that meant that similar data (in the form of the tables' multi-part primary keys) had been reorganized to be on the same database pages, so when pg_dump read/wrote out the data in page order, gzip had that much more similarity in the same neighborhood to work with, which resulted in the dramatic decrease in the compressed gzip dumps.

So the good news was that CLUSTER will save you space in your SQL dumps as well (if you're compressing), the bad news was that it took an emergency situation and an almost heart-attack for this engineer to figure it all out. Hope I've saved you the trouble... :-)

State of the Postgres project

It's been interesting watching the MySQL drama unfold, but I have to take issue when people start trying to drag Postgres into it again by spreading FUD (Fear, Uncertainty, and Doubt). Rather than simply rebut the FUD, I thought this was a good opportunity to examine the strength of the Postgres project.

Monty recently espoused the following in a blog comment:

"...This case is about ensuring that Oracle doesn't gain money and market share by killing an Open Source competitor. Today MySQL, tomorrow PostgreSQL. Yes, PostgreSQL can also be killed; To prove the case, think what would happen if someone managed to ensure that the top 20 core PostgreSQL developers could not develop PostgreSQL anymore or if each of these developers would fork their own PostgreSQL project."

Later on in his blog he raises the same theme again with a slight bit more detail:

"Note that not even PostgreSQL is safe from this threat! For example, Oracle could buy some companies developing PostgreSQL and target the core developers. Without the core developers working actively on PostgreSQL, the PostgreSQL project will be weakened tremendously and it could even die as a result."

Is this a valid concern? It's easy enough to overlook it considering the Sturm und Drang in Monty's recent posts, but I think this is something worth seriously looking into. Specifically, is the Postgres project capable of withstanding a direct threat from a large company with deep pockets (e.g. Oracle)?

To get to the answer, let's run some numbers first. Monty mentions the "top 20" Postgres developers. If we look at the community contributors page, we see that there are in fact 25 major developers listed, as well as 7 core members, so 20 would indeed be a significant chunk of that page. To dig deeper, I looked at the cvs logs for the year of 2009 for the Postgres project, and ran some scripts against them. The 9185 commits were spread across 16 different people, and about 16 other people were mentioned in the commit notes as having contributed in some way (e.g. a patch from a non-committer). So again, it looks like Monty's number of 20 is a pretty good approximation.

However (and you knew there was a however), the catch comes from being able to actually stop 20 of those people from working on Postgres. There are basically two ways to do this: Oracle could buy out a company, or they could hire (buy out) a person. The first problem is that the Postgres community is very widely distributed. If you look at the people on the community contributors page, you'll see that the 32 people work for 24 different companies. Further, no one company holds sway: the median is one company, and the high water mark is a mere three developers. All of this is much better than it was years ago, in the total number and in the distribution.

The next fly in the ointment is that buying out a company is not always easy to do, despite the size of your pockets. Many companies on that list are privately held and will not sell. Even if you did buy out the company, there is no way to prevent the people working there from then moving to a different company. Finally, buying out some companies just isn't possible, even if you are Oracle, because there are some big names on the list of people employing major Postgres developers: Google, Red Hat, Skype, and SRA. Then of course there is NTT, which is a really, really big company (larger than Oracle). NTT's Postgres developers are not always as visible as some of the English-speaking ones, but NTT employs a lot of people to work on Postgres (which is extremely popular in Japan).

The second way is hiring people directly. However, people can not always be bought off. Sure, some of the developers might choose to leave if Oracle offered them $20 million dollars, but not all of them (Larry, I might go for $19 million, call me :). Even if they did leave, the depth of the Postgres community should not be underestimated. For every "major developer" on that page, there are many others who read the lists, know the code well, but just haven't, for one reason or another, made it on to that list. At a rough guess, I'd say that there are a couple hundred people in the world who would be able to make commits to the Postgres source code. Would all of them be as fast or effective as some of the existing people? Perhaps not, but the point is that it would be nigh impossible to thin the pool fast enough to make a dent.

The project's email lists are as strong as ever, to such a point that I find it hard to keep up with the traffic, a problem I did not have a few years ago. The number of conferences and people attending each is growing rapidly, and there is a great demand for people with Postgres skills. The number of projects using Postgres, or offering it as an alternative database backend, is constantly growing. It's no longer difficult to find a hosting provider that offers Postgres in addition to MySQL. Most important of all, the project continues to regularly release stable new versions. Version 8.5 will probably be released in 2010.

In conclusion, the state of the Postgres project is in great shape, due to the depth and breadth of the community (and the depth and breadth of the developer subset). There is no danger of Postgres going the MySQL route; the PG developers are spread across a number of businesses, the code (and documentation!) is BSD, and no one firm holds sway in the project.

Monitoring Postgres log files with tail_n_mail

We've just publically released a useful script named tail_n_mail that keeps an eye on your Postgres log files and mails interesting lines to one or more addresses. It's released under a BSD license and is available at:

http://bucardo.org/wiki/Tail_n_mail

Complete documentation is available at the above, but here's a quick overview. First, it figures out the current log file (it actually works for any file, but it's primarily aimed at Postgres log files). Then, it finds any lines that match based on the INCLUDE lines in the config file, and finally removes any that do not match the EXCLUDE lines in the config files. It summarizes the results and sends a report to one or more emails.

To use, just specify a a configuration file as the first argument. Typically, the script is run from cron, either for instant reports (e.g. FATAL or PANIC errors), or for daily reports (e.g. all interesting ERRORs in the last 24 hours).

Here's what a typical config file looks like. In this example, we'll look for any FATAL or PANIC notices from Postgres, while ignoring a few known errors that we don't care about.


 ## Config file for the tail_n_mail.pl program
 ## This file is automatically updated
 EMAIL: greg@endpoint.com, postgres@endpoint.com
 
 FILE: /var/log/pg_log/postgres-%Y-%m-%d.log
 INCLUDE: FATAL:  
 INCLUDE: PANIC:  
 EXCLUDE: database ".+" does not exist
 EXCLUDE: database "template0" is not currently accepting connections
 MAILSUBJECT: HOST Postgres fatal errors (FILE)

It should be setup to run often from cron:

  */5 * * * * perl bin/tail_n_mail.pl bin/tnm/tnm.fatals.config

The resulting mail message will look like this:

Matches from /var/log/pg_log/postgres-2010-01-01.log: 42
Date: Fri Jan  1 10:34:00 2010
Host: pollo

[1] Between lines 123005 and 147976, occurs 39 times.
First:  Jan  1 00:00:01 rojogrande postgres[4306]
Last:   Jan  1 10:30:00 rojogrande postgres[16854]
Statement:  user=root,db=rojogrande FATAL:  password authentication failed for user "root"

[2] Between lines 147999 and 148213, occurs 2 times.
First:  Jan  1 10:31:01 rojogrande postgres[3561]
Last:   Jan  1 10:31:10 rojogrande postgres[15312]
Statement: FATAL  main: write to worker pipe failed -(9) Bad file descriptor

[3] (from line 152341)
PANIC:  could not locate a valid checkpoint record

There may be false positives, but it's not designed to be a complete log parser. There are some other command line flags and options for the config file: see the documentation for the full list. This script has been watching over a number of production systems for a while now, but improvements, ideas, and patches are always welcome. It's tracked via git; you can clone it by running:

  git clone git://bucardo.org/tail_n_mail.git

Bugs and feature requests can be filed and tracked at:

http://bucardo.org/bugzilla/

MySQL and Postgres command equivalents (mysql vs psql)

Users toggling between MySQL and Postgres are often confused by the equivalent commands to accomplish basic tasks. Here's a chart listing some of the differences between the command line client for MySQL (simply called mysql), and the command line client for Postgres (called psql).

MySQL (using mysql)Postgres (using psql)Notes
\c Clears the buffer\r (same)
\d string Changes the delimiterNo equivalent
\e Edit the buffer with external editor\e (same)Postgres also allows \e filename which will become the new buffer
\g Send current query to the server\g (same)
\h Gives help - general or specific\h (same)
\n Turns the pager off\pset pager off (same)The pager is only used when needed based on number of rows; to force it on, use \pset pager always
\p Print the current buffer\p (same)
\q Quit the client\q (same)
\r [dbname] [dbhost] Reconnect to server\c [dbname] [dbuser] (same)
\s Status of serverNo equivalentSome of the same info is available from the pg_settings table
\t Stop teeing output to fileNo equivalentHowever, \o (without any argument) will stop writing to a previously opened outfile
\u dbname Use a different database\c dbname (same)
\w Do not show warningsNo equivalentPostgres always shows warnings by default
\C charset Change the charset\encoding encoding Change the encodingRun \encoding with no argument to view the current one
\G Display results vertically (one column per line)\x (same)Note that \G is a one-time effect, while \x is a toggle from one mode to another. To get the exact same effect as \G in Postgres, use \x\g\x
\P pagername Change the current pager programEnvironment variable PAGER or PSQL_PAGER
\R string Change the prompt\set PROMPT1 string (same)Note that the Postgres prompt cannot be reset by omitting an argument. A good prompt to use is:\set PROMPT1 '%n@%`hostname`:%>%R%#%x%x%x '
\T filename Sets the tee output fileNo direct equivalentPostgres can output to a pipe, so you can do: \o | tee filename
\W Show warningsNo equivalentPostgres always show warnings by default
\? Help for internal commands\? (same)
\# Rebuild tab-completion hashNo equivalentNot needed, as tab-completion in Postgres is always done dynamically
\! command Execute a shell command\! command (same)If no command is given with Postgres, the user is dropped to a new shell (exit to return to psql)
\. filename Include a file as if it were typed in\i filename (same)
Timing is always on\timing Toggles timing on and off
No equivalent\t Toggles 'tuple only' modeThis shows the data from select queries, with no headers or footers
show tables; List all tables\dt (same)Many also use just \d, which lists tables, views, and sequences
desc tablename; Display information about the given table\d tablename (same)
show index from tablename; Display indexes on the given table\d tablename (same)The bottom of the \d tablename output always shows indexes, as well as triggers, rules, and constraints
show triggers from tablename; Display triggers on the given table\d tablename (same)See notes on show index above
show databases; List all databases\l (same)
No equivalent\dn List all schemasMySQL does not have the concept of schemas, but uses databases as a similar concept
select version(); Show backend server versionselect version(); (same)
select now(); Show current timeselect now(); (same)Postgres will give fractional seconds in the output
select current_user; Show the current userselect current_user; (same)
select database(); Show the current databaseselect current_database(); (same)
show create table tablename; Output a CREATE TABLE statement for the given tableNo equivalentThe closest you can get with Postgres is to use pg_dump --schema-only -t tablename
show engines; List all server enginesNo equivalentPostgres does not use separate engines
CREATE object ... Create an object: database, table, etc.CREATE object ... Mostly the sameMost CREATE commands are similar or identical. Lookup specific help on commands (for example: \h CREATE TABLE)

If there are any commands not listed you would like to see, or if there are errors in the above, please let me know. There are differences in how you invoke mysql and psql, and in the flags that they use, but that's a topic for another day.

Updates: Added PSQL_PAGER and \o |tee filename, thanks to the Davids in the comments section. Added \t back in, per Joe's comment.

Verifying Postgres tarballs with PGP

If you are downloading the Postgres source code tarballs from a mirror, how can you tell if these are the same tarballs that were created by the packagers? You can't really - although they come with a MD5 checksum file, these files are packaged right alongside the tarballs themselves, so it would be easy enough for someone to create an evil tarball along with a new MD5 file. All you could do is perhaps check if the tarball that came from mirror A has a matching checksum file from mirror B, or even the main repository itself.

One way around this is to use PGP (which almost always means GnuPG in the open-source software world) to digitally sign the tarballs. Until the Postgres project gets an official key and starts doing this, one workaround is to at least know the checksums from one single point in time. To that end, I've been digitally signing messages containing the checksums for the tarballs for many years now now and posting them to pgsql-announce. You'll need a copy of my public key (0x14964AC8m fingerprint 2529 DF6A B8F7 9407 E944 45B4 BC9B 9067 1496 4AC8) to verify the messages. A copy of the latest announcement message is below.

Note that I've also added a sha1sum for each tarball, as a precaution against relying on a single MD5 checksum (sha1sum does a SHA-1 checksum, naturally). Also note that rather than signing each tarball, I've simply signed a message containing the checksums for each one.

While this is far from a fool-proof system, it's much, much better than the existing system, and provides a way for changed tarballs to be detected. If anyone ever finds a mismatch please let me know (or better yet, email pgsql-general@postgresql.org)

-----BEGIN PGP SIGNED MESSAGE-----                                   
Hash: RIPEMD160                                                      


Source code MD5 and SHA1 checksums for PostgreSQL 
versions 8.4.2, 8.3.9, 8.2.15, 8.1.19, 8.0.23, and 7.4.27

For instructions on how to use this file to verify Postgres 
tarballs, please see:                                       

http://www.gtsm.com/postgres_sigs.html

## Created with md5sum:
1bc9cdc76c6a2a13bd7fdc0f3f53667f  postgresql-8.4.2.tar.gz
d738227e2f1f742d2f2d4ab56496c5c6  postgresql-8.4.2.tar.bz2
4f176a4e7c0a9f8a7673bec99d1905a0  postgresql-8.3.9.tar.gz 
e120b001354851b5df26cbee8c2786d5  postgresql-8.3.9.tar.bz2
a9d97def309c93998f4ff3e360f3f226  postgresql-8.2.15.tar.gz
e6f2274613ad42fe82f4267183ff174a  postgresql-8.2.15.tar.bz2
335d8c42bd6e7522bb310d19d1f9a91b  postgresql-8.1.19.tar.gz 
ba84995e1e2d53b0d750b75adfaeede3  postgresql-8.1.19.tar.bz2
eb35f66d1c49d87c27f2ab79f0cebf8e  postgresql-8.0.23.tar.gz 
1c6fac4265e71b4f314a827ca5f58f6a  postgresql-8.0.23.tar.bz2
77d09f4806bd913820f82abc27aca70e  postgresql-7.4.27.tar.gz 
1fd1d2702303f9b29b5dba1ec4e1aade  postgresql-7.4.27.tar.bz2

## Created with sha1sum:
563caa3da16ca84608e5ff9c487753f3bd127883  postgresql-8.4.2.tar.gz
a617698ef3b41a74fe2c4af346172eb03e7f8a7f  postgresql-8.4.2.tar.bz2
6ee1e384bdd37150ce6fafa309a3516ec3bbef02  postgresql-8.3.9.tar.gz 
5403f13bb14fe568e2b46a3350d6e28808d93a2c  postgresql-8.3.9.tar.bz2
bd803d74bf9aeac756cb69ae6c1c261046d90772  postgresql-8.2.15.tar.gz
4de199b3223dba2164a9e56d998f6deb708f0f74  postgresql-8.2.15.tar.bz2
233a365985a5a636a97f9d1ab4e777418937caed  postgresql-8.1.19.tar.gz 
f1667a64e92a365ae3d46903382648bdc0daa1ba  postgresql-8.1.19.tar.bz2
7783dc54638e044cff3c339d9fd960a9b65a31df  postgresql-8.0.23.tar.gz 
a2c37eb802a4d67bc2508f72035dae6fb29494df  postgresql-8.0.23.tar.bz2
405909d755aa907fc176d22d1b51d6b5704eb3b4  postgresql-7.4.27.tar.gz 
bb35cc844157b8a0d0b2e9e1ab25b6597c82dd1c  postgresql-7.4.27.tar.bz2

- -- 
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200912151528     
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAksoDPgACgkQvJuQZxSWSsikVQCgiE34ycdexL9lwSfZ+TLTZh5m
G3AAnRkazEu/uHLJCNvDZe2cmqCrCkem                                
=HjAS                                                           
-----END PGP SIGNATURE-----

Editing large files in place

Running out of disk space seems to be an all too common problem lately, especially when dealing with large databases. One situation that came up recently was a client who needed to import a large Postgres dump file into a new database. Unfortunately, they were very low on disk space and the file needed to be modified. Without going into all the reasons, we needed the databases to use template1 as the template database, and not template0. This was a very large, multi-gigabyte file, and the amount of space left on the disk was measured in megabytes. It would have taken too long to copy the file somewhere else to edit it, so I did a low-level edit using the Unix utility dd. The rest of this post gives the details.

To demonstrate the problem and the solution, we'll need a disk partition that has little-to-no free space available. In Linux, it's easy enough to create such a thing by using a RAM disk. Most Linux distributions already have these ready to go. We'll check it out with:

$ ls -l /dev/ram*
brw-rw---- 1 root disk 1,  0 2009-12-14 13:04 /dev/ram0
brw-rw---- 1 root disk 1,  1 2009-12-14 22:27 /dev/ram1

From the above, we see that there are some RAM disks available (there are actually 16 of them available on my box, but I only showed two). Here's the steps to create a usable partition from /dev/ram1, and to then check the size:

$ mkdir /home/greg/ramtest

$ sudo mke2fs /dev/ram1
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
4096 inodes, 16384 blocks
819 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=16777216
2 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
        8193

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

$ sudo mount /dev/ram1 /home/greg/ramtest

$ sudo chown greg:greg /home/greg/ramtest

$ df -h /dev/ram1
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1              16M  140K   15M   1% /home/greg/ramtest

First we created a new directory to server as the mount point, then we used the mke2fs utility to create a new file system (ext2) on the RAM disk at /dev/ram1. It's a fairly verbose program by default, but there is nothing in the output that's really important for this example. Then we mounted our new filesystem to the directory we just created. Finally, we reset the permissions on the directory such that an ordinary user (e.g. 'greg') can read and write to it. At this point, we've got a directory/filesystem that is just under 16 MB large (we could have made it much closer to 16 MB by specifying a -m 0 to mke2fs, but the actual size doesn't matter).

To simulate what happened, let's create a database dump and then bloat it until there it takes up all available space:

$ cd /home/greg/ramtest

$ pg_dumpall > data.20091215.pg

$ ls -l data.20091215.pg
-rw-r--r-- 1 greg greg 3685 2009-12-15 10:42 data.20091215.pg

$ dd seek=3685 if=/dev/zero of=data.20091215.pg bs=1024 count=99999
dd: writing 'data.20091215.pg': No space left on device
13897+0 records in
13896+0 records out
14229504 bytes (14 MB) copied, 0.0814188 s, 175 MB/s

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1              16M   15M     0 100% /home/greg/ramtest

First we created the dump, then we found the size of it, and told dd via the 'seek' argument to start adding data to it at the 3685 byte mark (in other words, we appended to the file). We used the special file /dev/zero as the 'if' (input file), and our existing dump as the 'of' (output file). Finally, we told it to chunk the inserts into 1024 bytes at a time, and to attempt to add 999,999 of those chunks. Since this is approximately 100MB, we ran out of disk space quickly, as we intended. The filesystem is now at 100% usage, and will refuse any further writes to it.

To recap, we need to change the first three instances of template0 with template1. Let's use grep to view the lines:

$ grep --text --max-count=3 template data.20091215.pg
CREATE DATABASE greg WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';

We need the --text argument here because grep correctly surmises that we've changed the file from text-based to binary with the addition of all those zeroes on the end. We also used the handy --max-count argument to stop processing once we've found the lines we want. Very handy argument when the actual file is gigabytes in size!

There are two major problems with using a normal text editor to change the file. First, the file (in the real situation, not this example!) was very, very large. We only needed to edit something at the very top of the file, so loading the entire thing into an editor is very inefficient. Second, editors need to save their changes somewhere, and there just was not enough room to do so.

Attempting to edit with emacs gives us: emacs: IO error writing /home/greg/ramtest/data.20091215.pg: No space left on device

An attempt with vi gives us: vi: Write error in swap file on startup. "data.20091215.pg" E514: write error (file system full?)

Although emacs gives the better error message (why is vim making a guess and outputting some weird E514 error?), the advantage always goes to vi in cases like this as emacs has a major bug in that it cannot even open very large files.

What about something more low-level like sed? Unfortunately, while sed is more efficient than emacs or vim, it still needs to read the old file and write the new one. We can't do that writing as we have no disk space! More importantly, in sed there is no way (that I could find anyway) to tell it stop processing after a certain number of matches.

What we need is something *really* low-level. The utility dd comes to the rescue again. We can use dd to truly edit the file in place. Basically, we're going to overwrite some of the bytes on disk, without needing to change anything else. First though, we have to figure out exactly which bytes to change. The grep program has a nice option called --byte-offset that can help us out:

$ grep --text --byte-offset --max-count=3 template data.20091215.pg
301:CREATE DATABASE greg WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
380:CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
459:CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';

This tells us the offset for each line, but we want to replace the number '0' in 'template0' with the number '1'. Rather than count it out manually, let's just use another Unix utility, hexdump, to help us find the number:

$ grep --text --byte-offset --max-count=3 template data.20091215.pg | hexdump -C
00000000  33 30 31 3a 43 52 45 41  54 45 20 44 41 54 41 42  |301:CREATE DATAB|
00000010  41 53 45 20 67 72 65 67  20 57 49 54 48 20 54 45  |ASE greg WITH TE|
00000020  4d 50 4c 41 54 45 20 3d  20 74 65 6d 70 6c 61 74  |MPLATE = templat|
00000030  65 30 20 4f 57 4e 45 52  20 3d 20 67 72 65 67 20  |e0 OWNER = greg |
00000040  45 4e 43 4f 44 49 4e 47  20 3d 20 27 55 54 46 38  |ENCODING = 'UTF8|
...

Each line is 16 characters, so the first three lines comes to 48 characters, then we add two for the 'e0', subtract four for the '301:', and get 301+48+2-4=347. We subtract one more as we want to seek to the point just before that character, and we can now use our dd command:

$ echo 1 | dd of=data.20091215.pg seek=346 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.00012425 s, 8.0 kB/s

Instead of an input file (the 'if' argument), we simply pass the number '1' via stdin to the dd command. We use our calculated seek, tell it to copy a single byte (bs=1), one time (count=1), and (this is very important!) tell dd NOT to truncate the file when it is done (conv=notrunc). Technically, we are sending two characters to the dd program, the number one and a newline, but the bs=1 argument ensures only the first character is being copied. We can now verify that the change was made as we expected:

$ grep --text --byte-offset --max-count=3 TEMPLATE data.20091215.pg
301:CREATE DATABASE greg WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
380:CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
459:CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';

Now for the other two entries. From before, the magic number is 45, so we now add 380 to 45 to get 425. For the third line, the name of the database is 1 character longer so we add 459+45+1 = 505:

$ echo 1 | dd of=data.20091215.pg seek=425 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000109234 s, 9.2 kB/s

$ echo 1 | dd of=data.20091215.pg seek=505 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000109932 s, 9.1 kB/s

$ grep --text --byte-offset --max-count=3 TEMPLATE data.20091215.pg
301:CREATE DATABASE greg WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
380:CREATE DATABASE rand WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
459:CREATE DATABASE sales WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';

Success! On the real system, the database was loaded with no errors, and the large file was removed. If you've been following along and need to cleanup:

$ cd ~
$ sudo umount /home/greg/ramtest
$ rmdir ramtest

Keep in mind that dd is a very powerful and thus very dangerous utility, so treat it with care. It can be invaluable for times like this however!

Live by the sword, die by the sword

In an amazing display of chutzpah, Monty Widenius recently asked on his blog for people to write to the EC about the takeover of Sun by Oracle and its effect on MySQL, saying:

I, Michael "Monty" Widenius, the creator of MySQL, is asking you urgently to 
help save MySQL from Oracle's clutches. Without your immediate help Oracle
might get to own MySQL any day now. By writing to the European Commission (EC)
you can support this cause and help secure the future development of the
product MySQL as an Open Source project.

"Help secure the future development"? Sorry, but that ship has sailed. Specifically, when MySQL was sold to Sun. There were many other missed opportunities over the years to keep MySQL as a good open source project. Some of the missteps:

  • Bringing in venture capitalists
  • Selling to Sun instead of making an IPO (Initial Public Offering)
  • Failing to check on the long-term health of Sun before selling to them
  • Choosing the proprietary dual-licensing route
  • Making the documentation have a restricted license
  • Failing to acquire InnoDB (which instead was bought by Oracle)
  • Failing to acquire SleepyCat (which was instead bought by Oracle)
  • Spreading FUD about the dual license and twisting the GPL in novel and dubious ways

Also interesting is some of the related blog posters and pundits, who seem to think that MySQL has some sort of special mystical quality that requires it be 'saved'. Sorry, but the business world and the open source world are both harsh ecosystems, where today's market leader can become tomorrow's has-been. For all those who are bemoaning MySQL's fate (especially those directly involved in selling this dual-licensed project for money), I offer a quote: "live by the sword, die by the sword". Not that MySQL is dead yet, but it's been dealt quite a number of near-fatal blows, and I'm not convinced that all the forks, spinoffs, well-wishers, and ex-developers can fix that. Should be interesting times ahead.

Automatically building Pentaho metadata

Every so often I'll hear of someone asking for a way to allow their users to write queries against their database without having to teach everyone SQL. There are various applications to do this: BusinessObjects and Cognos, are two common commercial examples, among many others. Pentaho and JasperReports provide similar capabilities in the open-source world. These tools allow users to write reports by selecting fields from a user-friendly list, adding suitable constraints, and making other formatting and filtering choices, all without needing to understand SQL.

Those familiar with these packages know that in order to provide those nice, readable field names and simple, meaningful field groupings, the software generally needs some sort of metadata file. This file maps actual database fields to readable descriptions, specifies relationships between tables, and translates database field types to data types the reporting software understands. Typically to create such a file, an administrator spends a few hours in front of a vendor-supplied GUI application dragging graphical representations of their tables and columns around, defining joins and entering friendly descriptions.

For the TriSano™ project's data warehouse, we needed a way to make regular modifications to the metadata file we gave to our Pentaho instance, in order to allow users to write reports that included data from the custom-built forms TriSano allowed them to create. To this end, we dove into the Pentaho APIs and developed a system to modify the metadata file automatically, adding tables and relationships whenever users create a new custom form.

TriSano is a Ruby-on-Rails application, running on JRuby, and the ability to use Java objects natively within JRuby was critical to interfacing correctly with Pentaho, a Java application. Within JRuby, our script can create Pentaho objects at will. Interested parties are encouraged to browse the source code of the TriSano script for the many details required to make this work.

In short, the script makes a new Pentaho metadata file entirely from scratch, using only information from a small number of purpose-built database tables, and database structure information taken directly from the PostgreSQL catalogs. It creates a schema file, populates it with descriptions of each of the actual database tables our users are interested in, assigns friendly names to each of the database objects with which users will interact, and divides up the results into user-defined groupings meaningful to their business.

I'm not familiar with a commercial reporting package that allows for modification of the underlying metadata without user intervention; doing something like this without the benefit of open-source software would have been daunting indeed.

PL/LOLCODE and INLINE functions

PostgreSQL 8.5 recently learned how to handle "inline functions" through the DO statement. Further discussion is here, but the basic idea is that within certain limitations, you can write ad hoc code in any language that supports it, without having to create a full-fledged function. One of those limitations is that you can't actually return anything from your function. Another is that the language has to support an "inline handler".

PostgreSQL procedural languages all have a language handler function, which gets called whenever you execute a stored procedure in that language. An inline handler is a separate function, somewhat slimmed down from the standard language handler. PostgreSQL gives the inline handler an argument containing, among other things, the source text passed in the DO block, which the inline handler simply has to parse and execute.

As of when the change was committed in PostgreSQL, only PL/pgSQL supported inline functions. Other languages may now support them; today I spent the surprisingly short time needed to add the capability to PL/LOLCODE. Here's a particularly useless example:

DO $$
HAI
 VISIBLE "This is a test of INLINE stuff"
KTHXBYE
$$ language pllolcode;

Talk slides are available! Bucardo: Replication for PostgreSQL

I'm in Seattle for the PostgreSQL Conference West today! I just finished giving a talk on Bucardo, a master-slave and multi-master replication system for Postgres.

The talk was full, and had lots of people who've used Slony in the past, so I got lots of great questions. I realized we should publish some "recommended architectures" for setting up the Bucardo control database, and provide more detailed diagrams for how replication events actually occur. I also talked to someone interested in using Bucardo to show DDL differences between development databases and suggested he post to the mailing list. Greg has created scripts to do similar things in the past, and it would be really cool to have Bucardo output runnable SQL for applying changes.

I also made a hard pitch for people to start a SEAPUG, and it sounds like some folks from the Fred Hutchinson Cancer Research Center are interested. (I'm naming names, hoping that we can actually do it this time :D). If you are from the Seattle area, go ahead and subscribe to the seapug@postgresql.org mailing list (pick 'seapug' from the list dropdown menu)!

Thanks everyone who attended, and I'm looking forward to having lunch with a bunch of PostgreSQL users here in Seattle!

Permission denied for postgresql.conf

I recently saw a problem in which Postgres would not startup when called via the standard 'service' script, /etc/init.d/postgresql. This was on a normal Linux box, Postgres was installed via yum, and the startup script had not been altered at all. However, running this as root:

 service postgresql start

...simply gave a "FAILED".

Looking into the script showed that output from the startup attempt should be going to /var/lib/pgsql/pgstartup.log. Tailing that file showed this message:

  postmaster cannot access the server configuration file
  "/var/lib/pgsql/data/postgresql.conf": Permission denied

However, the postgres user can see this file, as evidenced by an su to the account and viewing the file. What's going on? Well, anytime you see something odd when using Linux, especially if permissions are involved, you should suspect SELinux. The first thing to check is if SELinux is running, and in what mode:

# sestatus

SELinux status:                 enabled
SELinuxfs mount:                /selinux
Current mode:                   enforcing
Mode from config file:          enforcing
Policy version:                 21
Policy from config file:        targeted

Yes, it is running and most importantly, in 'enforcing' mode. SELinux logs to /var/log/audit/ by default on most distros, although some older ones may log directly to /var/log/messages. In this case, I quickly found the problem in the logs:

# grep postgres /var/log/audit/audit.log | grep denied | tail -1

type=AVC msg=audit(1234567890.334:432): avc:  denied  { read } for
pid=1234 comm="postmaster" name="pgsql" dev=newpgdisk ino=403123  
scontext=user_u:system_r:postgresql_t:s0
tcontext=system_u:object_r:var_lib_t:s0 tclass=lnk_file

Looks like SELinux did not like a symlink, and sure enough:

# ls -ld /var/lib/pgsql /var/lib/pgsql/data /var/lib/pgsql/data/postgresql.conf

lrwxrwxrwx. 1 postgres postgres 18 1999-12-31 23:55 /var/lib/pgsql -> /mnt/newpgdisk
drwx------. 2 postgres postgres  4096 1999-12-31 23:56 /var/lib/pgsql/data
-rw-------. 1 postgres postgres 16816 1999-12-31 23:57 /var/lib/pgsql
/data/postgresql.conf

Here we see that although the postgres user owns the symlink, owns the data directory at /var/lib/pgsql/data, and owns the file in question, /var/lib/pgsql/data/postgresql.conf, the conf file is no longer really on /var/lib/pgsql, but is on /mnt/newpgdisk. SELinux did not like the fact that the postmaster process was trying to read across that symlink.

Now that we know SELinux is the problem, what can we do about it? There are four possible solutions at this point to get Postgres working again:

First, we can simply edit the PGDATA assignment within the /etc/init.d/postgresql file to point to the actual data dir, and bypass the symlink. In this case, we'd change the line as follows:

#PGDATA=/var/lib/pgsql/data
PGDATA=/mnt/newpgdisk/data

The second solution is to simply turn SELinux off. Unless you are specifically using it for something, this is the quickest and easiest solution.

The third solution is to change the SELinux mode. Switching from "enforcing" to "permissive" will keep SELinux on, but rather than denying access, it will log the attempt and still allow it to proceed. This mode is a good way to debug things while you attempt to put in new enforcement rules or change existing ones.

The fourth solution is the most correct one, but also the most difficult. That of course is to carve out an SELinux exception for the new symlink. If you move things around again, you'll need to tweak the rules again, or course.

Migrating Postgres with Bucardo 4

Bucardo just released a major version (4). The latest version, 4.0.3, can be found at the Bucardo website. The complete list of changes is available on the new Bucardo wiki.

One of the neat tricks you can do with Bucardo is an in-place upgrade of Postgres. While it still requires application downtime, you can minimize your downtime to a very, very small window by using Bucardo. We'll work through an example below, but for the impatient, the basic process is this:

  1. Install Bucardo and add large tables to a pushdelta sync
  2. Copy the tables to the new server (e.g. with pg_dump)
  3. Start up Bucardo and catch things up (e.g. copy all rows changes since step 2)
  4. Stop your application from writing to the original database
  5. Do a final Bucardo sync, and copy over non-replicated tables
  6. Point the application to the new server

With this, you can migrate very large databases from one server to another (or from Postgres 8.2 to 8.4, for example) with a downtime measured in minutes, not hours or days. This is possible because Bucardo supports replicating a "pre-warmed" database - one in which most of the data is already there.

Let's test out this process, using the handy pgbench utility to create a database. We'll go from PostgreSQL 8.2 (the original database, called "A") to PostgreSQL 8.4 (the new database, called "B"). The first step is to create and populate database A:

  initdb -D testA
  echo port=5555 >> testA/postgresql.conf
  pg_ctl -D testA -l a.log start
  createdb -p 5555 alpha
  pgbench -p 5555 -i alpha
  psql -p 5555 -c 'create user bucardo superuser'

At this point, we have four tables:

  $ psql -p 5555 -d alpha -c '\d+'
                          List of relations
   Schema |   Name   | Type  |  Owner   |    Size    | Description
  --------+----------+-------+----------+------------+-------------
   public | accounts | table | postgres | 13 MB      |
   public | branches | table | postgres | 8192 bytes |
   public | history  | table | postgres | 0 bytes    |
   public | tellers  | table | postgres | 8192 bytes |

For the purposes of this example, let's make believe that accounts table is actually 13 TB. :) The next step is to prepare the 8.4 database:

  initdb -D testB
  echo port=5566 >> testB/postgresql.conf
  pg_ctl -D testB -l b.log start

We'll copy everything except the data itself to the new server:

  pg_dumpall --schema-only -p 5555 | psql -p 5566 -f -

Because the other tables are very small, we're only going to use Bucardo to copy over the large "accounts" table. So let's install Bucardo and add a sync to do just that:

  sudo yum install perl-DBIx-Safe
  tar xvf Bucardo-4.0.3.tar.gz
  cd Bucardo-4.0.3
  perl Makefile.PL
  sudo make install

(That's a very quick overview - see the Installation page for more information.)

Let's install bucardo on the new database:

  mkdir /tmp/bctest
  bucardo_ctl install --dbport=5566 --piddir=/tmp/bctest

Set the port so we don't have to keep typing it in:

  echo dbport=5566 > .bucardorc

Now teach Bucardo about both databases:

  bucardo_ctl add db alpha name=oldalpha port=5555
  bucardo_ctl add db alpha name=newalpha port=5566

Finally, create a sync to copy from old to new:

  bucardo_ctl add sync pepper type=pushdelta source=oldalpha targetdb=newalpha tables=accounts ping=false

This adds a new sync named "pepper" which is of type pushdelta (master-slave: copy changes from the source table to the target(s).). The source is our old server, named "oldalpha" by Bucardo. The target database is our new server, named "newalpha". The only table in this sync is "accounts", and we set ping as false, which means that we do NOT create a trigger on this table to signal Bucardo that a change has been made, as we will be kicking the sync manually.

At this point, the accounts table has a trigger on it that is capturing which rows have been changed. The next step is to copy the existing table from the old database to the new database. There are many ways to do this, such as a NetApp snapshot, using ZFS, etc., but we'll use the traditional way of a slow but effective pg_dump:

  pg_dump alpha -p 5555 --data-only -t accounts | psql -p 5566 -d alpha -f -

This can take as long as it needs to. Reads and writes can still happen against the old server, and changes can be made to the accounts tables. Once that is done, here's the situation:

  • The old server is still in production
  • The new server has a full but outdated copy of 'accounts'
  • The new server has empty tables for everything but 'accounts'
  • All changes to the accounts table on the old server are being logged.

Our next step is to start up Bucardo, and let it "catch up" the new server with all changes that have occurred since we created the sync:

  bucardo_ctl start

You can keep track of how far along the sync is by tailing the log file (syslog and ./log.bucardo by default) or by checking on the sync itself:

  bucardo_ctl status pepper

Once it has caught up (how long depends on how busy the accounts table is, of course), the only disparity should be any rows that have changed since the sync last ran. You can kick off the sync again if you want:

  bucardo_ctl kick pepper 0

The final 0 there will allow you to see when the sync has finished.

For the final step, we'll need to move the remainder of the data over. This begins our production downtime window. First, stop the app from writing to the database (reading is okay). Next, once you've confirmed nothing is making changes to the database, make a final kick:

  bucardo_ctl kick pepper 0

Next, copy over the other data that was not replicated by Bucardo. This should be small tables that will copy quickly. In our case, we can do it like this:

  pg_dump alpha -p 5555 --data-only -T accounts -N bucardo | psql -p 5566 -d alpha -f -

Note that we excluded the schema bucardo, and copied all tables *except* the 'accounts' one.

That's it! You can now point your application to the new server. There are no Bucardo triggers or other artifacts on the new server to clean up. At this point, you can shutdown Bucardo itself:

  bucardo_ctl stop

Then shutdown your old Postgres and start enjoying your new 8.4 server!

Two quick tips: egrep & SQL dumps, VIM and deleting things that don't match

Sometimes, I just don't want to restore a full SQL dump. The restore might take too long, and maybe I just want a small subset of the records anyway.

I was in exactly this situation the other day - faced with a 10+ hour restore process, it was way faster to grep out the records and then push them into the production databases, than to restore five different versions.

So! egrep and vim to the rescue!

In my case, the SQL dump was full of COPY commands, and I had a username that was used as a partial-key on all the tables I was interested in. So:

egrep "((^COPY)|username)" PostgresDump.sql > username.out

I get a pretty nice result from this. But, there are some records I'm not so interested in that got mixed in, so I opened the output file in vim and turned line numbers on (:set numbers).

The first thing that I do is insert the '\.' needed to tell Postgres that we're at the end of a COPY statement.

:2,$s/^COPY/\\\.^V^MCOPY/

The '^V^M' is a control sequence that results in a '^M' (a newline character, essentially). And the '2' starts the substitution command on the second line rather than the first COPY statement (which, in my case, was on the first line).

Next, I want to strip out any records that the egrep found that I really don't want to insert into the database:

:.,2000g!/stuff_i_wanna_keep/d

Broken down:

  • '.,2000' - start from the current line and apply the command through line 2000
  • 'g!' - find lines that do not match the following regular expression
  • '/stuff_i_wanna_keep/' - the regular expression
  • 'd' - delete what you find

I also use the ':split' command to divide my vim screen. This lets me look at both the start of a series of records as well as the end, and most importantly find the line number for where I want to stop my line deletion command.

I also add a 'BEGIN;' and 'ROLLBACK;' to the file to run tests on the script before applying to the database.

Once I got the system down, I was able to pull and process about 3000 records I needed out of a 15 GB dump file in about 5 minutes. Testing and finally applying the records took another 10 minutes.

Text sequences

Somebody recently asked on the Postgres mailing list about "Generating random unique alphanumeric IDs". While there were some interesting solutions given, from a simple Pl/pgsql function to using mathematical transformations, I'd like to lay out a simple and powerful solution using Pl/PerlU

First, to paraphrase the original request, the poster needed a table to have a text column be its primary key, and to have a five-character alphanumeric string used as that key. Let's knock out a quick function using Pl/PerlU that solves the generation part of the question:

DROP FUNCTION IF EXISTS nextvalalpha(TEXT);
CREATE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
LANGUAGE plperlu
AS $_$
  use strict;
  my $numchars = 5;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $value = join '' => @chars[map{rand @chars}(1..$numchars)];
  return $value;
$_$;

Pretty simple: it simply pulls a number of random characters from a string (with some commonly confused letters and number removed) and returns a string:

greg=# SELECT nextvalalpha('foo');
 nextvalalpha
--------------
 MChNf
(1 row)

greg=# SELECT nextvalalpha('foo');
 nextvalalpha
--------------
 q4jHm
(1 row)

So let's set up our test table. Since Postgres can use many things column DEFAULTS, including user-defined functions, this is pretty straightforward:

DROP TABLE IF EXISTS seq_test;
CREATE TABLE seq_test (
  id    VARCHAR(5) NOT NULL DEFAULT nextvalalpha('foo'),
  city  TEXT,
  state TEXT
);

A quick test shows that the id column is auto-propagated with some random values:

greg=# PREPARE abc(TEXT,TEXT) AS INSERT INTO seq_test(city,state) 
greg-# VALUES($1,$2) RETURNING id;

greg=# EXECUTE abc('King of Prussia', 'Pennsylvania');
  id
-------
 9zbsd
(1 row)

INSERT 0 1

greg=# EXECUTE abc('Buzzards Bay', 'Massachusetts');
  id
-------
 4jJ5D
(1 row)

INSERT 0 1

So far so good. But while those returned values are random, they are not in any way unique, which a primary key needs to be. First, let's create a helper table to keep track of which values we've already seen. We'll also track the 'name' of the sequence as well, to allow for more than one unique set of sequences at a time:

DROP TABLE IF EXISTS alpha_sequence;
CREATE TABLE alpha_sequence (
  sname TEXT,
  value TEXT
);
CREATE UNIQUE INDEX alpha_sequence_unique_value ON alpha_sequence(sname,value);

Now we tweak the original function to use this new table.

CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  use strict;
  my $sname = shift;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $numchars = 5;
  my $toomanyloops = 10000; ## Completely arbitrary pick
  my $loops = 0;

  my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = $1 AND value = $2';
  my $sth = spi_prepare($SQL, 'text', 'text');

  my $value = '';
  SEARCHING:
  {
    ## Safety valve
    if ($loops++ >= $toomanyloops) {
      die "Could not find a unique value, even after $toomanyloops tries!\n";
    }
    ## Build a new value, then test it out
    $value = join '' => @chars[map{rand @chars}(1..$numchars)];
    my $count = spi_exec_prepared($sth,$sname,$value)->{processed};
    redo if $count >= 1;
  } 

  ## Store it and commit the change
  $SQL = 'INSERT INTO alpha_sequence VALUES ($1,$2)';
  $sth = spi_prepare($SQL, 'text', 'text');
  spi_exec_prepared($sth,$sname,$value);
  return $value;
$_$;

Alright, that seems to work well, and prevents duplicate values. Or does it? Recall that one of the properties of sequences in Postgres is that they live outside of the normal MVCC rules. In other words, once you get a number via a call to nextval(), nobody else can get that number again (even you!) - regardless of whether you commit or rollback. Thus, sequences are guaranteed unique across all transactions and sessions, even if used for more than one table, called manually, etc. Can we do the same with our text sequence? Yes!

For this trick, we'll need to ensure that we only return a new value if we are 100% sure it is unique. We also need to record the value returned, even if the transaction that calls it rolls back. In other words, we need to make a small 'subtransaction' that commits, regardless of the rest of the transaction. Here's the solution:

CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  use strict;
  use DBI;
  my $sname = shift;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $numchars = 5;
  my $toomanyloops = 10000;
  my $loops = 0;

  ## Connect to this very database, but with a new session
  my $port = spi_exec_query('SHOW port')->{rows}[0]{port};
  my $dbname = spi_exec_query('SELECT current_database()')->{rows}[0]{current_database};
  my $dbuser = spi_exec_query('SELECT current_user')->{rows}[0]{current_user};
  my $dsn = "dbi:Pg:dbname=$dbname;port=$port";
  my $dbh = DBI->connect($dsn, $dbuser, '', {AutoCommit=>1,RaiseError=>1,PrintError=>0});

  my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = ? AND value = ?';
  my $sth = $dbh->prepare($SQL);

  my $value = '';
  SEARCHING:
  {
    ## Safety valve
    if ($loops++ >= $toomanyloops) {
      die "Could not find a unique value, even after $toomanyloops tries!\n";
    }
    ## Build a new value, then test it out
    $value = join '' => @chars[map{rand @chars}(1..$numchars)];
    my $count = $sth->execute($sname,$value);
    $sth->finish();
    redo if $count >= 1;
  } 

  ## Store it and commit the change
  $SQL = 'INSERT INTO alpha_sequence VALUES (?,?)';
  $sth = $dbh->prepare($SQL);
  $sth->execute($sname,$value); ## Does a commit

  ## Only now do we return the value to the caller
  return $value;
$_$;

What's the big difference between this one and the previous version? Rather than examine the alpha_sequence table in our /current/ session, we figure out who and where we are, and make a completely separate connection to the same database using DBI. Then we find an unused value, INSERT that value into the alpha_sequence table, and commit that outside of our current transaction.Only then can we return the value to the caller.

Postgres sequences also have a currval() function, which returns the last value returned via a nextval() in the current session. The lastval() function is similar, but it returns the last call to nextval(), regardless of the name used. We can make a version of these easy enough, because Pl/Perl functions have a built-in shared hash named '%_SHARED'. Thus, we'll add two new lines to the end of the function above:

...
  $sth->execute($sname,$value); ## Does a commit
  $_SHARED{nva_currval}{$sname} = $value;
  $_SHARED{nva_lastval} = $value;
...

Then we create a simple function to display that value, as well as throw an error if called too early - just like nextval() does:

DROP FUNCTION IF EXISTS currvalalpha(TEXT)
CREATE FUNCTION currvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  my $sname = shift;
  if (exists $_SHARED{nva_currval}{$sname}) {
    return $_SHARED{nva_currval}{$sname};
  }
  else {
    die qq{currval of text sequence "$sname" is not yet defined in this session\n};
  }
$_$;

Now the lastval() version:

DROP FUNCTION IF EXISTS lastvalalpha();
CREATE FUNCTION lastvalalpha()
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  if (exists $_SHARED{nva_lastval}) {
    return $_SHARED{nva_lastval};
  }
  else {
    die qq{lastval (text) is not yet defined in this session\n};
  }
$_$;

For the next tests, we'll create a normal (integer) sequence, and see how it acts compared to our newly created text sequence:

DROP SEQUENCE IF EXISTS newint;
CREATE SEQUENCE newint STARTS WITH 42;

greg=# SELECT lastval();
ERROR: lastval is not yet defined in this session

greg=# SELECT currval('newint');
ERROR:  currval of sequence "newint" is not yet defined in this session

greg=# SELECT nextval('newint');
 nextval
---------
      42
(1 row)

greg=# SELECT currval('newint');
 currval
---------
      42

greg=# SELECT lastval();
 lastval
---------
      42
greg=# SELECT lastvalalpha();
ERROR: error from Perl function "lastvalalpha": lastval (text) is not yet defined in this session

greg=# SELECT currvalalpha('newtext');
ERROR:  error from Perl function "currvalalpha": currval of text sequence "newtext" is not yet defined in this session

greg=# SELECT nextvalalpha('newtext');
 nextvalalpha
--------------
 rRwJ6

greg=# SELECT currvalalpha('newtext');
 currvalalpha
--------------
 rRwJ6

greg=# SELECT lastvalalpha();
 lastvalalpha
--------------
 rRwJ6

There is one more quick optimization we could make. Since the %_SHARED hash is available across our session, there is no need to do anything in the function more than once if we can cache it away. In this case, we'll cache away the server information we look up, the database handle, and the prepares. Our final function looks like this:

CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  use strict;
  use DBI;
  my $sname = shift;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $numchars = 5;
  my $toomanyloops = 10000;
  my $loops = 0;

  ## Connect to this very database, but with a new session
  if (! exists $_SHARED{nva_dbi}) {
    my $port = spi_exec_query('SHOW port')->{rows}[0]{port};
      my $dbname = spi_exec_query('SELECT current_database()')->{rows}[0]{current_database};
    my $dbuser = spi_exec_query('SELECT current_user')->{rows}[0]{current_user};
    my $dsn = "dbi:Pg:dbname=$dbname;port=$port";
    $_SHARED{nva_dbi} = DBI->connect($dsn, $dbuser, '', {AutoCommit=>1,RaiseError=>1,PrintError=>0});
    my $dbh = $_SHARED{nva_dbi};
    my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = ? AND value = ?';
    $_SHARED{nva_sth_check} = $dbh->prepare($SQL);
    $SQL = 'INSERT INTO alpha_sequence VALUES (?,?)';
    $_SHARED{nva_sth_add} = $dbh->prepare($SQL);
  }


  my $value = '';
  SEARCHING:
  {
    ## Safety valve
    if ($loops++ >= $toomanyloops) {
      die "Could not find a unique value, even after $toomanyloops tries!\n";
    }
    ## Build a new value, then test it out
    $value = join '' => @chars[map{rand @chars}(1..$numchars)];
    my $count = $_SHARED{nva_sth_check}->execute($sname,$value);
    $_SHARED{nva_sth_check}->finish();
    redo if $count >= 1;
  } 

  ## Store it and commit the change
  $_SHARED{nva_sth_add}->execute($sname,$value); ## Does a commit
  $_SHARED{nva_currval}{$sname} = $value;
  $_SHARED{nva_lastval} = $value;
  return $value;
$_$;

Having the ability to reach outside the database in Pl/PerlU - even if simply to go back in again! - can be a powerful tool, and allows us to do things that might otherwise seem impossible.

Debugging prepared statements

I was recently tasked with the all-too-familiar task for DBAs of "why is this script running so slow?". After figuring out exactly which script and where it was running from, I narrowed down the large number of SQL commands it was issuing to one particularly slow one, that looked something like this in the pg_stat_activity view:

current_query 
-------------
SELECT DISTINCT id
FROM containers
WHERE code LIKE $1

Although the query ran too quick to really measure a finite time just by watching pg_stat_activity, it did show up quite often. So it was likely slow *and* being called many times in a loop somewhere. The use of 'LIKE' always throws a yellow flag, so those factors encouraged me look closer into the query.

While the table in question did have an index on the 'code' column, it was not being used. This is because LIKE (on non-C locale databases) cannot work against normal indexes - it needs a simpler character by character index. In Postgres, you can achieve this by using some of the built in operator classes when creating an index. More details can be found at the documentation on operator classes. What I ended up doing was using text_pattern_ops:

SET maintenance_work_mem = '2GB';

CREATE INDEX CONCURRENTLY containers_code_textops
  ON containers (code text_pattern_ops);

Since this was on a production system (yes, I tested on a QA box first!), the CONCURRENTLY phrase ensured that the index did not block any reads or writes on the table while the index was being built. Details on this awesome option can be found in the docs on CREATE INDEX.

After the index was created, the following test query went from 800ms to 0.134ms!:

EXPLAIN ANALYZE SELECT * FROM containers WHERE code LIKE 'foobar%';

I then created a copy of the original script, stripped out any parts that made changes to the database, added a rollback to the end of it, and tested the speed. Still slow! Recall that the original query looked like this:

SELECT DISTINCT id
FROM containers
WHERE code LIKE $1

The $1 indicates that this is a prepared query. This leads us to the most important lesson of this post: whenever you see that a prepared statement is being used, it's not enough to test with a normal EXPLAIN or EXPLAIN ANALYZE. You must emulate what the script (e.g. the database driver) is really doing. So from psql, I did the following:

PREPARE foobar(text) AS SELECT DISTINCT id FROM containers WHERE code LIKE $1;
EXPLAIN ANALYZE EXECUTE('foobar%');

Bingo! This time, the new index was *not* being used. This is the great trade-off of prepared statements - while it allows you to prepare and rewrite the query only once, the planner cannot anticipate what you might pass in as a possible argument, so it makes the best generic plan possible. Thus, your EXPLAIN of the same query using literals or placeholders via PREPARE may look very different.

While it's possible to make workarounds at the database level for the problem of prepared statements using the "wrong" plan, in this case it was simply easier to tell the existing script not to use prepared statements at all for this one query. As the script was using DBD::Pg, the solution was to simply use the pg_server_prepare attribute like so:

$dbh->{pg_server_prepare} = 0;
my $sth = $dbh->prepare('SELECT DISTINCT id FROM containers WHERE code LIKE ?');
$dbh->{pg_server_prepare} = 1;

The effect of this inside of DBD::Pg is that instead of using PQprepare and then PQexecPrepared for each call to $sth->execute(), DBD::Pg will, for every call to $sth->execute(), quote the parameter itself, build a string containing the original SQL statement and the quoted literal, and send it to the backend via PQexec. Normally not something you want to do, but the slight overhead of doing it that way was completely overshadowed by the speedup of using the new index.

The final result: the script that used to take over 6 hours to run now only takes about 9 minutes to complete. Not only are the people using the script much happier, but it means less load on the database.

Perl+Postgres: changes in DBD::Pg 2.15.1

DBD::Pg, the Perl interface to Postgres, recently released version 2.15.1. The last two weeks has seen a quick flurry of releases: 2.14.0, 2.14.1, 2.15.0, and 2.15.1. Per the usual versioning convention, the numbers on the far right (in this case the "dot one" releases) were simply bug fixes, while 2.14.0 and 2.15.0 introduced API and/or major internal changes. Some of these changes are explained below.

From the Changes file for 2.15.0:

CHANGE:
 - Allow execute_array and bind_param_array to take oddly numbered items, 
   such that DBI will make missing entries undef/null (CPAN bug #39829) [GSM]

The Perl Database Interface (DBI) has a neat feature to allow you to execute many sets of items at one time, known as execute_array. The basic format is to pass in an list of arrays, in which each array contains the placeholders needed to execute the query. For example:

## Create a simple test table with two columns
$dbh->do('DROP TABLE IF EXISTS people');
$dbh->do('CREATE TABLE people (id int, fname text)');

## Pass in all ids as a single array
my @numbers = (1,2,3);

## Pass in all names as a single array
my @names = ("Garrett", "Viktoria", "Basso");

## Prepare the statement
my $sth = $dbh->prepare('INSERT INTO people VALUES (?, ?)');

## Execute the statement multiple times (three times in this case)
$sth->execute_array(undef, \@numbers, \@names);
## (the first argument is an optional argument hash which we don't use here)

## Pull back and display the rows from our new table
$SQL = 'SELECT id, fname FROM people ORDER BY fname';
for my $row (@{$dbh->selectall_arrayref($SQL)}) {
    print "Found: $row->[0] : $row->[1]\n";
}

$ perl testscript.pl
Found: 3 : Basso
Found: 1 : Garrett
Found: 2 : Viktoria

In 2.15.0, we loosened the requirement that the number of placeholders in each array match up with the expected number. Per the DBI spec, any "missing" items are considered undef, which maps to a SQL NULL. Thus:

$dbh->do('DROP TABLE IF EXISTS people');
$dbh->do('CREATE TABLE people (id int, fname text)');

## Note that this time there are only two ids given, not three:
my @numbers = (1,2);
my @names = ("Garrett", "Viktoria", "Basso");
my $sth = $dbh->prepare("INSERT INTO people VALUES (?, ?)");

$sth->execute_array(undef, \@numbers, \@names);

## Show a question mark for any null ids
$SQL = q{
SELECT CASE WHEN id IS NULL THEN '?' ELSE id::text END, fname 
FROM people ORDER BY fname
};
for my $row (@{$dbh->selectall_arrayref($SQL)}) {
    print "Found: $row->[0] : $row->[1]\n";
}

$ perl testscript2.pl
Found: ? : Basso
Found: 1 : Garrett
Found: 2 : Viktoria

Also note that bind_param_array is an alternate way to add the list of arrays before the execute is called. This is similar in concept to a regular execute: if you bind the values first, you can call execute without any arguments:

...
$sth->bind_param_array(1, \@numbers);
$sth->bind_param_array(2, \@names);
$sth->execute_array(undef);
...

CHANGE:
 - Use PQexecPrepared even when no placeholders (CPAN bug #48155) [GSM]

Sending queries to Postgres via DBD::Pg usually involves two steps: prepare and execute. The prepare is done one time, while the execute can be called many times, often times with different arguments. Previously, DBD::Pg would call PQexec for queries that had no placeholders. However, the ability to handle placeholders smoothly is only one advantage of using server-side prepares in Postgres. The other advantage is that Postgres only has to parse the query a single time, in the initial prepare. In 2.15.0, we use PQexecPrepared for all queries, whether they have placeholders or not. The upshot of this is that multiple calls to the execute() function will be a little bit faster, and that we only use PQexec when we really have to.


CHANGE:
 - Fix quoting of booleans to respect more Perlish variants (CPAN bug #41565) [GSM]

In previous versions, the mapping of Perl vars to booleans was very simple, and did only simple 0/1 mapping. However, Perl's values of "truth" is richer than that. We can now do things like this:

for my $name ('0', '1', '0E0', '0 but true', 'F', 'T', 'TRUE', 'false') {
  printf qq{Value '%s' is %s\n}, $name, $dbh->quote($name, {pg_type => PG_BOOL});
}

$ perl testscript3.pl
Value '0' is FALSE
Value '1' is TRUE
Value '0E0' is TRUE
Value '0 but true' is TRUE
Value 'F' is FALSE
Value 'T' is TRUE
Value 'TRUE' is TRUE
Value 'false' is FALSE

CHANGE:
  - Return ints and bools-cast-to-number from the db as true Perlish numbers.
    (CPAN bug #47619) [GSM]

This one is a little more subtle. When a value is returned from the database, it gets mapped back to a string. So even if the value in the database came from an INTEGER column, by the time it made it's way back to your Perl script it was a string that happened to hold an integer value. DBD::Pg now attempts to cast some types to their Perl equivalent. This is normally hard to see without peering inside Perl internals, but using Data::Dumper can show you the difference:

## Ask Postgres to return a string and an integer
$SQL = 'SELECT 123::text, 123::integer';
$info = $dbh->selectall_arrayref($SQL)->[0];
print Dumper $info;

## Older versions of DBD::Pg give:
$VAR1 = [
          '123',
          '123'
        ];

## The new and improved version gives:
$VAR1 = [
          '123',
          123
        ];

A small difference, but not unimportant - this change came about through a bug request, as it was causing problems when DBD::Pg was interacting with JSON::XS. Special thanks to Tim Bunce, (author of DBI, maintainer of the amazing NYTProf, and all around Perl guru) who found an important bug regarding this solution in 2.14.0, which led to the quick release of 2.14.1. Lesson learned: don't try converting ints to floats via sv_setnv.


Most of the other changes to 2.14 and 2.15 are bug fixes of one sort or another. To keep up on the changes or to talk about the project more, please join the mailing list

More PostgreSQL and SystemTap

Recently I've been working on a database with many multi-column indexes, and I've wondered how often all the columns of the index were used. Many of the indexes in question are primary key indexes, and I need all the columns to guarantee uniqueness, but for non-unique indexes, it would make sense to remove as many indexes from the column as possible. Especially with PostgreSQL 8.3 or greater, where I can take advantage of heap-only tuples[1], leaving columns out of the index would be a big win. PostgreSQL's statistics collector will already tell me how often an index is scanned. That shows up in pg_stat_all_indexes. But for a hypothetical index scanned 100 times, there's no way to know how many of those 100 scans used all the columns of the index, or, for instance, just the first column.

First, an example. I'll create a table with three integer columns, and fill it with random data:

5432 josh@josh# CREATE TABLE a (i INTEGER, j INTEGER, k INTEGER);
CREATE TABLE
5432 josh@josh*# INSERT INTO a SELECT i, j, k FROM (SELECT FLOOR(RANDOM() * 10) AS i, FLOOR(RANDOM() * 100) AS j, FLOOR(RANDOM() * 1000) AS k, GENERATE_SERIES(1, 1000)) f;
INSERT 0 1000
5432 josh@josh*# CREATE INDEX a_ix ON a (i, j, k);
CREATE INDEX
5432 josh@josh*# ANALYZE a;
ANALYZE
5432 josh@josh*# COMMIT;
COMMIT

This leaves me with a three-column index on 1000 rows of the following:

5432 josh@josh*# SELECT * FROM a LIMIT 10;
 i | j  |  k  
---+----+-----
 3 |  6 | 380
 7 | 94 | 933
 1 | 73 | 326
 2 | 86 | 224
 2 | 59 | 336
 9 | 44 | 220
 9 | 48 | 694
 3 | 27 | 268
 3 |  0 | 410
 8 | 25 | 337
(10 rows)

Now I need to make a query that will use the index. That's easy enough, with these two queries. As shown by the index condition, the first query uses all three columns of the index, and the second, only two.

5432 josh@josh# EXPLAIN SELECT * FROM a WHERE i > 8 AND j > 80 AND k > 800;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Bitmap Heap Scan on a  (cost=5.64..10.74 rows=4 width=12)
   Recheck Cond: ((i > 8) AND (j > 80) AND (k > 800))
   ->  Bitmap Index Scan on a_ix  (cost=0.00..5.64 rows=4 width=0)
         Index Cond: ((i > 8) AND (j > 80) AND (k > 800))
(4 rows)

5432 josh@josh*# EXPLAIN SELECT * FROM a WHERE i > 8 AND j > 80;
                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan on a  (cost=5.37..10.67 rows=20 width=12)
   Recheck Cond: ((i > 8) AND (j > 80))
   ->  Bitmap Index Scan on a_ix  (cost=0.00..5.36 rows=20 width=0)
         Index Cond: ((i > 8) AND (j > 80))
(4 rows)

Inside PostgreSQL, these queries result in a call to _bt_first() inside src/backend/access/nbtree/nbtsearch.c. This function two parameters: an IndexScanDesc object called scan, which describes the index to scan, the key to look for, and some other stuff, and a ScanDirection parameter to tell _bt_first() which direction to scan the index. It's this call that tells the statistics collector about each index scan, and it's this call that we'll instrument with SystemTap. I'm interested in the value in scan->numberOfKeys, which tells me how many of the index's keys will be considered in each scan. SystemTap makes getting this information really easy. I gave an introduction to SystemTap and using it with PostgreSQL in an earlier post; the following assumes familiarity with that material.

Since PostgreSQL doesn't come with a DTrace probe built into the _bt_first() function, I'll use SystemTap's ability to probe directly into a function. Conveniently, SystemTap also allows access to the values of variables in the function's memory space at runtime. Note that the technique shown below requires a PostgreSQL binary built with --enable-debug. Without debug information in the binary, different techniques are used, and the information is harder to get.

The test script I used is as follows:

probe process("/usr/local/pgsql/bin/postgres").function("_bt_first")
{
          /* Time of call */
        printf ("_bt_first at time %d\n", get_cycles())
          /* Number of scan keys */
        printf("%d scan keys\n", $scan->numberOfKeys)
          /* OID of index being scanned */
        printf("%u index oid\n\n", $scan->indexRelation->rd_id)
}

Note that the script above accesses variables in the _bt_first() function just as standard C functions would. The script has the following output:

[josh@localhost ~]$ sudo /usr/local/bin/stap -v test.d
Pass 1: parsed user script and 59 library script(s) in 130usr/70sys/196real ms.
Pass 2: analyzed script: 2 probe(s), 3 function(s), 0 embed(s), 0 global(s) in 50usr/50sys/103real ms.
Pass 3: translated to C into "/tmp/stapzCCwZE/stap_1854c2da59908c3e3633d6385ca6ce52_2782.c" in 120usr/80sys/209real ms.
Pass 4, preamble: (re)building SystemTap's version of uprobes.
Pass 4: compiled C into "stap_1854c2da59908c3e3633d6385ca6ce52_2782.ko" in 2240usr/3270sys/8102real ms.
Pass 5: starting run.
_bt_first at time 49379911010213
1 scan keys
2703 index oid

_bt_first at time 49379982691988
1 scan keys
2684 index oid

_bt_first at time 49379987397126
1 scan keys
2684 index oid

You'll note several indexes get scanned immediately. These are indexes from the PostgreSQL catalog. The index we created above has OID 16388. First, I'll run the query with three scan keys, followed by the query with two keys:

_bt_first at time 50357469430819
3 scan keys
16388 index oid

_bt_first at time 50363763650571
2 scan keys
16388 index oid

As expected, SystemTap reported first three and then two scan keys used, along with the OID of the a_ix index I created. With a technique like this I could, at least theoretically, get an exact usage profile for each index, and determine whether they need all the columns they have.

[1] See, for example, this page.

Slony, sl_status and diagnosing a particular type of lag

During some routine checking on a slony cluster, Greg noticed something curious. Replication was still happening between the master and a couple slaves, but we were seeing our indicator for lag inside of slony increasing.

To check out the status of slony replication, you will typically take a look at the view ‘sl_status’:

mydatabase=# select * from sl_status; 
 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts | st_lag_num_events |       st
_lag_time       
-----------+-------------+---------------+----------------------------+------------------+----------------------------+---------------------------+-------------------+---------
----------------
         2 |           1 |       2697511 | 2008-04-30 02:40:06.034144 |          2565031 | 2008-04-14 15:31:32.897165 | 2008-04-14 16:24:08.81738 |            132480 | 15 days 
10:16:03.060499
(1 row)

This view pulls data out of sl_event and sl_confirm, two tables that keep track of the forward progress of replication. Every time there is an event - SYNCs, DDL changes, slony administrative events - a row is added to sl_event. Slony is very chatty and so all of the slaves send events to each other, as well as the master. (That statement is a simplification, and it is possible to make some configuration changes that reduce the traffic, but in general, this is what people who set up slony will see.)

Broken down, the columns are:

st_origin: the local slony system
st_received: the slony instance that sent an event
st_last_event: the sequence number of the last event received from that origin/received pair
st_last_event_ts: the timestamp on the last event received
st_last_received: the sequence number of the last sl_event + sl_confirm pair received
st_last_received_ts: the timestamp on the sl_confirm in that pair
st_last_received_event_ts: the timestamp on the sl_event in that pair
st_lag_num_events: difference between st_last_event and st_last_received
st_lag_time: difference between st_last_event_ts and st_last_received_ts

Depending on the type of event, a row might be added to sl_confirm immediately (by the same thread that created the event), or this may be created separately by another process. The important thing here is that there is a separation between sl_event and sl_confirm, so it is possible for sl_event SYNCs (replication events) to continue to come through and be applied to the server, without the sl_confirm rows being eventually created.

We have a monitor which checks the status of replication by looking at a recently added value on the master and comparing that to what is on the slave. This works well for workloads that are primarily append-only. So, that monitor thought replication was working fine, even though the lag was increasing steadily.

sl_event and sl_confirm tables are periodically cleaned up by cleanupEvent(), automatically by slony. Typically, this function is run every 100 seconds. When the slon process kicks it off, it checks to see what the newest confirmed events are, deletes old event records, and old confirm rows.

When confirms stop coming through, sl_events can’t be cleaned up on the affected server (because they haven’t been confirmed!). Depending on how active your servers are, this will eat up disk space. But you’ve got disk space monitors in place, right? :)

So, how do you fix the problem when the confirms stop coming through?

I had a look at process tables on all the slon slaves, and noticed that on the two lagged systems, there was no incoming connection from the master slony system. The fix: restart slony on the master so that it could reconnect.

There’s a couple things I wished that slony would have told me:

  • Notification on the slave that it no longer had its connection back to the master. We’ll set up our own monitors to detect that this connection no longer exists, but it would be much nicer for slony to warn about this. Additionally, it would be nice to be able to re-connect to a single slave without restarting slon entirely.
  • More explanation about sl_confirm and likely causes of failed confirmations. I hope I’ve shed a little light with this blog post.

The documentation for setting up slony is very good, but the troubleshooting information is lacking around events and confirmations, and how each type of event and confirmation actually happens. I’m happy to be proven wrong -- so please leave pointers in the comments!

Comparing databases with check_postgres

One of the more recent additions to check_postgres, the all-singing, all-dancing Postgres monitoring tool, is the "same_schema" action. This was necessitated by clients who wanted to make sure that their schemas were identical across different servers. The two use cases I've seen are servers that are being replicated by Bucardo or Slony, and servers that are doing horizontal sharding (e.g. same schema and database on different servers: which server you go to depends on (for example) your customer id). Oft times a new index fails to make it to one of the slaves, or some function is tweaked on one server by a developer, who then forgets to change it back or propagate it. This program allows a quick and automatable check for such problems.

The idea behind the same_schema check is simple: we walk the schema and check for any differences, then throw a warning if any are found. In this case, we're using the term "schema" in the classic sense of a description of your database objects. Thus, one of the things we check is that all the schemas (in the classic RDBMS sense of a container of other database objects) are the same, when running the "same_schema" check. Only slightly confusing. :)

Not only is this program nice for monitoring (e.g. as a Nagios check), but if you pass in a --verbose argument, you get a simple not-all-on-one-line breakdown of all the differences between the two databases. Let's do a quick example.

First, we download and install check_postgres. We'll pull straight from a git repository for check_postgres. While we have our own repo at bucardo.org, we also are keeping it in sync with a tree at github.org, so we'll use that one:

git clone git://github.com/bucardo/check_postgres.git
cd check_postgres
perl Makefile.PL
make
make test
sudo make install

Let's create a Postgres cluster with the initdb command, start it up, then create two new databases to compare to each other.

initdb -D cptest
echo port=5555 >> cptest/postgresql.conf
pg_ctl -D cptest -l cp.log start
psql -p 5555 -c 'CREATE DATABASE yin'
psql -p 5555 -c 'CREATE DATABASE yang'

We're ready to run the script. By default, it outputs things in a Nagios-friendly manner. We should see an 'OK' because the two databases are identical:

./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang

POSTGRES_SAME_SCHEMA OK: DB "yin" (port=5555 => 5555) Both databases have identical items | time=0.01

The message could be clearer and show both database names, but the check worked and showed that things are exactly the same. Let's throw some differences in and run it again:

psql -p 5555 -d yin -c 'create table foobar(a int primary key, b text, c text)'
psql -p 5555 -d yang -c 'create table foobar(a int, b text, c varchar(99))'
psql -p 5555 -d yin -c 'create schema yinonly'
psql -p 5555 -d yang -c 'create table pineapple(id int)'

./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang

POSTGRES_SAME_SCHEMA CRITICAL: DB "yin" (port=5555 => 5555) Databases were different. Items not matched: 5 | time=0.01
Schema in 1 but not 2: yinonly  Table in 2 but not 1: public.pineapple  Column "a" of "public.foobar": nullable is NO on 1, but YES on 2.  Column "c" of "public.foobar": type is text on 1, but character varying on 2.  Table "public.foobar" on 1 has constraint "public.foobar_pkey", but 2 does not. 

It works, but a little messy for human consumption. Nagios requires everything to be in a single line, but we'll add a --verbose argument to ask the script for prettier formatting:

./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang

POSTGRES_SAME_SCHEMA CRITICAL: DB "yin" (port=5555 => 5555) Databases were different. Items not matched: 5 | time=0.01
Schema in 1 but not 2: yinonly
Table in 2 but not 1: public.pineapple
Column "a" of "public.foobar": nullable is NO on 1, but YES on 2.
Column "c" of "public.foobar": type is text on 1, but character varying on 2.
Table "public.foobar" on 1 has constraint "public.foobar_pkey", but 2 does not.

There are also ways to filter the output, for times when you have known differences. For example, to exclude any tables with the word 'bucardo' in them, you could add this argument:

--warning="notable=bucardo"

The online documentation has more details about all the filtering options.

So what kind of things do we check for? Right now, we are checking:

  • users (existence and powers, i.e. createdb, superuser)
  • schemas
  • tables
  • sequences
  • views
  • triggers
  • constraints
  • columns
  • functions (including volatility, strictness, etc.)

Got something else we aren't covering? Send in a patch, or a quick request, to the mailing list.

OSCON so far! Filesystem information bonanza on Wednesday

Wednesday was the first official day of OSCON, and I spent it elbow deep in filesystems. The morning was kicked off with Val Aurora delivering a great overview of Btrfs, a new fileystem currently in development. Some of the features include:

  • Copy on write filesystem
  • Cheap, easy filesystem snapshots
  • Dynamically resizable partitions
  • Indexed directory structure
  • Very simple administration

Val demonstrated basic functionality, including creating snapshots and creating a Btrfs filesystem on top of an ext3 filesystem. Cool stuff! The filesystem is still under heavy development, but seems very promising.

Next I saw Theodore Ts'o, the primary developer behind ext4, talk about the future of filesystems and storage. He referenced a great paper that dives deep into the economics behind SSD (solid state drives) and platter hard drive manufacturing. One interesting calculation was that even if we could convert all the silicon fabs to manufacture flash, would only be able to covert about 12% of the world-wide capacity of hard drive production. Because of this, Theodore believes that it is going to be challenging for the cost of SSDs to drop to the point where it becomes cost competitive with hard drives.

Other observations from Theodore concerned the slowing of innovation around hard drives, and companies like Seagate cutting back in their R&D departments. He sees opportunity for software and filesystem innovation in this environment, and so far that is playing out in the rapid development of new filesystems for Linux (Nilfs2, POMELFS, and EXOFS as three recent new examples). One open issue he brought up is the need for more and better benchmarking tools.

In the afternoon, I presented Linux Filesystem Performance for Databases. I've uploaded the slides to the conference site. I talked about the work that the Portland PostgreSQL Performance Pad team did on filesystem testing with some hardware donated from HP. I also included results from some recent DBT-2 tests Mark had run with PostgreSQL, using pgtune and then refining a few key parameters.

There were quite a few interesting questions, and I talked to one of the Wikia admins about a recent change he'd made to use SSDs instead of hard drives in some of their servers. I mentioned that it would be great to see a case study and data from his experience.

pgGearman 0.1 release!

Yesterday, Brian Aker and Eric Day presented pgGearman: A distributed worker queue for PostgreSQL during the OSCON/SFPUG PgDay.

Gearman is a distributed worker queuing system that allows you to farm work out to a collection of servers, and basically run arbitrary operations. The example they presented was automating and distributing the load of image processing for Livejournal. For example, everyone loves to share pictures of their kittens, but once an image is uploaded, it may need to be scaled or cropped in different ways to display in different contexts. Gearman is a tool you can use to farm these types of jobs out.

So, in anticipation of the talk, I worked with Eric Day on a set of C-language user defined functions for Postgres that allow client connections to a Gearman server.

You can try out the pgGearman 0.1 release on Launchpad!

Subverting PostgreSQL Aggregates for Pentaho

In a recent post I described MDX and a project I'm working on with the Mondrian MDX engine. In this post I'll describe a system I implemented to overcome one of Mondrian's limitations.

Each Mondrian measure has an associated aggregate function defined. For instance, here's a measure from the sample data that ships with Pentaho:

<Measure name="Quantity" column="QUANTITYORDERED" aggregator="sum" />

The schema defines the database connection properties and the table this cube deals with elsewhere; this line says there's a column called QUANTITYORDERED which Mondrian can meaningfully aggregate with the sum() function. Mondrian knows about six aggregates: count, avg, sum, min, max, and distinct-count. And therein lies the problem. In this case, the client wanted to use other aggregates such as median and standard deviation, but Mondrian didn't provide them[1].

Mondrian uses the aggregator attribute of the measure definition to generate SQL statements exactly as you might expect. In the case of the measure above, the SQL query involving that measure would read "sum(QUANTITYORDERED)". In our case, Mondrian is backed by a PostgreSQL database, which offers a much richer set of aggregates (such as stddev() for the standard deviation, one of the numbers we need), but Mondrian doesn't know how to get to them.

Measures can be defined in terms of SQL expressions, rather than simple column names, but this doesn't immediately help. If I wanted the standard deviation of the quantity ordered, I might try something like this:

<Measure name="Quantity">
    <KeyExpression><SQL dialect="postgres">
        stddev(quantityordered)
    </SQL></KeyExpression>
</Measure>

Here, Mondrian would complain that the measure was defined without an aggregator attribute. And if I define one, such as sum, the resulting SQL becomes "sum(stddev(quantityordered))", which is illegal and makes PostgreSQL complain about nested aggregates.

But PostgreSQL's function overloading can help here. Although Mondrian's generated SQL will always include a call to a "count()" function if the aggregator is defined as "count", but there's no reason we can't make PostgreSQL use some other count() function. For instance, let's defined a new "count()" function that isn't an aggregate, but simply returns whatever argument it is passed. Then we can use it to wrap whatever function we want, including arbitrary aggregate functions.

Consider an attempt to get Mondrian to use the stddev() aggregate. It returns a DOUBLE PRECISION type, so our fake count function must simply accept a DOUBLE PRECISION variable and return it:

CREATE FUNCTION count(DOUBLE PRECISION) RETURNS DOUBLE PRECISION AS $$
    SELECT $1
$$ LANGUAGE SQL IMMUTABLE;

Then we define a measure like this:

<Measure name="Quantity Std. Dev" aggregator="count">
    <KeyExpression><SQL dialect="postgres">
        stddev(quantityordered)
    </SQL></KeyExpression>
</Measure>

The resulting SQL is "count(stddev(quantityordered))", but in this case PostgreSQL uses our new count() function, and we get exactly the return value we want.

There's a catch: if we have a double precision column "foo" in a table "bar", and write:

SELECT count(foo) FROM bar;

...it uses our new count function, and rather than returning the number of rows in bar, it returns the value for foo from each row in bar.

To get around this problem, we can define a new data type. We'll write a function to create that datatype from another data type, and rewrite our count function to accept only that data type, and return the original data type, like this:

CREATE TYPE dp_cust AS (
    dp DOUBLE PRECISION); 

CREATE FUNCTION make_dpcust(a DOUBLE PRECISION) RETURNS dp_cust IMMUTABLE AS $$
DECLARE
    dpc dp_cust;
BEGIN
    dpc.dp := a;
    RETURN dpc;
END;
$$ LANGUAGE plpgsql;

DROP FUNCTION count(double precision);

CREATE FUNCTION count(dp_cust) RETURNS DOUBLE PRECISION IMMUTABLE AS $$
    SELECT $1.dp
$$ LANGUAGE sql;

Now our count() function will only be called when we're dealing with the dp_cust type, and we can control precisely when that happens, because the only way we make dp_cust values will be with the make_dpcust function. Our measure now looks like this:

<Measure name="Quantity Std. Dev" aggregator="count">
    <KeyExpression><SQL dialect="postgres">
        make_dpcust(stddev(quantityordered))
    </SQL></KeyExpression>
</Measure>

With this new data type and our custom count() function we can use whatever PostgreSQL aggregate we want as a measure aggregate in Mondrian.

[1] Note that the Mondrian developers already recognize this as a shortcoming worth removing. Allowing user-defined aggregates is on the Mondrian roadmap.

Bucardo and truncate triggers

Version 8.4 of Postgres was recently released. One of the features that hasn't gotten a lot of press, but which I'm excited about, is truncate triggers. This fixes a critical hole in trigger-based PostgreSQL replication systems, and support for these new triggers is now working in the Bucardo replication program.

Truncate triggers were added to Postgres by Simon Riggs (thanks Simon!), and unlike other types of triggers (UPDATE, DELETE, and INSERT), they are statement-level only, as truncate is not a row-level action.

Here's a quick demo showing off the new triggers. This is using the development version of Bucardo - a major new version is expected to be released in the next week or two that will include truncate trigger support and many other things. If you want to try this out for yourself, just run:

$ git clone git-clone http://bucardo.org/bucardo.git/

Bucardo does three types of replication; for this example, we'll be using the 'pushdelta' method, which is your basic "master to slaves" relationship. In addition to the master database (which we'll name A) and the slave database (which we'll name B), we'll create a third database for Bucardo itself.

$ initdb -D bcdata
$ initdb -D testA 
$ initdb -D testB 

(Technically, we are creating three new database clusters, and since we are doing this as the postgres user, the default database for all three will be 'postgres')

Let's give them all unique port numbers:

$ echo port=5400 >> bcdata/postgresql.conf
$ echo port=5401 >> testA/postgresql.conf 
$ echo port=5402 >> testB/postgresql.conf 

Now start them all up:

$ pg_ctl start -D bcdata -l bc.log
$ pg_ctl start -D testA -l A.log
$ pg_ctl start -D testB -l B.log

We'll create a simple test table on both sides:

$ psql -d postgres -p 5401 -c 'CREATE TABLE trtest(id int primary key)'
$ psql -d postgres -p 5402 -c 'CREATE TABLE trtest(id int primary key)'

Before we go any further, let's install Bucardo itself. Bucardo is a Perl daemon that uses a central database to store its configuration information. The first step is to create the Bucardo schema. This, like almost everything else with Bucardo, is done with the 'bucardo_ctl' script. The install process is interactive:

$ bucardo_ctl install --dbport=5400

This will install the bucardo database into an existing Postgres cluster.
Postgres must have been compiled with Perl support,
and you must connect as a superuser

We will create a new superuser named 'bucardo',
and make it the owner of a new database named 'bucardo'

Current connection settings:
1. Host:          
2. Port:          5400
3. User:          postgres
4. PID directory: /var/run/bucardo
Enter a number to change it, P to proceed, or Q to quit: P

Version is: 8.4
Attempting to create and populate the bucardo database and schema
Database creation is complete

Connecting to database 'bucardo' as user 'bucardo'
Updated configuration setting "piddir"
Installation is now complete.

If you see any unexpected errors above, please report them to bucardo-general@bucardo.org

You should probably check over the configuration variables next, by running:
bucardo_ctl show all
Change any setting by using: bucardo_ctl set foo=bar

Because we don't want to tell the bucardo_ctl program our custom port each time we call it, we'll store that info into the ~/.bucardorc file:

$ echo dbport=5400 > ~/.bucardorc

Let's double check that everything went okay by checking the list of databases that Bucardo knows about:

$ bucardo_ctl list db
There are no entries in the 'db' table.

Time to teach Bucardo about our two new databases. The format for the add commands is: bucardo_ctl add [type of thing] [name of thing within the database] [arguments of foo=bar format]

$ bucardo_ctl add database postgres name=master port=5401
Database added: master

$ bucardo_ctl add database postgres name=slave1 port=5402
Database added: slave1

Before we go any further, let's look at our databases:

$ bucardo_ctl list dbs
Database: master   Status: active
Conn: psql -h  -p 5401 -U bucardo -d postgres

Database: slave1   Status: active
Conn: psql -h  -p 5402 -U bucardo -d postgres

Note that by default we connect as the 'bucardo' user. This is a highly recommended practice, for safety and auditing. Since that user obviously does not exist on the newly created databases, we need to add them in:

$ psql -p 5401 -c 'create user bucardo superuser'
$ psql -p 5402 -c 'create user bucardo superuser'

Now we need to teach Bucardo about the tables we want to replicate:

$ bucardo_ctl add table trtest db=master herd=herd1
Created herd "herd1"
Table added: public.trtest

A herd is simply a named connection of tables. Typically, you put tables that are linked together by foreign keys or other logic into a herd so that they all get replicated at the same time.

The final setup step is to create a replication event, which in Bucardo is known as a 'sync':

$ bucardo_ctl add sync willow source=herd1 targetdb=slave1 type=pushdelta
NOTICE:  Starting validate_sync for willow
CONTEXT:  SQL statement "SELECT validate_sync('willow')"
Sync added: willow

This command actually did quite a bit of work behind the scenes, including creating all the supporting schemas, tables, functions, triggers, and indexes that Bucardo will need.

We are now ready to start Bucardo up. Simple enough:

$ bucardo_ctl start
Checking for existing processes
Starting Bucardo

Let's add a row to the master table and make sure it goes to the slave:

$ psql -p 5401 -c 'insert into trtest(id) VALUES (1)'
INSERT 0 1
$ psql -p 5402 -c 'select * from trtest'
 id
----
  1
(1 row)

Looks fine, so let's try out the truncate. On versions of Postgres less than 8.4, there was no way for Bucardo (or Slony) to know that a truncate had been run, so the rows were removed from the master but not from the slave. We'll do a truncate and add a new row in a single operation:

$ psql -p 5401 -c 'begin; truncate table trtest; insert into trtest values (2); commit'
COMMIT
$ psql -p 5402 -c 'select * from trtest'
 id
----
  2
(1 row)

It works! Let's clean up our test environment for good measure:

$ bucardo_ctl stop
$ pg_ctl stop -D bcdata
$ pg_ctl stop -D testA
$ pg_ctl stop -D testB

As mentioned, there are three types of syncs in Bucardo. The other type that can make use of truncate triggers is the 'swap' sync, aka "master to master". I've not yet decided on the behavior for such syncs, but one possibility is simply:

  • Database A gets truncated at time X
  • Bucardo truncates database B, then discards all delta rows older than X for both A and B, and all delta rows for B
  • Everything after X gets processed as normal (conflict resolution, etc.)
  • The same thing for a truncate on database B (truncate A, discard all older rows).

Second proposal:

  • Database A gets truncated at time X
  • We populate the delta table with every primary key in the table before truncation (assuming we can get at it)
  • That's it! Bucardo does its normal thing as if we just deleted a whole bunch of rows on A, and in theory deletes them from B as well.

Comments on this strategy welcome!

Update: Clarified initdb cluster vs. database per comment #1 below, and added new truncation handling scheme for multi-master replication per comment #2.

Last day in Nigeria! A short summary

Today is my last day in Nigeria. I hop in a car in a couple hours and head off to visit the university in Akure, and then I will be driving to Lagos to catch a plane home.

My students are pictured above. We covered a great deal of material this week. They learned about the PostgreSQL project, basic database administration, how to develop a schema from forms and application requirements, how to write procedural code inside the database, and how to use the pgAdmin and psql interfaces.

I learned about how many of the officials and IT workers I met (both in the class and outside of it), had worked very hard on the court case that led to the change in government in Ondo State three months ago. There had been systematic election fraud, and they were able to prove it in court using some clever IT and forensic analysis work. The members of SITEDEC believe very strongly in the importance of IT in increasing government accountability and transparency, a belief re-affirmed by their recent successes.

I'm looking forward to hearing about how the work progresses on their census and voter registration databases. Of course, I want to come back to Nigeria. It's a beautiful country, and I didn't have nearly enough time here to appreciate it.

Windows installer tip: passwords

Updated below!

When specifying a password for the Windows PostgreSQL one-click installer, you get this message:

Please provide a password for the database superuser and service account (postgres). If the service account already exists in Windows, you must enter the current password for the account. If the account does not exist, it will be created when you click 'Next'.

If you have already installed Postgres as a service, you will need to enter the current user postgres service user password to get past the password dialog box. Meaning, if you're logged in to Windows as 'selena', you need to enter selena's password. As a non-Windows user, this baffled me, and a few other people on this thread.

Otherwise, you can just enter a password that will be used for the 'postgres' database user. Hope this helps someone!

Update:

Further explanation from Dave Page, the maintainer of the windows package:

Selena: It's not the password for the user that you are logged in as that you need to enter, it's the password for the service account (ie. postgres).

Unlike *nix & Mac, service accounts on Windows need to have passwords so unfortunately we need to ensure we have the correct password to install the service. Hence, if there's an existing postgres account, we need the existing password, otherwise the account will be created with whatever password you specify.

In all OSs, we use the password entered on that page as the database superuser password.

In Nigeria: Weekend exploring

Yesterday, I traveled to a Michelin (yes, the tire company!) plantation for a party thrown in honor of the new Secretary to the Ondo State Government, Dr. Aderotimi Adelola.

Michelin grows rubber trees on this sprawling estate. It took nearly 20 minutes to get from the highway to the primary school deep inside the plantation where the celebration was held. Tapped rubber trees pictured below!

I was invited to a table inside the Governor's main tent, and spent most of the time just looking around at all the government officials, and chatting with the Chairman of SITEDEC, Cyril Egunlayi.

The high point of the afternoon was Dr. Olusegun Mimiko's speech welcoming Dr. Adelola to the government. He's a charismatic speaker. The people around the perimeter pressed closer, and were attentively silent for his 10 or 15 minute speech. He emphasized education -- his hometown's slogan is "Home of Education". He also said that despite Ondo State's history of leading Nigeria in educational opportunities, the state had regressed and needed to catch up again. Mimiko speaking:

The car ride out and back to the plantation took about two hours each way. I spent much of that time talking about open source options for various IT infrastructure, where something like Google Apps might fit in for them, and passed on information I'd I'd gotten about microwave links from a Portland WiMax provider, Stephouse Wireless. I also told Cyril about feedback regarding a replacement for Exchange. My followers on Twitter universally recommended Zimbra, and that was confirmed by at least one End Point coworker, Adam Volrath.

We also stopped by the office on our way home to check in on a new wireless repeater the engineers were installing on the tower they have out behind the SITEDEC center. We still have a few details to work out for the class arrangements.

In the evening, I enjoyed some Nigerian barbecue with Deji Agbebi. Originally from Lagos, he worked for a Canadian firm in the early 90s who's goal was to provide clean drinking water to villages in Ondo state. For various reasons, including a military coup, that business failed. Now Deji works in the US. He's a friend of Cyril's, and is here in Akure, hoping to help with the work the government is trying to complete before January.

Nigeria PostgreSQL Training: Day 1

I am in Lagos, Nigeria this morning, preparing for a half-day car ride to Akure in Ondo State. I'll be spending the next seven days with programmers from Ondo state, who are six months or so away from deploying a system to provide government-provided services using a centralized card system. They are designing their database using PostgreSQL!

Ondo state has a little over 3 million people, and plans to integrate a half-dozen government services under the centralized data system. They conducted a census in 2006, and will be using their new system to gather data yearly going forward.

Their plan is extremely ambitious, given obstacles like lack of power in most of the rural areas, and social issues like people not wanting to give accurate information about themselves to the government. Some biometric information, like finger prints, will be gathered electronically using special machines that they will primarily lease (instead of buying - significant cost savings), and these machines require power. They have been specially outfitted with dry-cell batteries, that operate for about 8 hours before needing to be recharged.

For the social problems around data collection, a marketing campaign to explain exactly what benefits those who provide accurate information are entitled to. After I mentioned to my host the American aversion to centralized government identification cards, he explained that in Nigeria they had the same issue. In addition to the marketing on TV, radio, newspapers and even leaflets, data collection volunteers will be trained on exactly how to collect accurate information. I am looking forward to having a look at the surveys and data collection strategy.

Otherwise, I've had a lot of fun talking with people. My car trip from the airport and remaining evening was mostly spent with me making funny vocabulary errors (tshirt == vest - who knew?), and explaining that Americans were mourning and in shock just like Nigerians because of Michael Jackson's death. I made an offhand comment about the number of people walking around outside at dusk because a friend had said a similar thing about Portland, OR's nightlife, and my escort commented on how peaceful and free people are in Lagos.

Inside PostgreSQL - Clause selectivity

One of the more valuable features of any conference is the so-called "hall track", or in other words, the opportunity to talk to all sorts of people about all sorts of things. PGCon was no exception, and I found the hall track particularly interesting because of suggestions I was able to gather regarding multi-column statistics, not all of which boiled down to "You're dreaming -- give it a rest". One of the problems I'd been trying to solve was where, precisely, to put the code that actually applies the statistics to a useful problem. There are several candidate locations, and certainly quite a few places where we could make use of such statistics. The lowest-hanging fruit, however, seems to be finding groups of query clauses that aren't as independent as we would normally assume. Between PGCon sessions one day, Tom Lane pointed me to a place where we already do something very similar: clausesel.c

"Clause selectivity" means much the same thing as any other selectivity: it's the proportion of rows from a relation that an operation will return. A "clause", in this case, is a filter on a relation, such as the "X = 1" and the "Y < 10" in "WHERE X = 1 AND Y < 10". PostgreSQL uses functions in clausesel.c to find clauses whose combined selectivity differs from the product of their individual selectivities. For instance, in "WHERE X < 4 AND X < 5", the "X < 5" is redundant; the clauses' combined selectivity is simply that of "X < 4". With "WHERE Y > 4 AND Y < 10", clausesel.c can determine that we really want the selectivity of the clause "4 < Y < 10". It's also smart enough to recognize "pseudo-constants": values from non-volatile functions, or from the outer relation of a nested loop. Although these values aren't truly constants, they remain constant at the level of the query where the clause will be applied, and can be treated as constants.

With any luck, one day clausesel.c will also know enough to notice cases where, for instance, although "foo.x = 3" and "foo.y > 10" are individually true for much of table "foo", there are very few rows where both conditions are true.

PostgreSQL with SystemTap

Those familiar with PostgreSQL know it has supported DTrace since version 8.2. The 8.4beta2 includes support for several new DTrace probes. But for those of us using platforms on which DTrace doesn't exist, this support hasn't necessarily meant much. SystemTap is a relatively new, Linux-based package with similar purpose to DTrace, available on Linux, and is under heavy development. As luck would have it, PostgreSQL's DTrace probes work with SystemTap as well.

A few caveats: it helps to run a very new SystemTap version (I used one I pulled from SystemTap's git repository today), and in order for SystemTap to have access to userspace software, your kernel must support utrace. I don't know precisely what kernel versions include the proper patches; my Ubuntu 8.04 laptop didn't have the right kernel, but the Fedora 10 virtual machine I just set up does.

Step 1 was to build SystemTap. This was a straightforward ./configure, make, make install, once I got the correct packages in place. Step 2 was to build PostgreSQL, including the --enable-dtrace option. This also was straightforward. Note that PostgreSQL won't build with the --enable-dtrace option unless you've already installed SystemTap. Finally, I initialized a PostgreSQL database cluster and started the database.

Here's where the fun starts. SystemTap's syntax differs from DTrace syntax. Here's an example probe SystemTap would accept:

probe process("/usr/local/pgsql/bin/postgres").function("eqjoinsel")
{
        printf ("%d\n", pid())
}

This tells SystemTap to print out the process ID (which comes from the SystemTap pid() function) each time the PostgreSQL eqjoinsel function is called. That's the function to estimate join selectivity with most equality operators, and gets called a lot, so it's a decently useful test. It also shows that SystemTap can probe inside programs without an explicitly defined probe. I saved this file as test.d, and ran it like this:

[josh@localhost ~]$ sudo stap -v test.d
Pass 1: parsed user script and 52 library script(s) in 160usr/220sys/641real ms.
Pass 2: analyzed script: 1 probe(s), 1 function(s), 1 embed(s), 0 global(s) in 40usr/60sys/331real ms.
Pass 3: translated to C into "/tmp/stapDD5a4p/stap_c0b737cdffdb48cec3fd55b631bb0656_1057.c" in 30usr/160sys/211real ms.
Pass 4, preamble: (re)building SystemTap's version of uprobes.
Pass 4: compiled C into "stap_c0b737cdffdb48cec3fd55b631bb0656_1057.ko" in 1510usr/3430sys/8052real ms.
Pass 5: starting run.
4521
4521
4521
4521

4521 is the process ID of the PostgreSQL backend I'm connected to, and it gets printed every time I type "\dt" in my psql session.

Now for something more interesting. Although SystemTap lets me probe whatever function I want, it's nice to be able to use the defined DTrace probes, because that way I don't have to find the function name I'm interested in, in order to trace something. Here are some examples I added to my test.d script, pulled more or less at random from the list of available DTrace probes in the PostgreSQL documentation. Note that whereas the documentation lists the probe names with dashes (or are these hyphens?), to make it work with SystemTap, I needed to use double-underscores, so "transaction-start" in the docs becomes "transaction__start" in my script.

probe process("/usr/local/pgsql/bin/postgres").mark("transaction__start")
{      
        printf("Transaction start: %d\n", pid())
}

probe process("/usr/local/pgsql/bin/postgres").mark("lwlock__condacquire") {
        printf("lock wait start at %d for process %d on cpu %d\n", gettimeofday_s(), pid(), cpu())
}

probe process("/usr/local/pgsql/bin/postgres").mark("sort__start") {
        printf("transaction abort at %d for process %d on cpu %d\n", gettimeofday_s(), pid(), cpu())
}

probe process("/usr/local/pgsql/bin/postgres").mark("smgr__md__write__done") {
        printf("smgr-md-write-done at %d for process %d on cpu %d\n", gettimeofday_s(), pid(), cpu())
}

...which resulted in something like this when I ran pgbench:

[josh@localhost ~]$ sudo stap -v test.d
Pass 1: parsed user script and 52 library script(s) in 130usr/150sys/286real ms.
Pass 2: analyzed script: 7 probe(s), 4 function(s), 2 embed(s), 0 global(s) in 30usr/30sys/120real ms.
Pass 3: translated to C into "/tmp/stapW9yfAQ/stap_f6f3ffd834ef5b249edcf7d1ca19dce2_3025.c" in 10usr/150sys/163real ms.
Pass 4, preamble: (re)building SystemTap's version of uprobes.
Pass 4: compiled C into "stap_f6f3ffd834ef5b249edcf7d1ca19dce2_3025.ko" in 1380usr/2690sys/4155real ms.
Pass 5: starting run.
Transaction start: 4894
Transaction start: 4894
lock wait start at 1243552147 for process 4907 on cpu 0
Transaction start: 4907
Transaction start: 4907
lock wait start at 1243552147 for process 4907 on cpu 0
Transaction start: 4907
lock wait start at 1243552174 for process 2770 on cpu 0
smgr-md-write-done at 1243552174 for process 2770 on cpu 0
smgr-md-write-done at 1243552174 for process 2770 on cpu 0
smgr-md-write-done at 1243552174 for process 2770 on cpu 0

This could be a very interesting way of profiling, performance testing, debugging, troubleshooting, and who knows what else. I'm interested to see SystemTap become more ubiquitous. I should note that I have no idea how SystemTap compares to DTrace or whether it will manage to do for Linux what DTrace can do on other operating systems. Time will tell, I guess.

UPDATE: As has been pointed out in the comments, compiling PostgreSQL with --enable-dtrace is only necessary if I want to use the built-in "taps" (the SystemTap word, apparently, for its equivalent of DTrace probes). Probing by function call, or any of the other probe methods SystemTap supports, works without --enable-dtrace.

UPDATE 2: It's important to note that the defined DTrace probes include sets of useful variables that DTrace and SystemTap scripts might be interested in. For instance, it's possible to get the transaction ID within the transaction__start probe. In SystemTap, these variables are referenced as $arg1, $arg2, etc. So in a transaction__start probe, you could say:

printf("Transaction with ID %d started\n", $arg1)

Writing Procedural Languages - slides

Although I'll be working to change this, the slides for my "Writing a PostgreSQL Procedural Language" tutorial available from the PGCon website are from an earlier iteration of the talk. The current ones, which I used in the presentation, are available here, on Scribd.

PGCon thus far

Though it might flood the End Point blog with PGCon content, I'm compelled to scribble something of my own to report on the last couple of days. Wednesday's Developers' Meeting was an interesting experience and I felt privileged to be invited. Although I could only stay for the first half, as my own presentation was scheduled for the afternoon, I enjoyed the opportunity to meet many PostgreSQL "luminaries", and participate in some of the decisions behind the project.

Attendance at my "How to write a PostgreSQL Procedural Language" tutorial exceeded my expectations, no doubt in part, at least, because aside from the Developers' Meeting it was the only thing going on. Many people seem interested in being able to write code for the PostgreSQL backend, and the lessons learned from PL/LOLCODE have broad application. It was suggested, even, that since PL/pgSQL converts most of its statements to SQL and passes the result to the SQL parser, PL/LOLCODE would have less parsing overhead than PL/pgSQL. Ensuing discussions of high performance LOLCODE were cancelled due to involuntary giggling.

Between talks I've had the opportunity to meet a wide variety of PostgreSQL users and contributors, and been interested to see various peoples' ideas for future development. Perhaps it will result in a blog post one day, but suffice it to say there's lots of activity under way. Most surprising to me has been the interest in my (still embryonic) work with multi-column statistics. On a number of different occasions people have unexpectedly asked me about it. Thanks to a hallway conversation with Tom Lane, another of the hard problems involved has a possible solution, the probable subject of yet another blog post.

Thanks to the organizers, sponsors, speakers, helpers, etc. who have made the conference possible so far; I'm looking forward to today.

PgCon: the developer's meeting and the 2009 keynote

Yesterday, I spent the entire day at a Postgres Developers meeting, discussing what happened over the last year, and how we're going to tackle a series of critical problems in the next year. We talked about how to get the Synchronous Replication and Hot Standby patches completed, important adoption issues, our continued participation in the SQL Standards committee (a surprising number of people were interested!), moving forward with alpha releases after commitfests (woo!), and creating a better infrastucture for managing modules and addons to Postgres.

That evening, a few of us were treated by Paul Vallee of Pythian Group to dinner and a trip to another of Ottawa's great local pubs. We discussed the future of open source databases and the relative quality of beer in Ottawa, Portland and the UK. Of course, I think Portland has the best beer ;)

This morning, Dan introduced everyone to the start of the sessions, and then Dave, Magnus and I managed to get through the keynote. It was mostly an opportunity to announce 8.4 Beta2, plug a few of the talks and mention all the different individuals involved in development. And have a laugh about our conference tshirts.

I have an hour and a half until I give the Power psql talk and then tonight is the big EnterpriseDB party. And one more talk tomorrow. And lightning talks. What a full conference :)

PgCon: Preparing the keynote, more talks and today is Developer Meeting day

I spent most of Tuesday polishing up slides for my VACUUM strategy talk, reviewing the Power psql talk slides, working a little bit and then meeting up with all the new arrivals.

Dave Page and Greg Stark rescued Magnus and I from the coffee shop and we settled in at the Royal Oak for the evening. Dave, Magnus and I decided on the theme "Why people are choosing Postgres" for our keynote, and we managed to produce a few slides to guide us!

Peter Eisentraut was there and I chatted briefly about his fun FUSE project for Postgres that he'll be giving a Lightning Talk about on Friday. (There is still time to give a lightning talk, by the way! Find me, or just update the wiki and I'll add you to the agenda.)

I also saw CB (one of the database gurus) from Etsy there, and I'm hoping to meet up with him and a few more people this evening. Tom Lane and I chatted a little bit about my experience at MySQL Conference, and how things seem to be going with Drizzle.

All in all, had a great evening and I even survived Dave's frequent refilling of my beer glass. I'm looking forward to today's Developer Meeting.

PGCon: First day in Ottawa

I arrived in Ottawa late Sunday night a little in advance of the conference. I'm spending a couple days working on the final bits of my slides, and spending a little time with friends in the Postgres community that I only get to see once a year!

I started the morning with Dan Langille, the PGCon organizer, Magnus Hagander, and Josh Berkus. During that conversation, I managed to avoid being assigned to give the keynote on Thursday by myself, but instead enlisted Magnus and Dave Page to come up with something together with me. They gave a keynote together at PgDay EU, so I figured I would be in good company.

One project that I've helped with in the past is the code that runs planet.postgresql.org. Magnus Hagander and I spent most of yesterday renaming the project, identifying the next few features we'd like to add, and getting the source tree moved over to git.postgresql.org.

I'm hoping we have a little more time between tweaking slides to get our new features finished and deployed to the production server today.

Competitors to Bucardo version 1

Last time I described the design and major functions of Bucardo version 1 in detail. A natural question to ask about Bucardo 1 is, why didn't I use something else already out there? And that's a very good question.

I had no desire to create a new replication system and work out the inevitable kinks that would come with that. However, nothing then available met our needs, and today still nothing I'm familiar with quite would. So writing something new was necessary. Writing an asynchronous multimaster replications system for Postgres was not trivial, but turned out to be easier than I had expected thanks to Postgres itself -- with the caveats noted in the last post.

But, back to the landscape. What follows is a survey of the Postgres replication landscape as it looked in mid-2002 when I first needed multimaster replication for PostgreSQL 7.2.

pgreplicator

PostgreSQL Replicator is probably the most similar project to Bucardo 1. It was released in 2001 and does not appear to have had any updates since October 2001. I don't recall why I didn't use this, but from reviewing the documentation I suspect it was because it hadn't been updated for PostgreSQL 7.2, it used PL/Tcl, and required a daemon to run on every node. But the asynchronous store-and-forward approach, the use of triggers and data storage tables is similar to Bucardo 1.

dbmirror

I don't remember whether this was around in 2002, but it's part of PostgreSQL contrib now. It is master/slave replication only.

Slony-I

I don't think Slony-I existed in 2002 -- version 1.0 was released in 2004. But in any case, it only does master/slave replication.

Slony2

There has been no code released from this project and the website is now gone.

erserver

Master/slave replication, abandoned in favor of Slony-I. Website is now gone.

Postgres-R

This was a research project that worked with PostgreSQL 6.4. Some Postgres-R design documents were published. An effort to port it to PostgreSQL 7.2 (the pgreplication project) did not appear to have gotten very far. In 2008 it seems to have been partially revived. I don't know what the current status is.

PGCluster

This didn't exist in 2002. I'm not sure where it's at now. I believe it uses synchronous replication.

pgpool

This isn't the kind of "replication" I wanted; it's database load balancing and multiplexing. The pgpool listener is a single point of failure, and all databases must be accessible or data will be lost on a database server that is down.

Usogres

Master/slave replication for backup purposes.

Mammoth PostgreSQL + Replication

This didn't exist in 2002. It is only master/slave replication. It began as proprietary software but I believe is open source now.

EnterpriseDB Replication Server

A proprietary offering that came out in 2005 or 2006, for master/slave replication only. Has apparently been replaced by Slony, or perhaps was always rebranded Slony.

pgComparator

An rsync-like tool for comparing databases. Didn't exist in 2002. Probably much better than Bucardo 1's compare operation.

DBBalancer

Kind of like pgpool, more of a connection pooler. Hasn't been updated since 2002.

DRAGON

"Database Replication based on Group Communication." Links to this project were defunct.

DBI-Link

DBI-Link isn't about replication.

(Summary)

I assembled this list some time back and have made some updates to it. I'm sure there are more to consider today. Please comment if you have any corrections or additions.

The design of Bucardo version 1

Since PGCon 2009 begins next week, I thought it would be a good time to start publishing some history of the Bucardo replication system for PostgreSQL. Here I will cover only Bucardo version 1 and leave Bucardo versions 2 and 3 for a later post.

Bucardo 1 is an asynchronous multi-master and master/slave database replication system. I designed it in August-September 2002, to run in Perl 5.6 using PostgreSQL 7.2. It was later updated to support PostgreSQL 7.4 and 8.1, and changes in DBD::Pg's COPY functionality. It was built for and funded by Backcountry.com, and various versions of Bucardo have been used in production as a core piece of their infrastructure from September 2002 to the present.

Bucardo's design is simple, relying on the consistently correct behavior of the underlying PostgreSQL database software. It made some compromises on ideal behavior in order to have a working system in a reasonable amount of time, but the compromises are few and are mentioned below.

General design

Bucardo 1 needed to:

  • Support asynchronous multimaster replication.
  • Support asynchronous master/slave replication of full tables and changes to tables.
  • Leave frequency of replication up to the administrator, which came by default since each replication event is a separate run of the program.
  • Preserve transaction atomicity and isolation across databases.
  • Continue collecting change information even when no replication process is running.
  • Be fairly efficient in storing changes and in bandwidth usage sending them to the other database.
  • Have a default "winner" in collision situations, with special handling possible for certain tables where more intelligent collision merges could be done.
  • Not require any database downtime for maintenance, upgrades, etc.
  • Be fairly simple to understand and support.
  • Support a data flow arrangement such that the replicator is behind a firewall and reaches out to an external database, but doesn't require inbound access to the internal database.

Operations

There are four types of database operations Bucardo 1 can perform:

  • peer - synchronize changes in one or more tables between two peer databases (multi-master)
  • pushdelta - copy only changed rows from a table or set of tables from a master database to a slave database
  • push - copy an entire table or set of tables from a master database to a slave database
  • compare - compare all rows of one or more tables between two databases

I will discuss each of these operations in turn.

Peer sync

The peer sync operation is the most groundbreaking feature of Bucardo 1. The much smaller Backcountry.com of 2002 wanted to have an internal master database in their office, which housed their customer service and warehouse employees, buyers, and management. Their office had a low-bandwidth and not entirely reliable Internet connection. Their e-commerce web, application, and database servers were at a colocation facility with a fast Internet connection, and they wanted an identical master database to reside there, so that in the case of any disruption in connectivity between their office and colocation facility, both locations could continue to function independently, and their databases would automatically synchronize after connectivity was restored.

To summarize, what they needed is multi-master replication. Their needs would be satisfied with asynchronous multi-master replication. That meant that it was acceptable for the databases to be current with each other with 1-2 minutes of lag time. (Synchronous multi-master replication requires a continuous connection between the two master databases, and transactions are not allowed to commit until the transaction is completed on both databases.)

I want to review some of the features that are required for multi-master replication to work. First, it needs to have ACID properties just as the underlying database itself. The most relevant properties for our multi-master replication system are atomicity and isolation. A transaction must be entirely visible on a given database, or not visible at all.

For example, let us imagine that a customer ecommerce order consists of exactly 1 row in the "orders" table, which references 1 row in the "users" table, and the following tables may have 0 or more rows pointing to the "orders" table:

  • order_lines
  • order_notes
  • credit_cards
  • payments
  • gift_certificates
  • coupon_uses
  • affiliate_commissions
  • inventory

To add an order to the source database, a transaction is started, rows are added to relevant tables, the transaction is committed, and then those rows will all appear to other database users at once. Until the transaction is committed, no changes are visible. If an error occurs, the entire transaction rolls back, and it will never have been seen by any other database user.

This ensures that warehouse employees, customer service representatives, etc. will never see a partial order. This is especially important since we don't want to ship an order that is missing some of its line items, or double-charge a credit card because we didn't have a payment record yet. And an order without its associated inventory records would have trouble shipping at the warehouse.

This is all standard ACID stuff. But since I was writing a multi-master replication system from scratch, I had to assure the same properties across two database clusters, for which PostgreSQL had no facilities.

Changes are tracked by having a "delta table" paired with every table that's part of the multi-master replication system. The table has three columns: the primary key in the table being tracked, the wallclock timestamp, and an indicator of whether the change was due to an insert, update, or delete. Every change in the table being tracked is recorded by rules and triggers that insert a corresponding row in the delta table.

This is what the delta table for "orders" looks like (simplified a bit for readability):

                      Table "public.orders_delta"
    Column     |     Type    |                Modifiers 
---------------+-------------+-----------------------------------------
 delta_key     | varchar(14) | not null
 delta_action  | char(1)     | not null
 last_modified | timestamp   | not null default timeofday()::timestamp
Check constraints:
    "delta_action_valid" CHECK (delta_action IN ('I','U','D'))
Triggers:
    orders_delta_last_modified BEFORE INSERT OR UPDATE ON orders_delta
        FOR EACH ROW EXECUTE PROCEDURE update_last_modified()

The new row data itself in the tracked table is not copied, because the data is right there for the taking. It is enough to note that a change was made. If multiple changes are made, only the most recent version of the row is available, but that is fine because that's the only one we need to replicate.

Because nothing outside of the database is required to track changes, the tracking continues even when Bucardo 1 is not running. As long as the delta table exists and can be written to, and the tracking rules and triggers are in place on the tracked table, the changes will be recorded.

Bucardo 1 achieves atomicity and isolation of the replication transaction with this process:

  1. Open a connection to the first database, set transaction isolation to serializable, and disable triggers and rules.
  2. Open a connection to the second database, set transaction isolation to serializable, and disable triggers and rules.
  3. For each table to be synchronized in this group:
    1. Verify that the table's column names and order match in the two databases.
    2. Walk through the delta table on the first database, making identical changes to the second database. Empty the delta table when done.
    3. Walk through the delta table on the second database, making identical changes to the first database. Empty the delta table when done.
    4. Make a note of any changes that were made to the same rows on both databases ("conflicts"). By default, we resolve the conflicts silently by allowing the designated "winner" database's change be the one that remains. For certain tables such as "inventory", appropriate table-specific conflict resolution code was added that merged the changes instead of designating a winner and loser version of the row.
  4. Once all changes have succeeded, commit transactions on both databases.

This last step of the process does not satisfy the ACID durability requirement. Since Bucardo 1 was designed on PostgreSQL 7.2, with no 2-phase commit possible, there is a chance that one database will fail to commit its transaction after the other database already did, and the changes will be lost on one side only. This has never happened in practice, mostly due to the fact that committing a transaction in PostgreSQL is a nearly instantaneous operation, since the data is already in place and no separate rollback or log tables need to be modified. But it is certainly possible that it could happen, and it is an undesirable risk. With real 2-phase commit now available in PostgreSQL, complete durability could be achieved.

All of a sudden, the changes on each side are now available to the other side, all at once. Only entire orders are visible, never partial orders.

ACID consistency is achieved by assuming that due to PostgreSQL's integrity checks on the source database, the data was already consistent there, and it is copied verbatim to the destination database where it will still be consistent. Thus, CHECK constraints, referential integrity constraints, etc. are expected to be identical between the two databases. Bucardo 1 does not propagate database schema changes.

Thus the main principles to provide fairly reliable replication are:

  1. All related tables must be synchronized within the same transaction.
  2. Synchronization must always be done in both directions in the same transaction, so that the code can detect simultaneous change conflicts.
  3. The most recent change to a given row must of course be the last change, so changes should be replayed in order. (We optimize this by not copying over row changes that we know will be deleted later in the same transaction.)

Things to consider with multi-master replication:

  1. Conflicts are less likely the more often the synchronization is performed. But conflicts can still happen, and must be resolved somehow. Creating a generic conflict resolution mechanism is difficult, but declaring a "winning" database is easy and special conflict resolution logic can be added for certain tables where lost changes would be troublesome.
  2. Very large change sets can take a long time to synchronize. For example, consider an unintentionally large update like this:

    UPDATE inventory SET quantity = quantity + 5

    That may change hundreds of thousands of rows, all in a single transaction. Our replication system need to make all those changes in a single transaction to the other database, but it must do so over a comparatively slow Internet connection. As transactions run longer, they often encounter locks from other concurrent database activity, and rollback. Then the process must start over, but now there are even more changes to copy over, so it takes even longer. In the worst situations, the synchronization simply cannot complete until other concurrent database activity is temporarily stopped, so that no locks will conflict. And that means downtime of applications, and manual intervention of the system administrator.

    Perhaps you could ship over all the data to the other database server ahead of time, then begin transactions on both databases and make the changes based on the local copy of the data, and expect the changes to be accepted more quickly since the network is no longer a bottleneck. But the destination database won't have been idle during that copying, which needs to be accounted for.

    Statement replication does not have this same weakness, but it has many weaknesses of its own.

  3. Sequences need to be set up to operate independently without collisions on the two servers in a peer sync. Two easy ways to do this are:
    1. Set up sequences to cover separate ranges on each server. For example, MAXVALUE 999999 on the first server, and MINVALUE 1000000 on the second server. Make sure to spread the ranges far enough apart that they'll never likely collide.
    2. Set up sequences to supply odd numbers on one server, and even on the other. For example, START 1 INCREMENT 2 on the first server, and START 2 INCREMENT 2 on the second server.
  4. A primary key is required. Currently, it must be a single column, and must be the first column in the table.
  5. Because each table's primary key may be of a different datatype, and to keep queries on delta tables as simple as possible, Bucardo 1 uses a separate delta table for each table being tracked.
  6. A more pluggable system for adding table-specific collision handling would be nice.
  7. The delta table column "delta_action" isn't actually necessary -- inserts and updates are already handled identically, and deletes can be inferred from the join on the tracked table. The "delta_action" is perhaps a nice bit of diagnostic information, and not burdensome as a CHAR(1), but otherwise could be removed.
  8. It's important that the delta table's "last_modified" column be based on wallclock time, not transaction start time, because we only keep the most recent change, and if all changes within a transaction are tagged by transaction start time, we'd end up with an arbitrary row as the "most recent" one, resulting in inconsistent data between the databases.

Pushdelta

The pushdelta operation uses the same kind of delta tables and associated triggers and rules that the peer sync uses, but is a one-way push of the changed rows from master to slave. It is useful for large tables that don't have a high percentage of changed rows.

The pushdelta operation currently only supports a single target database. The ability to use pushdelta from a master to multiple slaves would be useful.

Push

The push operation very simply copies entire tables from the master to one or more slaves, for each table in a group. It requires no delta tables, triggers, or rules.

Table pushes can optionally supply a query that will be used instead of a bare "SELECT *" on the source table. Any query is allowed that will result in matching columns for the target table. We've used this to push out only in-stock inventory, rather than the whole inventory table, for example.

No primary key is required on tables that are pushed out in full.

The push operation uses DELETE to empty the target table. It would be good to optionally specify that TRUNCATE be used instead, and to take advantage of the PostgreSQL 8.1 multi-table truncate feature on tables with foreign key references.

Compare

The compare operation compares every row of the tables in its group, and displays any differences. It is a read-only operation. It can be used to make sure that tables to be used in multi-master replication start out identical, and later, to verify correct functioning of peer, pushdelta, and push operations.

The compare operation is fairly slow. It reads in all primary keys from both tables first, then fetches each row in turn. It could be made much more efficient.

Options

Optionally, tables can be vacuumed and/or analyzed after each operation.

In earlier versions of Bucardo 1, there was also an option to drop and rebuild all indexes automatically, to reduce index bloat, but beginning with PostgreSQL 7.3, primary key indexes could not be dropped when foreign keys required them, and the index bloat problem was dramatically reduced in PostgreSQL 7.4, mostly eliminating the need for the feature.

Limitations

Some of these are limitations that could easily be lifted, but no need had arisen. Some are minor annoyances, and others are major feature requests.

  1. For peer, pushdelta, and compare operations, a primary key is required. There are currently limitations on that key:
    1. Only single-part primary keys are supported.
    2. The primary key is assumed to be the first column. It would be easy to allow specifying another column as the primary key, or to interrogate the database schema directly to determine the key column, but we've never needed it.
  2. If an operation of one type is already underway, other operations of the same type will be rejected. It would be much more convenient for the users to add the newly requested operation to a queue and perform it when the current operation has finished.
  3. The program stands alone, performing a single operation and exiting. It was designed to run from cron. A persistent daemon that accepts requests in a queue or by message passing could better handle the many operations needed on a busy server.
  4. The program could use PostgreSQL's LISTEN and NOTIFY feature to learn of changes in a table and run a peer sync based on that notification, instead of being run on a timed schedule or on demand.
  5. Delta tables and triggers must be created or removed manually, though our helper script makes that fairly easy. It would be nice to have Bucardo automatically create delta tables and triggers as needed, or remove them when no longer needed (so that the overhead of tracking changes isn't incurred).
  6. Delta tables clutter the schema of the tables they are connected to. PostgreSQL didn't yet have the schema (namespace) feature when Bucardo 1 was created, but it would be nice to centralize the delta tables and functions in a separate schema.
  7. The datatypes of the fields in tables being replicated are not compared; only the names and order are compared.
  8. The configuration file syntax is fairly unpleasant.
  9. Only tables can be synchronized. It would be good to add support for views, sequences, and functions as first-class objects that could be pushed from master to slave or synchronized between two masters.
  10. It would be more convenient, and could reduce the chance of trouble due to misconfiguration, if Bucardo would interrogate the database to learn of all foreign key relationships between tables so that it could automatically create groups of tables that need to be processed together. Trigger functions and rules can cause changes to one table's row to modify rows in other table(s), in an opaque way that is resistant to introspection, but Bucardo could offer a location for users to declare what other tables a function can affect, and use that in building its dependency tree.
  11. There is no unit test suite.
  12. The insert trigger and update_last_modified function are written in PL/pgSQL, and are the only dependency on PL/pgSQL. They are both simple functions and should work fine as plain SQL functions, but it seems like there was a reason I had to use PL/pgSQL -- I just can't remember why anymore.
  13. In Bucardo 1, permission to insert to the various delta tables must be granted to any user that would change the base tables, or changes will be prevented by PostgreSQL. For a database with many users of varying access levels, this is a pain. It would be better to define the function to run as SECURITY DEFINER, and create the function as the superuser. Then no explicit permission would need to be granted on any delta table, and the delta tables would be inaccessible except through the Bucardo 1 API (except to the superuser). That would necessitate a change to using functions for updates and deletes, which currently are tracked by rules.

Future

Bucardo 1 performed admirably for Backcountry.com for over 4 years. The most serious problems, already mentioned above, have been the lack of a queue for push and pushdelta requests, limitations of running one-off processes from cron, limited row collision resolution, and bogging under a large insert or update that happens inside a single transaction.

Greg Sabino Mullane then created Bucardo 2, which is a rearchitected system built around all new code. It has all the important features of Bucardo 1, addressed most of Bucardo 1's deficiencies, and added many of the desired features listed above. We hope to publish some design notes about Bucardo 2 in the near future.

The Name

I originally gave Bucardo 1 the fairly descriptive but uninspiring name "sync-tables". Greg Sabino Mullane came up with the name Bucardo, a reference to the logo of this program's patron, Backcountry.com. You can read about attempts to clone the extinct bucardo in the Wikipedia articles Bucardo and Cloning.

Being at the MySQL User Conference: how Postgres fits in

I spent last week in Santa Clara attending the MySQL User Conference. Friends had clued me in that the conference was going to be a riot - with developers from the many forks of MySQL in attendance, all vying for spotlight, and to differentiate themselves from the MySQL core code.

The Oracle announcement of acquiring Sun cast an uncertain and uncomfortable light over the talks about forks, community and the future of MySQL. Many people wondered aloud what development on the core of MySQL’s code would be like now, and what would become of the remaining MySQL engineers.

Would the engineers defect to Monty’s new company? Will Oracle end support of MySQL development? How would MySQL end users feel about the changes? Would there be a surge in interest in Postgres, my favorite open source database?

Of course, it’s a bit early to tell. So, I’ve really got two posts about the trip, and this first one is about PostgreSQL, aka Postgres.

There’s a huge opportunity right now for Postgres to tell its story. Not because of a specific failure on the part of MySQL, but because the Oracle acquisition has raised the consciousness of all of mainstream tech. Developers and IT managers are taking a serious look at Postgres for new development projects, and evaluating their database technology choices with an eye toward whatever Oracle decides to do.

In this window of uncertainty is an opportunity for Postgres advocates to explain what it is that draws us to the project.

As a developer and a sysadmin, my enthusiasm for Postgres comes directly from the people that work on the code. The love of their craft - developing beautiful, purpose-built code - is reflected in the product, the mailing lists and the individuals who make up our community.

When someone asks me why I choose Postgres, I have to first answer that it is because of the people I know who are involved in the project. I trust them, and believe that they make the best technology decisions when it comes to the core of the code.

I believe that there’s room for improvement in extending Postgres’ reach, and speaking to people who don’t already believe the same things that we believe: that conforming to the SQL standard is fundamentally a useful and important goal, that vertical scaling is an important design objective, and that consistency is just as important to excellent user experience as are verbose command names and syntactic sugar extensions.

All of those issues are debated when discussing (typically by people outside of the Postgres community) how the Postgres development is prioritized and how this community works. It is inarguable that in the web space, Postgres lost the race. But the initial goal of the project, I’d argue, wasn’t necessarily to be the most popular end-user database. Now, that may have changed... :)

Meantime, the Postgres community continues to mature. There are clear constraints we need to overcome on the people side. Two that I think about frequently are the need for more code reviewers for patch review and testing, and smoothing over our prickly mailing-list reputation by getting more volunteers responding to requests for information the lists.

During a particularly raucous panel session at the Percona Performance Conference, a friend in the Postgres community commented that he was so happy that our community didn’t have the issues that the MySQL community has. And I said to him that it’s just a matter of time before we experience those issues if Postgres grows as MySQL has.

We will have issues with forks, conflicts and deep-cutting (founded, or unfounded) criticism. So, my advice to all the people I know in the Postgres community is to pay attention to what is happening with MySQL right now, because we can only benefit from being prepared.

End Point speakers at PGCon 2009

PGCon is the annual conference for PostgreSQL users and developers, and PGCon 2009 in Ottawa, Canada, is now only about 3 weeks away. The schedule of presentations looks excellent, and I'm excited to have three of my co-workers presenting talks there. Here's a quick rundown of those talks.

Power psql by Greg Sabino Mullane: The psql command-line interface to PostgreSQL is extremely powerful and versatile. While it's easy to get started with, investing a little time in learning its many features really pays off in improved productivity. Greg will explore some corners and features you might not have known about, and also delve a little into its history and, more importantly, its future.

VACUUM Strategy by Selena Deckelmann: VACUUM is an important topic for both new and seasoned users of Postgres. Selena's talk will focus on changes in Postgres from version 8.0 on, tuning configuration parameters related to VACUUM for best performance, autovacuum, the updated Free Space Map in 8.4, and the brand new Visibility Map.

Writing a Procedural Language by Josh Tolley: Stored procedures and user-defined functions offer a lot of power, and PostgreSQL already allows developing such code in many different programming languages. Josh will show how to write a new PostgreSQL procedural language, which offers many practical lessons in PostgreSQL internals. Using the thoroughly impractical language PL/LOLCODE makes it fun, to boot.

For more details about the conference, see the PGCon website.

Inside PostgreSQL - Data Types and Operator Classes

Two separate posts taken from two separate mailing lists I'm on have gotten me thinking about PostgreSQL data types and operator classes today. The first spoke of a table where the poster had noticed that there was no entry in the pg_stats table for a particular column using the point data type. The second talks about Bucardo failing when trying to select DISTINCT values from a polygon type column. I'll only talk about the first, here, but both of these behaviors stem from the fact that the data types in question lack a few things more common types always have.

The first stems from the point type's lack of a default b-tree operator class and lack of an explicitly-declared analyze function. What are those, you ask? In the pg_type table, the column typanalyze contains the OID of a function that will analyze the data type in question, so when you call ANALYZE on a table containing that data type, that function will be run. In a default installation of PostgreSQL, all rows contain 0 in this column, meaning use the default analyze function.

This default analyze function tries, among other things, to build a histogram of the data in the column. Histograms depend on the values in a table having a defined one-dimensional ordering (e.g. X <> Y, like numbers on a number line or words in alphabetical order). Now it gets a bit more complex. Index access methods define "strategies", which are numbers that correspond to the function of a particular index. Per this page, the b-tree access method defines the following:

OperationStrategy Number
less than1
less than or equal2
equal3
greater than or equal4
greater than5

To build a histogram we might use strategies 1, 3, and 5, to determine whether two given values are equal, or which is greater. So having found that there's an appropriate operator class for this data type, the analyze function would finally look in the pg_amop table to get the operators it needs to build its histogram. pg_amop matches these strategy numbers with actual function OIDs to find the functions it should actually call.

This whole line of thought stemmed from the point data type not having these functions. B-tree indexes try to sort their data in some order, as determined by the functions talked about above. But point types don't have an obvious one-dimensional ordering, so the b-tree index isn't really appropriate for them. So there's no b-tree operator class, and thus no statistics from columns of point type.

All that said, if you can think of a nice set of statistics ANALYZE might get from point data that would be useful for later query planning, you might implement a custom analyze function to fill the pg_stats table, and selectivity estimation functions to consume the data you generate, to make queries on point data that much better...

UPDATE: Those interested in the guts of a type-specific analyze function might take a look at ts_typanalyze, which is in 8.4. Note that on its own, the typanalyze function doesn't do any good -- it needs selectivity functions, defined in this file, which also were committed in 8.4. Both patches courtesy of Jan Urbanski, and various reviewers.

OFFSET 0, FTW

A query I worked with the other day gave me a nice example of a useful PostgreSQL query planner trick. This query originally selected a few fields from a set of inner-joined tables, sorting by one particular field in descending order and limiting the results, like this:

SELECT <some stuff> FROM table_a INNER JOIN table_b ON (...)
INNER JOIN table_c ON (...) WHERE table_a.field1 = 'value'
ORDER BY table_a.field2 DESC LIMIT 20

The resulting query plan involved a bunch of index scans on the various tables, joined with nested loops, all based on a backward index scan of an index on the table_a.field2 column, looking for rows that matched the condition in the WHERE clause. PostgreSQL likes to choose backward index scans when there's a LIMIT clause and it needs result sorted in reverse order, because although backward index scans can be fairly slow, they're easy to interrupt when it finds enough rows to satisfy the LIMIT. In this case, it figured it could search backward through the index on table_a.field2 and quickly find 20 rows where table_a.field1 = 'value' is true. The problem was that it didn't find enough rows as quickly as it thought it would.

One way of dealing with this is to improve your statistics, which is what PostgreSQL uses to estimate how long the backward index scan will take in the first place. But sometimes that method still doesn't pan out, and it takes a lot of experimentation to be sure it works. That level of experimenting didn't seem appropriate in this case, so I used another trick. I guessed that maybe if I could get PostgreSQL to first pull out all the rows matching the WHERE clause, it could join them to the other tables involved and then do a separate sorting step, and come out faster than the plan that it was using currently. Step one is to separate out the part that filters table_a:

SELECT <some stuff> FROM
(SELECT * FROM table_a WHERE field1 = 'value') a
INNER JOIN table_b ON (...) INNER JOIN table_c ON (...)
ORDER BY a.field2 DESC LIMIT 20

The problem is that this doesn't change the query plan at all. PostgreSQL tries to "flatten" nested subqueries -- that is, it fiddles with join orders and query ordering to avoid subquery operations. In order to convince it not to flatten the new subquery, I added "OFFSET 0" to the subquery. This new query gives me the plan I want:

SELECT <some stuff> FROM
(SELECT * FROM table_a WHERE field1 = 'value' OFFSET 0) a
INNER JOIN table_b ON (...) INNER JOIN table_c ON (...)
ORDER BY a.field2 DESC LIMIT 20

This selects all rows from table_a where field1 = 'value', and uses them as a distinct relation for the rest of the query. This led to a distinct sorting step, and made the resulting query much faster than it had been previously.

CAVEAT: The query planner is pretty much always smarter than whoever is sending it queries. This trick just happened to work, but can be a really bad idea in some cases. It tells PostgreSQL to pull all matching rows out of the table and keep them all in memory (or worse, temporary disk files), and renders useless any indexes on the original table. If there were lots of rows matching the condition, this would be Very Bad. If one day my table changes and suddenly has lots of rows matching that condition, it will be Very Bad. It's because of potential problems like this that PostgreSQL doesn't support planner hints -- such things are a potent foot gun. Use with great care.

Greg's THREE talks at PostgreSQL Conference East this weekend

(Cross posted from my personal/postgres blog)

Greg Sabino Mullane will be presenting three talks at PostgreSQL Conference East this weekend in Philadelphia, at Drexel University. The talks are listed on the site, and here's what he'll be speaking about:

Bucardo
April 5, Sunday, 10am
Bucardo is a replication system for Postgres that uses triggers to asynchronously copy data from one server to many others (master-slave) or to exchange data between two servers (master-master). We'll look at replication in general and where Bucardo fits in among other solutions, we'll take a look at some of its features and use-cases, and discuss where it is going next. We'll setup a running system along the way to demonstrate how it all works.

Monitoring Postgres with check_postgres.pl
April 4, Saturday, 2:30pm
What should you monitor? And how? We'll look at the sort of things you should care about when watching over your Postgres databases, as well as ways to graph and analyze metadata about about your database, with a focus on the check_postgres.pl script.

The Power of psql
April 4, Saturday 10:30am
All about everyone's favorite Postgres utility, psql, the best command-line database interface, period. We'll cover basic and advanced usage.

I've seen a few of Greg's talks -- The Magic of MVCC, Cloning an elephant and a few others. He's a great speaker and cool guy. And he's my boss. But I'm not just saying that because he's my boss! Really!

He doesn't like to brag about himself, so I'm gonna help him out. He maintains DBD::Pg, check_postgres.pl, Bucardo and has had MANY patches committed to PostgreSQL. He's also a volunteer for the PostgreSQL sysadmins team, and specifically helps maintain the git repo box. He's a contributor to the MediaWiki project. He's on the board of the United States PostgreSQL Association. He's basically awesome.

If you're gonna be there, you should check out his talks. And if you can't make it, here's hoping Josh Drake records the talks and shares them with us all! :)

Inside PostgreSQL -- Multi-Batch Hash Join Improvements

A few days ago a patch was committed to improve PostgreSQL's performance when hash joining tables too large to fit into memory. I found this particularly interesting, as I was a minor participant in the patch review.

A hash join is a way of joining two tables where the database partitions each table, starting with the smaller one, using a hash algorithm on the values in the join columns. It then goes through each partition in turn, joining the rows from the first table with those from the second that fell in the same partition.

Things get more interesting when the set of partitions from the first table is too big to fit into memory. As the database partitions a table, if it runs out of memory it has to flush one or more partitions to disk. Then when it's done partitioning everything, it reads each partition back from the disk and joins the rows inside it. That's where the "Multi-Batch" in the title of those post comes in -- each partition is a batch. The database chooses the smaller of the two tables to partition first to help guard against having to flush to disk, but it still needs to use the disk for sufficiently large tables.

In practice, there's one important optimization: after partitioning the first table, even if some partitions are flushed to disk, the database can keep some of the partitions in memory. It then partitions the second table, and if a row in that second table falls into a partition that's already in memory, the database can join it and then forget about it. It doesn't need to read in anything else from disk, or hang on to the row for later use. But if it can't immediately join the row with a partition already in memory, the database has to write that row to disk with the rest of the partition it belongs to. It will read that partition back later and join the rows inside. So when the partitions of the first table get too big to fit into memory, there are performance gains to be had if it intelligently chooses which partitions go to disk. Specifically, it should keep in memory those partitions that are more likely to join with something in the second table.

How, you ask, can the database know which partitions those are? Because it has statistics describing the distribution of data in every column of every table: the histogram. Assume it wants to join tables A and B, as in "SELECT * FROM A JOIN B USING (id)". If B.id is significantly skewed -- that is, if some values show up noticeably more frequently than others -- PostgreSQL can tell by looking its statistics for that column, assuming we have an adequately large statistics_target on the column and have analyzed the table appropriately. Using the statistics, PostgreSQL can determine approximately what percentage of the rows in B have a particular value in the "id" column. So when deciding to flush a partition to disk while partitioning table A, PostgreSQL now knows enough to hang on those partitions containing values that show up most often in B.id, resulting in a noticeable speed improvement in common cases.

pg_controldata

PostgreSQL ships with several utility applications to administer the server life cycle and clean up in the event of problems. I spent some time lately looking at what is probably one of the least well known of these, pg_controldata. This useful utility dumps out a number of useful tidbits about a database cluster, given the data directory it should look at. Here's an example from a little-used 8.3.6 instance:
josh@eddie:~$ pg_controldata
pg_control version number:            833
Catalog version number:               200711281
Database system identifier:           5291243377389434335
Database cluster state:               in production
pg_control last modified:             Mon 09 Mar 2009 04:05:23 PM MDT
Latest checkpoint location:           0/B70E5B9C
Prior checkpoint location:            0/B70E5B5C
Latest checkpoint's REDO location:    0/B70E5B9C
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/307060
Latest checkpoint's NextOID:          37410
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Fri 06 Mar 2009 02:27:02 PM MST
Minimum recovery ending location:     0/0
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        2000
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           en_US.UTF-8
LC_CTYPE:                             en_US.UTF-8
I can't claim to speak with authority on all these data, but leave it as an exercise to the reader to determine the meaning of those that appear most captivating. One of pg_controldata's more interesting features is that it doesn't have to actually connect to anything; it reads everything from the disk. That means you can use it on databases in the middle of WAL recovery, even though you can't actually query the recovering database. The check_postgres.pl script uses this unique capability to make inferences about the health of a WAL replica, specifically by making sure checkpoints happen fairly regularly. pg_controldata requires only one argument, the data directory of the PostgreSQL instance you're interested in, and that only if you haven't already set the PGDATA environment variable.

Replicate only parts of my table

A day or two ago in #slony, someone asked if Slony would replicate only selected columns of a table. A natural response might be to create a view containing only the columns you're interested in, and have Slony replicate that. But Slony is trigger-based -- the only reason it knows there's something to replicate is because a trigger has told it so -- and you can't have a trigger on a view. So that won't work. Greg chimed in to say that Bucardo could do it, and mentioned a Bucardo feature I'd not yet noticed. Bucardo is trigger-based, like Slony, so defining a view won't work. But it allows you to specify a special query string for each table you're replicating. This query is called a "customselect", and can serve to limit the columns you replicate, transform the rows as they're being replicated, etc., and probably a bunch of other stuff I haven't thought of yet. A simple example:
  1. Create a table in one database as follows:
    CREATE TABLE synctest (
       id INTEGER PRIMARY KEY,
       field1 TEXT,
       field2 TEXT,
       field3 TEXT
    );
    
  2. Also create this table in the replication destination database; Bucardo won't replicate schema changes or database structure.
  3. Tell Bucardo about the table. I won't give the SQL here because it's already available in the Bucardo documentation. Suffice it to say you need to tell the goat table about a customselect query. For my testing, I used 'SELECT id, field1 FROM synctest'. Note that the fields returned by this query must
    • Include all the primary key fields from the table. Bucardo will complain if it can't find the primary key in the results of the customselect query.
    • Return field names matching those of the table. This means, for example, that if you somehow transform the contents of a field, you need to make sure the query explicitly names the results something Bucardo can recognize, e.g. 'SELECT id, do_some_transformation(field1) AS field1 FROM synctest'
  4. Tell the sync to use the custom select statements by setting the 'usecustomselect' field in the sync table to TRUE for the sync in question
  5. Fire up Bucardo and see the results. Here's my source table:
    58921 josh@bucardo_test# select * from uniq_test ;
     id |  field1  | field2 | field3  
    ----+----------+--------+---------
      1 | alpha    | bravo  | charlie
      2 | delta    | echo   | foxtrot
      3 | hotel    | india  | juliet
      4 | kilo     | lima   | mike
      5 | november | oscar  | papa
      6 | romeo    | sierra | tango
      7 | uniform  | victor | whiskey
      8 | xray     | yankee | zulu
    (8 rows)
    
    ...and here's my destination table...
    58922 josh@bucardo_test# select * from uniq_test;
     id |  field1  | field2 | field3 
    ----+----------+--------+--------
      1 | alpha    |        | 
      2 | delta    |        | 
      3 | hotel    |        | 
      4 | kilo     |        | 
      5 | november |        | 
      6 | romeo    |        | 
      7 | uniform  |        | 
      8 | xray     |        | 
    (8 rows)
    
Note that at least for now, customselect works only with fullcopy sync types. Also, the destination table must match the source table in structure, even if you're not going to copy all the fields. That is, even though I'm only replicating the 'id' and 'field1' fields in the example above, the destination table needs to contain all the fields in the source table. This is one of Bucardo's TODO items...

Announcing Release of PostgreSQL System Impact (PGSI) Log Analyzer

The PostgreSQL System Impact (PGSI) log analyzer is now available at http://bucardo.org/wiki/Pgsi.

System Impact (SI) is a measure of the overall load a given query imposes on a server. It is expressed as a percentage of a query's average duration over the its average interval between successive calls.

Queries are collected into canonical form with respect to literals and bind params; further, IN lists of varying cardinality are collapsed. Thus, queries that differ only in argument composition will be collected together in the evaluation. However, logically equivalent queries that differ in any other manner of structure (say two comparisons between AND that are transposed) will be seen as distinct.

The goal of SI is to identify those queries most likely to cause performance degradation on the database during heaviest traffic periods. Focusing exclusively on the least efficient queries can hide relatively fast-running queries that saturate the system more because they are called far more frequently. By contrast, focusing only on the most-frequently called queries will tend to emphasize small, highly optimized queries at the expense of slightly less popular queries that spend much more of their time between successive calls in an active state. These are often smaller queries that have failed to be optimized and punish a system severely under heavy load.

PGSI requires full PostgreSQL logging through syslog with a prescribed format. Specifically, log_statement must be 'all' and log_duration must be 'on'. Given a continuous log interval of any duration, PGSI will calculate reports in wiki-ready format with the following data over that interval:

  • First line defines suggested wiki page name for the given report
  • Log interval over which the report applies
  • SI, sorted from worst to best
  • Average duration of execution for the canonical query
  • Total count of times canonical query was executed
  • Average interval between successive executions
  • Standard deviation of the duration
  • Display of the canonical query
  • List of log entries for best- and worst-duration instances of the canonical query (only if report was generated using the --offenders option).

PGSI can be downloaded in tar.gz format or can be accessed from Git, its version-control system. To obtain it from git, run:

git clone http://bucardo.org/pgsi.git/

Contributions are welcome. Send patches (unified output format, please) to mark@endpoint.com.

Test::Database Postgres support

At our recent company meeting, we organized a 'hackathon' at which the company was split into small groups to work on specific projects. My group was Postgres-focused and we chose to add Postgres support to the new Perl module Test::Database.

This turned out to be a decent sized task for the few hours we had to accomplish it. The team consisted of myself (Greg Sabino Mullane), Mark Johnson, Selena Deckelmann, and Josh Tolley. While I undertook the task of downloading the latest version and putting it into a local git repository, others were assigned to get an overview of how it worked, examine the API, and start writing some unit tests.

In a nutshell, the Test::Database module allows an easy interface to creating and destroying test databases. This can be a non-trivial task on some systems, so putting it all into a module make sense (as well as the benefits of preventing everyone from reinventing this particular wheel). Once we had a basic understanding of how it worked, we were off.

While all of our tasks overlapped to some degree, we managed to get the job done without too much trouble, and in a fairly efficient manner. We made a new file for Postgres, added in all the required API methods, wrote tests for each one, and documented everything as we went along. The basic method to create a test database is to use the initdb program to create a new Postgres cluster, then modify the cluster to use a local Unix socket in the newly created directory (this side-stepping completely the problem of using an already occupied port). Then we can start up the new cluster via the pg_ctl command, and create a new database.

At the end of the day, we had a working module that passed all of its tests. We combined our git patches into a single one mailed it to the author of the module, so hopefully you'll soon see a new version of Test::Database with Postgres support!

Slony1-2.0.0 + PostgreSQL 8.4devel

Many people use Slony to replicate PostgreSQL databases in various interesting ways. Slony is a bit tough to get used to, but works very well, and can be found in important places at a number of high-load, high-profile sites. A few weeks back I set up Slony1-2.0.0 (the latest release) with a development version of PostgreSQL 8.4, and kept track of the play-by-play, as follows:

Starting Environment

On this machine, PostgreSQL is installed from the CVS tree. I updated the tree and reinstalled just to have a well-known starting platform (output of each command has been removed for brevity).
jtolley@uber:~/devel/pgsql$ make distclean
jtolley@uber:~/devel/pgsql$ cvs update -Ad
jtolley@uber:~/devel/pgsql$ ./configure --prefix=/home/jtolley/devel/pgdb
jtolley@uber:~/devel/pgsql$ make
jtolley@uber:~/devel/pgsql$ make install
jtolley@uber:~/devel/pgsql$ cd ../pgdb
jtolley@uber:~/devel/pgdb$ initdb data
jtolley@uber:~/devel/pgdb$ pg_ctl -l ~/logfile -D data start
The --prefix option in ./configure tells PostgreSQL where to install itself. Slony uses a daemon called slon to do its work, and slon connects to a database over TCP, so I needed to configure PostgreSQL to allow TCP connections by editing postgresql.conf appropriately and restarting PostgreSQL. [edit] Installing Slony Next, I downloaded slony1-2.0.0.tar.bz2 and checked its MD5 checksum
jtolley@uber:~/downloads$ wget http://www.slony.info/downloads/2.0/source/slony1-2.0.0.tar.bz2
--2008-12-17 11:29:54--  http://www.slony.info/downloads/2.0/source/slony1-2.0.0.tar.bz2
Resolving www.slony.info... 207.173.203.170
Connecting to www.slony.info|207.173.203.170|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 909567 (888K) [application/x-tar]
Saving to: `slony1-2.0.0.tar.bz2'
...
jtolley@uber:~/downloads$ md5sum slony1-2.0.0.tar.bz2 
d0c4955f10fe8efb7f4bbacbe5ee732b  slony1-2.0.0.tar.bz2
The MD5 checksum matches the one given on the slony website, so we can continue. First, I unzipped the download into /home/jtolley/devel/slony1-2.0.0. Now we need to configure and build the source. Again, the output of each command has been removed for brevity.
jtolley@uber:~/devel/slony1-2.0.0$ ./configure --with-pgconfigdir=/home/jtolley/devel/pgdb/bin/ \
> --prefix=/home/jtolley/devel/slony --with-perltools=/home/jtolley/devel/pgdb/bin
jtolley@uber:~/devel/slony1-2.0.0$ make
jtolley@uber:~/devel/slony1-2.0.0$ make install
The configure options tell slony where to find pg_config, a program that reports the locations of various important PostgreSQL libraries and other components, where to install slony, and where to put slony's perl-based toolset, which we'll use later. I also added /home/jtolley/devel/slony/bin to my PATH.

Setting Up Replication

Configuring PostgreSQL

The Slony documentation demonstrates setting up a database with pgbench and replicating it to another database. This document demonstrates the same thing. We'll create a slony user, databases pgbench and pgbenchslave, use the pgbench utility to create a schema, and then copy that schema to pgbenchslave. We'll then set up slony to replicate changes in pgbench to pgbenchslave. Each slony database needs a slon process connected to it using a superuser account. First, we'll create the superuser account, called slony, and a pair of databases called pgbench and pgbenchslave:
jtolley@uber:~/devel/pgdb$ createuser -sP slony
Enter password for new role: 
Enter it again: 
jtolley@uber:~/devel/pgdb$ createdb pgbench
jtolley@uber:~/devel/pgdb$ createdb pgbenchslave
Now we'll create some schema objects in the pgbench database using the pgbench utility (the output from pgbench isn't shown here):
jtolley@uber:~/devel/pgdb$ pgbench -i -s 1 pgbench
Slony requires PL/pgSQL, so we'll install it now, in both databases:
jtolley@uber:~/devel/pgdb$ for i in pgbench pgbenchslave ; do createlang plpgsql $i ; done
Note: Here we have to make changes from what older versions of slony expect. Slony requires every replicated table to have a primary key, and used to be able to create keys for tables that didn't otherwise have them, if instructed to do so. As of version 2.0.0 that's no longer possible, perhaps because it was a bad idea anyway, in most cases, for users to do it. So we have to make sure each table has a primary key. The pgbench schema consists of four tables, called accounts, branches, tellers, and history. Of these four, history doesn't have a primary key, so we need to create one. Here's how I did it:
jtolley@uber:~/devel/pgdb$ psql -Xc "alter table history add id serial primary key;" pgbench
NOTICE:  ALTER TABLE will create implicit sequence "history_id_seq" for serial column "history.id"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "history_pkey" for table "history"
ALTER TABLE
Note the -X in the call to psql; this prevents the "\set AUTOCOMMIT off" setting in my psqlrc file from taking effect, so I didn't have to add a "commit" command to the stuff I send psql. Now that our schema is set up properly, let's copy it from pgbench to pgbenchslave. In this case we want to replicate all tables, so we'll copy everything.
jtolley@uber:~/devel/pgdb$ pg_dump -s pgbench | psql -X pgbenchslave

Configuring Slony's altperl Scripts

Now we're ready to set up slony, and we'll make use of slony's altperl scripts to do most of the configuration grunt work for us. To make altperl work, we need to set up slon_tools.conf. A sample already lives in slony/etc/slon_tools.conf-sample.
jtolley@uber:~/devel/pgdb$ cd ../slony/etc/
jtolley@uber:~/devel/slony/etc$ cp slon_tools.conf-sample slon_tools.conf
jtolley@uber:~/devel/slony/etc$ vim slon_tools.conf
This file defines nodes and sets, and is written in Perl. First, a group of nodes makes a slony cluster, which is a named object. You can set that name with the $CLUSTER_NAME parameter. We also need a directory where log information will be written, which goes in the $LOGDIR parameter. In this case, I've set it to "/home/jtolley/devel/slony/log", which I've manually created. The slon daemons need write access to this directory; since they'll be running as me on this machine, that's fine. Next we add all the nodes. In this case, there are only two nodes, defined as follows:
    add_node(node     => 1,
             host     => 'localhost',
             dbname   => 'pgbench',
             port     => 5432,
             user     => 'slony',
             password => 'slony');
    add_node(node     => 2,
             host     => 'localhost',
             dbname   => 'pgbenchslave',
             port     => 5432,
             user     => 'slony',
             password => 'slony');
I had to remove definitions of nodes 3 and 4 from the sample configuration. Now we define replication sets. This involves defining tables and the unique, not null indexes slony can use as a primary key. If the table has an explicitly defined primary key, slony will use it automatically. Because of our modifications to the history table above, all four of our tables have primary keys, so this part is simple. All our table names go in the pkeyedtables element of the $SLONY_SETS hash, as follows:
        "pkeyedtables" => [     
                                'public.accounts',
                                'public.tellers',
                                'public.history',
                                'public.branches'
                           ],
We don't have any tables without primary keys, so we don't need the keyedtables element, and Slony no longer creates serial indexes for you as of v2.0.0, so we can delete the serialtables element. We do need to replicate the history_id_seq we created as part of the history table's primary key, so add that to the sequences element, as follows:
        "sequences" => ['history_id_seq' ],
Finally, remove the sample configuration for set 2, and save the file.

Generating Slonik Configuration

Now that we've configured the altperl stuff, we can use it to generate scripts that will be passed to slonik, that will actually set things up.
jtolley@uber:~/devel/pgdb$ slonik_init_cluster > initcluster
jtolley@uber:~/devel/pgdb$ slonik_create_set 1 > createset
jtolley@uber:~/devel/pgdb$ slonik_subscribe_set 1 2 > subscribeset
This creates three files each containing slonik code to set up a cluster and get it running. If you tried to use the serialtables stuff, you'll run into problems here with new versions of slony (not that I had that problem or anything...). Note that the arguments to slonik_subscribeset differ from those given in the documentation. This script requires two arguments: the set you're interested in, and the node that's subscribing to it.

Starting Everything Up

We're ready to do real work. Tell slonik to initialize the cluster:
jtolley@uber:~/devel/pgdb$ slonik < initcluster 
:6: Possible unsupported PostgreSQL version (80400) 8.4, defaulting to 8.3 support
:6: could not open file /home/jtolley/devel/slony/share/postgresql/slony1_base.sql
The complaints about version 8.4 aren't surprising, as I'm using bleeding-edge PostgreSQL. But I think I had something wrong with my directories when I built slony. The files in question ended up in /home/jtolley/devel/pgdb/share/postgresql, so I did this:
jtolley@uber:~/devel/pgdb$ mkdir ../slony/share
jtolley@uber:~/devel/pgdb$ cp -r share/postgresql/ ../slony/share/
jtolley@uber:~/devel/pgdb$ slonik < initcluster 
:6: Possible unsupported PostgreSQL version (80400) 8.4, defaulting to 8.3 support
:9: Possible unsupported PostgreSQL version (80400) 8.4, defaulting to 8.3 support
:10: Set up replication nodes
:13: Next: configure paths for each node/origin
:16: Replication nodes prepared
:17: Please start a slon replication daemon for each node
This looks right, so the next step is to start the slon daemon for each node:
jtolley@uber:~/devel/pgdb$ slon_start 1
Invoke slon for node 1 - /home/jtolley/devel/slony/bin/slon -s 1000 -d2 replication 'host=localhost dbname=pgbench user=slony port=5432 passwor
d=slony' > /home/jtolley/devel/slony/log/slony1/node1/pgbench-2008-12-17_12:33:18.log 2>&1 &                                                   Slon successfully started for cluster replication, node node1
PID [24745]
Start the watchdog process as well...
jtolley@uber:~/devel/pgdb$ syntax error at /home/jtolley/devel/pgdb/bin/slon_watchdog line 47, near "open "
Execution of /home/jtolley/devel/pgdb/bin/slon_watchdog aborted due to compilation errors.
Slony shipped a bug in slon_watchdog -- line 46 needs to have a semicolon at the end.
jtolley@uber:~/devel/pgdb$ pkill slon
jtolley@uber:~/devel/pgdb$ vim ../pgdb/bin/slon_watchdog
Change line 46 to read:
    my ($logfile) = "$LOGDIR/slon-$dbname-$node.err";
... and try again:
jtolley@uber:~/devel/pgdb$ slon_start 1
Invoke slon for node 1 - /home/jtolley/devel/slony/bin/slon -s 1000 -d2 replication 'host=localhost dbname=pgbench user=slony port=5432 passwor
d=slony' > /home/jtolley/devel/slony/log/slony1/node1/pgbench-2008-12-17_12:35:29.log 2>&1 &                                                   Slon successfully started for cluster replication, node node1
PID [24918]
Start the watchdog process as well...
jtolley@uber:~/devel/pgdb$ slon_start 2
Invoke slon for node 2 - /home/jtolley/devel/slony/bin/slon -s 1000 -d2 replication 'host=localhost dbname=pgbenchslave user=slony port=5432 pa
ssword=slony' > /home/jtolley/devel/slony/log/slony1/node2/pgbenchslave-2008-12-17_12:35:31.log 2>&1 &                                         Slon successfully started for cluster replication, node node2
PID [24962]
Start the watchdog process as well...
Now we need to create the cluster and subscribe:
jtolley@uber:~/devel/pgdb$ slonik < createset 
:16: Subscription set 1 created
:17: Adding tables to the subscription set
:21: Add primary keyed table public.accounts
:25: Add primary keyed table public.tellers
:29: Add primary keyed table public.history
:33: Add primary keyed table public.branches
:36: Adding sequences to the subscription set
:40: Add sequence public.history_id_seq
:41: All tables added
jtolley@uber:~/devel/pgdb$ slonik < subscribeset 
:10: Subscribed nodes to set 1

Watching It Work

Now we can make it do something interesting. First, start watching the logs. They live in /home/jtolley/devel/slony/log/slony1, and we can watch them like this, since there aren't too many log files involved:
jtolley@uber:~/devel/slony/log/slony1$ find . -type f | xargs tail -f
This shows lots of log info. If you want to see more, run another pgbench instance:
jtolley@uber:~/devel/pgdb$ pgbench -s 1 -c 5 -t 1000 pgbench
For extra credit, add another table to the replication set, get it replicated, and manually insert data. See if the new data come across.

Using cron and psql to transfer data across databases

I recently had to move information from one database to another in an automatic function. I centralized some auditing information such that specific information about each database in the cluster could be stored in a single table, inside a single database. While I still needed to copy the associated functions and views to each database, I was able to make use of the new "COPY TO query"feature to do it all on one step via cron.

At the top of the cron script, I added two lines defining the database I was pulling the information from ("alpha"), and the database I was sending the information to ("postgres"):

PSQL_ALPHA='/usr/bin/psql -X -q -t -d alpha'
PSQL_POSTGRES='/usr/bin/psql -X -q -t -d postgres'

From left to right, the options tell psql to not use any psqlrc file found (-X), to be quiet in the output (-q), to print tuples only and no header/footer information (-t), and the name of the database to connect to (-d).

The cron entry that did the work looked like this:

*/5 * * * * (echo "COPY audit_mydb_stats FROM STDIN;" && $PSQL_ALPHA -c "COPY (SELECT *, current_database(), now(), round(date_part('epoch'::text, now())) FROM audit_mydb_stats()) TO STDOUT" && echo "\\.") | $PSQL_POSTGRES -f -

From right to left, the command does this:

  • Run once every five minutes.
  • Take the entire output of the first parenthesized command and pipe it to the second command.
  • We build a complete COPY command to feed to the second database.
    • First, we echo the line that tells it where to store the data (COPY ... FROM STDIN)
    • Next, we run the 'COPY TO' command on the first database, which, instead of dumping a table, outputs the results of a function, plus three other columns indicating the current database, the current time and the current time as an epoch value.
    • After all the data is dumped out, we echo a "backslash dot" to indicate the end of the copied data
  • All of this is now piped to the second database by calling psql with a -f argument, indicating that we are reading from a file. In this case, the file is stdin via the newly opened pipe, indicated by a single dash after the -f argument.

This allowed me to simply move the data from one database to the other, with a transformation in the middle, neatly avoiding any need to make changes on either the functions output or the columns on the target table.

Greg Sabino Mullane @ US PostgreSQL Association

Belated congratulations to End Point's Greg Sabino Mullane for his election to the United States PostgreSQL Association's board for 2009-2011. It didn't really happen this way, but I think of Greg taking over Selena's board position there. (Actually Bruce Momjian filled that role till the elections.)

Anyway, nice work, all of you, on improving and promoting a great database and its equally important community.

Parallel Inventory Access using PostgreSQL

Inventory management has a number of challenges. One of the more vexing issues with which I've dealt is that of forced serial access. We have a product with X items in inventory. We also have multiple concurrent transactions vying for that inventory. Under any normal circumstance, whether the count is a simple scalar, or is comprised of any number of records up to one record/quantity, the concurrent transactions are all going to hone in on the same record, or set of records. In doing so, all transactions must wait and get their inventory serially, even if doing so isn't of interest.

If inventory is a scalar value, we don't have much hope of circumventing the problem. And, in fact, we wouldn't want to under that scenario because each transaction must reflect the part of the whole it consumed so that the next transaction knows how much is left to work with.

However, if we have inventory represented with one record = one quantity, we aren't forced to serialize in the same way. If we have multiple concurrent transactions vying for inventory, and the sum of the need is less than that available, why must the transactions wait at all? They would normally line up serially because, no matter what ordering you apply to the selection (short of random), it'll be the same ordering for each transaction (and even an increasing probability of conflict with random as concurrency increases). Thus, to all of them, the same inventory record looks the "most interesting" and, so, each waits for the lock from the transaction before it to resolve before moving on.

What we really want is for those transactions to attack the inventory like an easter-egg hunt. They may all make a dash for the "most interesting" egg first, but only one of them will get it. And, instead of the other transaction standing there, coveting the taken egg, we want them to scurry on unabated and look for the next "most interesting" egg to throw in their baskets.

We can leverage some PostgreSQL features to accomplish this goal. The key for establishing parallel access into the inventory is to use the row lock on the inventory records as an indicator of a "soft lock" on the inventory. That is, we assume any row-locked inventory will ultimately be consumed, but recognize that it might not be. That allows us to pass over locked inventory, looking for other inventory to fill the need; but if we find we don't have enough inventory for our need, those locked records indicate that we should take another pass and try again. Eventually, we either get all the inventory we need, or we have consumed all the inventory there is, meaning less than we asked for but with no locked inventory present.

We write a pl/pgsql function to do all the dirty work for us. The function has the following args:

  • Name of table on which we want to apply parallel access
  • Query that retrieves all pertinent records, and in the desired order
  • Integer number of records we ultimately want locked for this transaction.
  • The function returns a setof ctid. Using the ctid has the advantage of the function needing to know nothing about the composition of the table and providing exceedingly fast access back to the records of interest. Thus, the function can be applied to any table if desired and doesn't depend on properly indexed fields in the case of larger tables.

    CREATE OR REPLACE FUNCTION getlockedrows (
           tname TEXT,
           query TEXT,
           desired INT
       )
    RETURNS SETOF TID
    STRICT
    VOLATILE
    LANGUAGE PLPGSQL
    AS $EOR$
    DECLARE
       total   INT NOT NULL := 0;
       locked  BOOL NOT NULL := FALSE;
       myst    TEXT;
       myrec   RECORD;
       mytid   TEXT;
       found   TID[];
       loops   INT NOT NULL := 1;
    BEGIN
       -- Variables: tablename, full query of interest returning ctids of tablename rows, and # of rows desired.
       RAISE DEBUG 'Desired rows: %', desired;
       <<outermost>>
       LOOP
    /*
       May want a sanity limit here, based on loops:
       IF loops > 10 THEN
           RAISE EXCEPTION 'Giving up. Try again later.';
       END IF;
    */
           BEGIN
               total := 0;
               FOR myrec IN EXECUTE query
               LOOP
                   RAISE DEBUG 'Checking lock on id %',myrec.ctid;
                   mytid := myrec.ctid;
                   myst := 'SELECT 1 FROM '
                       || quote_ident(tname)
                       || ' WHERE ctid = $$'
                       || mytid
                       || '$$ FOR UPDATE NOWAIT';
                   BEGIN
                       EXECUTE myst;
                       -- If it worked:
                       total := total + 1;
                       found[total] := myrec.ctid;
                       -- quit as soon as we have all requested
                       EXIT outermost WHEN total >= desired;
                   -- It did not work
                   EXCEPTION
                       WHEN LOCK_NOT_AVAILABLE THEN
                           -- indicate we have at least one candidate locked
                           locked := TRUE;
                   END;
               END LOOP; -- end each row in the table
               IF NOT locked THEN
                   -- We have as many in found[] as we can get.
                   RAISE DEBUG 'Found % of the requested % rows.',
                       total,
                       desired;
                   EXIT outermost;
               END IF;
               -- We did not find as many rows as we wanted!
               -- But, some are currently locked, so keep trying.
               RAISE DEBUG 'Did not find enough rows!';
               RAISE EXCEPTION 'Roll it back!';
           EXCEPTION
               WHEN RAISE_EXCEPTION THEN
                   PERFORM pg_sleep(RANDOM()*0.1+0.45);
                   locked := FALSE;
                   loops := loops + 1;
           END;
       END LOOP outermost;
       FOR x IN 1 .. total LOOP
           RETURN NEXT found[x];
       END LOOP;
       RETURN;
    END;
    $EOR$
    ;
    

    The function makes a pass through all the records, attempting to row lock each one as it can. If we happen to lock as many as requested, we exit <<outermost>> immediately and start returning ctids. If we pass through all records without hitting any locks, we return the set even though it's less than requested. The calling code can decide how to react if there aren't as many as requested.

    To avoid artificial deadlocks, with each failed pass of <<outermost>>, we raise exception of the encompassing block. That is, with each failed pass, we start over completely instead of holding on to those records we've already locked. Once a run has finished, it's all or nothing.

    We also mix up the sleep times just a bit so any two transactions that happen to be locked into a dance precisely because of their timing will (likely) break the cycle after the first loop.

    Example of using our new function from within a pl/pgsql function:

    ...
       text_query := $EOQ$
    SELECT ctid
    FROM inventory
    WHERE sku = 'COOLSHOES'
       AND status = 'AVAILABLE'
    ORDER BY age, location
    $EOQ$
    ;
    
       OPEN curs_inv FOR
           SELECT inventory_id
           FROM inventory
           WHERE ctid IN (
                   SELECT *
                   FROM getlockedrows(
                       'inventory',
                       text_query,
                       3
                   )
           );
    
       LOOP
    
           FETCH curs_inv INTO int_invid;
    
           EXIT WHEN NOT FOUND;
    
           UPDATE inventory
           SET status = 'SOLD'
           WHERE inventory_id = int_invid;
    
       END LOOP;
    ...
    

    The risk we run with this approach is that our ordering will not be strictly enforced. In the above example, if it's absolutely critical that the sort on age and location never be violated, then we cannot run our access to the inventory in parallel. The risk comes if T1 grabs the first record, T2 only needs one and grabs the second, but T1 aborts for some other reason and never consumes the record it originally locked.

    Why is my function slow?

    I often hear people ask "Why is my function so slow? The query runs fast when I do it from the command line!" The answer lies in the fact that a function's query plans are cached by Postgres, and the plan derived by the function is not always the same as shown by an EXPLAIN from the command line. To illustrate the difference, I downloaded the pagila test database. To show the problem, we'll need a table with a lot of rows, so I used the largest table, rental, which has the following structure:

    pagila# \d rental
                           Table "public.rental"
        Column    |   Type     |             Modifiers
    --------------+-----------------------------+--------------------------------
     rental_id    | integer    | not null default nextval('rental_rental_id_seq')
     rental_date  | timestamp  | not null
     inventory_id | integer    | not null
     customer_id  | smallint   | not null
     return_date  | timestamp  |
     staff_id     | smallint   | not null
     last_update  | timestamp  | not null default now()
    Indexes:
        "rental_pkey" PRIMARY KEY (rental_id)
        "idx_unq_rental" UNIQUE (rental_date, inventory_id, customer_id)
        "idx_fk_inventory_id" (inventory_id)
    

    It only had 16044 rows, however, not quite enough to demonstrate the difference we need. So let's add a few more rows. The unique index means any new rows will have to vary in one of the three columns: rental_date, inventory_id, or customer_id. The easiest to change is the rental date. By changing just that one item and adding the table back into itself, we can quickly and exponentially increase the size of the table like so:

    INSERT INTO rental(rental_date, inventory_id, customer_id, staff_id)
      SELECT rental_date + '1 minute'::interval, inventory_id, customer_id, staff_id
      FROM rental;
    

    I then ran the same query again, but with '2 minutes', '4 minutes', '8 minutes', and finally '16 minutes'. At this point, the table had 513,408 rows, which is enough for this example. I also ran an ANALYZE on the table in question (this should always be the first step when trying to figure out why things are going slower than expected). The next step is to write a simple function that accesses the table by counting how many rentals have occurred since a certain date:

    DROP FUNCTION IF EXISTS count_rentals_since_date(date);
    
    CREATE FUNCTION count_rentals_since_date(date)
    RETURNS BIGINT
    LANGUAGE plpgsql
    AS $body$
      DECLARE
        tcount INTEGER;
      BEGIN
        SELECT INTO tcount
          COUNT(*) FROM rental WHERE rental_date > $1;
      RETURN tcount;
      END;
    $body$;
    

    Simple enough, right? Let's test out a few dates and see how long each one takes:

    pagila# \timing
    
    pagila# select count_rentals_since_date('2005-08-01');
     count_rentals_since_date
    --------------------------
                       187901
    Time: 242.923 ms
    
    pagila# select count_rentals_since_date('2005-09-01');
     count_rentals_since_date
    --------------------------
                         5824
    Time: 224.718 ms
    

    Note: all of the queries in this article were run multiple times first to reduce any caching effects. Those times appear to be about the same, but I know from the distribution of the data that the first query will not hit the index, but the second one should. Thus, when we try and emulate what the function is doing on the command line, the first effort often looks like this:

    pagila# explain analyze select count(*) from rental where rental_date > '2005-08-01';
                         QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate (actual time=579.543..579.544)
       Seq Scan on rental (actual time=4.462..403.122 rows=187901)
         Filter: (rental_date > '2005-08-01 00:00:00')
     Total runtime: 579.603 ms
    
    pagila# explain analyze select count(*) from rental where rental_date > '2005-09-01';
    
                         QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate  (actual time=35.133..35.133)
       Bitmap Heap Scan on rental (actual time=1.852..30.451)
         Recheck Cond: (rental_date > '2005-09-01 00:00:00')
         -> Bitmap Index Scan on idx_unq_rental (actual time=1.582..1.582 rows=5824)
             Index Cond: (rental_date > '2005-09-01 00:00:00')
     Total runtime: 35.204 ms
    
    

    Wow, that's a huge difference! The second query is hitting the index and using some bitmap magic to pull back the rows in a blistering time of 35 milliseconds. However, the same date, using the function, takes 224 ms - over six times as slow! What's going on? Obviously, the function is *not* using the index, regardless of which date is passed in. This is because the function cannot know ahead of time what the dates are going to be, but caches a single query plan. In this case, it is caching the 'wrong' plan.

    The correct way to see queries as a function sees them is to use prepared statements. This caches the query plan into memory and simply passes a value to the already prepared plan, just like a function does. The process looks like this:

    pagila# PREPARE foobar(DATE) AS SELECT count(*) FROM rental WHERE rental_date > $1;
    PREPARE
    
    pagila# EXPLAIN ANALYZE EXECUTE foobar('2005-08-01');
                    QUERY PLAN
    --------------------------------------------------------------
     Aggregate  (actual time=535.708..535.709 rows=1)
       ->  Seq Scan on rental (actual time=4.638..364.351 rows=187901)
             Filter: (rental_date > $1)
     Total runtime: 535.781 ms
    
    pagila# EXPLAIN ANALYZE EXECUTE foobar('2005-09-01');
                    QUERY PLAN
    --------------------------------------------------------------
     Aggregate  (actual time=280.374..280.375 rows=1)
       ->  Seq Scan on rental  (actual time=5.936..274.911 rows=5824)
             Filter: (rental_date > $1)
     Total runtime: 280.448 ms
    

    These numbers match the function, so we can now see the reason the function is running as slow as it does: it is sticking to the "Seq Scan" plan. What we want to do is to have it use the index when the given date argument is such that the index would be faster. Functions cannot have more than one cached plan, so what we need to do is dynamically construct the SQL statement every time the function is called. This costs us a small bit of overhead versus having a cached query plan, but in this particular case (and you'll find in nearly all cases), the overhead lost is more than compensated for by the faster final plan. Making a dynamic query in plpgsql is a little more involved than the previous function, but it becomes old hat after you've written a few. Here's the same function, but with a dynamically generated SQL statement inside of it:

    DROP FUNCTION IF EXISTS count_rentals_since_date_dynamic(date);
    
    CREATE FUNCTION count_rentals_since_date_dynamic(date)
    RETURNS BIGINT
    LANGUAGE plpgsql
    AS $body$
      DECLARE
        myst TEXT;
        myrec RECORD;
      BEGIN
        myst = 'SELECT count(*) FROM rental WHERE rental_date > ' || quote_literal($1);
        FOR myrec IN EXECUTE myst LOOP
          RETURN myrec.count;
        END LOOP;
      END;
    $body$;
    

    Note that we use the quote_literal function to take care of any quoting we may need. Also notice that we need to enter into a loop to run the query and then parse the output, but we can simply return right away, as we only care about the output from the first (and only) returned row. Let's see how this new function performs compared to the old one:

    pagila# \timing
    
    pagila# select count_rentals_since_date_dynamic('2005-08-01');
     count_rentals_since_date_dynamic
    ----------------------------------
                               187901
    Time: 255.022 ms
    
    pagila# select count_rentals_since_date('2005-08-01');
     count_rentals_since_date
    --------------------------
                       187901
    Time: 249.724 ms
    
    pagila# select count_rentals_since_date('2005-09-01');
     count_rentals_since_date
    --------------------------
                         5824
    Time: 228.224 ms
    
    pagila# select count_rentals_since_date_dynamic('2005-09-01');
     count_rentals_since_date_dynamic
    ----------------------------------
                                 5824
    Time: 6.618 ms
    

    That's more like it! Problem solved. The function is running much faster now, as it can hit the index. The take-home lessons here are:

    1. Always make sure the tables you are using have been analyzed.
    2. Emulate the queries inside a function by using PREPARE + EXPLAIN EXECUTE, not EXPLAIN.
    3. Use dynamic SQL inside a function to prevent unwanted query plan caching.

    OpenSQL Camp 2008

    I attended the OpenSQL Camp last weekend, which ran Friday night to Sunday, November 14-16th. This was the first "unconference" I had been to, and Baron Schwartz did a great job in pulling this all together. I drove down with Bruce Momjian who said that this is the first cross-database conference of any kind since at least the year 2000.

    The conference was slated to start at 6 pm, and Bruce and I arrived at our hotel a few minutes before then. Our hotel was at one end of the Charlottesville Downtown Mall, and the conference was at the other end, so we got a quick walking tour of the mall. Seems like a great place - lots of shops, people walking, temporary booths set out, outdoor seating for the restaurants. It reminded me a lot of Las Ramblas, but without the "human statue" performance artists. Having a hotel within walking distance of a conference is a big plus in my book, and I'll go out of my way to find one.


    The first night was simply mingling with other people and designing the next day's sessions. There was a grid of talk slots on a wall, with large sticky notes stuck to some of them to indicate already-scheduled sessions. Next to the grid were two sections, where people added sticky notes for potential lightning talks, and for potential regular talks. There were probably about 20 of each type of talk by the end of the night. The idea was to put a check next to any talk you were interested in, although I don't think everyone really got the message about that, judging by the number of checks vs. the number of people. At one point, we gathered in a circle and gave a quick 5 word introduction about ourselves. Mine was "Just Another Perl Postgres Hacker." There were probably around 50-60 or so people there, and the vast majority were from Sun/MySQL. A smaller group of people were non-Sun MySQL people, such as Baron and Sheeri. Coming in at a minority of two was Bruce and myself, representing Postgres (although Saturday saw our numbers swell to three, with the addition of Kelly McDonald). However, the smallest minority was the SQLite contingent, consisting solely of Dr. Richard Hipp (whom it was great to meet in person). Needless to say, I met a lot of MySQL people at this conference! All were very friendly and receptive to Bruce and myself, and it did feel mostly like an open source database conference rather than a MySQL one. Seven of the twenty one talks were by non-MySQL people, which means we were technically overrepresented. Or had more interesting talks! ;)

    After heading back to the room and reviewing my notes before bed, I got up the next day and caught the keynote, given by Brian Aker, about the future of open-source databases. Thanks for the Skype/Postgres shout out, Brian! :) A comment by Jim Starkey at the end of the talk led to an interesting discussion on bot nets, the current kings of cloud computing.

    My talk on MVCC was the first talk of the day, which of course means lots of technical difficulties. As usual, my laptop refused to cooperate with the overhead projector. In anticipation of this, I had copied the presentation in PDF format to a USB disk, and ended up using someone else's Mac laptop to give the presentation. (I don't remember whose it was, but thank you!) I've given the talk before, but this was a major rewrite to suit the audience: much less Postgres-specific material, and some details about how other systems implement MVCC, as well as the advantages and disadvantages of both ways. Both Oracle and InnoDB update the actual value on disk, and save changes elsewhere, optimistically assuming that a rollback won't happen. This makes a rollback expensive, as the old diffs must be looked up and applied to the main table. Postgres is pessimistic, in that rollbacks are not as expensive as we simply add an entire new row on update, and a rollback simply marks it as no longer valid. Both ways involve some sort of cleaning up of old rows, and handle tradeoffs in different ways. There was some interesting discussions during and after the talk, as Jim Starkey and Ann Harrison weighed in on how other systems (Falcon and Firebird) perform MVCC, and the costs and tradeoffs involved. After the talk, I had some interesting talks with Ann about garbage collection and vacuuming in general.

    The next talk was by Dr. Hipp, entitled "How SQL Database Engines Work", which was fascinating as it gave a glance into the inner working and philosophy of SQLite, whose underlying assumptions about power usage, memory, transactions, portability, and resource usage are radically different from most other database systems. Again there was some interesting discussions about certain slides from the audience within the talk.

    The competing talk for that time slot was "Libdrizzle" by Eric Day. While I missed this talk, I did get to talk to him the night before about libdrizzle, among other things. Patrick Galbraith and I tried to explain the monstrosity that is XS to Eric (as he and I maintain DBD::mysql and DBD::Pg respectively), and Eric showed us how PHP does something similar.

    My DBIx::Cache talk was sabotaged by Bruce having a better session at the same time, so I attended that instead of giving mine. I'll post the slides for the DBIx::Cache talk on the OpenSQL Camp wiki soon, however. I liked Bruce's talk ("Moving Application Logic Into the Database"), mostly becasuse he was preaching to the choir when talking about putting business logic into the database. There was an interesting discussion about the borrowing of LIMIT and OFFSET from MySQL and putting it into Postgres, and we even helped Richard figure out that he was unknowingly supporting the broken and deprecated Postgres "comma-comma" syntax. Bruce's talk was very polished and interesting. I suspect he may have given talks before. :)

    Lunch was catered in, and I talked to many people while eating lunch, indeed over the conference itself. Apparently MySQL 5.1 is finally going to be released, this time for sure, according to first Giuseppe and then Dups. Post-lunch were the lightning talks, which I normally would not miss, but their overall MySQL-centricness and my interest in another session, entitled "MySQL Unconference" by Sheeri K. Cabral, drew me away. Bruce, Sheeri, Giuseppe Maxia, and myself talked about the details of such a conference. It was a very interesting perpective: MySQL has the problem of a "one company, and no community" perception, while Postgres suffers from a "all community, and no company" perception. Neither perception is accurate, of course, but there are some seeds of truth to both.

    Bruce's second presentation, "Postgres Talks", turned into mostly a wide-ranging discussion between those present (myself, Bruce, Ann, Kelly, Richard, others?) about materialized views, vacuum, building query trees, and other topics.

    I bailed out on my fellow Postgres talk "Postgres Extensions" by Kelly McDonald (sorry Kelly). I had already picked his brain about it earlier, so I felt not too much guilt in attending "Atomic Commit In SQLite" by Dr. Hipp. Again, it's fascinating to see things from the SQLite perspective. Not only technically, but how their development is structured is different as well.

    I was not feeling well, so I ran back to the hotel to drop off my backpack with super-heavy laptop inside, and thus missed my next planned talk, "Unix Command Line Productivity Tips". If anyone went and can pass on some tips in the comments below, please do so! :)

    The final talk I went to was "Join-Fu" by Jay Pipes. I honestly had no idea what this talk would be about, but I actually found it very interesting (and entertaining). Jay is a great speaker, and is not shy about pointing out some of MySQL's weaknesses. The talk was basically a collection of best practices for MySQL, and I actually learned not only things about MySQL I can put to use, but things to apply to Postgres as well. He spent some time on the MySQL query cache as well, which is particularly interesting to me as I'd love to see Postgres get something similar (and until then, people can use DBIx::Cache of course!).

    After the final set of presentations was more mingling, eating of some pizza with funky toppings, and planning for the nexy day's hackathon. All the proposed ideas were MySQL-specific, as to be expected, but Bruce and I actually got some work done that night by looking over the pg_memcached code, prompted by Brian. I had looked it over a little bit a few months ago, but Bruce and I managed to fix a bug and, more importantly, found other people to continue working on it. Don't forget to take the credit when they finish their work, Bruce! :)

    All in all, a great time. I would have liked to see the presentations stretched out over two days, and to have seen a greater Postgres turnout, but there's always next year. Thanks to Baron for creating a unique event!

    Varnish, Radiant, etc.

    As my colleague Jon mentioned, the Presidential Youth Debates launched its full debate content this week. And, as Jon also mentioned, the mix of tools involved was fairly interesting:

    Our use of Postgres for this project was not particularly special, and is simply a reflection of our using Postgres by default. So I won't discuss the Postgres usage further (though it pains me to ignore my favorite piece of the software stack).

    Radiant

    Dan Collis-Puro, who has done a fair amount of CMS-focused work throughout his career, was the initial engineer on this project and chose Radiant as the backbone of the site. He organized the content within Radiant, configured the Page Attachments extension for use with Amazon's S3 (Simple Storage Service), and designed the organization of videos and thumbnails for easy administration through the standard Radiant admin interface. Furthermore, prior to the release of the debate videos, Dan built a question submission and moderation facility as a Radiant extension, through which users could submit questions that might ultimately get passed along to the candidates for the debate.

    In the last few days prior to launch, it fell to me to get the new debate materials into production, and we had to reorganize the way we wanted to lay out the campaign videos and associated content. Because the initial implementation relied purely on conventions in how page parts and page attachments are used, accomplishing the reorganization was straightforward and easily achieved; it was not the sort of thing that required code tweaks and the like, managed purely through the CMS. It ended up being quite -- dare I say it? -- an agile solution. (Agility! Baked right in! Because it's opinionated software! Where's my Mac? It just works! Think Same.)

    For managing small, simple, straightforward sites, Radiant has much to recommend it. For instance:

    • the hierarchical management of content/pages is quite effective and intuitive
    • a pretty rich set of extensions (such as page attachments)
    • the "filter" option on content is quite handy (switch between straight text, fckeditor, etc.) and helpful
    • the Radiant tag set for basic templating/logic is easy to use and understand
    • the general resources available for organizing content (pages, layouts, snippets) enables and readily encourages effective reuse of content and/or presentation logic

    That said, there are a number of things for which one quickly longs within Radiant:

    • In-place editing user interface: an adminstrative mode of viewing the site in which editing tools would show in-place for the different components on a given page. This is not an uncommon approach to content management. The fact that you can view the site in one window/tab and the admin in another mitigates the pain of not having this feature to a healthy extent, but the ease of use undoubtedly suffers nevertheless.
    • Radiant offers different publishing "states" for any given page ("draft", "published", "hidden", etc.), and only publicly displays pages in the "published" state in production. This is certainly helpful, but it is ultimately insufficient. This is no substitute for versioning of resources; there is no way to have a staging version of a given page, in which the staging version is exposed to administrative users only at the same URL as the published version. To get around this, one needs to make an entirely different page that will replace the published page when you're ready. While it's possible to work around the problem in this manner, it clutters up the set of resources in the CMS admin UI, and doesn't fit well with the hierarchical nature of the system; the staging version of a page can't have the same children as the published version of the page, so any staging involving more than one level of edits is problematic and awkward. That leaves quite a lot to be desired: any engineer who has ever done all development on a production site (no development sites) and moved to version-controlled systems knows full well that working purely against a live system is extremely painful. Content management is no different.
    • The page attachments extension, while quite handy in general, has configuration information (such as file size limits and the attachment_fu storage backend to use) hard-coded into its PageAttachment model class definition, rather than abstracting that configuration information into YAML files. Furthermore, it's all or nothing: you can only use one storage backend, apparently, rather than having the flexibility of choosing different storage backends by the content type of the file attached, or choosing manually when uploading the file, etc. The result in our case is that all page attachments go to Amazon S3, even though videos were the only thing we really wanted to have in S3 (bandwidth on our server is not a concern for simple images and the like).

    The in-place editing UI features could presumably be added to Radiant given a reasonable degree of patience. The page attachment criticisms also seem achievable. The versioning, however, is a more fundamental issue. Many CMSes attempt to solve this problem many different ways, and ultimately things tend to get unpleasant. I tend to think that CMSes would do well to learn from version control systems like Git in their design; beyond that, integrate with Git: dump the content down to some intelligent serialized format and integrate with git branching, checkin, checkout, pushing, etc. That dandy, glorious future is not easily realized.

    To be clear: Radiant is a very useful, effective, straightforward tool; I would be remiss not to emphasize that the things it does well are more important than the areas that need improvement. As is the case with most software, it could be better. I'd happily use/recommend it for most content management cases I've encountered.

    Amazon S3

    I knew it was only a matter of time before I got to play with Amazon S3. Having read about it, I felt like I pretty much knew what to expect. And the expectations were largely correct: it's been mostly reliable, fairly straightforward, and its cost-effectiveness will have to be determined over time. A few things did take me by surprise, though:

    • The documentation on certain aspects, particularly the logging is, fairly uninspiring. It could be a lot worse. It could also be a lot better. Given that people pay for this service, I would expect it to be documented extremely well. Of course, given the kind of documentation Microsoft routinely spits out, this expectation clearly lacks any grounding in reality.
    • Given that the storage must be distributed under the hood, making usage information aggregation somewhat complicated, it's nevertheless disappointing that Amazon doesn't give any interface for capping usage for a given bucket. It's easy to appreciate that Amazon wouldn't want to be on the hook over usage caps when the usage data comes in from multiple geographically-scattered servers, presumably without any guarantee of serialization in time. Nevertheless, it's a totally lame problem. I have reason to believe that Amazon plans to address this soon, for which I can only applaud them.

    So, yeah, Amazon S3 has worked fine and been fine and generally not offended me overmuch.

    Varnish

    The Presidential Youth Debate project had a number of high-profile sponsors potentially capable of generating significant usage spikes. Given the simplicity of the public-facing portion of the site (read-only content, no forms to submit), scaling out with a caching reverse proxy server was a great option. Fortunately, Varnish makes it pretty easy; basic Varnish configuration is simple, and putting it in place took relatively little time.

    Why go with Varnish? It's designed from the ground up to be fast and scalable (check out the architecture notes for an interesting technical read). The time-based caching of resources is a nice approach in this case; we can have the cached representations live for a couple of minutes, which effectively takes the load off of Apache/Rails (we're running Rails with Phusion Passenger) while refreshing frequently enough for little CMS-driven tweaks to percolate up in a timely fashion. Furthermore, it's not a custom caching design, instead relying upon the fundamentals of caching in HTTP itself. Varnish, with its Varnish Configuration Language (VCL), is extremely flexible and configurable, allowing us to easily do things like ignore cookies, normalize domain names (though I ultimately did this in Apache), normalize the annoying Accept-Encoding header values, etc. Furthermore, if the cache interval is too long for a particular change, Varnish gives you a straightforward, expressive way of purging cached representations, which came in handy on a number of occasions close to launch time.

    A number of us at End Point have been interested in Varnish for some time. We've made some core patches: JT Justman tracked down a caching bug when using Edge-Side Includes (ESI), and Charles Curley and JT have done some work to add native gzip/deflate support in Varnish, though that remains to be released upstream. We've also prototyped a system relying on ESI and message-driven cache purging for an up-to-date, high-speed, extremely scalable architecture. (That particular project hasn't gone into production yet due to the degree of effort required to refactor much of the underlying app to fit the design, though it may still come to pass next year -- I hope!)

    Getting Varnish to play nice with Radiant was a non-issue, because the relative simplicity of the site feature set and content did not require specialized handling of any particular resource: one cache interval was good for all pages. Consequently, rather than fretting about having Radiant issue Cache-Control headers on a per-page basis (which may have been fairly unpleasant, though I didn't look into it deeply; eventually I'll need to, though, having gotten modestly hooked on Radiant and less-modestly hooked on Varnish), the setup was refreshingly simple:

    • The public site's domain disallows all access to the Radiant admin, meaning it's effectively a read-only site.
    • The public domain's Apache container always issues a couple of cache-related headers:
      Header always set Cache-Control "public; max-age=120"
      Header always set Vary "Accept-Encoding"
      The Cache-Control header tells clients (Varnish in this case) that it's acceptable to cache representations for 120 seconds, and that all representations are valid for all users ("public"). We can, if we want, use VCL to clean this out of the representation Varnish passes along to clients (i.e. browsers) so that browsers don't cache automatically, instead relying on conditional GET. The Vary header tells clients that cache (again, primarily concerned with Varnish here) to consider the "Accept-Encoding" header value of a request when keying cached representations.
    • An entirely separate domain exists that is not fronted by Varnish and allows access to the Radiant admin. We could have it fronted by Varnish with caching deactivated, but the configuration we used keeps things clean and simple.
    • We use some simple VCL to tell Varnish to ignore cookies (in case of Rails sessions on the public site), to normalize the Accept-Encoding header value to one of "gzip" or "deflate" (or none at all) to avoid caching different versions of the same representation due to inconsistent header values submitted by competing browsers.

    Getting all that sorted was, as stated, refreshingly easy. It was a little less easy, surprisingly, to deal with logging. The main Varnish daemon (varnishd) logs to a shared memory block. The logs just sit there (and presumably eventually get overwritten) unless consumed by another process. A varnishlog utility, which can be run as a one-off or as a daemon, reads in the logs and outputs them in various ways. Furthermore, a varnishncsa utility outputs logging information in an Apache/NCSA-inspired "combined log" format (though it includes full URLs in the request string rather than just the path portion, presumably due to the possibility of Varnish fronting many different domains). Neither one of these is particularly complicated, though the varnishlog output is reportedly quite verbose and may need frequent rotation, and when run in daemon mode, both will re-open the log file to which they write upon receiving SIGHUP, meaning they'll play nice with log rotation routines. I found myself repeatedly wishing, however, that they both interfaced with syslog.

    So, I'm very happy with Varnish at this point. Being a jerk, I must nevertheless publicly pick a few nits:

    • Why no syslog support in the logging utilities? Is there a compelling argument against it (I haven't encountered one, but admittedly I haven't looked very hard), or is it simply a case of not having been handled yet?
    • The VCL snippet we used for normalizing the Accept-Encoding header came right off the Varnish FAQ, and seems to be a pretty common case. I wonder if it would make more sense for this to be part of the default VCL configuration requiring explicit deactivation if not desired. It's not a big deal either way, but it seems like the vast majority of deployments are likely to use this strategy.

    That's all I have to whine about, so either I'm insufficiently observant or the software effectively solves the problem it set out to address. These options are not mutually exclusive.

    I'm definitely looking forward to further work with Varnish. This project didn't get into ESI support at all, but the native ESI support, combined with the high-performance caching, seems like a real win, potentially allowing for simplification of resource design in the application server, since documents can be constructed by the edge server (Varnish in this case) from multiple components. That sort of approach to design calls into question many of the standard practices seen in popular (and unpopular) application servers (namely, high-level templating with "pages" fitting into an overall "layout") but could help engineers keep maintain component encapsulation, think through more effectively the URL space, resource privacy and scoping considerations (whether or not a resource varies per user, by context, etc.), etc. But I digress. Shocking.

    Walden University Presidential Youth Debate goes live

    This afternoon was the launch of Walden University's Presidential Youth Debate website, which features 14 questions and video responses from Presidential candidates Barack Obama and John McCain. The video responses are about 44 minutes long overall.

    The site has a fairly simple feature set but is technologically interesting for us. It was developed by Dan Collis-Puro and Ethan Rowe using Radiant, PostgreSQL, CentOS Linux, Ruby on Rails, Phusion Passenger, Apache, Varnish, and Amazon S3.

    Nice work, guys!

    Filesystem I/O: what we presented

    As mentioned last week, Gabrielle Roth and I presented results from tests run in the new Postgres Performance Lab. Our slides are available on Slideshare.

    We tested eight core assumptions about filesystem I/O performance and presented the results to a room of filesystem hackers and a few database specialists. Some important things to remember about our tests: we were testing I/O only - no tuning had been done on the hardware, filesystem defaults or for Postgres - and we did not take reliability into account at all.  Tuning the database and filesystem defaults will be done for our next round of tests.

    Filesystems we tested were ext2, ext3 (with or without data journaling), xfs, jfs, and reiserfs.

    Briefly, here are our assumptions, and the results we presented:

    1. RAID5 is the worst choice for a database. Our tests confirmed this, as expected.
    2. LVM incurs too much overhead to use. Our test showed that for sequential or random reads on RAID0, LVM doesn't incur much more overhead than hardware or software RAID.
    3. Software RAID is slower. Same result as LVM for sequential or random reads.
    4. Turning off 'atime' is a big performance gain. We didn't see a big improvement, but you do generally get 2-3% improvement "for free" by turning atime off on a filesystem.
    5. Partition alignment is a big deal. Our tests weren't able to prove this, but we still think it's a big problem. Here's one set of tests demonstrating the problem on Windows-based servers.
    6. Journaling filesystems will have worse performance than non-journaling filesystems. Turn the data journaling off on ext3, and you will see better performance than ext2. We polled the audience, and nearly all thought ext2 would have performed better than ext3. People in the room suggested that the difference was because of seek-bundling that's done in ext3, but not ext2.
    7. Striping doubles performance. Doubling-performance is a best-case scenario, and not what we observed. Throughput increased about 35%.
    8. Your read-ahead buffer is big enough.  The default read-ahead buffer size is 128K. Our tests, and an independent set of tests by another author, confirm that increasing read-ahead buffers can provide a performance boost of about 75%.  We saw improvement leveling out when the buffer is sized at 8MB, with the bulk of the improvement occurring up to 1MB. We plan to test this further in the future.

    All the data from these tests is available on the Postgres Developers wiki.

    Our hope is that someone in the Linux filesystem community takes up these tests and starts to produce them for other hardware, and on a more regular basis. We did have 3 people interested in running their own tests on our hardware from the talk!  In the future, we plan to focus our testing most on Postgres performance.

    Mark Wong and Gabrielle will be presenting this talk again, with a few new results, at the PostgreSQL Conference West.

    Authorize.Net Transaction IDs to increase in size

    A sign of their success, Authorize.net is going to break through Transaction ID numbers greater than 2,147,483,647 (or 2^31), which happens to exceed the maximum size of a signed MySQL int() column and the default Postgres "integer".

    It probably makes sense to ensure that your transaction ID columns are large enough proactively - this would not be a fun bug to run into ex-post-facto.