Postgresql Blog Archive
Foreign Data Wrappers
Original images from Flickr user jenniferwilliams
One of our clients, for various historical reasons, runs both MySQL and PostgreSQL to support their website. Information for user login lives in one database, but their customer activity lives in the other. The eventual plan is to consolidate these databases, but thus far, other concerns have been more pressing. So when they needed a report combining user account information and customer activity, the involvement of two separate databases became a significant complicating factor.
In similar situations in the past, using earlier versions of PostgreSQL, we've written scripts to pull data from MySQL and dump it into PostgreSQL. This works well enough, but we've updated PostgreSQL fairly recently, and can use the SQL/MED features added in version 9.1. SQL/MED ("MED" stands for "Management of External Data") is a decade-old standard designed to allow databases to make external data sources, such as text files, web services, and even other databases look like normal database tables, and access them with the usual SQL commands. PostgreSQL has supported some of the SQL/MED standard since version 9.1, with a feature called Foreign Data Wrappers, and among other things, it means we can now access MySQL through PostgreSQL seamlessly.
The first step is to install the right software, called mysql_fdw. It comes to us via Dave Page, PostgreSQL core team member and contributor to many projects. It's worth noting Dave's warning that he considers this experimental code. For our purposes it works fine, but as will be seen in this post, we didn't push it too hard. We opted to download the source and build it, but installing using pgxn works as well:
$ env USE_PGXS=1 pgxnclient install mysql_fdw INFO: best version: mysql_fdw 1.0.1 INFO: saving /tmp/tmpjrznTj/mysql_fdw-1.0.1.zip INFO: unpacking: /tmp/tmpjrznTj/mysql_fdw-1.0.1.zip INFO: building extension gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -fpic -I/usr/include/mysql -I. -I. -I/home/josh/devel/pg91/include/postgresql/server -I/home/josh/devel/pg91/include/postgresql/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o mysql_fdw.o mysql_fdw.c mysql_fdw.c: In function ‘mysqlPlanForeignScan’: mysql_fdw.c:466:8: warning: ‘rows’ may be used uninitialized in this function [-Wmaybe-uninitialized] gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -fpic -shared -o mysql_fdw.so mysql_fdw.o -L/home/josh/devel/pg91/lib -L/usr/lib -Wl,--as-needed -Wl,-rpath,'/home/josh/devel/pg91/lib',--enable-new-dtags -L/usr/lib/x86_64-linux-gnu -lmysqlclient -lpthread -lz -lm -lrt -ldl INFO: installing extension < ... snip ... >
Here I'll refer to the documentation provided in mysql_fdw's README. The first step in using a foreign data wrapper, once the software is installed, is to create the foreign server, and the user mapping. The foreign server tells PostgreSQL how to connect to MySQL, and the user mapping covers what credentials to use. This is an interesting detail; it means the foreign data wrapper system can authenticate with external data sources in different ways depending on the PostgreSQL user involved. You'll note the pattern in creating these objects: each simply takes a series of options that can mean whatever the FDW needs them to mean. This allows the flexibility to support all sorts of different data sources with one interface.
The final step in setting things up is to create a foreign table. In MySQL's case, this is sort of like a view, in that it creates a PostgreSQL table from the results of a MySQL query. For our purposes, we needed access to several thousand structurally identical MySQL tables (I mentioned the goal is to move off of this one day, right?), so I automated the creation of each table with a simple bash script, which I piped into psql:
for i in `cat mysql_tables`; do
echo "CREATE FOREIGN TABLE mysql_schema.$i ( ... table definition ...)
SERVER mysql_server OPTIONS (
database 'mysqldb',
query 'SELECT ... some fields ... FROM $i'
);"
done
In a step not shown above, this script also consolidates the data from each table into one, native PostgreSQL table, to simplify later reporting. In our case, pulling the data once and reporting on the results is perfectly acceptable; in other words, data a few seconds old wasn't a concern. We also didn't need to write back to MySQL, which presumably could complicate things somewhat. We did, however, run into the same data validation problems PostgreSQL users habitually complain about when working with MySQL. Here's an example, in my own test database:
mysql> create table bad_dates (mydate date);
Query OK, 0 rows affected (0.07 sec)
mysql> insert into bad_dates values ('2013-02-30'), ('0000-00-00');
Query OK, 2 rows affected (0.02 sec)
Records: 2 Duplicates: 0 Warnings: 0
Note that MySQL silently transformed '2013-02-30' into '0000-00-00'. Sigh. Then, in psql we do this:
josh=# create extension mysql_fdw; CREATE EXTENSION josh=# create server mysql_svr foreign data wrapper mysql_fdw options (address '127.0.0.1', port '3306'); CREATE SERVER josh=# create user mapping for public server mysql_svr options (username 'josh', password ''); CREATE USER MAPPING josh=# create foreign table bad_dates (mydate date) server mysql_svr options (query 'select * from test.bad_dates'); CREATE FOREIGN TABLE josh=# select * from bad_dates ; ERROR: date/time field value out of range: "0000-00-00"
We've told PostgreSQL we'll be feeding it valid dates, but MySQL's idea of a valid date differs from PostgreSQL's, and the latter complains when the dates don't meet its stricter requirements. Several different workarounds exist, including admitting that '0000-00-00' really is wrong and cleaning up MySQL, but in this case, we modified the query underlying the foreign table to fix the dates on the fly:
SELECT CASE disabled WHEN '0000-00-00' THEN NULL ELSE disabled END,
-- various other fields
FROM some_table
Fortunately this is the only bit of MySQL / PostgreSQL impedance mismatch that has tripped us up thus far; we'd have to deal with any others we found individually, just as we did this one.
SFTP virtual users with ProFTPD and Rails: Part 1
I recently worked on a Rails 3.2 project that used the sweet PLupload JavaScript/Flash upload tool to upload files to the web app. To make it easier for users to upload large and/or remote files to the app, we also wanted to let them upload via SFTP. The catch was, our users didn't have SFTP accounts on our server and we didn't want to get into the business of creating and managing SFTP accounts. Enter: ProFTPD and virtual users.
ProFTPD's virtual users concept allows you to point ProFTPD at a SQL database for your user and group authentication. This means SFTP logins don't need actual system logins (although you can mix and match if you want). Naturally, this is perfect for dynamically creating and destroying SFTP accounts. Give your web app the ability to create disposable SFTP credentials and automatically clean up after the user is done with them, and you have a self-maintaining system.
Starting from the inside-out, you need to configure ProFTPD to enable virtual users. Here are the relevant parts from our proftpd.conf:
## # Begin proftpd.conf excerpt. For explanation of individual config directives, see the # great ProFTPD docs at http://www.proftpd.org/docs/directives/configuration_full.html ## DefaultServer off Umask 002 AllowOverwrite on # Don't reference /etc/ftpusers UseFtpUsers off# Enable SFTP SFTPEngine on # Enable SQL based authentication SQLAuthenticate on # From http://www.proftpd.org/docs/howto/CreateHome.html # Note that the CreateHome params are kind of touchy and easy to break. CreateHome on 770 dirmode 770 uid ~ gid ~ # chroot them to their home directory DefaultRoot ~ # Defines the expected format of the passwd database field contents. Hint: An # encrypted password will look something like: {sha1}IRYEEXBUxvtZSx3j8n7hJmYR7vg= SQLAuthTypes OpenSSL # That '*' makes that module authoritative and prevents proftpd from # falling through to system logins, etc AuthOrder mod_sql.c* # sftp_users and sftp_groups are the database tables that must be defined with # the proceeding column names. You can have other columns in these tables and # ProFTPD will leave them alone. The sftp_groups table can be empty, but it must exist. SQLUserInfo sftp_users username passwd uid sftp_group_id homedir shell SQLGroupInfo sftp_groups name id members SFTPHostKey /etc/ssh/ssh_host_rsa_key SFTPHostKey /etc/ssh/ssh_host_dsa_key SFTPCompression delayed SFTPAuthMethods password RequireValidShell no # SQLLogFile is very verbose, but helpful for debugging while you're getting this working SQLLogFile /var/log/proftpd_sql.sql ## Customize these for production SQLConnectInfo database@localhost:5432 dbuser dbpassword # The UID and GID values here are set to match the user that runs our web app because our # web app needs to read and delete files uploaded via SFTP. Naturally, that is outside # the requirements of a basic virtual user setup. But in our case, our web app user needs # to be able to cd into a virtual user's homedir, and run a `ls` in there. Also, note that # setting these two IDs here (instead of in our sftp_users table) *only* makes sense if # you are using the DefaultRoot directive to chroot virtual users. SQLDefaultUID 510 SQLDefaultGID 500
The CreateHome piece was the trickiest to get working just right for our use-case. But there are two reasons for that; we needed our web app to be able to read/delete the uploaded files, and we wanted to make ProFTPD create those home directories itself. (And it only creates that home directory once a user successfully logs in via SFTP. That means you can be more liberal in your UI with generating credentials that may never get used without having to worry about a ton of empty home directories lying about.)
That's it for the introductory "Part 1" of this article. In Part 2, I'll show how we generate credentials, the workflow behind displaying those credentials, and our SftpUser ActiveRecord model that handles it all. In Part 3, I'll finish up by running through exactly how our web app accesses these files, and how it cleans up after it's done.
Detecting table rewrites with the ctid column
In a recent article, I mentioned that changing the column definition of a Postgres table will sometimes cause a full table rewrite, but sometimes it will not. The rewrite depends on both the nature of the change and the version of Postgres you are using. So how can you tell for sure if changing a large table will do a rewrite or not? I'll show one method using the internal system column ctid.
Naturally, you do not want to perform this test using your actual table. In this example, we will create a simple dummy table. As long as the column types are the same as your real table, you can determine if the change will do a table rewrite on your version of PostgreSQL.
The aforementioned ctid column represents the physical location of the table's row on disk. This is one of the rare cases in which this column can be useful. The ctid value consists of two numbers: the first is the "page" that the row resides in, and the second number is the slot in that page where it resides. To make things confusing, the page numbering starts at 0, while the slot starts at 1, which is why the very first row is always at ctid (0,1). However, the only important information for this example is determining if the ctid for the rows has changed or now (which indicates that the physical on-disk data has changed, even if the data inside of it has not!).
Let's create a very simple example table and see what the ctids look like. When Postgres updates a row, it actually marks the current row as deleted and inserts a new row. Thus, there is a "dead" row that needs to be eventually cleaned out. (this is the way Postgres implements MVCC - there are others). The primary way this cleanup happens is through the use of VACUUM FULL, so we'll use that command to force the table to rewrite itself (and thus 'reset' the ctids as you will see):
postgres=# DROP TABLE IF EXISTS babies; DROP TABLE postgres=# CREATE TABLE babies (gender VARCHAR(10), births INTEGER); CREATE TABLE postgres=# INSERT INTO babies VALUES ('Girl', 1), ('Boy', 1); INSERT 0 2 -- Note: the ctid column is never included as part of '*' postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,1) | Girl | 1 (0,2) | Boy | 1 (2 rows) -- Here comes Ivy, another girl: postgres=# UPDATE babies SET births = births+1 WHERE gender = 'Girl'; UPDATE 1 -- Note that we have a new ctid: slot 3 of page 0 -- The old row at (0,1) is still there, but it is deleted and not visible postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,2) | Boy | 1 (0,3) | Girl | 2 (2 rows) -- The vacuum full removes the dead rows and moves the live rows to the front: postgres=# VACUUM FULL babies; VACUUM -- We are back to the original slots, although the data is reversed: postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,1) | Boy | 1 (0,2) | Girl | 2 (2 rows)
That's what a table rewrite will look like - all the dead rows will be removed, and the rows will be rewritten starting at page 0, adding slots until a new page is needed. We know from the previous article and the fine documentation that Postgres version 9.1 is smarter about avoiding table rewrites. Let's try changing the column definition of the table above on version 8.4 and see what happens. Note that we do an update first so that we have at least one dead row.
postgres=# SELECT substring(version() from $$\d+\.\d+$$); substring ----------- 8.4 postgres=# DROP TABLE IF EXISTS babies; DROP TABLE postgres=# CREATE TABLE babies (gender VARCHAR(10), births INTEGER); CREATE TABLE postgres=# INSERT INTO babies VALUES ('Girl', 1), ('Boy', 1); INSERT 0 2 -- No real data change, but does write new rows to disk: postgres=# UPDATE babies SET gender = gender; UPDATE 2 postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,3) | Girl | 1 (0,4) | Boy | 1 (2 rows) -- Change the VARCHAR(32) to a TEXT: postgres=# ALTER TABLE babies ALTER COLUMN gender TYPE TEXT; ALTER TABLE postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,1) | Girl | 1 (0,2) | Boy | 1 (2 rows)
We can see from the above that changing from VARCHAR to TEXT in version 8.4 of Postgres does indeed rewrite the table. Now let's see how version 9.1 performs:
postgres=# SELECT substring(version() from $$\d+\.\d+$$); substring ----------- 9.1 postgres=# DROP TABLE IF EXISTS babies; DROP TABLe postgres=# CREATE TABLE babies (gender VARCHAR(10), births INTEGER); CREATE TABLe postgres=# INSERT INTO babies VALUES ('Girl', 1), ('Boy', 1); INSERT 0 2 -- No real data change, but does write new rows to disk: postgres=# UPDATE babies SET gender = gender; UPDATE 2 postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,3) | Girl | 1 (0,4) | Boy | 1 (2 rows) -- Change the VARCHAR(32) to a TEXT: postgres=# ALTER TABLE babies ALTER COLUMN gender TYPE TEXT; ALTER TABLE postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,3) | Girl | 1 (0,4) | Boy | 1 (2 rows)
We confirmed that the ALTER TABLE in this particular case does *not* perform a table rewrite when using version 9.1, as we suspected. We tell this by seeing that the ctids stayed the same. We could further verify by doing a vacuum full and showing that there were indeed dead rows that had been left untouched by the ALTER TABLE.
Note that this small example works because nothing else is vacuuming the table, as it is too small and transient for autovacuum to care about it. VACUUM FULL is one of three ways a table can get rewritten; besides ALTER TABLE, the other way is with the CLUSTER command. We go through all the trouble above because an ALTER TABLE is the only one of the three that *may* rewrite the table - the other two are guaranteed to do so.
This is just one example of the things you can do by viewing the ctid column. It is always nice to know beforehand if a table rewrite is going to occur, as it can be the difference between a query that runs in milliseconds versus hours!
PostgreSQL search_path Behaviour
PostgreSQL has a great feature: schemas. So you have one database with multiple schemas. This is a really great solution for the data separation between different applications. Each of them can use different schema, and they also can share some schemas between them.
I have noticed that some programmers tend to name the working schema as their user name. This is not a bad idea, however once I had a strange behaviour with such a solution.
I'm using user name szymon in the database szymon.
First let's create a simple table and add some values. I will add one row with information about the table name.
# CREATE TABLE a ( t TEXT );
# INSERT INTO a(t) VALUES ('This is table a');
Let's check if the row is where it should be:
# SELECT t FROM a;
t
-----------------
This is table a
(1 row)
Now let's create another schema, name it like my user's name.
# CREATE SCHEMA szymon;
Let's now create table a in the new schema.
# CREATE TABLE szymon.a ( t TEXT );
So there are two tables a in different schemas.
# SELECT t FROM pg_tables WHERE tablename = 'a'; schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers ------------+-----------+------------+------------+------------+----------+------------- public | a | szymon | \N | f | f | f szymon | a | szymon | \N | f | f | f (2 rows)
I will just add a row similar to the previous one.
# INSERT INTO szymon.a(t) VALUES ('This is table szymon.a');
Let's check the data in the table "szymon.a".
# SELECT t FROM szymon.a;
t
------------------------
This is table szymon.a
(1 row)
OK, now I have all the data prepared for showing the quite interesting behaviour. As you might see in the above queries, selecting table "a" when there is only one schema works. What's more, selecting "szymon.a" works as well.
What will hapeen when I get data from the table "a"?
# SELECT t FROM a;
t
------------------------
This is table szymon.a
(1 row)
Suddenly PostgreSQL selects data from other table than at the beginning. The reason of this is the schema search mechanism. There is a PostgreSQL environment variable "search_path". If you set the value of this variable to "x,a,public" then PostgreSQL will look for all the tables, types and function names in the schema "x". If there is no such table in this schema, then it will look for this table in the next schema, which is "a" in this example.
What's the defualt value of the search_path variable? You can check the current value of this variable with the following query:
# show search_path; search_path ---------------- "$user",public (1 row)
The default search path makes PostgreSQL search first in the schema named exactly as the user name you used for logging into database. If the user name is different from the schema names, or there is no table "szymon.a" then there would be used the "public.a" table.
The problem is even more tricky, even using simple EXPLAIN doesn't help, as it shows only table name omitting the schema name. So the plan for this query looks exactly the same, regardless of the schema used:
# EXPLAIN SELECT * FROM a;
QUERY PLAN
------------------------------------------------------
Seq Scan on a (cost=0.00..1.01 rows=1 width=32)
(1 row)
For plan with more information you should use EXPLAIN VERBOSE, then you will have the plan with schema name, so it will be easier to spot the usage of different schema:
# EXPLAIN VERBOSE SELECT * FROM a;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on szymon.a (cost=0.00..1.01 rows=1 width=32)
Output: t
(2 rows)
Postgres alter column problems and solutions
A common situation for database-backed applications is the need to change the attributes of a column. One can change the data type, or more commonly, only the size limitation, e.g. VARCHAR(32) gets changed to VARCHAR(42). There are a few ways to accomplish this in PostgreSQL, from a straightforward ALTER COLUMN, to replacing VARCHAR with TEXT (plus a table constraint), to some advanced system catalog hacking.
The most common example of such a change is expanding a VARCHAR declaration to allow more characters. For example, your "checksum" column was based on MD5 (at 32 characters), and now needs to be based on Keccak (Keccak is pronounced "catch-ack") (at 64 characters) In other words, you need a column in your table to change from VARCHAR(32) to VARCHAR(64). The canonical approach is to do this:
ALTER TABLE foobar ALTER COLUMN checksum TYPE VARCHAR(64);
This approach works fine, but it has two huge and interrelated problems: locking and time. This approach locks the table for as long as the command takes to run. And by lock, we are talking a heavy 'access exclusive' lock which shuts everything else out of the table. If your table is small, this is not an issue. If your table has a lot of data, however, this brings us to the second issue: table rewrite. The above command will cause Postgres to rewrite every single row of the table, which can be a very expensive operation (both in terms of disk I/O and wall clock time). So, a simple ALTER COLUMN solution usually comes at a very high cost for large tables. Luckily, there are workarounds for this problem.
First, some good news: as of version 9.2, there are many operations that will no longer require a full table rewrite. Going from VARCHAR(32) to VARCHAR(64) is one of those operations! Thus, if you are lucky enough to be using version 9.2 or higher of Postgres, you can simply run the ALTER TABLE and have it return almost instantly. From the release notes:
Reduce need to rebuild tables and indexes for certain ALTER TABLE ... ALTER COLUMN TYPE operations (Noah Misch)
Increasing the length limit for a varchar or varbit column, or removing the limit altogether, no longer requires a table rewrite. Similarly, increasing the allowable precision of a numeric column, or changing a column from constrained numeric to unconstrained numeric, no longer requires a table rewrite. Table rewrites are also avoided in similar cases involving the interval, timestamp, and timestamptz types.
However, if you are not yet on version 9.2, or are making an operation not covered above (such as shrinking the size limit of a VARCHAR), your only option to avoid a full table rewrite is the system catalog change below. However, before you jump down there, consider a different option: abandoning VARCHAR altogether.
In the Postgres world, there are few differences between the VARCHAR and TEXT data types. The latter can be thought of as an unbounded VARCHAR, or if you like, a VARCHAR(999999999999). You may also add a check constraint to a table to emulate the limit of a VARCHAR. For example, to convert a VARCHAR(32) column named "checksum" to a TEXT column:
ALTER TABLE foobar ALTER COLUMN checksum TYPE text; ALTER TABLE foobar ADD CONSTRAINT checksum_length CHECK (LENGTH(checksum) <= 32);
The data type change suffers from the same full table rewrite problem as before, but if you are using version 9.1 or newer of Postgres, the change from VARCHAR to TEXT does not do a table rewrite. The creation of the check constraint, however, will scan all of the existing table rows to make sure they meet the condition. While not as costly as a full table rewrite, scanning every single row in a large table will still be expensive. Luckily, version 9.2 of Postgres comes to the rescue again with the addition of the NOT VALID phrase to the check constraint clause. Thus, in newer versions you can avoid the scan entirely by writing:
ALTER TABLE foobar ADD CONSTRAINT checksum_length CHECK (LENGTH(checksum) <= 32) NOT VALID;
This is a one-time exception for the constraint, and only applies as the constraint is being created. In other words, despite the name, the constraint is very much valid after it is created. If you want to validate all the rows that you skipped at a later time, you can use the ALTER TABLE .. VALIDATE CONSTRAINT command. This has the double advantage of allowing the check to be delayed until a better time, and taking a much lighter lock on the table than the ALTER TABLE .. ADD CONSTRAINT does.
So why would you go through the trouble of switching from your VARCHAR(32) to a TEXT column with a CHECK constraint? There are at least three good reasons.
First, if you are running Postgres 9.2 or better, this means you can change the constraint requirements on the fly, without a table scan - even for the 'non-optimal' situations such as going from 64 characters down to 32. Just drop the old constraint, and add a new one with the NOT VALID clause thrown on it.
Second, the check constraint gives a better error message, and a clearer indication that the limitation was constructed with some thought behind it. Compare these messages:
postgres=# CREATE TABLE river( checksum VARCHAR(4) ); CREATE TABLE postgres=# INSERT INTO river VALUES ('abcde'); ERROR: value too long for type character varying(4) postgres=# CREATE TABLE river( checksum TEXT, postgres-# CONSTRAINT checksum_length CHECK (LENGTH(checksum) <= 4) ); CREATE TABLE postgres=# INSERT INTO river VALUES ('abcde'); ERROR: new row for relation "river" violates check constraint "checksum_length" DETAIL: Failing row contains (abcde).
Third, and most important, you are no longer limited to a single column attribute (maximum length). You can use the constraint to check for many other things as well: minimum size, actual content, regex matching, you name it. As a good example, if we are are truly storing checksums, we probably want the hexadecimal Keccak checksums to be *exactly* 64 characters, and not just a maximum length of 64 characters. So, to illustrate the above point about switching constraints on the fly, you could change the VARCHAR(32) to a TEXT and enforce a strict 64 character limit with:
BEGIN; ALTER TABLE foobar DROP CONSTRAINT checksum_length; ALTER TABLE foobar ADD CONSTRAINT checksum_length CHECK (LENGTH(checksum) = 64) NOT VALID; COMMIT;
We just introduced a minimum *and* a maximum, something old VARCHAR could not do. We can constrain it further, as we should only be allowing hexadecimal characters to be stored. Thus, we can also reject and characters other than 0123456789abcdef from being added:
BEGIN; ALTER TABLE foobar DROP CONSTRAINT checksum_length; ALTER TABLE foobar ADD CONSTRAINT checksum_valid CHECK ( LENGTH(checksum) = 64 AND checksum ~ '^[a-f0-9]*$' ) NOT VALID; COMMIT;
Since we have already added a regex check, we can reduce the size of the CHECK with a small hit in clarity like so:
BEGIN;
ALTER TABLE foobar DROP CONSTRAINT checksum_length;
ALTER TABLE foobar ADD CONSTRAINT checksum_valid
CHECK ( checksum ~ '^[a-f0-9]{64}$' ) NOT VALID;
COMMIT;
Back to the other problem, however: how can we avoid a table rewrite when going from VARCHAR(64) to VARCHAR(32), or when stuck on an older version of Postgres that always insists on a table rewrite? The answer is the system catalogs. Please note that any updating to the system catalogs should be done very, very carefully. This is one of the few types of update I will publicly mention and condone. Do not apply this lesson to any other system table or column, as there may be serious unintended consequences.
So, what does it mean to have VARCHAR(32) vs. VARCHAR(64)? As it turns out, there is no difference in the way the actual table data is written. The length limit of a VARCHAR is simply an implicit check constraint, after all, and as such, it is quite easy to change.
Let's create a table and look at some of the important fields in the system table pg_attribute. In these examples we will use Postgres 8.4, but other versions should look very similar - this part of the system catalog rarely changes.
postgres=# CREATE TABLE foobar ( checksum VARCHAR(32) ); CREATE TABLE postgres=# \x Expanded display is on. postgres=# SELECT attname, atttypid::regtype, atttypmod FROM pg_attribute postgres=# WHERE attrelid = 'foobar'::regclass AND attname = 'checksum'; -[ RECORD 1 ]---------------- attname | checksum atttypid | character varying atttypmod | 36
The important column is atttypmod. It indicates the legal length of this varchar column (whose full legal name is 'character varying', but everyone calls it varchar). In the case of Postgres, there is also 4 characters of overhead. So VARCHAR(32) shows up as 36 in the atttypmod column. Thus, if we want to change it to a VARCHAR(64), we add 4 to 64 and get a number of 68. Before we do this change, however, we need to make sure that nothing else will be affected. There are other dependencies to consider, such as views and foreign keys, that you need to keep in mind before making this change. What you should do is carefully check all the dependencies this table has:
postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ---------------+--------- pg_type:16419 | i
We can see in the above that the only dependency is an entry in the pg_type table - which is a normal thing for all tables and will not cause any issues. Any other entries, however, should give you pause before doing a manual update of pg_attribute. You can use the information returned by the first column of the above query to see exactly what is referencing the table. For example, let's make that column unique, as well as adding a view that uses the table, and then see the effects on the pg_depend table:
postgres=# CREATE UNIQUE INDEX jack ON foobar(checksum); CREATE INDEX postgres=# CREATE VIEW martha AS SELECT * FROM foobar; CREATE VIEW postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ------------------+--------- pg_type:16419 | i pg_class:16420 | a pg_rewrite:16424 | n
The 'i', 'a', and 'n' stand for internal, auto, and normal. They are not too important in this context, but more details can be found in the docs on the pg_depend table. The first column shows us the system table and oid of the dependency, so we can look them up and see what they are:
postgres=# SELECT typname FROM pg_type WHERE oid = 16419; typname --------- foobar postgres=# SELECT relname, relkind FROM pg_class WHERE oid = 16420; relname | relkind ---------+--------- jack | i -- Views require a little redirection as they are implemented via the rules system postgres=# SELECT relname,relkind FROM pg_class WHERE oid = postgres-# (SELECT ev_class FROM pg_rewrite WHERE oid = 16424); relname | relkind ---------+--------- martha | v postgres=# \d martha View "public.martha" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(32) | View definition: SELECT foobar.checksum FROM foobar;
So what does all that tell us? It tells us we should look carefully at the index and the view to make sure they will not be affected by the change. In this case, a simple index on the column will not be affected by changing the length, so it (along with the pg_type entry) can be ignored. The view, however, should be recreated so that it records the actual column size.
We are now ready to make the actual change. This would be an excellent time to make a backup of your database. This procedure should be done very carefully - if you are unsure about any of the entries in pg_depend, do not proceed.
First, we are going to start a transaction, lock the table, and drop the view. Then we are going to change the length of the varchar directly, recreate the view, and commit! Here we go:
postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ------------------+--------- pg_type:16419 | i pg_class:16420 | a pg_rewrite:16424 | n postgres=# \d foobar Table "public.foobar" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(32) | Indexes: "jack" UNIQUE, btree (checksum) postgres=# \d martha View "public.martha" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(32) | View definition: SELECT foobar.checksum FROM foobar; postgres=# BEGIN; BEGIN postgres=# DROP VIEW martha; DROP VIEW postgres=# LOCK TABLE pg_attribute IN EXCLUSIVE MODE; LOCK TABLE postgres=# UPDATE pg_attribute SET atttypmod = 68 postgres-# WHERE attrelid = 'foobar'::regclass AND attname = 'checksum'; UPDATE 1 postgres=# COMMIT; COMMIT
Verify the changes and check out the pg_depend entries:
postgres=# \d foobar Table "public.foobar" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(64) | Indexes: "jack" UNIQUE, btree (checksum) postgres=# CREATE VIEW martha AS SELECT * FROM foobar; CREATE VIEW postgres=# \d martha View "public.martha" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(64) | View definition: SELECT foobar.checksum FROM foobar; postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ------------------+--------- pg_type:16419 | i pg_class:16420 | a pg_rewrite:16428 | n
Success. Both the table and the view are showing the new VARCHAR size, but the data in the table was not rewritten. Note how the final row returned by the pg_depend query changed: we dropped the view and created a new one, resulting in a new row in both pg_class and pg_rewrite, and thus a new OID shown in the pg_rewrite table.
Hopefully this is not something you ever have to perform. The new features of 9.1 and 9.2 that prevent table rewrites and table scanning should go a long way towards that.
How to Make PostgreSQL Query Slow
Some applications can be very vulnerable to long running queries. When you test an application, sometimes it is good to have a query running for, let's say, 10 minutes. What's more it should be a normal query, so the application can get the normal results, however this query should run for some longer time than usual.
PostgreSQL has quite a nice function pg_sleep which takes exactly one parameter, it is the number of seconds this function will wait before returning. You can use it as a normal PostgreSQL function, however it's not very sensible:
# SELECT pg_sleep(10); pg_sleep ---------- (1 row) Time: 10072.794 ms
The most interesting usage is adding this function into a query. Let's take this query:
# SELECT schemaname, tablename FROM pg_tables WHERE schemaname <> 'pg_catalog'; Time: 0.985 ms
As you can see, this query is quite fast and returns data in less than 1 ms. Let's now make this query much slower, however returning exactly the same data, but after 15 seconds:
# SELECT schemaname, tablename FROM pg_tables, pg_sleep(15) WHERE schemaname <> 'pg_catalog'; Time: 15002.084 ms
In fact the query execution time is a little bit longer, the pg_sleep function was waiting 15 seconds, but PostgreSQL had to spend some time on query parsing, execution and returning proper data.
I was using this solution many times to simulate a long running query, without changing the application logic, to check how the application behaves during some load peaks.
PostgreSQL auto_explain Module
PostgreSQL has many nice additional modules, usually hidden and not enabled by default. One of them is auto_explain, which can be very helpful for bad query plan reviews. Autoexplain allows for automatic logging of query plans, according to the module's configuration.
This module is very useful for testing. Due to some ORM features, it is hard to repeat exactly the same queries with exactly the same parameters as ORMs do. Even without ORM, many applications make a lot of different queries depending on input data and it can be painful the repeat all the queries from logs. It's much easier to run the app and let it perform all the queries normally. The only change would be adding a couple of queries right after the application connects to the database.
At the beginning let's see how my logs look when I run "SELECT 1" query:
2012-10-24 14:55:09.937 CEST 5087e52d.22da 1 [unknown]@[unknown] LOG: connection received: host=127.0.0.1 port=33004 2012-10-24 14:55:09.947 CEST 5087e52d.22da 2 szymon@szymon LOG: connection authorized: user=szymon database=szymon 2012-10-24 14:55:10.860 CEST 5087e52d.22da 3 szymon@szymon LOG: statement: SELECT 1; 2012-10-24 14:55:10.860 CEST 5087e52d.22da 4 szymon@szymon LOG: duration: 0.314 ms
Your logs can look a little bit different depending on your settings. The settings I use for logging on my development machine are:
log_destination = 'stderr' logging_collector = on log_directory = '/var/log/postgresql/' log_filename = 'postgresql-9.1-%Y-%m-%d_%H%M%S.log' log_file_mode = 0666 log_rotation_age = 1d log_rotation_size = 512MB client_min_messages = notice log_min_messages = notice log_min_duration_statement = -1 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%m %c %l %u@%d ' log_statement = 'all'
The main idea for the above logging configuration is to log all queries before execution, so when a query fails (e.g. because of Out Of Memory Error), it will be logged as well. The execution time won't be logged, but the query will.
Let's run the simple query SELECT 1/0; which should fail. Then the log entries look like:
2012-10-24 15:00:24.767 CEST 5087e52d.22da 5 szymon@szymon LOG: statement: SELECT 1/0; 2012-10-24 15:00:24.823 CEST 5087e52d.22da 6 szymon@szymon ERROR: division by zero 2012-10-24 15:00:24.823 CEST 5087e52d.22da 7 szymon@szymon STATEMENT: SELECT 1/0;
Enabling it for whole PostgreSQL installation is not the best idea, I always enable it only for my session using the following query:
LOAD 'auto_explain';
Now we have to configure this plugin a little bit. The main thing is to set the minimum statement execution time to log, let's set this to 0, just to explain all queries:
SET auto_explain.log_min_duration = 0;
Now let's create a table for tests:
CREATE TABLE x(t text); INSERT INTO x(t) SELECT generate_series(1,10000);
The first query will be quite simple, let's just take the first ten rows.
SELECT t FROM x ORDER BY t LIMIT 10;
2012-10-24 16:21:34.102 CEST 5087f8f8.3fe6 16 szymon@szymon LOG: statement: SELECT * FROM x ORDER BY t LIMIT 10;
2012-10-24 16:21:34.109 CEST 5087f8f8.3fe6 17 szymon@szymon LOG: duration: 6.586 ms plan:
Query Text: SELECT * FROM x ORDER BY t LIMIT 10;
Limit (cost=361.10..361.12 rows=10 width=4)
-> Sort (cost=361.10..386.10 rows=10000 width=4)
Sort Key: t
-> Seq Scan on x (cost=0.00..145.00 rows=10000 width=4)
2012-10-24 16:21:34.109 CEST 5087f8f8.3fe6 18 szymon@szymon LOG: duration: 7.285 ms
Other things we can do with auto_explain module is to use EXPLAIN ANALYZE. First set the setting:
SET auto_explain.log_analyze = true;
Now PostgreSQL adds into logs the following lines:
2012-10-24 16:23:22.514 CEST 5087f8f8.3fe6 21 szymon@szymon LOG: statement: SELECT * FROM x ORDER BY t LIMIT 10;
2012-10-24 16:23:22.522 CEST 5087f8f8.3fe6 22 szymon@szymon LOG: duration: 8.248 ms plan:
Query Text: SELECT * FROM x ORDER BY t LIMIT 10;
Limit (cost=361.10..361.12 rows=10 width=4) (actual time=8.214..8.218 rows=10 loops=1)
-> Sort (cost=361.10..386.10 rows=10000 width=4) (actual time=8.211..8.213 rows=10 loops=1)
Sort Key: t
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on x (cost=0.00..145.00 rows=10000 width=4) (actual time=0.032..2.663 rows=10000 loops=1)
2012-10-24 16:23:22.522 CEST 5087f8f8.3fe6 23 szymon@szymon LOG: duration: 8.722 ms
There are other settings, but I didn't find them as useful as the above.
Auto_explain is a great module for testing, however it can be a little painful on production databases when you enable EXPLAIN ANALYZE, as all the queries will always be run with calculating the node plan times, regardless of the log_min_duration setting. So the query can be much slower than usually, even if its plan is not logged.
An Encouraging LinuxFest
A few weekends ago I gave a talk at Ohio LinuxFest: Yes, You Can Run Your Business On PostgreSQL. Next Question? (slides freshly posted.) The talk isn't as technically oriented as the ones I'll usually give, but rather more inspirational and encouraging. It seemed like a good and reasonable topic, centered around Postgres but applicable to open source in general, and it's something I'd been wanting to get out there for a while.
In a previous life I worked with Microsoft shops a bit more often. You know, companies that use Windows and related software pretty much exclusively. This talk was, more or less, a result of a number of conversations with those companies about open source software and why it's a valid option. I heard a number of arguments against, some reasonable, some pretty far out there, so it felt like it'd be a good thing to gather up all of those that I'd heard over time.
These days I don't interact with those companies so much, so I was a little worried at first that the landscape had changed enough that the talk wouldn't really be useful any more. But after talking with a few people around the conference a day or two before the talk, there's definitely some companies that don't see the value in open source technologies.
The slides are essentially a rough outline, but I tried to go back and add some of the spoken context. Anyway, enjoy, and hopefully it'll help you get the open source word out.
Postgres system triggers error: permission denied
This mystifying Postgres error popped up for one of my coworkers lately while using Ruby on Rails:
ERROR: permission denied: "RI_ConstraintTrigger_16410" is a system trigger
On PostgreSQL version 9.2 and newer, the error may look like this:
ERROR: permission denied: "RI_ConstraintTrigger_a_32778" is a system trigger
ERROR: permission denied: "RI_ConstraintTrigger_c_32780" is a system trigger
I labelled this as mystifying because, while Postgres' error system is generally well designed and gives clear messages, this one stinks. A better one would be something similar to:
ERROR: Cannot disable triggers on a table containing foreign keys unless superuser
As you can now guess, this error is caused by a non-superuser trying to disable triggers on a table that is used in a foreign key relationship, via the SQL command:
ALTER TABLE foobar DISABLE TRIGGERS ALL;
Because Postgres enforces foreign keys through the use of triggers, and because data integrity is very important to Postgres, one must be a superuser to perform such an action and bypass the foreign keys. (A superuser is a Postgres role that has "do anything" privileges). We'll look at an example of this in action, and then discuss solutions and workarounds.
Note that if you are not a superuser *and* you are not the owner of the table, you will get a much better error message when you try to disable all the triggers:
ERROR: must be owner of relation foobar
To reproduce the original error, we will create two tables, and then link them together via a foreign key:
postgres=# create user alice; CREATE ROLE postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". -- Verify that we are not a superuser postgres=> select usename, usesuper from pg_user where usename = (select current_user); usename | usesuper ---------+---------- alice | f postgres=> create table foo(a int unique); NOTICE: CREATE TABLE / UNIQUE will create implicit index "foo_a_key" for table "foo" CREATE TABLE postgres=> create table bar(b int); CREATE TABLE postgres=> alter table bar add constraint baz foreign key (b) references foo(a); ALTER TABLE
Let's take a look at both tables, and then try to disable triggers on each one. Because the triggers enforcing the foreign key are internal, they will not show up when we do a \d:
postgres=> \d foo Table "public.foo" Column | Type | Modifiers --------+---------+----------- a | integer | Indexes: "foo_a_key" UNIQUE CONSTRAINT, btree (a) Referenced by: TABLE "bar" CONSTRAINT "baz" FOREIGN KEY (b) REFERENCES foo(a) postgres=> \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) postgres=> alter table foo disable trigger all; ERROR: permission denied: "RI_ConstraintTrigger_41047" is a system trigger postgres=> alter table bar disable trigger all; ERROR: permission denied: "RI_ConstraintTrigger_41049" is a system trigger
If we try the same thing as a superuser, we have no problem:
postgres=# \c postgres postgres You are now connected to database "postgres" as user "postgres". postgres=# select usename, usesuper from pg_user where usename = (select current_user); usename | usesuper ----------+---------- postgres | t postgres=# alter table foo disable trigger all; ALTER TABLE postgres=# alter table bar disable trigger all; ALTER TABLE -- Don't forget to re-enable the triggers! postgres=# alter table foo enable trigger all; ALTER TABLE postgres=# alter table bar enable trigger all; ALTER TABLE
So, this error has happened to you - now what? Well, it depends on exactly what you are trying to do, and how much control over your environment you have. If you are using Ruby on Rails, for example, you may not be able to change anything except the running user. As you may imagine, this is the most obvious solution: become a superuser and run the command, as in the example above.
If you do have the ability to run as a superuser however, it is usually much easier to adjust the session_replication_role. In short, this disables *all* triggers and rules, on all tables, until it is switched back again. Do NOT forget to switch it back again! Usage is like this:
postgres=# \c postgres postgres You are now connected to database "postgres" as user "postgres". postgres=# set session_replication_role to replica; SET -- Do what you need to do - triggers and rules will not fire! postgres=# set session_replication_role to default; SET
Note: while you can do "SET LOCAL" to limit the changes to the current transaction, I always feel safer to explicitly set it before and after the changes, rather than relying on the implicit change back via commit and rollback.
It may be that you are simply trying to disable one or more of the "normal" triggers that appear on the table. In which case, you can simply disable user triggers manually rather than use 'all':
postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) Triggers: trunk AFTER INSERT ON bar FOR EACH STATEMENT EXECUTE PROCEDURE funk() vupd BEFORE UPDATE ON bar FOR EACH ROW EXECUTE PROCEDURE verify_update(); postgres=> alter table bar disable trigger trunk; ALTER TABLE postgres=> alter table bar disable trigger vupd; ALTER TABLE -- Do what you need to do, then: postgres=> alter table bar enable trigger trunk; ALTER TABLE postgres=> alter table bar enable trigger vupd; ALTER TABLE
Another option for a regular user (in other words, a non super-user) is to remove the foreign key relationship yourself. You cannot disable the trigger, but you can drop the foreign key that created it in the first place. Of course, you have to add it back in as well:
postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> alter table bar drop constraint baz; ALTER TABLE -- Do what you need to do then: postgres=> alter table bar add constraint baz foreign key (b) references foo(a); ALTER TABLE
The final solution is to work around the problem. Do you really need to disable triggers on this table? Then you can simply not disable any triggers. Perhaps the action you are ultimately trying to do (e.g. update/delete/insert to the table) can be performed some other way.
All of these solutions have their advantages and disadvantages. And that's what charts are good for!:
| Permission denied: "RI_ConstraintTrigger" is a system trigger - now what? | ||
|---|---|---|
| Solution | Good | Bad |
| Become a superuser | Works as you expect it to | Locks the table Must re-enable triggers |
| Adjust session_replication_role | No table locks! Bypasses triggers and rules on ALL tables |
Must be superuser MUST set it back to default setting |
| Disable user triggers manually | Regular users can perform Very clear what is being done Less damage if forget to re-enable |
Locks the table May not be enough |
| Drop the foreign key | Regular users can perform Very clear what is being done |
Locks the tables Must recreate the foreign key |
| Not disable any triggers | No locking Nothing to remember to re-enable |
May not work in all situations |
For the rest of this article, we will tie up two loose ends. First, how can we see the triggers if \d will not show them? Second, what's up with the crappy trigger name?
As seen above, the output of \d in the psql program shows us the triggers on a table, but not the internal system triggers, such as those created by foreign keys. Here is how triggers normally appear:
postgres=# \c postgres postgres You are now connected to database "postgres" as user "postgres". postgres=# create language plperl; CREATE LANGUAGE postgres=# create function funk() returns trigger language plperl as $$ return undef; $$; CREATE FUNCTION postgres=# create trigger trunk after insert on bar for each statement execute procedure funk(); CREATE TRIGGER postgres=# \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) Triggers: trunk AFTER INSERT ON bar FOR EACH STATEMENT EXECUTE PROCEDURE funk() postgres=# alter table bar disable trigger all; ALTER TABLE postgres=# \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) Disabled triggers: trunk AFTER INSERT ON bar FOR EACH STATEMENT EXECUTE PROCEDURE funk()
Warning: Versions older than 8.3 will not tell you in the \d output that the trigger is disabled! Yet another reason to upgrade as soon as possible because 8.2 and earlier are end of life.
If you want to see all the triggers on a table, even the internal ones, you will need to look at the pg_trigger table directly. Here is the query that psql uses when generating a list of triggers on a table. Note the exclusion based on the tgisinternal column:
SELECT t.tgname, pg_catalog.pg_get_triggerdef(t.oid, true), t.tgenabled FROM pg_catalog.pg_trigger t WHERE t.tgrelid = '32774' AND NOT t.tgisinternal ORDER BY 1;
So in our example table above, we should find the trigger we created, as well as the two triggers created by the foreign key. All of them are enabled. Disabled triggers will show as a 'D' in the tgenabled column. (O stands for origin, and has to do with session_replication_role).
postgres=# select tgname,tgenabled,tgisinternal from pg_trigger postgres-# where tgrelid = 'bar'::regclass; tgname | tgenabled | tgisinternal ------------------------------+-----------+-------------- RI_ConstraintTrigger_c_32780 | D | t RI_ConstraintTrigger_c_32781 | D | t trunk | D | f postgres=# alter table bar enable trigger all; ALTER TABLE postgres=# select tgname,tgenabled,tgisinternal from pg_trigger postgres-# where tgrelid = 'bar'::regclass; tgname | tgenabled | tgisinternal ------------------------------+-----------+-------------- RI_ConstraintTrigger_c_32780 | O | t RI_ConstraintTrigger_c_32781 | O | t trunk | O | f
As you recall, the original error - with the system trigger that had a rather non-intuitive named - looked like this:
ERROR: permission denied: "RI_ConstraintTrigger_16509" is a system trigger
We can break it apart to see what it is doing. The "RI" is short for "Referential Integrity", and anyone who manages to figure that out can probably make a good guess as to what it does. The "Constraint" means it is a constraint on the table - okay, simple enough. The "Trigger" is a little redundant, as it is extraordinarily unlikely you will ever come across this trigger without some context (such as the error message above) that tells you it is a trigger. The final number is simply the oid of the trigger itself. Stick them all together and you get a fairly obscure trigger name that is hopefully not as mysterious now!
The Real Cost of Data Roundtrip
Sometimes you need to perform some heavy database operations. I don't know why very often programmers are afraid of using databases for that. They usually have some fancy ORM which performs all the operations, and the only way to change the data is to make some SELECT * from a table, create a bunch of unneeded objects, change one field, convert those changed objects into queries and send that to the database.
Have you ever thought about the cost of the roundtrip of data? The cost of getting all the data from database just to send changed data into the database? Why do that if there would be much faster way of achieving the same results?
Imagine that you have quite a heavy operation. Let's make something which normally databases cannot do, some more complicated operation. Many programmers just don't know that there is any other way than writing this in the application code. Let's change all the HTML entities into real characters.
The HTML entities are a way of writing many different characters in HTML. This way you can write for instance the Euro currency sign "€" in HTML even if you don't have it on your keyboard. You just have to write € or € instead. I don't have to, as when I use UTF-8 encoding and write this character directly, it should be showed normally. What's more I have this character on my keyboard.
I will convert the text stored in database changing all the htmlentities into real unicode characters. I will do it using three different methods.
- The first will be a simple query run inside PostgreSQL
- The second will be an external program which downloads the text column from database, changes it externally and loads into database.
- The third method will be almost the same as the second, however it will download whole rows.
Generate Data
So, for this test I need to have some data. Let's write a simple data generator.
First, a simple function for returning a random number within the given range.
CREATE FUNCTION random(INTEGER, INTEGER) RETURNS INTEGER AS $$ SELECT floor ( $1 + ($2 - $1 + 1 ) * random())::INTEGER; $$ LANGUAGE SQL;
Now the function for generating random texts of random length filled with the HTML entities.
CREATE FUNCTION generate_random_string() RETURNS TEXT AS $$
DECLARE
items TEXT[] =
ARRAY[
'AAAA','BBBB','CCCC','DDDD','EEEE','FFFF','GGGG',
'HHHH','IIII','JJJJ','KKKK','LLLL','MMMM','NNNN',
'OOOO','PPPP','QQQQ','RRRR','SSSS','TTTT','UUUU',
'VVVV','WWWW','XXXX','YYYY','ZZZZ',
'&', '"', ''', '&','<','>',
'¢','£','¤','¥','¦','§',
'¨','©','ª','«','¬','­',
'®','¯','°','±','²','³',
'´','µ','¶','·','¸','¹',
'º','»','¼','½','¾'
];
length INTEGER := random(500, 1500);
result TEXT := '';
items_length INTEGER := array_length(items, 1);
BEGIN
FOR x IN 1..length LOOP
result := result || items[ random(1, items_length) ];
END LOOP;
RETURN result;
END;
$$ LANGUAGE PLPGSQL;
The table for storing the data is created with the following query:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
padding TEXT,
t TEXT
);
Then I filled this table using a query generating 50k rows with random data:
INSERT INTO data(payload, t)
SELECT
generate_random_string(),
generate_random_string()
FROM
generate_series(1, 50*1000);
Let's check the table size:
SELECT pg_size_pretty(pg_relation_size('data'));
pg_size_pretty
----------------
207 MB
(1 row)
As the table is filled with random data, I need to have two tables with exactly the same data.
CREATE TABLE query (id SERIAL PRIMARY KEY, payload TEXT, t TEXT); CREATE TABLE script (id SERIAL PRIMARY KEY, payload TEXT, t TEXT); CREATE TABLE script_full (id SERIAL PRIMARY KEY, payload TEXT, t TEXT); INSERT INTO query SELECT * FROM data; INSERT INTO script SELECT * FROM data; INSERT INTO script_full SELECT * FROM data;
The Tests
SQL
Many programmers think that such operations are not normally available inside a database. However PostgreSQL has quite a nice feature, it can execute functions written in many different languages. For the purpose of this test I will use the language pl/perlu which allows me to use external libraries. I will also use HTML::Entities package for the conversion.
The function I wrote is quite simple:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
The update of the data can be done using the following query:
UPDATE query SET t = decode_html_entities(t);
Application
In order to have those tests comparable, I will write a simple perl script using exactly the same package for converting html entities.
#!/usr/bin/env perl
use DBI;
use HTML::Entities;
use Encode;
my $dbh = DBI->connect(
"DBI:Pg:dbname=test;host=localhost",
"szymon",
"",
{'RaiseError' => 1, 'pg_utf8_strings' => 1});
$dbh->do('BEGIN');
my $upd = $dbh->prepare("UPDATE script SET t = ? WHERE id = ?");
my $sth = $dbh->prepare("SELECT id, t FROM script");
$sth->execute();
while(my $row = $sth->fetchrow_hashref()) {
my $t = decode_entities( $row->{'t'} );
$t = encode("UTF-8", $t);
$upd->execute( $t, $row->{'id'} );
}
$dbh->do('COMMIT');
$dbh->disconnect();
The Worst Application
There is another terrible idea implemented by programmers too often. Why select only the column you want to change? Let's select all the rows and send them back to database.
This script will look like this (the important changes are in lines 17 and 23)
#!/usr/bin/env perl
use DBI;
use HTML::Entities;
use Encode;
my $dbh = DBI->connect(
"DBI:Pg:dbname=test;host=localhost",
"szymon",
"",
{'RaiseError' => 1, 'pg_utf8_strings' => 1});
$dbh->do('BEGIN');
my $upd = $dbh->prepare("UPDATE script_all SET t = ? WHERE id = ?");
my $sth = $dbh->prepare("SELECT id, payload, t FROM script_all");
$sth->execute();
while(my $row = $sth->fetchrow_hashref()) {
my $t = decode_entities( $row->{'t'} );
$t = encode("UTF-8", $t);
$upd->execute( $t, $row->{'payload'}, $row->{'id'} );
}
$dbh->do('COMMIT');
$dbh->disconnect();
Results.
The query using pl/perlu function executed in 26 seconds.
The script changing data externally execuded in 2 minutes 10 seconds (5 times slower)
The worst script getting and resending whole rows finished in 4 minutes 35 seconds (10 times slower).
I used quite a small number of rows. There were just 50k rows (about 200MB). On production servers the numbers are much bigger.
Just imagine that the code you developed for changing data could run 10 times faster if you'd do this in the database.
Enforcing Transaction Compartments with Foreign Keys and SECURITY DEFINER
In support of End Point's evolving offering for multi-master database replication, from the precursor to Bucardo through several versions of Bucardo itself, our code solutions depended on the ability to suppress the actions of triggers and rules through direct manipulation of the pg_class table. Most PostgreSQL database developers are probably familiar with the construct we used from the DDL scripts generated by pg_dump at one time.
Disable triggers and rules on table "public"."foo":
UPDATE pg_class SET
relhasrules = false,
reltriggers = 0
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = 'public'
AND pg_class.relname = 'foo';
Re-enable all triggers and rules on "public"."foo" when finished with DML that must not fire triggers and rules:
UPDATE pg_class SET
reltriggers = (
SELECT COUNT(*) FROM pg_trigger
WHERE pg_class.oid = pg_trigger.tgrelid
),
relhasrules = (
SELECT COUNT(*) > 0
FROM pg_rules
WHERE schemaname = 'public'
AND tablename = 'foo'
)
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = 'public'
AND pg_class.relname = 'foo';
In practice, the simple usage described for trigger and rule suppression worked reasonably well. However, it didn't always work well. In particular, there is a somewhat concerning state that exists between the two previously described events. The actions of disabling triggers/rules, then manipulating those relations affected, and then re-enforcing triggers/rules, must happen within the confines of a single transaction, and they must happen, period. The risk is that, at some point between the "open and shut" on pg_class, the transaction is committed and the "shut" never fires. If that happens, all database activity against the relations with disabled triggers and rules continues. I don't recall that we ever isolated the reasons why, on rare occasion, this happened; I only know that it did happen, and it was never welcome news.
In an effort to curb the worst aspect of this issue, I started with a simple question: how can I limit the transaction to a "safe compartment", thinking in terms of perl's Safe.pm? In this case, the "unsafe" action is "commit the transaction with triggers and rules disabled". But in reality, the unsafe list can be any conditions the developer needs to have exposed, but cannot make visible to the rest of the database.
An ancillary issue we faced, too, was the fact that any app code needing to suppress triggers and rules (beyond syncing, there were any number of DML requirements where it was undesirable for syncing to occur, and the pg_class manipulations were quite common) had to operate as the super user. While we had not had an incident where the postgres user for mundane operations had burned down the database, there was certainly concern about that potential.
The resolution I settled on was to construct a pair of functions that made use of the following features:
- PostgreSQL's SECURITY DEFINER function attribute
- Deferred foreign keys
- The ON COMMIT DROP option for CREATE TEMP TABLE
The first function, safe_disable_trigrules(schema_name text, table_name text), is called after beginning a transaction and makes the necessary modifications to pg_class on behalf of schema_name.table_name. After the work within the transaction is finished, the second function--safe_reenable_trigrules(schema_name text, table_name text)--is called before issuing the commit. It, of course, puts pg_class back to the proper state.
Under the hood, the two functions create a dependency that only each other can satisfy when used as a non super user. Before safe_disable_trigrules() will manipulate pg_class, it creates a temp table with a self-referencing, deferred foreign key. Then, based on the schema and table args, it will insert a record for the relation defined by the args that violates the FK. Once the transaction's work is finished, but before committing, safe_reenable_trigrules() is called for every relation that safe_disable_trigrules() was called against and it will delete out the offending record for that relation alone. If the two functions are used properly, by the time of commit, the temp table is empty, thus having no foreign key violations, at which point the transaction can be safely committed. In the process of ensuring the temp table has no foreign key violations, pg_class has been fully restored to its pre-transaction state.
How each of the identified features is used:
- Creating the functions with the SECURITY DEFINER attribute, we have now opened an access point for non-privileged users specifically for the purpose of the proscribed interaction with pg_class--and nothing more.
- The temp table is created and owned by user postgres. There is no chance of the non-privileged user manipulating this table directly, accidentally or otherwise. Thus, the only positive escape for the transaction is through the use of the reenabling function.
- Deferring the foreign key on the temp table allows us to ensure the transaction is in an invalid state at all times while pg_class is in its vulnerable condition without aborting the transaction.
- ON COMMIT DROP allows the function to clean up after itself without having to make an explicit decision on the right time to drop the table. It allows a single temp table to be utilized per transaction, regardless of how may different relations will be passed through the trigger and rule disabling process.
- Before the temp table actually drops, its deferred foreign keys are evaluated. If any rows are left in the table, it means for at least one relation we failed to call the reenable function and the entire transction is aborted rather than risk committing pg_class in the disabled state.
Converting to this system of pg_class manipulation completely eliminated the instances of finding pg_class in a committed state with triggers and rules disabled for various relations. It also allowed us to convert a number database-dependent scripts and applications from using the postgres user down to the appropriate application users.
CREATE FUNCTION safe_disable_trigrules (
schema_name TEXT,
table_name TEXT
)
RETURNS void
LANGUAGE plpgsql
STRICT
SECURITY DEFINER
AS $$
DECLARE
text_table_pk TEXT NOT NULL := '';
text_fk_table TEXT NOT NULL := '';
text_cur_min_msg TEXT;
BEGIN
-- Stop any malicious shenanigans by user overloading
-- relations or operators in a different schema and
-- manipulating search_path to use them.
PERFORM
pg_catalog.set_config(
'search_path',
'pg_catalog, '
OPERATOR(pg_catalog.||)
(SELECT
pg_catalog.current_setting(
'search_path'
)
),
TRUE
);
-- Shared PK for table to hold FK in violated state.
-- This naming convention must not change without also
-- reflecting the convention in safe_reenable_trigrules()
-- so that both can immutably create the same name given
-- the same arguments.
text_table_pk :=
schema_name || '_' ||
table_name || '_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Allowing for the same relation to have triggers and rules
-- disabled and reenabled multiple times within the same
-- transaction. On subsequent calls, the temp table will
-- already exist.
SELECT setting
INTO text_cur_min_msg
FROM pg_settings
WHERE name = 'client_min_messages';
UPDATE pg_settings
SET setting = 'error'
WHERE name = 'client_min_messages'
AND text_cur_min_msg IS DISTINCT FROM 'error';
-- Attempt to create the temp table. If first function call for
-- transaction, it succeeds; otherwise, it fails silently unless
-- error is something other than re-creating extant table.
BEGIN
-- Temp table for this transaction, with same shared
-- convention as the PK above.
text_fk_table :=
'trigrules_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Use ON COMMIT DROP so PG will garbage collect
-- all such temp tables created within the transaction.
EXECUTE
'CREATE TEMP TABLE ' ||
quote_ident(text_fk_table) || ' (
id TEXT PRIMARY KEY NOT NULL,
fk_id TEXT NOT NULL
CONSTRAINT "Must Call safe_reenable_trigrules() Before Commit"
REFERENCES ' ||
quote_ident(text_fk_table) || '
DEFERRABLE
INITIALLY DEFERRED
)
ON COMMIT DROP';
EXCEPTION
WHEN DUPLICATE_TABLE THEN
-- Ignore
END;
UPDATE pg_settings
SET setting = text_cur_min_msg
WHERE name = 'client_min_messages'
AND text_cur_min_msg IS DISTINCT FROM 'error';
-- Insert new record that violates FK. Allowing for
-- the function to be gracefully recalled on the same
-- relation between calls to re-enable triggers and rules.
EXECUTE '
INSERT INTO ' ||
quote_ident(text_fk_table) || '
SELECT ' ||
quote_literal(text_table_pk) ||
', ' ||
quote_literal(text_table_pk || 'X') || '
WHERE NOT EXISTS (
SELECT 1
FROM ' ||
quote_ident(text_fk_table) || '
WHERE id = ' ||
quote_literal(text_table_pk) || '
)';
-- Disable all rules and triggers on target relation
UPDATE pg_class SET
relhasrules = false,
reltriggers = 0
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = schema_name
AND pg_class.relname = table_name;
-- Abort transaction if relation doesn't exist
IF NOT FOUND THEN
RAISE EXCEPTION
'Table %.% does not exist',
schema_name,
table_name;
END IF;
-- reset search_path for users legitimately overloading
-- operators or relations.
PERFORM
set_config(
'search_path',
(SELECT
SUBSTRING(
current_setting('search_path')
FROM
'^pg_catalog, (.*)'
)
),
TRUE
);
END;
$$
;
CREATE FUNCTION safe_reenable_trigrules (
schema_name TEXT,
table_name TEXT
)
RETURNS void
LANGUAGE plpgsql
STRICT
SECURITY DEFINER
AS $$
DECLARE
text_fk_table TEXT NOT NULL := '';
text_table_pk TEXT NOT NULL := '';
int_num_del INTEGER;
BEGIN
-- Stop any malicious shenanigans by user overloading
-- relations or operators in a different schema and
-- manipulating search_path to use them.
PERFORM
pg_catalog.set_config(
'search_path',
'pg_catalog, '
OPERATOR(pg_catalog.||)
(SELECT
pg_catalog.current_setting(
'search_path'
)
),
TRUE
);
-- Re-enable rules and triggers on target
UPDATE pg_class SET
reltriggers = (
SELECT COUNT(*) FROM pg_trigger
WHERE pg_class.oid = pg_trigger.tgrelid
),
relhasrules = (
SELECT COUNT(*) > 0
FROM pg_rules
WHERE schemaname = schema_name
AND tablename = table_name
)
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = schema_name
AND pg_class.relname = table_name;
-- Shared PK for table to hold FK in violated state.
-- This naming convention must not change without also
-- reflecting the convention in safe_disable_trigrules()
-- so that both can immutably create the same name given
-- the same arguments.
text_table_pk :=
schema_name || '_' ||
table_name || '_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Temp table for this transaction, with same shared convention
-- as the PK above.
text_fk_table :=
'trigrules_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Remove pertinent row so FK is no longer in violated state
EXECUTE
'DELETE FROM ' ||
quote_ident(text_fk_table) || '
WHERE id = ' ||
quote_literal(text_table_pk);
GET DIAGNOSTICS int_num_del = ROW_COUNT;
IF (int_num_del > 0) IS NOT TRUE THEN
RAISE EXCEPTION
'No entry for %.% set by safe_disable_trigrules()',
schema_name,
table_name;
END IF;
-- reset search_path for users legitimately overloading
-- operators or relations.
PERFORM
set_config(
'search_path',
(SELECT
SUBSTRING(
current_setting('search_path')
FROM
'^pg_catalog, (.*)'
)
),
TRUE
);
END;
$$
;
Pl/Perl multiplicity issues with PostgreSQL - the Highlander restriction
I came across this error recently for a client using Postgres 8.4:
ERROR: cannot allocate multiple Perl
interpreters on this platform
Most times when you see this error it indicates that someone was trying to use both a Pl/Perl function and a Pl/PerlU function on a server in which Perl's multiplicity flag is disabled. In such a case, only a single Perl interpreter can exist for each Postgres backend, and trying to create a new one (as happens when you execute two functions written in Pl/Perl and Pl/PerlU), the error above is thrown.
However, in this case it was not a combination of Pl/Perl and Pl/PerlU - I confirmed that only Pl/Perl was installed. The error was caused by a slightly less known limitation of a non-multiplicity Perl and Postgres. As the docs mention at the very bottom of the page, "...so any one session can only execute either PL/PerlU functions, or PL/Perl functions that are all called by the same SQL role". So we had two roles both trying to execute some Pl/Perl code in the same session. How is that possible - isn't each session tied to a single role at login? The answer is the SECURITY DEFINER flag for functions, which causes the function to run as if it was being invoked by the role that created the function, not the role that is executing it.
There is still a bit of a gotcha here, because Perl interpreters are created as needed, and thus the order of operations is very important. In other words, you may be able to run function foo() just fine, and run function bar() just fine, but you cannot run them together in the same session! This applies to both the Pl/Perl and Pl/PerlU limitation, as well as the Pl/Perl multiple user limitation.
While Postgres will validate functions as you create them, this is
subject to the same in-session limitation. All of the below examples assume
you have a non multiplicity-enabled Perl
(see
the perlguts manpage for gory details on what multiplicity means in Perl)
. To see what state your Perl is,
you need to determine if the 'usemultiplicity' option is enabled.
The -V option to the perl executable tells it to output all
of its configuration parameters. While the canonical way to check is to issue a
perl -V:usemultiplicity, that's a
hard string to remember, so I simply use grep:
$ perl -V | grep multi
useithreads=define, usemultiplicity=define
The above indicates that Perl has been compiled with multiplicity and thus not subject to the Postgres limitations - you can mix and match Perl functions in your database with abandon. The only problem occurs if the output looks like this:
$ perl -V | grep multi
useithreads=undef, usemultiplicity=undef
Technically, you can also prevent the issue by setting ithreads on, but there really is no reason to not just keep things simpler by setting the multiplicity on.
Watch what happens when we try to create two Perl functions using Postgres 9.2:
postgres=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create language plperl;
CREATE LANGUAGE
test=# create language plperlu;
CREATE LANGUAGE
test=# create or replace function test_perlver()
test-# returns text
test-# language plperl
test-# AS $$ return "Running test_perlver on Perl $^V"; $$;
CREATE FUNCTION
test=# create or replace function test_perlverU()
test-# returns text
test-# language plperlU
test-# AS $$ return "Running test_perlverU on Perl $^V"; $$;
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlveru"
What's going on here? We've already used a perl (Pl/Perl) in *this session*, so we cannot create another one, even if just to compile (but not execute) the function. However, if we start a new session, we can create our Pl/PerlU function!
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create or replace function test_perlverU()
test-# returns text
test-# language plperlU
test-# AS $$ return "Running test_perlverU on Perl $^V"; $$;
CREATE FUNCTION
This Highlander restriction ("there can be only one!") applies to both creation and execution of functions. Notice that we have both the Pl/Perl and Pl/PerlU versions installed, but we can only use one in a particular session - and which one depends on which is called first!:
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# select test_perlver();
test_perlver
--------------------------------------
Running test_perlver on Perl v5.10.0
test=# select test_perlverU();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlveru"
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# select test_perlverU();
test_perlveru
---------------------------------------
Running test_perlverU on Perl v5.10.0
test=# select test_perlver();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlver"
As you can imagine, the nondeterministic nature of such functions can make discovery and debugging of this issue on production servers tricky. :) Here's the other variant we talked about, in which only the first of two functions - both of which are Pl/Perl - will run:
postgres=# create database test;
CREATE DATABASE
postgres=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create language plperl;
CREATE LANGUAGE
test=# create or replace function foo()
test-# returns text
test-# language plperl
test-# security invoker
test-# AS $$ return "Running as security invoker"; $$;
CREATE FUNCTION
test=# create or replace function bar()
test-# returns text
test-# language plperl
test-# security definer
test-# AS $$ return "Running as security definer"; $$;
CREATE FUNCTION
Now let's run as the user who created the function - no problemo, because we are the same user that created the function:
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# SELECT foo();
foo
-----------------------------
Running as security invoker
(1 row)
test=# SELECT bar();
bar
-----------------------------
Running as security definer
(1 row)
All is well. However, if we try it as a different user, the Highlander restriction creeps in:
test=# \c test greg
You are now connected to database "test" as user "greg".
test=# SELECT foo();
foo
-----------------------------
Running as security invoker
(1 row)
test=# SELECT bar();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "bar"
test=# \c test greg
You are now connected to database "test" as user "greg".
test=# SELECT bar();
bar
-----------------------------
Running as security definer
(1 row)
test=# SELECT foo();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "foo"
This one took me a while to figure out on a production system, as somewhere in a twisty maze of trigger functions there was one that was set as security definer. Normally, this was not a problem, as the user that created that function did much of the updates, but a different user invoked a non- security definer function and then the security definer function, causing the error at the top of this article to show up.
So what can one do to prevent this problem from occurring? Luckily, for most people this will not be a problem, as many (if not all) distros and operating systems have the multiplicity compile flag for Perl enabled. If you do have the restriction, one option is to simply be careful about the use of security definer functions. You could either declare everything as security definer, or perhaps make sure that it is only called in a separate session if it really needs to be called by a different user.
A better solution is to recompile your Perl to enable multiplicity. I am not aware of any drawbacks to doing so. In theory, one could even recompile Perl in-place and then restart Postgres, but I have never tried this out. :)
Using Different PostgreSQL Versions at The Same Time.
When I work for multiple clients on multiple different projects, I usually need a bunch of different stuff on my machine. One of the things I need is having multiple PostgreSQL versions installed.
I use Ubuntu 12.04. Installing PostgreSQL there is quite easy. Currently there are available two versions out of the box: 8.4 and 9.1. To install them I used the following command:
~$ sudo apt-get install postgresql-9.1 postgresql-8.4 postgresql-client-common
Now I have the above two versions installed.
Starting the database is also very easy:
~$ sudo service postgresql restart * Restarting PostgreSQL 8.4 database server [ OK ] * Restarting PostgreSQL 9.1 database server [ OK ]
The problem I had for a very long time was using the proper psql version. Both database installed their own programs like pg_dump and psql. Normally you can use pg_dump from the higher version PostgreSQL, however using different psql versions can be dangerous because psql uses a lot of queries which dig deep into the PostgreSQL internal tables for getting information about the database. Those internals sometimes change from one database version to another, so the best solution is to use the psql from the PostgreSQL installation you want to connect to.
The solution to this problem turned out to be quite simple. There is a pg_wrapper program which can take care of the different versions. It is enough to provide information about the PostgreSQL version you want to connect to and it will automatically choose the correct psql version.
Below you can see the results of using psql --version command which prints the psql version. As you can see there are different psql versions chosen according to the --cluster parameter.
~$ psql --cluster 8.4/main --version psql (PostgreSQL) 8.4.11 contains support for command-line editing ~$ psql --cluster 9.1/main --version psql (PostgreSQL) 9.1.4 contains support for command-line editing
You can find more information in the program manual using man pg_wrapper or at pg_wrapper manual
Postgres Open 2012
My talk will be "Choosing a Logical Replication System: Slony vs Bucardo".
I look forward to seeing many of you there!
Postgres log_statement='all' should be your default
Setting the PostgreSQL log_statement parameter to 'all' is always your best choice; this article will explain why. PostgreSQL does not have many knobs to control logging. The main one is log_statement, which can be set to 'none' (do not ever set it to this!), 'ddl' or 'mod' (decent but flawed values), or 'all', which is what you should be using. In addition, you probably want to set log_connections = on, log_disconnections = on, and log_duration = on. Of course, if you do set all of those, don't forget to set log_min_duration_statement = -1 to turn it off completely, as it is no longer needed.
The common objections to setting log_statement to 'all' can be summed up as Disk Space, Performance, and Noise. Each will be explained and countered below. The very good reasons for having it set to 'all' will be covered as well: Troubleshooting, Analytics, and Forensics/Auditing.
Objection: Disk Space
The most common objection to logging all of your SQL statements is disk space. When log_statement is set to all, every action against the database is logged, from a simple SELECT 1; to a gigantic data warehousing query that is 300 lines long and takes seven hours to complete. As one can imagine, logging all queries generates large logs, very quickly. How much depends on your particular system of course. Luckily, the amount of space is very easy to test: just flip log_statement='all' in your postgresql.conf, and reload your database (no restart required!). Let it run for 15 minutes or so and you will have a decent starting point for extrapolating daily disk space usage. For most of our clients, the median is probably around 30MB per day, but we have some as low as 1MB and some over 50GB! Disk space is cheap, but if it is really an issue to save everything, one solution is to dynamically ship the logs to a different box via syslog (see below). Another not-as-good option is to simply purge older logs, or at least ship the older logs to a separate server, or perhaps to tape. Finally, you could write a quick script to remove common and uninteresting lines (say, all selects below 10 milliseconds) from older logs.
Objection: Performance
Another objection is that writing all those logs is going to harm the performance of the server the database is on. A valid concern, although the actual impact can be easily measured by toggling the value temporarily. The primary performance issue is I/O, as all those logs have to get written to a disk. The first solution is to make sure the logs are going to a different hard disk, thus reducing contention with anything else on the server. Additionally, you can configure this disk differently, as it will be heavy write/append with little to no random read access. The best filesystems for handling this sort of thing seem to be ext2 and ext4. A better solution is to trade the I/O hit for a network hit, and use syslog (or better, rsyslog) to ship the logs to a different server. Doing this is usually as simple as setting log_destination = 'syslog' in your postgresql.conf and adjusting your [r]syslog.conf. This has many advantages: if shipping to a local server, you can often go over a non-public network interface, and thus not impact the database server at all. This other server can also be queried at will, without affecting the performance of the database server. This means heavy analytics, e.g. running pgsi or pgfouine, can run without fear of impacting production. It can also be easier to provision this other server with larger disks than to mess around with the production database server.
Objection: Noise
A final objection is that the log files get so large and noisy, they are hard to read. Certainly, if you are used to reading sparse logs, this will be a change that will take some getting used to. One should not be reading logs manually anyway: there are tools to do that. If all your logs were showing before was log_min_duration_statement, you can get the same effect (in a prettier format!) by using the 'duration' mode of the tail_n_mail program, which also lets you pick your own minimum duration and then sorts them from longest to shortest.
Advantage: Troubleshooting
When things go wrong, being able to see exactly what is happening in your database can be crucial. Additionally, being able to look back and see what was going on can be invaluable. I cannot count the number of times that full logging has made debugging a production issue easier. Without this logging, the only option sometimes is to switch log_statement to all and then wait for the error to pop up again! Don't let that happen to you - log heavy preemptively. This is not just useful for tracking direct database problems; often the database trail can enable a DBA to work with application developers to see exactly what their application is doing and where things started to go wrong. On that note, it is a good idea to log as verbosely as you can for everything in your stack, from the database to the application to the httpd server: you never know which logs you may need at a moment's notice when major problems arise.
Advantage: Analytics
If the only logging you are doing is those queries that happen to be longer than you log_min_duration_statement, you have a very skewed and incomplete view of your database activity. Certainly one can view the slowest queries and try to speed them up, but tools like pgsi are designed to parse full logs: the impact of thousands of "fast" queries can often be more stressful on your server than a few long-running queries, but without full logging you will never know. You also won't know if those long-running queries sometimes (or often!) run faster than log_min_duration_statement.
We do have some clients that cannot do log_statement = 'all', but we still want to use pgsi, so what we do is turn on full logging for a period of time via cron (usually 15 minutes, during a busy time of day), then turn it off and run pgsi on that slice of full logging. Not ideal, but better than trying to crunch numbers using incomplete logs.
Advantage: Forensics/Auditing
Full logging via log_statement='all' is extremely important if you need to know exactly what commands a particular user or process has run. This is not just relevant to SQL injection attacks, but for rogue users, lost laptops, and any situation in which someone has done something unknown to your database. Not every one of these situations will be noticeable, such as the infamous DROP TABLE students;: often it involves updating a few critical rows, modifying some functions, or simply copying a table to disk. The *only* way to know exactly what was done is to have log_statement = 'all'. Luckily, this parameter cannot be turned off by clients: one must edit the postgresql.conf file and then reload the server.
The advantages of complete logging should outweigh the disadvantages, except in the most extreme cases. Certainly, it is better to start from a position of setting Postgres' log_statement to 'all' and defending any change to a lesser setting. Someday it may save your bacon. Disagreements welcome in the comment section!
Speeding Up Integration Tests with PostgreSQL - Follow Up
Last week I wrote a blog article about speeding up integration tests using PostgreSQL. I proposed there changing a couple of PostgreSQL cluster settings. The main drawback of this method is that those settings need to be changed for the whole cluster. When you have some important data in other databases, you can have a trouble.
In one of the comments Greg proposed using the unlogged table. This feature appeared in PostgreSQL 9.1. The whole difference is that you should use CREATE UNLOGGED TABLE instead of CREATE TABLE for creating all your tables.
For the unlogged table, the data is not written to the write-ahead log. All inserts, updates and deletes are much faster, however the table will be truncated at the server crash or unclean shutdown. Such table is not replicated to standby servers, which is obvious as there are replicated write-ahead logs. What is more important, the indexes created on unlogged tables are unlogged as well.
All the things I describe here are for integrations tests. When database crashes, then all the tests should be restarted and should prepare the database before running, so I really don't care what happens with the data when something crashes.
The bad thing about unlogged tables is that you cannot change normal table to unlogged. There is nothing like:
ALTER TABLE SET UNLOGGED
The easiest way which I found for changing the table into unlogged was to create a database dump and add UNLOGGED to all the table creation commands. To have it a little bit faster, I used this command:
pg_dump pbench | sed 's/^CREATE TABLE/CREATE UNLOGGED TABLE/' > pbench.dump.sql
This time I will just delete all the tables in the database and load this dump before running the tests instead of using pg_bench for generating the data:
psql -c "drop database pbench" psql -c "create database pbench" psql pbench < pbench.dump.sql
Time for tests. The previous tests results are in the previous blog article. I'm using standard PostgreSQL settings (the secure ones) and the same scale value for pg_bench.
The tests were made using exactly the same command as last time:
./pgbench -h localhost -c 1 -j 1 -T 300 pbench
Below are results combined with the results from previous article.
| number of clients and threads | |||
|---|---|---|---|
| 1 | 2 | 3 | |
| normal settings | 78 tps | 80 tps | 99 tps |
| dangerous settings | 414 tps | 905 tps | 1215 tps |
| unlogged table | 420 tps | 720 tps | 1126 tps |
As you can see, the efficiency with unlogged tables is almost as good as with the unsafe settings. The great thing is that it doesn't influence other databases in the cluster, so you can use safe/default settings for other databases, and only use unlogged tables for the integration tests, which should be much faster now.
This solution works only with the PostgreSQL 9.1 and newer. If you have older PostgreSQL you have to use the previous method with unsafe settings, or better: just upgrade the database.
Speeding Up Integration Tests with PostgreSQL
Many people tend to say they don't want to write tests. One of the reasons is usually that the tests are too slow. The tests can be slow because they are written in a bad way. They can also be slow because of slow components. One such component is usually a database.
The great thing about PostgreSQL is that all types of queries are transactional. It simply means that you can start a transaction, then run the test, which can add, delete and update all the data and database structure it wants. At the end of the integration test, there should be called rollback which just reverts all the changes. It means the next test will always have the same database structure and data before running, and you don't need to manually clear anything.
For running the integration tests we need a test database. One of the most important things when running test is speed. Tests should be fast, programmers don't like to wait ages just to know that there is something wrong.
We can also have a read only databases for the tests. Then you don't need to worry about the transactions, however you always need to ensure the tests won't change anything. Even if you assume your tests won't make any changes, it is always better to use a new transaction for each test and rollback at the end.
The main idea for fast integration tests using PostgreSQL is that those tests don't change anything in the database. If they don't change, we don't need to worry about some possible data loss when the database suddenly restarts. Then we can just restart the tests. Tests should prepare the data before running, assuming the database is in unknown state.
This database should be as fast as possible, even if it means losing data when some unusual things happen. Normally PostgreSQL works really great when someone turns off the server plug suddenly or kills the database process. It just doesn't lose the data.
However we really don't need this stuff when running the tests. The database can be loaded before running tests. If the database is suddenly shut down, we should restart the tests.
The simplest thing is to change a couple of settings which enable great secure writes, however it slows down the database. We don't need to have secure writes, they are only important when something crashes. Then we should restart all the components used for integration tests and load database before testing.
For testing I will use pgbench program which makes a test similar to TPC-B. The tests prepare the data in four tables and then performs a simple transaction:
BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END;where the params are randomly chosen during execution.
The database is created normally using SQL query:
CREATE DATABASE pbench;
Before running the tests, pgbench has to prepare initial data. This is done using the -i param. My computer is not very slow, so the default size of the database is too small, I used a quite larger database size using the -s 25 param. This way the database size is about 380MB, including indexes.
./pgbench -h localhost pbench -i -s 25
The first test will just run using standard PostgreSQL configuration settings and will run for 5 minutes (the -T 300 param).
./pgbench -h localhost -T 300 pbench
The initial results show about 80 transactions per second (tps).
number of clients: 1 number of threads: 1 duration: 300 s number of transactions actually processed: 23587 tps = 78.621158 (including connections establishing) tps = 78.624383 (excluding connections establishing)
You may probably noticed the "number of clients" and "number of threads" values. It is the scenario where you have sequential tests, so all of them are run one by one. However integration tests written in a good way can be run in parallel, so let's run the pgbench once again, but with three threads and three clients.
./pgbench -h localhost -c 3 -j 3 -T 300 pbench
The results show that it is a little bit faster now:
number of clients: 3 number of threads: 3 duration: 300 s number of transactions actually processed: 29782 tps = 99.268609 (including connections establishing) tps = 99.273464 (excluding connections establishing)
Let's now change the PostgreSQL settings to some more dangerous, so it can lose some data when shut down suddenly, but in fact I don't care as all the data is loaded just before running the test.
I've written at the end of the postgresql.conf file the following lines:
fsync = off # turns forced synchronization on or off synchronous_commit = off # synchronization level; on, off, or local full_page_writes = off # recover from partial page writes
Those changes need a database restart, and after restarting PostgreSQL, I just run the pgbench tests once again.
All the results are in the following table:
| number of clients and threads | |||
|---|---|---|---|
| 1 | 2 | 3 | |
| normal settings | 78 tps | 80 tps | 99 tps |
| dangerous settings | 414 tps | 905 tps | 1215 tps |
| change ratio | 531 % | 1131 % | 1227 % |
As you can see, you can do three simple things to speed up your integration tests using PostgreSQL:
- Change default PostgreSQL settings to speed the database up.
- Change your tests to run in parallel.
- Run each test in one transaction.
I've also played with many other settings which could have some impact on the PostgreSQL speed. They really have, however the impact is so small that I don't think it is worth mentioning here. Changing those three settings can make the PostgreSQL fast enough.
You shall not pass! Preventing SQL injection
Greg Sabino Mullane presented a few extremely useful techniques for preventing SQL injection. His advice was mostly based on his recent real-world experience.
The chunk of simple code was causing a potentially very dangerous security breach to the system:
[query … where order_number='[scratch order_number] and username='[session username]']
This code can generate this SQL query:
select * from orders where order_number = '12345' and username = 'alice';
Or this SQL query:
select * from orders where order_number=' '; delete from orders where id IS NOT NULL;
This is a vulnerability, and you certainly do not want any random stranger to delete records from the "orders" table in your database.
The problem was solved in no time by escaping user input.
Here is Greg's list of recommendations to make SQL injection impossible:
- Escape all user input passed to the database.
- Log extensively. If this system hadn't logged SQL queries, they would have never noticed anything strange. They used tail_n_mail that tracks PostgreSQL logs and sends out emails whenever SQL exception occurs.
- Introduce fine-grained control for accessing and manipulating the database. Split responsibilities between a lot of database users and selectively grant permissions to them. Run your code as the appropriate database user with the most restrictive set of permissions possible.
- Database triggers can become very handy. In Greg's case it was impossible to delete the already shipped order because of the triggers assigned to the record.
- Have a lot of eyes on the code to eliminate the obvious mistakes.
- And finally, if SQL injection is happening, consider shutting down the database server until you find the cause. This is an emergency!
Detecting Postgres SQL Injection
SQL injection attacks are often treated with scorn among seasoned DBAs and developers - "oh it could never happen to us!". Until it does, and then it becomes a serious matter. It can, and most likely will eventually happen to you or one of your clients. It's prudent to not just avoid them in the first place, but to be proactively looking for attacks, to know what to do when they occur, and know what steps to take after you have cleaned up the mess.
What is a SQL injection attack? Broadly speaking, it is a malicious user entering data to subvert the nature of your original query. This is almost always through a web interface, and involves an "unescaped" parameter that can be used to change the data returned or perform other database actions. The user "injects" their own SQL into your original SQL statement, changing the query from its original intent.
For example, you have a page in which the a logged-in customer can look up their orders by an order_number, a text field on a web form. The query thus looks like this in your code:
$order_id = cgi_param('order_number');
$sql = "SELECT * FROM order WHERE order_id = $order_id AND order_owner = '$username'";
$results = run_query($sql);
Because there is nothing to limit what the user enters in the order_number field, they can inject their own SQL into to the middle of your SQL query by creating a non-standard order_number such as:
12345 --
This would return information on anyone's order, without checking the order_owner column, as the SQL sent to the database would become:
SELECT * FROM order WHERE order_id = 12345 -- AND order_owner = 'alice'
Much more creative (and destructive) choices are available to the attacker as well, such as:
SELECT * FROM order WHERE order_id = 12345; UPDATE user SET admin=TRUE WHERE username = 'alice'; --AND order_owner = 'alice'; SELECT * FROM order WHERE order_id = 12345; TRUNCATE TABLE invoices; SELECT * FROM order WHERE order_id = 12345 AND order_owner = 'alice';
The above is a very simplistic generic-language example, but there are many ways for SQL injection attacks to work, including software out of your direct control (anything in your chain, from database driver to language to the database itself) and non-obvious angles (such as getting creative with multi-byte languages).
The correct approach to the above would be to use placeholders:
$order_id = cgi_param('order_number');
$sql = 'SELECT * FROM order WHERE order_id = ? AND order_owner = ?';
$results = run_query($sql, $order_id, $username);
Reaction
So you've just detected a SQL injection attack. Don't panic! Okay, perhaps panic a little bit. The first order of business is to, as quickly as possible, disable access and prevent the attacker from doing anything else. Their next injected SQL statement could be a DROP TABLE. Do as much as is needed to stop it right away - don't worry about fixing the hole yet. Stop Apache, disable all CGI, shut down your database, whatever it takes. Yes, this will cause a loss of business for a busy site but so will that potential DROP TABLE command! Once things are disabled, start patching up the holes. If it is a well isolated, obvious fix, bring things back up. If not, look for similar code with the same problem, then bring things back up. There are now some important steps to take:
- Double check all similar code for any other problems.
- Check your logs carefully to see if this was an isolated event, or if the hole had been used before. If you are relying on SQL errors for detection, a careful attacker may have already successfully injected some SQL. See below for forensic tactics.
- Learn why this happened in the first place. Didn't update a driver? Someone just wrote some bad code? Something else? Fix it at both the immediate technical and long-term procedural level.
Detection
Detection is the most important part of this article. If someone were to start a SQL injection attack against your site right now, would you even know? How quickly?
Fortunately, SQL injection attacks almost always generate some SQL errors as the attacker tries to work around your SQL. This is the number one way to detect an attack while it is happening. We recommend the invaluable tail_n_mail for this task. For our clients, we have tail_n_mail running via cron every minute, scanning for new and interesting errors and mailing them out to us. Thus, detection is usually within minutes.
In addition to pure SQL errors, permission errors often occur as well, as the attacker tries to do something not allowed by the current database user, such as creating a table or running the COPY command. Remember to never treat a strange error as an unintersting isolated event, or assume that it is probably one of your developers making a typo. Follow up on everything.
Sometimes, when the attacker is very good, no SQL errors are generated, and the problems have to be detected in other ways. One way is to scan for common SQL injection items. The trick is filtering out valid SQL while finding injected ones. In most cases, attacker access to your database is fairly limited without knowing the names of your tables, columns, functions, and views, so one thing to look for is references to system tables such as pg_class and pg_attribute, system views such as pg_tables and pg_stat_activity, the pg_sleep() function, and the information_schema schema. (pg_sleep() is often used in "blind SQL" attacks, to let the attacker know if something worked or not by the inclusion of a delay, when there may be no other direct feedback from their injection). While looking for these items is not as easy to setup as looking for errors, it can be fairly easy to develop and exclude a whitelist of things that should be accessing those items.
Another thing to watch out for strange offsets. Because the information an attacker can get back is often limited to a row at a time due to the limitations of the original query, SQL injections often pull back the same information from, say, information_schema.tables, with a "LIMIT 1 OFFSET 1" tacked on, Then they call the page again and inject their SQL, but with an offset of 2, then an offset of 3, etc. Nothing says SQL injection like seeing an OFFSET 871 in your logs.
Speaking of logs, you may have noticed that the above checks will only work if you are logging all statements, by adjusting log_statement in your postgresql.conf file. Setting this parameter to 'all' is *highly* recommended, and SQL injection detection (and forensics!) are merely two of the many reasons for doing so.
If you don't have log_statement set to 'all', your only hope of direct detection is if one of the queries happens to get logged for some other reason, such as going over your log_min_duration_statement setting. Good luck with that. /sarcasm.
There are other methods of detecting SQL injection, but they can all be classified as reacting to side effects. Your logs may grow larger, a sysadmin on your team may notice some odd network patterns, your business intelligence people may come across some unexplained buying patterns, etc. Intuition from experienced people is a powerful tool: follow up on those hunches and nagging feelings!
Prevention
Preventing SQL injection is mostly a matter of following some standard software development practices. Basically, you want your code up to date, well vetted, and easy to read and revert. Here are some guidelines:
Use version control
More specifically, use git. For everything related to your site. Application code, HTML pages, system configurations. There are many advantages to git, but it is particularly useful when you are (quickly!) trying to figure out how some bad code (e.g. with SQL injection holes) got into your app, and what safe version you can replace it with. The powers of git log -p, git bisect, git blame, and git checkout will make you wonder how you ever lived without them.
More than one set of eyes
Never commit code that hasn't been looked at by at least one other person not involved in its writing. This can be as informal as leaning over and asking someone else to look at the patch, to setting up a complex enforcement system via something like gerrit. The most important thing is to have it reviewed by someone qualified, and to note the review in your commit message.
Email is a great way to do this, especially if you have a list of people qualified to give a review of the code in question. So, database changes could go to a "dbgroup" list, and one or more people on the list will review and reply.
Another nice thing is a post-commit hook that mails committed code as a diff to a wide audience, such as all engineers in the company. Sure, most people may ignore it, but the more eyes the better. On that note, make sure the age-old appeal of heavily commenting code is followed, especially code that is trying to fetch information from a database.
Teach people about SQL injection
Using placeholders is the only truly safe way to write code. Make sure everyone knows this, and show some examples of SQL injection problems to new hires so they know what to look out for and what the consequences will be.
Never assume any database input is safe.
Never, ever assume database input is safe, or will remain safe. Always use prepared statements aka placeholders. You say you scrubbed that variable with a regular expression above the SQL call? Someone will tweak that regex someday.
Be proactive in looking for problems
See the section about about using tail_n_mail. There are also companies / tools that will attempt to find SQL injection problems in your application. While not foolproof, these can be useful, particularly if you have a very large website with a very large codebase.
Keep your software up to date
Sure, your software is free of all problems, but what about the framework you are using? The language? The database? And the database drivers? They may have a SQL injection problem, and, more importantly, they may have already patched it. Run the most recent version, and make sure you are on all the relevant announcement lists so you hear about new problems and new releases of everything important in your tool chain.
Compartmentalize
In these days of complex frameworks and multiple levels of abstraction, direct SQL access is often hard or impossible to do. Which can be a very good thing, as this is often a good protection against SQL injection. Keep in mind however that there are always other ways to reach your database, such as the boss's daughter or son whipping up a quick PHP script so he can run some reports from home against the production database.
Use the least privileges possible
Make sure you are taking full advantage of roles and users in your database. This means an application should have the bare minimum rights it needs to do its job. No creating of functions, no creating tables, and explicit GRANTs to the tables/views/functions it truly needs. Limit severely what runs as a superuser. If something really needs to run as a superuser, consider wrapping the data/logic in a SECURITY DEFINER function. Having separate "readonly" and "readwrite" versions of each application's user is a great idea as well, and may even help you to scale by being able to send your readonly user to a different database (via hot standby or a Bucardo/Slony slave), or even send them to different pg_bouncer ports with different pooling methods.
Access can be further limited by the use of views, which can limit which columns and rows are visible to a user, or you could even limit all application user access to going through stored procedures.
URLs are public
Never assume an application, URL, or API will remain internal. It will end up accessible to an attacker someday, somehow. Treat everything with the same careful, paranoid, care and always use placeholders.
Forensics
So you've just closed a SQL injection hole, and carefully audited your code to ensure no other holes exist. Now what? Forensics! Which means, a careful examination after the crime. In this case, we want to see what damage the intruder managed to cause.
The first thing to do is figure out what potential harm there is. You can do this by assuming the worst case scenario. What database user was used in the attack, and what rights does it have? Could tables have been updated? Data deleted? Were tables dropped or views altered? This may be a good time to run something like same_schema in historical mode to find out the answer to that last question.
Now comes the hard part: seeing what was changed. If you do not have log_statment='all' set in your postgresql.conf (as I will once again highly recommend you should have) finding what has changed becomes a very, very difficult task. Your best bet at this point is to go to your backups and start comparing things, and perhaps running some sanity checks on your data (e.g. unusually low prices on things you sell, new mega-useful coupons, users with elevated rights). If you know about when the attack started, you could, in theory, look on disk to see which relations may have been altered to narrow the list of changed data. You will also have to assume that the attacker captured all the possible data the database user was allowed to see.
Enough about the worst case scenario above - what about those of us with log_statment='all'? Well, now we go through the logs to see what exact SQL was injected, and what commands have been succesfully run. At this point, you should know what the SQL involved in the attack looks like, and more to the point, where in your code it came from. Now its a matter of filtering out the good stuff from the bad. Luckily, this is a pretty easy task.
What you will need to do is write a quick script to parse your logs, find the type of query that had the hole, and determine the "bad" ones. Then you can look closer and have it report exactly what commands the attacker ran.
Most SQL injection results in a string of additional SQL in place of where a single value should be, with an adding of quotes. So, for example, if someone forgot to escape an OFFSET at the end of the query, your program could simply look for any variations of the query that ended in something other than OFFSET \d$. If the unescaped value was in the middle of the query, I find that a simple but reliable test is to look for whitespace or a '*' character. This assumes that whitespace or '*' would not normally appear for that value, but as long as it's not common, it should still work. (The '*' is needed because one can use SQL comments as a means of whitespace, for example SELECT/**/*/*foo*/FROM/**/pg_tables). Your script should ignore any queries in which the value has no whitespace or '*' character, and focus on the ones that do. Then normalize the queries (for example collapse ones that differ only by the OFFSET value), and generate a report. Of course, the exact method to differentiate between "good" and "bad" queries will vary. Find your best Perl hacker and set them on it.
I should point out that a script is almost always necessary, for three reasons. First, manually reading logs is a time-wasting and error-prone bore. Second, log_statment='all' leads to some really, really big logs. Third, SQL injection attacks usually involve some sort of scripted attack, which can mean a lot of entries. For example, a client recently had over 8000 lines from a SQL injection attack spread out over 20 GB of log files. (This one had a happy ending: the attacker was both not very competent and the database user was fairly locked down, so no damage was done.)
So remember: SQL injection can happen to you. Make sure you are able to detect it, recognize it, fix it, and inspect the damage!
Monitoring many Postgres files at once with tail_n_mail
This post discusses version 1.25.0 of tail_n_mail, which can be downloaded at http://bucardo.org/wiki/Tail_n_mail
One of our clients recently had one of their Postgres servers crash. In technical terms, it issued a PANIC because it tried to commit a transaction that had already been committed. We are using tail_n_mail for this client, and while we got notified six ways to Sunday about the server being down (from Nagios, tail_n_mail, and other systems), I was curious as to why the actual PANIC had not gotten picked up by tail_n_mail and mailed out to us.
The tail_n_mail program at its simplest is a Perl script that greps through log files, finds items of interest, and mails them out. It does quite a bit more than that, of course, including normalizing SQL, figuring out which log files to scan, and analyzing the data on the fly. This particular client of ours consolidates all of their logs to some central logging boxes via rsyslog. For the host in question that issued the PANIC, we had two tail_n_mail config files that looked like this:
## Config file for the tail_n_mail program ## This file is automatically updated ## Last updated: Fri Apr 27 18:00:01 2012 MAILSUBJECT: Groucho fatals: NUMBER INHERIT: tail_n_mail.fatals.global.txt FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-err.log LASTFILE: /var/log/2012/groucho/04/27/18/pgsql-err.log OFFSET: 10199
## Config file for the tail_n_mail program ## This file is automatically updated ## Last updated: Fri Apr 27 18:00:01 2012 MAILSUBJECT: Groucho fatals: NUMBER INHERIT: tail_n_mail.fatals.global.txt FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-warning.log LASTFILE: /var/log/2012/groucho/04/27/18/pgsql-warning.log OFFSET: 7145
The reason for two files was that rsyslog was splitting the incoming Postgres logs into multiple files. Which is normally a very handy thing, because the main file, pgsql-info.log, is quite large, and it's nice to have the mundane things filtered out for us already. Because rsyslog also splits things based on the timestamp, we don't give it an exact file name, but use a POSIX template instead, e.g. /var/log/apps/%Y/groucho/%m/%d/%H/pgsql-warning.log. By doing this, tail_n_mail knows where to find the latest file. It also uses the LASTFILE and OFFSET to know exactly where it stopped last time, and then walks through all files from LASTFILE until the current one.
So why did we miss the PANIC? Because it was in a heretofore unseen and untracked log file known as pgsql-crit.log. (Which goes to show how rarely Postgres crashes: this was the first time in well over 700,000 log files generated that a PANIC had occurred!) At this point, the solution was to either create yet another set of config files for each host to watch for and parse any pgsql-crit.log files, or to give tail_n_mail some more brains and allow it to handle multiple FILE entries in a single config file. Obviously, I chose the latter.
After some period of coding, testing, debugging, and caffeine consumption, a new tail_n_mail was ready. This one (version 1.25.0) allows multiple values of the FILE parameter inside of a single config. Thus, for the above, I was able to combine everything into a single tail_n_mail config file like so:
MAILSUBJECT: Groucho fatals: NUMBER INHERIT: tail_n_mail.fatals.global.txt FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-warning.log FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-err.log FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-crit.log
The INHERIT file is a way of keeping common config items in a single file: in this case, groucho and a bunch of other similar hosts all use it. It contains the rules on what tail_n_mail should care about, and looks similar to this:
## Global behavior for all "fatals" configs EMAIL: acme-alerts@endpoint.com FROM: postgres@endpoint.com FIND_LINE_NUMBER: 0 STATEMENT_SIZE: 3000 INCLUDE: FATAL: INCLUDE: PANIC: INCLUDE: ERROR: ## Client specific exceptions: EXCLUDE: ERROR: Anvils cannot be delivered via USPS EXCLUDE: ERROR: Jetpack fuel quantity missing EXCLUDE: ERROR: Iron Carrots and Giant Magnets must go to different addresses EXCLUDE: ERROR: Rocket Powered Rollerskates no longer available ## Postgres excceptions: EXCLUDE: ERROR: aggregates not allowed in WHERE clause EXCLUDE: ERROR: negative substring length not allowed EXCLUDE: ERROR: there is no escaped character EXCLUDE: ERROR: operator is not unique EXCLUDE: ERROR: cannot insert multiple commands into a prepared statement EXCLUDE: ERROR: value "\d+" is out of range for type integer EXCLUDE: ERROR: could not serialize access due to concurrent update
Thus, we only have one file per host to worry about, in addition to a common shared file across all hosts. So now tail_n_mail can handle multiple files over a time dimension (by walking forward from LASTFILE to the present), as well as over a vertical dimension (by forcing together the files split by rsyslog). However, there is no reason we cannot handle multiple files over a horizontal dimension as well. In other words, putting multiple hosts into a single file. In this client's case, there were other hosts very similar to "groucho" that had files we wanted to monitor. Thus, the config file was changed to look like this:
MAILSUBJECT: Acme fatals: NUMBER INHERIT: tail_n_mail.fatals.global.txt FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-warning.log FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-err.log FILE: /var/log/%Y/groucho/%m/%d/%H/pgsql-crit.log FILE: /var/log/%Y/dawson/%m/%d/%H/pgsql-warning.log FILE: /var/log/%Y/dawson/%m/%d/%H/pgsql-err.log FILE: /var/log/%Y/dawson/%m/%d/%H/pgsql-crit.log FILE: /var/log/%Y/cosby/%m/%d/%H/pgsql-warning.log FILE: /var/log/%Y/cosby/%m/%d/%H/pgsql-err.log FILE: /var/log/%Y/cosby/%m/%d/%H/pgsql-crit.log
We've just whittled nine config files down to a single one. Of course, the config file cannot stay like that, as the LASTFILE and OFFSET entries need to be applied to specific files. Thus, when tail_n_mail does its first rewrite of the config file, it will assign numbers to each FILE, and the file will then look something like this:
FILE1: /var/log/%Y/groucho/%m/%d/%H/pgsql-warning.log LASTFILE1: /var/log/2012/groucho/04/27/18/pgsql-warning.log OFFSET1: 100 FILE2: /var/log/%Y/groucho/%m/%d/%H/pgsql-err.log LASTFILE2: /var/log/2012/groucho/04/27/18/pgsql-err.log OFFSET2: 2531 FILE3: /var/log/%Y/groucho/%m/%d/%H/pgsql-crit.log FILE4: /var/log/%Y/dawson/%m/%d/%H/pgsql-warning.log LASTFILE4: /var/log/2012/dawson/04/27/18/pgsql-warning.log OFFSET4: 42 # etc.
By using this technique, we were able to reduce a slew of config files (the actual number was around 60), and their crontab entries, into a single config file and a single cron call. We also have a daily "error" report that mails a summary of all ERROR/FATAL calls in the last 24 hours. These were consolidated into a single email, rather than the half dozen that appeared before.
While tail_n_mail has a lot of built-in intelligence to handle Postgres logs, it is ultimately regex-based and can be used on any files which you want to track and receive alerts when certain items appear inside of them, so feel free to use it for more than just Postgres!
A Little Less of the Middle
I've been meaning to exercise a bit more. You know, just to keep the mid section nice and trim. But getting into that habit doesn't seem to be so easy. Trimming middleware from an app, that's something that can catch my attention.
Something that caught my eye recently is a couple recent commits to Postgres 9.2 that adds a JSON data type. Or more specifically, the second commit that adds a couple handy output functions: array_to_json() and row_to_json(). If you want to try it out on 9.1, those have been made available as a backported extension.
Lately I've been doing a bit of work with jQuery, using it for AJAX-y stuff but passing JSON around instead. (AJAJ?) And traditionally that involves something in between the database and the client rewriting rows from one format to another. Not that it's all that difficult; for example, in Python it's a simple module call:
jsonresult = json.dumps(cursor.fetchall())
... assuming I don't have any columns needing processing: TypeError: datetime.datetime(2012, 3, 09, 18, 34, 20, 730250, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)) is not JSON serializable Similarly in PHP I can stitch together a JSON array to pass back to the front end:
while ($row = pg_fetch_assoc($rs))
{
$rows[] = $row;
}
$jsonresult = json_encode($rows)Now I can trim that out, and embed the encoding right into the database query:
SELECT row_to_json(pages) FROM pages WHERE page_id = 5; -- or, to return an array of rows SELECT array_to_json(array_agg(pages)) FROM pages WHERE page_title LIKE 'A Little Less%';
Notice the use of the row-type reference to the table itself after the SELECT, rather than just a single column. This outputs:
[{"page_id":105,"today":"π day","page_title":"A Little Less of the Middle","contents":"I've been meaning to exercise a bit more. You...","published_on":"2012-03-15 03:30:00+00"}]Compare that to the output from json_encode() above, where the database driver treated everything as a string, even the page_id integer. The other difference is the Postgres code doesn't do any quoting on Unicode characters:
[{"page_id":"105","today":"\u03c0 day","page_title":"A Little Less of the Middle","contents":"I've been meaning to exercise a bit more. You...","published_on":"2012-03-15 03:30:00+00"}]I'm a bit on the fence about whether it's a real replacement for doing it in middleware, especially in some web use cases where you typically want to do things like anti-XSS type processing on some fields before sending them off to a browser somewhere. Besides, at the moment at least, there's no built-in way to break JSON back apart in the database. But I'm sure there's some places getting direct JSON is helpful, and it's certainly an interesting start.
The Mystery of The Zombie Postgres Row
Being a PostgreSQL DBA is always full of new challenges and mysteries. Tracking them down is one of the best parts of the job. Presented below is an error message we received one day via tail_n_mail from one of our client's production servers. See if you can figure out what was going on as I walk through it. This is from a "read only" database that acts as a Bucardo target (aka slave), and as such, the only write activity should be from Bucardo.
05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
Okay, so there was a unique violation during a COPY. Seems harmless enough. However, this should never happen, as Bucardo always deletes the rows it is about to add in with the COPY command. Sure enough, going to the logs showed the delete right above it:
05:45:51 [85]: LOG: statement: DELETE FROM public.foobar WHERE id IN (12345) 05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
How weird. Although we killed the row, it seems to have resurrected, and shambled like a zombie into our b-tree index, preventing a new row from being added. At this point, I double checked that the correct schema was being used (it was), that there were no rules or triggers, no quoting problems, no index corruption, and that "id" was indeed the first column in the table. I also confirmed that there were plenty of occurrences of the exact same DELETE/COPY pattern - with the same id! - that had run without any error at all, both before and after this error. If you are familiar with Postgres' default MVCC mode, you might make a guess what is going on. Inside the postgresql.conf file there is a setting named 'default_transaction_isolation', which is almost always set to read committed. Further discussion of what this mode does can be found in the online documentation, but the short version is that while in this mode, another transaction could have added row 12345 and committed after we did the DELETE, but before we ran the COPY. A great theory that fits the facts, except that Bucardo always sets the isolation level manually to avoid just such problems. Scanning back for the previous command for that PID revealed:
05:45:51 [85]: LOG: statement: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE 05:45:51 [85]: LOG: statement: DELETE FROM public.foobar WHERE id IN (12345) 05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
So that rules out any effects of read committed isolation mode. We have Postgres set to the strictest interpretation of MVCC it knows, SERIALIZABLE. (As this was on Postgres 8.3, it was not a "true" serializable mode, but that does not matter here.) What else could be going on? If you look at the timestamps, you will note that there is actually quite a large gap between the DELETE and the COPY error, despite it simply deleting and adding a single row (I have changed the table and data names, but it was actually a single row). So something else must be happening to that table.
Anyone guess what the problem is yet? After all, "when you have eliminated the impossible, whatever remains, however improbable, must be the truth". In this case, the truth must be that Postgres' MVCC was not working, and the database was not as ACID as advertised. Postgres does use MVCC, but has two (that I know of) exceptions: the system tables, and the TRUNCATE command. I knew in this case nothing was directly manipulating the system tables, so that only left truncate. Sure enough, grepping through the logs found that something had truncated the table right around the same time, and then added a bunch of rows back in. As truncate is *not* MVCC-safe, this explains our mystery completely. It's a bit of a race condition, to be sure, but it can and does happen. Here's some more logs showing the complete sequence of events for two separate processes, which I have labeled A and B:
A 05:45:47 [44]: LOG: statement: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE A 05:45:47 [44]: LOG: statement: TRUNCATE TABLE public.foobar A 05:45:47 [44]: LOG: statement: COPY public.foobar FROM STDIN B 05:45:51 [85]: LOG: statement: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE B 05:45:51 [85]: LOG: statement: DELETE FROM public.foobar WHERE id IN (12345) A 05:46:11 [44]: LOG: duration: 24039.243 ms A 05:46:11 [44]: LOG: statement: commit B 05:46:11 [85]: LOG: duration: 19884.284 ms B 05:46:11 [85]: LOG: statement: COPY public.foobar FROM STDIN B 05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" B 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
So despite transaction B doing the correct thing, it still got tripped up by transaction A, which did a truncate, added some rows back in (including row 12345), and committed. If process A had done a DELETE instead of a TRUNCATE, the COPY still would have failed, but with a better error message:
ERROR: could not serialize access due to concurrent update
Why does this truncate problem happen? Truncate, while extraordinarily handy, can be real tricky to implement properly in MVCC without some severe tradeoffs. A DELETE in Postgres actually leaves the row on disk, but changes its visibility information. Only after all other transactions that may need to access the old row have ended can the row truly be removed on disk (usually via the autovacuum daemon). Truncate, however, does not walk through all the rows and add visibility information: as the name implies, it truncates the table by removing all rows, period.
So when we did the truncate, process A was able to add row 12345 back in: it had no idea that the row was "in use" by transaction B. Similarly, B had no idea that something had added the row back in. No idea, that is, until it tried to add the row and the unique index prevented it! There appears to be some work on making truncate more MVCC friendly in future versions.
Here is a sample script demonstrating the problem:
#!perl
use strict;
use warnings;
use DBI;
use Time::HiRes; ## so we can reliably sleep less than one second
## Connect and create a test table, populate it:
my $dbh = DBI->connect('dbi:Pg', 'postgres', '', {AutoCommit=>0});
$dbh->do('DROP TABLE foobar');
$dbh->do('CREATE TABLE foobar(a INT UNIQUE)');
$dbh->do('INSERT INTO foobar VALUES (42)');
$dbh->commit();
$dbh->disconnect();
## Fork, then have one process truncate, and the other delete+insert
if (fork) {
my $dbhA = DBI->connect('dbi:Pg', 'postgres', '', {AutoCommit=>0});
$dbhA->do('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');
$dbhA->do('TRUNCATE TABLE foobar'); ## 1
sleep 0.3; ## Wait for B to delete
$dbhA->do('INSERT INTO foobar VALUES (42)'); ## 2
$dbhA->commit(); ## 2
}
else {
my $dbhB = DBI->connect('dbi:Pg', 'postgres', '', {AutoCommit=>0});
$dbhB->do('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');
sleep 0.3; ## Wait for A to truncate
$dbhB->do('DELETE FROM foobar'); ## 3
$dbhB->do('INSERT INTO foobar VALUES (42)'); ## 3
}
Running the above gives us:
ERROR: duplicate key value violates unique constraint "foobar_a_key" DETAIL: Key (a)=(42) already exists
This should not happen, of course, as process B did a delete of the entire table before trying an INSERT, and was in SERIALIZABLE mode. If we switch out the TRUNCATE with a DELETE, we get a completely different (and arguably better) error message:
ERROR: could not serialize access due to concurrent update
However, it we try it with a DELETE on PostgreSQL version 9.1 or better, which features a brand new true serializable mode, we see yet another error message:
ERROR: could not serialize access due to read/write dependencies among transactions DETAIL: Reason code: Canceled on identification as a pivot, during write. HINT: The transaction might succeed if retried
This doesn't really give us a whole lot more information, and the "detail" line is fairly arcane, but it does give a pretty nice "hint", because in this particular case, the transaction *would* succeed if it were tried again. More specifically, B would DELETE the new row added by process A, and then safely add the row back in without running into any unique violations.
So the morals of the mystery are to be very careful when using truncate, and to realize that everything has exceptions, even the supposed sacred visibility walls of MVCC in Postgres.
Tracking down PostgreSQL XYZ error: tablespace, database, and relfilnode
One of our Postgres clients recently had this error show up in their logs:
ERROR: could not read block 3 of relation 1663/18421/31582: read only 0 of 8192 bytes

Because we were using the tail_n_mail program, the above error was actually mailed to us within a minute of it occurring. The message is fairly cryptic, but it basically means that Postgres could not read data from a physical file that represented a table or index. This is generally caused by corruption or a missing file. In this case, the "read only 0 of 8192" indicates this was most likely a missing file.
When presented with an error like this, it's nice to be able to figure out which relation the message is referring to. The word "relation" is Postgres database-speak for a generic object in the database: in this case, it is almost certainly going to be a table or an index. Both of those are, of course, represented by actual files on disk, usually inside of your data_directory. The number given, 1663/18421/31582, is in the standard X/Y/Z format Postgres uses to identify a file, where X represents the tablespace, Y is the database, and Z is the file.
The first number, X, indicates which tablespace this relation belongs to. Tablespaces are physical directories mapped to internal names in the database. Their primary use is to allow you to put tables or indexes on different physical disks. The number here, 1663, is a very familiar one, as it almost always indicates the default tablespace, known as pg_default. If you do not create any additional tablespaces, everything will end up here. On disk, this will be the directory named base underneath your data_directory.
What if the relation you are tracking is not inside of the default tablespace? The number X represents the OID inside the pg_tablespace system table, which will let you know where the tablespace is physically located. To illustrate, let's create a new tablespace and then view the contents of the pg_tablespace table:
$ mkdir /tmp/pgtest $ psql -c "CREATE TABLESPACE ttest LOCATION '/tmp/pgtest'" CREATE TABLESPACE $ psql -c 'select oid, * from pg_tablespace' oid | spcname | spcowner | spclocation | spcacl | spcoptions -------+------------+----------+-------------+--------+------------ 1663 | pg_default | 10 | | | 1664 | pg_global | 10 | | | 78289 | ttest | 10 | /tmp/pgtest | |
Thus, if X were 78289, it would lead us to the tablespace ttest, and we would know that the file we were ultimately looking for will be in the directory indicated by the spclocation column, /tmp/pgtest. If that column is blank, it means the directory to use is data_directory/base.
The second number in our X/Y/Z series, Y, indicates which database the relation belongs to. You can look this information up by querying the pg_database system table like so:
$ psql -xc 'select oid, * from pg_database where oid = 18421' -[ RECORD 1 ]-+----------- oid | 18421 datname | foobar datdba | 10 encoding | 6 datcollate | en_US.utf8 datctype | en_US.utf8 datistemplate | f datallowconn | t datconnlimit | -1 datlastsysoid | 12795 datfrozenxid | 1792 dattablespace | 1663 datacl |
The columns may look different depending on your version of Postgres - the important thing here is that the number Y maps to a database via the oid column - in this case the database foobar. We need to know which database so we can query the correct pg_class table in the next step. We did not have to worry about that in until now as the pg_tablespace and pg_database tables are two of the very few shared system catalogs.
The final number in our X/Y/Z series, Z, represents a file on disk. You can look up which relation it is by querying the pg_class system table of the correct database:
$ psql -d foobar -c "select relname,relkind from pg_class where relfilenode=31582" relname | relkind --------+------- (0 rows)
No rows, so as far as Postgres is concerned that file does not exist! Let's verify that this is the case by looking on the disk. Recall that X was the default tablespace, which means we start in data_directory/base. Once we are in that directory, we can look for the subdirectory holding the database we want (Y or 18421) - it is named after the OID of the database. We can then look for our relfilenode (Z or 31582) inside of that directory:
$ psql -c 'show data_directory'
data_directory
---------------------------------
/var/lib/pgsql/data
(1 row)
$ cd /var/lib/pgsql/data
/var/lib/pgsql/data $ cd base
/var/lib/pgsql/data/base $ cd 18421
/var/lib/pgsql/data/base/18421 $ stat 31582
stat: cannot stat `31582': No such file or directory
So in this case, we confirmed that the relfilenode was no longer there! If it was there, we can probably surmise that the file on disk is corrupted somehow. If the relation was an index, the solution would be to simply run a REINDEX INDEX indexname on it, which will recreate the entire index with a new relfilenode. If it is a table, then things get trickier: we can try a VACUUM FULL on it, which rewrites the entire table, but you will most likely need to go back to your last SQL backup or take a look at your PITR (Point-In-Time Recovery) server.
So why would a relfilenode file not exist on disk? There are a few possibilities:
→ We are looking in the wrong pg_class table (i.e. user error). Each database has its own copy of the pg_class, with different relfilenodes. This means that each subdirectory corresponding to the database has its own set of files as well.
→ It may be a bug in Postgres. Unlikely, unless we have exhausted the other possibilities.
→ Bad RAM or a bad disk may have caused a flipped bit somewhere, for example changing the relfilenode from 12345 to 12340. Possible, but still unlikely.
→ The relfilenode file was removed by something. This is the most likely explanation. We've already hinted above at one way this could happen: a REINDEX. Since the client in this story was (is!) prudently running with log_statement = 'all', I was able to grep back through the logs and found that a REINDEX of a few system tables, including pg_depend, was kicked off a second before the error popped up. While it's impossible to know exactly what the missing relfilenode referred to, the REINDEX is as close to a smoking gun as we are going to get. So the query started, a REINDEX removed one of the indexes it was using, and then the error occurred as Postgres tried to access that index.
In this case, we were able to simply rerun the query and it worked as expected. In normal every day usage, this error should not appear, even when reindexing system tables, but should something like this happen to you, at least you will know what those numbers mean. :)
Protecting and auditing your secure PostgreSQL data

PostgreSQL functions can be written in many languages. These languages fall into two categories, 'trusted' and 'untrusted'. Trusted languages cannot do things "outside of the database", such as writing to local files, opening sockets, sending email, connecting to other systems, etc. Two such languages are PL/pgSQL and and PL/Perl. For "untrusted" languages, such as PL/PerlU, all bets are off, and they have no limitations placed on what they can do. Untrusted languages can be very powerful, and sometimes dangerous.
One of the reasons untrusted languages can be considered dangerous is that they can cause side effects outside of the normal transactional flow that cannot be rolled back. If your function writes to local disk, and the transaction then rolls back, the changes on disk are still there. Working around this is extremely difficult, as there is no way to detect when a transaction has rolled back at the level where you could, for example, undo your local disk changes.
However, there are times when this effect can be very useful. For example, in a recent thread on the PostgreSQL "general" mailing list (aka pgsql-general), somebody asked for a way to audit SELECT queries into a logging table that would survive someone doing a ROLLBACK. In other words, if you had a function named weapon_details() and wanted to have that function log all requests to it by inserting to a table, a user could simply run the query, read the data, and then rollback to thwart the auditing:
BEGIN;
SELECT weapon_details('BFG 9000'); -- also inserts to an audit table
ROLLBACK; -- inserts to the audit table are now gone!
Certainly there are other ways to track who is using this query, the most obvious being by enabling full Postgres logging (by setting log_statement = 'all' in your postgresql.conf file.) However, extracting that information from logs is no fun, so let's find a way to make that INSERT stick, even if the surrounding function was rolled back.
Stepping back for one second, we can see there are actually two problems here: restricting access to the data, and logging that access somewhere. The ultimate access restriction is to simply force everyone to go through your custom interface. However, in this example, we will assume that someone has psql access and needs to be able to run ad hoc SQL queries, as well as be able to BEGIN, ROLLBACK, COMMIT, etc.
Let's assume we have a table with some Very Important Data inside of it. Further, let's establish that regular users can only see some of that data, and that we need to know who asked for what data, and when. For this example, we will create a normal user named Alice:
postgres=> CREATE USER alice; CREATE ROLE
We need a way to tell which rows are suitable for people like Alice to view. We will set up a quick classification scheme using the nifty ENUM feature of PostgreSQL:
postgres=> CREATE TYPE classification AS ENUM ( 'unclassified', 'restricted', 'confidential', 'secret', 'top secret' ); CREATE TYPE
Next, as a superuser, we create the table containing sensitive information, and populate it:
postgres=> CREATE TABLE weapon (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
cost TEXT NOT NULL,
security_level CLASSIFICATION NOT NULL,
description TEXT NOT NULL DEFAULT 'a fine weapon'
);
NOTICE: CREATE TABLE will create implicit sequence "weapon_id_seq" for serial column "weapon.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "weapon_pkey" for table "weapon"
CREATE TABLE
postgres=> INSERT INTO weapon (name,cost,security_level) VALUES
('Crowbar', 10, 'unclassified'),
('M9', 200, 'restricted'),
('M16A2', 300, 'restricted'),
('M4A1', 400, 'restricted'),
('FGM-148 Javelin', 700, 'confidential'),
('Pulse Rifle', 50000, 'secret'),
('Zero Point Energy Field Manipulator', 'unknown', 'top secret');
INSERT 0 7
We don't want anyone but ourselves to be able to access this table, so for safety, we make some explicit revocations. We'll examine the permissions before and after we do this:
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+-------------------+--------------------------
public | weapon | table | |
postgres=> REVOKE ALL ON TABLE weapon FROM public;
REVOKE
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+---------------------------+--------------------------
public | weapon | table | postgres=arwdDxt/postgres |
As you can see, what the REVOKE really does is remove the implicit "no permission" and grant explicit permissions to only the postgres user to view or modify the table. Let's confirm that Alice cannot do anything with that table:
postgres=> \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> postgres=> SELECT * FROM weapon; ERROR: permission denied for relation weapon postgres=> postgres=> UPDATE weapon SET id = id; ERROR: permission denied for relation weapon
Alice does need to have access to parts of this table, so we will create a "wrapper function" that will query the table for us and return some results. By declaring this function as SECURITY DEFINER, it will run as if the person who created the function invoked it - in this case, the postgres user. For this example, we'll be letting Alice see the "cost and description" of exactly one item at a time. Further, we are not going to let her (or anyone else using this function) view certain items. Only those items classified as "confidential" or lower can be viewed (i.e. "confidential", "restricted", or "unclassified"). Here's the first version of our function:
postgres=> CREATE LANGUAGE plperlu;
CREATE LANGUAGE
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
The above should be fairly self-explanatory. We are using PL/Perl's built-in database access functions, such as spi_prepare, to do the actual querying. Let's confirm that this works as it should for Alice:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('anvil');
name | cost | description
------+------+-------------
(0 rows)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
Now that we have solved the restricted access problem, let's move on the auditing. We will create a simple table to hold information about who accessed what and when:
postgres=> CREATE TABLE data_audit ( tablename TEXT NOT NULL, arguments TEXT NULL, results INTEGER NULL, status TEXT NOT NULL DEFAULT 'normal', username TEXT NOT NULL DEFAULT session_user, txntime TIMESTAMPTZ NOT NULL DEFAULT now(), realtime TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp() ); CREATE TABLE
The 'tablename' column simply records which table they are getting data from. The 'arguments' is a free-form field describing what they were looking for. The 'results' column shows how many matching rows were found. The 'status' column will be used primarily to log unusual requests, such as the case where Alice looks for a forbidden item. The 'username' column records the name of the user doing the searching. Because we are using functions with SECURITY DEFINER set, this needs to be session_user, not current_user, as the latter will switch to 'postgres' within the function, and we want to log the real caller (e.g. 'alice'). The final two columns tell us then the current transaction started, and the exact time when an entry was made inside of this table. As a first attempt, we'll have our function do some simple inserts to this new data_audit table:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Log this request
$SQL = 'INSERT INTO data_audit(tablename,arguments,results,status)
VALUES ($1,$2,$3,$4)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
$sth = spi_prepare($SQL, 'TEXT', 'TEXT', 'INTEGER', 'TEXT');
spi_exec_prepared($sth, 'weapon', $name, $rv->{processed}, $status);
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
However, this fails the case pointed out in the original poster's email about viewing the data within a transaction that is then rolled back. It also fails to work at all when a forbidden item is requested, as that insert is rolled back by the die() call:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM weapon_details('m9');
name | cost | description
------+------+---------------
M9 | 200 | a fine weapon
(1 row)
postgres=> ROLLBACK;
ROLLBACK
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=> SELECT * FROM data_audit \x \g
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results | 1
status | normal
username | alice
txntime | 2012-01-30 17:37:39.497491-05
realtime | 2012-01-30 17:37:39.545891-05
How do we get around this? We need a way to commit something that will survive the surrounding transaction's rollback. The closest thing Postgres has to such a thing at the moment is to connect back to the database with a new and entirely separate connection. Two such popular ways to do so are with the dblink program and the PL/PerlU language. Obviously, we are going to focus on the latter, but all of this could be done with dblink as well. Here are the additional steps to connect back to the database, do the insert, and then leave again:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT) RETURNS TABLE (name TEXT, cost TEXT, description TEXT) LANGUAGE plperlu SECURITY DEFINER VOLATILE AS $bc$
use strict;
use warnings;
>use DBI;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = defined $row ? delete $row->{allowed} : 1;
## Log this request
$SQL = 'INSERT INTO data_audit(username,tablename,arguments,results,status)
VALUES (?,?,?,?,?)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
my $dbh = DBI->connect('dbi:Pg:service=auditor', '', '',
{AutoCommit=>0, RaiseError=>1, PrintError=>0});
$sth = $dbh->prepare($SQL);
my $user = spi_exec_query('SELECT session_user')->{rows}[0]{session_user};
$sth->execute($user, 'weapon', $name, $rv->{processed}, $status);
$dbh->commit();
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
Note that because we are making external changes, we marked the function as VOLATILE, which ensures that it will always be run every time it is called, and not cached in any form. We are also using a Postgres service file with the 'db:Pg:service=auditor'. This means that the connection information (username, password, database) is contained in an external file. This is not only tidier than hard-coding those values into this function, but safer as well, as the function itself can be viewed by Alice. Finally, note that we are passing the 'username' directly into the function this time, as we have a brand new connection which is no longer linked to the 'alice' user, so we have to derive it ourselves from "SELECT session_user" and then pass it along.
Once this new function is in place, and we re-run the same queries as we did before, we see three entries in our audit table:
postgres=> \c postgres postgres You are now connected to database "postgres" as user "postgres". Expanded display is on. -[ RECORD 1 ]---------------------------- tablename | weapon arguments | crowbar results | 1 status | normal username | alice txntime | 2012-01-30 17:56:01.544557-05 realtime | 2012-01-30 17:56:01.54569-05 -[ RECORD 2 ]---------------------------- tablename | weapon arguments | pulse rifle results | 1 status | forbidden username | alice txntime | 2012-01-30 17:56:01.559532-05 realtime | 2012-01-30 17:56:01.561225-05 -[ RECORD 3 ]---------------------------- tablename | weapon arguments | m9 results | 1 status | normal username | alice txntime | 2012-01-30 17:56:01.573335-05 realtime | 2012-01-30 17:56:01.574989-05
So that's the basic premise of how to solve the auditing problem. For an actual production script, you would probably want to cache the database connection by sticking things inside of the special %_SHARED hash available to PL/Perl and Pl/PerlU. Note that each user gets their own version of that hash, so Alice will not be able to create a function and have access to the same %_SHARED hash that the postgres user has access to. It's probably a good idea to simply not let users like Alice use the language at all. Indeed, that's the default when we do the CREATE LANGUAGE call as above:
postgres=> \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> CREATE FUNCTION showplatform() RETURNS TEXT LANGUAGE plperlu AS $bc$ return $^O; $bc$; ERROR: permission denied for language plperlu
Further refinements to the actual script might include refactoring the logging bits to a separate function, writing some of the auditing data to a file on the local disk, recording the actual results returned to the user, and sending the data to another Postgres server entirely. For that matter, as we are using DBI, you could send it to other place entirely - such as a MySQL, Oracle, or DB2 database!
Another place for improvement would be associating each user with a security_level classification, such that any user could run the function and only see things at or below their level, rather than hard-coding the level as "confidential" as we have done here. Another nice refinement might be to always return undef (no matches) for items marked "top secret", to prevent the very existence of a top secret weapon from being deduced. :)
Some great press for College District
College District has been getting some positive press lately, the most recent being a Forbes article which talks about the success they have been seeing in the last few years.
College District is a company that sells collegiate merchandise to fans. They got their start focusing on the LSU Tigers at TigerDistrict.com and have branched out to teams such as the Oregon Ducks and Alabama Roll Tide.
We've been working with Jared Loftus @ College District for more then four and a half years. College District is running on a heavily modified Interchange system with some cool Postgres tricks. The system can support a nearly unlimited number of sites, running on 2 catalogs (1 for the admin, 1 for the front end) and 1 database. The key to the system is different schemas, fronted by views, that hide and expose records based on the database user that is connected. The great thing about this system is that Jared can choose to launch a new store within a day and be ready for sales, something he has taken advantage of in the past when a team is on fire and he sees an opportunity he can't pass up.
We are currently preparing for a re-launch of the College District site that will focus on crowd-sourced designs. Artists and fans will submit their designs, have them voted on, some will be chosen to be sold and the folks that have their designs chosen will get paid for their efforts. The goal here is to grow a community that guides what College District and the individual school sites ultimately sell.
With College District's quick growth we've also been helping them improve their order fulfillment process. This includes streamlining how orders are picked, packed and shipped. The introduction of bar code scanners will help with the accuracy and speed of the process.
We get a kick out of seeing our clients succeed, especially those that come to us with a clear vision and a good attitude, and then put the hard work in to make it happen. It's an exciting year ahead for College District and we'll be right there supporting them on the journey.
Sanitizing supposed UTF-8 data
As time passes, it's clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default. In a few more years we'll probably find discussion of different character set encodings to be arcane, relegated to "data historians" and people working with legacy systems.
But we're not there yet! There's still lots of migration to do before we can forget about everything that's not UTF-8.
Last week I again found myself converting data. This time I was taking data from a PostgreSQL database with no specified encoding (so-called "SQL_ASCII", really just raw bytes), and sending it via JSON to a remote web service. JSON uses UTF-8 by default, and that's what I needed here. Most of the source data was in either UTF-8, ISO Latin-1, or Windows-1252, but some was in non-Unicode Chinese or Japanese encodings, and some was just plain mangled.
At this point I need to remind you about one of the most unusual aspects of UTF-8: It has limited valid forms. Legacy encodings typically used all or most of the 255 code points in their 8-byte space (leaving point 0 for traditional ASCII NUL). While UTF-8 is compatible with 7-bit ASCII, it does not allow any possible 8-bit byte in any position. See the Wikipedia summary of invalid byte sequences to know what can be considered invalid.
We had no need to try to fix the truly broken data, but we wanted to convert everything possible to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in what we sent.
I previously wrote about converting a PostgreSQL database dump to UTF-8, and used the Perl CPAN module IsUTF8.
I was going to use that again, but looked around and found an even better module, exactly targeting this use case: Encoding::FixLatin, by Grant McLean. Its documentation says it "takes mixed encoding input and produces UTF-8 output" and that's exactly what it does, focusing on input with mixed UTF-8, Latin-1, and Windows-1252.
It worked as advertised, very well. We would need to use a different module to convert some other legacy encodings, but in this case this was good enough and got the vast majority of the data right.
There's even a standalone fix_latin program designed specifically for processing Postgres pg_dump output from legacy encodings, with some nice examples of how to use it.
One gotcha is similar to a catch that David Christensen reported with the Encode module in a blog post here about a year ago: If the Perl string already has the UTF-8 flag set, Encoding::FixLatin immediately returns it, rather than trying to process it. So it's important that the incoming data be a pure byte stream, or that you otherwise turn off the UTF-8 flag, if you expect it to change anything.
Along the way I found some other CPAN modules that look useful for cases where I need more manual control than Encoding::FixLatin gives:
- Search::Tools::UTF8 - test for and/or fix bad ASCII, Latin-1, Windows-1252, and UTF-8 strings
- Encode::Detect - use Mozilla's universal charset detector and convert to UTF-8
- Unicode::Tussle - ridiculously comprehensive set of Unicode tools that has to be seen to be believed
Once again Perl's thriving open source/free software community made my day!
Finding PostgreSQL temporary_file problems with tail_n_mail
PostgreSQL does as much work as it can in RAM, but sometimes it needs to (or thinks that it needs to) write things temporarily to disk. Typically, this happens on large or complex queries in which the required memory is greater than the work_mem setting.
This is usually an unwanted event: not only is going to disk much slower than keeping things in memory, but it can cause I/O contention. For very large, not-run-very-often queries, writing to disk can be warranted, but in most cases, you will want to adjust the work_mem setting. Keep in mind that this is very flexible setting, and can be adjusted globally (via the postgresql.conf file), per-user (via the ALTER USER command), and dynamically within a session (via the SET command). A good rule of thumb is to set it to something reasonable in your postgresql.conf (e.g. 8MB), and set it higher for specific users that are known to run complex queries. When you discover a particular query run by a normal user requires a lot of memory, adjust the work_mem for that particular query or set of queries.
How do you tell when you work_mem needs adjusting, or more to the point, when Postgres is writing files to disk? The key is the setting in postgresql.conf called log_temp_files. By default it is set to -1, which does no logging at all. Not very useful. A better setting is 0, which is my preferred setting: it logs all temporary files that are created. Setting log_temp_files to a positive number will only log entries that have an on-disk size greater than the given number (in kilobytes). Entries about temporary files used by Postgres will appear like this in your log file:
2011-01-12 16:33:34.175 EST LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp16501.0", size 130220032
The only important part is the size, in bytes. In the example above, the size is 124 MB, which is not that small of a file, especially as it may be created many, many times. So the question becomes, how can we quickly parse the files and get a sense of which queries are causing excess writes to disk? Enter the tail_n_mail program, which I recently tweaked to add a "tempfile" mode for just this purpose.
To enter this mode, just name your config file with "tempfile" in its name, and have it find the lines containing the temporary file information. It's also recommended you make use of the tempfile_limit parameter, which limits the results to the "top X" ones, as the report can get very verbose otherwise. An example config file and an example invocation via cron:
$ cat tail_n_mail.tempfile.myserver.txt ## Config file for the tail_n_mail program ## This file is automatically updated ## Last updated: Thu Nov 10 01:23:45 2011 MAILSUBJECT: Myserver tempfile sizes EMAIL: greg@endpoint.com FROM: postgres@myserver.com INCLUDE: temporary file TEMPFILE_LIMIT: 5 FILE: /var/log/pg_log/postgres-%Y-%m-%d.log $ crontab -l | grep tempfile ## Mail a report each morning about tempfile usage: 0 5 * * * bin/tail_n_mail tnm/tail_n_mail.tempfile.myserver.txt --quiet
For the client I wrote this for, we run this once a day and it mails us a nice report giving the worst tempfile offenders. The queries are broken down in three ways:
- Largest overall temporary file size
- Largest arithmetic mean (average) size
- Largest total size across all the same query
Here is a slightly edited version of an actual tempfile report email:
Date: Mon Nov 7 06:39:57 2011 EST
Host: myserver.example.com
Total matches: 1342
Matches from [A] /var/log/pg_log/2011-11-08.log: 1241
Matches from [B] /var/log/pg_log/2011-11-09.log: 101
Not showing all lines: tempfile limit is 5
Top items by arithmetic mean | Top items by total size
----------------------------------+-------------------------------
860 MB (item 5, count is 1) | 17 GB (item 4, count is 447)
779 MB (item 1, count is 2) | 8 GB (item 2, count is 71)
597 MB (item 7, count is 1) | 6 GB (item 334, count is 378)
597 MB (item 8, count is 1) | 6 GB (item 46, count is 104)
596 MB (item 9, count is 1) | 5 GB (item 3, count is 63)
[1] From file B Count: 2
Arithmetic mean is 779.38 MB, total size is 1.52 GB
Smallest temp file size: 534.75 MB (2011-11-08 12:33:14.312 EST)
Largest temp file size: 1024.00 MB (2011-11-08 16:33:14.121 EST)
First: 2011-11-08 05:30:12.541 EST
Last: 2011-11-09 03:12:22.162 EST
SELECT ab.order_number, TO_CHAR(ab.creation_date, 'YYYY-MM-DD HH24:MI:SS') AS order_date,
FROM orders o
JOIN order_summary os ON (os.order_id = o.id)
JOIN customer c ON (o.customer = c.id)
ORDER BY creation_date DESC
[2] From file A Count: 71
Arithmetic mean is 8.31 MB, total size is 654 MB
Smallest temp file size: 12.12 MB (2011-11-08 06:12:15.012 EST)
Largest temp file size: 24.23 MB (2011-11-08 19:32:45.004 EST)
First: 2011-11-08 06:12:15.012 EST
Last: 2011-11-09 04:12:14.042 EST
CREATE TEMPORARY TABLE tmp_sales_by_month AS SELECT * FROM sales_by_month_view;
While it still needs a little polishing (such as showing which file each smallest/largest came from), it has already been an indispensible tool forfinding queries that causing I/O problems via frequent and/or large temporary files.
PG West 2011 Re-cap
I just recently got back from PG West 2011, and have had some time to ruminate on the experience (do elephants chew a cud?</note-to-self>). I definitely enjoyed San Jose as the location; it's always neat to visit new places and to meet new people, and I have to say that San Jose's weather was perfect for this time of year. I was also glad to be able to renew professional relationships and meet others in the PostgreSQL community.
Topic-wise, I noticed that quite a few talks had to do with replication and virtualization; this certainly seems to be a trend in the industry in general, and has definitely been a pet topic of mine for quite a while. It's interesting to see the various problems that necessitate some form of replication, the tradeoffs/considerations for each specific problem, and wide variety of tools that are available in order to attack each of these problems (e.g. availability, read/write scaling, redundancy, etc).
A few high points from each of the days:
Tuesday
I had dinner with fellow PostgreSQL contributors; some I knew ahead of time, others I got to know. This was followed by additional socializing.
Wednesday
I attended a talk on PostgreSQL HA, which covered the use of traditional cluster-level warm/hot standbys, as well as a solution using pg_pool and slony. This was followed by the keynote address at the conference, given by Charles Fan, Senior Vice President from VMware. This was a high-level overview of the type of work that VMware had been doing in order to support virtualizing PostgreSQL and optimizing for running multiple PostgreSQL instances on separate VMs efficiently.
I was involved in some "lunch track" discussions, and followed this all up with several more talks covering VMWare's specific offerings in more detail.
Evening was dinner and mandatory socializing.
Thursday
I went to Robert Hodges' talk about Tungsten. I had only heard of it in general terms, so it was interesting to get more specific details. Robert's talk covered the basic architecture of Tungsten, as well as how their various adapters between multiple types of databases were used to ensure that the SQL that was executed on heterogeneous clusters would account for differences in datatype representation, encoding, DDL, specific query syntax, etc; for instance when executing a CREATE TABLE statement, MySQL's AUTO_INCREMENT fields would be converted to PostgreSQL's equivalent SERIAL type. There was lots of good discussion after the presentation, and I spoke with Robert after the talk about different design/architecture choices that they made with Tungsten and we discussed differences between that and Bucardo.
At lunchtime I got to meet David Fetter's wife and baby (who looks just like him!), then gave an updated version of my Bucardo: More than just Multimaster talk. Attendance was good, around 30-35, and the audience asked plenty of questions.
After my talk, I attended one about database optimization. This is always an interesting topic for me, so I'm glad to hear other's insights on this subject.
This was all followed up by mandatory socializing.
Friday
I found the talk about Translattice to be very interesting, as it highlighted specific problem domains for distributed, redundant, multi-write database clusters for more fault-tolerant applications. It struck me as utilizing some of the same ideas as Cassandra or other decentralized distributed datastores, but doing so in a way that is transparent to the use of PostgreSQL. What I found particularly interesting about this system was the use of data access/usage patterns, explicit policy, and locality to specify both the costing algorithm for accessing data as well as distributing knowledge about just where each copy of each piece of data exists. The talk, while an introduction to the system, did not skimp on the details and the presenter was happy to answer my many specific questions.
The remaining talks were fairly light-hearted. I went to one called Redis: Data Bacon for the title alone. While I still don't understand why bacon, I walked away with an appreciation of the problem domain Redis addresses and how it could be used in specific cases. The final talk I attended was about Schemaverse, a project which implements a game entirely in SQL. Each player has their own database user created that they can then use from either the web interface or even via just a regular psql connection. I can't speak for the game itself other than the overview given in the talk, but creative use/hacking of the game was explicitly encouraged, and seems like an interesting approach for testing things which may not often be stressed enough in (at least my) regular use of PostgreSQL, such as intra-database security/permissions, huge numbers of users, etc. (It didn't surprise me that this game had been a hit at DEFCON.)
This was followed by the closing session, and final goodbyes, etc. Oh, and (need I say) mandatory socializing.
Final Thoughts
I always enjoy going to PostgreSQL events, and continue to be impressed with the community that surrounds PostgreSQL. Thanks to everyone who attended, and a special thanks to Josh Drake for the work he put into it. Hope to see ya next time!
Viewing schema changes over time with check_postgres
Version 2.18.0 of check_postgres, a monitoring tool for PostgreSQL, has just been released. This new version has quite a large number of changes: see the announcement for the full list. One of the major features is the overhaul of the same_schema action. This allows you to compare the structure of one database to another and get a report of all the differences check_postgres finds. Note that "schema" here means the database structure, not the object you get from a "CREATE SCHEMA" command. Further, remember the same_schema action does not compare the actual data, just its structure.
Unlike most check_postgres actions, which deal with the current state of a single database, same_schema can compare databases to each other, as well as audit things by finding changes over time. In addition to having the entire system overhauled, same_schema now allows comparing as many databases you want to each other. The arguments have been simplified, in that a comma-separated list is all that is needed for multiple entries. For example:
./check_postgres.pl --action=same_schema \ --dbname=prod,qa,dev --dbuser=alice,bob,charlie
The above command will connect to three databases, as three different users, and compare their schemas (i.e. structures). Note that we don't need to specify a warning or critical value: we consider this an 'OK' Nagios check if the schemas match, otherwise it is 'CRITICAL'. Each database gets assigned a number for ease of reporting, and the output looks like this:
POSTGRES_SAME_SCHEMA CRITICAL: (databases:prod,qa,dev) Databases were different. Items not matched: 1 | time=0.54s DB 1: port=5432 dbname=prod user=alice DB 1: PG version: 9.1.1 DB 1: Total objects: 312 DB 2: port=5432 dbname=qa user=bob DB 2: PG version: 9.1.1 DB 2: Total objects: 312 DB 3: port=5432 dbname=dev user=charlie DB 3: PG version: 9.1.1 DB 3: Total objects: 313 Language "plpgsql" does not exist on all databases: Exists on: 3 Missing on: 1, 2
The second large change was a simplification of the filtering options. Everything is now controlled by the --filter argument, and basically you can tell it what things to ignore. For example:
./check_postgres.pl --action=same_schema \ --dbname=A,B --filter=nolanguage,nosequence
The above command will compare the schemas on databases A and B, but will ignore any difference in which languages are installed, and ignore any differences in the sequences used by the databases. Most objects can be filtered out in a similar way. There are also a few other useful options for the --filter argument:
- noposition: Ignore what order columns are in
- noperms: Do not worry about any permissions on database objects
- nofuncbody: Do not check function source
The final and most exciting large change is the ability to compare a database to itself, over time. In other words, you can see exactly what changed during a certain time period. We have a client using that now to send a daily report on all schema changes made in the last 24 hours, for all the databases in their system. This is a very nice thing for a DBA to receive: not only is there a nice audit trail in your email, you can answer questions such as:
- Was this a known change, or did someone make it without letting anyone else know?
- Did somebody fat-finger and drop an index by mistake?
- Were the changes applied to database X also applied to database Y and Z?
To enable time-based checks, simply provide a single database to check. The first time it is run, same_schema simply gathers all the schema information and stores it on disk. The next time it is run, it detects the file, reads it in as database "2", and compares it to the current database (number "1"). The --replace argument will rewrite the file with the current data when it is done. So the cronjob for the aforementioned client is as simple as:
10 0 * * * ~/bin/check_postgres.pl --action=same_schema \ --host=bar --dbname=abc --quiet --replace
The --quiet argument ensures that no output is given if everything is 'OK'. If everything is not okay (i.e. if differences are found), cron gets a bunch of input sent to it and duly mails it out. Thus, a few minutes after 10AM each day, a report is sent if anything has changed in the last day. Here's a slightly redacted version of this morning's report, which shows that a schema named "stat_backup" was dropped at some point in the last 24 hours (which was a known operation):
POSTGRES_SAME_SCHEMA CRITICAL: DB "abc" (host:bar) Databases were different. Items not matched: 1 | time=516.56s DB 1: port=5432 host=bar dbname=abc user=postgres DB 1: PG version: 8.3.16 DB 1: Total objects: 11863 DB 2: File=check_postgres.audit.port.5432.host.bar.db.abc DB 2: Creation date: Sun Oct 2 10:06:12 2011 CP version: 2.18.0 DB 2: port=5432 host=bar dbname=abc user=postgres DB 2: PG version: 8.3.16 DB 2: Total objects: 11864 Schema "stat_backup" does not exist on all databases: Exists on: 2 Missing on: 1
As you can see, the first part is a standard Nagios-looking output, followed by a header explaining how we defined database "1" and "2" (the former a direct database call, and the latter a frozen version of the same.)
Sometimes you want to store more than one version at a time: for example, if you want both a daily and a weekly view. To enable this, use the --suffix argument to create different instances of the saved file. For example:
10 0 * * * ~/bin/check_postgres.pl --action=same_schema \ --host=bar --dbname=abc --quiet --replace --suffix=daily 10 0 * * Fri ~/bin/check_postgres.pl --action=same_schema \ --host=bar --dbname=abc --quiet --replace --suffix=weekly
The above command would end up recreating this file every morning at 10:check_postgres.audit.port.5432.host.bar.db.abc.daily and this file each Friday at 10: check_postgres.audit.port.5432.host.bar.db.abc.weekly.
Thanks to all the people that made 2.18.0 happen (see the release notes for the list). There are still some rough edges to the same_schema action: for example, the output could be a little more user-friendly, and not all database objects are checked yet (e.g. no custom aggregates or operator classes). Development is ongoing; patches and other contributions are always welcome. In particular, we need more translators. We have French covered, but would like to include more languages. The code can be checked out at:
git clone git://bucardo.org/check_postgres.git
There is also a github mirror if you so prefer: https://github.com/bucardo/check_postgres.
You can also file a bug (or feature request), or join one of the mailing lists: general, announce, and commit.
PostgreSQL Serializable and Repeatable Read Switcheroo
PostgreSQL allows for different transaction isolation levels to be specified. Because Bucardo needs a consistent snapshot of each database involved in replication to perform its work, the first thing that the Bucardo daemon does when connecting to a remote PostgreSQL database is:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE;
The 'READ WRITE' bit sets us in read/write mode, just in case the entire database has been set to read only (a quick and easy way to make your slave databases non-writeable!). It also sets the transaction isolation level to 'SERIALIZABLE'. At least, it used to. Now Bucardo uses 'REPEATABLE READ' like this:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ WRITE;
Why the change? In version 9.1 of PostgreSQL the concept of SSI (Serializable Snapshot Isolation) was introduced. How it actually works is a little complicated (follow the link for more detail), but before 9.1 PostgreSQL was only *sort of* doing serialized transactions when you asked for serializable mode. What it was really doing was repeatable read and not trying to really serialize the transactions. In 9.1, PostgreSQL is doing *true* serializable transactions. It also adds a new distinct 'internal' transaction mode, 'repeatable read', which does exactly what the old 'serializable' used to do. Finally, if you issue a 'repeatable read' on a pre-9.1 database, it silently upgrades it to the old 'serializable' mode.
So in summary, if your application was using 'SERIALIZABLE' before, you can now replace that with 'REPEATABLE READ' and get the exact same behavior as before, regardless of the version. Of course, if you want *true* serializable transactions, use SERIALIZABLE. It will continue to mean the same as 'REPEATABLE READ' in pre-9.1 databases, and provide true serializability in 9.1 and beyond. (I haven't determined yet if Bucardo is going to use this new level, as it comes with a little bit of overhead)
Since this can be a little confusing, here's a handy chart showing how version 9.1 changed the meaning of SERIALIZABLE, and added a new 'internal' isolation level:
| Postgres version 9.0 and earlier | Postgres version 9.1 and later | ||||||
|---|---|---|---|---|---|---|---|
| Requested isolation level | → | Actual internal isolation level | Version comparison | Actual internal isolation level | ← | Requested isolation level | |
| READ UNCOMMITTED | ↘ | Read committed | Exact same | Read committed | ↙ | READ UNCOMMITTED | |
| READ COMMITTED | ↗ | ↖ | READ COMMITTED | ||||
| REPEATABLE READ | ↘ | Serializable | Functionally identical | Repeatable read | ← | REPEATABLE READ | |
| SERIALIZABLE | ↗ | ||||||
| 9.1 only! | Serializable (true) | ← | SERIALIZABLE | ||||
Congratulations and thanks to Kevin Grittner and Dan Ports for making true serializability a reality!
Another Post-Postgres Open Post
Well, that was fun! I've always found attending conferences to be an invigorating experience. The talks are generally very informative, it's always nice to put a face to names seen online in the community, and between the "hall track", lunches, and after-session social activities it's difficult to not find engaging discussions.
My favorite presentations:
- Scaling servers with Skytools -- seeing what it takes to balance several high-velocity nodes was intriguing.
- Mission Impossible -- lots of good arguments for why Postgres can be an equivalent, nay, better replacement for an enterprise database.
- The PostgreSQL replication protocol -- even if I never intend to write something that'll interact with it directly, knowing how something like the new streaming replication works under the hood goes a long way to keeping it running at a higher level.
- True Serializable Transactions Are Here! -- I'll admit I haven't had a chance to fully check out the changes to Serializable, so getting to hear some of the reasoning and stepping through some of the use cases was quite helpful.
But what of my talks? Monitoring went well -- it seemed to get the message out. There was a lot of "gee, I have Postgres, and Nagios, but they're not talkin'. Now they can!" So hopefully, with a little more visibility into how the database is standing, the tools can boost confidence within business environments that aren't as sure about Postgres and help keep existing installations in place. I think the Bucardo presentation had me a bit more animated for some reason. That one also led to some interesting questions from the audience, and a couple challenges for the Bucardo project.
All in all, great work everyone!
Headed out to PgWest next week
I'm gearing up to go out to San Jose to attend and speak at the PG West PostgreSQL conference in sunny San Jose. (Does anyone have directions...?)
I'm excited to again meet and mingle with more PostgreSQL experts and enthusiasts and look forward to the various talks, technical discussions, and social opportunities. My talk will be on Bucardo and many uses for it as a general tool. It'll also cover additional changes coming down the pipe in Bucardo 5.
I look forward to seeing everyone!
Bucardo, 9.1, and you!
A little bit of bad news for Bucardo fans, Greg Sabino Mullane won't be making Postgres Open due to scheduling conflicts. But not to worry, I'll be giving the "Postgres masters, other slaves" talk in the meantime in his place.
In looking over the slides, one thing that catches my eye is how quickly Bucardo is adopting PostgreSQL 9.1 features. Specifically, Unlogged Tables will be very useful in boosting performance where Bucardo stages information about changed rows for multi-database updates. I also wonder if the enhanced Serializable Snapshot Isolation would be helpful in some situations. Innovation encouraging more innovation, gotta love open source!
If I hadn't said it before, thanks to everyone that made Postgres 9.1 possible. Some of the other enhancements are just as exciting. For instance, I'm eager to see some creative uses for Writable CTE's. And it'll be very interesting to see what additional Foreign Data Wrappers pop up over time.
Now, back to packing...
Postgres Open: One week to go!
Wow, time flies, Postgres Open is almost upon us!
I'll be there giving a talk Thursday morning on monitoring tools and techniques, and possibly helping with the Bucardo 5 replication session Friday afternoon. Sadly I'll need need to catch a flight shortly after that, so there won't be much time to explore Chicago around everything going on. But at least it'll be nice to get out to a conference again!
Bucardo PostgreSQL replication to other tables with customname
(Don't miss the Bucardo5 talk at Postgres Open in Chicago)
Work on the next major version of Bucardo is wrapping up (version 5 is now in beta), and two new features have been added to this major version. The first, called customname, allows you to replicate to a table with a different name. This has been a feature people have been asking for a long time, and even allows you to replicate between differently named Postgres schemas. The second option, called customcols, allows you replicate to different columns on the target: not only a subset, but different column names (and types), as well as other neat tricks.
The "customname" options allows changing of the table name for one or more targets. Bucardo replicates tables from the source databases to the target databases, and all tables must have the same name and schema everywhere. With the customname feature, you can change the target table names, either globally, per database, or per sync.
We'll go through a full example here, using a stock 64-bit RedHat 6.1 EC2 box (ami-5e837b37). I find EC2 a great testing platform - not only can you try different operating systems and architectures, but (as my own personal box is very customized) it is great to start afresh from a stock configuration.
First, let's turn off SELinux, install the EPEL rpm, update the box, and install a few needed packages.
echo 0 > /selinux/enforce wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm rpm -ivh epel-release-6-5.noarch.rpm yum update yum install emacs-nox perl-DBIx-Safe perl-DBD-Pg git postgresql-plperl cpan boolean
The yum update takes a while to run, but I always feel better when things are up to date. Next, we will create a new database cluster, create the /var/run/bucardo directory that Bucardo uses to store its PIDs, adjust the ultraconservative stock pg_hba.conf file, and start Postgres up:
service postgresql initdb mkdir /var/run/bucardo chown postgres.postgres /var/run/bucardo emacs /var/lib/pgsql/data/pg_hba.conf service postgresql start
For the pg_hba.conf configuration file, because we want to be able to connect to the database as the bucardo user without actually logging into that account, we will allow access using the 'md5' (password) method instead of 'ident'. But we don't want to bother creating a password for the postgres user, we will still allow those connections via ident. The relevant lines in the pg_hba.conf will end up like this:
# TYPE DATABASE USER METHOD local all postgres ident local all all md5
At this point, we (as the postgres user) download and install Bucardo itself:
su - postgres git clone git://bucardo.org/bucardo.git cd bucardo perl Makefile.PL make sudo make install bucardo install# (enter 'p' and keep the default values)
We are now ready to start testing out the new customname feature. First we will need some data to replicate! For this demo we are going to use one of the handy sample datasets from the dbsamples project. The one we will use has a few small tables with information about towns in France. Note that the tarball does not (sadly) contain a top-level directory, so we have to create one ourselves. We will then create three identical databases holding the data from that file.
wget http://pgfoundry.org/frs/download.php/935/french-towns-communes-francaises-1.0.tar.gz mkdir frenchtowns cd frenchtowns tar xvfz ../french-towns-communes-francaises-1.0.tar.gz psql -c 'create database french1' psql french1 -q -f french-towns-communes-francaises.sql psql -c 'create database french2 template french1' psql -c 'create database french3 template french1' psql -c 'create database french4 template french1'
Bucardo is installed but does not know what to do yet, so we will teach Bucardo about each of the databases, and add in all the tables, grouping then into a herd in the process. Finally, we create a sync in which french1 and french2 are both source (master) databases, and french3 and french4 will be target (slave) databases.
bucardo add db f1 db=french1 bucardo add db f2 db=french2 bucardo add db f3 db=french3 bucardo add db f4 db=french4 bucardo add all tables herd=fherd bucardo add sync wildstar herd=fherd dbs=f1=source,f2=source,f3=target,f4=target
Before starting it up, I usually raise the debug level, as it gives a much clearer picture of what is going on in the logs. It does make the logs a lot more crowded, so it is not recommended for production use:
echo log_level=DEBUG >> ~/.bucardorc
Next, we start Bucardo up and make sure everything is working as it should. Scanning the log.bucardo file that is generated is a great way to do this:
bucardo start sleep 3 tail log.bucardo
If all goes well, you should see something very similar to this in the last lines of your log.bucardo file:
(972) [Sat Sep 3 16:18:54 2011] KID Total time for sync "wildstar" (0 rows): 0.05 seconds (966) [Sat Sep 3 16:18:55 2011] CTL Got NOTICE ctl_syncdone_wildstar from 973 (line 1624) (966) [Sat Sep 3 16:18:55 2011] CTL Kid 973 has reported that sync wildstar is done (966) [Sat Sep 3 16:18:55 2011] CTL Sending NOTIFY "syncdone_wildstar" (line 1709) (954) [Sat Sep 3 16:18:55 2011] MCP Got NOTICE syncdone_wildstar from 967 (line 749) (954) [Sat Sep 3 16:18:55 2011] MCP Sync wildstar has finished (954) [Sat Sep 3 16:18:55 2011] MCP Sending NOTIFY "syncdone_wildstar" (line 812) (954) [Sat Sep 3 16:18:56 2011] MCP Got NOTICE syncdone_wildstar from 957 (Bucardo DB) (line 749)
From the above, we see that a KID finished running the sync we created, without finding any changed rows to replicate. Then there is some chatter between the different Bucardo processes. Now to test out the customname feature. We'll rename one of the tables, tell Bucardo about the change, reload the sync, and verify that all is still being replicated.
psql french3 -c 'ALTER TABLE regions RENAME TO tesla' bucardo add customname regions tesla db=f3 bucardo reload wildstar
psql french3 -c 'truncate table tesla cascade' TRUNCATE psql french3 -t -c 'select count(*) from tesla' 0 psql french1 -c 'update regions set name=name' UPDATE 26 psql french3 -t -c 'select count(*) from tesla' 26
In the above, the update on the regions table inthe french1 database calls a trigger that notifies Bucardo that some rows have changed; Bucardo then has a KID copy the rows from the source databases french1 to the other source database french2, as well as the targets french3 and french4. The final internal DELETE and COPY that it performs is done on database french3 to the tesla table rather than the regions table.
The customname feature cannot be used to change the tables in a source database, as they must all be the same (for obvious reasons). We can, however, specify that a different schema be used for a target, as well as a different table. This only applies to Postgres targets, as other database types (e.g. MySQL) do not use schemas. Let's see that in action:
psql french4 -c 'create schema banana' psql french4 -c 'alter table regions set schema banana' psql french4 -c 'truncate table banana.regions cascade' bucardo add customname regions banana.regions db=f4 bucardo reload wildstar
psql french4 -t -c 'select count(*) from banana.regions' 0 psql french2 -c 'update regions set name=name' UPDATE 26 psql french4 -t -c 'select count(*) from banana.regions' 26
As before, the update on a source causes the changes to propagate to the other source database, as well as both targets. Note that the ALTER TABLE also mutated the associated sequence for the table, so there will be warnings in Bucardo's logs about the DEFAULT values for the primary keys in the regions' tables being different. Since this post is getting long, I will save the discussion of customcols for another day.
PostgreSQL log analysis / PGSI
End Point recently started working with a new client (a startup in stealth mode, cannot name names, etc.) who is using PostgreSQL because of the great success some of the people starting the company have had with Postgres in previous companies. One of the things we recommend to our clients is a regular look at the database to see where the bottlenecks are. A good way to do this is by analyzing the logs. The two main tools for doing so are PGSI (Postgres System Impact) and pgfouine. We prefer PGSI for a few reasons: the output is better, it considers more factors, and it does not require you to munge your log_line_prefix setting quite as badly.
Both programs work basically the same: given a large number of log lines from Postgres, normalize the queries, see how long they took, and produce some pretty output.If you only want to look at the longest queries, it's usually enough to set your log_min_duration_statement to something sane (such as 200), and then run a daily tail_n_mail job against it. This is what we are doing with this client, and it sends a daily report that looks like this:
Date: Mon Aug 29 11:22:33 2011 UTC Host: acme-postgres-1 Minimum duration: 2000 ms Matches from /var/log/pg_log/postgres-2011-08-29.log: 7 [1] (from line 227) 2011-08-29 08:36:50 UTC postgres@maindb [25198] LOG: duration: 276945.482 ms statement: COPY public.sales (id, name, region, item, quantity) TO stdout; [2] (from line 729) 2011-08-29 21:29:18 UTC tony@quadrant [17176] LOG: duration: 8229.237 ms execute dbdpg_p29855_1: SELECT id, singer, track FROM album JOIN artist ON artist.id = album.singer WHERE id < 1000 AND track <> 1
However, the PGSI program was born of the need to look at all the queries in the database, not just the slowest-running ones; the cumulative effect of many short queries can have much more of an impact on the server than a smaller number of long-running queries. Thus, PGSI looks not only at how long a query takes to run, but how many times it has run in a certain period, as well as how often it runs. All of this information is put together to give a score to each normalized query, known as the "system impact". Like the costs on a Postgres explain plan, this is a unit-less number and of little importance in and of itself - the important thing is to compare it to the other queries to see the relative impact. We also have that report emailed out, it looks similar to this (this is a text version of the HTML produced):
Log file: /var/log/pg_log/postgres-2011-08-29.log * SELECT (24) * UPDATE (1) Query System Impact : SELECT Log activity from 2011-08-29 11:00:01 to 2011-08-29 11:15:01 +----------------------------------+ | System Impact: | 0.15 | | Mean Duration: | 1230.95 ms | | Median Duration: | 1224.70 ms | | Total Count: | 411 | | Mean Interval: | 4195 seconds | | Std. Deviation: | 126.01 ms | +---------------------------------+ SELECT * FROM albums WHERE track <> ? AND artist = ? ORDER BY artist, track
At this point you may be wondering how we get all the queries into the log. This is done by setting log_min_duration_statement to 0. However, most (but not all!) clients do not want full logging 24 hours a day, as this creates some very large log files. So the solution we use is to analyze a slice of the day, only. It depends on the client, but we try for about 15 minutes during a busy time of day. Thus, the sequence of events is:
- Turn on "full logging" by dropping log_min_duration_statement to zero
- Some time later, set log_min_duration_statement back to what it was (e.g. 200)
- Extract the logs from the time it was set to zero to when it was flipped back.
- Run PGSI against the log subsection we pulled out
- Mail the results out
All of this is run by cron. The first problem is how to update the postgresql.conf file and have Postgres re-read it, all automatically. As covered previously, we use the modify_postgres.pl script for this.
The exact incantation looks like this:
0 11 * * * perl bin/modify_postgres_conf --quiet \ --pgconf /etc/postgresql/9.0/main/postgresql.conf \ --change log_min_duration_statement=0 15 11 * * * perl bin/modify_postgres_conf --quiet \ --pgconf /etc/postgresql/9.0/main/postgresql.conf \ --change log_min_duration_statement=200 --no-comment ## The above are both one line each, but split for readability here
This changes log_min_duration_statement to 0 at 11AM, and then back to 200 15 minutes later. We use the --quiet argument as this is run from cron so we don't want any output from modify_postgres_conf on success. We do want a comment when we flip it to 0, as this is the temporary state and we want people viewing the postgresql.conf file at that time to realize it (or someone just doing a "git diff"). We don't want a comment when we flip it back, as the timestamp in the comment would cause git to think the file had changed.
Now for the tricky bit: extracting out just the section of logs that we want and sending it to PGSI. Here's the recipe I came up with for this client:
16 11 * * * tac `ls -1rt /var/log/pg_log/postgres*log \ | tail -1` \ | sed -n '/statement" changed to "200"/,/statement" changed to "0"/ p' \ | tac \ | bin/pgsi.pl --quiet > tmp/pgsi.html && bin/send_pgsi.pl ## Again, the above is all one line
What does this do? First, it finds the latest file in the /var/log/pg_log directory that starts with 'postgres' and ends with 'log'. Then it uses the tac program to spool the file backwards, one line at a time ('tac' is the opposite of 'cat'). Then it pipes that output to the sed program, which prints out all lines starting with the one where we changed the log_min_duration_statement to 200, and ending with the one where we changed it to 0 (the reverse of what we actually did, as we are reading it backwards). Finally, we use tac again to put the lines back in the correct order, pipe the output to pgsi, write the output to a temporary file, and then call a quick Perl script named send_pgsi.pl which mails the temporary HTML file to some interested parties.
Why do we use tac? Because we want to read the file backwards, so as to make sure we get the correct slice of log files as delimited by the log_min_duration_statement changes. If we simply started at the beginning of the file, we might encounter other similar changes that were made earlier and not by us.
All of this is not foolproof, of course, but it does not have to be, as it is very easy to run manually is something (for example the sed recipe) goes wrong, as the log file will still be there. Yes, it's also possible to grab the ranges in other ways (such as perl), but I find sed the quickest and easiest. As tempting as it was to write Yet Another Perl Script to extract the lines, sometimes a few chained Unix programs can do the job quite nicely.
Changing postgresql.conf from a script
The modify_postgres_conf script for Postgres allows you to change your postgresql.conf file from the command line, via a cron job, or any time when you want to automate the process.
Postgres runs as a background daemon. The configuration parameters it runs with are stored in a file named postgresql.conf. To change the behavior of Postgres, one must usually edit this file, and then tell Postgres that you have made the changes. Sometimes all that is needed is to 'HUP' or reload Postgres. Most changes fall into this category. Other changes require a full restart of Postgres, which entails disconnecting all current clients.
Thus, to make a change, one must edit the file, find the item to change (the file consists of "name = value" lines), change it, then send a signal to the main Postgres process so it picks up the change. Finally, you should then connect to Postgres to make sure it is still running and has accepted the latest change.
Doing this automatically (such as via a cron script) is very difficult. One method, if you are doing something simple like toggling between two known configuration files, is to simply store copies of both files and replace them, like this example cronjob:
30 10 * * * cp -f conf/postgresql.conf.1 /etc/postgresql.conf; /etc/init.d/postgresql reload 50 10 * * * cp -f conf/postgresql.conf.2 /etc/postgresql.conf; /etc/init.d/postgresql reload
The major problem with that approach, as I quickly learned when I tried it, is that despite nobody making changes to the postgresql.conf file in *years*, a few days after I put the above change in place, someone decided to edit postgresql.conf. At 10:30AM the next day, their changes were blown away. A better way is to simply write a program to make the change for you. Thus, the modify_postgres_conf.pl script.
The basic usage is to tell the script where the conf file is, and list what changes you want to make. Here's an example that will change the random_page_cost to 2 on a Debian system:
./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
Here is exactly what the script does for the above statement:
- For each item to be changed, we:
- Ask the database what the current value is (and die if that parameter does not exist)
- If the current and new value are the same, do nothing
- Otherwise, open (and flock) the configuration file and change the parameter
- If no changes were made, exit
- Otherwise, close the configuration file
- Figure out the Postgres PID and send it a HUP signal
- Reconnect to the database and confirm each change has taken effect
By default, it adds a comment after the changed value as well, to help in tracking down who made the change. A diff of the postgresql.conf file after running the example above produces:
diff -r1.1 postgresql.conf 499c499 < random_page_cost = 4 --- > random_page_cost = 2 ## changed by modify_postgres_conf.pl on Wed Aug 10 13:31:34 2011
The addition of the comment can be stopped by added a --no-comment argument. If the script runs successfully, it also returns two items of information: the size and name of the current Postgres log file. This is useful so you can know exactly where in the log this change took place. Note that this only works for items that are already explicitly set in your configuration file. However, as discussed before, you should already have all the items that you may possibly change explicitly listed out at the bottom of the file already. Whitespace is preserved as well, for those (like me) who like to keep things lined up neatly inside the file (see examples in the link above).
Here are some more examples of the script in action:
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2 114991 /var/log/postgres/postgres-2011-08-10.log $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2 No change made: value of "random_page_cost" is already 2 $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change random_page_cost=2 \ > --change log_statement=ddl \ > --change log_min_duration_statement=100 No change made: value of "random_page_cost" is already 2 118459 /var/log/postgres/postgres-2011-08-10.log $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change default_statitics_target=200 --no-comment There is no Postgres variable named "default_statitics_target"! $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change default_statistics_target=200 --no-comment 123396 /var/log/postgres/postgres-2011-08-10.log
Note that we make no attempt to automatically check changes in to version control: as you will see in an upcoming blog post on a real-life use case, such a checkin is usually not wanted, as we are making temporary changes.
This is a fairly simple Perl script, but I thought I would put it out there in the hopes of helping others out (and preventing the reinventing of wheels). Of course, if you find a bug or want to write a patch for it, those are welcome additions at any time! The code can be found on github:
git clone git://git@github.com:bucardo/modify_postgres_config.git
Debian Postgres readline psql problem and the solutions
There was a bit of a controversy back in February as Debian decided to replace libreadline with libedit, which affected a number of apps, the most important of which for Postgres people is the psql utility. They did this because psql links to both OpenSSL and readline, and although psql is compatible with both, they are not compatible with each other!
By compatible, I mean that the licenses they use (OpenSSL and readline) are not, in one strict interpretation, allowed to be used together. Debian attempts to live by the letter and spirit of the law as close as possible, and thus determined that they could not bundle both together. Interestingly, Red Hat does still ship psql using OpenSSL and readline; apparently their lawyers reached a different conclusion. Or perhaps they, as a business, are being more pragmatic than strictly legal, as it's very unlikely there would be any consequence for violating the licenses in this way.
While libreadline (the library for GNU readline) is a feature rich, standard, mature, and widely used library, libedit (sadly) is not as developed and has some important bugs and shortcomings (including no home page, apparently, and no Wikipedia page!). This resulted in frustration for many Debian users, who found that their command-line history commands in psql no longer worked, and worse, psql no longer supported non-ASCII input! Since I came across this problem recently on a client machine, I thought I would lay out the current solutions.
The first and easiest solution is to simply upgrade. Debian has made a "workaround" by forcing psql to use the readline library when it is invoked.
The next best solution, for those rare cases when you cannot upgrade, is to apply Debian's solution yourself by patching the 'pg_wrapper' program that Debian uses. In order to support running different versions of Postgres on the same box in a sane and standard fashion, Debian uses some wrapper scripts around some of the Postgres command-line utilities such as psql. Thus, the psql command in /usr/bin/psql is actually a symlink to the shell script pg_wrapper, which parses some arguments and then calls the actual psql binary, which is no longer in the default path. So, to apply the Debian fix, just patch your pg_wrapper file like so:
*** pg_wrapper 2011/07/18 03:46:49 1.1
--- pg_wrapper 2011/07/18 03:48:23
***************
*** 94,100 ****
}
error 'Invalid PostgreSQL cluster version' unless -d "/usr/lib/postgresql/$version";
! my $cmd = get_program_path (((split '/', $0)[-1]), $version);
error 'pg_wrapper: invalid command name' unless $cmd;
unshift @ARGV, $cmd;
exec @ARGV;
--- 94,110 ----
}
error 'Invalid PostgreSQL cluster version' unless -d "/usr/lib/postgresql/$version";
! my $cmdname = (split '/', $0)[-1];
! my $cmd = get_program_path ($cmdname, $version);
!
! # libreadline is a lot better than libedit, so prefer that
! if ($cmdname eq 'psql') {
! my @readlines = sort();
! if (@readlines) {
! $ENV{'LD_PRELOAD'} = ($ENV{'LD_PRELOAD'} or '') . ':' . $readlines[-1];
! }
! }
!
error 'pg_wrapper: invalid command name' unless $cmd;
unshift @ARGV, $cmd;
exec @ARGV;
As you can see, what Debian has done is set the LD_PRELOAD environment variable to point to the libreadline shared object, which means that when psql is started, it uses the libreadline library instead of libedit. This is great news for Debian users. I'm unconvinced of how "legal" this is per Debian's standards, but then I'm in the camp that think they are interpreting all the licensing around this in the wrong way, and should have just left libreadline alone.
The second best solution, after patching pg_wrapper, is to simply define LD_PRELOAD yourself, either globally or per user.
Another solution is to use the 'rlwrap' program, which is a wrapper around some arbitrary program (in this case, psql) which routes the user input through readline. So a quick alias would be:
alias p='rlwrap psql --no-readline'
(Yes, we could also use -n, but it's an alias and thus we don't have to type it out each time, so it's better to be more verbose). The rlwrap solution is a quick hack, and I do not recommend it, as it still leaves out many psql features, such as autocompletion and ctrl-c support.
All of this is not strictly Debian's fault. If you read the various Debian bug reports as well as some of the Postgres mailing list threads about this topic, you will find there is plenty of finger pointing going around. It seems to me the least guilty party here is readline itself, whose only fault is that it is GPL and not a better license ;). Debian should take a little blame, both for being too strict in what is obviously a very uncharted legal licensing mess, and for making this change so quickly without any announcement and apparently without realizing how many things would break. The worst offender appears to be OpenSSL, which apparently is being stubborn about changing its license to allow linking with the GPL readline. I'll throw a little bit of blame towards libedit as well, merely for its inability to keep up with 20th century ideas like Unicode (because whose database doesn't need more 麟?).
The current Debian "solution" has stilled the waters a little bit, but we (Postgres) really need a long-term solution. Or solutions, as the case may be. As with my previous post, the big question there is "who shall put the bell on the cat"? I'd like to see Debian itself fund some work into improving libedit, since they are strongly encouraging use of it over libreadline. That's solution one: improve libedit such that it becomes a decent readline replacement. This is nice because as great as libreadline is, it's one of the only pieces of Postgres that used the GPL, and it would be nice to get rid of it for that reason alone (the other big one is PostGIS).
Another solution is to replace OpenSSL, since they apparently are never going to change their license, despite it being in everyone's best interest. GnuTLS is an oft-mentioned replacement, which seems to be production ready, unlike libedit. The problem here is that psql has a lot of "openssl-isms" in the code. However, that is something that can be accomplished by the Postgres community.
Another option is to get readline to make an exception so it can play nicely with OpenSSL. Not only is this unlikely to happen, I think it's a band-aid and I'd rather see the above two actions happen instead.
So, in summary, there are really two ways out of this mess: fix up libedit (hello Debian community) and allow Postgres support for GnuTLS (or other non-OpenSSL system for that matter) (hello Postgres community).
For those wanting to dig into this some more, Greg Smith's excellent summation in this thread is a great read.
Announcing pg_blockinfo!
I'm pleased to announce the initial release of pg_blockinfo. It is a tool to examine your PostgreSQL heap data files, written in Perl.
Similar in purpose to pg_filedump, it is used to display (and soon validate) buffer-page-level information for PostgreSQL page/heap files.
pg_blockinfo aims to work in a portable and non-destructive way, using read-only "mmap", sys-level IO functions, and "unpack" in order to minimize any Perl overhead.
What we buy for the compromise of writing this in Perl instead of C is two-fold:
- portability/low impact — pg_blockinfo has no other dependencies than Perl and several core Perl modules and will work in environments where you can't or won't easily install other packages or compile based on specific headers.
- expressibility — while not currently supported in full, one of pg_blockinfo's future goals is to allow you to specify criteria for display of both page-level and tuple-level info. It will allow you to define arbitrary Perl expressions to filter the objects you're looking at (i.e., pages, tuples, etc; think "grep" but on a tuple level). It will support a DSL for querying based off of the named fields as well as allow you to supply arbitrary Perl for scanning for any criteria.
Requirements
We require a perl version with PerlIO ":mmap" support, which basically means any perl >= 5.8. We do not require any non-core perl modules; currently we only use Data::Dumper and Getopt::Long for debugging and option parsing respectively, the former only when requested.
Getting pg_blockinfo
The canonical git repo for development for pg_blockinfo is located at github: http://github.com/machack666/pg_blockinfo/
.For the development repo, simply run:
$ git clone git://github.com/machack666/pg_blockinfo.git
Or you can just grab the current script directly here.
Using pg_blockinfo
To get help including available options, canonical and alternate/abbreviated names of recognized fields, range syntax:
$ pg_blockinfo -h
To dump all fields for all page headers for all pages in a relation:
$ pg_blockinfo /path/to/relfile
To include only specific fields in the output you can specify multiple -f options and/or include multiple options per -f argument by comma delimiting. Field specifiers are processed in order, so only the final logical set will be included.
"all" is a special shorthand type which will expand to all known columns. pg_blockinfo may support other shorthand groups in the future. When no fields are provided explicitly, "all" is implicitly assumed.
There are both positive and negative field inclusions. An example of a positive inclusion is:
$ pg_blockinfo /path/to/relfile -f prune_xid,tli
This will display only the indicated fields in question for all blocks in relfile. To include all fields *except* certain ones, prefix their name with a '-' sign:
$ pg_blockinfo -f -pagesize_version /path/to/relfile
This will display all page header fields in all blocks with the exception of the pagesize_version header field.
One consequence of the way these field display options are designed (particularly going forward with additional field/tuple display options) that you could define a "view" of the column data using a shell alias, then add/remove columns/criteria by passing additional -f options to it:
# using this as a shorthand to display just those fields $ alias lsn='pg_blockinfo -f lsn_seq,lsn_off,tli' $ lsn -f -tli /path/to/foo # remove fields from the display $ lsn -f prune_xid /path/to/foo # or add to the list as well
Similar functionality is available for selecting the specific blocks available using the range option (-r or -b), which lets you specify a range of blocks to look at instead of the entire file.
$ pg_blockinfo -r 2-49 /path/to/relfile $ pg_blockinfo -r -100 /path/to/relfile $ pg_blockinfo -r 2,4,120-140,0xFF-0x1100 /path/to/relfile
Range options can be provided multiple times, each with one or more comma-delimited block-range specifications. Blocks are numbered from 0, can be provided in decimal or hexadecimal (when prefixed via 0x), and can appear singly or in a range (unbounded or unbounded) when separated by a hyphen.
Planned future features/TODO
In no particular order:
- dump tuples/tuple headers.
- better output/interpretation of bitflags.
- support alternate structures to allow detection/specification of different target versions of the page/tuple headers.
- allow querying/filtering pages/tuples.
- validation/sanity checking of various pages.
- actual extraction of ranges in the heap file.
- extract/dump tuples by raw ctid.
- allow arbitrary expressions to define powerful filtering options when querying/displaying information about the tuples/data files.
- detections of invalid toast tuple pointers/corrupted lz_compressed data (will require connection to theactive system catalog).
- detect relfile type?
- mvcc queries against tuples at a given arbitrarily-constructed snapshot
- detect xids that are invalid (i.e. map to non-existent pages in the pg_clog directory).
- better/shorter name?
I look forward to any feedback, patches, or other improvements/interest.
DBD::Pg UTF-8 for PostgreSQL server_encoding
We are preparing to make a major version bump in DBD::Pg, the Perl interface for PostgreSQL, from the 2.x series to 3.x. This is due to a reworking of how we handle UTF-8. The change is not going to be backwards compatible, but will probably not affect many people. If you are using the pg_enable_utf8 flag, however, you definitely need to read on for the details.
The short version is that DBD::Pg is going return all strings from the Postgres server with the Perl utf8 flag on. The sole exception will be databases in which the server_encoding is SQL_ASCII, in which case the flag will never be turned on.
For backwards compatibility and fine-tuning control, there is a new attribute called pg_utf8_strings that can be set at connection time to override the decision above. For example, if you need your connection to return byte-soup, non-utf8-marked strings, despite coming from a UTF-8 Postgres database, you can say:
my $dsn = 'dbi:Pg:dbname=foobar';
my $dbh = DBI->connect($dsn, $dbuser, $dbpass,
{ AutoCommit => 0,
RaiseError => 0,
PrintError => 0,
pg_utf8_strings => 0,
}
);
Similarly, you can set pg_utf8_strings to 1 and it will force settings returned strings as utf8, even if the backend is SQL_ASCII. You should not be using SQL_ASCII of course, and certainly not forcing the strings returned from it to UTF-8. :)
All Perl variables (be they strings or otherwise) are actually Perl objects, with some internal attributes defined on them. One of those is the utf8 flag, which can be flipped on to indicate that the string should be treated as possibly containing multi-byte characters, or it can be left off, to indicate the string should always be treated on a byte-by-byte basis. This will affect things like the Perl length function, and the Perl \w regex flag. This is completely unrelated to the Perl pragma use utf8, which DBD::Pg has nothing at all to do with. Have I mentioned that UTF-8, and UTF-8 in Perl in particular, can be quite confusing?
There are a few exceptions as to what things DBD::Pg will mark as utf8. Integers and other numbers will not, boolean values will not, and no bytea data will ever have the flag set. When in doubt, assume that it is set.
The old attribute, pg_enable_utf8, will be deprecated, and have no effect. We thought about re-using that but it seemed clearer and cleaner to simply create a new variable (pg_utf8_strings), as the behavior has significantly changed.
A beta version of DBD::Pg (2.99.9_1) with these changes has been uploaded to CPAN for anyone to experiment with. Right now, none of this is set in stone, but we did want to get a working version out there to start the discussion and see how it interacts with applications that were making use of the pg_enable_utf8 flag. You can web search for "dbdpg" and look for the "Latest Dev. Release", or jump straight to the page for DBD::Pg 2.99.9_1. The trailing underscore is a CPAN convention that indicates this is a development version only, and thus will not replace the latest production version (2.18.1 as of this writing).
As a reminder, DBD::Pg has switched to using git, so you can follow along with the development with:
git clone git://bucardo.org/dbdpg.git
There is also a commits mailing list you can join to receive notifications of commits as they are pushed to the main repo. To sign up, send an email to dbd-pg-changes-subscribe@perl.org.
DBD::Pg moves to git!
Just a note to everyone that development the official DBD::Pg DBI driver for PostgreSQL source code repository has moved from its old home in SVN to a git repository. All development has now moved to this repo.
We have imported the SVN revision history, so it's just a matter of pointing your git clients to:
$ git clone git://bucardo.org/dbdpg.git
For those who prefer, there is a github mirror:
$ git clone git://github.com/bucardo/dbdpg.git
Git is available via many package managers or by following the download links at http://git-scm.com/download for your platform.
Enjoy!
MongoDB replication from Postgres using Bucardo
One of the features of the upcoming version of Bucardo (a replication system for the PostgreSQL RDBMS) is the ability to replicate data to things other than PostgreSQL databases. One of those new targets is MongoDB, a non-relational 'document-based' database. (to be clear, we can only use MongoDB as a target, not as a source)
To see this in action, let's setup a quick example, modified from the earlier blog post on running Bucardo 5. We will create a Bucardo instance that replicates from two Postgres master databases to a Postgres database target and a MongoDB instance target. We will start by setting up the prerequisites:
sudo aptitude install postgresql-server \ perl-DBIx-Safe \ perl-DBD-Pg \ postgresql-contrib
Getting Postgres up and running is left as an exercise to the reader. If you have problems, the friendly folks at #postgresql on irc.freenode.net will be able to help you out.
Now for the MongoDB parts. First, we need the server itself. Your distro may have it already available, in which case it's as simple as:
aptitude install mongodb
For more installation information, follow the links from the MongoDB Quickstart page. For my test box, I ended up installing from source by following the directions at the Building for Linux page.
Once MongoDB is installed, we will need to start it up. First, create a place for MongoDB to store its data, and then launch the mongodb process:
$ mkdir /tmp/mongodata $ mongod --dbpath=/tmp/mongodata --fork --logpath=/tmp/mongo.log all output going to: /tmp/mongo.log forked process: 428
You can perform a quick test that it is working by invoking the command-line shell for MongoDB (named "mongo" of course) Use quit() to exit:
$ mongo MongoDB shell version: 1.8.1 Fri Jun 10 12:45:00 connecting to: test > quit() $
The other piece we need is a Perl driver so that Bucardo (which is written in Perl) can talk to the MongoDB server. Luckily, there is an excellent one available on CPAN named 'MongoDB'. We started the MongoDB server before doing this step because the driver we will install needs a running MongoDB instance to pass all of its tests. The module has very good documentation available on its CPAN page. Installation may be as easy as:
$ sudo cpan MongoDB
If that did not work for you (case matters!), there are more detailed directions on the Perl Language Center page.
Our next step is to grab the latest Bucardo, install it, and create a new Bucardo instance. See the previous blog post for more details about each step.
$ git clone git://bucardo.org/bucardo.git Initialized empty Git repository... $ cd bucardo $ perl Makefile.PL Checking if your kit is complete... Looks good Writing Makefile for Bucardo $ make cp bucardo.schema blib/share/bucardo.schema cp Bucardo.pm blib/lib/Bucardo.pm cp bucardo blib/script/bucardo /usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/bucardo Manifying blib/man1/bucardo.1pm Manifying blib/man3/Bucardo.3pm $ sudo make install Installing /usr/local/lib/perl5/site_perl/5.10.0/Bucardo.pm Installing /usr/local/share/bucardo/bucardo.schema Installing /usr/local/bin/bucardo Installing /usr/local/share/man/man1/bucardo.1pm Installing /usr/local/share/man/man3/Bucardo.3pm Appending installation info to /usr/lib/perl5/5.10.0/i386-linux-thread-multi/perllocal.pod $ sudo mkdir /var/run/bucardo $ sudo chown $USER /var/run/bucardo $ bucardo install This will install the bucardo database into an existing Postgres cluster. ... Installation is now complete.
Now we create some test databases and populate with pgbench:
$ psql -c 'create database btest1' CREATE DATABASE $ pgbench -i btest1 NOTICE: table "pgbench_branches" does not exist, skipping ... creating tables... 10000 tuples done. 20000 tuples done. ... 100000 tuples done. $ psql -c 'create database btest2 template btest1' CREATE DATABASE $ psql -c 'create database btest3 template btest1' CREATE DATABASE $ psql btest3 -c 'truncate table pgbench_accounts' TRUNCATE TABLE $ bucardo add db t1 dbname=btest1 Added database "t1" $ bucardo add db t2 dbname=btest2 Added database "t2" $ bucardo add db t3 dbname=btest3 Added database "t3" $ bucardo list dbs Database: t1 Status: active Conn: psql -p 5432 -U bucardo -d btest1 Database: t2 Status: active Conn: psql -p 5432 -U bucardo -d btest2 Database: t3 Status: active Conn: psql -p 5432 -U bucardo -d btest3 $ bucardo add tables pgbench_accounts pgbench_branches pgbench_tellers herd=therd Created herd "therd" Added table "public.pgbench_accounts" Added table "public.pgbench_branches" Added table "public.pgbench_tellers" $ bucardo list tables Table: public.pgbench_accounts DB: t1 PK: aid (int4) Table: public.pgbench_branches DB: t1 PK: bid (int4) Table: public.pgbench_tellers DB: t1 PK: tid (int4)
The next step is to add in our MongoDB instance. The syntax is the same as the "add db" above, but we also tell it the type of database, as it is not the default of "postgres". We will also assign an arbitrary database name, "btest1", the same as the others. Everything else (such as the port and host) is default, so all we need to say is:
$ bucardo add db m1 dbname=btest1 type=mongo Added database "m1" $ bucardo list dbs Database: m1 Type: mongo Status: active Database: t1 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest1 Database: t2 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest2 Database: t3 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest3
Next we group our databases together and assign them roles:
$ bucardo add dbgroup tgroup t1:source t2:source t3:target m1:target Created database group "tgroup" Added database "t1" to group "tgroup" as source Added database "t2" to group "tgroup" as source Added database "t3" to group "tgroup" as target Added database "m1" to group "tgroup" as target
Note that "target" is the default action, so we could shorten that to:
$ bucardo add dbgroup tgroup t1:source t2 t3 m1
However, I think it is best to be explicit, even if it does (incorrectly) hint that m1 could be anything *other* than a target. :)
We are almost ready to go. The final step is to create a sync (a basic replication event in Bucardo), then we can start up Bucardo, put some test data into the master databases, and 'kick' the sync:
$ bucardo add sync mongotest herd=therd dbs=tgroup ping=false Added sync "mongotest" $ bucardo start Checking for existing processes Starting Bucardo $ pgbench -t 10000 btest1 starting vacuum...end. transaction type: TPC-B (sort of) number of transactions actually processed: 10000/10000 ... tps = 503.300595 (excluding connections establishing) $ pgbench -t 10000 btest2 number of transactions actually processed: 10000/10000 ... tps = 408.059368 (excluding connections establishing) $ bucardo kick mongotest
We'll give it a few seconds to replicate those changes (it took 18 seconds on my test box), and then check the output of bucardo status:
$ bucardo status PID of Bucardo MCP: 3317 Name State Last good Time Last I/D/C Last bad Time ===========+========+============+=======+=============+===========+======= mongotest | Good | 21:57:47 | 11s | 6/36234/898 | none |
Looks good, but what about the data in MongoDB? Let's get some counts from the Postgres masters and slave, and then look at the data inside MongoDB with the mongo command-line client:
$ psql btest1 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest2 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest3 -c 'SELECT count(*) FROM pgbench_accounts'
18106
$ psql btest1 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest2 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest3 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:46:00
connecting to: btest1
> show collections
bucardo_status
pgbench_accounts
pgbench_branches
pgbench_tellers
system.indexes
> db.pgbench_accounts.count()
18106
> db.pgbench_accounts.find().sort({abalance:1}).limit(1).next()
{
"_id" : ObjectId("4df39bcb8795839660001de5"),
"abalance" : -12071,
"aid" : 84733,
"bid" : 1,
"filler" : " "
}
> db.pgbench_accounts.find().sort({abalance:-1}).limit(1).next()
{
"_id" : ObjectId("4df39bd08795839660002fb0"),
"abalance" : 13010,
"aid" : 45500,
"bid" : 1,
"filler" : " "
}
Why the difference in counts? We only started replicating after we populated the Postgres tables on the master databases with 100,000 rows, so the eighteen thousand is the number of rows that was changed during the subsequent pgbench run. (Note that pgbench uses randomness, so your numbers will be different than the above). In the future Bucardo will support the "onetimecopy" feature for MongoDB, but until then we can fully populate the pgbench_accounts collection simply by "touching' all the records on one of the masters:
$ psql btest1 -c 'UPDATE pgbench_accounts SET aid=aid' UPDATE 100000 $ bucardo kick mongotest Kicked sync mongotest $ echo 'db.pgbench_accounts.count()' | mongo btest1 MongoDB shell version: 1.8.1 Fri Jun 10 12:47:00 connecting to: btest1 > 100000 > bye
A nice feature of MongoDB is its autovivification ability (aka dynamic schemas), which means unlike Postgres you do not have to create your tables first, but can simply ask MongoDB to do an insert, and it will create the table (or, in mongospeak, the collection) automatically for you.
Because MongoDB has no concept of transactions, and because Bucardo does not update, but does deletes plus inserts (for reasons I'll not get into today), there is one more trick Bucardo does when replicating to a MongoDB instance. A collection named 'bucardo_status' is created and updated at the start and the end of a sync (a replication event). Thus, your application can pause if it sees this table has a 'started' value, and wait until it sees 'complete' or 'failed'. Not foolproof by any means, but better than nothing :) You should, of course, carefully consider the way your app and Bucardo will coordinate things.
Feedback from Postgres or MongoDB folk is much appreciated: there are probably some rough edges, but as you can see from above, the basics are there are working. Feel free to email the bucardo-general mailing list or make a feature request / bug report on the Bucardo Bugzilla page.
Bucardo multi-master for PostgreSQL
The next version of Bucardo, a replication system for Postgres, is almost complete. The scope of the changes required a major version bump, so this Bucardo will start at version 5.0.0. Much of the innards was rewritten, with the following goals:
Multi-master support
Where "multi" means "as many as you want"! There are no more pushdelta (master to slaves) or swap (master to master) syncs: there is simply one sync where you tell it which databases to use, and what role they play. See examples below.
Ease of use
The bucardo program (previously known as 'bucardo_ctl') has been greatly improved, making all the administrative tasks such as adding tables, creating syncs, etc. much easier.
Performance
Much of the underlying architecture was improved, and sometimes rewritten, to make things go much faster. Most striking is the difference between the old multi-master "swap syncs" and the new method, which has been described as "orders of magnitudes" faster by early testers. We use async database calls whenever possible, and no longer have the bottleneck of a single large bucardo_delta table.
Improved logging
Not only are more details provided, there is now the ability to control how verbose the logs are. Just set the log_level parameter to terse, normal, verbose, or debug. Those who had busy systems, which was the equivalent of a 'debug' firehose, will really appreciate this.
Different targets
Who says your slave (target) databases need to be Postgres? In addition to the ability to write text SQL files (for say, shipping to a different system), you can have Bucardo push to other systems as well. Stay tuned for more details on this. (Update: there is a blog post about using MongoDB as a target)
This new version is not quite at beta yet, but you can try out a demo of multi-master on Postgres quie easily. Let's see if we can do it in ten steps.
I. Download all prerequisites
To run Bucardo, you will need a Postgres database (obviously), the DBIx::Safe module, the DBI and DBD::Pg modules, and (for the purposes of this demo) the pgbench utility. Systems vary, but on aptitude-based systems, one can grab all of the above like this:
aptitude install postgresql-server \ perl-DBIx-Safe \ perl-DBD-Pg \ postgresql-contrib
II. Grab the latest Bucardo
git clone git://bucardo.org/bucardo.git
III. Install the program
cd bucardo perl Makefile.PL make sudo make install
You can ignore any errors that come up about ExtUtils::MakeMaker not being recent.
IV. Setup an instance of Bucardo
This step assumes there is a running Postgres available to connect to.
sudo mkdir /var/run/bucardo sudo chown $USER /var/run/bucardo bucardo install
V. Use the pgbench program to create some test tables
psql -c 'CREATE DATABASE btest1' pgbench -i btest1 psql -c 'CREATE DATABASE btest2 TEMPLATE btest1' psql -c 'CREATE DATABASE btest3 TEMPLATE btest1' psql -c 'CREATE DATABASE btest4 TEMPLATE btest1' psql -c 'CREATE DATABASE btest5 TEMPLATE btest1'
VI. Tell Bucardo about the databases and tables you are going to use
bucardo add db t1 dbname=btest1 bucardo add db t2 dbname=btest2 bucardo add db t3 dbname=btest3 bucardo add db t4 dbname=btest4 bucardo add db t5 dbname=btest5 bucardo list dbs bucardo add table pgbench_accounts pgbench_branches pgbench_tellers herd=therd bucardo list tables
A herd is simply a logical grouping of tables. We did not add the other pgbench table, pgbench_history, because it has no primary key or unique index.
VII. Group the databases together and set their roles
bucardo add dbgroup tgroup t1:source t2:source t3:source t4:source t5:target
We've grouped all five databases together, and made four of them masters (aka source), and one of them a slave (aka target). You can any combination of master and slaves you want, as long as there is at least one master.
VII. Create the Bucardo sync
bucardo add sync foobar herd=therd dbs=tgroup ping=false
Here we simply create a new sync, which is a controllable replication event, telling it which tables we want to replicate, and which databases we are going to use. We also set ping to false, which means that we will not create triggers to automatically fire off replication on any changes, but will do it manually. In a real world scenario, you generally do want those triggers, or want to set Bucardo to check periodically.
VIII. Start up Bucardo
bucardo start
If all went well, you should see some information in the log.bucardo file in the current directory.
IX. Make a bunch of changes on all the source databases.
pgbench -t 10000 btest1 pgbench -t 10000 btest2 pgbench -t 10000 btest3 pgbench -t 10000 btest4
Here, we've told pgbench to run ten thousand transactions against each of the first four databases. Triggers on these tables have captured the changes.
X. Kick off the sync and watch the fun.
bucardo kick foobar
You can now tail the log.bucardo file to see the fun, or simply run:
bucardo status
...to see what it is doing, and the final counts when we are done. Don't forget to stop Bucardo when you are done testing:
bucardo stop
The output of bucardo status, after the sync has completed, should look like this:
bucardo status Name State Last good Time Last I/D/C Last bad Time ========+========+============+=======+====================+===========+======= foobar | Good | 17:58:37 | 3m2s | 131836/131836/4785 | none |
Here we see that this syncs has never failed ("Last bad"), the time of day of the last good run, how long ago it was from right now (3 minutes and 2 seconds), as well as details of the last successful run. Last I/D/C stands for number of inserts, deletes, and collisions across all databases for this syncs. This is just an overview of all syncs at a high level, but we can also give status an argument of a sync name to see more details like so:
bucardo status foobar Last good : Jun 02, 2011 17:57:47 (time to run: 42s) Rows deleted/inserted/conflicts : 131,836 / 131,836 / 4,785 Sync name : foobar Current state : Good Source herd/database : therd / t1 Tables in sync : 3 Status : active Check time : none Overdue time : 00:00:00 Expired time : 00:00:00 Stayalive/Kidsalive : yes / yes Rebuild index : 0 Ping : no Onetimecopy : 0 Post-copy analyze : Yes Last error: :
This gives us a little more information about the sync itself, as well as another important metric, how long the sync itself took to run, in this case, 42 seconds. That particular metric might make its way back to the overall "status" view above. Try things out and help us find bugs and improve Bucardo!
Saving time with generate_series()
I was giving a presentation once on various SQL constructs, and, borrowing an analogy I'd seen elsewhere, described PostgreSQL's generate_series() function as something you might use in places where, in some other language, you'd use a FOR loop. One attendee asked, "So, why would you ever want a FOR loop in a SQL query?" A fair question, and one that I answered using examples later in the presentation. Another such example showed up recently on a client's system where the ORM was trying to be helpful, and chose a really bad query to do it.
The application in question was trying to display a list of records, and allow the user to search through them, modify them, filter them, etc. Since the ORM knew users might filter on a date-based field, it wanted to present a list of years containing valid records. So it did this:
SELECT DISTINCT DATE_TRUNC('year', some_date_field) FROM some_table;
In fairness to the ORM, this query wouldn't be so bad if some_table only had a few hundred or thousand rows. But in our case it has several tens of millions. This query results in a sequential scan of each of those records, in order to build a list of, as it turns out, about fifty total years. There must be a better way...
The better way we chose turns out to be, in essence, this: find the years of the maximum and minimum date values in the date field, construct a list of all years between the minimum and maximum, inclusive, and see which ones exist in the table. This date field is indexed, so finding its maximum and minimum is very fast:
SELECT
DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
DATE_TRUNC('year', MAX(some_date_field)) AS mymax
FROM some_table
Here's where the FOR loop idea comes in, though it's probably better described as an "iterator" rather than a FOR loop specifically: for each year between mymin and mymax inclusive, I want a database row. The analogy may not hold terribly well, but the technique is very useful, because it will create a list of all the possible years I might be interested in, and it will do it with just two scans of the some_date_field index, rather than a sequential scan of millions of rows.
SELECT
generate_series(mymin::INTEGER, mymax::INTEGER) AS yearnum
FROM (
SELECT
DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
DATE_TRUNC('year', MAX(some_date_field)) AS mymax
FROM some_table
) minmax_tbl
Now I simply have to convert these values to years, and see which ones exist in the underlying table:
SELECT
yearbegin::timestamptz
FROM
(
SELECT
yearnum * INTERVAL '1 year' + '0000-01-01'::date AS yearbegin
FROM (
SELECT
generate_series(mymin::INTEGER, mymax::INTEGER) AS yearnum
FROM (
SELECT
DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
DATE_TRUNC('year', MAX(some_date_field)) AS mymax
FROM some_table
) yearnum_tbl
) beginend_tbl
WHERE
EXISTS (
SELECT 1 FROM some_table
WHERE
some_date_field BETWEEN yearbegin AND yearbegin + INTERVAL '1 year'
)
ORDER BY yearbegin ASC
;
As expected, this probes the some_date_field index twice, to get the maximum and minimum date values, and then once for each year between those values. Because of some strangely-dated data in there, that means nearly 10,000 index probes, but that's still much faster than scanning the entire table.
Postgres Bug Tracking - Help Wanted!
Once again there is talk in the Postgres community about adopting the use of a bug tracker. The latest thread, on pgsql-hackers, was started by someone asking about the status of their patch. Or rather, asking an even better meta-question about how one finds out the status of a PostgreSQL bug report or patch. Sadly, the answer is that there is no standard way, other than sending emails until someone replies one way or another. The current process works something like this:
- Someone finds a bug
- They send an email to pgsql-bugs@postgresql.org OR they use the web form, which grabs a sequential number and mails the report to pgsql-bugs@postgresql.org. Nothing else is done/stored, it just sends the email.
- Someone replies about the bug OR nobody replies about the bug.
- After a fix is found, which may involve some emails on other mailing lists, someone replies that the bug is fixed on the original thread. Maybe.
As you can see, there is some room for improvement there. Some of the most major and glaring holes in the current system:
- No way to search previous / existing bugs
- No way to tell the status of a bug
- No way to categorize and group bugs (per version, per platform, per component, per severity, etc.)
- No way to know who is working on a bug
- No way to prevent things from slipping through the cracks
Luckily, the above problems have been solved for many many years now but a wide variety of bug tracking software. There have traditionally been three problems to getting a bug tracker working for the Postgres project:
Inertia
The current system is, in a very literal sense, "good enough", so it's hard to impose the inevitable short-term pain of a new system when there always seem to be more pressing matters to attend to.
Doesn't Make Julienne Fries
Everyone wants a different set of features, and getting all the hackers involved to agree on even a simple subset of desired features is pretty difficult. This is sort of similar to the crusade by myself and others to get git as the replacement version control system; there were some strong voices for competing systems (e.g. mercurial).
Who Will Put the Bell on the Cat?
Everyone talks about the problem, and there have even been some attempts over the years to implement some sort of system, but the problem remains that setting up such a system, getting it smoothly integrated into the project's work flow, and then maintaining said system is a non-trivial task. Especially when you can't be assured of buy-in from some of the major players.
I'm hopeful that the recent thread indicates a slight shift of late in global acceptance of the need for a bug tracking system. The question is, which one, and who is going to take the time to write something? I'm really hoping someone who has been lurking in the background will step up and help create something wonderful (okay, we can start with 'decent' :) Perhaps even someone with experience setting up bug tracking systems. Certainly Postgres must be one of the last major open source projects without a bug tracker; there is plenty of hard-won experience out there to be learned from. It would also be ideal if the person or persons was *not* a Postgres hacker of any sort, as taking the time to build and maintain this system would definitely take time away from their other hacking tasks. On the other hand, one could argue that a bug tracker is a vital piece of project infrastructure that is potentially as important as any other work that goes on. I certainly think so.
Only Try This At Home

Taken by Josh 6 years to the day before the release of 9.1 beta 1
For the record, 9.1 is gearing up to be an awesome release. I was tinkering and testing PostgreSQL 9.1 Beta 1 (... You are beta testing, too, right?) ... and some of the new PL/Python features caught my eye. These are minor among all the really cool high profile features, to be sure. But it made me think back to a little bit of experimental code written some time ago, and how these couple language additions could make a big difference.
For one reason or another I'd just hit the top level postgresql.org website, and suddenly realized just how many Postgres databases it took to put together what I was seeing on the screen. Not only does it power the content database that generated the page, of course, but even the lookup of the .org went through Afilias and their Postgres-backed domain service. It's a pity the DBMS couldn't act as the middle layer between those.
Or could it?
That's a shortened form of it just for demonstration purposes (the original one had things like a table browser) ... but it works. For example, on this test 9.1 install, hit http://localhost:8000/public/webtest and the following table appears:
| generate_series | lh | rnd |
|---|---|---|
| 1 | 0 | 0.548577250913 |
| 2 | 1 | 1.70926172473 |
| 3 | 1 | 1.24841631576 |
| (etc) | ... | ... |
Note the use of two specific 9.1 features, though. The plpy object contains nice query building helper utilities like quote_ident that you may be familiar with in other languages. But this also makes use of subtransactions, which helps recover from db errors. That's important here, as something like a typo in a table name will generate an error from Postgres and without that in place the database will end the transaction and ignore any subsequent commands the function tries to run.
But with that in place, the page shows the 404 error, and picks up where it left off with subsequent requests:
Error code 404. Message: Table not found.
By the way, if it's not clear by now don't take this anywhere near a production database, if not any other reason that a transaction will be held open as long as that function runs. That will hold back all the nice maintenance stuff that keeps things running efficiently. Still, I think it helps show off what just a handful of lines of code can do in a powerful language like PL/Python. I'm sure with the right module PL/PerlU could do something very similar. But even more I think it shows how Postgres is growing and innovating by leaps and bounds, seemingly every day!
DBD::Pg and the libpq COPY bug
(image by kvanhorn)Version 2.18.1 of DBD::Pg, the Perl driver for Postgres, was just released. This was to fix a serious bug in which we were not properly clearing things out after performing a COPY. The only time the bug manifested, however, is if an asynchronous query was done immediately after a COPY finished. I discovered this while working on the new version of Bucardo. The failing code section was this (simplified):
## Prepare the source
my $srccmd = "COPY (SELECT * FROM $S.$T WHERE $pkcols IN ($pkvals)) TO STDOUT";
$fromdbh->do($srccmd);
## Prepare each target
for my $t (@$todb) {
my $tgtcmd = "COPY $S.$T FROM STDIN";
$t->{dbh}->do($tgtcmd);
}
## Pull a row from the source, and push it to each target
while ($fromdbh->pg_getcopydata($buffer) >= 0) {
for my $t (@$todb) {
$t->{dbh}->pg_putcopydata($buffer);
}
}
## Tell each target we are done with COPYing
for my $t (@$todb) {
$t->{dbh}->pg_putcopyend();
}
## Later on, run an asynchronous command on the source database
$sth{track}{$dbname}{$g} = $fromdbh->prepare($SQL, {pg_async => PG_ASYNC});
$sth{track}{$dbname}{$g}->execute();
This gave the error "another command is already in progress". This error did not come from Postgres or DBD::Pg, but from libpq, the underlying C library which DBD::Pg uses to talk to the database. Strangely enough, taking out the async part and running the exact same command produced no errors.
After tracking back through the libpq code, it turns out that DBD::Pg was only calling PQresult a single time after the copy ended. I can see why this was done: the docs for PQputCopyEnd state: "After successfully calling PQputCopyEnd, call PQgetResult to obtain the final result status of the COPY command. One can wait for this result to be available in the usual way. Then return to normal operation." What's not explicitly stated is that you need call PQgetResult again, and keep calling it, until it returns null, to "clear out the message queue". In this case, PQresult pulled back a 'c' message from Postgres, via the frontend/backend protocol, indicating that the copy command was complete. However, what it really needed was to call PQresult two more times, once to get back a 'C' (indicating the COPY statement was complete), and a 'Z' (indicating the backend was ready for a new query). Technically, there was nothing stopping libpq from sending a fresh query except that its own internal flag, conn->asyncStatus, is not reset on a simple end of copy, but only when 'Z' is encountered. Thus, DBD::Pg 2.18.1 now calls PQresult until it returns null.
If your application is encountering this bug and you cannot upgrade to 2.18.1 yet, the solution is simple: perform a non-asynchronous query between the end of the copy and the start of the asynchronous query. It can be any query at all, so the above code could be cured with:
...
## Tell each target we are done with COPYing
for my $t (@$todb) {
$t->{dbh}->pg_putcopyend();
$t->{dbh}->do('SELECT 123');
}
## Later on, run an asynchronous command on the source database
$fromdbh->do('SELECT 123');
$sth{track}{$dbname}{$g} = $fromdbh->prepare($SQL, {pg_async => PG_ASYNC});
$sth{track}{$dbname}{$g}->execute();
Why does the non-asynchronous command work? Doesn't it check the conn->asyncStatus as well? The secret is that PQexecstart has this bit of code in it:
/*
* Silently discard any prior query result that application didn't eat.
* This is probably poor design, but it's here for backward compatibility.
*/
while ((result = PQgetResult(conn)) != NULL)
Wow, that code looks familiar! So it turns out that the only reason this was not spotted earlier is that non-asynchronous commands (e.g. those using PQexec) were silently clearing out the message queue, kind of as a little favor from libpq to the driver. The async function, PQsendQuery, is not as nice, so it does the correct thing and fails right away with the error seen above (via PQsendQueryStart).
NOTIFY vs Prepared Transactions in Postgres (the Bucardo solution)

We recently had a client use Bucardo to migrate their app from Postgres 8.2 to Postgres 9.0 with no downtime (which went great). They also were using Bucardo to replicate from the new 9.0 mater to a bunch of 9.0 slaves. This ran into problems the moment the application started, as we started seeing these messages in the logs:
ERROR: cannot PREPARE a transaction that has executed LISTEN, UNLISTEN or NOTIFY
The problem is that the Postgres LISTEN/NOTIFY system cannot be used with prepared transactions. Bucardo uses a trigger on the source tables that issues a NOTIFY to let the main Bucardo daemon know that something has changed and needs to be replicated. However, their application was issuing a PREPARE TRANSACTION as an occasional part of its work. Thus, they would update the table, which would fire the trigger, which would send the NOTIFY. Then the application would issue the PREPARE TRANSACTION which produced the error given above. Bucardo is setup to deal with this situation; rather than using notify triggers, the Bucardo daemon can be set to look for any changes at a set interval. The steps to change Bucardo's behavior for a given sync is simply:
$ bucardo_ctl update sync foobar ping=false checktime=15 $ bucardo_ctl validate foobar $ bucardo_ctl reload foobar
The first command tells the sync not to use notify triggers (these are actually statement-level triggers that simply issue a NOTIFY bucardo_kick_sync_foobar. It also sets a checktime of 15 seconds, which means that the Bucardo daemon will check for changes every 15 seconds - or as if the original notify trigger is firing every 15 seconds. The second command validates the sync but checking that all supporting tables, functions, triggers, etc. are installed and up to date. It also removes triggers that are no longer needed: in this case, the statement-level notify triggers for all tables in this sync. Finally, the third command simply tells the Bucardo daemon to stop the sync, load in the new changes, and restart it.
Another solution to the problem is to simply not use prepared transactions: very few applications actually need it, but I've noticed a few that use it anyway when they should not be. What exactly is a prepared transaction? It's the Postgres way of implementing two-part commit. Basically, this means that a transaction's state is stored away on disk, and can be committed or rolled back at a later time - even by a different session. This is handy if you need to ensure that, for example, you can atomically commit multiple database connections. By atomically, I mean that either they all commit or none of them do. This is done by doing work on each database, issuing a PREPARE TRANSACTION, and then, once all have been prepared, issuing the COMMIT TRANSACTION against each one.
As an aside, prepared transactions are often confused with prepared statements. While the use of prepared statements is very common, use of prepared transactions is very rare. Prepared statements are simply a way of planning a query one time, then re-running it multiple times without having to run the query through the planner each time. Many interfaces, such as DBD::Pg, will do this for you automatically behind the scenes. Sometimes using prepared statements can cause issues, but it is usually a win.
As mentioned above, the use of 2PC (two-phase commit) is very rare, which is why the default for the max_prepared_transactions variable was recently changed to 0, which effectively disallows the use of prepared transactions until you explicitly turning them on in your postgresql.conf file. This helps prevent people from accidentally issuing a PREPARE TRANSACTION and then leaving them around. This mistake is easy to do, for once you issue the command, everything goes back to normal and it's easy to forget about them. However, having them around is a bad thing, as they continue to hold locks, and can prevent vacuum from running.The check_postgres program even has a specific check for this situation:check_prepared_txns.
What does two-part commit look like? There are only three basic commands: PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED. Each takes a name, which is an arbitrary string 200 characters or less. Usage is to start a transaction, do some work, and then issue a PREPARE TRANSACTION instead of a COMMIT. At this point, all the work you have done is gone from your session and stored on disk. You cannot get back into this transaction: you can only commit it or roll it back. See the docs on PREPARE TRANSACTION for the full details.
Here's an example of two-part commit in action:
testdb=# BEGIN;
BEGIN
testdb=#* CREATE TABLE preptest(a int);
CREATE TABLE
testdb=#* INSERT INTO preptest VALUES (1),(2),(3);
INSERT 0 3
testdb=#* SELECT * FROM preptest;
a
---
1
2
3
(3 rows)
testdb=#* PREPARE TRANSACTION 'foobar';
PREPARE TRANSACTION
testdb=# SELECT * FROM preptest;
ERROR: relation "preptest" does not exist
LINE 1: SELECT * FROM preptest;
^
testdb=# COMMIT PREPARED 'foobar';
COMMIT PREPARED
testdb=# SELECT * FROM preptest;
a
---
1
2
3
(3 rows)
A contrived example, but you can see how easy it could be to issue a PREPARE TRANSACTION and not even realize that it actually sticks around forever!
Postgres query caching with DBIx::Cache
A few years back, I started working on a module named DBIx::Cache which would add a caching layer at the database driver level. The project that was driving it got put on hold indefinitely, so it's been on my long-term todo list to release what I did have to the public in the hope that someone else may find it useful. Hence, I've just released version 1.0.1 of DBIx::Cache. Consider it the closest thing Postgres has at the moment for query caching. :) The canonical webpage:
http://bucardo.org/wiki/DBIx-Cache
You can also grab it via git, either directly:
git clone git://bucardo.org/dbixcache.git/
or through the indispensable github:
https://github.com/bucardo/dbixcache
So, what does it do exactly? Well, the idea is that certain queries that are either repeated often and/or are very expensive to run should be cached somewhere, such that the database does not have to redo all the same work, just to return the same results over and over to the client application. Currently, the best you can hope for with Postgres is that things are in RAM from being run recently. DBIx::Cache changes this by caching the results somewhere else. The default destination is memcached.
DBIx::Cache acts as a transparent layer around your DBI calls. You can control which queries, or classes of queries get cached. Most of the basic DBI methods are overridden so that rather than query Postgres, they actually query memcached as needed (or other caching layer - could even query back into Postgres itself!). Let's look at a simple example:
use strict;
use warnings;
use Data::Dumper;
use DBIx::Cache;
use Cache::Memcached::Fast;
## Connect to an existing memcached server,
## and establish a default namespace
my $mc = Cache::Memcached::Fast->new(
{
servers => [ { address => 'localhost:11211' } ],
namespace => 'joy',
});
## Rather than DBI->connect, use DBIx->connect
## Tell it what to use as our caching source
## (the memcached server above)
my $dbh = DBIx::Cache->connect('', '', '',
{ RaiseError => 1,
dxc_cachehandle => $mc
});
## This is an expensive query, that takes 30 seconds to run:
my $SQL = 'SELECT * FROM analyze_sales_data()';
## Prepare this query
my $sth = $dbh->prepare($SQL);
## Run it ten times in a row.
## The first time takes 30 seconds, the other nine return instantly.
for (1..10) {
my $count = $sth->execute();
my $info = $sth->fetchall_arrayref({});
print Dumper $info;
}
In the above, the prepare($SQL) is actually calling the DBIx::Class::prepare method. This parses the query and tries to determine if it is cacheable or not, then stores that decision internally. Regardless of the result, it calls DBI::prepare (which is techincally DBD::Pg::prepare), and returns the result.The magic comes in the call to execute() later on. As you might imagine, this is also actually the DBIx::Class::execute() method. If the query is not cacheable, it simply runs it as normal and returns. If it is cacheable, and this is the first time it is run, DBIx::Class runs an EXPLAIN EXECUTE on the original statement, and parses out a list of all tables that are used in this query. Then it caches all of this information into memcached, so that subsequent runs using the same list of arguments to execute() don't need to do that work again.
Finally, we come to fetchall_arrayref(). The first time it is run, we simply call the parent methods and get the data back. Then we build unique keys and store the results of the query into memcached. Finally, we mark the execute() as fully cached. Thus, on subsequent calls to execute(), we don't actually execute anything on the database server, but simply return the count as stashed inside of memcached (in the case of execute, this is the number of affected rows). For the various fetch() methods, we do the same thing - rather than fetch things from the database (via DBI, DBD::Pg, and libpq), we get the results from memcached (frozen via Data::Dumper), and then unpack and return them. Since we don't actually need to do any work against the database, everything returns as fast as we can query memcached - which is in general very fast indeed.
Most of the above is working, but the piece that is not written is the cache invalidation. DBIx::Cache knows which tables go to which queries, so in theory you could have (for example), an UPDATE/INSERT/DELETE trigger on table X which calls DBIx::Cache and tells it to invalidate all items related to table X, so that the next call to prepare() or execute() or fetch() will not find any memcached matches and re-run the whole query and store the results. You could also simply handle that in your application, of course, and have it decide when to invalidate items.
It's been a while since I've really looked at the code, but as far as I can tell it is close to being able to actually use somewhere. :) Patches and questions welcome!
DBD::Pg query cancelling in Postgres
A new version of DBD::Pg, the Perl driver for PostgreSQL, has just been released. In addition to fixing some memory leaks and other minor bugs, this release (version 2.18.0) introduces support for the DBI method known as cancel(). A giant thanks to Eric Simon, who wrote this new feature. The new method is similar to the existing pg_cancel() method, except it works on synchronous rather than asynchronous queries. I'll show an example of both below.
DBD::Pg has been able to handle asynchronous queries for a while now. Basically, that means you don't have to wait around for the database to finish a query. Your application can do other things while the query runs, then check back later to see if it has completed and grab the results. The way to cancel an already kicked-off asynchronous query is with the pg_cancel() method (the other asynchronous methods are pg_ready and pg_result, which have no synchronous equivalents).
The prefix "pg_" is used because there is no corresponding built-in DBI method to override, and the convention is to prefix everything custom to a driver with the driver's prefix, in our case 'pg'. Here's an example showing one possible use of asynchronous queries using DBD::Pg in some Perl code:
## We are connecting to two servers and running expensive
## queries on both. We kick both off right away, then wait
## for them both to finish. Our total wait time is thus
## max(server1,server2) rather than sum(server1,server2)
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn1 = 'dbi:Pg:dbname=sales;host=example1.com';
my $dsn2 = 'dbi:Pg:dbname=sales;host=example2.com';
my $dbh1 = DBI->connect($dsn1, '', '', {AutoCommit=>0, RaiseError=>1});
my $dbh2 = DBI->connect($dsn2, '', '', {AutoCommit=>0, RaiseError=>1});
my $SQL = 'SELECT gather_yearly_sales_data()';
print "Kicking off a long, expensive query on database one\n";
## Normally, a do() will not return until the query is complete
## However, the async flag causes it to return immediately
$dbh1->do($SQL, {pg_async => PG_ASYNC});
print "Kicking off a long, expensive query on database two\n";
$dbh2->do($SQL, {pg_async => PG_ASYNC});
## Both queries are running in the 'background'
## We have to wait for both, so it doesn't matter which one we wait for here
## However, if it's been over 2 minutes, we'll cancel both and quit
my $time = 0;
while ( ! $dbh1->pg_ready() ) {
sleep 1;
if ($time++ > 120) {
print "Taking too long, let's cancel the queries\n";
$dbh1->pg_cancel();
$dbh2->pg_cancel();
$dbh1->rollback();
$dbh2->rollback();
die "No sales data was retrieved\n";
}
}
## We know that database 1 has finished, so we read in the results
my $rows1 = $dbh1->pg_result();
## We then grab results from database 2
## This will block until done, which is okay
my $rows2 = $dbh2->pg_result();
The new method, simply known as cancel(), will kill any synchronously running query. One of the main uses for this is to timeout a query by using the builtin Perl alarm function. However, since the builtin alarm function has some quirks, we will instead use the much safer POSIX::SigAction method. Another example:
## We are running a series of queries against a database, but if
## the whole thing is taking over 30 seconds, we want to cancel
## the currently running query and move on to something else.
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn = 'dbi:Pg:dbname=dq';
my $dbh = DBI->connect($dsn, '', '', {AutoCommit=>0, RaiseError=>1});
## Setup all the POSIX alarm plumbing
my $mask = POSIX::SigSet->new(SIGALRM);
my $action = POSIX::SigAction->new(
sub { die "TIMEOUT\n" },
$mask,
);
my $oldaction = POSIX::SigAction->new();
sigaction( SIGALRM, $action, $oldaction );
## Prepare the queries
my $upd = $dbh->prepare('UPDATE foobar SET x=? WHERE y=?');
my $inv = $dbh->prepare('SELECT refresh_inventory(?)');
## Yes, a double eval. Async is looking better all the time :)
eval {
eval {
alarm 30;
for my $y (12,24,48) {
print "Adjusting widget #$y\n";
$upd->execute(555,$y);
print "Recalculating inventory\n";
$inv->execute($y);
}
};
alarm 0; ## Turn off our alarm
die "$@\n" if $@; ## Bubble the error to the outer eval
};
if ($@) { ## Something went wrong
if ($@ =~ /TIMEOUT/) {
print "Queries are taking too long! Cancelling\n";
## We don't know which one is still running, and don't care
## It's safe to cancel a non-active statement handle
$upd->cancel() or die qq{Failed to cancel the query!\n};
$inv->cancel() or die qq{Failed to cancel the query!\n};
$dbh->rollback();
die "Who has time to wait 30 seconds anymore?";
}
## Some other non-alarm error, so we simply:
die $@;
}
print "Updates are complete\n";
$dbh->commit();
exit;
Got an interesting use case for asynchronous queries or the new $dbh‑>cancel()? Let me know!
Annotating Your Logs
We recently did some PostgreSQL performance analysis for a client with an application having some scaling problems. In essence, they wanted to know where Postgres was getting bogged down, and once we knew that we'd be able to target some fixes. But to get to that point, we had to gather a whole bunch of log data for analysis while the test software hit the site.
This is on Postgres 8.3 in a rather locked down environment, by the way. Coordinated pg_rotate_logfile() was useful, but occasionally it would seem to devolve to something resembling: "Okay, we're adding 60 more users ... now!" And I'd write down the time stamp, and figure out an appropriate place to slice the log file later.
Got me thinking, what if we could just drop an entry into the log file, and use it to filter things out later? My first instinct was to start looking at seeing if a patch would be accepted, maybe a wrapper for ereport(), something easy. Turns out, it's even easier than that...
pubsite=# DO $$BEGIN RAISE LOG 'MARK: 60 users'; END;$$; DO Time: 0.464 ms pubsite=# DO $$BEGIN RAISE LOG 'MARK: 120 users'; END;$$; DO Time: 0.378 ms pubsite=# DO $$BEGIN RAISE LOG 'MARK: 360 users'; END;$$; DO Time: 0.700 ms
Of course the above will only work on version 9.0 and up (eventually). Previous versions that have PL/pgSQL turned can just create a function that does the same thing. The "LOG" severity level is an informational message that's supposed to always make it into the log files. So with those in place, a grep through the log can reveal just where they appear, and sed can extract the sections of log between those lines and feed them into your favorite analysis utility:
postgres@mothra:~$ grep -n 'LOG: MARK' /var/log/postgresql/postgresql-9.0-main.log 19180:2011-03-31 20:20:37 EDT LOG: MARK: 60 users 19478:2011-03-31 20:25:48 EDT LOG: MARK: 120 users 20247:2011-03-31 20:32:15 EDT LOG: MARK: 360 users postgres@mothra:~$ sed -n '19180,19478p' /var/log/postgresql/postgresql-9.0-main.log | bin/pgsi.pl > 60users.html
Oh, and the performance problem? Turns out it wasn't Postgres at all, every single query average execution time was shown to vary minimally as the concurrent user count was scaled higher and higher. But that's another story.
Postgres Build Farm Animal Differences
I'm a big fan of the Postgres Build Farm, a distributed network of computers that are constantly installling, building, and testing Postgres to detect any problems in the code. The build farm works best when there is a wide variety of operating systems and architectures testing. Thus, while I have a rather common x86_64 Linux box available for testing, I try to make it a little unique to get better test coverage.
One thing I've been working on is clang support (clang is an alternative to gcc). Unfortunately, the latest version of clang has a bug that prevents it from building Postgres on Linux boxes. I submitted a small patch to the Postgres source to fix this, but it was decided that we'll wait until clang fixes their bug. Supposedly they have in their svn head, but I've not been able to get that to compile successfully.
So I also just installed gcc 4.6.0, the latest and greatest. Installing it was not easy (nasty problems with the mfpr dependencies), but it's done now and working. It probably won't make any difference as far as the results, but at least my box is somewhat different from all the other x86_64 Linux boxes in the farm. :)
I've asked before on the list (with no response) about what sort of configuration changes could be made to expand the range of testing. The build farm itself provides a handful of things to choose from, and most of the animals in the farm have most of them configured (I have everything except "pam" and "vpath" enabled). However, one thing I've thought about changing is NAMEDATALEN. It's basically a compile-time option that sets the maximum number of characters things like table names can have. It is set by default to 64, while the SQL spec wants it to be 128. The problem is that this causes some tests to fail, as they have a hard-coded assumption about the length. The real problem of course is that Postgres' 'make check' is a very crude test. I've got some ideas on how to fix that, but that's another post for another day. So, anyone have other ideas on how to make my particular build farm member, and others like it, more useful?
Presenting at PgEast
I'm excited to be going to the upcoming PostgreSQL East Conference. This will be both my first PostgreSQL conference to attend, as well as my first time presenting. I will be giving a talk on Bucardo entitled Bucardo: More than Just Multi-Master. I'll be in NYC for the conference, so I'll get to work for a couple days at our company's main office as well.
I look forward to learning more about PostgreSQL, putting some names and faces with some IRC nicks, and socializing with others in the PostgreSQL community; after all, Postgres' community is one of its strongest assets.
Hope to see you there!
Pausing Hot Standby Replay in PostgreSQL 9.0
When using a PostgreSQL Hot Standby master/replica pair, it can be useful to temporarily pause WAL replay on the replica. While future versions of Postgres will include the ability to pause recovery using administrative SQL functions, the current released version does not have this support. This article describes two options for pausing recovery for the rest of us that need this feature in the present. These two approaches are both based around the same basic idea: utilizing a "pause file", whose presence causes recovery to pause until the file has been removed.
Option 1: patched pg_standby
pg_standby is a fairly standard tool that is often used as a restore_command for WAL replay. I wrote a patch for it (available at my github repo) to support the "pause file" notion. The patch adds a -p path/to/pausefile optional argument, which if present will check for the pausefile and wait until it is removed before proceeding with recovery.
The benefit of patching pg_standby is that the we're building on mature production-level code, adding a functionality at its most relevant place. In particular, we know that signal handling is already sensibly handled; (this was something I was less than positive about with when it comes to the wrapper shell script described later). The downside here is that you need to compile your own version of pg_standby in order to take advantage of it. However, it may be considered useful enough of a patch to accept in the 9.0 tree, so future releases could support it out-of-the-box.
After patching, compiling, and installing the modified version of pg_standby the only change to an existing restore_command already using pg_standby would be the addition of the -p /path/to/pausefile argument; e.g.:
restore_command = 'pg_standby -p /tmp/pausefile /path/to/archive %f %p'
After restarting the standby, simply touching the /tmp/pausefile file will pause recovery until the file is subsequently removed.
Option 2: a shell script
The pause-while script is a simple wrapper script I wrote which can be used to gate the invocation of any command by checking if the "pause file" (a file path passed as the first argument) exists. If the pause file exists, we loop in a sleep cycle until it is removed. Once the pause file does not exist (or if it did not exist in the first place), we execute the rest of the provided command string.
Sample invocation:
[user@host<1>] $ touch /tmp/pausefile; pause-while /tmp/pausefile echo hi ... # pauses, notifying of status [user@host<2>] $ rm /tmp/pausefile ... # shell 1 will now output "hi"
Here's the script:
pause-while:
#!/bin/bash # we're trapping this signal trap 'exit 1' INT; PAUSE_FILE=$1; shift; while [ -f $PAUSE_FILE ]; do echo "'$PAUSE_FILE' present; pausing. remove to continue" >&2 sleep 1; PAUSED=1 done [ "$PAUSED" ] && echo "'$PAUSE_FILE' removed; " >&2 # untrap so we don't block the invoked command's expected signal handling trap INT; # now we know the pause file doesn't exist, proceed to execute our # command as normal exec $@;
We need to trap SIGINT to prevent the wrapped command from executing if the sleep cycle is interrupted.
Putting this to use in our Hot Standby case, we will want to use pause-while as a wrapper for the existing restore_command, thus adjusting recovery.conf to something like this:
restore_command = 'pause-while /tmp/standby.pause pg_standby ... <args>'
With this configuration, when you want to pause WAL replay on the replica simply touch the /tmp/standby.pause pause file and the next invocation of restore_command will wait until that file is removed before proceeding.
The wrapper script approach has the benefit of working with any defined restore_command and is not limited to just working with pg_standby.
Limitations
- Since this is based on WAL archive restoration, this has a very coarse granularity; recovery can only pause between WAL files, which are 16MB. It is likely that future SQL support functions will support this at arbitrary transaction boundaries and will not have this specific limitation.
- Neither of these options will work with Streaming Replication. Streaming Replication uses a non-zero exit status of the restore_command as the "End of Archive" marker to flip from archive restoration/catchup mode to WAL Streaming mode. pg_standby's default behavior (even before this patch) is to wait for the next archive file to appear before returning a zero exit status, and returning a non-zero exit status only on error, signal, or because its failover trigger file now exists. This means that if you use pg_standby as the restore_command with Streaming Replication enabled, you will never actually flip over into WAL streaming mode, and will stay pointlessly in rechive restoration mode. (Technically speaking you could touch the failover trigger file; that would get you out of the archive mode, and into WAL streaming mode, but would not result in actually failing over.) It is likely that future SQL support functions for pausing recovery will not have this same dependency/limitation, and will be able to pause recovery when utilizing Streaming Replication.
- While reviewed/manually tested, these programs have not been production-tested. I've done basic testing on both the shell script and pg_standby patch, however this has not been battle-tested, and likely has some corner cases that haven't been considered (I'm particularly concerned about the shell script's signal handling interactions.)
- pg_standby has been deprecated and removed in future releases of PostgreSQL. I believe it would still be possible to compile/use pg_standby for future releases based on the version in the 9.0 source tree, but I believe it was removed because of the issues in conjunction with Streaming Replication. Presumably it (and this approach) would still be relevant if people wanted to utilize a traditional log-shipping standby with Hot Standby.
Comments/improvements welcome/appreciated!
check_postgres without Nagios (Postgres checkpoints)
Version 2.16.0 of check_postgres, a monitoring tool for Postgres, was just released. We're still trying to keep a "release often" schedule, and hopefully this year will see many releases. In addition to a few minor bug fixes, we added a new check by Nicola Thauvin called hot_standby_delay, which, as you might have guessed from the name, calculates the streaming replication lag between a master server and one of the slaves connected to it. Obviously the servers must be running PostgreSQL 9.0 or better.
Another recently added feature (in version 2.15.0) was the simple addition of a --quiet flag. All this does is to prevent any normal output when an OK status is found. I wrote this because sometimes even Nagios is overkill. In the default mode (Nagios, the other major mode is MRTG), check_postgres will exit with one of four states, each with their own exit code: OK, WARNING, CRITICAL, or UNKNOWN. It also outputs a small message, per Nagios conventions, so a txn_idle action might exit with a value of 1 and output something similar to this:
POSTGRES_TXN_IDLE WARNING: (host:svr1) longest idle in txn: 4638s
I had a situation where I wanted to use the functionality of check_postgres (to examine the lag on a warm standby server), but did not want the overhead of adding it into Nagios, and just needed a quick email to be sent if there were any problems. Thus, the use of the quiet flag yielded a quick and cheap Nagios replacement using cron:
*/10 * * * * bin/check_postgres.pl --action=checkpoint -w 300 -c 600 --datadir=/dbdir --quiet
So every 10 minutes the script gathers the number of seconds since the last checkpoint was run. If that number is under five minutes (300 seconds), it exits silently. If it's over five minutes, it outputs something similar to this, which cron then sends in an email:
POSTGRES_CHECKPOINT CRITICAL: Last checkpoint was 842 seconds ago
I'm not advocating replacing Nagios of course: there are many other good reasons to use Nagios instead of cron, but this worked well for the situation at hand. Other actions, feature requests, and patches for check_postgres are always welcome, either on the check_postgres bug tracker or the mailing list.
DBD::Pg, UTF-8, and Postgres client_encoding
Photo by Roger SmithI've been working on getting DBD::Pg to play nicely with UTF-8, as the current system is suboptimal at best. DBD::Pg is the Perl interface to Postgres, and is the glue code that takes the data from the database (via libpq) and gives it to your Perl program. However, not all data is created equal, and that's where the complications begin.
Currently, everything coming back from the database is, by default, treated as byte soup, meaning no conversion is done, and no strings are marked as utf8 (Perl strings are actually objects in which one of the attributes you can set is 'utf8'). If you want strings marked as utf8, you must currently set the pg_enable_utf8 attribute on the database handle like so:
$dbh->{pg_enable_utf8} = 1;
This causes DBD::Pg to scan incoming strings for high bits and mark the string as utf8 if it finds them. There are a few drawbacks to this system:
- It does this for all databases, even SQL_ASCII!
- It doesn't do this for everything, e.g. arrays, custom data types, xml.
- It requires the user to remember to set pg_enable_utf8.
- It adds overhead as we have to parse every single byte coming back from the database.
Here's one proposal for a new system. Feedback welcome, as this is a tricky thing to get right.
DBD::Pg will examine the client_encoding parameter, and see if it matches UTF8. If it does, then we can assume everything coming back to us from Postgres is UTF-8. Therefore, we'll simply flip the utf8 bit on for all strings. The one exception is bytea data, of course, which we'll read in and dequote into a non-utf8 string. Any non-UTF8 client_encodings (e.g. the monstrosity that is SQL_ASCII) will simply get back a byte soup, with no utf8 markings on our part.
The pg_enable_utf8 attribute will remain, so that applications that do their own decoding, or otherwise do not want the utf8 flag set, can forcibly disable it by setting pg_enable_utf8 to 0. Similarly, it can be forced on by setting pg_enable_utf8 to 1. The flag will always trump the client_encoding parameter.
A further complication is client_encoding: What if it defaults to something else? We can set it ourselves upon first connecting, and then if the program changes it after that point, it's on them to deal with the issues. (As DBD::Pg will still assume it is UTF-8, as we don't constantly recheck the parameter.)
Someone also raised the issue of marking ASCII-only strings as utf8. While technically this is not correct, it would be nice to avoid having to parse every single byte that comes out of the database to look for high bits. Hopefully, programs requesting data from a UTF-8 database will not be surprised when things come back marked as utf8.
Feel free to comment here or on the bug that started it all. Thanks also to David Christensen, who has given me great input on this topic.
SSH config wildcards and multiple Postgres servers per client
The SSH config file has some nice features that help me to keep my sanity among a wide variety of servers spread across many different clients. Nearly all of my Postgres work is done by using SSH to connect to remote client sites, so the ability to connect to the various servers easily and intuitively is important. I'll go over an example of how a ssh config file might progress as you deal with an ever‑expanding client.
Some quick background: the ssh config file is a per‑user configuration file for the SSH program. It typically exists as ~/.ssh/config. It has two main purposes: setting global configuration items (such as ForwardX11 no), and setting things on a host‑by‑host basis. We'll be focusing on the latter.
Inside the ssh config file, you can create Host sections which specify options that apply only to one or more matching hosts. The sections are applied if the host name you type in as the argument to the ssh command matches what is after the word "Host". As we'll see, this also allows for wildcards, which can be very useful.
I'm going to walk through a hypothetical client, Acme Corporation, and show how the ssh config can grow as the client does, until the final example mirrors an actual section of my ssh config section file.
So, you've just got a new Postgres client called Acme Corporation, and they are using Amazon Web Services (AWS) to host their server. We're coming in as the postgres user, and have our public ssh keys already in place inside ~postgres/.ssh/authorized_keys on their server. The hostname is ec2‑456‑55‑123‑45.compute‑1.amazonaws.com. So, generally, we would connect by running:
$ ssh postgres@ec2‑456‑55‑123‑45.compute‑1.amazonaws.com
That's a lot to type each time! We could create a bash alias to handle this, but it's better to use the ssh config file instead. We'll add this to the end of our ssh config:
## ## Client: Acme Corporation ## Host acmecorp User postgres Hostname ec2-456-55-123-45.compute-1.amazonaws.com
Now we can simply use 'acmecorp' in place of that ugly string:
$ ssh acmecorp
Notice that we don't need to specify the user anymore: ssh config plugs that in for us. We can still override it if we need to connect as someone else:
$ ssh greg@acmecorp
The next week, Acme Corporation decides that rather than allow anyone to SSH to their servers, they will use iptables or something similar to restrict access to select known hosts. Because different people with different IPs at End Point may need to access Acme, and because we don't want to have Acme have to open a new hole each time we connect from a different place, we will connect from a shared company box. In this case, the box is vp.endpoint.com. Acme arranges to allow SSH from that box to their servers, and each End Point employee has a login on the vp.endpoint.com box. What we need to do now is create a SSH tunnel. Inside of the ssh config file, we add a new line to the entry for 'acmecorp':
Host acmecorp User postgres Hostname ec2-456-55-123-45.compute-1.amazonaws.com ProxyCommand ssh -q greg@vp.endpoint.com nc -w 180 %h %p
Now, when we run this:
$ ssh acmecorp
...everything looks the same to us, but what we are really doing is connecting to vp.endpoint.com, running the nc (netcat) command, and then connecting to the amazonaws.com box over the new netcat connection. (The arguments to netcat specify that the connection should be closed if there is the connection goes away for 180 seconds, and the host and port should be echoed along). As far as amazonaws.com is concerned, we are connecting from vp.endpoint.com. As far as we are concerned, we are going directly to amazonaws.com. A nice side effect, and a big reason why we don't simply use bash aliases, is that the scp program will use these aliases as well. So we can now do something like this:
$ scp check_postgres.pl acmecorp:
This will copy the check_postgres.pl program from our computer to the Acme one, going through the tunnel at vp.endpoint.com.
Business has been good for Acme lately and they finally have conceded to your strong suggestion to set up a warm standby server (using Postgres' Point In Time Recovery system). This new server is located at ec2‑456‑55‑123‑99.compute‑1.amazonaws.com, and the internal host name they give it is maindb‑replica (the original box is known as maindb‑db). This new server requires another host entry to ssh config. Rather than copy over the same ProxyCommand, we'll refactor the information out into a separate host entry. What we end up with is this:
Host acmetunnel User greg Hostname vp.endpoint.com Host acmedb User postgres Hostname ec2-456-55-123-45.compute-1.amazonaws.com ProxyCommand ssh -q acmetunnel nc -w 180 %h %p Host acmereplica User postgres Hostname ec2-456-55-123-99.compute-1.amazonaws.com ProxyCommand ssh -q acmetunnel nc -w 180 %h %p
We also changed the name from acmecorp to just "acme" as that's enough to uniquely identify among our clients, and who wants to type more than they have to?
Next, the company adds a QA box they want End Point to help setup. This box, however, is *not* reachable from outside their network; it can be reached only from other hosts in their network. Luckily, we already have access to some of those. What we'll do is extend our tunnel by one more host, so that the path we travel from us to the Acme QA box is:
Local box → vp.endpoint.com → acreplica → acqa
Here's the section of the ssh config after we've added in the QA box:
Host acmetunnel User greg Hostname vp.endpoint.com Host acmedb User postgres Hostname ec2-456-55-123-45.compute-1.amazonaws.com ProxyCommand ssh -q acmetunnel nc -w 180 %h %p Host acmereplica User postgres Hostname ec2-456-55-123-99.compute-1.amazonaws.com ProxyCommand ssh -q acmetunnel nc -w 180 %h %p Host acmeqa User postgres Hostname qa ProxyCommand ssh -q acreplica nc -w 180 %h %p
Note that we don't need the full hostname at this point for the "acmeqa" Hostname, as we can simply say 'qa' and the acreplica box knows how to get there.
There is still some unwanted repetition in the file, so let's take advantage of the fact that the "Host" item inside the ssh config file will take wildcards as well. It's not really apparent until you use wildcards, but a ssh host can match more than one "Host" section in the ssh config file, and thus you can achieve a form of inheritance. (However, once something has been set, it cannot be changed, so you always want to set the more specific items first). Here's what the file looks like after adding a wildcard section:
Host acme* User postgres ProxyCommand ssh -q greg@vp.endpoint.com nc -w 180 %h %p Host acmedb Hostname ec2-456-55-123-45.compute-1.amazonaws.com Host acmereplica Hostname ec2-456-55-123-99.compute-1.amazonaws.com Host acmeqa User root Hostname qa ProxyCommand ssh -q acreplica nc -w 180 %h %p
Notice that the file is now simplified quite a bit. If we run this command:
$ ssh acmereplica
...then the Host acme* section sets up both the User and the ProxyCommand. It then also matches on the Host acmereplica section and applies the Hostname there.
Note that we have removed the "acmetunnel" section. Now that all the ProxyCommands are in a single place, we can simply go back to the original ProxyCommand and specify the exact user and host.
All of the above presumes we want to login as the postgres user, but there are also times when we need to login as a different user (e.g. 'root'). We can again use wildcards, this time to match the end of the host, to specify which user we want. Anything ending in the letter "r" means we log in as user root, and anything ending in the letter "p" means we log in as user postgres. Our final ssh config section for Acme is now:
## ## Client: Acme Corporation ## Host acme* ProxyCommand ssh -q greg@vp.endpoint.com nc -w 180 %h %p Host acme*r User root Host acme*p User postgres Host acmedb* Hostname ec2-456-55-123-45.compute-1.amazonaws.com Host acmereplica* Hostname ec2-456-55-123-99.compute-1.amazonaws.com Host acmeqa* Hostname qa ProxyCommand ssh -q acreplica nc -w 180 %h %p
From this point on, if Acme decides to add a new server, adding it into our ssh config is as simple as adding two lines:
Host acmedev* Hostname ec2-456-55-999-45.compute-1.amazonaws.com
This automatically sets up two hosts for us, "acmedevr" and "acmedevp". What if we leave out the ending "r" or "p" and just ssh to "acmedev"? Then we'll connect as the default user, or $ENV{USER} (in my case, "greg").
Have fun configuring your ssh config file, don't be afraid to leave lots of comments inside of it, and of course keep it in version control!
Version Control Visualization and End Point in Open Source
Over the weekend, I discovered an open source tool for version control visualization, Gource. I decided to put together a few videos to showcase End Point's involvement in several open source projects.
Here's a quick legend to help understand the videos below:
The Videos
Interchange from endpoint on Vimeo.
Bucardo from endpoint on Vimeo.
One of the articles that references Gource suggests that the videos can be used to visualize and analyze the community involvement of a project (open source or not). One might also be able to qualitatively analyze the stability of project file architecture from a video, but this won't reveal anything definitive about the code stability since external factors can influence file structure. For example, since I am intimately familiar with the progress of Spree, I can identify when Spree transitioned to Rails 3 in the video, which required reorganization of the Spree core functionality (read more about this here and here).
In the case of this article, I wanted to highlight End Point's involvement in a few open source projects where we've had various levels of involvement. We've contributed to Interchange since 2000. We've been involved in Spree less lately, but had more presence in early 2009. In the smaller projects Bucardo and pgsi, End Point employees have worked on a team to be the primary contributors to the projects in addition to a few external contributors. Open source is important to End Point, and it's great to see our presence demonstrated in these cute videos.
PostgreSQL 9.0 High Performance Review
I recently had the privilege of reading and reviewing the book PostgreSQL 9.0 High Performance by Greg Smith. While the title of the book suggests that it may be relevant only to PostgreSQL 9.0, there is in fact a wealth of information to be found which is relevant for all community supported versions of Postgres.
Acheiving the highest performance with PostgreSQL is definitely something which touches all layers of the stack, from your specific disk hardware, OS and filesystem to the database configuration, connection/data access patterns, and queries in use. This book gathers up a lot of the information and advice that I've seen bandied about on the IRC channel and the PostgreSQL mailing lists and presents it in one place.
While seemingly related, I believe some of the main points of the book could be summed up as:
- Measure, don't guess. From the early chapters which cover the lowest-level considerations, such as disk hardware/configuration to the later chapters which cover such topics as query optimization, replication and partitioning, considerable emphasis is placed on determining the metrics by which to measure performance before/after specific changes. This is the only way to determine the impact the changes you make have.
- Tailor to your specific needs/workflows. While there are many good rules of thumb out there when it comes to configuration/tuning, this book emphasizes the process of determining/refining those more general numbers to tailoring configuration/setup to your specific database's needs.
- Review the information the database system itself gives you. Information provided by the pg_stat_* views can be useful in identifying bottlenecks in queries, unused/underused indexes.
This book also introduced me to a few goodies which I had not encountered previously. One of the more interesting ones is the pg_buffercache contrib module. This suite of functions allows you to peek at the internals of the shared_buffers cache to get a feel for which relations are heavily accessed on a block-by-block basis. The examples in the book show this being used to more accurately size shared_buffers based on the actual number of accesses to specific portions of different relations.
I found the book to be well-written (always a plus when reading technical books) and felt it covered quite a bit of depth given its ambitious scope. Overall, it was an informative and enjoyable read.
PostgreSQL 9.0 Admin Cookbook
I've been reading through the recently published book PostgreSQL 9.0 Admin Cookbook of late, and found that it satisfies an itch for me, at least for now. Every time I get involved in a new project, or work with a new group of people, there's a period of adjustment where I get introduced to new tools and new procedures. I enjoy seeing new (and not uncommonly, better) ways of doing the things I do regularly. At conferences I'll often spend time playing "What's on your desktop" with people I meet, to get an idea of how they do their work, and what methods they use. Questions about various peoples' favorite window manager, email reader, browser plugin, or IRC client are not uncommon. Sometimes I'm surprised by a utility or a technique I'd never known before, and sometimes it's nice just to see minor differences in the ways people do things, to expand my toolbox somewhat. This book did that for me.
As the title suggests, authors Simon Riggs and Hannu Krosing have organized their book similarly to a cookbook, made up of simple "recipes" organized in subject groups. Each recipe covers a simple topic, such as "Connecting using SSL", "Adding/Removing tablespaces", and "Managing Hot Standby", with detail sufficient to guide a user from beginning to end. Of course in many of the more complex cases some amount of detail must be skipped, and in general this book probably won't provide its reader with an in depth education, but it will provide a framework to guide further research into a particular topic. It includes a description of the manuals, and locations of some of the mailing lists to get the researcher started.
I've used PostgreSQL for many different projects and been involved in the community for several years, so I didn't find anything in the book that was completely unfamiliar. But PostgreSQL is an open source project with a large community. There exists a wide array of tools, many of which I've never had occasion to use. Reading about some of them, and seeing examples in print, was a pleasant and educational experience. For instance, one recipe describes "Selective replication using Londiste". My tool of choice for such problems is generally Bucardo, so I'd not been exposed to Londiste's way of doing things. Nor have I used pgstatspack, a project for collecting various statistics and metrics from database views which is discussed under "Collecting regular statistics from pg_stat_* views".
In short, the book gave me the opportunity to look over the shoulder of experienced PostgreSQL users and administrators to see how they go about doing things, and compare to how I've done them. I'm glad to have had the opportunity.
Upgrading old versions of Postgres
Old elephant courtesy of Photos8.comThe recent release of Postgres 9.0.0 at the start of October 2010 was not the only big news from the project. Also released were versions 7.4.30 and 8.0.26, which, as I noted in my usual PGP checksum report, are going to be the last publicly released revisions in the 7.4 and 8.0 branches. In addition, the 8.1 branch will no longer be supported by the end of 2010. If you are still using one of those branches (or something older!), this should be the incentive you need upgrade as soon as possible. To be clear, this means that anyone running Postgres 8.1 or older is not going to get any official updates, including security and bug fixes.
A brief recap: Postgres uses major versions, containing two numbers, to indicate a major change in features and functionality. These are released about every two years. Each of these major versions has many revisions, which are released as often as needed. These revisions are designed to be completely binary compatible with the previous revision, meaning you can upgrade revisions very easily, with no dump and restore of the data needed.
Below are the options available for those running older versions of Postgres, from the most desirable to the least desirable. The three general options are to upgrade to the latest release (9.0 as I write this), migrate to a newer version, or stay on your release.
1. Upgrade to the latest release
This is the best option, as each new version of Postgres adds more features and becomes more efficient, all while maintaining the high code quality standards Postgres is known for. There are three general approaches to upgrading: pg_upgrade, pg_dump, and Bucardo / Slony.
Using pg_upgrade
The pg_upgrade utility is the preferred method for upgrading in the future. Basically, it rewrites your data directory from the "old" on-disk format to the "new" one. Unfortunately, pg_upgrade only works from version 8.3 and onwards, which means it cannot be used if you are coming from an older version. (This utility used to be called pg_migrator, in case you see references to that.)
Dump and restore
The next best method is the tried and true "dump and restore". This involves using pg_dump to create a logical representation of the old database, and then loading it into your new database with pg_restore or psql. The disadvantage to this method is time - dump and reload can take a very, very long time for large databases. Not only does the data need to get loaded into the new database tables, but all the indexes must be recreated, which can be agonizingly slow.
Replication systems
A third option is to use a replication system such as Slony or Bucardo to help with the upgrade. With Slony, you can set up a replication from the old version to the new version, and then failover to the new version once replication is caught up and running smooth. You can do something similar with Bucardo. Note that both systems can only replicate sequences, and tables containing primary keys or unique indexes. Bucardo has a "fullcopy" mode that will copy any table, regardless of primary keys, but it's slow as it's equivalent to a full dump and restore of the table. Note that Bucardo is really only tested on the 8.X versions: for anything older, you will need to use Slony.
Even if you cannot replicate all your tables, such systems can help a migration by replicating most of your data. For example, if you have a 750 GB table full of mostly historical data, you can have Bucardo start tracking changes to the table, set up a copy on the new version (perhaps by using warm standby or a snapshot to reduce load on the master), and then start Bucardo to catch up the rows that have changed since the changes were tracked. If you do this for all your large tables, the actual upgrade process can proceed with minimal downtime by shutting down the master, doing a pg_dump of only the non-tracked tables, and then pointing your apps at the new server.
2. Migrate to a newer version
Even if you don't go to 9.0, you may want to upgrade to a newer version. Why not go all the way to 9.0? There are only two good reasons not to. One, if your system's packaging system does not have 9.0 yet, or you have custom packaging requirements that prevent you from doing so. Two, if you have concerns about application compatibility between two versions. However, that latter concern should be minimal. The largest and most disruptive compatibility change appeared in version 8.3 with the removal of implicit casts. Since 8.2 is likely to be unsupported in the next couple years, you should be going to at least 8.3. And if you can go to 8.3, you can go to 9.0.
3. Stay on your release
This is obviously the least-desirable option, but may be necessary due to real-world constraints involving time, testing, compatibility with other programs, etc. At the bare minimum, make sure you are at least running the latest revision, e.g. 7.4.30 if running 7.4. Moving forward, you will need to keep an eye on the Postgres commits list and/or the detailed release notes for new versions, and examine if any of the fixed bugs apply to your version or your situation. If they do, you'll need to figure out how to apply the patch to your older version, and then release this new version into your environment. Sound risky? It gets worse, because your patch is only being used and tested by an extremely small pool of people, has no build farm support, and is not available to the Postgres developers. If you want to go this route, there are companies familiar with the Postgres code base (including End Point) that will help you do so. But know in advance that we are also going to push you very hard to upgrade to a modern, supported version instead (which we can help you with as well, of course :).
PostgreSQL 8.4 in RHEL/CentOS 5.5
The announcement of end of support coming soon for PostgreSQL 7.4, 8.0, and 8.1 means that people who've put off upgrading their Postgres systems are running out of time before they're in the danger zone where critical bugfixes won't be available.
Given that PostgreSQL 7.4 was released in November 2003, that's nearly 7 years of support, quite a long time for free community support of an open-source project.
Many of our systems run Red Hat Enterprise Linux 5, which shipped with PostgreSQL 8.1. All indications are that Red Hat will continue to support that version of Postgres as it does all parts of a given version of RHEL during its support lifetime. But of course it would be nice to get those systems upgraded to a newer version of Postgres to get the performance and feature benefits of newer versions.
For any developers or DBAs familiar with Postgres, upgrading to a new version with RPMs from the PGDG or other custom Yum repository is not a big deal, but occasionally we've had a client worry that using a packages other than the ones supplied by Red Hat is riskier.
For those holdouts still on PostgreSQL 8.1 because it's the "norm" on RHEL 5, Red Hat gave us a gift in their RHEL 5.5 update. It now includes separate PostgreSQL 8.4 packages that may optionally be used on RHEL 5 instead of PostgreSQL 8.1. (Both can't be used on the same system at the same time.)
I know that getting these packages from Red Hat shouldn't be necessary, but for those who feel jittery about using 3rd-party packages, it's a good nudge to switch to Postgres 8.4 using Red Hat's supported packages. Thanks to Tom Lane at Red Hat for making this happen. Though I don't know whose idea it was, Tom is the author of all the RPM commitlog messages, so thanks, Tom!
This brings up a few other rhetorical questions: Will RHEL 6 ship with PostgreSQL 9.0? Will RHEL 5.6 have backported PostgreSQL 9.0 in similar postgresql90 packages? It'd be great to see each new PostgreSQL release have supported packages in RHEL so that there's even less reason to start a new project on an older version of Postgres. RHEL 5.5 with PostgreSQL 8.4 is a nice start in that direction.
Postgres configuration best practices
This is the first in an occasional series of articles about configuring PostgreSQL. The main way to do this, of course, is the postgresql.conf file, which is read by the Postgres daemon on startup and contains a large number of parameters that affect the database's performance and behavior. Later posts will address specific settings inside this file, but before we do that, there are some global best practices to address.
Version Control
The single most important thing you can do is to put your postgresql.conf file into version control. I care not which one you use, but go do it right now. If you don't already have a version control system on your database box, git is a good choice to use. Barring that, RCS. Doing so is extremely easy. Just change to the directory postgresql.conf is in. The process for git:
- Install git if not there already (e.g. "sudo yum install git")
- Run: git init
- Run: git add postgresql.conf pg_hba.conf
- Run: git commit -a -m "Initial commit"
For RCS:
- Install as needed (e.g. "sudo apt-get install rcs")
- Run: mkdir RCS
- Run: ci -l postgresql.conf pg_hba.conf
Note that we also checked in pg_hba.conf as well. You want to check in any file in that directory you may possibly change. For most people, that only means postgresql.conf and pg_hba.conf, but if you use other files (pg_ident.conf) check those in as well.
Ideally you want the version checked in to be the "raw" configuration files that came with the system - in other words, before you started messing with them. Then you make your initial changes and check it in. From then on of course, you commit every time you change the file.
At a bare minimum, the version control system should be telling you:
- Exactly what was changed
- When it was changed
- Who made the change
- Why it was changed
The first two items happen automatically in all version control systems, so you don't have to worry about those. The third item, "who made the change", must be entered manually if on a shared account (e.g. postgres) and using RCS. If you are using git, you can simply set the environment variables GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL. For shared accounts, I have a custom bashrc file called "gregbashrc" that is called when I log in that sets those ENVs as well as a host of other items.
The fourth item, "why it was changed", is generally the content of the commit message. Never leave this blank, and be as descriptive and verbose as possible - someone later on will be grateful you did. It's okay to be repetitive and state the obvious. If this was done as part of a specific ticket number or project name, mention that as well.
Safe Changes
It's important that the changes you make to the postgresql.conf file (or other files) actually work and don't cause Postgres to be unable to parse the file, or handle a changed setting. Never make changes and restart Postgres, because if it doesn't work, you've got a broken config file, no Postgres daemon, and most likely unhappy applications and/or users. At the very least, do a reload first (e.g. /etc/init.d/postgresql reload or just kill -HUP the PID). Check the logs and see if Postgres was happy with your changes. If you are lucky, it won't even require a restart (some changes do, some do not).
A better way to test your changes is to make it on an identical test box. That way, all the wrinkles are ironed out before you make the changes on production and attempt a reload or restart.
Another way I've found handy is to simply start a new Postgres daemon. Sounds like a lot of work, but it's pretty automatic once you've done it a few times. The process generally looks like this, assuming your production postgresql.conf is in the "data" directory, and your changes are in data/postgresql.conf.new:
- cd ..
- initdb testdata
- cp -f data/postgresql.conf.new testdata/
- echo port=5555 >> testdata/postgresql.conf
- echo max_connections=10 >> testdata/postgresql.conf
The max_connections is not strictly necessary, of course, but unless you are changing something that relies on that setting, it's nicer to keep it (and the resulting memory) low.
- pg_ctl -D testdata -l test.log start
- cat test.log
- pg_ctl -D testdata stop
- rm -fr testdata (or just keep it around for next time)
The test.log file will show you any problems that might have popped up with your changes, and once it works you can be fairly confident it will work for the "main" daemon as well, so to finish up:
- cd data
- mv -f postgresql.conf.new postgresql.conf
- git commit postgresql.conf -m "Adjusted random_page_cost to 2, per bug #4151"
- kill -HUP `head -1 postmaster.pid`
- psql -c 'show random_page_cost'
Keeping it Clean
The postgresql.conf file is fairly long, and can be confusing to read with its mixture of comments, in-line comments, strange wrapping, and the commented out vs. not-commented-out variables. Hence, I recommend this system:
- Put a big notice at the top of the file asking people to make changes to the bottom
- Put all important variables at the bottom, sans comments, one per line
- Line things up
- Put into logical groups.
This avoids having to hunt for settings, prevents the gotcha of when a setting is changed twice in the file, and makes things much easier to read visually. Here's what I put at the top of the postgresql.conf:
## ## PLEASE MAKE ALL CHANGES TO THE BOTTOM OF THIS FILE! ##
I then add a good 20+ empty lines, so anyone viewing the file is forced to focus on the all-caps message above.
The next step is to put all the settings you care about at the bottom of the file. Which ones should you care about? Any setting you have changed (obviously), any setting that you *might* change in the future, and any that you may not have changed, but someone may want to look up. In practice, this means a list of about 25 items. After aligning all the values to the right and breaking things into logical groups, here's what the bottom of the postgresql.conf looks like:
## Connecting port = 5432 listen_addresses = '*' max_connections = 100 ## Memory shared_buffers = 400MB work_mem = 1MB maintenance_work_mem = 1GB ## Disk fsync = on synchronous_commit = on full_page_writes = on checkpoint_segments = 100 ## PITR archive_mode = off archive_command = '' archive_timeout = 0 ## Planner effective_cache_size = 18GB random_page_cost = 2 ## Logging log_destination = 'stderr' logging_collector = on log_filename = 'postgres-%Y-%m-%d.log' log_truncate_on_rotation = off log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 200 log_statement = 'ddl' log_line_prefix = '%t %u@%d %p' ## Autovacuum autovacuum = on autovacuum_vacuum_scale_factor = 0.1 autovacuum_analyze_scale_factor = 0.3
Because everything is in one place, at the bottom of the file, and not commented out, it's very easy to see what is going on. The groups above are somewhat arbitrary, and you can leave them out or create your own, but at least keep things grouped together as much as possible. When in doubt, use the same order as they appear in the original postgresql.conf.
Sometimes people change important settings in a group, such as for bulk loading of data. In this case, I usually make a separate group for it at the very bottom. This makes it easy to switch back and forth, and helps to prevent people from (for example) forgetting to switch fsync back on:
## Bulk loading only - leave 'on' for everyday use! autovacuum = off fsync = off full_page_writes = off
Ownership and permissions
All the conf files should be owned by the postgres user, and the configuration files should be world-readable if possible (indeed, it's a requirement for Debian based system that postgresql.conf be readable for psql to work!). Be careful about SELinux as well: it can get ornery if you do things like use symlinks.
Backups
One final note - make sure you are backing up your changes as well. PITR and pg_dump won't save your postgresql.conf! If you are checking things in to a remote version control system, then some of the pressure is off, but you should have some sort of policy for backing up all your conf files explicitly. Even if using a local git repo, tarring and copying up the whole thing is usually a very quick and cheap action.
Anonymous code blocks
With the release of PostgreSQL 9.0 comes the ability to execute "anonymous code blocks" in various of the PostgreSQL procedural languages. The idea stemmed from work back in autumn of 2009 that tried to respond to a common question on IRC or the mailing lists: how do I grant a permission to a particular user for all objects in a schema? At the time, the only solution short of manually writing commands to grant the permission in question on every object individually was to write a script of some sort. Further discussion uncovered several people that often found themselves writing simple functions to handle various administrative tasks. Many of those people, it turned out, would rather simply call one statement, rather than create a function, call the function, and then drop (or just ignore) the function they'd never need again. Hence, the new DO command.
The first language to support DO was PL/pgSQL. The PostgreSQL documentation provides an example to answer the original question: how do I grant permissions on everything to a particular user.
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT table_schema, table_name FROM information_schema.tables
WHERE table_type = 'VIEW' AND table_schema = 'public'
LOOP
EXECUTE 'GRANT ALL ON ' || quote_ident(r.table_schema) || '.' || quote_ident(r.table_name) || ' TO webuser';
END LOOP;
END$$;
Notice that this doesn't actually tell us what language to use. If no language is specified, DO defaults to PL/pgSQL (which, in 9.0, is enabled by default). But you can use other languages as well:
DO $$
HAI
BTW Calculate pi using Gregory-Leibniz series
BTW This method does not converge particularly quickly...
I HAS A PIADD ITZ 0.0
I HAS A PISUB ITZ 0.0
I HAS A ITR ITZ 0
I HAS A T1
I HAS A T2
I HAS A PI ITZ 0.0
I HAS A ITERASHUNZ ITZ 1000
IM IN YR LOOP
T1 R QUOSHUNT OF 4.0 AN SUM OF 3.0 AN ITR
T2 R QUOSHUNT OF 4.0 AN SUM OF 5.0 AN ITR
PISUB R SUM OF PISUB AN T1
PIADD R SUM OF PIADD AN T2
ITR R SUM OF ITR AN 4.0
BOTH SAEM ITR AN BIGGR OF ITR AN ITERASHUNZ, O RLY?
YA RLY, GTFO
OIC
IM OUTTA YR LOOP
PI R SUM OF 4.0 AN DIFF OF PIADD AN PISUB
VISIBLE "PI R: "
VISIBLE PI
FOUND YR PI
KTHXBYE
$$ LANGUAGE PLLOLCODE;
I tried to rewrite the GRANT function shown above in PL/LOLCODE for this example, until I discovered that some of PL/LOLCODE's limitations make it extremely difficult, if not impossible. So far as I know, PL/LOLCODE was the second language to support anonymous blocks, thanks to what turned out to be a relatively simple programming exercise. After finishing PL/LOLCODE's DO support, I decided to do the same for PL/Perl. I wasn't particularly surprised to find that PL/Perl was harder to extend than PL/LOLCODE; PL/Perl is a much more feature-rich (and hence, complicated) language and I wasn't as familiar with its internals. However, after my initial submission and with helpful commentary from several other people, Andrew Dunstan tied off the loose ends and got it committed. It looks like this:
DO $$
my $row;
my $rv = spi_exec_query(q{
SELECT quote_ident(table_schema) || '.' || quote_ident(table_name) AS relname
FROM information_schema.tables WHERE table_type = 'VIEW' AND table_schema = 'public'
});
my $nrows = $rv->{processed};
foreach my $i (0 .. $nrows - 1) {
my $row = $rv->{rows}[$rn];
spi_exec_query("GRANT ALL ON $row->{relname} TO webuser");
}
$$ LANGUAGE plperl;
DO wasn't the only thing to come from the pgsql-hackers discussion I mentioned above. In PostgreSQL 9.0, the GRANT command has also been modified, so it's now possible to grant permissions several objects in one stroke syntax. For instance:
GRANT SELECT ON ALL TABLES IN SCHEMA public TO webuser
pg_wrapper's very symbolic links
I like pg_wrapper. For a development environment, or testing replication scenarios, it's brilliant. If you're not familiar with pg_wrapper and its family of tools, it's a set of scripts in the postgresql-common and postgresql-client-common packages available in Debian, as well as Ubuntu and other Debian-like distributions. As you may have guessed pg_wrapper itself is a wrapper script that calls the correct version of the binary you're invoking – psql, pg_dump, etc – depending on the version of the database you want to connect to. Maybe not all that exciting in itself, but implied therein is the really cool bit: This set of tools lets you manage multiple installations of Postgres, spanning multiple versions, easily and reliably.
Well, usually reliably. We were helping a client upgrade their production boxes from Postgres 8.1 to 8.4. This was just before the 9.0 release, otherwise we'd consider moving the directly to that instead. It was going fairly smoothly until on one box we hit this message:
Could not parse locale out of pg_controldata output
Oops, they had pinned the older postgres-common version. An upgrade of those packages and no more error!
$ pg_lsclusters Version Cluster Port Status Owner Data directory Log file 8.1 main 5439 online postgres /var/lib/postgresql/8.1/main custom Error: Invalid data directory
Hmm, interesting. Okay, so not quite, got a little bit more work to do. This one took some tracing through the code. The pg_wrapper scripts, if they don't already know it, look for the data directory in a couple of places. The first stop is the postgresql.conf file, specifically /etc/postgresql/<version>/<cluster-name>/postgresql.conf, looking for the data_directory parameter. But, in its transitional state at the time, the postgresql.conf was still a work in progress.
The second place it looks is a symlink in the same /etc/postgresql/<version>/<cluster-name>/ directory. While that's the old way of doing things, it at least let us get things looking reasonable:
# ln -s /var/lib/postgresql/8.4/main /etc/postgresql/8.4/main/pgdata # /etc/init.d/postgresql-8.4 status 8.1 main 5439 online postgres /var/lib/postgresql/8.1/main custom 8.4 main 5432 online postgres /var/lib/postgresql/8.4/main custom
Voilà! From there we were able to proceed with the upgrade, confident that the instance will behave as expected. And now, everything is running great!
As with most things that provide a simpler experience on the surface, there's additional complexity under the hood. But for now, we have one more client upgraded. Thanks, Postgres!
Listen/Notify improvements in PostgreSQL 9.0
Improved listen/notify is one of the new features of Postgres 9.0 I've been waiting for a long time. There are basically two major changes: everything is in shared memory instead of using system tables, and full support for "payload" messages is enabled.
Before I demonstrate the changes, here's a review of what exactly the listen/notify system in Postgres is. Basically, it is an inter-process signalling system, which uses the pg_listener system table to coordinate simple named events between processes. One or more clients connects to the database and issues a command such as:
LISTEN foobar;
The name foobar can be replaced by any valid name; usually the name is something that gives a contextual clue to the listening process, such as the name of a table. Another client (or even one of the original ones) will then issue a notification like so:
NOTIFY foobar;
Each client that is listening for the 'foobar' message will receive a notification that the sender has issued the NOTIFY. It also receives the PID of the sending process. Multiple notifications are collapsed into a single notice, and the notification is not sent until a transaction is committed.
Here's some sample code using DBD::Pg that demonstrates how the system works:
#!/usr/bin/env perl
# -*-mode:cperl; indent-tabs-mode: nil-*-
use strict;
use warnings;
use DBI;
my $dsn = 'dbi:Pg:dbname=test';
my $dbh1 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
my $dbh2 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
print "Postgres version is $dbh1->{pg_server_version}\n";
my $SQL = 'SELECT pg_backend_pid(), version()';
my $pid1 = $dbh1->selectall_arrayref($SQL)->[0][0];
my $pid2 = $dbh2->selectall_arrayref($SQL)->[0][0];
print "Process one has a PID of $pid1\n";
print "Process two has a PID of $pid2\n";
## Process one listens for a notice named "jtx"
$dbh1->do(q{LISTEN jtx});
$dbh1->commit();
## Process one checks for any notices received
print show_notices($dbh1);
## Process two sends a notice, but does not commit
$dbh2->do(q{NOTIFY jtx});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx});
$dbh2->commit();
sleep 1; ## Ensure the notice has time to get to propogate
## Process two receives a single notice from process one
print show_notices($dbh1);
## Now that it has seen the notice, it reports nothing again:
print show_notices($dbh1);
sub show_notices { ## Function to return any notices received
my $dbh = shift;
my $messages = '';
$dbh->commit();
while (my $n = $dbh->func('pg_notifies')) {
$messages .= "Got notice '$n->[0]' from PID $n->[1]\n";
}
return $messages || "No messages\n";
}The output of the above script on a 8.4 Postgres server is:
Postgres version is 80401 Process one has a PID of 18238 Process two has a PID of 18239 No messages No messages Got notice 'jtx' from PID 18239 No messages
As expected, we got a notification only after the other process committed.
Note that because this is asychronous and involves the system tables, we added a sleep call to ensure that the notice had time to propagate so that the other processes will see it. Without the sleep, we usually see four "No messages" appear, as the script goes too fast for the pg_listener table to catch up.
Now for the aforementioned payloads. Payloads allow an arbitrary string to be attached to the notification, such that you can have a standard name like before, but you can also attach some specific text that the other processes can see. I added support for payloads to DBD::Pg back in June 2008, so let's modify the script a little bit to demonstrate the new payload mechanism:
...
## Process two sends two notices, but does not commit
$dbh2->do(q{NOTIFY jtx, 'square'});
$dbh2->do(q{NOTIFY jtx, 'square'});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx, 'triangle'});
$dbh2->commit();
...
## This part changes: we get an extra item from our array:
$messages .= "Got notice '$n->[0]' from PID $n->[1] message is '$n->[2]'\n";
...Here's what the output looks like under version 9.0 of Postgres:
Postgres version is 90000 Process one has a PID of 19089 Process two has a PID of 19090 No messages No messages Got notice 'jtx' from PID 19090 message is 'square' Got notice 'jtx' from PID 19090 message is 'triangle' No messages
Note that the collapsing of identical messages into a single notification now takes into account the message as well, so we received two notifications in the above example for the three total notifications sent. To add a payload, we simply say NOTIFY, then the name of the notification, add a comma, and specify a payload as a quoted string. Of course, the payload string is still completely optional. If no payload is specified, DBD::Pg will simply treat the payload as an empty string (this is also the behavior when you request the payload using DBD::Pg against a pre-9.0 server, so all combinations should be 100% backwards compatible).
We also got rid of the sleep. Because we are now using shared memory instead of system tables, there is no lag whatsoever, and the other process can see the notices right away.
Another large advantage to removing the pg_listener table is that systems that make heavy use of it (such as the replication systems Bucardo and Slony) no longer have to worry about bloat in these tables.
The use of payloads also means that many application can be greatly simplified: in the past, one had to be creative in the name of your notifications in order to pass meta-information to your listener. For example, Bucardo uses a large collection of notifications, meaning that the Bucardo processes had to do the equivalent of things like this:
$dbh->do(q{LISTEN bucardo_reload_config});
$dbh->do(q{LISTEN bucardo_log_message});
$dbh->do(q{LISTEN bucardo_activate_sync_$sync});
$dbh->do(q{LISTEN bucardo_deactivate_sync_$sync});
$dbh->do(q{LISTEN bucardo_kick_sync_$sync});
...
while (my $notice = $dbh->func('pg_notifies')) {
my ($name, $pid) = @$notice;
if ($name eq 'bucardo_reload_config') {
...
}
elsif ($name =~ /bucardo_kick_sync_(.+)/) {
...
}
...
}
We can instead do things like this:
$dbh->do(q{LISTEN bucardo});
...
while (my $notice = $dbh->func('pg_notifies')) {
my ($name, $pid, $msg) = @$notice;
if ($msg eq 'bucardo_reload_config') {
...
}
elsif ($msg =~ /bucardo_kick_sync_(.+)/) {
...
}
...
}
I hope to add this support to Bucardo shortly; it's simply a matter of refactoring all the listen and notify calls into a function that does the right thing depending on the server version it is attached to.
PostgreSQL odd checkpoint failure
Nothing strikes fear into the heart of a DBA like error messages, particularly ones which indicate that there may be data corruption. One such situation happened recently to us, when we ran into a recent unusual situation in an upgrade to PostgreSQL 8.1.21. We had updated the software and manually been running a REINDEX DATABASE command, when we started to notice some errors being reported on the front-end. We decided to dump the database in question to ensure we had a backup to return to, however we still ended up with more messages:
pg_dump -Fc database1 > pgdump.database1.archive pg_dump: WARNING: could not write block 1 of 1663/207394263/443523507 DETAIL: Multiple failures --- write error may be permanent. pg_dump: ERROR: could not open relation 1663/207394263/443523507: No such file or directory CONTEXT: writing block 1 of relation 1663/207394263/443523507 pg_dump: SQL command to dump the contents of table "table1" failed: PQendcopy() failed. pg_dump: Error message from server: ERROR: could not open relation 1663/207394263/443523507: No such file or directory CONTEXT: writing block 1 of relation 1663/207394263/443523507 pg_dump: The command was: COPY public."table1" (id, field1, field2, field3) TO stdout;
Looking at the pg_database contents revealed that 207394263 was not even the database in question. I connected to the aforementioned database and looked for a relation that matched that pg_class.oid, and barring that pg_class.relfilenode. This search revealed nothing. So where was the object itself living, and why were we getting this message?
We decided that since it appeared that something was awry with the database system in general, that we should take this opportunity to dump the tables in question. I proceeded to write a quick script to go through the database tables and dump each one individually using pg_dump's -t option. This worked for some of the tables, but not all of them, which would die with the same error. Looking at the pg_class.relpages field for the non-dumpable tables revealed that these were all the larger tables in the database. Obviously not good, since this is where the bulk of the data lay. However, we also noticed that the message that we got referenced the exact same filesystem path, so it appeared to be something separate from the table that was being dumped.
After some advice on IRC, we reviewed the logs for checkpoint logging, which revealed that checkpoints had been failing. This further meant that the database was in a state such that it could not be shut down cleanly, had we wanted to try to restart to see if that cleared up the flakiness. This further meant that we'd only be able to shutdown via a hard kill, which is definitely something to avoid, WAL or not, particularly since there had not been a checkpoint for some time. A manual CHECKPOINT further failed after a timeout.
Before we went down the road of forcing a hard server shutdown, we ended up just touching the specific relation path in question into existence and then running a CHECKPOINT. This time since the file existed, it was able to complete the checkpoint, and restore working order to the database. We successfully (and quickly) ran a full pg_dump, and went about the task of manually vetting a few of the affected tables, etc.
Our working theory for this is that somehow there was a dirty buffer that referenced a relation that no longer existed, and hence when the there was a checkpoint or other event which attempted to flush shared_buffers (i.e., the loading of a large relation which would require a flush of Least Recently Used pages as in the pg_dump case), the flush attempt for the missing relation failed, which aborted the checkpoint/other action.
After the file existed and PostgreSQL had successfully synched to disk, it was a single two-block file, of which the first block was completely empty and the second block looked like an index page (due to the layout/contents of the data). The most suggestive cause was that had been an interrupted REINDEX earlier in the day. Since this machine was showing no other signs of data corruption and everything else seemed reasonable, our best guess is that there was some race condition that had caused the relation's data to exist in memory even while the canceled REINDEX ensured that the actual relfile and the pg_class rows did not exist for the buffer.
Perl Testing - stopping the firehose
I maintain a large number of Perl modules and scripts, and one thing they all have in common is a test suite, which is basically a collection of scripts inside a "t" subdirectory used to thoroughly test the behavior of the program. When using Perl, this means you are using the awesome Test::More module, which uses the Test Anything Protocol (TAP). While I love Test::More, I often find myself needing to stop the testing entirely after a certain number of failures (usually one). This is the solution I came up with.
Normally tests are run as a group, by invoking all files named t/*.t; each file has numerous tests inside of it, and these individual tests issue a pass or a fail. At the end of each file, a summary is output stating how many tests passed and how many failed. So why is stopping after a failed test even needed? The reasons below mostly relate to the tests I write for the Bucardo program, which has a fairly large and complex test suite. Some of the reasons I like having fine-grained control of when to stop are:
- Scrolling back through screens and screens of failing tests to find the point where the test began to fail is not just annoying, but a very unproductive use of my time.
- Tests are very often dependent. If test #23 fails, it means there is a very good chance that most if not all of the subsequent tests are going to fail as well, and it makes no sense for me to look at fixing anything but test #23 first.
- Tests can take a very long time to run, and I can't wait around for the errors to start appearing and hit ctrl-c. I need to kick them off, go do something else, and then come back and have the tests stop running immediately after the first failed test. Bucardo tests, for example, create and startup four different Postgres clusters, populates the databases inside each cluster with test data, installs a fresh copy of Bucardo, and *then* begins the real testing. No way I'm going to wait around for that to happen.
- Debugging is greatly aided by having the tests stop where I want them to. Often tests after the failing one will modify data and otherwise destroy the "state" such that I cannot manually duplicate the error right then and there, and thus fix it easily.
For now, my solution is to override some of the methods from Test::More. I have a standard script that does this, and I 'use' this script after I 'use Test::More' inside my test scripts. For example, a test script might look like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Test::More tests => 356;
use TestOverride;
sub some_function {
my $arr = [];
push @$arr => 4,9;
return [$arr];
}
my $t = q{Function some_function() returns correct value when called with 'foo'};
my $value = some_function('foo');
my $res = [[3],[5]];
is_deeply( $value, $res, $t);
...
$t = q{Value of baz is 123};
is ($baz, 123, $t);
...
In turn, the TestOverride file contains this:
...
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Pad = '|';
use base 'Exporter';
our @EXPORT = qw{ is_deeply like pass is isa_ok ok };
my $bail_on_error = $ENV{TESTBAIL} || 0;
my $total_errors = 0;
sub is_deeply {
# Return right away if the test passes
my $rv = Test::More::is_deeply(@_);
return $rv if $rv;
if ($bail_on_error and ++$total_errors >= $bail_on_error) {
my ($file,$line) = (caller)[1,2];
Test::More::diag("GOT: ".Dumper $_[0]);
Test::More::diag("EXPECTED: ".Dumper $_[1]);
Test::More::BAIL_OUT "Stopping on a failed 'is_deeply' test from line $line of $file.";
}
return;
} ## end of is_deeply
sub is {
my $rv = Test::More::is(@_);
return $rv if $rv;
if ($bail_on_error and ++$total_errors >= $bail_on_error) {
my ($file,$line) = (caller)[1,2];
Test::More::BAIL_OUT "Stopping on a failed 'is' test from line $line of $file.";
}
return;
} ## end of is
The is_deeply compares two arbitrary Perl structures (such as the arrayref here, but it can do hashes as well), and points out if they differ, and where. The "deeply" is because it will walk through the entire structure to find any differences. Good stuff.
Some things to note about the new is_deeply function: first, we simply pass in our parameters to the "real" is_deeply subroutine - the one found inside the Test::More package. If this passes (by returning true), we simply pass that truth back to the caller, and it's completely as if is_deeply had not been overwritten at all. However, if the test fails, Test::More::is_deeply will output a failure notice, but we check to see if the total number of failures for this test script ($total_errors) is greater than or equal to the threshold ($bail_on_error) that we set via then environment variable TESTBAIL. (Having it as an environment variable that defaults to zero allows the traditional behavior to be easily changed without editing any files).
If the number of failed tests is over our threshhold, we call the BAIL_OUT method from Test::More, which not only stops the current test script from running any more scripts, but stops any subsequent test files from running as well.
Before calling BAIL_OUT however, we also take advantage of the overriding to provide a little more detail about the failure. We output the line and file the test came from (because Test::More::is_deeply only sees that we are calling it from within the TestOverride.pm file). Most importantly, we output a complete dump of the expected and actual structures passed to is_deeply to be compared. The regular is_deeply only describes where the first mismatch occurs, but I often need to see the entire surrounding object. So rather than normal output looking like this:
1..356 not ok 1 - Function some_function() returns correct value when called with 'foo' # Failed test 'Function some_function() returns correct value when called with 'foo'' # at test1.t line 18. # Structures begin differing at: # $got->[0] = '4' # $expected->[0] = '3' # Looks like you planned 356 tests but ran 1. # Looks like you failed 1 test of 1 run.
The new output looks like this:
1..356 not ok 1 - Function some_function() returns correct value when called with 'foo' # Failed test 'Function some_function() returns correct value when called with 'foo'' # at TestOverride.pm line 23. # Structures begin differing at: # $got->[0] = '4' # $expected->[0] = '3' # GOT: |[ # | 4, # | [ # | 9 # | ] # |] # EXPECTED: |[ # | 3 # |] Bail out! Stopping on a failed 'is_deeply' test from line 17 of test1.t.
Yes, the Test::Most module does some similar things, but I don't use it because it's yet another module dependency, it doesn't allow me to control the number of acceptable failures before bailing, and it doesn't show pretty output for is_deeply.
Reducing bloat without locking
It's not altogether uncommon to find a database where someone has turned off vacuuming, for a table or for the entire database. I assume people do this thinking that vacuuming is taking too much processor time or disk IO or something, and needs to be turned off. While this fixes the problem very temporarily, in the long run it causes tables to grow enormous and performance to take a dive. There are two ways to fix the problem: moving rows around to consolidate them, or rewriting the table completely. Prior to PostgreSQL 9.0, VACUUM FULL did the former; in 9.0 and above, it does the latter. CLUSTER is another suitable alternative, which also does the latter. Unfortunately all these methods require heavy table locking.
Recently I've been experimenting with an alternative method -- sort of a VACUUM FULL Lite. Vanilla VACUUM can reduce table size when the pages at the end of a table are completely empty. The trick is to empty those pages of live data. You do that by paying close attention to the table's ctid column:
5432 josh@josh# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer | not null
b | integer |
Indexes:
"foo_pkey" PRIMARY KEY, btree (a)
5432 josh@josh# select ctid, * from foo;
ctid | a | b
-------+---+---
(0,1) | 1 | 1
(0,2) | 2 | 2
(2 rows)
The ctid is one of several hidden columns found in each PostgreSQL table. It shows up in query results only if you explicitly ask for it, and tells you two values: a page number, and a tuple number. Pages are numbered sequentially from zero, starting with the first page in the relation's first file, and ending with the last page in its last file. Tuple numbers refer to entries within each page, and are numbered sequentially starting from one. When I update a row, the row's ctid changes, because the update creates a new version of the row and leaves the old version behind (see this page for explanation of that behavior).
5432 josh@josh# update foo set a = 3 where a = 2; UPDATE 1 5432 josh@josh*# select ctid, * from foo; ctid | a | b -------+---+--- (0,1) | 1 | 1 (0,3) | 3 | 2 (2 rows)
Note the changed ctid for the second row. If I vacuum this table now, I'll see it remove one dead row version, from both the table and its associated index:
5432 josh@josh# VACUUM verbose foo; INFO: vacuuming "public.foo" INFO: scanned index "foo_pkey" to remove 1 row versions DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "foo": removed 1 row versions in 1 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "foo_pkey" now contains 2 row versions in 2 pages DETAIL: 1 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "foo": found 1 removable, 2 nonremovable row versions in 1 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 1 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM
So given these basics, how can I make tables smaller? Let's build a bloated table:
5432 josh@josh# truncate foo; TRUNCATE TABLE 5432 josh@josh*# insert into foo select generate_series(1, 1000); INSERT 0 1000 5432 josh@josh*# delete from foo where a % 2 = 0; DELETE 500 5432 josh@josh*# select max(ctid) from foo; max --------- (3,234) (1 row) 5432 josh@josh# vacuum verbose foo; INFO: vacuuming "public.foo" INFO: scanned index "foo_pkey" to remove 500 row versions DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "foo": removed 500 row versions in 4 pages ...
I've filled the table with 1000 rows, and then deleted every other row. The last tuple is on the fourth page (remember they're numbered starting with zero), but since half the table is empty space, I can probably squish it into three or maybe just two pages. I'll start by moving the tuples on the last page off to another page, by updating them:
5432 josh@josh# begin; BEGIN 5432 josh@josh*# update foo set a = a where ctid >= '(3,0)'; UPDATE 117 5432 josh@josh*# update foo set a = a where ctid >= '(3,0)'; UPDATE 117 5432 josh@josh*# update foo set a = a where ctid >= '(3,0)'; UPDATE 21 5432 josh@josh*# update foo set a = a where ctid >= '(3,0)'; UPDATE 0 5432 josh@josh*# commit; COMMIT
Here I'm not changing the row at all, but the tuples are moving around into dead space earlier in the table; this is apparent because the number of rows affected decreases. For the first update or two, there's room enough on the page to store all the new rows, but after a few updates they have to start moving to new pages. Eventually the row count goes to zero, meaning there are no rows on or after page #3, so vacuum can truncate that page:
5432 josh@josh# vacuum verbose foo; INFO: vacuuming "public.foo" ... INFO: "foo": truncated 4 to 3 pages
It's important to note that I did this all within a transaction. If I hadn't, there's a possibility that vacuum would have reclaimed some of the dead space made by the updates, so instead of moving to different pages, the tuples would have moved back and forth within the same page.
There remains one problem: I can't remove index bloat, and in fact, all this tuple-moving causes more index bloat. I can't fix that completely, but in PostgreSQL 8.3 and later I can avoid creating too much new bloat by updating an unindexed column instead of an indexed one. In PostgreSQL 8.3 and later, the heap-only tuples (HOT) feature avoids modifying indexes if:
- the update touches only unindexed columns, and
- there's sufficient free space available for the tuple to stay on the same page.
Creativity with fuzzy string search
PostgreSQL provides a useful set of contrib modules for "fuzzy" string searching; that is, searching for something that sounds like or looks like the original search key, but that might not exactly match. One place this type of searching shows up frequently is when looking for peoples' names. For instance, a receptionist at the dentist's office doesn't want to have to ask for the exact spelling of your name every time you call asking for an appointment, so the scheduling application allows "fuzzy" searches, and the receptionist doesn't have to get it exactly right to find out who you really are. The PostgreSQL documentation provides an excellent introduction to the topic in terms of the available modules; This blog post also demonstrates some of the things they can do.
The TriSano application was originally written to use soundex search alone to find patient names, but that proved insufficient, particularly because common-sounding last names with unusual spellings would be ranked very poorly in the search results. Our solution, which has worked quite well in practice, involved creative use of PostgreSQL's full-text search combined with the pg_trgm contrib module.
A trigram is a set of three characters. In the case of pg_trgm, it's three adjacent characters taken from a given input text. The pg_trgm module provides easy ways to extract all possible trigrams from an input, and compare them with similar sets taken from other inputs. Two strings that generate similar trigram lists are, in theory, similar strings. There's no particular reason you couldn't use two, four, or some other number of characters instead of trigrams, but you'd trade sensitivity and variability. And as the name implies, pg_trgm only supports trigrams.
Straight trigram search didn't buy us much on top of soundex, so we got a bit more creative. A trigram is just a set of three characters, which looks pretty much just like a word, so we thought we'd try using PostgreSQL's full text search on trigram data. Typically full text search has a list of "stop words": un-indexed words judged too common and too short to contribute meaningfully to an index. Our words would all be three characters long, so we had to create a new text search configuration using a dictionary with an empty stop word list. With that text search configuration, we could index trigrams effectively.
This search helped, but wasn't quite good enough. We finally borrowed a simplified version of a data mining technique called "boosting", which involves using multiple "weak" classifiers or searchers to create one relatively good result set. We combined straightforward trigram, soundex, and metaphone searches with a normal full text search of the unmodified name data and a full text search over the trigrams generated from the names. The data sizes in question aren't particularly large, so this amount of searching hasn't proven unsustainably taxing on processor power, and it provides excellent results. The code is on github; feel free to try it out.
Update: One of the comments suggested a demonstration of the results, which of course makes perfect sense. So I resurrected some of the scripts I used when developing the technique. In addition to the scripts used to install the fuzzystrmatch and pg_trgm modules and the name_search.sql script linked above, I had a script that populated the people table with a bunch of fake names. Then, it's easy to test the search mechanism like this:
select * from search_for_name('John Doe')
as a(id integer, last_name text, first_name text, sources text[], rank double precision);
id | last_name | first_name | sources | rank
-----+-------------+------------+-------------------------------------------------+--------------------
167 | Krohn | Javier | {trigram_fts,name_trgm,trigram_fts,trigram_fts} | 0.281305521726608
228 | Jordahl | Javier | {trigram_fts,name_trgm,trigram_fts} | 0.237995445728302
59 | Pesce | Dona | {trigram_fts} | 0.174265757203102
185 | Finchum | Dona | {trigram_fts} | 0.174265757203102
104 | Rumore | Dona | {trigram_fts} | 0.174265757203102
250 | Dumond | Julio | {name_trgm,trigram_fts,trigram_fts} | 0.16849160194397
200 | Dedmon | Javier | {name_trgm,trigram_fts,trigram_fts} | 0.163729697465897
230 | Dossey | Malinda | {name_trgm,trigram_fts} | 0.158055320382118
50 | Dress | Darren | {name_trgm,trigram_fts} | 0.153293430805206
136 | Doshier | Neil | {name_trgm,trigram_fts} | 0.148531511425972
165 | Donatelli | Lance | {name_trgm,trigram_fts} | 0.132845237851143
280 | Dollinger | Clinton | {name_trgm,trigram_fts} | 0.132845237851143
273 | Dimeo | Milagros | {name_trgm,trigram_fts} | 0.0866267532110214
49 | Dawdy | Christian | {name_trgm,trigram_fts} | 0.0866267532110214
298 | Elswick | Jami | {trigram_fts} | 0.0845221653580666
This isn't all the results it returned, but it gives an idea what the results look like. The rank value ranks results based on the rankings given by each of the underlying search methods, and the sources column shows which of the search methods found this particular entry. Some search methods may show up twice, because that search method found multiple matches between the input text and the result record. These results don't look particularly good, because there isn't really a good match for "John Doe" in the data set. But if I horribly misspell "Jamie Elswick", the search does a good job:
select * from search_for_name('Jomy Elswik') as a(id integer, last_name text,
first_name text, sources text[], rank double precision)
id | last_name | first_name | sources | rank
-----+-------------+------------+-------------------------------------------------+--------------------
298 | Elswick | Jami | {trigram_fts,name_trgm,trigram_fts,trigram_fts} | 0.480943143367767
312 | Elswick | Kurt | {name_trgm,trigram_fts} | 0.381967514753342
228 | Jordahl | Javier | {trigram_fts,name_trgm,trigram_fts} | 0.197063013911247
403 | Walberg | Erik | {trigram_fts} | 0.145491883158684
309 | Hammaker | Erik | {trigram_fts} | 0.145491883158684
Tail_n_mail and the log_line_prefix curse
One of the problems I had when writing tail_n_mail (a program that parses log files and mails interesting lines to you) was getting the program to understand the format of the Postgres log files. There are quite a few options inside of postgresql.conf that control where the logging goes, and what it looks like. The basic three options are to send it to a rotating logfile with a custom prefix at the start of each line, to use syslog, or to write it in CSV format. I'll save a discussion of all the logging parameters for another time, but the important one for this story is log_line_prefix. This is what gets prepended to each log line when using 'stderr' mode (e.g. regular log files and not syslog or csvlog). By default, log_line_prefix is an empty string. This is a very useless default.
What you can put in the log_line_prefix parameter is a string of sprintf style escapes, which Postgres will expand for you as it writes the log. There are a large number of escapes, but only a few are commonly used or useful. Here's a log_line_prefix I commonly use:
log_line_prefix = '%t [%p] %u@%d '
This tells Postgres to print out the timestamp, the PID aka process id (inside of square brackets), the current username and database name, and finally a single space to help separate the prefix visually from the rest of the line. The above will generate lines that look like this:
2010-08-06 09:24:57.714 EDT [7229] joy@joymail LOG: execute dbdpg_p7228_5: SELECT count(id) FROM joymail WHERE folder = $1 2010-08-06 09:24:57.714 EDT [7229] joy@joymail DETAIL: parameters: $1 = '4'
As you might imagine, the customizability of log_line_prefix makes parsing the log files all but impossible without some prior knowledge. I didn't want to go the pgfouine route and make people change their log_line_prefix to a specific setting. I think it's kind of rude to force your database to change its logging to accommodate your tools :). The original quick solution I came up with was to have a set of predefined regular expressions and the user would pick one that most closely matched their logs. For tail_n_mail to work properly, it needs to pick up at least the PID so it can tell when one statement ends a new one begins. For example, if you chose "regex #1", the log parsing regex would look like this:
(\d\d\d\d\-\d\d\-\d\d \d\d:\d\d:\d\d).+?(\d+)
This works fine on the example above, and gets us the timestamp and the PID from each line. The stock regexes worked for many different log_line_prefixes I came across that our clients were using, but I was never very happy with this solution. Not only was it susceptible to failing completely when a client was using a log_line_prefix not fitting into the current list of regexes, but there was no way to know exactly where the prefix ended and the statement began, which is important for the formatting of the output and the canonicaliztion of similar queries.
Enter the current solution: building a regex on the fly. Since we don't have a connection to the database at all, merely to the the log files, this requires that the user enter in their current log_line_prefix. This is a simple entry into the tailnmailrc file that looks just like the entry in postgresql.conf, e.g.:
log_line_prefix = '%t [%p] %u@%d '
The tail_n_mail script uses that variable to build a custom regex specifically tailored to that log_line_prefix and thus to the Postgres logs being used. Not only can we grab whatever bits we want (currently we only care about the timestamp (%t and %m) and the PID (%p)), but we can now cleanly break apart each line in the log into the prefix and the actual statement. This means the canonicalization/flattening of the queries is more effective, and allows us to only output the prefix information once. The output of tail_n_mail looks something like this:
Date: Fri Aug 6 11:01:03 2010 UTC Host: whale.example.com Unique items: 7 Total matches: 85 Matches from [A] /var/log/pg_log/postgresql-2010-08-05.log: 61 Matches from [B] /var/log/pg_log/postgresql-2010-08-06.log: 24 [1] From files A to B (between lines 14,205 of A and 527 of B, occurs 64 times) First: [A] 2010-08-05 16:52:11 UTC [1602] postgres@mydb Last: [B] 2010-08-06 01:18:14 UTC [20981] postgres@mydb ERROR: syntax error at or near ")" STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (?,?,?)) - ERROR: syntax error at or near ")" STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (123,'chocolate','donut')); [2] From file A (line 12,172) 2010-08-05 12:27:48 UTC [2906] bob@otherdb ERROR: invalid input syntax for type date: "May" STATEMENT: UPDATE personnel SET birthdate='May' WHERE id = 1234; (plus five other entries)
For the entry in the above example, we are able to show the complete prefix of the log lines where the error first occurred and where it most recently occurred. The next two lines show the "flattened" version of the query that tail_n_mail uses to group together similar errors. We then show a non-flattened example of an actual query from that group. In this case, someone added an extra closing paren in their application somewhere, which gives the same error each time, although the exact output changes depending on the values used. In the second example, because there is only one match, we don't bother to show the flattened version at all.
So in theory tail_n_mail should be now be able to handle any Postgres log you care to throw at it (yes, it can read syslog and csvlog format as well). As my coworker pointed out, parsing log files in this way is something that should probably be abstracted into a common module so other tools like pgsi can take advantage of it as well.
Distributed Transactions and Two-Phase Commit
The typical example of a transaction involves Alice and Bob, and their bank. Alice pays Bob $100, and the bank needs to debit Alice and credit Bob. Easy enough, provided the server doesn't crash. But what happens if the bank debits Alice, and then before crediting Bob, the server goes down? Or what if they credit Bob first, and then try to debit Alice only to find she doesn't have enough funds? A transaction allows the debit and credit operations to happen as a package ("atomically" is the word commonly used), so either both operations happen or neither happens, even if the server crashes halfway through the transaction. That way the bank never credits Bob without debiting Alice, or vice versa.
That's simple enough, but the situation can become more complex. What if, for instance, for buzzword-compliance purposes, the bank has "sharded" its accounts database by splitting it in pieces and putting each piece on a different server (whether this is would be smart or not is outside the scope of this post). The typical transaction handles statements issued only for one database, so we can't wrap the debit and credit operations within a single BEGIN/COMMIT if Alice's account information lives on one server and Bob's lives on another.
Enter "distributed transactions". A distributed transaction allows applications to group multiple transaction-aware systems into a single transaction. These systems might be different databases, or they might include other systems such as message queues, in which case the transaction concept means a message would get delivered if and only if the rest of the transaction completed. So with a distributed transaction, the bank could debit Alice's account in one database and credit Bob's in another, atomically.
All this comes at some cost. Distributed transactions require a "transaction manager", an application which handles the special semantics required to commit a distributed transaction. Second, the systems involved must support "two-phase commit" (which was added to PostgreSQL in version 8.1). Distributed transactions are committed using PREPARE TRANSACTION 'foo' (phase 1), and COMMIT PREPARED 'foo' or ROLLBACK PREPARED 'foo' (phase 2), rather than the usual COMMIT or ROLLBACK.
The beginning of a distributed transaction looks just like any other transaction: the application issues a BEGIN statement (optional in PostgreSQL), followed by normal SQL statements. When the transaction manager is instructed to commit, it runs the first commit phase by saying "PREPARE TRANSACTION 'foo'" (where "foo" is some arbitrary identifier for this transaction) on each system involved in the distributed transaction. Each system does whatever it needs to do to determine whether or not this transaction can be committed and to make sure it can be committed even if the server crashes, and reports success or failure. If all systems succeed, the transaction manager follows up with "COMMIT PREPARED 'foo'", and if a system reports failure, the transaction manager can roll back all the other systems using either ROLLBACK (for those transactions it hasn't yet prepared), or "ROLLBACK PREPARED 'foo'". Using two-phase commit is obviously slower than committing transactions on only one database, but sometimes the data integrity it provides justifies the extra cost.
In PostgreSQL, two-phase commit is supported provided max_prepared_transactions is nonzero. A PREPARE TRANSACTION statement persists the current transaction to disk, and dissociates it from the current session. That way it can survive even if the database goes down. The current session no longer has an active transaction. However, the prepared transaction acts like any other open transaction in that all locks held by the prepared transaction remain held, and VACUUM cannot reclaim storage from that transaction. So it's not a good idea to leave prepared transactions open for a long time.
Distributed transactions are most common, it seems, in Java applications. Full J2EE application servers typically come with a transaction manager component. For my examples I'll use an open source, standalone transaction manager, called Bitronix. I'm not particularly fond of using Java for simple scripts, though, so I've used JRuby for this demonstration code.
This script uses two databases, which I've called "athos" and "porthos". Each has same schema, which provides a simple framework for the sharded bank example described above. This schema provides a table for account names, another for ledger information, and a simple trigger to raise an exception when a transaction would bring a person's balance below $0. I'll first populate athos with Alice's account information. She gets $200 to start. Bob will go in the porthos database, with no initial balance.
5432 josh@athos# insert into accounts values ('Alice');
INSERT 0 1
5432 josh@athos*# insert into ledger values ('Alice', 200);
INSERT 0 1
5432 josh@athos*# commit;
COMMIT5432 josh@athos# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# insert into accounts values ('Bob');
INSERT 0 1
5432 josh@porthos*# commit;
COMMIT
Use of Bitronix is pretty straightforward. After setting up a few constants for easier typing, I create a Bitronix data source for each PostgreSQL database. Here I have to use the PostgreSQL JDBC driver's org.postgresql.xa.PGXADataSource class; "XA" is Java's protocol for two-phase commit, and requires JDBC driver support. Here's the code for setting up one data source; the other is just the same.
ds1 = PDS.new ds1.set_class_name 'org.postgresql.xa.PGXADataSource' ds1.set_unique_name 'pgsql1' ds1.set_max_pool_size 3 ds1.get_driver_properties.set_property 'databaseName', 'athos' ds1.get_driver_properties.set_property 'user', 'josh' ds1.init
Then I simply get a connection from each data source, instantiate a Bitronix TransactionManager object, and begin a transaction.
c1 = ds1.get_connection c2 = ds2.get_connection btm = TxnSvc.get_transaction_manager btm.begin
Within my transaction, I just use normal JDBC commands to debit Alice and credit Bob, after which I commit the transaction through the TransactionManager object. If this transaction fails, it raises an exception, which I can capture using Ruby's begin/rescue exception handling, and roll back the transaction.
begin
s2 = c2.prepare_statement "INSERT INTO ledger VALUES ('Bob', 100)"
s2.execute_update
s2.close
s1 = c1.prepare_statement "INSERT INTO ledger VALUES ('Alice', -100)"
s1.execute_update
s1.close
btm.commit
puts "Successfully committed"
rescue
puts "Something bad happened: " + $!
btm.rollback
end
When I run this, Bitronix gives me a bunch of output, which I haven't bothered to suppress, but among it all is the "Successfully committed" string I told it to print on success. Since Alice is debited $100 each time we run this, and she started with $200, we can run it twice before hitting errors. On the third time, we get this:
Something bad happened: org.postgresql.util.PSQLException: ERROR: Rejecting operation; account owner Alice's balance would drop below 0
This is our trigger firing, to tell us that we can't debit Alice any more. If I look in the two databases, I can see that everything worked as planned:
5432 josh@athos*# select get_balance('Alice');
get_balance
-------------
0
(1 row)
5432 josh@athos*# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# select get_balance('Bob');
get_balance
-------------
200
(1 row)
Remember I've run my script three times, but Bob has only been credited $200, because that's all Alice had to start with.
PostgreSQL: per-version .psqlrc
File this under "you learn something new every day." I came across this little tidbit while browsing the source code for psql: you can have a per-version .psqlrc file which will be executed only by the psql associated with that major version. Just name the file .psqlrc-$version, substituting the major version for the $version token. So for PostgreSQL 8.4.4, it would look for a file named .psqlrc-8.4.4 in your $HOME directory.
It's worth noting that the version-specific .psqlrc file requires the full minor version, so you cannot currently define (say) an 8.4-only version which applies to all 8.4 psqls. I don't know if this feature gets enough mileage to make said modification worth it, but it would be easy enough to just use a symlink from the .psqlrc-$majorversion to the specific .psqlrc file with minor version.
This seems of most interest to developers, who may simultaneously run many versions of psql which may have incompatible settings, but also could come in handy to regular users as well.
PostgreSQL: Dynamic SQL Function
Sometimes when you're doing something in SQL, you find yourself doing something repetitive, which naturally lends itself to the desire to abstract out the boring parts. This pattern is often prevalent when doing maintenance-related tasks such as creating or otherwise modifying DDL in a systematic kind of way. If you've ever thought, "Hey, I could write a query to handle this," then you're probably looking for dynamic SQL.
The standard approach to using dynamic SQL in PostgreSQL is plpgsql's EXECUTE function, which takes a text argument as the SQL statement to execute. One technique fairly well-known on the #postgresql IRC channel is to create a function which essentially wraps the EXECUTE statement, commonly known as exec(). Here is the definition of exec():
CREATE FUNCTION exec(text) RETURNS text AS $$ BEGIN EXECUTE $1; RETURN $1; END $$ LANGUAGE plpgsql;
Using exec() then takes the form of a SELECT query with the appropriately generated query to be executed passed as the sole argument. We return the generated query text as an ease in auditing the actually executed results. Some examples:
SELECT exec('CREATE TABLE partition_' || generate_series(1,100) || ' (LIKE original_table)');
SELECT exec('ALTER TABLE ' || quote_identifier(attrelid::regclass) || ' DROP COLUMN foo') FROM pg_attribute WHERE attname = 'foo';
Some notes about the exec() function: since the generated SQL statement is being run inside a function, it is not run in a top-level transaction, so some commands will not work, including CREATE/DROP DATABASE, ALTER TABLESPACE, VACUUM, etc.
Starting in PostgreSQL 9.0, the plpgsql language will be pre-installed in all new databases, which will make this recipe even easier to use.
PostgreSQL: Migration Support Checklist
A database migration (be it from some other database to PostgreSQL, or even from an older version of PostgreSQL to a nice shiny new one) can be a complicated procedure with many details and many moving parts. I've found it helpful to construct a list of questions in order to make sure that you're considering all aspects of the migrations and gauge the scope of what will be involved. This list includes questions we ask our clients; feel free to contribute your own additional considerations or suggestions.
Technical questions:
- Database servers: How many database servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? What kind of storage are you using for the existing database, and what do you plan to use for the new database? Direct-attached storage (SAS, SATA, etc.), SAN (what vendor?), or other? Do you use any configuration management system such as Puppet, Chef, etc.?
- Application servers and other remote access: How many application servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? Do you use any configuration management system such as Puppet, Chef, etc.? What other network considerations are there? Is ODBC used, or SSL transport, any VPNs? Are multiple datacenters involved? How about egress/ingress firewalls?
- Middleware: Do you currently use any sort of connection pooling, load balancing, or other middleware between your application and database servers?
- Data needs: Can you describe your data access patterns? i.e., is the majority of your data historical and rarely accessed? Are there any existing reporting needs that will need to be duplicated on the PostgreSQL system? Do you already have reports of database usage, including traffic levels, frequent or intensive queries, etc?
- Size: What kind of transaction volume do you see? How large are your databases? How many tables do you have and what is the size of the larger ones? How many users or database connections will you need to support?
- Backups: What are your current backup policies/procedures? How will these need to change with the move to PostgreSQL?
- Replication/load balancing: What kind of system redundancy do you currently have/need? Do you have any kind of database load-balancing or master-slave replication?
- Monitoring: What is the current monitoring/in-house support infrastructure? What needs to be duplicated, and can any portion of this facility be reused?
- Interfaces: What language are your applications written in, and what drivers exist to connect to your current database? Will there be a compatible driver available in your language of choice in order?
- Extensions: Are you currently using any in-database procedures or functionality (i.e., in PL/SQL or another embedded language of choice)? If so, how many? What will the difficulty be in porting these functions to PostgreSQL?
And a couple of business-related questions:
- Scheduling: What is the timeframe for transition? When can appropriate downtime be scheduled? How much database downtime can you afford?
- Staffing: Do you currently have in-house DBAs to manage the servers, etc on a day-to-day basis? Is there anyone with PostgreSQL experience or familiarity on staff?
Being able to answer all of these questions is critical to formulating a migration plan and carrying out a migration successfully.
Particularly with the impending (July 2010) end of life for previous PostgreSQL releases 7.4, 8.0 and (in November 2010) 8.1, a database migration may be on your radar. End Point is one of many professional PostgreSQL support companies who would be happy to assist you in your transition.
Views across many similar tables
An application I'm working on has a host of (a dozen or so) status tables, each containing various rows that reflect the state of associated rows in other tables. For instance:
Table "public.inventory" ... status_code | character varying(50) | not null Table "public.inventory_statuses" code | character varying(50) | not null display_label | character varying(70) | not null SELECT * FROM inventory_statuses; code | display_label -----------+--------------- ordered | Ordered shipped | Shipped returned | Returned repaired | Repairedetc.
Several of the codes are common to several tables. For instance, "void" is a status that occurs in seven tables. The application cares about this; there are code-level triggers that will respond to a change of status to "void" in one table, and pass that information along to another table higher up the chain.
Since I wasn't present at the birth of the system (nor do I have unlimited memory to keep 180+ codes in my head), I needed a way to answer the question, "In which table(s) does status 'foo' occur?" This was made rather easier by attention to detail early on: each of the status tables was named "*_statuses"; each primary key was named "code"; and each human-readable description field was named "display_label". I wrote a Pl/PgSQL function to create a view spanning all the tables. (I could have just created the SQL by hand, but I wanted a way to reproduce this effort later, if tables are added, dropped, or modified.)
CREATE FUNCTION create_all_statuses()
RETURNS VOID
LANGUAGE 'plpgsql'
AS $$
DECLARE
stmt TEXT;
tbl RECORD;
BEGIN
stmt := '';
FOR tbl IN EXECUTE $SQL$
SELECT DISTINCT table_name
FROM information_schema.columns a
JOIN information_schema.columns b
USING (table_name)
JOIN information_schema.tables t
USING (table_name)
WHERE a.column_name = 'code'
AND b.column_name = 'display_label'
AND table_name ~ '_statuses$'
AND t.table_type = 'BASE TABLE'
$SQL$
LOOP
IF (LENGTH(stmt) > 0)
THEN
stmt := stmt || ' UNION ';
END IF;
stmt := stmt || 'SELECT code, display_label, ' ||
quote_literal(tbl.table_name) ||
' AS table_name FROM ' ||
quote_ident(tbl.table_name);
END LOOP;
EXECUTE 'CREATE VIEW all_statuses AS ' || stmt;
RETURN;
END;
$$;Now it's easy to answer the question:select * from all_statuses where code = 'void'; code | display_label | table_name ------+---------------+-------------------------------------- void | Void | inventory_statuses void | Void | parcel_statuses void | Void | pick_list_statusesetc.
If your database uses boilerplate columns such as "last_modified" or "date_created" to record timestamps on rows, you could use similar logic to create a view that would tell you which tables were the most recently modified.
pgcrypto pg_cipher_exists errors on upgrade from PostgreSQL 8.1
While migrating a client from a 8.1 Postgres database to a 8.4 Postgres database, I came across a very annoying pgcrypto problem. (pgcrypto is a very powerful and useful contrib module that contains many functions for encryption and hashing.) Specifically, the following functions were removed from pgcrypto as of version 8.2 of Postgres:
- pg_cipher_exists
- pg_digest_exists
- pg_hmac_exists
While the functions listed above were deprecated, and marked as such for a while, their complete removal from 8.2 presents problems when upgrading via a simple pg_dump. Specifically, even though the client was not using those functions, they were still there as part of the dump. Here's what the error message looked like:
$ pg_dump mydb --create | psql -X -p 5433 -f - >pg.stdout 2>pg.stderr ... psql::2654: ERROR: could not find function "pg_cipher_exists" in file "/var/lib/postgresql/8.4/lib/pgcrypto.so" psql: :2657: ERROR: function public.cipher_exists(text) does not exist
While it doesn't stop the rest of the dump from importing, I like to remove any errors I can. In this case, it really was a SMOP. Inside the Postgres 8.4 source tree, in the contrib/pgcrypto directory, I added the following declarations to pgcrypto.h:
Datum pg_cipher_exists(PG_FUNCTION_ARGS); Datum pg_digest_exists(PG_FUNCTION_ARGS); Datum pg_hmac_exists(PG_FUNCTION_ARGS);
Then I added three simple functions to the bottom of the pgcrypto.c file that simply throw an error if they are invoked, letting the user know that the functions are deprecated. This is a much friendlier way than simply removing the functions, IMHO.
/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_cipher_exists);
Datum
pg_cipher_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_cipher_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_digest_exists);
Datum
pg_digest_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_digest_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_hmac_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_hmac_exists);
Datum
pg_hmac_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_hmac_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
After running make install from the pgcrypto directory, the dump proceeded without any further pgcrypto errors. From this point forward, if the anyone attempts to use one of the functions, it will be quite obvious that the function is deprecated, rather than leaving the user wondering if they typed the function name incorrectly or wondering if pgcrypto is perhaps not installed.
Why not just add some dummy SQL functions to the pgcrypto.sql file instead of hacking the C code? Because pg_dump by default will create the database as a copy of template0. While there are other ways around the problem (such as putting the SQL functions into template1 and forcing the load to use that instead of template0, or by creating the database, adding the SQL functions, and then loading the data), this was the simplest approach.
Photo of Enigma machine by Marcin Wichary
Learn more about End Point's Postgres Support, Development, and Consulting.
Tracking Down Database Corruption With psql
I love broken Postgres. Really. Well, not nearly as much as I love the usual working Postgres, but it's still a fantastic learning opportunity. A crash can expose a slice of the inner workings you wouldn't normally see in any typical case. And, assuming you have the resources to poke at it, that can provide some valuable insight without lots and lots of studying internals (still on my TODO list.)
As a member of the PostgreSQL support team at End Point a number of diverse situations tend to cross my desk. So imagine my excitement when I get an email containing a bit of log output that would normally make a DBA tremble in fear:
LOG: server process (PID 10023) was terminated by signal 11 LOG: terminating any other active server processes FATAL: the database system is in recovery mode LOG: all server processes terminated; reinitializing
Oops, signal 11 is SIGSEGV, Segmentation Fault. Really not supposed to happen, especially in day to day activities. That'll cause Postgres to drop all of its current sessions and restart itself, as the log lines indicate. That crash was in response to a specific query their application was running, which essentially runs a process on a column across an entire table. Upon running pg_dump they received a different error:
ERROR: invalid memory alloc request size 2667865904 STATEMENT: COPY public.different_table (etc, etc) TO stdout
Different, but still very annoying and in the way of their data. So we have (at least) two areas of corruption. But therein lies the bigger problem: Neither of these messages give us any clues about where in these potentially very large tables it's encountering a problem.
Yes, my hope is that the corruption is not widespread. I know this database tends to not see a whole lot of churn, relatively speaking, and that they look at most if not all the data rather frequently. So the expectation is that it was caught not long after the disk controller or some memory or something went bad, and that whatever's wrong is isolated to a handful of pages.
Our good and trusty psql command line client to the rescue! One of the options available in psql is FETCH_COUNT, which if set will wrap a SELECT query in a cursor then automatically and repeatedly fetch the specified number of rows from it. This option is there primarily to allow psql to show the results of large queries without having to dedicate so much memory up front. But in this case it lets us see the output of a table scan as it happens:
testdb=# \set FETCH_COUNT 1 testdb=# \pset pager off Pager usage is off. testdb=# SELECT ctid, * FROM gs; ctid | generate_series -------+----------------- (0,1) | 0 (0,2) | 1 (scroll, scroll, scroll...)
(You did start that in a screen session, right? No need to have it send all the data over to your terminal, especially if you're working remotely. Set screen to watch for the output to go idle, Ctrl-A, _ keys by default, and switch to a different window. Oh, and this of course isn't the client's database, but one where I've intentionally introduced some corruption.)
We select the system column ctid to tell us the page where the problem occurs. Or more specifically, the page and positions leading up to the problem:
(439,226) | 99878
(439,227) | 99879
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
:|!>?
Yup, there it is. Some point after item pointer 227 on page 439, which probably actually means page 440. At this point we can reconnect, and possibly through a bit of trial and error narrow down the affected area a little more. But for now let's run with page 440 being suspect; let's take a closer look. And it here it should be noted that if you're going to try anything, shut down Postgres and take a file-level backup of the data directory. Anyway, first we need to find the underlying file for our table...
testdb=# select oid from pg_database where datname = 'testdb';
oid
-------
16393
(1 row)
testdb=#* select relfilenode from pg_class where relname = 'gs';
relfilenode
-------------
16394
(1 row)
testdb=#* \q
demo:~/p82$ dd if=data/base/16393/16394 bs=8192 skip=440 count=1 | hexdump -C | less
...
000001f0 00 91 40 00 e0 90 40 00 00 00 00 00 00 00 00 00 |..@...@.........|
00000200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 1f 8b 08 08 00 00 00 00 02 03 70 6f 73 74 67 72 |..........postgr|
00001010 65 73 71 6c 2d 39 2e 30 62 65 74 61 31 2e 74 61 |esql-9.0beta1.ta|
00001020 72 00 ec 7d 69 63 1b b7 d1 f0 f3 55 fb 2b 50 8a |r..}ic.....U.+P.|
00001030 2d 25 96 87 24 5f 89 14 a6 a5 25 5a 56 4b 1d 8f |-%..$_....%ZVK..|
00001040 28 27 4e 2d 87 5a 91 2b 6a 6b 72 97 d9 25 75 c4 |('N-.Z.+jkr..%u.|
00001050 f6 fb db df 39 00 2c b0 bb a4 28 5b 71 d2 3e 76 |....9.,...([q.>v|
00001060 1b 11 8b 63 30 b8 06 83 c1 60 66 1c c6 93 41 e4 |...c0....`f...A.|
...
Huh, so through perhaps either a kernel bug, a disk controller problem, or bizarre action on the part of a sysadmin, the last bit of our table has been overwritten by the 9.0beta1 tarball distribution. Incidentally this is not one of the recommended ways of upgrading your database.
With a corrupt page identified, if it's fairly clear the invalid data covers most or all of the page it's probably not too likely we'll be able to recover any rows from it. Our best bet is to "zero out" the page so that Postgres will skip over it and let us pull the rest of the data from the table. We can use `dd` to seek to the corrupt block in the table and write out an 8k block of zero-bytes in its place. Shut down Postgres (just to make sure it doesn't re-overwrite your work later) and note the conv=notrunc that'll keep dd from truncating the rest of the table.
demo:~/p82$ dd if=/dev/zero of=data/base/16393/16394 bs=8192 seek=440 count=1 conv=notrunc 1+0 records in 1+0 records out 8192 bytes (8.2 kB) copied, 0.000141498 s, 57.9 MB/s demo:~/p82$ dd if=data/base/16393/16394 bs=8192 skip=440 count=1 | hexdump -C 1+0 records in 1+0 records out 8192 bytes (8.2 kB) copied, 0.000147993 s, 55.4 MB/s 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00002000
Cool, it's now an empty, uninitialized page that Postgres should be fine skipping right over. Let's test it, start Postgres back up and run psql again...
testdb=# select count(*) from gs; count ------- 99880 (1 row)
No crash, hurray! We've clearly lost some rows from the table, but that should now allow us to rescue any of the surrounding data. As always it's worth dumping out all the data you can, running initdb, and loading it back in. You never know what else might have been affected in the original database. This is of course no substitute for a real backup, but if you're in a pinch at least there is some hope. For now, PostgreSQL is happy again!
Learn more about End Point's Postgres Support, Development, and Consulting.
The PGCon "Hall Track"
One of my favorite parts of PGCon is always the "hall track", a general term for the sideline discussions and brainstorming sessions that happen over dinner, between sessions (or sometimes during sessions), and pretty much everywhere else during the conference. This year's hall track topics seemed to be set by the developers' meeting; everywhere I went, someone was talking about hooks for external security modules, MERGE, predicate locking, extension packaging and distribution, or exposing transaction order for replication. Other developers' pet projects that didn't appear in the meeting showed up occasionally, including unlogged tables and range types. Even more than, for instance, the wiki pages describing the things people plan to work on, these interstitial discussions demonstrate the vibrancy of the community and give a good idea just how active our development really is.
This year I shared rooms with Robert Haas, so I got a good overview of his plans for global temporary and unlogged tables. I spent a while with Jeff Davis looking through the code for exclusion constraints and deciding whether it was realistically possible to cause a starvation problem with many concurrent insertions into a table with an exclusion constraint. I didn't spend the time I should have talking with Dimitri Fontaine about his PostgreSQL extensions project, but if time permits I'd like to see if I could help out with it. Nor did I find the time I'd have liked to work on PL/Parrot, but I was glad to meet Jonathan Leto, who has done most of the coding work thus far on that project.
In contrast to other conferences, I didn't have a particular itch of my own to scratch between sessions. During past conferences I've been eager to discuss ideas for multi-column statistics; though that work continues, slowly, time hasn't permitted enough recent development even for the topic to be fresh in my mind, much less worthy of in-depth discussion. This lack of one overriding subject turned out to be a refreshing change, however, as it left the other hall track subjects less filtered.
Finally, it was nice to spend time with co-workers, and in fact to meet (finally) in person the one of the "Greg"s I'd talked to on the phone many times, but never actually met in person. Various engagements in my family or his have gotten in the way in the past. One of the quirks of working for a distributed organization...
Update: Fixed link to developers' meeting wiki page, thanks to comment from roppert
Learn more about End Point's Postgres Support, Development, and Consulting.
Postgres Conference - PGCon2010 - Day Two
Day two of the PostgreSQL Conference started a little later than the previous day in obvious recognition of the fact that many people were up very, very late the night before. (Technically, this is day four, as the first two days consisted of tutorials; this was the second day of "talks").
The first talk I went to was PgMQ: Embedding messaging in PostgreSQL by Chris Bohn. It was well attended, although there were definitely a lot of late-comers and bleary eyes. A tough slot to fill! Chris is from Etsy.com and I've worked with him there, although I had no interaction with the PgMQ project, which looks pretty cool. From the talk description:
PgMQ (PostgreSQL Message Queueing) is an add-on that embeds a messaging client inside PostgreSQL. It supports the AMQP, STOMP and OpenWire messaging protocols, meaning that it can work with all of the major messaging systems such as ActiveMQ and RabbitMQ. PgMQ enables two replication capabilities: "Eventually Consistent" Replication and sharding.
As near as I can tell, "eventually consistent" is the same as "asynchronous replication": the slave won't be the same as the master right away, but will be eventually. As with Bucardo and Slony, the actual lag is very small in practice: a handful of seconds at the most. I like the fact that it supports all those common messaging protocols. Chris mentioned in the talk that it should be possible for other systems like Bucardo to support something similar. I'll have to play around with PgMQ a bit and see about doing just that. :)

The typical post-talk gatherings
The next "talk" was the enigmatically labeled Replication Panel. Enigmatic in this case as it had no description whatsoever. It's a good thing I had decided to check it out anyway (I'm a sucker for any talk related to replication, in case it wasn't obvious yet). I was apparently nominated to be on the panel, representing Bucardo! So much for getting all my speaking done and over with the first day. The panel represented a pretty wide swatch of Postgres replication technologies, and by the people who are very deep in the development of each one. From left to right on a cluster of stools at the front of the room was:
- Londiste (Marko Kreen)
- Slony (Jan Wieck)
- pgpool-II (Tatsuo Ishii)
- Hot standby and Streaming replication (Heikki Linnakangas)
- Bucardo (Greg Sabino Mullane)
- Golconde (Gavin M. Roy)
After a quick one-minute each intro describing who we were and what our replication system was, we took questions from the audience. Rather, Dan Langille played the part of the moderator and gathered written questions from the audience which he read to us, and we each took turns answering. We managed to get through 16 questions. All were interesting, even if some did not apply to all the solutions. Some of the more relevant ones I remember:
"If your replication solution was not available, which of the other replication solutions would you recommend?" This was my favorite question. My answer was: if using Bucardo in multi-master mode, switch to pgpool. If using in master-slave mode, use Slony.
"How will PG 9.0 affect your solution? Will your solution still remain relevant?" This most heavily affects Bucardo, Slony, and Londiste, and we all agreed that we're happy to lose users who simply need a read-only copy of their database. Their remains plenty of use cases that 9.0 will not solve however.
"For multi-master solutions: How are database collisions resolved? Do you recommend your solution for geographically remote locations?" This one is pretty much for me alone. :) I gave a quick overview of Bucardo's built-in conflict resolution systems, and how custom ones built on business logic works. Since Bucardo was originally built to support servers over a non-optimal network, the second part was an easy Yes.
"Is there a way to standardize and reduce the number of replication systems and focus on making the subset more robust, efficient, and versatile?" The general answer was no, as the use cases for all of them are so wildly different. I thought the only possible reduction was to combine Slony and Londiste, as they are very close technically and have pretty much identical use cases.
"How easy is it to switch masters? Are you planning on improving the tools to do so?" With Bucardo, switching is as easy as pointing to a different database if using master-master. However, Bucardo master-slave has no built in support at all for failover (like Slony does). So the answer is "not easy at all" and yes, we want to provide tools to do so.
"What is your biggest bug, problem, or limitation you are fixing now?" All three of the async trigger solutions (Bucardo, Slony, and Londiste) answered "DDL triggers". Which is hopefully coming for 9.1 (stop reading this blog and get to work on that, Jan).
All in all, I really liked the panel, and I think the audience did as well. Hopefully we'll see more things like at future conferences. Since we did not know the questions before hand, and took everything from the audience, it was the polar opposite of someone giving a talk with prepared slides.
I had some people come up to me afterwards to ask for more details about Bucardo, because (as they pointed out), it's the only multi-master replication system for Postgres (not technically true, as pg-pool and rubyrep provide multi-master use cases as well, but the former is synchronous and fairly complex, while the latter is very new and lacking some features). Maybe next year I should give a whole talk on Bucardo rather than just blabbing about it here on the blog. :)
After that, I popped into the Check Please! What Your Postgres Databases Wishes You Would Monitor talk by Robert Treat (who I also used to work with). It was a good talk, but pretty much review for me, as watching over and monitoring databases is what I spend a lot of my time doing. :) Here's the description:
Compared to many proprietary systems, Postgres tends to be pretty straight forward to run. However, if you want to get the most from your database, you shouldn't just set it and forget it, you need to monitor a few key pieces of information to keep performance going. This talk will review several key metrics you should be aware of, and explain under which scenarios you may need additional monitoring.
The final talk I went to was Deploying and testing triggers and functions in multiple databases by Norman Yamada. This was an interesting talk for me because he was using a lot of the code from the same_schema action in the check_postgres program to do the actual comparison. Indeed, I made some patches while at the conference to allow for better index comparison's at Norman's request. I also managed to get some work done on tail_n_mail and Bucardo while there - something about being surrounded by all that Postgres energy made me productive despite having very little free time.
I had to catch an early flight, and was not able to catch the final talk slot of the day, nor the closing session or the BOFs that night. Hopefully someone who did catch those will blog about it and let me know how it went. I hear the t-shirt we signed at the developer's meeting went for a sweet ransom.
If you went to PgCon, I have two requests for you.
First, please fill out the feedback for each talk you went to. It takes less than a minute per talk, and is invaluable for both the speakers and the conference organizers. Second, please blog about PgCon. It's helpful for people who did not get to go to see the conference through other people's eyes. And do it now, while things are still fresh.
If you did not go to PgCon, I have one request for you: go next year! Perhaps next year at PgCon 2011 we'll break the 200 person mark. Thanks to Dan Langille as always for creating PgCon and keeping it running smooth year after year.
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL Conference - PGCon 2010 - Day One
The first day of talks for PGCon 2010 is now over, here's a recap of the parts that I attended.
On Wednesday, the developer's meeting took place. It was basically 20 of us gathered around a long conference table, with Dave Page keeping us to a strict schedule. While there were a few side conversations and contentious issues, overall we covered an amazing amount of things in a short period of time, and actually made action items out of almost all of them. My favorite *decision* we made was to finally move to git, something myself and others have been championing for years. The other most interesting parts for me were the discussion of what features we will try to focus on for 9.1 (it's an ambitious list, no doubt), and DDL triggers! It sounds like Jan Wieck has already given this a lot of thought, so I'm looking forward to working with him in implementing these triggers (or at least
nagging him about it if he slows down). These triggers will be immensely useful to replication systems like Bucardo and Slony, which implement DDL replication in a very manual and unsatisfactory way. These triggers will not be like the current triggers, in that they will not be directly attached to system tables. Instead, they will be associated with certain DDL events, such that you could have a trigger on any CREATE events (or perhaps also allowing something finer grained such as a trigger on a CREATE TABLE event). Whenever it comes in, I'll make sure that Bucardo supports it, of course!
The first day of talks kicked off the the plenary by Gavin Roy called "Perspectives on NoSQL" (description and slides are available). Gavin actually took the time to *gasp* research the topic, and gave a quick rundown of some of the more popular "NoSQL" solutions, including CouchDB, MongoDB, Cassandra, Project Voldemort, Redis, and Tokyo Tyrant. He then benchmarked all of them against Postgres for various tasks - and did it against both "regular safe" Postgres and "running with scissors" fsync-off Postgres. The results? Postgres scales, very well, and more than holds it own against the NoSQL newcomers. MongoDB did surprisingly well: see the slides for the details. His slides also had the unfortunate portmanteau of "YeSQL", which only helps to empahsize how silly our "PostgreSQL" name is. :)
The next talk was Postgres (for non-Postgres people) by Greg Sabino Mullane (me!). Unlike previous years, my slides are already online. Yes, at first blush, it seems a strange talk to give at a conference like this, but we always have a good number of people from other database systems that are considering Postgres, are in the process of migrating to Postgres, or are just new to Postgres. The talk was in three parts: the first was about the mechanics of migrating your application to Postgres: the data types that Postgres uses, how we implement indexes, the best way to migrate your data, and many other things, with an eye towards common migration problems (especially when coming from MySQL). The second part of the talk discussed some of the quirks of Postgres people coming from DB2, Oracle, etc. should be aware of. Some things discussed: how Postgres does MVCC and need for vacuum, our really smart planner and lack of hints, the automatic (and against the spec) lowercasing, and our concept of schemas. I also touched on what I see as some of our drawbacks: tuned for a toaster, no true in place upgrade, the unpronounceable name, the lack of marketing. and what some of our perceived-but-not-real drawbacks are: lack of replication, poor speed. What would a list of drawbacks be without a list of strengths?: transactional DDL, very friendly and helpful community, PostGIS, authentication options, awesome query planner, the ability to create your own custom database objects, and our distributed nature that ensures the project cannot be bought out or destroyed. The last part of the talk went over the Postgres project itself: the community, the developers, the philosophy, and how it all fits together. I ran out of time so did not get to tell my "longest patch process ever" story for \dfS (six years!) but I don't think I missed anything important and gave time for some questions.
The next talk was Hypothetical Indexes towards self-tuning in PostgreSQL by Sergio Lifschitz. In the words of Sergio:
Hypothetical indexes are simulated index structures created solely in the database catalog. This type of index has no physical extension and, therefore, cannot be used to answer actual queries. The main benefit is to provide a means for simulating how query execution plans would change if the hypothetical indexes were actually created in the database. This feature is quite useful for database tuners and DBAs.
It was a very interesting talk. Robert Haas asked him to put it in the PostgreSQL license so we can easily put it into the project as needed. Sergio promised to make the change immediately after the talk!
After lunch, the next talk was pg_statsinfo - More useful statistics information for DBAs by Tatsuhito Kasahara. This talk was a little hard to follow along, but had some interesting ideas about monitoring Postgres, a lot of which overlapped with some of my projects such as tail_n_mail and check_postgres.
The next talk was Forensic Analysis of Corrupted Databases by Greg Stark. This was a neat little talk; many of the error messages he displayed were all too familiar to me. It was nice overview of how to track down the exact location of a problem in a corrupted database, and some strategies for fixing it, including the old "using dd to write things from /dev/zero directly into your Postgres files" trick. There was even a discussion about the possibility of zeroing out specific parts of a page header, with the consensus that it would not work as one would hope.
After a quick hacky sack break with Robert Treat and some Canadian locals, I went to the final real talk of the day: The PostgreSQL Query Planner by Robert Haas. I had seen this talk recently, but wanted to see it again as I missed some of the beginning of the talk when I saw it at Pg East 2010 in Philly. Robert gave a good talk, and was very good at repeating the audience's questions. I didn't learn all that much, but it was a very good overview of the planner, including some of the new planner tricks (such as join removal) in 9.0 and 9.1.
After that, the lightning talks started. I really like lightning talks, and thankfully they weren't held on the last day of the conference this time (a common mistake). The MC was Selena Deckelmann, who did a great job of making sure all the slides were gathered up beforehand, and strictly enforced the five minute time limit. The list of slides is on the Postgres wiki. I talked on my latest favorite project, tail_n_mail - the slides are available on the wiki. I didn't make it through all my slides, so if you were at the talks, check out the PDF for the final two that were not shown. There seemed to be good interest in the project, and I had several people tell me afterwards they would try it out.
The night ended with the EnterpriseDB sponsored party. I spoke to a lot of people there, about replication, PITR scripts, log monitoring, the problem with a large number of inherited objects, and many other topics. Note to EDB: I don't think that venue is going to scale, as the conference gets bigger each year! The total number of people at the conference this year was 184, a new record.
A very good first day: I learned a lot, met new people, saw old friends, and hopefully sold Postgres to some non-Postgres people :). I also managed to git push some changes to tail_n_mail, check_postgres, and Bucardo. It's hard to say no to feature requests when someone asks you in person. :)
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL switches to Git
Looks like the Postgres project is finally going to be bite the bullet and switch to git as the canonical VCS. Some details are yet to be hashed out, but the decision has been made and a new repo will be built soon. Now to lobby to get that commit-with-inline-patches list to be created...
PostgreSQL 8.4 on RHEL 4: Teaching an old dog new tricks
So a client has been running a really old version of PostgreSQL in production for a while. We finally got the approval to upgrade them from 7.3 to the latest 8.4. Considering the age of the installation, it should come as little surprise that they had been running a similarly ancient OS: RHEL 4.
Like the installed PostgreSQL version, RHEL 4 is ancient -- 5 years old. I anticipated that in order to get us to a current version of PostgreSQL, we'd need to resort to a source build or rolling our own PostgreSQL RPMs. Neither approach was particularly appealing.
While the age/decrepitude of the current machine's OS came as little surprise, what did come as a surprise was that there were supported RPMs available for RHEL 4 in the community yum rpm repository, located at http://yum.pgrpms.org/8.4/redhat/rhel-4-i386/repoview/ (modulo your architecture of choice).
In order to get things installed, I followed the instructions for installing the specific yum repo. There were a few seconds where I was confused because the installation command was giving a "permission denied" error when attempting to install the 8.4 PGDG rpm as root. A little brainstorming and a lsattr later revealed that a previous administrator, apparently in the quest for über-security, had performed a chattr +i on the /etc/yum.repo.d directory.
Evil having been thwarted, in the interest of über-usability I did a quick chattr -i /etc/yum.repo.d and installed the PGDG rpm. Away we went. From that point, the install was completely straightforward; I had a PostgreSQL 8.4.4 system running in no time, and could finally get off that 7.3 behemoth. Now to talk my way into an OS upgrade...
Learn more about End Point's Postgres Support, Development, and Consulting.
























