Database Blog Archive
Foreign Data Wrappers
Original images from Flickr user jenniferwilliams
One of our clients, for various historical reasons, runs both MySQL and PostgreSQL to support their website. Information for user login lives in one database, but their customer activity lives in the other. The eventual plan is to consolidate these databases, but thus far, other concerns have been more pressing. So when they needed a report combining user account information and customer activity, the involvement of two separate databases became a significant complicating factor.
In similar situations in the past, using earlier versions of PostgreSQL, we've written scripts to pull data from MySQL and dump it into PostgreSQL. This works well enough, but we've updated PostgreSQL fairly recently, and can use the SQL/MED features added in version 9.1. SQL/MED ("MED" stands for "Management of External Data") is a decade-old standard designed to allow databases to make external data sources, such as text files, web services, and even other databases look like normal database tables, and access them with the usual SQL commands. PostgreSQL has supported some of the SQL/MED standard since version 9.1, with a feature called Foreign Data Wrappers, and among other things, it means we can now access MySQL through PostgreSQL seamlessly.
The first step is to install the right software, called mysql_fdw. It comes to us via Dave Page, PostgreSQL core team member and contributor to many projects. It's worth noting Dave's warning that he considers this experimental code. For our purposes it works fine, but as will be seen in this post, we didn't push it too hard. We opted to download the source and build it, but installing using pgxn works as well:
$ env USE_PGXS=1 pgxnclient install mysql_fdw INFO: best version: mysql_fdw 1.0.1 INFO: saving /tmp/tmpjrznTj/mysql_fdw-1.0.1.zip INFO: unpacking: /tmp/tmpjrznTj/mysql_fdw-1.0.1.zip INFO: building extension gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -fpic -I/usr/include/mysql -I. -I. -I/home/josh/devel/pg91/include/postgresql/server -I/home/josh/devel/pg91/include/postgresql/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o mysql_fdw.o mysql_fdw.c mysql_fdw.c: In function ‘mysqlPlanForeignScan’: mysql_fdw.c:466:8: warning: ‘rows’ may be used uninitialized in this function [-Wmaybe-uninitialized] gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -fpic -shared -o mysql_fdw.so mysql_fdw.o -L/home/josh/devel/pg91/lib -L/usr/lib -Wl,--as-needed -Wl,-rpath,'/home/josh/devel/pg91/lib',--enable-new-dtags -L/usr/lib/x86_64-linux-gnu -lmysqlclient -lpthread -lz -lm -lrt -ldl INFO: installing extension < ... snip ... >
Here I'll refer to the documentation provided in mysql_fdw's README. The first step in using a foreign data wrapper, once the software is installed, is to create the foreign server, and the user mapping. The foreign server tells PostgreSQL how to connect to MySQL, and the user mapping covers what credentials to use. This is an interesting detail; it means the foreign data wrapper system can authenticate with external data sources in different ways depending on the PostgreSQL user involved. You'll note the pattern in creating these objects: each simply takes a series of options that can mean whatever the FDW needs them to mean. This allows the flexibility to support all sorts of different data sources with one interface.
The final step in setting things up is to create a foreign table. In MySQL's case, this is sort of like a view, in that it creates a PostgreSQL table from the results of a MySQL query. For our purposes, we needed access to several thousand structurally identical MySQL tables (I mentioned the goal is to move off of this one day, right?), so I automated the creation of each table with a simple bash script, which I piped into psql:
for i in `cat mysql_tables`; do
echo "CREATE FOREIGN TABLE mysql_schema.$i ( ... table definition ...)
SERVER mysql_server OPTIONS (
database 'mysqldb',
query 'SELECT ... some fields ... FROM $i'
);"
done
In a step not shown above, this script also consolidates the data from each table into one, native PostgreSQL table, to simplify later reporting. In our case, pulling the data once and reporting on the results is perfectly acceptable; in other words, data a few seconds old wasn't a concern. We also didn't need to write back to MySQL, which presumably could complicate things somewhat. We did, however, run into the same data validation problems PostgreSQL users habitually complain about when working with MySQL. Here's an example, in my own test database:
mysql> create table bad_dates (mydate date);
Query OK, 0 rows affected (0.07 sec)
mysql> insert into bad_dates values ('2013-02-30'), ('0000-00-00');
Query OK, 2 rows affected (0.02 sec)
Records: 2 Duplicates: 0 Warnings: 0
Note that MySQL silently transformed '2013-02-30' into '0000-00-00'. Sigh. Then, in psql we do this:
josh=# create extension mysql_fdw; CREATE EXTENSION josh=# create server mysql_svr foreign data wrapper mysql_fdw options (address '127.0.0.1', port '3306'); CREATE SERVER josh=# create user mapping for public server mysql_svr options (username 'josh', password ''); CREATE USER MAPPING josh=# create foreign table bad_dates (mydate date) server mysql_svr options (query 'select * from test.bad_dates'); CREATE FOREIGN TABLE josh=# select * from bad_dates ; ERROR: date/time field value out of range: "0000-00-00"
We've told PostgreSQL we'll be feeding it valid dates, but MySQL's idea of a valid date differs from PostgreSQL's, and the latter complains when the dates don't meet its stricter requirements. Several different workarounds exist, including admitting that '0000-00-00' really is wrong and cleaning up MySQL, but in this case, we modified the query underlying the foreign table to fix the dates on the fly:
SELECT CASE disabled WHEN '0000-00-00' THEN NULL ELSE disabled END,
-- various other fields
FROM some_table
Fortunately this is the only bit of MySQL / PostgreSQL impedance mismatch that has tripped us up thus far; we'd have to deal with any others we found individually, just as we did this one.
Dimensional Modeling
People occasionally bring up the question, "What exactly is a data warehouse?" Though answers to this question vary, in short a data warehouse exists to analyze data and its behavior over large swaths of time or many different data sources. There's more to it, though, than simply cramming historical data into the same old database. There are a number of defining characteristics, including the following:
- Query patterns and behavior
- Data retention policy
- Database structure
Query Behavior
Data warehouses are sometimes called "OLAP" databases, which stands for "on-line analytical processing", in contrast to the more common "OLTP", or "on-line transaction processing" databases that manage data for an online storefront, a bug tracker, or a blog. A typical OLTP database supports applications that issue short, simple queries, and expect quick answers and support for many simultaneous transactions. The average OLAP query, by contrast, is generally read-only, but can be quite complex, and might take minutes or hours to complete. Such queries will often include heavy-duty statistical processing and data mining, involving terabytes of data.
Data Retention
In a typical e-commerce database, eventually it becomes helpful to archive away older data that the front-end applications won't need anymore. This helps performance, simplifies backups, and has the nice side-effect of leaving less data available for nefarious black hats that come snooping around. But it doesn't make sense simply to discard much of this data, because it contains valuable information: customer behavior, supplier response time, etc. Often this deleted data remains alive, in a different form perhaps, in a data warehouse, which can contain data spanning many years.
Database structure
Because query patterns in data warehouses differ so much from OLTP databases, it makes sense to structure the OLAP database to support its queries better. OLTP databases typically follow an "entity" model, hence the ubiquitous (if often not particularly useful) entity-relationship diagram. In such a design, tables represent objects stored in the database, such as an order, a user, or a product. OLAP databases, on the other hand, are commonly "dimensionally" modeled, which results in something called a "star schema". In a star schema, the database contains large "fact" tables, full of foreign keys pointing to a set of "dimension" tables; when diagrammed, this looks like star with the fact table in the center, or for the more mechanically oriented, a wheel with the fact table at the hub and dimension tables at the end of each spoke. Rather than modeling a particular entity, the fact table generally describes business processes, such as customer conversion or shipping efficiency.
Dimensional modeling is an interesting topic full of its own rules of thumb, which often differ quite dramatically from typical entity modeling. For instance, whereas many database modelers complain about the use of surrogate keys in OLTP databases, it's recommended in dimensional modeling. Dimensional databases generally don't need OUTER joins, and rarely contain NULL values. As a result, business intelligence applications designed for data warehousing can make certain assumptions about how they'll be asked to query the database. These assumptions obviously place some limits on the types of queries the database can process effectively. However it is through these assumptions that the system gains most of its efficiency.
Biggest among the limitations of an OLAP database is what's called the "grain". The grain describes exactly what information the fact table contains, and at what level of detail; it should be made clear during the first stages of warehouse design and widely understood by all involved. Queries that require information that isn't part of the grain, or at finer levels of detail, must find a different fact table to use. But for queries which depend only on the available data, the fact table can be very efficient, as the database can partition it easily, and scan it unencumbered by simultaneous writes from other transactions, and filtered by simple conditions and INNER joins to the various dimension tables.
Data warehouses differ from the traditional database in several other ways, but this covers some of the basics. Dimensional modeling alone is a well-developed field of study with numerous intricacies, where experience and careful training are important for developing a useful final model. But the analytical power of such databases has been proven.
NoSQL benchmark of Cassandra, HBase, MongoDB
We're excited to have recently worked on an interesting benchmarking project for DataStax, the key company supporting the Cassandra "NoSQL" database for large horizontally-scalable data stores. This was done over the course of about 2 months.
This benchmark compares the performance of MongoDB, HBase, and Cassandra on the widely-used Amazon Web Services (AWS) EC2 cloud instances with local storage in RAID, in configurations ranging from 1-32 database nodes. The software stack included 64-bit Ubuntu 12.04 LTS AMIs, Oracle Java 1.6, and YCSB (Yahoo! Cloud Serving Benchmark) for its lowest-common-denominator NoSQL database performance testing features. Seven different test workloads were used to get a good mix of read, write, modify, and combined scenarios.
Because cloud computing resources are subject to "noisy neighbor" situations of degraded CPU or I/O performance, the tests were run 3 times each on 3 different days, with different EC2 instances to minimize any AWS-related variance.
The project involved some interesting automation challenges for repeatedly spinning up the correct numbers and types of nodes, configuring the node software, running tests, and gathering and collating results data. We kept the AWS costs more reasonable by using Spot Instances for most instances.
You can read more at DataStax's white paper page and see all the details in the white paper itself.
SFTP virtual users with ProFTPD and Rails: Part 1
I recently worked on a Rails 3.2 project that used the sweet PLupload JavaScript/Flash upload tool to upload files to the web app. To make it easier for users to upload large and/or remote files to the app, we also wanted to let them upload via SFTP. The catch was, our users didn't have SFTP accounts on our server and we didn't want to get into the business of creating and managing SFTP accounts. Enter: ProFTPD and virtual users.
ProFTPD's virtual users concept allows you to point ProFTPD at a SQL database for your user and group authentication. This means SFTP logins don't need actual system logins (although you can mix and match if you want). Naturally, this is perfect for dynamically creating and destroying SFTP accounts. Give your web app the ability to create disposable SFTP credentials and automatically clean up after the user is done with them, and you have a self-maintaining system.
Starting from the inside-out, you need to configure ProFTPD to enable virtual users. Here are the relevant parts from our proftpd.conf:
## # Begin proftpd.conf excerpt. For explanation of individual config directives, see the # great ProFTPD docs at http://www.proftpd.org/docs/directives/configuration_full.html ## DefaultServer off Umask 002 AllowOverwrite on # Don't reference /etc/ftpusers UseFtpUsers off# Enable SFTP SFTPEngine on # Enable SQL based authentication SQLAuthenticate on # From http://www.proftpd.org/docs/howto/CreateHome.html # Note that the CreateHome params are kind of touchy and easy to break. CreateHome on 770 dirmode 770 uid ~ gid ~ # chroot them to their home directory DefaultRoot ~ # Defines the expected format of the passwd database field contents. Hint: An # encrypted password will look something like: {sha1}IRYEEXBUxvtZSx3j8n7hJmYR7vg= SQLAuthTypes OpenSSL # That '*' makes that module authoritative and prevents proftpd from # falling through to system logins, etc AuthOrder mod_sql.c* # sftp_users and sftp_groups are the database tables that must be defined with # the proceeding column names. You can have other columns in these tables and # ProFTPD will leave them alone. The sftp_groups table can be empty, but it must exist. SQLUserInfo sftp_users username passwd uid sftp_group_id homedir shell SQLGroupInfo sftp_groups name id members SFTPHostKey /etc/ssh/ssh_host_rsa_key SFTPHostKey /etc/ssh/ssh_host_dsa_key SFTPCompression delayed SFTPAuthMethods password RequireValidShell no # SQLLogFile is very verbose, but helpful for debugging while you're getting this working SQLLogFile /var/log/proftpd_sql.sql ## Customize these for production SQLConnectInfo database@localhost:5432 dbuser dbpassword # The UID and GID values here are set to match the user that runs our web app because our # web app needs to read and delete files uploaded via SFTP. Naturally, that is outside # the requirements of a basic virtual user setup. But in our case, our web app user needs # to be able to cd into a virtual user's homedir, and run a `ls` in there. Also, note that # setting these two IDs here (instead of in our sftp_users table) *only* makes sense if # you are using the DefaultRoot directive to chroot virtual users. SQLDefaultUID 510 SQLDefaultGID 500
The CreateHome piece was the trickiest to get working just right for our use-case. But there are two reasons for that; we needed our web app to be able to read/delete the uploaded files, and we wanted to make ProFTPD create those home directories itself. (And it only creates that home directory once a user successfully logs in via SFTP. That means you can be more liberal in your UI with generating credentials that may never get used without having to worry about a ton of empty home directories lying about.)
That's it for the introductory "Part 1" of this article. In Part 2, I'll show how we generate credentials, the workflow behind displaying those credentials, and our SftpUser ActiveRecord model that handles it all. In Part 3, I'll finish up by running through exactly how our web app accesses these files, and how it cleans up after it's done.
Detecting table rewrites with the ctid column
In a recent article, I mentioned that changing the column definition of a Postgres table will sometimes cause a full table rewrite, but sometimes it will not. The rewrite depends on both the nature of the change and the version of Postgres you are using. So how can you tell for sure if changing a large table will do a rewrite or not? I'll show one method using the internal system column ctid.
Naturally, you do not want to perform this test using your actual table. In this example, we will create a simple dummy table. As long as the column types are the same as your real table, you can determine if the change will do a table rewrite on your version of PostgreSQL.
The aforementioned ctid column represents the physical location of the table's row on disk. This is one of the rare cases in which this column can be useful. The ctid value consists of two numbers: the first is the "page" that the row resides in, and the second number is the slot in that page where it resides. To make things confusing, the page numbering starts at 0, while the slot starts at 1, which is why the very first row is always at ctid (0,1). However, the only important information for this example is determining if the ctid for the rows has changed or now (which indicates that the physical on-disk data has changed, even if the data inside of it has not!).
Let's create a very simple example table and see what the ctids look like. When Postgres updates a row, it actually marks the current row as deleted and inserts a new row. Thus, there is a "dead" row that needs to be eventually cleaned out. (this is the way Postgres implements MVCC - there are others). The primary way this cleanup happens is through the use of VACUUM FULL, so we'll use that command to force the table to rewrite itself (and thus 'reset' the ctids as you will see):
postgres=# DROP TABLE IF EXISTS babies; DROP TABLE postgres=# CREATE TABLE babies (gender VARCHAR(10), births INTEGER); CREATE TABLE postgres=# INSERT INTO babies VALUES ('Girl', 1), ('Boy', 1); INSERT 0 2 -- Note: the ctid column is never included as part of '*' postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,1) | Girl | 1 (0,2) | Boy | 1 (2 rows) -- Here comes Ivy, another girl: postgres=# UPDATE babies SET births = births+1 WHERE gender = 'Girl'; UPDATE 1 -- Note that we have a new ctid: slot 3 of page 0 -- The old row at (0,1) is still there, but it is deleted and not visible postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,2) | Boy | 1 (0,3) | Girl | 2 (2 rows) -- The vacuum full removes the dead rows and moves the live rows to the front: postgres=# VACUUM FULL babies; VACUUM -- We are back to the original slots, although the data is reversed: postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,1) | Boy | 1 (0,2) | Girl | 2 (2 rows)
That's what a table rewrite will look like - all the dead rows will be removed, and the rows will be rewritten starting at page 0, adding slots until a new page is needed. We know from the previous article and the fine documentation that Postgres version 9.1 is smarter about avoiding table rewrites. Let's try changing the column definition of the table above on version 8.4 and see what happens. Note that we do an update first so that we have at least one dead row.
postgres=# SELECT substring(version() from $$\d+\.\d+$$); substring ----------- 8.4 postgres=# DROP TABLE IF EXISTS babies; DROP TABLE postgres=# CREATE TABLE babies (gender VARCHAR(10), births INTEGER); CREATE TABLE postgres=# INSERT INTO babies VALUES ('Girl', 1), ('Boy', 1); INSERT 0 2 -- No real data change, but does write new rows to disk: postgres=# UPDATE babies SET gender = gender; UPDATE 2 postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,3) | Girl | 1 (0,4) | Boy | 1 (2 rows) -- Change the VARCHAR(32) to a TEXT: postgres=# ALTER TABLE babies ALTER COLUMN gender TYPE TEXT; ALTER TABLE postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,1) | Girl | 1 (0,2) | Boy | 1 (2 rows)
We can see from the above that changing from VARCHAR to TEXT in version 8.4 of Postgres does indeed rewrite the table. Now let's see how version 9.1 performs:
postgres=# SELECT substring(version() from $$\d+\.\d+$$); substring ----------- 9.1 postgres=# DROP TABLE IF EXISTS babies; DROP TABLe postgres=# CREATE TABLE babies (gender VARCHAR(10), births INTEGER); CREATE TABLe postgres=# INSERT INTO babies VALUES ('Girl', 1), ('Boy', 1); INSERT 0 2 -- No real data change, but does write new rows to disk: postgres=# UPDATE babies SET gender = gender; UPDATE 2 postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,3) | Girl | 1 (0,4) | Boy | 1 (2 rows) -- Change the VARCHAR(32) to a TEXT: postgres=# ALTER TABLE babies ALTER COLUMN gender TYPE TEXT; ALTER TABLE postgres=# SELECT ctid, * FROM babies; ctid | gender | births -------+--------+-------- (0,3) | Girl | 1 (0,4) | Boy | 1 (2 rows)
We confirmed that the ALTER TABLE in this particular case does *not* perform a table rewrite when using version 9.1, as we suspected. We tell this by seeing that the ctids stayed the same. We could further verify by doing a vacuum full and showing that there were indeed dead rows that had been left untouched by the ALTER TABLE.
Note that this small example works because nothing else is vacuuming the table, as it is too small and transient for autovacuum to care about it. VACUUM FULL is one of three ways a table can get rewritten; besides ALTER TABLE, the other way is with the CLUSTER command. We go through all the trouble above because an ALTER TABLE is the only one of the three that *may* rewrite the table - the other two are guaranteed to do so.
This is just one example of the things you can do by viewing the ctid column. It is always nice to know beforehand if a table rewrite is going to occur, as it can be the difference between a query that runs in milliseconds versus hours!
Postgres alter column problems and solutions
A common situation for database-backed applications is the need to change the attributes of a column. One can change the data type, or more commonly, only the size limitation, e.g. VARCHAR(32) gets changed to VARCHAR(42). There are a few ways to accomplish this in PostgreSQL, from a straightforward ALTER COLUMN, to replacing VARCHAR with TEXT (plus a table constraint), to some advanced system catalog hacking.
The most common example of such a change is expanding a VARCHAR declaration to allow more characters. For example, your "checksum" column was based on MD5 (at 32 characters), and now needs to be based on Keccak (Keccak is pronounced "catch-ack") (at 64 characters) In other words, you need a column in your table to change from VARCHAR(32) to VARCHAR(64). The canonical approach is to do this:
ALTER TABLE foobar ALTER COLUMN checksum TYPE VARCHAR(64);
This approach works fine, but it has two huge and interrelated problems: locking and time. This approach locks the table for as long as the command takes to run. And by lock, we are talking a heavy 'access exclusive' lock which shuts everything else out of the table. If your table is small, this is not an issue. If your table has a lot of data, however, this brings us to the second issue: table rewrite. The above command will cause Postgres to rewrite every single row of the table, which can be a very expensive operation (both in terms of disk I/O and wall clock time). So, a simple ALTER COLUMN solution usually comes at a very high cost for large tables. Luckily, there are workarounds for this problem.
First, some good news: as of version 9.2, there are many operations that will no longer require a full table rewrite. Going from VARCHAR(32) to VARCHAR(64) is one of those operations! Thus, if you are lucky enough to be using version 9.2 or higher of Postgres, you can simply run the ALTER TABLE and have it return almost instantly. From the release notes:
Reduce need to rebuild tables and indexes for certain ALTER TABLE ... ALTER COLUMN TYPE operations (Noah Misch)
Increasing the length limit for a varchar or varbit column, or removing the limit altogether, no longer requires a table rewrite. Similarly, increasing the allowable precision of a numeric column, or changing a column from constrained numeric to unconstrained numeric, no longer requires a table rewrite. Table rewrites are also avoided in similar cases involving the interval, timestamp, and timestamptz types.
However, if you are not yet on version 9.2, or are making an operation not covered above (such as shrinking the size limit of a VARCHAR), your only option to avoid a full table rewrite is the system catalog change below. However, before you jump down there, consider a different option: abandoning VARCHAR altogether.
In the Postgres world, there are few differences between the VARCHAR and TEXT data types. The latter can be thought of as an unbounded VARCHAR, or if you like, a VARCHAR(999999999999). You may also add a check constraint to a table to emulate the limit of a VARCHAR. For example, to convert a VARCHAR(32) column named "checksum" to a TEXT column:
ALTER TABLE foobar ALTER COLUMN checksum TYPE text; ALTER TABLE foobar ADD CONSTRAINT checksum_length CHECK (LENGTH(checksum) <= 32);
The data type change suffers from the same full table rewrite problem as before, but if you are using version 9.1 or newer of Postgres, the change from VARCHAR to TEXT does not do a table rewrite. The creation of the check constraint, however, will scan all of the existing table rows to make sure they meet the condition. While not as costly as a full table rewrite, scanning every single row in a large table will still be expensive. Luckily, version 9.2 of Postgres comes to the rescue again with the addition of the NOT VALID phrase to the check constraint clause. Thus, in newer versions you can avoid the scan entirely by writing:
ALTER TABLE foobar ADD CONSTRAINT checksum_length CHECK (LENGTH(checksum) <= 32) NOT VALID;
This is a one-time exception for the constraint, and only applies as the constraint is being created. In other words, despite the name, the constraint is very much valid after it is created. If you want to validate all the rows that you skipped at a later time, you can use the ALTER TABLE .. VALIDATE CONSTRAINT command. This has the double advantage of allowing the check to be delayed until a better time, and taking a much lighter lock on the table than the ALTER TABLE .. ADD CONSTRAINT does.
So why would you go through the trouble of switching from your VARCHAR(32) to a TEXT column with a CHECK constraint? There are at least three good reasons.
First, if you are running Postgres 9.2 or better, this means you can change the constraint requirements on the fly, without a table scan - even for the 'non-optimal' situations such as going from 64 characters down to 32. Just drop the old constraint, and add a new one with the NOT VALID clause thrown on it.
Second, the check constraint gives a better error message, and a clearer indication that the limitation was constructed with some thought behind it. Compare these messages:
postgres=# CREATE TABLE river( checksum VARCHAR(4) ); CREATE TABLE postgres=# INSERT INTO river VALUES ('abcde'); ERROR: value too long for type character varying(4) postgres=# CREATE TABLE river( checksum TEXT, postgres-# CONSTRAINT checksum_length CHECK (LENGTH(checksum) <= 4) ); CREATE TABLE postgres=# INSERT INTO river VALUES ('abcde'); ERROR: new row for relation "river" violates check constraint "checksum_length" DETAIL: Failing row contains (abcde).
Third, and most important, you are no longer limited to a single column attribute (maximum length). You can use the constraint to check for many other things as well: minimum size, actual content, regex matching, you name it. As a good example, if we are are truly storing checksums, we probably want the hexadecimal Keccak checksums to be *exactly* 64 characters, and not just a maximum length of 64 characters. So, to illustrate the above point about switching constraints on the fly, you could change the VARCHAR(32) to a TEXT and enforce a strict 64 character limit with:
BEGIN; ALTER TABLE foobar DROP CONSTRAINT checksum_length; ALTER TABLE foobar ADD CONSTRAINT checksum_length CHECK (LENGTH(checksum) = 64) NOT VALID; COMMIT;
We just introduced a minimum *and* a maximum, something old VARCHAR could not do. We can constrain it further, as we should only be allowing hexadecimal characters to be stored. Thus, we can also reject and characters other than 0123456789abcdef from being added:
BEGIN; ALTER TABLE foobar DROP CONSTRAINT checksum_length; ALTER TABLE foobar ADD CONSTRAINT checksum_valid CHECK ( LENGTH(checksum) = 64 AND checksum ~ '^[a-f0-9]*$' ) NOT VALID; COMMIT;
Since we have already added a regex check, we can reduce the size of the CHECK with a small hit in clarity like so:
BEGIN;
ALTER TABLE foobar DROP CONSTRAINT checksum_length;
ALTER TABLE foobar ADD CONSTRAINT checksum_valid
CHECK ( checksum ~ '^[a-f0-9]{64}$' ) NOT VALID;
COMMIT;
Back to the other problem, however: how can we avoid a table rewrite when going from VARCHAR(64) to VARCHAR(32), or when stuck on an older version of Postgres that always insists on a table rewrite? The answer is the system catalogs. Please note that any updating to the system catalogs should be done very, very carefully. This is one of the few types of update I will publicly mention and condone. Do not apply this lesson to any other system table or column, as there may be serious unintended consequences.
So, what does it mean to have VARCHAR(32) vs. VARCHAR(64)? As it turns out, there is no difference in the way the actual table data is written. The length limit of a VARCHAR is simply an implicit check constraint, after all, and as such, it is quite easy to change.
Let's create a table and look at some of the important fields in the system table pg_attribute. In these examples we will use Postgres 8.4, but other versions should look very similar - this part of the system catalog rarely changes.
postgres=# CREATE TABLE foobar ( checksum VARCHAR(32) ); CREATE TABLE postgres=# \x Expanded display is on. postgres=# SELECT attname, atttypid::regtype, atttypmod FROM pg_attribute postgres=# WHERE attrelid = 'foobar'::regclass AND attname = 'checksum'; -[ RECORD 1 ]---------------- attname | checksum atttypid | character varying atttypmod | 36
The important column is atttypmod. It indicates the legal length of this varchar column (whose full legal name is 'character varying', but everyone calls it varchar). In the case of Postgres, there is also 4 characters of overhead. So VARCHAR(32) shows up as 36 in the atttypmod column. Thus, if we want to change it to a VARCHAR(64), we add 4 to 64 and get a number of 68. Before we do this change, however, we need to make sure that nothing else will be affected. There are other dependencies to consider, such as views and foreign keys, that you need to keep in mind before making this change. What you should do is carefully check all the dependencies this table has:
postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ---------------+--------- pg_type:16419 | i
We can see in the above that the only dependency is an entry in the pg_type table - which is a normal thing for all tables and will not cause any issues. Any other entries, however, should give you pause before doing a manual update of pg_attribute. You can use the information returned by the first column of the above query to see exactly what is referencing the table. For example, let's make that column unique, as well as adding a view that uses the table, and then see the effects on the pg_depend table:
postgres=# CREATE UNIQUE INDEX jack ON foobar(checksum); CREATE INDEX postgres=# CREATE VIEW martha AS SELECT * FROM foobar; CREATE VIEW postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ------------------+--------- pg_type:16419 | i pg_class:16420 | a pg_rewrite:16424 | n
The 'i', 'a', and 'n' stand for internal, auto, and normal. They are not too important in this context, but more details can be found in the docs on the pg_depend table. The first column shows us the system table and oid of the dependency, so we can look them up and see what they are:
postgres=# SELECT typname FROM pg_type WHERE oid = 16419; typname --------- foobar postgres=# SELECT relname, relkind FROM pg_class WHERE oid = 16420; relname | relkind ---------+--------- jack | i -- Views require a little redirection as they are implemented via the rules system postgres=# SELECT relname,relkind FROM pg_class WHERE oid = postgres-# (SELECT ev_class FROM pg_rewrite WHERE oid = 16424); relname | relkind ---------+--------- martha | v postgres=# \d martha View "public.martha" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(32) | View definition: SELECT foobar.checksum FROM foobar;
So what does all that tell us? It tells us we should look carefully at the index and the view to make sure they will not be affected by the change. In this case, a simple index on the column will not be affected by changing the length, so it (along with the pg_type entry) can be ignored. The view, however, should be recreated so that it records the actual column size.
We are now ready to make the actual change. This would be an excellent time to make a backup of your database. This procedure should be done very carefully - if you are unsure about any of the entries in pg_depend, do not proceed.
First, we are going to start a transaction, lock the table, and drop the view. Then we are going to change the length of the varchar directly, recreate the view, and commit! Here we go:
postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ------------------+--------- pg_type:16419 | i pg_class:16420 | a pg_rewrite:16424 | n postgres=# \d foobar Table "public.foobar" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(32) | Indexes: "jack" UNIQUE, btree (checksum) postgres=# \d martha View "public.martha" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(32) | View definition: SELECT foobar.checksum FROM foobar; postgres=# BEGIN; BEGIN postgres=# DROP VIEW martha; DROP VIEW postgres=# LOCK TABLE pg_attribute IN EXCLUSIVE MODE; LOCK TABLE postgres=# UPDATE pg_attribute SET atttypmod = 68 postgres-# WHERE attrelid = 'foobar'::regclass AND attname = 'checksum'; UPDATE 1 postgres=# COMMIT; COMMIT
Verify the changes and check out the pg_depend entries:
postgres=# \d foobar Table "public.foobar" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(64) | Indexes: "jack" UNIQUE, btree (checksum) postgres=# CREATE VIEW martha AS SELECT * FROM foobar; CREATE VIEW postgres=# \d martha View "public.martha" Column | Type | Modifiers ----------+-----------------------+----------- checksum | character varying(64) | View definition: SELECT foobar.checksum FROM foobar; postgres=# SELECT c.relname||':'||objid AS dependency, deptype postgres-# FROM pg_depend d JOIN pg_class c ON (c.oid=d.classid) postgres-# WHERE refobjid = 'foobar'::regclass; dependency | deptype ------------------+--------- pg_type:16419 | i pg_class:16420 | a pg_rewrite:16428 | n
Success. Both the table and the view are showing the new VARCHAR size, but the data in the table was not rewritten. Note how the final row returned by the pg_depend query changed: we dropped the view and created a new one, resulting in a new row in both pg_class and pg_rewrite, and thus a new OID shown in the pg_rewrite table.
Hopefully this is not something you ever have to perform. The new features of 9.1 and 9.2 that prevent table rewrites and table scanning should go a long way towards that.
An Encouraging LinuxFest
A few weekends ago I gave a talk at Ohio LinuxFest: Yes, You Can Run Your Business On PostgreSQL. Next Question? (slides freshly posted.) The talk isn't as technically oriented as the ones I'll usually give, but rather more inspirational and encouraging. It seemed like a good and reasonable topic, centered around Postgres but applicable to open source in general, and it's something I'd been wanting to get out there for a while.
In a previous life I worked with Microsoft shops a bit more often. You know, companies that use Windows and related software pretty much exclusively. This talk was, more or less, a result of a number of conversations with those companies about open source software and why it's a valid option. I heard a number of arguments against, some reasonable, some pretty far out there, so it felt like it'd be a good thing to gather up all of those that I'd heard over time.
These days I don't interact with those companies so much, so I was a little worried at first that the landscape had changed enough that the talk wouldn't really be useful any more. But after talking with a few people around the conference a day or two before the talk, there's definitely some companies that don't see the value in open source technologies.
The slides are essentially a rough outline, but I tried to go back and add some of the spoken context. Anyway, enjoy, and hopefully it'll help you get the open source word out.
Postgres system triggers error: permission denied
This mystifying Postgres error popped up for one of my coworkers lately while using Ruby on Rails:
ERROR: permission denied: "RI_ConstraintTrigger_16410" is a system trigger
On PostgreSQL version 9.2 and newer, the error may look like this:
ERROR: permission denied: "RI_ConstraintTrigger_a_32778" is a system trigger
ERROR: permission denied: "RI_ConstraintTrigger_c_32780" is a system trigger
I labelled this as mystifying because, while Postgres' error system is generally well designed and gives clear messages, this one stinks. A better one would be something similar to:
ERROR: Cannot disable triggers on a table containing foreign keys unless superuser
As you can now guess, this error is caused by a non-superuser trying to disable triggers on a table that is used in a foreign key relationship, via the SQL command:
ALTER TABLE foobar DISABLE TRIGGERS ALL;
Because Postgres enforces foreign keys through the use of triggers, and because data integrity is very important to Postgres, one must be a superuser to perform such an action and bypass the foreign keys. (A superuser is a Postgres role that has "do anything" privileges). We'll look at an example of this in action, and then discuss solutions and workarounds.
Note that if you are not a superuser *and* you are not the owner of the table, you will get a much better error message when you try to disable all the triggers:
ERROR: must be owner of relation foobar
To reproduce the original error, we will create two tables, and then link them together via a foreign key:
postgres=# create user alice; CREATE ROLE postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". -- Verify that we are not a superuser postgres=> select usename, usesuper from pg_user where usename = (select current_user); usename | usesuper ---------+---------- alice | f postgres=> create table foo(a int unique); NOTICE: CREATE TABLE / UNIQUE will create implicit index "foo_a_key" for table "foo" CREATE TABLE postgres=> create table bar(b int); CREATE TABLE postgres=> alter table bar add constraint baz foreign key (b) references foo(a); ALTER TABLE
Let's take a look at both tables, and then try to disable triggers on each one. Because the triggers enforcing the foreign key are internal, they will not show up when we do a \d:
postgres=> \d foo Table "public.foo" Column | Type | Modifiers --------+---------+----------- a | integer | Indexes: "foo_a_key" UNIQUE CONSTRAINT, btree (a) Referenced by: TABLE "bar" CONSTRAINT "baz" FOREIGN KEY (b) REFERENCES foo(a) postgres=> \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) postgres=> alter table foo disable trigger all; ERROR: permission denied: "RI_ConstraintTrigger_41047" is a system trigger postgres=> alter table bar disable trigger all; ERROR: permission denied: "RI_ConstraintTrigger_41049" is a system trigger
If we try the same thing as a superuser, we have no problem:
postgres=# \c postgres postgres You are now connected to database "postgres" as user "postgres". postgres=# select usename, usesuper from pg_user where usename = (select current_user); usename | usesuper ----------+---------- postgres | t postgres=# alter table foo disable trigger all; ALTER TABLE postgres=# alter table bar disable trigger all; ALTER TABLE -- Don't forget to re-enable the triggers! postgres=# alter table foo enable trigger all; ALTER TABLE postgres=# alter table bar enable trigger all; ALTER TABLE
So, this error has happened to you - now what? Well, it depends on exactly what you are trying to do, and how much control over your environment you have. If you are using Ruby on Rails, for example, you may not be able to change anything except the running user. As you may imagine, this is the most obvious solution: become a superuser and run the command, as in the example above.
If you do have the ability to run as a superuser however, it is usually much easier to adjust the session_replication_role. In short, this disables *all* triggers and rules, on all tables, until it is switched back again. Do NOT forget to switch it back again! Usage is like this:
postgres=# \c postgres postgres You are now connected to database "postgres" as user "postgres". postgres=# set session_replication_role to replica; SET -- Do what you need to do - triggers and rules will not fire! postgres=# set session_replication_role to default; SET
Note: while you can do "SET LOCAL" to limit the changes to the current transaction, I always feel safer to explicitly set it before and after the changes, rather than relying on the implicit change back via commit and rollback.
It may be that you are simply trying to disable one or more of the "normal" triggers that appear on the table. In which case, you can simply disable user triggers manually rather than use 'all':
postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) Triggers: trunk AFTER INSERT ON bar FOR EACH STATEMENT EXECUTE PROCEDURE funk() vupd BEFORE UPDATE ON bar FOR EACH ROW EXECUTE PROCEDURE verify_update(); postgres=> alter table bar disable trigger trunk; ALTER TABLE postgres=> alter table bar disable trigger vupd; ALTER TABLE -- Do what you need to do, then: postgres=> alter table bar enable trigger trunk; ALTER TABLE postgres=> alter table bar enable trigger vupd; ALTER TABLE
Another option for a regular user (in other words, a non super-user) is to remove the foreign key relationship yourself. You cannot disable the trigger, but you can drop the foreign key that created it in the first place. Of course, you have to add it back in as well:
postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> alter table bar drop constraint baz; ALTER TABLE -- Do what you need to do then: postgres=> alter table bar add constraint baz foreign key (b) references foo(a); ALTER TABLE
The final solution is to work around the problem. Do you really need to disable triggers on this table? Then you can simply not disable any triggers. Perhaps the action you are ultimately trying to do (e.g. update/delete/insert to the table) can be performed some other way.
All of these solutions have their advantages and disadvantages. And that's what charts are good for!:
| Permission denied: "RI_ConstraintTrigger" is a system trigger - now what? | ||
|---|---|---|
| Solution | Good | Bad |
| Become a superuser | Works as you expect it to | Locks the table Must re-enable triggers |
| Adjust session_replication_role | No table locks! Bypasses triggers and rules on ALL tables |
Must be superuser MUST set it back to default setting |
| Disable user triggers manually | Regular users can perform Very clear what is being done Less damage if forget to re-enable |
Locks the table May not be enough |
| Drop the foreign key | Regular users can perform Very clear what is being done |
Locks the tables Must recreate the foreign key |
| Not disable any triggers | No locking Nothing to remember to re-enable |
May not work in all situations |
For the rest of this article, we will tie up two loose ends. First, how can we see the triggers if \d will not show them? Second, what's up with the crappy trigger name?
As seen above, the output of \d in the psql program shows us the triggers on a table, but not the internal system triggers, such as those created by foreign keys. Here is how triggers normally appear:
postgres=# \c postgres postgres You are now connected to database "postgres" as user "postgres". postgres=# create language plperl; CREATE LANGUAGE postgres=# create function funk() returns trigger language plperl as $$ return undef; $$; CREATE FUNCTION postgres=# create trigger trunk after insert on bar for each statement execute procedure funk(); CREATE TRIGGER postgres=# \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) Triggers: trunk AFTER INSERT ON bar FOR EACH STATEMENT EXECUTE PROCEDURE funk() postgres=# alter table bar disable trigger all; ALTER TABLE postgres=# \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- b | integer | Foreign-key constraints: "baz" FOREIGN KEY (b) REFERENCES foo(a) Disabled triggers: trunk AFTER INSERT ON bar FOR EACH STATEMENT EXECUTE PROCEDURE funk()
Warning: Versions older than 8.3 will not tell you in the \d output that the trigger is disabled! Yet another reason to upgrade as soon as possible because 8.2 and earlier are end of life.
If you want to see all the triggers on a table, even the internal ones, you will need to look at the pg_trigger table directly. Here is the query that psql uses when generating a list of triggers on a table. Note the exclusion based on the tgisinternal column:
SELECT t.tgname, pg_catalog.pg_get_triggerdef(t.oid, true), t.tgenabled FROM pg_catalog.pg_trigger t WHERE t.tgrelid = '32774' AND NOT t.tgisinternal ORDER BY 1;
So in our example table above, we should find the trigger we created, as well as the two triggers created by the foreign key. All of them are enabled. Disabled triggers will show as a 'D' in the tgenabled column. (O stands for origin, and has to do with session_replication_role).
postgres=# select tgname,tgenabled,tgisinternal from pg_trigger postgres-# where tgrelid = 'bar'::regclass; tgname | tgenabled | tgisinternal ------------------------------+-----------+-------------- RI_ConstraintTrigger_c_32780 | D | t RI_ConstraintTrigger_c_32781 | D | t trunk | D | f postgres=# alter table bar enable trigger all; ALTER TABLE postgres=# select tgname,tgenabled,tgisinternal from pg_trigger postgres-# where tgrelid = 'bar'::regclass; tgname | tgenabled | tgisinternal ------------------------------+-----------+-------------- RI_ConstraintTrigger_c_32780 | O | t RI_ConstraintTrigger_c_32781 | O | t trunk | O | f
As you recall, the original error - with the system trigger that had a rather non-intuitive named - looked like this:
ERROR: permission denied: "RI_ConstraintTrigger_16509" is a system trigger
We can break it apart to see what it is doing. The "RI" is short for "Referential Integrity", and anyone who manages to figure that out can probably make a good guess as to what it does. The "Constraint" means it is a constraint on the table - okay, simple enough. The "Trigger" is a little redundant, as it is extraordinarily unlikely you will ever come across this trigger without some context (such as the error message above) that tells you it is a trigger. The final number is simply the oid of the trigger itself. Stick them all together and you get a fairly obscure trigger name that is hopefully not as mysterious now!
Case Sensitive MySQL Searches
MySQL's support for case sensitive search is explained somewhat opaquely in the aptly titled Case Sensitivity in String Searches documentation. In short, it explains that by default, MySQL won't treat strings as case sensitive when executing a statement such as:
SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]';
This simple search to look for contacts whose first name starts with a lower case letter, will return *all* contacts because in the default character set used by MySQL (latin1), upper and lower case letters share the same "sort value".
UPDATE: After many helpful comments from readers, it would seem the term I should have used was collation, not sort value. The documentation for both MySQL and PostgreSQL have lengthy discussions on the topic.
Enough with the backstory, how do I perform case sensitive searches!
The docs say to convert the string representation to a binary one. This allows "comparisons [to] use the numeric values of the bytes in the operands". Let's see it in action:
SELECT first_name FROM contacts WHERE BINARY(first_name) REGEXP '^[a-z]';
There are other strategies available, such as changing the character set being used for comparisons with the COLLATE function. This would likely work better for cases where you had many columns to compare.
SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]' COLLATE latin1_bin;
You can even go so far as to have MySQL switch character sets and collations. But you do have to do this for each database, each table, and each column you need to convert. Not terribly fun.
The Real Cost of Data Roundtrip
Sometimes you need to perform some heavy database operations. I don't know why very often programmers are afraid of using databases for that. They usually have some fancy ORM which performs all the operations, and the only way to change the data is to make some SELECT * from a table, create a bunch of unneeded objects, change one field, convert those changed objects into queries and send that to the database.
Have you ever thought about the cost of the roundtrip of data? The cost of getting all the data from database just to send changed data into the database? Why do that if there would be much faster way of achieving the same results?
Imagine that you have quite a heavy operation. Let's make something which normally databases cannot do, some more complicated operation. Many programmers just don't know that there is any other way than writing this in the application code. Let's change all the HTML entities into real characters.
The HTML entities are a way of writing many different characters in HTML. This way you can write for instance the Euro currency sign "€" in HTML even if you don't have it on your keyboard. You just have to write € or € instead. I don't have to, as when I use UTF-8 encoding and write this character directly, it should be showed normally. What's more I have this character on my keyboard.
I will convert the text stored in database changing all the htmlentities into real unicode characters. I will do it using three different methods.
- The first will be a simple query run inside PostgreSQL
- The second will be an external program which downloads the text column from database, changes it externally and loads into database.
- The third method will be almost the same as the second, however it will download whole rows.
Generate Data
So, for this test I need to have some data. Let's write a simple data generator.
First, a simple function for returning a random number within the given range.
CREATE FUNCTION random(INTEGER, INTEGER) RETURNS INTEGER AS $$ SELECT floor ( $1 + ($2 - $1 + 1 ) * random())::INTEGER; $$ LANGUAGE SQL;
Now the function for generating random texts of random length filled with the HTML entities.
CREATE FUNCTION generate_random_string() RETURNS TEXT AS $$
DECLARE
items TEXT[] =
ARRAY[
'AAAA','BBBB','CCCC','DDDD','EEEE','FFFF','GGGG',
'HHHH','IIII','JJJJ','KKKK','LLLL','MMMM','NNNN',
'OOOO','PPPP','QQQQ','RRRR','SSSS','TTTT','UUUU',
'VVVV','WWWW','XXXX','YYYY','ZZZZ',
'&', '"', ''', '&','<','>',
'¢','£','¤','¥','¦','§',
'¨','©','ª','«','¬','­',
'®','¯','°','±','²','³',
'´','µ','¶','·','¸','¹',
'º','»','¼','½','¾'
];
length INTEGER := random(500, 1500);
result TEXT := '';
items_length INTEGER := array_length(items, 1);
BEGIN
FOR x IN 1..length LOOP
result := result || items[ random(1, items_length) ];
END LOOP;
RETURN result;
END;
$$ LANGUAGE PLPGSQL;
The table for storing the data is created with the following query:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
padding TEXT,
t TEXT
);
Then I filled this table using a query generating 50k rows with random data:
INSERT INTO data(payload, t)
SELECT
generate_random_string(),
generate_random_string()
FROM
generate_series(1, 50*1000);
Let's check the table size:
SELECT pg_size_pretty(pg_relation_size('data'));
pg_size_pretty
----------------
207 MB
(1 row)
As the table is filled with random data, I need to have two tables with exactly the same data.
CREATE TABLE query (id SERIAL PRIMARY KEY, payload TEXT, t TEXT); CREATE TABLE script (id SERIAL PRIMARY KEY, payload TEXT, t TEXT); CREATE TABLE script_full (id SERIAL PRIMARY KEY, payload TEXT, t TEXT); INSERT INTO query SELECT * FROM data; INSERT INTO script SELECT * FROM data; INSERT INTO script_full SELECT * FROM data;
The Tests
SQL
Many programmers think that such operations are not normally available inside a database. However PostgreSQL has quite a nice feature, it can execute functions written in many different languages. For the purpose of this test I will use the language pl/perlu which allows me to use external libraries. I will also use HTML::Entities package for the conversion.
The function I wrote is quite simple:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
The update of the data can be done using the following query:
UPDATE query SET t = decode_html_entities(t);
Application
In order to have those tests comparable, I will write a simple perl script using exactly the same package for converting html entities.
#!/usr/bin/env perl
use DBI;
use HTML::Entities;
use Encode;
my $dbh = DBI->connect(
"DBI:Pg:dbname=test;host=localhost",
"szymon",
"",
{'RaiseError' => 1, 'pg_utf8_strings' => 1});
$dbh->do('BEGIN');
my $upd = $dbh->prepare("UPDATE script SET t = ? WHERE id = ?");
my $sth = $dbh->prepare("SELECT id, t FROM script");
$sth->execute();
while(my $row = $sth->fetchrow_hashref()) {
my $t = decode_entities( $row->{'t'} );
$t = encode("UTF-8", $t);
$upd->execute( $t, $row->{'id'} );
}
$dbh->do('COMMIT');
$dbh->disconnect();
The Worst Application
There is another terrible idea implemented by programmers too often. Why select only the column you want to change? Let's select all the rows and send them back to database.
This script will look like this (the important changes are in lines 17 and 23)
#!/usr/bin/env perl
use DBI;
use HTML::Entities;
use Encode;
my $dbh = DBI->connect(
"DBI:Pg:dbname=test;host=localhost",
"szymon",
"",
{'RaiseError' => 1, 'pg_utf8_strings' => 1});
$dbh->do('BEGIN');
my $upd = $dbh->prepare("UPDATE script_all SET t = ? WHERE id = ?");
my $sth = $dbh->prepare("SELECT id, payload, t FROM script_all");
$sth->execute();
while(my $row = $sth->fetchrow_hashref()) {
my $t = decode_entities( $row->{'t'} );
$t = encode("UTF-8", $t);
$upd->execute( $t, $row->{'payload'}, $row->{'id'} );
}
$dbh->do('COMMIT');
$dbh->disconnect();
Results.
The query using pl/perlu function executed in 26 seconds.
The script changing data externally execuded in 2 minutes 10 seconds (5 times slower)
The worst script getting and resending whole rows finished in 4 minutes 35 seconds (10 times slower).
I used quite a small number of rows. There were just 50k rows (about 200MB). On production servers the numbers are much bigger.
Just imagine that the code you developed for changing data could run 10 times faster if you'd do this in the database.
Enforcing Transaction Compartments with Foreign Keys and SECURITY DEFINER
In support of End Point's evolving offering for multi-master database replication, from the precursor to Bucardo through several versions of Bucardo itself, our code solutions depended on the ability to suppress the actions of triggers and rules through direct manipulation of the pg_class table. Most PostgreSQL database developers are probably familiar with the construct we used from the DDL scripts generated by pg_dump at one time.
Disable triggers and rules on table "public"."foo":
UPDATE pg_class SET
relhasrules = false,
reltriggers = 0
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = 'public'
AND pg_class.relname = 'foo';
Re-enable all triggers and rules on "public"."foo" when finished with DML that must not fire triggers and rules:
UPDATE pg_class SET
reltriggers = (
SELECT COUNT(*) FROM pg_trigger
WHERE pg_class.oid = pg_trigger.tgrelid
),
relhasrules = (
SELECT COUNT(*) > 0
FROM pg_rules
WHERE schemaname = 'public'
AND tablename = 'foo'
)
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = 'public'
AND pg_class.relname = 'foo';
In practice, the simple usage described for trigger and rule suppression worked reasonably well. However, it didn't always work well. In particular, there is a somewhat concerning state that exists between the two previously described events. The actions of disabling triggers/rules, then manipulating those relations affected, and then re-enforcing triggers/rules, must happen within the confines of a single transaction, and they must happen, period. The risk is that, at some point between the "open and shut" on pg_class, the transaction is committed and the "shut" never fires. If that happens, all database activity against the relations with disabled triggers and rules continues. I don't recall that we ever isolated the reasons why, on rare occasion, this happened; I only know that it did happen, and it was never welcome news.
In an effort to curb the worst aspect of this issue, I started with a simple question: how can I limit the transaction to a "safe compartment", thinking in terms of perl's Safe.pm? In this case, the "unsafe" action is "commit the transaction with triggers and rules disabled". But in reality, the unsafe list can be any conditions the developer needs to have exposed, but cannot make visible to the rest of the database.
An ancillary issue we faced, too, was the fact that any app code needing to suppress triggers and rules (beyond syncing, there were any number of DML requirements where it was undesirable for syncing to occur, and the pg_class manipulations were quite common) had to operate as the super user. While we had not had an incident where the postgres user for mundane operations had burned down the database, there was certainly concern about that potential.
The resolution I settled on was to construct a pair of functions that made use of the following features:
- PostgreSQL's SECURITY DEFINER function attribute
- Deferred foreign keys
- The ON COMMIT DROP option for CREATE TEMP TABLE
The first function, safe_disable_trigrules(schema_name text, table_name text), is called after beginning a transaction and makes the necessary modifications to pg_class on behalf of schema_name.table_name. After the work within the transaction is finished, the second function--safe_reenable_trigrules(schema_name text, table_name text)--is called before issuing the commit. It, of course, puts pg_class back to the proper state.
Under the hood, the two functions create a dependency that only each other can satisfy when used as a non super user. Before safe_disable_trigrules() will manipulate pg_class, it creates a temp table with a self-referencing, deferred foreign key. Then, based on the schema and table args, it will insert a record for the relation defined by the args that violates the FK. Once the transaction's work is finished, but before committing, safe_reenable_trigrules() is called for every relation that safe_disable_trigrules() was called against and it will delete out the offending record for that relation alone. If the two functions are used properly, by the time of commit, the temp table is empty, thus having no foreign key violations, at which point the transaction can be safely committed. In the process of ensuring the temp table has no foreign key violations, pg_class has been fully restored to its pre-transaction state.
How each of the identified features is used:
- Creating the functions with the SECURITY DEFINER attribute, we have now opened an access point for non-privileged users specifically for the purpose of the proscribed interaction with pg_class--and nothing more.
- The temp table is created and owned by user postgres. There is no chance of the non-privileged user manipulating this table directly, accidentally or otherwise. Thus, the only positive escape for the transaction is through the use of the reenabling function.
- Deferring the foreign key on the temp table allows us to ensure the transaction is in an invalid state at all times while pg_class is in its vulnerable condition without aborting the transaction.
- ON COMMIT DROP allows the function to clean up after itself without having to make an explicit decision on the right time to drop the table. It allows a single temp table to be utilized per transaction, regardless of how may different relations will be passed through the trigger and rule disabling process.
- Before the temp table actually drops, its deferred foreign keys are evaluated. If any rows are left in the table, it means for at least one relation we failed to call the reenable function and the entire transction is aborted rather than risk committing pg_class in the disabled state.
Converting to this system of pg_class manipulation completely eliminated the instances of finding pg_class in a committed state with triggers and rules disabled for various relations. It also allowed us to convert a number database-dependent scripts and applications from using the postgres user down to the appropriate application users.
CREATE FUNCTION safe_disable_trigrules (
schema_name TEXT,
table_name TEXT
)
RETURNS void
LANGUAGE plpgsql
STRICT
SECURITY DEFINER
AS $$
DECLARE
text_table_pk TEXT NOT NULL := '';
text_fk_table TEXT NOT NULL := '';
text_cur_min_msg TEXT;
BEGIN
-- Stop any malicious shenanigans by user overloading
-- relations or operators in a different schema and
-- manipulating search_path to use them.
PERFORM
pg_catalog.set_config(
'search_path',
'pg_catalog, '
OPERATOR(pg_catalog.||)
(SELECT
pg_catalog.current_setting(
'search_path'
)
),
TRUE
);
-- Shared PK for table to hold FK in violated state.
-- This naming convention must not change without also
-- reflecting the convention in safe_reenable_trigrules()
-- so that both can immutably create the same name given
-- the same arguments.
text_table_pk :=
schema_name || '_' ||
table_name || '_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Allowing for the same relation to have triggers and rules
-- disabled and reenabled multiple times within the same
-- transaction. On subsequent calls, the temp table will
-- already exist.
SELECT setting
INTO text_cur_min_msg
FROM pg_settings
WHERE name = 'client_min_messages';
UPDATE pg_settings
SET setting = 'error'
WHERE name = 'client_min_messages'
AND text_cur_min_msg IS DISTINCT FROM 'error';
-- Attempt to create the temp table. If first function call for
-- transaction, it succeeds; otherwise, it fails silently unless
-- error is something other than re-creating extant table.
BEGIN
-- Temp table for this transaction, with same shared
-- convention as the PK above.
text_fk_table :=
'trigrules_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Use ON COMMIT DROP so PG will garbage collect
-- all such temp tables created within the transaction.
EXECUTE
'CREATE TEMP TABLE ' ||
quote_ident(text_fk_table) || ' (
id TEXT PRIMARY KEY NOT NULL,
fk_id TEXT NOT NULL
CONSTRAINT "Must Call safe_reenable_trigrules() Before Commit"
REFERENCES ' ||
quote_ident(text_fk_table) || '
DEFERRABLE
INITIALLY DEFERRED
)
ON COMMIT DROP';
EXCEPTION
WHEN DUPLICATE_TABLE THEN
-- Ignore
END;
UPDATE pg_settings
SET setting = text_cur_min_msg
WHERE name = 'client_min_messages'
AND text_cur_min_msg IS DISTINCT FROM 'error';
-- Insert new record that violates FK. Allowing for
-- the function to be gracefully recalled on the same
-- relation between calls to re-enable triggers and rules.
EXECUTE '
INSERT INTO ' ||
quote_ident(text_fk_table) || '
SELECT ' ||
quote_literal(text_table_pk) ||
', ' ||
quote_literal(text_table_pk || 'X') || '
WHERE NOT EXISTS (
SELECT 1
FROM ' ||
quote_ident(text_fk_table) || '
WHERE id = ' ||
quote_literal(text_table_pk) || '
)';
-- Disable all rules and triggers on target relation
UPDATE pg_class SET
relhasrules = false,
reltriggers = 0
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = schema_name
AND pg_class.relname = table_name;
-- Abort transaction if relation doesn't exist
IF NOT FOUND THEN
RAISE EXCEPTION
'Table %.% does not exist',
schema_name,
table_name;
END IF;
-- reset search_path for users legitimately overloading
-- operators or relations.
PERFORM
set_config(
'search_path',
(SELECT
SUBSTRING(
current_setting('search_path')
FROM
'^pg_catalog, (.*)'
)
),
TRUE
);
END;
$$
;
CREATE FUNCTION safe_reenable_trigrules (
schema_name TEXT,
table_name TEXT
)
RETURNS void
LANGUAGE plpgsql
STRICT
SECURITY DEFINER
AS $$
DECLARE
text_fk_table TEXT NOT NULL := '';
text_table_pk TEXT NOT NULL := '';
int_num_del INTEGER;
BEGIN
-- Stop any malicious shenanigans by user overloading
-- relations or operators in a different schema and
-- manipulating search_path to use them.
PERFORM
pg_catalog.set_config(
'search_path',
'pg_catalog, '
OPERATOR(pg_catalog.||)
(SELECT
pg_catalog.current_setting(
'search_path'
)
),
TRUE
);
-- Re-enable rules and triggers on target
UPDATE pg_class SET
reltriggers = (
SELECT COUNT(*) FROM pg_trigger
WHERE pg_class.oid = pg_trigger.tgrelid
),
relhasrules = (
SELECT COUNT(*) > 0
FROM pg_rules
WHERE schemaname = schema_name
AND tablename = table_name
)
FROM pg_namespace
WHERE pg_namespace.oid = pg_class.relnamespace
AND pg_namespace.nspname = schema_name
AND pg_class.relname = table_name;
-- Shared PK for table to hold FK in violated state.
-- This naming convention must not change without also
-- reflecting the convention in safe_disable_trigrules()
-- so that both can immutably create the same name given
-- the same arguments.
text_table_pk :=
schema_name || '_' ||
table_name || '_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Temp table for this transaction, with same shared convention
-- as the PK above.
text_fk_table :=
'trigrules_' ||
TO_CHAR(
NOW(),
'DHH24MISSMS'
);
-- Remove pertinent row so FK is no longer in violated state
EXECUTE
'DELETE FROM ' ||
quote_ident(text_fk_table) || '
WHERE id = ' ||
quote_literal(text_table_pk);
GET DIAGNOSTICS int_num_del = ROW_COUNT;
IF (int_num_del > 0) IS NOT TRUE THEN
RAISE EXCEPTION
'No entry for %.% set by safe_disable_trigrules()',
schema_name,
table_name;
END IF;
-- reset search_path for users legitimately overloading
-- operators or relations.
PERFORM
set_config(
'search_path',
(SELECT
SUBSTRING(
current_setting('search_path')
FROM
'^pg_catalog, (.*)'
)
),
TRUE
);
END;
$$
;
Pl/Perl multiplicity issues with PostgreSQL - the Highlander restriction
I came across this error recently for a client using Postgres 8.4:
ERROR: cannot allocate multiple Perl
interpreters on this platform
Most times when you see this error it indicates that someone was trying to use both a Pl/Perl function and a Pl/PerlU function on a server in which Perl's multiplicity flag is disabled. In such a case, only a single Perl interpreter can exist for each Postgres backend, and trying to create a new one (as happens when you execute two functions written in Pl/Perl and Pl/PerlU), the error above is thrown.
However, in this case it was not a combination of Pl/Perl and Pl/PerlU - I confirmed that only Pl/Perl was installed. The error was caused by a slightly less known limitation of a non-multiplicity Perl and Postgres. As the docs mention at the very bottom of the page, "...so any one session can only execute either PL/PerlU functions, or PL/Perl functions that are all called by the same SQL role". So we had two roles both trying to execute some Pl/Perl code in the same session. How is that possible - isn't each session tied to a single role at login? The answer is the SECURITY DEFINER flag for functions, which causes the function to run as if it was being invoked by the role that created the function, not the role that is executing it.
There is still a bit of a gotcha here, because Perl interpreters are created as needed, and thus the order of operations is very important. In other words, you may be able to run function foo() just fine, and run function bar() just fine, but you cannot run them together in the same session! This applies to both the Pl/Perl and Pl/PerlU limitation, as well as the Pl/Perl multiple user limitation.
While Postgres will validate functions as you create them, this is
subject to the same in-session limitation. All of the below examples assume
you have a non multiplicity-enabled Perl
(see
the perlguts manpage for gory details on what multiplicity means in Perl)
. To see what state your Perl is,
you need to determine if the 'usemultiplicity' option is enabled.
The -V option to the perl executable tells it to output all
of its configuration parameters. While the canonical way to check is to issue a
perl -V:usemultiplicity, that's a
hard string to remember, so I simply use grep:
$ perl -V | grep multi
useithreads=define, usemultiplicity=define
The above indicates that Perl has been compiled with multiplicity and thus not subject to the Postgres limitations - you can mix and match Perl functions in your database with abandon. The only problem occurs if the output looks like this:
$ perl -V | grep multi
useithreads=undef, usemultiplicity=undef
Technically, you can also prevent the issue by setting ithreads on, but there really is no reason to not just keep things simpler by setting the multiplicity on.
Watch what happens when we try to create two Perl functions using Postgres 9.2:
postgres=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create language plperl;
CREATE LANGUAGE
test=# create language plperlu;
CREATE LANGUAGE
test=# create or replace function test_perlver()
test-# returns text
test-# language plperl
test-# AS $$ return "Running test_perlver on Perl $^V"; $$;
CREATE FUNCTION
test=# create or replace function test_perlverU()
test-# returns text
test-# language plperlU
test-# AS $$ return "Running test_perlverU on Perl $^V"; $$;
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlveru"
What's going on here? We've already used a perl (Pl/Perl) in *this session*, so we cannot create another one, even if just to compile (but not execute) the function. However, if we start a new session, we can create our Pl/PerlU function!
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create or replace function test_perlverU()
test-# returns text
test-# language plperlU
test-# AS $$ return "Running test_perlverU on Perl $^V"; $$;
CREATE FUNCTION
This Highlander restriction ("there can be only one!") applies to both creation and execution of functions. Notice that we have both the Pl/Perl and Pl/PerlU versions installed, but we can only use one in a particular session - and which one depends on which is called first!:
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# select test_perlver();
test_perlver
--------------------------------------
Running test_perlver on Perl v5.10.0
test=# select test_perlverU();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlveru"
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# select test_perlverU();
test_perlveru
---------------------------------------
Running test_perlverU on Perl v5.10.0
test=# select test_perlver();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlver"
As you can imagine, the nondeterministic nature of such functions can make discovery and debugging of this issue on production servers tricky. :) Here's the other variant we talked about, in which only the first of two functions - both of which are Pl/Perl - will run:
postgres=# create database test;
CREATE DATABASE
postgres=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create language plperl;
CREATE LANGUAGE
test=# create or replace function foo()
test-# returns text
test-# language plperl
test-# security invoker
test-# AS $$ return "Running as security invoker"; $$;
CREATE FUNCTION
test=# create or replace function bar()
test-# returns text
test-# language plperl
test-# security definer
test-# AS $$ return "Running as security definer"; $$;
CREATE FUNCTION
Now let's run as the user who created the function - no problemo, because we are the same user that created the function:
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# SELECT foo();
foo
-----------------------------
Running as security invoker
(1 row)
test=# SELECT bar();
bar
-----------------------------
Running as security definer
(1 row)
All is well. However, if we try it as a different user, the Highlander restriction creeps in:
test=# \c test greg
You are now connected to database "test" as user "greg".
test=# SELECT foo();
foo
-----------------------------
Running as security invoker
(1 row)
test=# SELECT bar();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "bar"
test=# \c test greg
You are now connected to database "test" as user "greg".
test=# SELECT bar();
bar
-----------------------------
Running as security definer
(1 row)
test=# SELECT foo();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "foo"
This one took me a while to figure out on a production system, as somewhere in a twisty maze of trigger functions there was one that was set as security definer. Normally, this was not a problem, as the user that created that function did much of the updates, but a different user invoked a non- security definer function and then the security definer function, causing the error at the top of this article to show up.
So what can one do to prevent this problem from occurring? Luckily, for most people this will not be a problem, as many (if not all) distros and operating systems have the multiplicity compile flag for Perl enabled. If you do have the restriction, one option is to simply be careful about the use of security definer functions. You could either declare everything as security definer, or perhaps make sure that it is only called in a separate session if it really needs to be called by a different user.
A better solution is to recompile your Perl to enable multiplicity. I am not aware of any drawbacks to doing so. In theory, one could even recompile Perl in-place and then restart Postgres, but I have never tried this out. :)
Postgres log_statement='all' should be your default
Setting the PostgreSQL log_statement parameter to 'all' is always your best choice; this article will explain why. PostgreSQL does not have many knobs to control logging. The main one is log_statement, which can be set to 'none' (do not ever set it to this!), 'ddl' or 'mod' (decent but flawed values), or 'all', which is what you should be using. In addition, you probably want to set log_connections = on, log_disconnections = on, and log_duration = on. Of course, if you do set all of those, don't forget to set log_min_duration_statement = -1 to turn it off completely, as it is no longer needed.
The common objections to setting log_statement to 'all' can be summed up as Disk Space, Performance, and Noise. Each will be explained and countered below. The very good reasons for having it set to 'all' will be covered as well: Troubleshooting, Analytics, and Forensics/Auditing.
Objection: Disk Space
The most common objection to logging all of your SQL statements is disk space. When log_statement is set to all, every action against the database is logged, from a simple SELECT 1; to a gigantic data warehousing query that is 300 lines long and takes seven hours to complete. As one can imagine, logging all queries generates large logs, very quickly. How much depends on your particular system of course. Luckily, the amount of space is very easy to test: just flip log_statement='all' in your postgresql.conf, and reload your database (no restart required!). Let it run for 15 minutes or so and you will have a decent starting point for extrapolating daily disk space usage. For most of our clients, the median is probably around 30MB per day, but we have some as low as 1MB and some over 50GB! Disk space is cheap, but if it is really an issue to save everything, one solution is to dynamically ship the logs to a different box via syslog (see below). Another not-as-good option is to simply purge older logs, or at least ship the older logs to a separate server, or perhaps to tape. Finally, you could write a quick script to remove common and uninteresting lines (say, all selects below 10 milliseconds) from older logs.
Objection: Performance
Another objection is that writing all those logs is going to harm the performance of the server the database is on. A valid concern, although the actual impact can be easily measured by toggling the value temporarily. The primary performance issue is I/O, as all those logs have to get written to a disk. The first solution is to make sure the logs are going to a different hard disk, thus reducing contention with anything else on the server. Additionally, you can configure this disk differently, as it will be heavy write/append with little to no random read access. The best filesystems for handling this sort of thing seem to be ext2 and ext4. A better solution is to trade the I/O hit for a network hit, and use syslog (or better, rsyslog) to ship the logs to a different server. Doing this is usually as simple as setting log_destination = 'syslog' in your postgresql.conf and adjusting your [r]syslog.conf. This has many advantages: if shipping to a local server, you can often go over a non-public network interface, and thus not impact the database server at all. This other server can also be queried at will, without affecting the performance of the database server. This means heavy analytics, e.g. running pgsi or pgfouine, can run without fear of impacting production. It can also be easier to provision this other server with larger disks than to mess around with the production database server.
Objection: Noise
A final objection is that the log files get so large and noisy, they are hard to read. Certainly, if you are used to reading sparse logs, this will be a change that will take some getting used to. One should not be reading logs manually anyway: there are tools to do that. If all your logs were showing before was log_min_duration_statement, you can get the same effect (in a prettier format!) by using the 'duration' mode of the tail_n_mail program, which also lets you pick your own minimum duration and then sorts them from longest to shortest.
Advantage: Troubleshooting
When things go wrong, being able to see exactly what is happening in your database can be crucial. Additionally, being able to look back and see what was going on can be invaluable. I cannot count the number of times that full logging has made debugging a production issue easier. Without this logging, the only option sometimes is to switch log_statement to all and then wait for the error to pop up again! Don't let that happen to you - log heavy preemptively. This is not just useful for tracking direct database problems; often the database trail can enable a DBA to work with application developers to see exactly what their application is doing and where things started to go wrong. On that note, it is a good idea to log as verbosely as you can for everything in your stack, from the database to the application to the httpd server: you never know which logs you may need at a moment's notice when major problems arise.
Advantage: Analytics
If the only logging you are doing is those queries that happen to be longer than you log_min_duration_statement, you have a very skewed and incomplete view of your database activity. Certainly one can view the slowest queries and try to speed them up, but tools like pgsi are designed to parse full logs: the impact of thousands of "fast" queries can often be more stressful on your server than a few long-running queries, but without full logging you will never know. You also won't know if those long-running queries sometimes (or often!) run faster than log_min_duration_statement.
We do have some clients that cannot do log_statement = 'all', but we still want to use pgsi, so what we do is turn on full logging for a period of time via cron (usually 15 minutes, during a busy time of day), then turn it off and run pgsi on that slice of full logging. Not ideal, but better than trying to crunch numbers using incomplete logs.
Advantage: Forensics/Auditing
Full logging via log_statement='all' is extremely important if you need to know exactly what commands a particular user or process has run. This is not just relevant to SQL injection attacks, but for rogue users, lost laptops, and any situation in which someone has done something unknown to your database. Not every one of these situations will be noticeable, such as the infamous DROP TABLE students;: often it involves updating a few critical rows, modifying some functions, or simply copying a table to disk. The *only* way to know exactly what was done is to have log_statement = 'all'. Luckily, this parameter cannot be turned off by clients: one must edit the postgresql.conf file and then reload the server.
The advantages of complete logging should outweigh the disadvantages, except in the most extreme cases. Certainly, it is better to start from a position of setting Postgres' log_statement to 'all' and defending any change to a lesser setting. Someday it may save your bacon. Disagreements welcome in the comment section!
You shall not pass! Preventing SQL injection
Greg Sabino Mullane presented a few extremely useful techniques for preventing SQL injection. His advice was mostly based on his recent real-world experience.
The chunk of simple code was causing a potentially very dangerous security breach to the system:
[query … where order_number='[scratch order_number] and username='[session username]']
This code can generate this SQL query:
select * from orders where order_number = '12345' and username = 'alice';
Or this SQL query:
select * from orders where order_number=' '; delete from orders where id IS NOT NULL;
This is a vulnerability, and you certainly do not want any random stranger to delete records from the "orders" table in your database.
The problem was solved in no time by escaping user input.
Here is Greg's list of recommendations to make SQL injection impossible:
- Escape all user input passed to the database.
- Log extensively. If this system hadn't logged SQL queries, they would have never noticed anything strange. They used tail_n_mail that tracks PostgreSQL logs and sends out emails whenever SQL exception occurs.
- Introduce fine-grained control for accessing and manipulating the database. Split responsibilities between a lot of database users and selectively grant permissions to them. Run your code as the appropriate database user with the most restrictive set of permissions possible.
- Database triggers can become very handy. In Greg's case it was impossible to delete the already shipped order because of the triggers assigned to the record.
- Have a lot of eyes on the code to eliminate the obvious mistakes.
- And finally, if SQL injection is happening, consider shutting down the database server until you find the cause. This is an emergency!
Detecting Postgres SQL Injection
SQL injection attacks are often treated with scorn among seasoned DBAs and developers - "oh it could never happen to us!". Until it does, and then it becomes a serious matter. It can, and most likely will eventually happen to you or one of your clients. It's prudent to not just avoid them in the first place, but to be proactively looking for attacks, to know what to do when they occur, and know what steps to take after you have cleaned up the mess.
What is a SQL injection attack? Broadly speaking, it is a malicious user entering data to subvert the nature of your original query. This is almost always through a web interface, and involves an "unescaped" parameter that can be used to change the data returned or perform other database actions. The user "injects" their own SQL into your original SQL statement, changing the query from its original intent.
For example, you have a page in which the a logged-in customer can look up their orders by an order_number, a text field on a web form. The query thus looks like this in your code:
$order_id = cgi_param('order_number');
$sql = "SELECT * FROM order WHERE order_id = $order_id AND order_owner = '$username'";
$results = run_query($sql);
Because there is nothing to limit what the user enters in the order_number field, they can inject their own SQL into to the middle of your SQL query by creating a non-standard order_number such as:
12345 --
This would return information on anyone's order, without checking the order_owner column, as the SQL sent to the database would become:
SELECT * FROM order WHERE order_id = 12345 -- AND order_owner = 'alice'
Much more creative (and destructive) choices are available to the attacker as well, such as:
SELECT * FROM order WHERE order_id = 12345; UPDATE user SET admin=TRUE WHERE username = 'alice'; --AND order_owner = 'alice'; SELECT * FROM order WHERE order_id = 12345; TRUNCATE TABLE invoices; SELECT * FROM order WHERE order_id = 12345 AND order_owner = 'alice';
The above is a very simplistic generic-language example, but there are many ways for SQL injection attacks to work, including software out of your direct control (anything in your chain, from database driver to language to the database itself) and non-obvious angles (such as getting creative with multi-byte languages).
The correct approach to the above would be to use placeholders:
$order_id = cgi_param('order_number');
$sql = 'SELECT * FROM order WHERE order_id = ? AND order_owner = ?';
$results = run_query($sql, $order_id, $username);
Reaction
So you've just detected a SQL injection attack. Don't panic! Okay, perhaps panic a little bit. The first order of business is to, as quickly as possible, disable access and prevent the attacker from doing anything else. Their next injected SQL statement could be a DROP TABLE. Do as much as is needed to stop it right away - don't worry about fixing the hole yet. Stop Apache, disable all CGI, shut down your database, whatever it takes. Yes, this will cause a loss of business for a busy site but so will that potential DROP TABLE command! Once things are disabled, start patching up the holes. If it is a well isolated, obvious fix, bring things back up. If not, look for similar code with the same problem, then bring things back up. There are now some important steps to take:
- Double check all similar code for any other problems.
- Check your logs carefully to see if this was an isolated event, or if the hole had been used before. If you are relying on SQL errors for detection, a careful attacker may have already successfully injected some SQL. See below for forensic tactics.
- Learn why this happened in the first place. Didn't update a driver? Someone just wrote some bad code? Something else? Fix it at both the immediate technical and long-term procedural level.
Detection
Detection is the most important part of this article. If someone were to start a SQL injection attack against your site right now, would you even know? How quickly?
Fortunately, SQL injection attacks almost always generate some SQL errors as the attacker tries to work around your SQL. This is the number one way to detect an attack while it is happening. We recommend the invaluable tail_n_mail for this task. For our clients, we have tail_n_mail running via cron every minute, scanning for new and interesting errors and mailing them out to us. Thus, detection is usually within minutes.
In addition to pure SQL errors, permission errors often occur as well, as the attacker tries to do something not allowed by the current database user, such as creating a table or running the COPY command. Remember to never treat a strange error as an unintersting isolated event, or assume that it is probably one of your developers making a typo. Follow up on everything.
Sometimes, when the attacker is very good, no SQL errors are generated, and the problems have to be detected in other ways. One way is to scan for common SQL injection items. The trick is filtering out valid SQL while finding injected ones. In most cases, attacker access to your database is fairly limited without knowing the names of your tables, columns, functions, and views, so one thing to look for is references to system tables such as pg_class and pg_attribute, system views such as pg_tables and pg_stat_activity, the pg_sleep() function, and the information_schema schema. (pg_sleep() is often used in "blind SQL" attacks, to let the attacker know if something worked or not by the inclusion of a delay, when there may be no other direct feedback from their injection). While looking for these items is not as easy to setup as looking for errors, it can be fairly easy to develop and exclude a whitelist of things that should be accessing those items.
Another thing to watch out for strange offsets. Because the information an attacker can get back is often limited to a row at a time due to the limitations of the original query, SQL injections often pull back the same information from, say, information_schema.tables, with a "LIMIT 1 OFFSET 1" tacked on, Then they call the page again and inject their SQL, but with an offset of 2, then an offset of 3, etc. Nothing says SQL injection like seeing an OFFSET 871 in your logs.
Speaking of logs, you may have noticed that the above checks will only work if you are logging all statements, by adjusting log_statement in your postgresql.conf file. Setting this parameter to 'all' is *highly* recommended, and SQL injection detection (and forensics!) are merely two of the many reasons for doing so.
If you don't have log_statement set to 'all', your only hope of direct detection is if one of the queries happens to get logged for some other reason, such as going over your log_min_duration_statement setting. Good luck with that. /sarcasm.
There are other methods of detecting SQL injection, but they can all be classified as reacting to side effects. Your logs may grow larger, a sysadmin on your team may notice some odd network patterns, your business intelligence people may come across some unexplained buying patterns, etc. Intuition from experienced people is a powerful tool: follow up on those hunches and nagging feelings!
Prevention
Preventing SQL injection is mostly a matter of following some standard software development practices. Basically, you want your code up to date, well vetted, and easy to read and revert. Here are some guidelines:
Use version control
More specifically, use git. For everything related to your site. Application code, HTML pages, system configurations. There are many advantages to git, but it is particularly useful when you are (quickly!) trying to figure out how some bad code (e.g. with SQL injection holes) got into your app, and what safe version you can replace it with. The powers of git log -p, git bisect, git blame, and git checkout will make you wonder how you ever lived without them.
More than one set of eyes
Never commit code that hasn't been looked at by at least one other person not involved in its writing. This can be as informal as leaning over and asking someone else to look at the patch, to setting up a complex enforcement system via something like gerrit. The most important thing is to have it reviewed by someone qualified, and to note the review in your commit message.
Email is a great way to do this, especially if you have a list of people qualified to give a review of the code in question. So, database changes could go to a "dbgroup" list, and one or more people on the list will review and reply.
Another nice thing is a post-commit hook that mails committed code as a diff to a wide audience, such as all engineers in the company. Sure, most people may ignore it, but the more eyes the better. On that note, make sure the age-old appeal of heavily commenting code is followed, especially code that is trying to fetch information from a database.
Teach people about SQL injection
Using placeholders is the only truly safe way to write code. Make sure everyone knows this, and show some examples of SQL injection problems to new hires so they know what to look out for and what the consequences will be.
Never assume any database input is safe.
Never, ever assume database input is safe, or will remain safe. Always use prepared statements aka placeholders. You say you scrubbed that variable with a regular expression above the SQL call? Someone will tweak that regex someday.
Be proactive in looking for problems
See the section about about using tail_n_mail. There are also companies / tools that will attempt to find SQL injection problems in your application. While not foolproof, these can be useful, particularly if you have a very large website with a very large codebase.
Keep your software up to date
Sure, your software is free of all problems, but what about the framework you are using? The language? The database? And the database drivers? They may have a SQL injection problem, and, more importantly, they may have already patched it. Run the most recent version, and make sure you are on all the relevant announcement lists so you hear about new problems and new releases of everything important in your tool chain.
Compartmentalize
In these days of complex frameworks and multiple levels of abstraction, direct SQL access is often hard or impossible to do. Which can be a very good thing, as this is often a good protection against SQL injection. Keep in mind however that there are always other ways to reach your database, such as the boss's daughter or son whipping up a quick PHP script so he can run some reports from home against the production database.
Use the least privileges possible
Make sure you are taking full advantage of roles and users in your database. This means an application should have the bare minimum rights it needs to do its job. No creating of functions, no creating tables, and explicit GRANTs to the tables/views/functions it truly needs. Limit severely what runs as a superuser. If something really needs to run as a superuser, consider wrapping the data/logic in a SECURITY DEFINER function. Having separate "readonly" and "readwrite" versions of each application's user is a great idea as well, and may even help you to scale by being able to send your readonly user to a different database (via hot standby or a Bucardo/Slony slave), or even send them to different pg_bouncer ports with different pooling methods.
Access can be further limited by the use of views, which can limit which columns and rows are visible to a user, or you could even limit all application user access to going through stored procedures.
URLs are public
Never assume an application, URL, or API will remain internal. It will end up accessible to an attacker someday, somehow. Treat everything with the same careful, paranoid, care and always use placeholders.
Forensics
So you've just closed a SQL injection hole, and carefully audited your code to ensure no other holes exist. Now what? Forensics! Which means, a careful examination after the crime. In this case, we want to see what damage the intruder managed to cause.
The first thing to do is figure out what potential harm there is. You can do this by assuming the worst case scenario. What database user was used in the attack, and what rights does it have? Could tables have been updated? Data deleted? Were tables dropped or views altered? This may be a good time to run something like same_schema in historical mode to find out the answer to that last question.
Now comes the hard part: seeing what was changed. If you do not have log_statment='all' set in your postgresql.conf (as I will once again highly recommend you should have) finding what has changed becomes a very, very difficult task. Your best bet at this point is to go to your backups and start comparing things, and perhaps running some sanity checks on your data (e.g. unusually low prices on things you sell, new mega-useful coupons, users with elevated rights). If you know about when the attack started, you could, in theory, look on disk to see which relations may have been altered to narrow the list of changed data. You will also have to assume that the attacker captured all the possible data the database user was allowed to see.
Enough about the worst case scenario above - what about those of us with log_statment='all'? Well, now we go through the logs to see what exact SQL was injected, and what commands have been succesfully run. At this point, you should know what the SQL involved in the attack looks like, and more to the point, where in your code it came from. Now its a matter of filtering out the good stuff from the bad. Luckily, this is a pretty easy task.
What you will need to do is write a quick script to parse your logs, find the type of query that had the hole, and determine the "bad" ones. Then you can look closer and have it report exactly what commands the attacker ran.
Most SQL injection results in a string of additional SQL in place of where a single value should be, with an adding of quotes. So, for example, if someone forgot to escape an OFFSET at the end of the query, your program could simply look for any variations of the query that ended in something other than OFFSET \d$. If the unescaped value was in the middle of the query, I find that a simple but reliable test is to look for whitespace or a '*' character. This assumes that whitespace or '*' would not normally appear for that value, but as long as it's not common, it should still work. (The '*' is needed because one can use SQL comments as a means of whitespace, for example SELECT/**/*/*foo*/FROM/**/pg_tables). Your script should ignore any queries in which the value has no whitespace or '*' character, and focus on the ones that do. Then normalize the queries (for example collapse ones that differ only by the OFFSET value), and generate a report. Of course, the exact method to differentiate between "good" and "bad" queries will vary. Find your best Perl hacker and set them on it.
I should point out that a script is almost always necessary, for three reasons. First, manually reading logs is a time-wasting and error-prone bore. Second, log_statment='all' leads to some really, really big logs. Third, SQL injection attacks usually involve some sort of scripted attack, which can mean a lot of entries. For example, a client recently had over 8000 lines from a SQL injection attack spread out over 20 GB of log files. (This one had a happy ending: the attacker was both not very competent and the database user was fairly locked down, so no damage was done.)
So remember: SQL injection can happen to you. Make sure you are able to detect it, recognize it, fix it, and inspect the damage!
UTOSC Recap
I spent three days last week attending the Utah Open Source Conference, in company with Josh Ausborne and Jon Jensen. Since End Point is a "distributed company", I'd never met Josh Ausborne before, and was glad to spend a few days helping and learning from him as we demonstrated the Liquid Galaxy he has already written about.
This time around, the conference schedule struck me as being particularly oriented toward front-end web development. The talks were chosen based on a vote taken on the conference website, so apparently that's what everyone wanted, but front end stuff is not generally my cup of tea. That fact notwithstanding, I found plenty to appeal to my particular interests, and a number of talks I didn't make it to but wished I had.
I delivered two talks during the conference, the first on database constraints, and the second on Google Earth and the Liquid Galaxy as they apply to geospatial visualization (slides here and here, respectively). Though I couldn't get past the feeling that my constraints talk dragged quite a bit, it was well received. Where possible I kept it as database-agnostic as possible, but no talk on the subject would be complete without mentioning PostgreSQL's innovative exclusion constraints. Their applicability to scheduling applications, by easily preventing things like overlapping time ranges, seemed particularly interesting to one attendee with recent experience writing such an application. Should I have opportunity to deliver the talk again, it will definitely include more examples of some of the more overlooked constraint types, as well as a more detailed description of the surrogate vs. natural keys, which generated quite a bit of discussion after I mentioned it in passing.
My mapping talk was less enthusiastically attended, which may well be due to the topic or the speaker, but it was also scheduled at 6:00 PM, in the last slot of the day, and I expect many attendees had gone home. UTOSC features an unusually high number of attendees with young families, compared to most conferences I've attended, and clears out relatively rapidly toward evening. The last day's tracks tend to be family-focused specifically because of all the parents who want to bring their children, and included hands-on labs, board game sessions, and child-friendly demonstrations.
Sparse attendance notwithstanding, I enjoyed introducing my audience to Google Earth's KML language, the Kamelopard library I've been working on to facilitate making KML, and some of the applications of Google Earth for visualization. We moved the Liquid Galaxy from our display booth to the classroom for my presentation; I expect it was one of the more involved demonstrations in any talk, and certainly deserves honorable mention for being a live demo that actually worked.
Integrating Propel ORM
Propel ORM Integration
Most of us have worked in environments that are organized in an less than desirable manner. A common recurring problem is the organization and communication between the business logic and the database model. One helpful methodology to help with a problem like this is implementing an ORM (Object-Relational Mapping) system. There are many to choose from, but I would like to discuss use and integration of Propel ORM.
Propel is currently only used in PHP, but supports many different databases. Currently Propel supports the following databases: MySQL, Postgres, Sqlite, MsSQL, and Oracle.
Installation and Setup
The main point of this post is to show how easily you can start integrating an ORM into your working environment. The explanation and examples below assume that you have installed the correct packages and configured Propel to work with your environment properly.
The Propel website offers great documentation on how to do that:
Integration
After you have set everything up, in particular the build.properties file, you can now generate your schema.xml file. This generated file describes your database in XML, everything form datatypes to the relationships between tables. Run the following command to generate this XML file:
propel-gen reverse
Now we want to generate the PHP classes that you will use to interact with your data. Do this by running the following command.
propel-gen .
If everything went well you should now have a directory called 'build/' where you ran the propel_gen commands. Look in 'build/classes/' you will see the name of the project you named earlier in build.properties. Within this directory you will see a list of files, one for each of your tables. Copy these somewhere you can easily include them from a PHP file.
From now on each time you make a change to your database you will need to either modify your schema.xml to reflect that change, or simply run the propel-gen reverse to have it generated for you. You will also need to copy the generated PHP classes to your project when changes are made. There are many different ways you can use Propel for data model management, this is just one way to get you up and running.
Usage
Propel offers very elaborate functionality for interfacing with your data-model. I have included two quick examples to give you an idea of how you can easily begin to use it. I have included two examples showing some simple retrieval functionality. For this example let's say we have a table called 'Customers' with a few common columns like name and address.
Simple Select Statement
$query = new CustomerQuery(); $customers = CustomerQuery::create()->orderByName()->limit(2)->find(); $first_customer = $query->findPK(1); // This would give us the first in the list
Custom SQL
$con = Propel::getConnection(CustomerPeer::DATABASE_NAME);
$sql = "SELECT * FROM customer WHERE name ilike ('%:name%');
$stmt = $con->prepare($sql);
$stmt->execute(array(':name' => 'Terry'));
These are just two 'very' simple examples but Propel offers many ways to access and update your data. Please visit the Propel website if you would like to read more about this ORM.
The Mystery of The Zombie Postgres Row
Being a PostgreSQL DBA is always full of new challenges and mysteries. Tracking them down is one of the best parts of the job. Presented below is an error message we received one day via tail_n_mail from one of our client's production servers. See if you can figure out what was going on as I walk through it. This is from a "read only" database that acts as a Bucardo target (aka slave), and as such, the only write activity should be from Bucardo.
05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
Okay, so there was a unique violation during a COPY. Seems harmless enough. However, this should never happen, as Bucardo always deletes the rows it is about to add in with the COPY command. Sure enough, going to the logs showed the delete right above it:
05:45:51 [85]: LOG: statement: DELETE FROM public.foobar WHERE id IN (12345) 05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
How weird. Although we killed the row, it seems to have resurrected, and shambled like a zombie into our b-tree index, preventing a new row from being added. At this point, I double checked that the correct schema was being used (it was), that there were no rules or triggers, no quoting problems, no index corruption, and that "id" was indeed the first column in the table. I also confirmed that there were plenty of occurrences of the exact same DELETE/COPY pattern - with the same id! - that had run without any error at all, both before and after this error. If you are familiar with Postgres' default MVCC mode, you might make a guess what is going on. Inside the postgresql.conf file there is a setting named 'default_transaction_isolation', which is almost always set to read committed. Further discussion of what this mode does can be found in the online documentation, but the short version is that while in this mode, another transaction could have added row 12345 and committed after we did the DELETE, but before we ran the COPY. A great theory that fits the facts, except that Bucardo always sets the isolation level manually to avoid just such problems. Scanning back for the previous command for that PID revealed:
05:45:51 [85]: LOG: statement: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE 05:45:51 [85]: LOG: statement: DELETE FROM public.foobar WHERE id IN (12345) 05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
So that rules out any effects of read committed isolation mode. We have Postgres set to the strictest interpretation of MVCC it knows, SERIALIZABLE. (As this was on Postgres 8.3, it was not a "true" serializable mode, but that does not matter here.) What else could be going on? If you look at the timestamps, you will note that there is actually quite a large gap between the DELETE and the COPY error, despite it simply deleting and adding a single row (I have changed the table and data names, but it was actually a single row). So something else must be happening to that table.
Anyone guess what the problem is yet? After all, "when you have eliminated the impossible, whatever remains, however improbable, must be the truth". In this case, the truth must be that Postgres' MVCC was not working, and the database was not as ACID as advertised. Postgres does use MVCC, but has two (that I know of) exceptions: the system tables, and the TRUNCATE command. I knew in this case nothing was directly manipulating the system tables, so that only left truncate. Sure enough, grepping through the logs found that something had truncated the table right around the same time, and then added a bunch of rows back in. As truncate is *not* MVCC-safe, this explains our mystery completely. It's a bit of a race condition, to be sure, but it can and does happen. Here's some more logs showing the complete sequence of events for two separate processes, which I have labeled A and B:
A 05:45:47 [44]: LOG: statement: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE A 05:45:47 [44]: LOG: statement: TRUNCATE TABLE public.foobar A 05:45:47 [44]: LOG: statement: COPY public.foobar FROM STDIN B 05:45:51 [85]: LOG: statement: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE B 05:45:51 [85]: LOG: statement: DELETE FROM public.foobar WHERE id IN (12345) A 05:46:11 [44]: LOG: duration: 24039.243 ms A 05:46:11 [44]: LOG: statement: commit B 05:46:11 [85]: LOG: duration: 19884.284 ms B 05:46:11 [85]: LOG: statement: COPY public.foobar FROM STDIN B 05:46:11 [85]: ERROR: duplicate key value violates unique constraint "foobar_id" B 05:46:11 [85]: CONTEXT: COPY foobar, line 1: "12345#011...
So despite transaction B doing the correct thing, it still got tripped up by transaction A, which did a truncate, added some rows back in (including row 12345), and committed. If process A had done a DELETE instead of a TRUNCATE, the COPY still would have failed, but with a better error message:
ERROR: could not serialize access due to concurrent update
Why does this truncate problem happen? Truncate, while extraordinarily handy, can be real tricky to implement properly in MVCC without some severe tradeoffs. A DELETE in Postgres actually leaves the row on disk, but changes its visibility information. Only after all other transactions that may need to access the old row have ended can the row truly be removed on disk (usually via the autovacuum daemon). Truncate, however, does not walk through all the rows and add visibility information: as the name implies, it truncates the table by removing all rows, period.
So when we did the truncate, process A was able to add row 12345 back in: it had no idea that the row was "in use" by transaction B. Similarly, B had no idea that something had added the row back in. No idea, that is, until it tried to add the row and the unique index prevented it! There appears to be some work on making truncate more MVCC friendly in future versions.
Here is a sample script demonstrating the problem:
#!perl
use strict;
use warnings;
use DBI;
use Time::HiRes; ## so we can reliably sleep less than one second
## Connect and create a test table, populate it:
my $dbh = DBI->connect('dbi:Pg', 'postgres', '', {AutoCommit=>0});
$dbh->do('DROP TABLE foobar');
$dbh->do('CREATE TABLE foobar(a INT UNIQUE)');
$dbh->do('INSERT INTO foobar VALUES (42)');
$dbh->commit();
$dbh->disconnect();
## Fork, then have one process truncate, and the other delete+insert
if (fork) {
my $dbhA = DBI->connect('dbi:Pg', 'postgres', '', {AutoCommit=>0});
$dbhA->do('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');
$dbhA->do('TRUNCATE TABLE foobar'); ## 1
sleep 0.3; ## Wait for B to delete
$dbhA->do('INSERT INTO foobar VALUES (42)'); ## 2
$dbhA->commit(); ## 2
}
else {
my $dbhB = DBI->connect('dbi:Pg', 'postgres', '', {AutoCommit=>0});
$dbhB->do('SET TRANSACTION ISOLATION LEVEL SERIALIZABLE');
sleep 0.3; ## Wait for A to truncate
$dbhB->do('DELETE FROM foobar'); ## 3
$dbhB->do('INSERT INTO foobar VALUES (42)'); ## 3
}
Running the above gives us:
ERROR: duplicate key value violates unique constraint "foobar_a_key" DETAIL: Key (a)=(42) already exists
This should not happen, of course, as process B did a delete of the entire table before trying an INSERT, and was in SERIALIZABLE mode. If we switch out the TRUNCATE with a DELETE, we get a completely different (and arguably better) error message:
ERROR: could not serialize access due to concurrent update
However, it we try it with a DELETE on PostgreSQL version 9.1 or better, which features a brand new true serializable mode, we see yet another error message:
ERROR: could not serialize access due to read/write dependencies among transactions DETAIL: Reason code: Canceled on identification as a pivot, during write. HINT: The transaction might succeed if retried
This doesn't really give us a whole lot more information, and the "detail" line is fairly arcane, but it does give a pretty nice "hint", because in this particular case, the transaction *would* succeed if it were tried again. More specifically, B would DELETE the new row added by process A, and then safely add the row back in without running into any unique violations.
So the morals of the mystery are to be very careful when using truncate, and to realize that everything has exceptions, even the supposed sacred visibility walls of MVCC in Postgres.
Tracking down PostgreSQL XYZ error: tablespace, database, and relfilnode
One of our Postgres clients recently had this error show up in their logs:
ERROR: could not read block 3 of relation 1663/18421/31582: read only 0 of 8192 bytes

Because we were using the tail_n_mail program, the above error was actually mailed to us within a minute of it occurring. The message is fairly cryptic, but it basically means that Postgres could not read data from a physical file that represented a table or index. This is generally caused by corruption or a missing file. In this case, the "read only 0 of 8192" indicates this was most likely a missing file.
When presented with an error like this, it's nice to be able to figure out which relation the message is referring to. The word "relation" is Postgres database-speak for a generic object in the database: in this case, it is almost certainly going to be a table or an index. Both of those are, of course, represented by actual files on disk, usually inside of your data_directory. The number given, 1663/18421/31582, is in the standard X/Y/Z format Postgres uses to identify a file, where X represents the tablespace, Y is the database, and Z is the file.
The first number, X, indicates which tablespace this relation belongs to. Tablespaces are physical directories mapped to internal names in the database. Their primary use is to allow you to put tables or indexes on different physical disks. The number here, 1663, is a very familiar one, as it almost always indicates the default tablespace, known as pg_default. If you do not create any additional tablespaces, everything will end up here. On disk, this will be the directory named base underneath your data_directory.
What if the relation you are tracking is not inside of the default tablespace? The number X represents the OID inside the pg_tablespace system table, which will let you know where the tablespace is physically located. To illustrate, let's create a new tablespace and then view the contents of the pg_tablespace table:
$ mkdir /tmp/pgtest $ psql -c "CREATE TABLESPACE ttest LOCATION '/tmp/pgtest'" CREATE TABLESPACE $ psql -c 'select oid, * from pg_tablespace' oid | spcname | spcowner | spclocation | spcacl | spcoptions -------+------------+----------+-------------+--------+------------ 1663 | pg_default | 10 | | | 1664 | pg_global | 10 | | | 78289 | ttest | 10 | /tmp/pgtest | |
Thus, if X were 78289, it would lead us to the tablespace ttest, and we would know that the file we were ultimately looking for will be in the directory indicated by the spclocation column, /tmp/pgtest. If that column is blank, it means the directory to use is data_directory/base.
The second number in our X/Y/Z series, Y, indicates which database the relation belongs to. You can look this information up by querying the pg_database system table like so:
$ psql -xc 'select oid, * from pg_database where oid = 18421' -[ RECORD 1 ]-+----------- oid | 18421 datname | foobar datdba | 10 encoding | 6 datcollate | en_US.utf8 datctype | en_US.utf8 datistemplate | f datallowconn | t datconnlimit | -1 datlastsysoid | 12795 datfrozenxid | 1792 dattablespace | 1663 datacl |
The columns may look different depending on your version of Postgres - the important thing here is that the number Y maps to a database via the oid column - in this case the database foobar. We need to know which database so we can query the correct pg_class table in the next step. We did not have to worry about that in until now as the pg_tablespace and pg_database tables are two of the very few shared system catalogs.
The final number in our X/Y/Z series, Z, represents a file on disk. You can look up which relation it is by querying the pg_class system table of the correct database:
$ psql -d foobar -c "select relname,relkind from pg_class where relfilenode=31582" relname | relkind --------+------- (0 rows)
No rows, so as far as Postgres is concerned that file does not exist! Let's verify that this is the case by looking on the disk. Recall that X was the default tablespace, which means we start in data_directory/base. Once we are in that directory, we can look for the subdirectory holding the database we want (Y or 18421) - it is named after the OID of the database. We can then look for our relfilenode (Z or 31582) inside of that directory:
$ psql -c 'show data_directory'
data_directory
---------------------------------
/var/lib/pgsql/data
(1 row)
$ cd /var/lib/pgsql/data
/var/lib/pgsql/data $ cd base
/var/lib/pgsql/data/base $ cd 18421
/var/lib/pgsql/data/base/18421 $ stat 31582
stat: cannot stat `31582': No such file or directory
So in this case, we confirmed that the relfilenode was no longer there! If it was there, we can probably surmise that the file on disk is corrupted somehow. If the relation was an index, the solution would be to simply run a REINDEX INDEX indexname on it, which will recreate the entire index with a new relfilenode. If it is a table, then things get trickier: we can try a VACUUM FULL on it, which rewrites the entire table, but you will most likely need to go back to your last SQL backup or take a look at your PITR (Point-In-Time Recovery) server.
So why would a relfilenode file not exist on disk? There are a few possibilities:
→ We are looking in the wrong pg_class table (i.e. user error). Each database has its own copy of the pg_class, with different relfilenodes. This means that each subdirectory corresponding to the database has its own set of files as well.
→ It may be a bug in Postgres. Unlikely, unless we have exhausted the other possibilities.
→ Bad RAM or a bad disk may have caused a flipped bit somewhere, for example changing the relfilenode from 12345 to 12340. Possible, but still unlikely.
→ The relfilenode file was removed by something. This is the most likely explanation. We've already hinted above at one way this could happen: a REINDEX. Since the client in this story was (is!) prudently running with log_statement = 'all', I was able to grep back through the logs and found that a REINDEX of a few system tables, including pg_depend, was kicked off a second before the error popped up. While it's impossible to know exactly what the missing relfilenode referred to, the REINDEX is as close to a smoking gun as we are going to get. So the query started, a REINDEX removed one of the indexes it was using, and then the error occurred as Postgres tried to access that index.
In this case, we were able to simply rerun the query and it worked as expected. In normal every day usage, this error should not appear, even when reindexing system tables, but should something like this happen to you, at least you will know what those numbers mean. :)
RailsAdmin Import: Part 2
I recently wrote about importing data in RailsAdmin. RailsAdmin is a Rails engine that provides a nice admin interface for managing your data, which comes packed with configuration options.
In a recent Ruby on Rails ecommerce project, I've been using RailsAdmin, Piggybak (a Rails ecommerce gem supported by End Point), and have been building out custom front-end features such as advanced search and downloadable product support. When this client came to End Point with the project, we offered several options for handling data migration from a legacy system to the new Rails application:
- Create a standard migration file, which migrates data from the existing legacy database to the new data architecture. The advantage with this method is that it requires virtually no manual interaction for the migration process. The disadvantage with this is that it's basically a one-off solution and would never be useful again.
- Have the client manually enter data. This was a reasonable solution for several of the models that required 10 or less entries, but not feasible for the tables containing thousands of entries.
- Develop import functionality to plug into RailsAdmin which imports from CSV files. The advantage to this method is that it could be reused in the future. The disadvantage with ths method is that data exported from the legacy system would have to be cleaned up and formatted for import.
The client preferred option #3. Using a quick script for generating custom actions for RailsAdmin, I developed a new gem called rails_admin_import to handle import that could be plugged into RailsAdmin. Below are some technical details on the generic import solution.
ActiveSupport::Concern
Using ActiveSupport::Concern, the rails_admin_import gem extends ActiveRecord::Base to add the following class methods:
- import_fields: Returns an array of fields that will be included in the import, excluding :id, :created_at, and :updated_at, belongs_to fields, and file fields.
- belongs_to_fields: Returns an array of fields with belongs_to relationships to other models.
- many_to_many_fields: Returns an array of fields with has_and_belongs_to_many relationships to other models.
- file_fields: Returns an array of fields that represent data for Paperclip attached files.
- run_import: Method for running the actual import, receives request params.
And the following instance methods:
- import_files: sets attached files for object
- import_belongs_to_data: sets belongs_to associated data for object
- import_many_to_many_data: sets many_to_many associated data for object
The general approach here is that the import of files, belongs_to, many_to_many relationships, and standard fields makes up the import process for a single object. The run_import method collects success and failure messages for each object import attempt and those results are presented to the user. A regular ActiveRecord save method is called on the object, so the existing validation of objects during each save applies.
Working with Associated Data
One of the tricky parts here is how to handle import of fields representing associations. Given a user model that belongs to a state, country, and has many roles, how would one decide what state, country, or role value to include in the import?
I've solved this by including a dropdown to select the attribute used for mapping in the form. Each of the dropdowns contains a list of model attributes that are used for association mapping. A user can then select the associated mappings when they upload a file. In a real-life situation, I may import the state data via abbreviation, country via display name (e.g. "United States", "Canada") and role via the role name (e.g. "admin"). My data import file might look like this:
| name | favorite_color | state | country | role | |
| Steph Skardal | steph@endpoint.com | blue | CO | United States | admin |
| Aleks Skardal | aleksskardal@gmail.com | green | Norway | user | |
| Roger Skardal | roger@gmail.com | tennis ball yellow | UT | United States | dog |
| Milton Skardal | milton@gmail.com | kibble brown | UT | United States | dog |
Many to Many Relationships
Many to many relationships are handled by allowing multiple columns in the CSV to correspond to the imported data. For example, there may be two columns for role on the user import, where users may be assigned to multiple roles. This may not be suitable for data with a large number of many to many assignments.
Import of File Fields
In this scenario, I've chosen to use open-uri to request existing files from a URL. The CSV must contain the URL for that file to be imported. The import process downloads the file and attaches it to the imported object.
self.class.file_fields.each do |key|
if map[key] && !row[map[key]].nil?
begin
row[map[key]] = row[map[key]].gsub(/\s+/, "")
format = row[map[key]].match(/[a-z0-9]+$/)
open("#{Rails.root}/tmp/uploads/#{self.permalink}.#{format}", 'wb') { |file| file << open(row[map[key]]).read }
self.send("#{key}=", File.open("#{Rails.root}/tmp/uploads/#{self.permalink}.#{format}"))
rescue Exception => e
self.errors.add(:base, "Import error: #{e.inspect}")
end
end
end
If the file request fails, an error is added to the object and presented to the user. This method may not be suitable for handling files that do not currently exist on a web server, but it was suitable for migrating a legacy application.
Configuration: Display
Following RailsAdmin's example for setting configurations, I've added the ability to allow the import display to be set for each model.
config.model User do label :name end
The above configuration will yield success and error messages with the user.name, e.g.:
Configuration: Excluded Fields
In addition to allowing a configurable display option, I've added the configuration for excluding fields.
config.model User do
excluded_fields do
[:reset_password_token, :reset_password_sent_at, :remember_created_at,
:sign_in_count, :current_sign_in_at, :last_sign_in_at, :current_sign_in_ip,
:last_sign_in_ip]
end
end
The above configuration will exclude the specified fields during the import, and they will not display on the import page.
Configuration: Additional Fields and Additional Processing
Another piece of functionality that I found necessary for various imports was to hook in additional import functionality. Any model can have an instance method before_import_save that accepts the row of CSV data and map of CSV keys to perform additional tasks. For example:
def before_import_save(row, map) self.created_nested_items(row, map) end
The above method will create nested items during the import process. This simple extensibility allows for additional data to be handled upon import outside the realm of has_and_belongs_to and belongs_to relationships.
Fields for additional nested data can be defined with the extra_fields configuration, and are shown on the import page.
config.model User do
extra_fields do
[:field1, :field2, :field3, :field4]
end
end
Hooking into RailsAdmin
As I mentioned above, I used a script to generate this Engine. Using RailsAdmin configurable actions, import must be added as an action:
config.actions do dashboard index ... import end
And CanCan settings must be updated to allow for import if applicable, e.g.:
cannot :import, :all can :import, User
Conclusion
My goal in developing this tool was to produce reusable functionality that could easily be applied to multiple models with different import needs, and to use this tool across Rails applications. I've already used this gem in another Rails 3.1 project to quickly import data that would otherwise be difficult to deal with manually. The combination of association mapping and configurability produces a flexibility that encourages reusability.
Feel free to review or check out the code here, or read more about End Point's services here.
Protecting and auditing your secure PostgreSQL data

PostgreSQL functions can be written in many languages. These languages fall into two categories, 'trusted' and 'untrusted'. Trusted languages cannot do things "outside of the database", such as writing to local files, opening sockets, sending email, connecting to other systems, etc. Two such languages are PL/pgSQL and and PL/Perl. For "untrusted" languages, such as PL/PerlU, all bets are off, and they have no limitations placed on what they can do. Untrusted languages can be very powerful, and sometimes dangerous.
One of the reasons untrusted languages can be considered dangerous is that they can cause side effects outside of the normal transactional flow that cannot be rolled back. If your function writes to local disk, and the transaction then rolls back, the changes on disk are still there. Working around this is extremely difficult, as there is no way to detect when a transaction has rolled back at the level where you could, for example, undo your local disk changes.
However, there are times when this effect can be very useful. For example, in a recent thread on the PostgreSQL "general" mailing list (aka pgsql-general), somebody asked for a way to audit SELECT queries into a logging table that would survive someone doing a ROLLBACK. In other words, if you had a function named weapon_details() and wanted to have that function log all requests to it by inserting to a table, a user could simply run the query, read the data, and then rollback to thwart the auditing:
BEGIN;
SELECT weapon_details('BFG 9000'); -- also inserts to an audit table
ROLLBACK; -- inserts to the audit table are now gone!
Certainly there are other ways to track who is using this query, the most obvious being by enabling full Postgres logging (by setting log_statement = 'all' in your postgresql.conf file.) However, extracting that information from logs is no fun, so let's find a way to make that INSERT stick, even if the surrounding function was rolled back.
Stepping back for one second, we can see there are actually two problems here: restricting access to the data, and logging that access somewhere. The ultimate access restriction is to simply force everyone to go through your custom interface. However, in this example, we will assume that someone has psql access and needs to be able to run ad hoc SQL queries, as well as be able to BEGIN, ROLLBACK, COMMIT, etc.
Let's assume we have a table with some Very Important Data inside of it. Further, let's establish that regular users can only see some of that data, and that we need to know who asked for what data, and when. For this example, we will create a normal user named Alice:
postgres=> CREATE USER alice; CREATE ROLE
We need a way to tell which rows are suitable for people like Alice to view. We will set up a quick classification scheme using the nifty ENUM feature of PostgreSQL:
postgres=> CREATE TYPE classification AS ENUM ( 'unclassified', 'restricted', 'confidential', 'secret', 'top secret' ); CREATE TYPE
Next, as a superuser, we create the table containing sensitive information, and populate it:
postgres=> CREATE TABLE weapon (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
cost TEXT NOT NULL,
security_level CLASSIFICATION NOT NULL,
description TEXT NOT NULL DEFAULT 'a fine weapon'
);
NOTICE: CREATE TABLE will create implicit sequence "weapon_id_seq" for serial column "weapon.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "weapon_pkey" for table "weapon"
CREATE TABLE
postgres=> INSERT INTO weapon (name,cost,security_level) VALUES
('Crowbar', 10, 'unclassified'),
('M9', 200, 'restricted'),
('M16A2', 300, 'restricted'),
('M4A1', 400, 'restricted'),
('FGM-148 Javelin', 700, 'confidential'),
('Pulse Rifle', 50000, 'secret'),
('Zero Point Energy Field Manipulator', 'unknown', 'top secret');
INSERT 0 7
We don't want anyone but ourselves to be able to access this table, so for safety, we make some explicit revocations. We'll examine the permissions before and after we do this:
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+-------------------+--------------------------
public | weapon | table | |
postgres=> REVOKE ALL ON TABLE weapon FROM public;
REVOKE
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+---------------------------+--------------------------
public | weapon | table | postgres=arwdDxt/postgres |
As you can see, what the REVOKE really does is remove the implicit "no permission" and grant explicit permissions to only the postgres user to view or modify the table. Let's confirm that Alice cannot do anything with that table:
postgres=> \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> postgres=> SELECT * FROM weapon; ERROR: permission denied for relation weapon postgres=> postgres=> UPDATE weapon SET id = id; ERROR: permission denied for relation weapon
Alice does need to have access to parts of this table, so we will create a "wrapper function" that will query the table for us and return some results. By declaring this function as SECURITY DEFINER, it will run as if the person who created the function invoked it - in this case, the postgres user. For this example, we'll be letting Alice see the "cost and description" of exactly one item at a time. Further, we are not going to let her (or anyone else using this function) view certain items. Only those items classified as "confidential" or lower can be viewed (i.e. "confidential", "restricted", or "unclassified"). Here's the first version of our function:
postgres=> CREATE LANGUAGE plperlu;
CREATE LANGUAGE
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
The above should be fairly self-explanatory. We are using PL/Perl's built-in database access functions, such as spi_prepare, to do the actual querying. Let's confirm that this works as it should for Alice:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('anvil');
name | cost | description
------+------+-------------
(0 rows)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
Now that we have solved the restricted access problem, let's move on the auditing. We will create a simple table to hold information about who accessed what and when:
postgres=> CREATE TABLE data_audit ( tablename TEXT NOT NULL, arguments TEXT NULL, results INTEGER NULL, status TEXT NOT NULL DEFAULT 'normal', username TEXT NOT NULL DEFAULT session_user, txntime TIMESTAMPTZ NOT NULL DEFAULT now(), realtime TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp() ); CREATE TABLE
The 'tablename' column simply records which table they are getting data from. The 'arguments' is a free-form field describing what they were looking for. The 'results' column shows how many matching rows were found. The 'status' column will be used primarily to log unusual requests, such as the case where Alice looks for a forbidden item. The 'username' column records the name of the user doing the searching. Because we are using functions with SECURITY DEFINER set, this needs to be session_user, not current_user, as the latter will switch to 'postgres' within the function, and we want to log the real caller (e.g. 'alice'). The final two columns tell us then the current transaction started, and the exact time when an entry was made inside of this table. As a first attempt, we'll have our function do some simple inserts to this new data_audit table:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Log this request
$SQL = 'INSERT INTO data_audit(tablename,arguments,results,status)
VALUES ($1,$2,$3,$4)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
$sth = spi_prepare($SQL, 'TEXT', 'TEXT', 'INTEGER', 'TEXT');
spi_exec_prepared($sth, 'weapon', $name, $rv->{processed}, $status);
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
However, this fails the case pointed out in the original poster's email about viewing the data within a transaction that is then rolled back. It also fails to work at all when a forbidden item is requested, as that insert is rolled back by the die() call:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM weapon_details('m9');
name | cost | description
------+------+---------------
M9 | 200 | a fine weapon
(1 row)
postgres=> ROLLBACK;
ROLLBACK
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=> SELECT * FROM data_audit \x \g
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results | 1
status | normal
username | alice
txntime | 2012-01-30 17:37:39.497491-05
realtime | 2012-01-30 17:37:39.545891-05
How do we get around this? We need a way to commit something that will survive the surrounding transaction's rollback. The closest thing Postgres has to such a thing at the moment is to connect back to the database with a new and entirely separate connection. Two such popular ways to do so are with the dblink program and the PL/PerlU language. Obviously, we are going to focus on the latter, but all of this could be done with dblink as well. Here are the additional steps to connect back to the database, do the insert, and then leave again:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT) RETURNS TABLE (name TEXT, cost TEXT, description TEXT) LANGUAGE plperlu SECURITY DEFINER VOLATILE AS $bc$
use strict;
use warnings;
>use DBI;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = defined $row ? delete $row->{allowed} : 1;
## Log this request
$SQL = 'INSERT INTO data_audit(username,tablename,arguments,results,status)
VALUES (?,?,?,?,?)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
my $dbh = DBI->connect('dbi:Pg:service=auditor', '', '',
{AutoCommit=>0, RaiseError=>1, PrintError=>0});
$sth = $dbh->prepare($SQL);
my $user = spi_exec_query('SELECT session_user')->{rows}[0]{session_user};
$sth->execute($user, 'weapon', $name, $rv->{processed}, $status);
$dbh->commit();
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
Note that because we are making external changes, we marked the function as VOLATILE, which ensures that it will always be run every time it is called, and not cached in any form. We are also using a Postgres service file with the 'db:Pg:service=auditor'. This means that the connection information (username, password, database) is contained in an external file. This is not only tidier than hard-coding those values into this function, but safer as well, as the function itself can be viewed by Alice. Finally, note that we are passing the 'username' directly into the function this time, as we have a brand new connection which is no longer linked to the 'alice' user, so we have to derive it ourselves from "SELECT session_user" and then pass it along.
Once this new function is in place, and we re-run the same queries as we did before, we see three entries in our audit table:
postgres=> \c postgres postgres You are now connected to database "postgres" as user "postgres". Expanded display is on. -[ RECORD 1 ]---------------------------- tablename | weapon arguments | crowbar results | 1 status | normal username | alice txntime | 2012-01-30 17:56:01.544557-05 realtime | 2012-01-30 17:56:01.54569-05 -[ RECORD 2 ]---------------------------- tablename | weapon arguments | pulse rifle results | 1 status | forbidden username | alice txntime | 2012-01-30 17:56:01.559532-05 realtime | 2012-01-30 17:56:01.561225-05 -[ RECORD 3 ]---------------------------- tablename | weapon arguments | m9 results | 1 status | normal username | alice txntime | 2012-01-30 17:56:01.573335-05 realtime | 2012-01-30 17:56:01.574989-05
So that's the basic premise of how to solve the auditing problem. For an actual production script, you would probably want to cache the database connection by sticking things inside of the special %_SHARED hash available to PL/Perl and Pl/PerlU. Note that each user gets their own version of that hash, so Alice will not be able to create a function and have access to the same %_SHARED hash that the postgres user has access to. It's probably a good idea to simply not let users like Alice use the language at all. Indeed, that's the default when we do the CREATE LANGUAGE call as above:
postgres=> \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> CREATE FUNCTION showplatform() RETURNS TEXT LANGUAGE plperlu AS $bc$ return $^O; $bc$; ERROR: permission denied for language plperlu
Further refinements to the actual script might include refactoring the logging bits to a separate function, writing some of the auditing data to a file on the local disk, recording the actual results returned to the user, and sending the data to another Postgres server entirely. For that matter, as we are using DBI, you could send it to other place entirely - such as a MySQL, Oracle, or DB2 database!
Another place for improvement would be associating each user with a security_level classification, such that any user could run the function and only see things at or below their level, rather than hard-coding the level as "confidential" as we have done here. Another nice refinement might be to always return undef (no matches) for items marked "top secret", to prevent the very existence of a top secret weapon from being deduced. :)
Interchange loops using DBI Slice
One day I was reading through the documentation on search.cpan.org for the DBI module and ran across an attribute that you can use with selectall_arrayref() that creates the proper data structure to be used with Interchange's object.mv_results loop attribute. The attribute is called Slice which causes selectall_arrayref() to return an array of hashrefs instead of an array of arrays. To use this you have to be working in global Perl modules as Safe.pm will not let you use the selectall_arrayref() method.
An example of what you could use this for is an easy way to generate a list of items in the same category. Inside the module, you would do like this:
my $results = $dbh->selectall_arrayref(
q{
SELECT
sku,
description,
price,
thumb,
category,
prod_group
FROM
products
WHERE
category = ?},
{ Slice => {} },
$category
);
$::Tag->tmpn("product_list", $results);
In the actual HTML page, you would do this:
<table cellpadding=0 cellspacing=2 border=1>
<tr>
<th>Image</th>
<th>Description</th>
<th>Product Group</th>
<th>Category</th>
<th>Price</th>
</tr>
[loop object.mv_results=`$Scratch->{product_list}` prefix=plist]
[list]
<tr>
<td><a href="/cgi-bin/vlink/[plist-param sku].html"><img src="[plist-param thumb]"></a></td>
<td>[plist-param description]</td>
<td>[plist-param prod_group]</td>
<td>[plist-param category]</td>
<td>[plist-param price]</td>
</tr>
[/list]
[/loop]
</table>
We normally use this when writing ActionMaps and using some template as our setting for mv_nextpage.
Some great press for College District
College District has been getting some positive press lately, the most recent being a Forbes article which talks about the success they have been seeing in the last few years.
College District is a company that sells collegiate merchandise to fans. They got their start focusing on the LSU Tigers at TigerDistrict.com and have branched out to teams such as the Oregon Ducks and Alabama Roll Tide.
We've been working with Jared Loftus @ College District for more then four and a half years. College District is running on a heavily modified Interchange system with some cool Postgres tricks. The system can support a nearly unlimited number of sites, running on 2 catalogs (1 for the admin, 1 for the front end) and 1 database. The key to the system is different schemas, fronted by views, that hide and expose records based on the database user that is connected. The great thing about this system is that Jared can choose to launch a new store within a day and be ready for sales, something he has taken advantage of in the past when a team is on fire and he sees an opportunity he can't pass up.
We are currently preparing for a re-launch of the College District site that will focus on crowd-sourced designs. Artists and fans will submit their designs, have them voted on, some will be chosen to be sold and the folks that have their designs chosen will get paid for their efforts. The goal here is to grow a community that guides what College District and the individual school sites ultimately sell.
With College District's quick growth we've also been helping them improve their order fulfillment process. This includes streamlining how orders are picked, packed and shipped. The introduction of bar code scanners will help with the accuracy and speed of the process.
We get a kick out of seeing our clients succeed, especially those that come to us with a clear vision and a good attitude, and then put the hard work in to make it happen. It's an exciting year ahead for College District and we'll be right there supporting them on the journey.
Interchange Search Caching with "Permanent More"
Most sites that use Interchange take advantage of Interchange's "more lists". These are built-in tools that support an Interchange "search" (either the search/scan action, or result of direct SQL via [query]) to make it very easy to paginate results. Under the hood, the more list is a drill-in to a cached "search object", so each page brings back a slice from the cache of the original search. There are extensive ways to modify the look and behavior of more lists and, with a bit of effort, they can be configured to meet design requirements.
Where more lists tend to fall short, however, is with respect to SEO. There are two primary SEO deficiencies that get business stakeholders' attention:
- There is little control over the construction of the URLs for more lists. They leverage the scan actionmap and contain a hash key for the search object and numeric data to identify the slice and page location. They possess no intrinsic value in identifying the content they reference.
- The search cache by default is ephemeral and session-specific. This means all those results beyond page 1 the search engine has cataloged will result in dead links for search users who try to land directly on the more-listed pages.
It is the latter issue that I wish to address because there is--and has been for some time now--a simple mechanism called "permanent more" to remedy the default behavior.
You can leverage "permanent more" by adding the boolean mv_more_permanent, or the shorthand pm, to your search conditions. E.g.:
Link:
<a href="[area search="
co=1
sf=category
se=Foo
op=rm
more=1
ml=5
pm=1
"]">All Foos</a>
Loop:
[loop search="
co=1
sf=category
se=Foo
op=rm
more=1
ml=5
pm=1
"]
...loop body with [more-list]...
[/loop]
Query:
[query
list=1
more=1
ml=10
pm=1
sql="SELECT * FROM products WHERE category LIKE '%Foo%'"
]
...same as loop but with 10 matches/page...
[/query]
If the initial search is defined with the "permanent more" setting, it will produce the following adjustments:
- The hash key used to store and identify the search cache is deterministic based on the search conditions. Many searches for Interchange are category driven. Thus, all end users who wish to browse a category end up clicking identical links, which create duplicate search caches, belonging uniquely to them. With permanent more, they all share the same cache, with the same identifier. As long as the search conditions don't change, neither does the cache identifier. Even as the cache is refreshed with new executions of the search, the object remains in the same location. Thus, the results a search engine produced this morning reference links still valid now, tomorrow, or next week, provided they reference the same search conditions.
- The cached search object has no session affinity. Any link referencing the cache with the correct hash key has access to the content.
Taken together, "permanent more" removes (for the most part, addressed later) dead links from more lists cataloged by search engines. There are, however, other benefits to "permanent more" beyond those intended as described above:
- As stated in passing, standard Interchange search caching produces duplicate search objects for common search conditions. For a busy site, these caches can have an impact on storage. Typically, maintenance is implemented to clean up cache files for all such files whose age exceeds by some amount the session duration (standard is 48 hours). With permanent more, duplicate caches are eliminated. A cache location is reused by all users with the same search requirements, keeping data-storage requirements for caches to the minimum necessary. As searches change, ophaned caches can still easily be cleaned up as they will immediately start to age with no more access to them necessary for storage.
- For the same reason that "permanent more" resolves search-engine links, it also resolves content management for individual sites using a reverse proxy for caching. Because most (and certainly the easiest) caching keys are based off of URL, the deterministic nature of the hash keys for "permanent more" allows assurance that the cached content in the proxy accurately reflects the search content over time, and that all users will hit the cached resource and not generate new, unique links with varying hash keys.
One shortcoming of "permanent more" to be aware of is the impact of changing data underneath the search. Even if search conditions do not change, the count and order of matching record sets may. So, e.g., enough products may be removed from a given category to cause the last page of a more list to become empty, which would cause any specific link into that page to become dead. More minor, but still a possibility, is the introduction or removal of products so that a particularly searched-for term has been "bumped" to another page within the search cache since the last time the search engine crawled the more lists. For searches backed by particularly volatile data, "permanent more" may not be sufficient to address search-engine or caching demands.
Finally, "permanent more" should be avoided for any search features that may cache data sensitive to an individual user. This is unlikely to happen as, under most circumstances, the configuration of the search itself will change based on the unique characteristics of the user executing the search (e.g., a username included in a query to review order history). However, it is still possible that context-sensitive information could be stored in the search object and, if so, all other users with access to the more lists would have access to that information.
Finding PostgreSQL temporary_file problems with tail_n_mail
PostgreSQL does as much work as it can in RAM, but sometimes it needs to (or thinks that it needs to) write things temporarily to disk. Typically, this happens on large or complex queries in which the required memory is greater than the work_mem setting.
This is usually an unwanted event: not only is going to disk much slower than keeping things in memory, but it can cause I/O contention. For very large, not-run-very-often queries, writing to disk can be warranted, but in most cases, you will want to adjust the work_mem setting. Keep in mind that this is very flexible setting, and can be adjusted globally (via the postgresql.conf file), per-user (via the ALTER USER command), and dynamically within a session (via the SET command). A good rule of thumb is to set it to something reasonable in your postgresql.conf (e.g. 8MB), and set it higher for specific users that are known to run complex queries. When you discover a particular query run by a normal user requires a lot of memory, adjust the work_mem for that particular query or set of queries.
How do you tell when you work_mem needs adjusting, or more to the point, when Postgres is writing files to disk? The key is the setting in postgresql.conf called log_temp_files. By default it is set to -1, which does no logging at all. Not very useful. A better setting is 0, which is my preferred setting: it logs all temporary files that are created. Setting log_temp_files to a positive number will only log entries that have an on-disk size greater than the given number (in kilobytes). Entries about temporary files used by Postgres will appear like this in your log file:
2011-01-12 16:33:34.175 EST LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp16501.0", size 130220032
The only important part is the size, in bytes. In the example above, the size is 124 MB, which is not that small of a file, especially as it may be created many, many times. So the question becomes, how can we quickly parse the files and get a sense of which queries are causing excess writes to disk? Enter the tail_n_mail program, which I recently tweaked to add a "tempfile" mode for just this purpose.
To enter this mode, just name your config file with "tempfile" in its name, and have it find the lines containing the temporary file information. It's also recommended you make use of the tempfile_limit parameter, which limits the results to the "top X" ones, as the report can get very verbose otherwise. An example config file and an example invocation via cron:
$ cat tail_n_mail.tempfile.myserver.txt ## Config file for the tail_n_mail program ## This file is automatically updated ## Last updated: Thu Nov 10 01:23:45 2011 MAILSUBJECT: Myserver tempfile sizes EMAIL: greg@endpoint.com FROM: postgres@myserver.com INCLUDE: temporary file TEMPFILE_LIMIT: 5 FILE: /var/log/pg_log/postgres-%Y-%m-%d.log $ crontab -l | grep tempfile ## Mail a report each morning about tempfile usage: 0 5 * * * bin/tail_n_mail tnm/tail_n_mail.tempfile.myserver.txt --quiet
For the client I wrote this for, we run this once a day and it mails us a nice report giving the worst tempfile offenders. The queries are broken down in three ways:
- Largest overall temporary file size
- Largest arithmetic mean (average) size
- Largest total size across all the same query
Here is a slightly edited version of an actual tempfile report email:
Date: Mon Nov 7 06:39:57 2011 EST
Host: myserver.example.com
Total matches: 1342
Matches from [A] /var/log/pg_log/2011-11-08.log: 1241
Matches from [B] /var/log/pg_log/2011-11-09.log: 101
Not showing all lines: tempfile limit is 5
Top items by arithmetic mean | Top items by total size
----------------------------------+-------------------------------
860 MB (item 5, count is 1) | 17 GB (item 4, count is 447)
779 MB (item 1, count is 2) | 8 GB (item 2, count is 71)
597 MB (item 7, count is 1) | 6 GB (item 334, count is 378)
597 MB (item 8, count is 1) | 6 GB (item 46, count is 104)
596 MB (item 9, count is 1) | 5 GB (item 3, count is 63)
[1] From file B Count: 2
Arithmetic mean is 779.38 MB, total size is 1.52 GB
Smallest temp file size: 534.75 MB (2011-11-08 12:33:14.312 EST)
Largest temp file size: 1024.00 MB (2011-11-08 16:33:14.121 EST)
First: 2011-11-08 05:30:12.541 EST
Last: 2011-11-09 03:12:22.162 EST
SELECT ab.order_number, TO_CHAR(ab.creation_date, 'YYYY-MM-DD HH24:MI:SS') AS order_date,
FROM orders o
JOIN order_summary os ON (os.order_id = o.id)
JOIN customer c ON (o.customer = c.id)
ORDER BY creation_date DESC
[2] From file A Count: 71
Arithmetic mean is 8.31 MB, total size is 654 MB
Smallest temp file size: 12.12 MB (2011-11-08 06:12:15.012 EST)
Largest temp file size: 24.23 MB (2011-11-08 19:32:45.004 EST)
First: 2011-11-08 06:12:15.012 EST
Last: 2011-11-09 04:12:14.042 EST
CREATE TEMPORARY TABLE tmp_sales_by_month AS SELECT * FROM sales_by_month_view;
While it still needs a little polishing (such as showing which file each smallest/largest came from), it has already been an indispensible tool forfinding queries that causing I/O problems via frequent and/or large temporary files.
Viewing schema changes over time with check_postgres
Version 2.18.0 of check_postgres, a monitoring tool for PostgreSQL, has just been released. This new version has quite a large number of changes: see the announcement for the full list. One of the major features is the overhaul of the same_schema action. This allows you to compare the structure of one database to another and get a report of all the differences check_postgres finds. Note that "schema" here means the database structure, not the object you get from a "CREATE SCHEMA" command. Further, remember the same_schema action does not compare the actual data, just its structure.
Unlike most check_postgres actions, which deal with the current state of a single database, same_schema can compare databases to each other, as well as audit things by finding changes over time. In addition to having the entire system overhauled, same_schema now allows comparing as many databases you want to each other. The arguments have been simplified, in that a comma-separated list is all that is needed for multiple entries. For example:
./check_postgres.pl --action=same_schema \ --dbname=prod,qa,dev --dbuser=alice,bob,charlie
The above command will connect to three databases, as three different users, and compare their schemas (i.e. structures). Note that we don't need to specify a warning or critical value: we consider this an 'OK' Nagios check if the schemas match, otherwise it is 'CRITICAL'. Each database gets assigned a number for ease of reporting, and the output looks like this:
POSTGRES_SAME_SCHEMA CRITICAL: (databases:prod,qa,dev) Databases were different. Items not matched: 1 | time=0.54s DB 1: port=5432 dbname=prod user=alice DB 1: PG version: 9.1.1 DB 1: Total objects: 312 DB 2: port=5432 dbname=qa user=bob DB 2: PG version: 9.1.1 DB 2: Total objects: 312 DB 3: port=5432 dbname=dev user=charlie DB 3: PG version: 9.1.1 DB 3: Total objects: 313 Language "plpgsql" does not exist on all databases: Exists on: 3 Missing on: 1, 2
The second large change was a simplification of the filtering options. Everything is now controlled by the --filter argument, and basically you can tell it what things to ignore. For example:
./check_postgres.pl --action=same_schema \ --dbname=A,B --filter=nolanguage,nosequence
The above command will compare the schemas on databases A and B, but will ignore any difference in which languages are installed, and ignore any differences in the sequences used by the databases. Most objects can be filtered out in a similar way. There are also a few other useful options for the --filter argument:
- noposition: Ignore what order columns are in
- noperms: Do not worry about any permissions on database objects
- nofuncbody: Do not check function source
The final and most exciting large change is the ability to compare a database to itself, over time. In other words, you can see exactly what changed during a certain time period. We have a client using that now to send a daily report on all schema changes made in the last 24 hours, for all the databases in their system. This is a very nice thing for a DBA to receive: not only is there a nice audit trail in your email, you can answer questions such as:
- Was this a known change, or did someone make it without letting anyone else know?
- Did somebody fat-finger and drop an index by mistake?
- Were the changes applied to database X also applied to database Y and Z?
To enable time-based checks, simply provide a single database to check. The first time it is run, same_schema simply gathers all the schema information and stores it on disk. The next time it is run, it detects the file, reads it in as database "2", and compares it to the current database (number "1"). The --replace argument will rewrite the file with the current data when it is done. So the cronjob for the aforementioned client is as simple as:
10 0 * * * ~/bin/check_postgres.pl --action=same_schema \ --host=bar --dbname=abc --quiet --replace
The --quiet argument ensures that no output is given if everything is 'OK'. If everything is not okay (i.e. if differences are found), cron gets a bunch of input sent to it and duly mails it out. Thus, a few minutes after 10AM each day, a report is sent if anything has changed in the last day. Here's a slightly redacted version of this morning's report, which shows that a schema named "stat_backup" was dropped at some point in the last 24 hours (which was a known operation):
POSTGRES_SAME_SCHEMA CRITICAL: DB "abc" (host:bar) Databases were different. Items not matched: 1 | time=516.56s DB 1: port=5432 host=bar dbname=abc user=postgres DB 1: PG version: 8.3.16 DB 1: Total objects: 11863 DB 2: File=check_postgres.audit.port.5432.host.bar.db.abc DB 2: Creation date: Sun Oct 2 10:06:12 2011 CP version: 2.18.0 DB 2: port=5432 host=bar dbname=abc user=postgres DB 2: PG version: 8.3.16 DB 2: Total objects: 11864 Schema "stat_backup" does not exist on all databases: Exists on: 2 Missing on: 1
As you can see, the first part is a standard Nagios-looking output, followed by a header explaining how we defined database "1" and "2" (the former a direct database call, and the latter a frozen version of the same.)
Sometimes you want to store more than one version at a time: for example, if you want both a daily and a weekly view. To enable this, use the --suffix argument to create different instances of the saved file. For example:
10 0 * * * ~/bin/check_postgres.pl --action=same_schema \ --host=bar --dbname=abc --quiet --replace --suffix=daily 10 0 * * Fri ~/bin/check_postgres.pl --action=same_schema \ --host=bar --dbname=abc --quiet --replace --suffix=weekly
The above command would end up recreating this file every morning at 10:check_postgres.audit.port.5432.host.bar.db.abc.daily and this file each Friday at 10: check_postgres.audit.port.5432.host.bar.db.abc.weekly.
Thanks to all the people that made 2.18.0 happen (see the release notes for the list). There are still some rough edges to the same_schema action: for example, the output could be a little more user-friendly, and not all database objects are checked yet (e.g. no custom aggregates or operator classes). Development is ongoing; patches and other contributions are always welcome. In particular, we need more translators. We have French covered, but would like to include more languages. The code can be checked out at:
git clone git://bucardo.org/check_postgres.git
There is also a github mirror if you so prefer: https://github.com/bucardo/check_postgres.
You can also file a bug (or feature request), or join one of the mailing lists: general, announce, and commit.
A comparison of JasperSoft iReport and Pentaho Reporting
I've recently been involved in reporting projects using both JasperSoft's iReport and Pentaho's Reporting, so this seemed a good opportunity to compare the two. Both are Java-based, open source reporting systems which claim to build "pixel-perfect" documents ("pixel-perfect" means that when you put something somewhere on a report design, it doesn't move around. That this isn't taken for granted is a rant for another time). I have more experience with Pentaho than with JasperSoft, and once reviewed a book on Pentaho; I'll try to give the two a fair evaluation, but in the end I can't promise my greater experience with Pentaho won't affect my conclusions one way or the other. Both suites are available in open source and commercial flavors; I'll consider only the open source versions here.
First let me point out that JasperSoft and Pentaho both produce business intelligence software suites. The two suites exist in both community (open source) and enterprise (commercial) forms, are well worth comparing in their entirety, but I'm focusing principally on the report builder component of each, because that's where my recent experience has led me. These report builder packages allow users to build complex and visually interesting data-based documents from many different kinds of data sources. A "document" could be a simple form letter, an address book, or a complex dashboard of business metrics. In each case, users build documents by dragging and dropping various components into report "bands", and then modifying components' properties. "Bands" are horizontal sections of the page with different roles. The page header and footer bands, for instance, are (obviously) printed at the top and bottom of each page. "Detail" bands print once per row returned by the query that underlies the report. Both iReport and Pentaho Reporting will group and filter query results for you, if you want, and include header and footer bands for each group. Both products allow users to publish finished reports to a server where other users can view them, schedule them to run periodically, or modify them for new purposes, and both provide a Java-based reporting library to embed reporting in other applications.

iReport Query Dialog
Reports in both products are based on queries. These queries may be SQL using JDBC data sources, or they can come from other more obscure data sources, such as MDX queries, XQuery, scripts in various languages, web services, and more. Both also provide a query editor, at least for SQL queries, and in fact, both use the same query editor. I've only used it rarely, and only in Pentaho; both products also allow users to type in queries free-form, which I much prefer. In Pentaho, the data source for the query is embedded in the report itself, in the form of a JDBC URL, a JNDI name, or something else appropriate for that data source, so if you publish the report to a server, you're responsible for ensuring that JNDI name or JDBC URL or whatever makes sense on that server. Jasper, on the other hand, prompts the user with a list of available data sources when publishing a report to a server. Jasper's method seems more friendly, but Pentaho's choice may have its advantages here, because each report is self-contained, whereas Jasper's reports have metadata outside the single report file to describe the associated data source.
The Pentaho Report Builder
The component libraries available in each product are fairly similar. Users can select from basic drawing components such as lines and circles, static formatted text labels, and of course various numeric, textual, and date components to display query results. Both products also include complex charting components, to display visualizations of many different kinds. Though the charting functionality of both products, at least at the basic level I've used, is quite similar, I found iReport's charting dialogs quite helpful for making complex charts much easier to create than they would otherwise be. In Pentaho, after adding a chart component the user can open a special properties window, but the window offers few clues beyond the usual tooltips and occasionally meaningful property names to help the user know what to do. In contrast, iReport describes many of the properties involved in clear language.
iReport's chart selection dialog
This brings me to the topic of documentation, which, in fact, I found lacking for both produts. Yes, these are open source projects, and yes, documentation isn't always as fun to write as code, so yes, open source projects sometimes end up with lousy documentation. The enterprise versions come with documentation in one form or another, and there are several books published on the different suites' components, including reporting. But the documentation available free-of-charge on the web left me unimpressed. Here Pentaho was particularly frustrating, largely because Pentaho's reporting has changed a great deal from version to version, especially with its 3.5 release a couple years ago, so much of the help available in forums and wikis is completely out of date. Jasper documentation was more difficult to find, in general, but more accurate and up-to-date when I found it.
Both products save their results in fairly comprehensible formats, which is helpful if you ever need to modify them by hand, without the help of the report building tool. iReport's files are in a single XML file; Pentaho creates a ZIP-ed archive of several XML files. That fact has come in quite handy several times, both with Pentaho and with iReport, typically because it's faster for me to edit XML by hand than click on each of 20 components and tweak properties one-by-one, because (in Pentaho) I needed to modify a data source, or because the tool failed to figure out what columns to expect from a query and I wanted to enter them manually.
One major difference between the two products stems from what is available in their open source "community editions". Jasper looks a bit more polished to me, the documentation is somewhat more consistent, and its selection of sample reports and sample code is more comprehensive -- but the community edition does little more than support iReport. Pentaho's community edition, in addition to the reporting functions I've discussed, also offers ad-hoc web based reporting and a powerful MDX analysis package. JasperSoft offers those features only in its enterprise edition. That may not be a big deal in some places, but it is the deciding factor in others. Having access to Pentaho's code has made possible a number of things we certainly couldn't have done otherwise.
I've been trying to keep this "a comparison" of Pentaho Reporting and iReport, rather than a "showdown" or "shootout" or even "Jasper vs. Pentaho", but at some point, conclusions must be drawn. I'll draw here somewhat on evidence not mentioned above, because, well, the post was getting a bit long as it was, and I didn't want to describe everything. I admit to liking iReport's interface better, because to me it seems to make better use of screen real estate. Google was much better able to answer questions for iReport than for Pentaho. But, although both products sometimes seem to make simple things hard, Pentaho seemed to do this less than iReport, and the various magic incantations I needed to get things working were fewer in Pentaho. In the end, the much greater capability of Pentaho's open source offering over JasperSoft's clinches it for me. I'll take open source over closed any day.
PostgreSQL Serializable and Repeatable Read Switcheroo
PostgreSQL allows for different transaction isolation levels to be specified. Because Bucardo needs a consistent snapshot of each database involved in replication to perform its work, the first thing that the Bucardo daemon does when connecting to a remote PostgreSQL database is:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE;
The 'READ WRITE' bit sets us in read/write mode, just in case the entire database has been set to read only (a quick and easy way to make your slave databases non-writeable!). It also sets the transaction isolation level to 'SERIALIZABLE'. At least, it used to. Now Bucardo uses 'REPEATABLE READ' like this:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ WRITE;
Why the change? In version 9.1 of PostgreSQL the concept of SSI (Serializable Snapshot Isolation) was introduced. How it actually works is a little complicated (follow the link for more detail), but before 9.1 PostgreSQL was only *sort of* doing serialized transactions when you asked for serializable mode. What it was really doing was repeatable read and not trying to really serialize the transactions. In 9.1, PostgreSQL is doing *true* serializable transactions. It also adds a new distinct 'internal' transaction mode, 'repeatable read', which does exactly what the old 'serializable' used to do. Finally, if you issue a 'repeatable read' on a pre-9.1 database, it silently upgrades it to the old 'serializable' mode.
So in summary, if your application was using 'SERIALIZABLE' before, you can now replace that with 'REPEATABLE READ' and get the exact same behavior as before, regardless of the version. Of course, if you want *true* serializable transactions, use SERIALIZABLE. It will continue to mean the same as 'REPEATABLE READ' in pre-9.1 databases, and provide true serializability in 9.1 and beyond. (I haven't determined yet if Bucardo is going to use this new level, as it comes with a little bit of overhead)
Since this can be a little confusing, here's a handy chart showing how version 9.1 changed the meaning of SERIALIZABLE, and added a new 'internal' isolation level:
| Postgres version 9.0 and earlier | Postgres version 9.1 and later | ||||||
|---|---|---|---|---|---|---|---|
| Requested isolation level | → | Actual internal isolation level | Version comparison | Actual internal isolation level | ← | Requested isolation level | |
| READ UNCOMMITTED | ↘ | Read committed | Exact same | Read committed | ↙ | READ UNCOMMITTED | |
| READ COMMITTED | ↗ | ↖ | READ COMMITTED | ||||
| REPEATABLE READ | ↘ | Serializable | Functionally identical | Repeatable read | ← | REPEATABLE READ | |
| SERIALIZABLE | ↗ | ||||||
| 9.1 only! | Serializable (true) | ← | SERIALIZABLE | ||||
Congratulations and thanks to Kevin Grittner and Dan Ports for making true serializability a reality!
Bucardo PostgreSQL replication to other tables with customname
(Don't miss the Bucardo5 talk at Postgres Open in Chicago)
Work on the next major version of Bucardo is wrapping up (version 5 is now in beta), and two new features have been added to this major version. The first, called customname, allows you to replicate to a table with a different name. This has been a feature people have been asking for a long time, and even allows you to replicate between differently named Postgres schemas. The second option, called customcols, allows you replicate to different columns on the target: not only a subset, but different column names (and types), as well as other neat tricks.
The "customname" options allows changing of the table name for one or more targets. Bucardo replicates tables from the source databases to the target databases, and all tables must have the same name and schema everywhere. With the customname feature, you can change the target table names, either globally, per database, or per sync.
We'll go through a full example here, using a stock 64-bit RedHat 6.1 EC2 box (ami-5e837b37). I find EC2 a great testing platform - not only can you try different operating systems and architectures, but (as my own personal box is very customized) it is great to start afresh from a stock configuration.
First, let's turn off SELinux, install the EPEL rpm, update the box, and install a few needed packages.
echo 0 > /selinux/enforce wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm rpm -ivh epel-release-6-5.noarch.rpm yum update yum install emacs-nox perl-DBIx-Safe perl-DBD-Pg git postgresql-plperl cpan boolean
The yum update takes a while to run, but I always feel better when things are up to date. Next, we will create a new database cluster, create the /var/run/bucardo directory that Bucardo uses to store its PIDs, adjust the ultraconservative stock pg_hba.conf file, and start Postgres up:
service postgresql initdb mkdir /var/run/bucardo chown postgres.postgres /var/run/bucardo emacs /var/lib/pgsql/data/pg_hba.conf service postgresql start
For the pg_hba.conf configuration file, because we want to be able to connect to the database as the bucardo user without actually logging into that account, we will allow access using the 'md5' (password) method instead of 'ident'. But we don't want to bother creating a password for the postgres user, we will still allow those connections via ident. The relevant lines in the pg_hba.conf will end up like this:
# TYPE DATABASE USER METHOD local all postgres ident local all all md5
At this point, we (as the postgres user) download and install Bucardo itself:
su - postgres git clone git://bucardo.org/bucardo.git cd bucardo perl Makefile.PL make sudo make install bucardo install# (enter 'p' and keep the default values)
We are now ready to start testing out the new customname feature. First we will need some data to replicate! For this demo we are going to use one of the handy sample datasets from the dbsamples project. The one we will use has a few small tables with information about towns in France. Note that the tarball does not (sadly) contain a top-level directory, so we have to create one ourselves. We will then create three identical databases holding the data from that file.
wget http://pgfoundry.org/frs/download.php/935/french-towns-communes-francaises-1.0.tar.gz mkdir frenchtowns cd frenchtowns tar xvfz ../french-towns-communes-francaises-1.0.tar.gz psql -c 'create database french1' psql french1 -q -f french-towns-communes-francaises.sql psql -c 'create database french2 template french1' psql -c 'create database french3 template french1' psql -c 'create database french4 template french1'
Bucardo is installed but does not know what to do yet, so we will teach Bucardo about each of the databases, and add in all the tables, grouping then into a herd in the process. Finally, we create a sync in which french1 and french2 are both source (master) databases, and french3 and french4 will be target (slave) databases.
bucardo add db f1 db=french1 bucardo add db f2 db=french2 bucardo add db f3 db=french3 bucardo add db f4 db=french4 bucardo add all tables herd=fherd bucardo add sync wildstar herd=fherd dbs=f1=source,f2=source,f3=target,f4=target
Before starting it up, I usually raise the debug level, as it gives a much clearer picture of what is going on in the logs. It does make the logs a lot more crowded, so it is not recommended for production use:
echo log_level=DEBUG >> ~/.bucardorc
Next, we start Bucardo up and make sure everything is working as it should. Scanning the log.bucardo file that is generated is a great way to do this:
bucardo start sleep 3 tail log.bucardo
If all goes well, you should see something very similar to this in the last lines of your log.bucardo file:
(972) [Sat Sep 3 16:18:54 2011] KID Total time for sync "wildstar" (0 rows): 0.05 seconds (966) [Sat Sep 3 16:18:55 2011] CTL Got NOTICE ctl_syncdone_wildstar from 973 (line 1624) (966) [Sat Sep 3 16:18:55 2011] CTL Kid 973 has reported that sync wildstar is done (966) [Sat Sep 3 16:18:55 2011] CTL Sending NOTIFY "syncdone_wildstar" (line 1709) (954) [Sat Sep 3 16:18:55 2011] MCP Got NOTICE syncdone_wildstar from 967 (line 749) (954) [Sat Sep 3 16:18:55 2011] MCP Sync wildstar has finished (954) [Sat Sep 3 16:18:55 2011] MCP Sending NOTIFY "syncdone_wildstar" (line 812) (954) [Sat Sep 3 16:18:56 2011] MCP Got NOTICE syncdone_wildstar from 957 (Bucardo DB) (line 749)
From the above, we see that a KID finished running the sync we created, without finding any changed rows to replicate. Then there is some chatter between the different Bucardo processes. Now to test out the customname feature. We'll rename one of the tables, tell Bucardo about the change, reload the sync, and verify that all is still being replicated.
psql french3 -c 'ALTER TABLE regions RENAME TO tesla' bucardo add customname regions tesla db=f3 bucardo reload wildstar
psql french3 -c 'truncate table tesla cascade' TRUNCATE psql french3 -t -c 'select count(*) from tesla' 0 psql french1 -c 'update regions set name=name' UPDATE 26 psql french3 -t -c 'select count(*) from tesla' 26
In the above, the update on the regions table inthe french1 database calls a trigger that notifies Bucardo that some rows have changed; Bucardo then has a KID copy the rows from the source databases french1 to the other source database french2, as well as the targets french3 and french4. The final internal DELETE and COPY that it performs is done on database french3 to the tesla table rather than the regions table.
The customname feature cannot be used to change the tables in a source database, as they must all be the same (for obvious reasons). We can, however, specify that a different schema be used for a target, as well as a different table. This only applies to Postgres targets, as other database types (e.g. MySQL) do not use schemas. Let's see that in action:
psql french4 -c 'create schema banana' psql french4 -c 'alter table regions set schema banana' psql french4 -c 'truncate table banana.regions cascade' bucardo add customname regions banana.regions db=f4 bucardo reload wildstar
psql french4 -t -c 'select count(*) from banana.regions' 0 psql french2 -c 'update regions set name=name' UPDATE 26 psql french4 -t -c 'select count(*) from banana.regions' 26
As before, the update on a source causes the changes to propagate to the other source database, as well as both targets. Note that the ALTER TABLE also mutated the associated sequence for the table, so there will be warnings in Bucardo's logs about the DEFAULT values for the primary keys in the regions' tables being different. Since this post is getting long, I will save the discussion of customcols for another day.
PostgreSQL log analysis / PGSI
End Point recently started working with a new client (a startup in stealth mode, cannot name names, etc.) who is using PostgreSQL because of the great success some of the people starting the company have had with Postgres in previous companies. One of the things we recommend to our clients is a regular look at the database to see where the bottlenecks are. A good way to do this is by analyzing the logs. The two main tools for doing so are PGSI (Postgres System Impact) and pgfouine. We prefer PGSI for a few reasons: the output is better, it considers more factors, and it does not require you to munge your log_line_prefix setting quite as badly.
Both programs work basically the same: given a large number of log lines from Postgres, normalize the queries, see how long they took, and produce some pretty output.If you only want to look at the longest queries, it's usually enough to set your log_min_duration_statement to something sane (such as 200), and then run a daily tail_n_mail job against it. This is what we are doing with this client, and it sends a daily report that looks like this:
Date: Mon Aug 29 11:22:33 2011 UTC Host: acme-postgres-1 Minimum duration: 2000 ms Matches from /var/log/pg_log/postgres-2011-08-29.log: 7 [1] (from line 227) 2011-08-29 08:36:50 UTC postgres@maindb [25198] LOG: duration: 276945.482 ms statement: COPY public.sales (id, name, region, item, quantity) TO stdout; [2] (from line 729) 2011-08-29 21:29:18 UTC tony@quadrant [17176] LOG: duration: 8229.237 ms execute dbdpg_p29855_1: SELECT id, singer, track FROM album JOIN artist ON artist.id = album.singer WHERE id < 1000 AND track <> 1
However, the PGSI program was born of the need to look at all the queries in the database, not just the slowest-running ones; the cumulative effect of many short queries can have much more of an impact on the server than a smaller number of long-running queries. Thus, PGSI looks not only at how long a query takes to run, but how many times it has run in a certain period, as well as how often it runs. All of this information is put together to give a score to each normalized query, known as the "system impact". Like the costs on a Postgres explain plan, this is a unit-less number and of little importance in and of itself - the important thing is to compare it to the other queries to see the relative impact. We also have that report emailed out, it looks similar to this (this is a text version of the HTML produced):
Log file: /var/log/pg_log/postgres-2011-08-29.log * SELECT (24) * UPDATE (1) Query System Impact : SELECT Log activity from 2011-08-29 11:00:01 to 2011-08-29 11:15:01 +----------------------------------+ | System Impact: | 0.15 | | Mean Duration: | 1230.95 ms | | Median Duration: | 1224.70 ms | | Total Count: | 411 | | Mean Interval: | 4195 seconds | | Std. Deviation: | 126.01 ms | +---------------------------------+ SELECT * FROM albums WHERE track <> ? AND artist = ? ORDER BY artist, track
At this point you may be wondering how we get all the queries into the log. This is done by setting log_min_duration_statement to 0. However, most (but not all!) clients do not want full logging 24 hours a day, as this creates some very large log files. So the solution we use is to analyze a slice of the day, only. It depends on the client, but we try for about 15 minutes during a busy time of day. Thus, the sequence of events is:
- Turn on "full logging" by dropping log_min_duration_statement to zero
- Some time later, set log_min_duration_statement back to what it was (e.g. 200)
- Extract the logs from the time it was set to zero to when it was flipped back.
- Run PGSI against the log subsection we pulled out
- Mail the results out
All of this is run by cron. The first problem is how to update the postgresql.conf file and have Postgres re-read it, all automatically. As covered previously, we use the modify_postgres.pl script for this.
The exact incantation looks like this:
0 11 * * * perl bin/modify_postgres_conf --quiet \ --pgconf /etc/postgresql/9.0/main/postgresql.conf \ --change log_min_duration_statement=0 15 11 * * * perl bin/modify_postgres_conf --quiet \ --pgconf /etc/postgresql/9.0/main/postgresql.conf \ --change log_min_duration_statement=200 --no-comment ## The above are both one line each, but split for readability here
This changes log_min_duration_statement to 0 at 11AM, and then back to 200 15 minutes later. We use the --quiet argument as this is run from cron so we don't want any output from modify_postgres_conf on success. We do want a comment when we flip it to 0, as this is the temporary state and we want people viewing the postgresql.conf file at that time to realize it (or someone just doing a "git diff"). We don't want a comment when we flip it back, as the timestamp in the comment would cause git to think the file had changed.
Now for the tricky bit: extracting out just the section of logs that we want and sending it to PGSI. Here's the recipe I came up with for this client:
16 11 * * * tac `ls -1rt /var/log/pg_log/postgres*log \ | tail -1` \ | sed -n '/statement" changed to "200"/,/statement" changed to "0"/ p' \ | tac \ | bin/pgsi.pl --quiet > tmp/pgsi.html && bin/send_pgsi.pl ## Again, the above is all one line
What does this do? First, it finds the latest file in the /var/log/pg_log directory that starts with 'postgres' and ends with 'log'. Then it uses the tac program to spool the file backwards, one line at a time ('tac' is the opposite of 'cat'). Then it pipes that output to the sed program, which prints out all lines starting with the one where we changed the log_min_duration_statement to 200, and ending with the one where we changed it to 0 (the reverse of what we actually did, as we are reading it backwards). Finally, we use tac again to put the lines back in the correct order, pipe the output to pgsi, write the output to a temporary file, and then call a quick Perl script named send_pgsi.pl which mails the temporary HTML file to some interested parties.
Why do we use tac? Because we want to read the file backwards, so as to make sure we get the correct slice of log files as delimited by the log_min_duration_statement changes. If we simply started at the beginning of the file, we might encounter other similar changes that were made earlier and not by us.
All of this is not foolproof, of course, but it does not have to be, as it is very easy to run manually is something (for example the sed recipe) goes wrong, as the log file will still be there. Yes, it's also possible to grab the ranges in other ways (such as perl), but I find sed the quickest and easiest. As tempting as it was to write Yet Another Perl Script to extract the lines, sometimes a few chained Unix programs can do the job quite nicely.
Changing postgresql.conf from a script
The modify_postgres_conf script for Postgres allows you to change your postgresql.conf file from the command line, via a cron job, or any time when you want to automate the process.
Postgres runs as a background daemon. The configuration parameters it runs with are stored in a file named postgresql.conf. To change the behavior of Postgres, one must usually edit this file, and then tell Postgres that you have made the changes. Sometimes all that is needed is to 'HUP' or reload Postgres. Most changes fall into this category. Other changes require a full restart of Postgres, which entails disconnecting all current clients.
Thus, to make a change, one must edit the file, find the item to change (the file consists of "name = value" lines), change it, then send a signal to the main Postgres process so it picks up the change. Finally, you should then connect to Postgres to make sure it is still running and has accepted the latest change.
Doing this automatically (such as via a cron script) is very difficult. One method, if you are doing something simple like toggling between two known configuration files, is to simply store copies of both files and replace them, like this example cronjob:
30 10 * * * cp -f conf/postgresql.conf.1 /etc/postgresql.conf; /etc/init.d/postgresql reload 50 10 * * * cp -f conf/postgresql.conf.2 /etc/postgresql.conf; /etc/init.d/postgresql reload
The major problem with that approach, as I quickly learned when I tried it, is that despite nobody making changes to the postgresql.conf file in *years*, a few days after I put the above change in place, someone decided to edit postgresql.conf. At 10:30AM the next day, their changes were blown away. A better way is to simply write a program to make the change for you. Thus, the modify_postgres_conf.pl script.
The basic usage is to tell the script where the conf file is, and list what changes you want to make. Here's an example that will change the random_page_cost to 2 on a Debian system:
./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
Here is exactly what the script does for the above statement:
- For each item to be changed, we:
- Ask the database what the current value is (and die if that parameter does not exist)
- If the current and new value are the same, do nothing
- Otherwise, open (and flock) the configuration file and change the parameter
- If no changes were made, exit
- Otherwise, close the configuration file
- Figure out the Postgres PID and send it a HUP signal
- Reconnect to the database and confirm each change has taken effect
By default, it adds a comment after the changed value as well, to help in tracking down who made the change. A diff of the postgresql.conf file after running the example above produces:
diff -r1.1 postgresql.conf 499c499 < random_page_cost = 4 --- > random_page_cost = 2 ## changed by modify_postgres_conf.pl on Wed Aug 10 13:31:34 2011
The addition of the comment can be stopped by added a --no-comment argument. If the script runs successfully, it also returns two items of information: the size and name of the current Postgres log file. This is useful so you can know exactly where in the log this change took place. Note that this only works for items that are already explicitly set in your configuration file. However, as discussed before, you should already have all the items that you may possibly change explicitly listed out at the bottom of the file already. Whitespace is preserved as well, for those (like me) who like to keep things lined up neatly inside the file (see examples in the link above).
Here are some more examples of the script in action:
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2 114991 /var/log/postgres/postgres-2011-08-10.log $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2 No change made: value of "random_page_cost" is already 2 $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change random_page_cost=2 \ > --change log_statement=ddl \ > --change log_min_duration_statement=100 No change made: value of "random_page_cost" is already 2 118459 /var/log/postgres/postgres-2011-08-10.log $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change default_statitics_target=200 --no-comment There is no Postgres variable named "default_statitics_target"! $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change default_statistics_target=200 --no-comment 123396 /var/log/postgres/postgres-2011-08-10.log
Note that we make no attempt to automatically check changes in to version control: as you will see in an upcoming blog post on a real-life use case, such a checkin is usually not wanted, as we are making temporary changes.
This is a fairly simple Perl script, but I thought I would put it out there in the hopes of helping others out (and preventing the reinventing of wheels). Of course, if you find a bug or want to write a patch for it, those are welcome additions at any time! The code can be found on github:
git clone git://git@github.com:bucardo/modify_postgres_config.git
Announcing pg_blockinfo!
I'm pleased to announce the initial release of pg_blockinfo. It is a tool to examine your PostgreSQL heap data files, written in Perl.
Similar in purpose to pg_filedump, it is used to display (and soon validate) buffer-page-level information for PostgreSQL page/heap files.
pg_blockinfo aims to work in a portable and non-destructive way, using read-only "mmap", sys-level IO functions, and "unpack" in order to minimize any Perl overhead.
What we buy for the compromise of writing this in Perl instead of C is two-fold:
- portability/low impact — pg_blockinfo has no other dependencies than Perl and several core Perl modules and will work in environments where you can't or won't easily install other packages or compile based on specific headers.
- expressibility — while not currently supported in full, one of pg_blockinfo's future goals is to allow you to specify criteria for display of both page-level and tuple-level info. It will allow you to define arbitrary Perl expressions to filter the objects you're looking at (i.e., pages, tuples, etc; think "grep" but on a tuple level). It will support a DSL for querying based off of the named fields as well as allow you to supply arbitrary Perl for scanning for any criteria.
Requirements
We require a perl version with PerlIO ":mmap" support, which basically means any perl >= 5.8. We do not require any non-core perl modules; currently we only use Data::Dumper and Getopt::Long for debugging and option parsing respectively, the former only when requested.
Getting pg_blockinfo
The canonical git repo for development for pg_blockinfo is located at github: http://github.com/machack666/pg_blockinfo/
.For the development repo, simply run:
$ git clone git://github.com/machack666/pg_blockinfo.git
Or you can just grab the current script directly here.
Using pg_blockinfo
To get help including available options, canonical and alternate/abbreviated names of recognized fields, range syntax:
$ pg_blockinfo -h
To dump all fields for all page headers for all pages in a relation:
$ pg_blockinfo /path/to/relfile
To include only specific fields in the output you can specify multiple -f options and/or include multiple options per -f argument by comma delimiting. Field specifiers are processed in order, so only the final logical set will be included.
"all" is a special shorthand type which will expand to all known columns. pg_blockinfo may support other shorthand groups in the future. When no fields are provided explicitly, "all" is implicitly assumed.
There are both positive and negative field inclusions. An example of a positive inclusion is:
$ pg_blockinfo /path/to/relfile -f prune_xid,tli
This will display only the indicated fields in question for all blocks in relfile. To include all fields *except* certain ones, prefix their name with a '-' sign:
$ pg_blockinfo -f -pagesize_version /path/to/relfile
This will display all page header fields in all blocks with the exception of the pagesize_version header field.
One consequence of the way these field display options are designed (particularly going forward with additional field/tuple display options) that you could define a "view" of the column data using a shell alias, then add/remove columns/criteria by passing additional -f options to it:
# using this as a shorthand to display just those fields $ alias lsn='pg_blockinfo -f lsn_seq,lsn_off,tli' $ lsn -f -tli /path/to/foo # remove fields from the display $ lsn -f prune_xid /path/to/foo # or add to the list as well
Similar functionality is available for selecting the specific blocks available using the range option (-r or -b), which lets you specify a range of blocks to look at instead of the entire file.
$ pg_blockinfo -r 2-49 /path/to/relfile $ pg_blockinfo -r -100 /path/to/relfile $ pg_blockinfo -r 2,4,120-140,0xFF-0x1100 /path/to/relfile
Range options can be provided multiple times, each with one or more comma-delimited block-range specifications. Blocks are numbered from 0, can be provided in decimal or hexadecimal (when prefixed via 0x), and can appear singly or in a range (unbounded or unbounded) when separated by a hyphen.
Planned future features/TODO
In no particular order:
- dump tuples/tuple headers.
- better output/interpretation of bitflags.
- support alternate structures to allow detection/specification of different target versions of the page/tuple headers.
- allow querying/filtering pages/tuples.
- validation/sanity checking of various pages.
- actual extraction of ranges in the heap file.
- extract/dump tuples by raw ctid.
- allow arbitrary expressions to define powerful filtering options when querying/displaying information about the tuples/data files.
- detections of invalid toast tuple pointers/corrupted lz_compressed data (will require connection to theactive system catalog).
- detect relfile type?
- mvcc queries against tuples at a given arbitrarily-constructed snapshot
- detect xids that are invalid (i.e. map to non-existent pages in the pg_clog directory).
- better/shorter name?
I look forward to any feedback, patches, or other improvements/interest.
DBD::Pg UTF-8 for PostgreSQL server_encoding
We are preparing to make a major version bump in DBD::Pg, the Perl interface for PostgreSQL, from the 2.x series to 3.x. This is due to a reworking of how we handle UTF-8. The change is not going to be backwards compatible, but will probably not affect many people. If you are using the pg_enable_utf8 flag, however, you definitely need to read on for the details.
The short version is that DBD::Pg is going return all strings from the Postgres server with the Perl utf8 flag on. The sole exception will be databases in which the server_encoding is SQL_ASCII, in which case the flag will never be turned on.
For backwards compatibility and fine-tuning control, there is a new attribute called pg_utf8_strings that can be set at connection time to override the decision above. For example, if you need your connection to return byte-soup, non-utf8-marked strings, despite coming from a UTF-8 Postgres database, you can say:
my $dsn = 'dbi:Pg:dbname=foobar';
my $dbh = DBI->connect($dsn, $dbuser, $dbpass,
{ AutoCommit => 0,
RaiseError => 0,
PrintError => 0,
pg_utf8_strings => 0,
}
);
Similarly, you can set pg_utf8_strings to 1 and it will force settings returned strings as utf8, even if the backend is SQL_ASCII. You should not be using SQL_ASCII of course, and certainly not forcing the strings returned from it to UTF-8. :)
All Perl variables (be they strings or otherwise) are actually Perl objects, with some internal attributes defined on them. One of those is the utf8 flag, which can be flipped on to indicate that the string should be treated as possibly containing multi-byte characters, or it can be left off, to indicate the string should always be treated on a byte-by-byte basis. This will affect things like the Perl length function, and the Perl \w regex flag. This is completely unrelated to the Perl pragma use utf8, which DBD::Pg has nothing at all to do with. Have I mentioned that UTF-8, and UTF-8 in Perl in particular, can be quite confusing?
There are a few exceptions as to what things DBD::Pg will mark as utf8. Integers and other numbers will not, boolean values will not, and no bytea data will ever have the flag set. When in doubt, assume that it is set.
The old attribute, pg_enable_utf8, will be deprecated, and have no effect. We thought about re-using that but it seemed clearer and cleaner to simply create a new variable (pg_utf8_strings), as the behavior has significantly changed.
A beta version of DBD::Pg (2.99.9_1) with these changes has been uploaded to CPAN for anyone to experiment with. Right now, none of this is set in stone, but we did want to get a working version out there to start the discussion and see how it interacts with applications that were making use of the pg_enable_utf8 flag. You can web search for "dbdpg" and look for the "Latest Dev. Release", or jump straight to the page for DBD::Pg 2.99.9_1. The trailing underscore is a CPAN convention that indicates this is a development version only, and thus will not replace the latest production version (2.18.1 as of this writing).
As a reminder, DBD::Pg has switched to using git, so you can follow along with the development with:
git clone git://bucardo.org/dbdpg.git
There is also a commits mailing list you can join to receive notifications of commits as they are pushed to the main repo. To sign up, send an email to dbd-pg-changes-subscribe@perl.org.
MongoDB replication from Postgres using Bucardo
One of the features of the upcoming version of Bucardo (a replication system for the PostgreSQL RDBMS) is the ability to replicate data to things other than PostgreSQL databases. One of those new targets is MongoDB, a non-relational 'document-based' database. (to be clear, we can only use MongoDB as a target, not as a source)
To see this in action, let's setup a quick example, modified from the earlier blog post on running Bucardo 5. We will create a Bucardo instance that replicates from two Postgres master databases to a Postgres database target and a MongoDB instance target. We will start by setting up the prerequisites:
sudo aptitude install postgresql-server \ perl-DBIx-Safe \ perl-DBD-Pg \ postgresql-contrib
Getting Postgres up and running is left as an exercise to the reader. If you have problems, the friendly folks at #postgresql on irc.freenode.net will be able to help you out.
Now for the MongoDB parts. First, we need the server itself. Your distro may have it already available, in which case it's as simple as:
aptitude install mongodb
For more installation information, follow the links from the MongoDB Quickstart page. For my test box, I ended up installing from source by following the directions at the Building for Linux page.
Once MongoDB is installed, we will need to start it up. First, create a place for MongoDB to store its data, and then launch the mongodb process:
$ mkdir /tmp/mongodata $ mongod --dbpath=/tmp/mongodata --fork --logpath=/tmp/mongo.log all output going to: /tmp/mongo.log forked process: 428
You can perform a quick test that it is working by invoking the command-line shell for MongoDB (named "mongo" of course) Use quit() to exit:
$ mongo MongoDB shell version: 1.8.1 Fri Jun 10 12:45:00 connecting to: test > quit() $
The other piece we need is a Perl driver so that Bucardo (which is written in Perl) can talk to the MongoDB server. Luckily, there is an excellent one available on CPAN named 'MongoDB'. We started the MongoDB server before doing this step because the driver we will install needs a running MongoDB instance to pass all of its tests. The module has very good documentation available on its CPAN page. Installation may be as easy as:
$ sudo cpan MongoDB
If that did not work for you (case matters!), there are more detailed directions on the Perl Language Center page.
Our next step is to grab the latest Bucardo, install it, and create a new Bucardo instance. See the previous blog post for more details about each step.
$ git clone git://bucardo.org/bucardo.git Initialized empty Git repository... $ cd bucardo $ perl Makefile.PL Checking if your kit is complete... Looks good Writing Makefile for Bucardo $ make cp bucardo.schema blib/share/bucardo.schema cp Bucardo.pm blib/lib/Bucardo.pm cp bucardo blib/script/bucardo /usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/bucardo Manifying blib/man1/bucardo.1pm Manifying blib/man3/Bucardo.3pm $ sudo make install Installing /usr/local/lib/perl5/site_perl/5.10.0/Bucardo.pm Installing /usr/local/share/bucardo/bucardo.schema Installing /usr/local/bin/bucardo Installing /usr/local/share/man/man1/bucardo.1pm Installing /usr/local/share/man/man3/Bucardo.3pm Appending installation info to /usr/lib/perl5/5.10.0/i386-linux-thread-multi/perllocal.pod $ sudo mkdir /var/run/bucardo $ sudo chown $USER /var/run/bucardo $ bucardo install This will install the bucardo database into an existing Postgres cluster. ... Installation is now complete.
Now we create some test databases and populate with pgbench:
$ psql -c 'create database btest1' CREATE DATABASE $ pgbench -i btest1 NOTICE: table "pgbench_branches" does not exist, skipping ... creating tables... 10000 tuples done. 20000 tuples done. ... 100000 tuples done. $ psql -c 'create database btest2 template btest1' CREATE DATABASE $ psql -c 'create database btest3 template btest1' CREATE DATABASE $ psql btest3 -c 'truncate table pgbench_accounts' TRUNCATE TABLE $ bucardo add db t1 dbname=btest1 Added database "t1" $ bucardo add db t2 dbname=btest2 Added database "t2" $ bucardo add db t3 dbname=btest3 Added database "t3" $ bucardo list dbs Database: t1 Status: active Conn: psql -p 5432 -U bucardo -d btest1 Database: t2 Status: active Conn: psql -p 5432 -U bucardo -d btest2 Database: t3 Status: active Conn: psql -p 5432 -U bucardo -d btest3 $ bucardo add tables pgbench_accounts pgbench_branches pgbench_tellers herd=therd Created herd "therd" Added table "public.pgbench_accounts" Added table "public.pgbench_branches" Added table "public.pgbench_tellers" $ bucardo list tables Table: public.pgbench_accounts DB: t1 PK: aid (int4) Table: public.pgbench_branches DB: t1 PK: bid (int4) Table: public.pgbench_tellers DB: t1 PK: tid (int4)
The next step is to add in our MongoDB instance. The syntax is the same as the "add db" above, but we also tell it the type of database, as it is not the default of "postgres". We will also assign an arbitrary database name, "btest1", the same as the others. Everything else (such as the port and host) is default, so all we need to say is:
$ bucardo add db m1 dbname=btest1 type=mongo Added database "m1" $ bucardo list dbs Database: m1 Type: mongo Status: active Database: t1 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest1 Database: t2 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest2 Database: t3 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest3
Next we group our databases together and assign them roles:
$ bucardo add dbgroup tgroup t1:source t2:source t3:target m1:target Created database group "tgroup" Added database "t1" to group "tgroup" as source Added database "t2" to group "tgroup" as source Added database "t3" to group "tgroup" as target Added database "m1" to group "tgroup" as target
Note that "target" is the default action, so we could shorten that to:
$ bucardo add dbgroup tgroup t1:source t2 t3 m1
However, I think it is best to be explicit, even if it does (incorrectly) hint that m1 could be anything *other* than a target. :)
We are almost ready to go. The final step is to create a sync (a basic replication event in Bucardo), then we can start up Bucardo, put some test data into the master databases, and 'kick' the sync:
$ bucardo add sync mongotest herd=therd dbs=tgroup ping=false Added sync "mongotest" $ bucardo start Checking for existing processes Starting Bucardo $ pgbench -t 10000 btest1 starting vacuum...end. transaction type: TPC-B (sort of) number of transactions actually processed: 10000/10000 ... tps = 503.300595 (excluding connections establishing) $ pgbench -t 10000 btest2 number of transactions actually processed: 10000/10000 ... tps = 408.059368 (excluding connections establishing) $ bucardo kick mongotest
We'll give it a few seconds to replicate those changes (it took 18 seconds on my test box), and then check the output of bucardo status:
$ bucardo status PID of Bucardo MCP: 3317 Name State Last good Time Last I/D/C Last bad Time ===========+========+============+=======+=============+===========+======= mongotest | Good | 21:57:47 | 11s | 6/36234/898 | none |
Looks good, but what about the data in MongoDB? Let's get some counts from the Postgres masters and slave, and then look at the data inside MongoDB with the mongo command-line client:
$ psql btest1 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest2 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest3 -c 'SELECT count(*) FROM pgbench_accounts'
18106
$ psql btest1 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest2 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest3 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:46:00
connecting to: btest1
> show collections
bucardo_status
pgbench_accounts
pgbench_branches
pgbench_tellers
system.indexes
> db.pgbench_accounts.count()
18106
> db.pgbench_accounts.find().sort({abalance:1}).limit(1).next()
{
"_id" : ObjectId("4df39bcb8795839660001de5"),
"abalance" : -12071,
"aid" : 84733,
"bid" : 1,
"filler" : " "
}
> db.pgbench_accounts.find().sort({abalance:-1}).limit(1).next()
{
"_id" : ObjectId("4df39bd08795839660002fb0"),
"abalance" : 13010,
"aid" : 45500,
"bid" : 1,
"filler" : " "
}
Why the difference in counts? We only started replicating after we populated the Postgres tables on the master databases with 100,000 rows, so the eighteen thousand is the number of rows that was changed during the subsequent pgbench run. (Note that pgbench uses randomness, so your numbers will be different than the above). In the future Bucardo will support the "onetimecopy" feature for MongoDB, but until then we can fully populate the pgbench_accounts collection simply by "touching' all the records on one of the masters:
$ psql btest1 -c 'UPDATE pgbench_accounts SET aid=aid' UPDATE 100000 $ bucardo kick mongotest Kicked sync mongotest $ echo 'db.pgbench_accounts.count()' | mongo btest1 MongoDB shell version: 1.8.1 Fri Jun 10 12:47:00 connecting to: btest1 > 100000 > bye
A nice feature of MongoDB is its autovivification ability (aka dynamic schemas), which means unlike Postgres you do not have to create your tables first, but can simply ask MongoDB to do an insert, and it will create the table (or, in mongospeak, the collection) automatically for you.
Because MongoDB has no concept of transactions, and because Bucardo does not update, but does deletes plus inserts (for reasons I'll not get into today), there is one more trick Bucardo does when replicating to a MongoDB instance. A collection named 'bucardo_status' is created and updated at the start and the end of a sync (a replication event). Thus, your application can pause if it sees this table has a 'started' value, and wait until it sees 'complete' or 'failed'. Not foolproof by any means, but better than nothing :) You should, of course, carefully consider the way your app and Bucardo will coordinate things.
Feedback from Postgres or MongoDB folk is much appreciated: there are probably some rough edges, but as you can see from above, the basics are there are working. Feel free to email the bucardo-general mailing list or make a feature request / bug report on the Bucardo Bugzilla page.
Bucardo multi-master for PostgreSQL
The next version of Bucardo, a replication system for Postgres, is almost complete. The scope of the changes required a major version bump, so this Bucardo will start at version 5.0.0. Much of the innards was rewritten, with the following goals:
Multi-master support
Where "multi" means "as many as you want"! There are no more pushdelta (master to slaves) or swap (master to master) syncs: there is simply one sync where you tell it which databases to use, and what role they play. See examples below.
Ease of use
The bucardo program (previously known as 'bucardo_ctl') has been greatly improved, making all the administrative tasks such as adding tables, creating syncs, etc. much easier.
Performance
Much of the underlying architecture was improved, and sometimes rewritten, to make things go much faster. Most striking is the difference between the old multi-master "swap syncs" and the new method, which has been described as "orders of magnitudes" faster by early testers. We use async database calls whenever possible, and no longer have the bottleneck of a single large bucardo_delta table.
Improved logging
Not only are more details provided, there is now the ability to control how verbose the logs are. Just set the log_level parameter to terse, normal, verbose, or debug. Those who had busy systems, which was the equivalent of a 'debug' firehose, will really appreciate this.
Different targets
Who says your slave (target) databases need to be Postgres? In addition to the ability to write text SQL files (for say, shipping to a different system), you can have Bucardo push to other systems as well. Stay tuned for more details on this. (Update: there is a blog post about using MongoDB as a target)
This new version is not quite at beta yet, but you can try out a demo of multi-master on Postgres quie easily. Let's see if we can do it in ten steps.
I. Download all prerequisites
To run Bucardo, you will need a Postgres database (obviously), the DBIx::Safe module, the DBI and DBD::Pg modules, and (for the purposes of this demo) the pgbench utility. Systems vary, but on aptitude-based systems, one can grab all of the above like this:
aptitude install postgresql-server \ perl-DBIx-Safe \ perl-DBD-Pg \ postgresql-contrib
II. Grab the latest Bucardo
git clone git://bucardo.org/bucardo.git
III. Install the program
cd bucardo perl Makefile.PL make sudo make install
You can ignore any errors that come up about ExtUtils::MakeMaker not being recent.
IV. Setup an instance of Bucardo
This step assumes there is a running Postgres available to connect to.
sudo mkdir /var/run/bucardo sudo chown $USER /var/run/bucardo bucardo install
V. Use the pgbench program to create some test tables
psql -c 'CREATE DATABASE btest1' pgbench -i btest1 psql -c 'CREATE DATABASE btest2 TEMPLATE btest1' psql -c 'CREATE DATABASE btest3 TEMPLATE btest1' psql -c 'CREATE DATABASE btest4 TEMPLATE btest1' psql -c 'CREATE DATABASE btest5 TEMPLATE btest1'
VI. Tell Bucardo about the databases and tables you are going to use
bucardo add db t1 dbname=btest1 bucardo add db t2 dbname=btest2 bucardo add db t3 dbname=btest3 bucardo add db t4 dbname=btest4 bucardo add db t5 dbname=btest5 bucardo list dbs bucardo add table pgbench_accounts pgbench_branches pgbench_tellers herd=therd bucardo list tables
A herd is simply a logical grouping of tables. We did not add the other pgbench table, pgbench_history, because it has no primary key or unique index.
VII. Group the databases together and set their roles
bucardo add dbgroup tgroup t1:source t2:source t3:source t4:source t5:target
We've grouped all five databases together, and made four of them masters (aka source), and one of them a slave (aka target). You can any combination of master and slaves you want, as long as there is at least one master.
VII. Create the Bucardo sync
bucardo add sync foobar herd=therd dbs=tgroup ping=false
Here we simply create a new sync, which is a controllable replication event, telling it which tables we want to replicate, and which databases we are going to use. We also set ping to false, which means that we will not create triggers to automatically fire off replication on any changes, but will do it manually. In a real world scenario, you generally do want those triggers, or want to set Bucardo to check periodically.
VIII. Start up Bucardo
bucardo start
If all went well, you should see some information in the log.bucardo file in the current directory.
IX. Make a bunch of changes on all the source databases.
pgbench -t 10000 btest1 pgbench -t 10000 btest2 pgbench -t 10000 btest3 pgbench -t 10000 btest4
Here, we've told pgbench to run ten thousand transactions against each of the first four databases. Triggers on these tables have captured the changes.
X. Kick off the sync and watch the fun.
bucardo kick foobar
You can now tail the log.bucardo file to see the fun, or simply run:
bucardo status
...to see what it is doing, and the final counts when we are done. Don't forget to stop Bucardo when you are done testing:
bucardo stop
The output of bucardo status, after the sync has completed, should look like this:
bucardo status Name State Last good Time Last I/D/C Last bad Time ========+========+============+=======+====================+===========+======= foobar | Good | 17:58:37 | 3m2s | 131836/131836/4785 | none |
Here we see that this syncs has never failed ("Last bad"), the time of day of the last good run, how long ago it was from right now (3 minutes and 2 seconds), as well as details of the last successful run. Last I/D/C stands for number of inserts, deletes, and collisions across all databases for this syncs. This is just an overview of all syncs at a high level, but we can also give status an argument of a sync name to see more details like so:
bucardo status foobar Last good : Jun 02, 2011 17:57:47 (time to run: 42s) Rows deleted/inserted/conflicts : 131,836 / 131,836 / 4,785 Sync name : foobar Current state : Good Source herd/database : therd / t1 Tables in sync : 3 Status : active Check time : none Overdue time : 00:00:00 Expired time : 00:00:00 Stayalive/Kidsalive : yes / yes Rebuild index : 0 Ping : no Onetimecopy : 0 Post-copy analyze : Yes Last error: :
This gives us a little more information about the sync itself, as well as another important metric, how long the sync itself took to run, in this case, 42 seconds. That particular metric might make its way back to the overall "status" view above. Try things out and help us find bugs and improve Bucardo!
Postgres Bug Tracking - Help Wanted!
Once again there is talk in the Postgres community about adopting the use of a bug tracker. The latest thread, on pgsql-hackers, was started by someone asking about the status of their patch. Or rather, asking an even better meta-question about how one finds out the status of a PostgreSQL bug report or patch. Sadly, the answer is that there is no standard way, other than sending emails until someone replies one way or another. The current process works something like this:
- Someone finds a bug
- They send an email to pgsql-bugs@postgresql.org OR they use the web form, which grabs a sequential number and mails the report to pgsql-bugs@postgresql.org. Nothing else is done/stored, it just sends the email.
- Someone replies about the bug OR nobody replies about the bug.
- After a fix is found, which may involve some emails on other mailing lists, someone replies that the bug is fixed on the original thread. Maybe.
As you can see, there is some room for improvement there. Some of the most major and glaring holes in the current system:
- No way to search previous / existing bugs
- No way to tell the status of a bug
- No way to categorize and group bugs (per version, per platform, per component, per severity, etc.)
- No way to know who is working on a bug
- No way to prevent things from slipping through the cracks
Luckily, the above problems have been solved for many many years now but a wide variety of bug tracking software. There have traditionally been three problems to getting a bug tracker working for the Postgres project:
Inertia
The current system is, in a very literal sense, "good enough", so it's hard to impose the inevitable short-term pain of a new system when there always seem to be more pressing matters to attend to.
Doesn't Make Julienne Fries
Everyone wants a different set of features, and getting all the hackers involved to agree on even a simple subset of desired features is pretty difficult. This is sort of similar to the crusade by myself and others to get git as the replacement version control system; there were some strong voices for competing systems (e.g. mercurial).
Who Will Put the Bell on the Cat?
Everyone talks about the problem, and there have even been some attempts over the years to implement some sort of system, but the problem remains that setting up such a system, getting it smoothly integrated into the project's work flow, and then maintaining said system is a non-trivial task. Especially when you can't be assured of buy-in from some of the major players.
I'm hopeful that the recent thread indicates a slight shift of late in global acceptance of the need for a bug tracking system. The question is, which one, and who is going to take the time to write something? I'm really hoping someone who has been lurking in the background will step up and help create something wonderful (okay, we can start with 'decent' :) Perhaps even someone with experience setting up bug tracking systems. Certainly Postgres must be one of the last major open source projects without a bug tracker; there is plenty of hard-won experience out there to be learned from. It would also be ideal if the person or persons was *not* a Postgres hacker of any sort, as taking the time to build and maintain this system would definitely take time away from their other hacking tasks. On the other hand, one could argue that a bug tracker is a vital piece of project infrastructure that is potentially as important as any other work that goes on. I certainly think so.
Only Try This At Home

Taken by Josh 6 years to the day before the release of 9.1 beta 1
For the record, 9.1 is gearing up to be an awesome release. I was tinkering and testing PostgreSQL 9.1 Beta 1 (... You are beta testing, too, right?) ... and some of the new PL/Python features caught my eye. These are minor among all the really cool high profile features, to be sure. But it made me think back to a little bit of experimental code written some time ago, and how these couple language additions could make a big difference.
For one reason or another I'd just hit the top level postgresql.org website, and suddenly realized just how many Postgres databases it took to put together what I was seeing on the screen. Not only does it power the content database that generated the page, of course, but even the lookup of the .org went through Afilias and their Postgres-backed domain service. It's a pity the DBMS couldn't act as the middle layer between those.
Or could it?
That's a shortened form of it just for demonstration purposes (the original one had things like a table browser) ... but it works. For example, on this test 9.1 install, hit http://localhost:8000/public/webtest and the following table appears:
| generate_series | lh | rnd |
|---|---|---|
| 1 | 0 | 0.548577250913 |
| 2 | 1 | 1.70926172473 |
| 3 | 1 | 1.24841631576 |
| (etc) | ... | ... |
Note the use of two specific 9.1 features, though. The plpy object contains nice query building helper utilities like quote_ident that you may be familiar with in other languages. But this also makes use of subtransactions, which helps recover from db errors. That's important here, as something like a typo in a table name will generate an error from Postgres and without that in place the database will end the transaction and ignore any subsequent commands the function tries to run.
But with that in place, the page shows the 404 error, and picks up where it left off with subsequent requests:
Error code 404. Message: Table not found.
By the way, if it's not clear by now don't take this anywhere near a production database, if not any other reason that a transaction will be held open as long as that function runs. That will hold back all the nice maintenance stuff that keeps things running efficiently. Still, I think it helps show off what just a handful of lines of code can do in a powerful language like PL/Python. I'm sure with the right module PL/PerlU could do something very similar. But even more I think it shows how Postgres is growing and innovating by leaps and bounds, seemingly every day!
NOTIFY vs Prepared Transactions in Postgres (the Bucardo solution)

We recently had a client use Bucardo to migrate their app from Postgres 8.2 to Postgres 9.0 with no downtime (which went great). They also were using Bucardo to replicate from the new 9.0 mater to a bunch of 9.0 slaves. This ran into problems the moment the application started, as we started seeing these messages in the logs:
ERROR: cannot PREPARE a transaction that has executed LISTEN, UNLISTEN or NOTIFY
The problem is that the Postgres LISTEN/NOTIFY system cannot be used with prepared transactions. Bucardo uses a trigger on the source tables that issues a NOTIFY to let the main Bucardo daemon know that something has changed and needs to be replicated. However, their application was issuing a PREPARE TRANSACTION as an occasional part of its work. Thus, they would update the table, which would fire the trigger, which would send the NOTIFY. Then the application would issue the PREPARE TRANSACTION which produced the error given above. Bucardo is setup to deal with this situation; rather than using notify triggers, the Bucardo daemon can be set to look for any changes at a set interval. The steps to change Bucardo's behavior for a given sync is simply:
$ bucardo_ctl update sync foobar ping=false checktime=15 $ bucardo_ctl validate foobar $ bucardo_ctl reload foobar
The first command tells the sync not to use notify triggers (these are actually statement-level triggers that simply issue a NOTIFY bucardo_kick_sync_foobar. It also sets a checktime of 15 seconds, which means that the Bucardo daemon will check for changes every 15 seconds - or as if the original notify trigger is firing every 15 seconds. The second command validates the sync but checking that all supporting tables, functions, triggers, etc. are installed and up to date. It also removes triggers that are no longer needed: in this case, the statement-level notify triggers for all tables in this sync. Finally, the third command simply tells the Bucardo daemon to stop the sync, load in the new changes, and restart it.
Another solution to the problem is to simply not use prepared transactions: very few applications actually need it, but I've noticed a few that use it anyway when they should not be. What exactly is a prepared transaction? It's the Postgres way of implementing two-part commit. Basically, this means that a transaction's state is stored away on disk, and can be committed or rolled back at a later time - even by a different session. This is handy if you need to ensure that, for example, you can atomically commit multiple database connections. By atomically, I mean that either they all commit or none of them do. This is done by doing work on each database, issuing a PREPARE TRANSACTION, and then, once all have been prepared, issuing the COMMIT TRANSACTION against each one.
As an aside, prepared transactions are often confused with prepared statements. While the use of prepared statements is very common, use of prepared transactions is very rare. Prepared statements are simply a way of planning a query one time, then re-running it multiple times without having to run the query through the planner each time. Many interfaces, such as DBD::Pg, will do this for you automatically behind the scenes. Sometimes using prepared statements can cause issues, but it is usually a win.
As mentioned above, the use of 2PC (two-phase commit) is very rare, which is why the default for the max_prepared_transactions variable was recently changed to 0, which effectively disallows the use of prepared transactions until you explicitly turning them on in your postgresql.conf file. This helps prevent people from accidentally issuing a PREPARE TRANSACTION and then leaving them around. This mistake is easy to do, for once you issue the command, everything goes back to normal and it's easy to forget about them. However, having them around is a bad thing, as they continue to hold locks, and can prevent vacuum from running.The check_postgres program even has a specific check for this situation:check_prepared_txns.
What does two-part commit look like? There are only three basic commands: PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED. Each takes a name, which is an arbitrary string 200 characters or less. Usage is to start a transaction, do some work, and then issue a PREPARE TRANSACTION instead of a COMMIT. At this point, all the work you have done is gone from your session and stored on disk. You cannot get back into this transaction: you can only commit it or roll it back. See the docs on PREPARE TRANSACTION for the full details.
Here's an example of two-part commit in action:
testdb=# BEGIN;
BEGIN
testdb=#* CREATE TABLE preptest(a int);
CREATE TABLE
testdb=#* INSERT INTO preptest VALUES (1),(2),(3);
INSERT 0 3
testdb=#* SELECT * FROM preptest;
a
---
1
2
3
(3 rows)
testdb=#* PREPARE TRANSACTION 'foobar';
PREPARE TRANSACTION
testdb=# SELECT * FROM preptest;
ERROR: relation "preptest" does not exist
LINE 1: SELECT * FROM preptest;
^
testdb=# COMMIT PREPARED 'foobar';
COMMIT PREPARED
testdb=# SELECT * FROM preptest;
a
---
1
2
3
(3 rows)
A contrived example, but you can see how easy it could be to issue a PREPARE TRANSACTION and not even realize that it actually sticks around forever!
MySQL Integer Size Attributes
MySQL has those curious size attributes you can apply to integer data types. For example, when creating a table, you might see:
mysql> CREATE TABLE foo (
-> field_ti tinyint(1),
-> field_si smallint(2),
-> field_int int(4),
-> field_bi bigint(5)
-> );
Query OK, 0 rows affected (0.05 sec)
mysql> desc foo;
+-----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+-------+
| field_ti | tinyint(1) | YES | | NULL | |
| field_si | smallint(2) | YES | | NULL | |
| field_int | int(4) | YES | | NULL | |
| field_bi | bigint(5) | YES | | NULL | |
+-----------+-------------+------+-----+---------+-------+
3 rows in set (0.03 sec)
mysql>
I had always assumed those size attributes were limiters, MySQL's way of providing some sort of constraint on the integers allowed in the field. While doing some recent work for a MySQL client, I attempted to enforce the range of a tinyint according to that assumption. In reality, I only wanted a sign field, and would have liked to have applied a "CHECK field IN (-1,1)", but without check constraints I figured at least keeping obviously incorrect data out would be better than nothing.
I wanted to see what MySQL's behavior would be on data entry that failed the limiters. I was hoping for an error, but expecting truncation. What I discovered was neither.
mysql> INSERT INTO foo (field_ti) VALUES (-1); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | +----------+ 1 row in set (0.00 sec) mysql> INSERT INTO foo (field_ti) VALUES (1); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | | 1 | +----------+ 2 rows in set (0.00 sec) mysql> INSERT INTO foo (field_ti) VALUES (10); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | | 1 | | 10 | +----------+ 3 rows in set (0.00 sec) mysql> INSERT INTO foo (field_ti) VALUES (100); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | | 1 | | 10 | | 100 | +----------+ 4 rows in set (0.00 sec) mysql>
Two possible conclusions followed immediately: either the limiter feature was horribly broken, or those apparent sizes didn't represent a limiter feature. A full review of MySQL's Numeric Types documentation provided the answer:
MySQL supports an extension for optionally specifying the display width of integer data types in parentheses following the base keyword for the type. For example, INT(4) specifies an INT with a display width of four digits. This optional display width may be used by applications to display integer values having a width less than the width specified for the column by left-padding them with spaces. (That is, this width is present in the metadata returned with result sets. Whether it is used or not is up to the application.) The display width does not constrain the range of values that can be stored in the column.
And, so, the lesson is repeated: Beware assumptions.
Postgres query caching with DBIx::Cache
A few years back, I started working on a module named DBIx::Cache which would add a caching layer at the database driver level. The project that was driving it got put on hold indefinitely, so it's been on my long-term todo list to release what I did have to the public in the hope that someone else may find it useful. Hence, I've just released version 1.0.1 of DBIx::Cache. Consider it the closest thing Postgres has at the moment for query caching. :) The canonical webpage:
http://bucardo.org/wiki/DBIx-Cache
You can also grab it via git, either directly:
git clone git://bucardo.org/dbixcache.git/
or through the indispensable github:
https://github.com/bucardo/dbixcache
So, what does it do exactly? Well, the idea is that certain queries that are either repeated often and/or are very expensive to run should be cached somewhere, such that the database does not have to redo all the same work, just to return the same results over and over to the client application. Currently, the best you can hope for with Postgres is that things are in RAM from being run recently. DBIx::Cache changes this by caching the results somewhere else. The default destination is memcached.
DBIx::Cache acts as a transparent layer around your DBI calls. You can control which queries, or classes of queries get cached. Most of the basic DBI methods are overridden so that rather than query Postgres, they actually query memcached as needed (or other caching layer - could even query back into Postgres itself!). Let's look at a simple example:
use strict;
use warnings;
use Data::Dumper;
use DBIx::Cache;
use Cache::Memcached::Fast;
## Connect to an existing memcached server,
## and establish a default namespace
my $mc = Cache::Memcached::Fast->new(
{
servers => [ { address => 'localhost:11211' } ],
namespace => 'joy',
});
## Rather than DBI->connect, use DBIx->connect
## Tell it what to use as our caching source
## (the memcached server above)
my $dbh = DBIx::Cache->connect('', '', '',
{ RaiseError => 1,
dxc_cachehandle => $mc
});
## This is an expensive query, that takes 30 seconds to run:
my $SQL = 'SELECT * FROM analyze_sales_data()';
## Prepare this query
my $sth = $dbh->prepare($SQL);
## Run it ten times in a row.
## The first time takes 30 seconds, the other nine return instantly.
for (1..10) {
my $count = $sth->execute();
my $info = $sth->fetchall_arrayref({});
print Dumper $info;
}
In the above, the prepare($SQL) is actually calling the DBIx::Class::prepare method. This parses the query and tries to determine if it is cacheable or not, then stores that decision internally. Regardless of the result, it calls DBI::prepare (which is techincally DBD::Pg::prepare), and returns the result.The magic comes in the call to execute() later on. As you might imagine, this is also actually the DBIx::Class::execute() method. If the query is not cacheable, it simply runs it as normal and returns. If it is cacheable, and this is the first time it is run, DBIx::Class runs an EXPLAIN EXECUTE on the original statement, and parses out a list of all tables that are used in this query. Then it caches all of this information into memcached, so that subsequent runs using the same list of arguments to execute() don't need to do that work again.
Finally, we come to fetchall_arrayref(). The first time it is run, we simply call the parent methods and get the data back. Then we build unique keys and store the results of the query into memcached. Finally, we mark the execute() as fully cached. Thus, on subsequent calls to execute(), we don't actually execute anything on the database server, but simply return the count as stashed inside of memcached (in the case of execute, this is the number of affected rows). For the various fetch() methods, we do the same thing - rather than fetch things from the database (via DBI, DBD::Pg, and libpq), we get the results from memcached (frozen via Data::Dumper), and then unpack and return them. Since we don't actually need to do any work against the database, everything returns as fast as we can query memcached - which is in general very fast indeed.
Most of the above is working, but the piece that is not written is the cache invalidation. DBIx::Cache knows which tables go to which queries, so in theory you could have (for example), an UPDATE/INSERT/DELETE trigger on table X which calls DBIx::Cache and tells it to invalidate all items related to table X, so that the next call to prepare() or execute() or fetch() will not find any memcached matches and re-run the whole query and store the results. You could also simply handle that in your application, of course, and have it decide when to invalidate items.
It's been a while since I've really looked at the code, but as far as I can tell it is close to being able to actually use somewhere. :) Patches and questions welcome!
DBD::Pg query cancelling in Postgres
A new version of DBD::Pg, the Perl driver for PostgreSQL, has just been released. In addition to fixing some memory leaks and other minor bugs, this release (version 2.18.0) introduces support for the DBI method known as cancel(). A giant thanks to Eric Simon, who wrote this new feature. The new method is similar to the existing pg_cancel() method, except it works on synchronous rather than asynchronous queries. I'll show an example of both below.
DBD::Pg has been able to handle asynchronous queries for a while now. Basically, that means you don't have to wait around for the database to finish a query. Your application can do other things while the query runs, then check back later to see if it has completed and grab the results. The way to cancel an already kicked-off asynchronous query is with the pg_cancel() method (the other asynchronous methods are pg_ready and pg_result, which have no synchronous equivalents).
The prefix "pg_" is used because there is no corresponding built-in DBI method to override, and the convention is to prefix everything custom to a driver with the driver's prefix, in our case 'pg'. Here's an example showing one possible use of asynchronous queries using DBD::Pg in some Perl code:
## We are connecting to two servers and running expensive
## queries on both. We kick both off right away, then wait
## for them both to finish. Our total wait time is thus
## max(server1,server2) rather than sum(server1,server2)
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn1 = 'dbi:Pg:dbname=sales;host=example1.com';
my $dsn2 = 'dbi:Pg:dbname=sales;host=example2.com';
my $dbh1 = DBI->connect($dsn1, '', '', {AutoCommit=>0, RaiseError=>1});
my $dbh2 = DBI->connect($dsn2, '', '', {AutoCommit=>0, RaiseError=>1});
my $SQL = 'SELECT gather_yearly_sales_data()';
print "Kicking off a long, expensive query on database one\n";
## Normally, a do() will not return until the query is complete
## However, the async flag causes it to return immediately
$dbh1->do($SQL, {pg_async => PG_ASYNC});
print "Kicking off a long, expensive query on database two\n";
$dbh2->do($SQL, {pg_async => PG_ASYNC});
## Both queries are running in the 'background'
## We have to wait for both, so it doesn't matter which one we wait for here
## However, if it's been over 2 minutes, we'll cancel both and quit
my $time = 0;
while ( ! $dbh1->pg_ready() ) {
sleep 1;
if ($time++ > 120) {
print "Taking too long, let's cancel the queries\n";
$dbh1->pg_cancel();
$dbh2->pg_cancel();
$dbh1->rollback();
$dbh2->rollback();
die "No sales data was retrieved\n";
}
}
## We know that database 1 has finished, so we read in the results
my $rows1 = $dbh1->pg_result();
## We then grab results from database 2
## This will block until done, which is okay
my $rows2 = $dbh2->pg_result();
The new method, simply known as cancel(), will kill any synchronously running query. One of the main uses for this is to timeout a query by using the builtin Perl alarm function. However, since the builtin alarm function has some quirks, we will instead use the much safer POSIX::SigAction method. Another example:
## We are running a series of queries against a database, but if
## the whole thing is taking over 30 seconds, we want to cancel
## the currently running query and move on to something else.
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn = 'dbi:Pg:dbname=dq';
my $dbh = DBI->connect($dsn, '', '', {AutoCommit=>0, RaiseError=>1});
## Setup all the POSIX alarm plumbing
my $mask = POSIX::SigSet->new(SIGALRM);
my $action = POSIX::SigAction->new(
sub { die "TIMEOUT\n" },
$mask,
);
my $oldaction = POSIX::SigAction->new();
sigaction( SIGALRM, $action, $oldaction );
## Prepare the queries
my $upd = $dbh->prepare('UPDATE foobar SET x=? WHERE y=?');
my $inv = $dbh->prepare('SELECT refresh_inventory(?)');
## Yes, a double eval. Async is looking better all the time :)
eval {
eval {
alarm 30;
for my $y (12,24,48) {
print "Adjusting widget #$y\n";
$upd->execute(555,$y);
print "Recalculating inventory\n";
$inv->execute($y);
}
};
alarm 0; ## Turn off our alarm
die "$@\n" if $@; ## Bubble the error to the outer eval
};
if ($@) { ## Something went wrong
if ($@ =~ /TIMEOUT/) {
print "Queries are taking too long! Cancelling\n";
## We don't know which one is still running, and don't care
## It's safe to cancel a non-active statement handle
$upd->cancel() or die qq{Failed to cancel the query!\n};
$inv->cancel() or die qq{Failed to cancel the query!\n};
$dbh->rollback();
die "Who has time to wait 30 seconds anymore?";
}
## Some other non-alarm error, so we simply:
die $@;
}
print "Updates are complete\n";
$dbh->commit();
exit;
Got an interesting use case for asynchronous queries or the new $dbh‑>cancel()? Let me know!
Annotating Your Logs
We recently did some PostgreSQL performance analysis for a client with an application having some scaling problems. In essence, they wanted to know where Postgres was getting bogged down, and once we knew that we'd be able to target some fixes. But to get to that point, we had to gather a whole bunch of log data for analysis while the test software hit the site.
This is on Postgres 8.3 in a rather locked down environment, by the way. Coordinated pg_rotate_logfile() was useful, but occasionally it would seem to devolve to something resembling: "Okay, we're adding 60 more users ... now!" And I'd write down the time stamp, and figure out an appropriate place to slice the log file later.
Got me thinking, what if we could just drop an entry into the log file, and use it to filter things out later? My first instinct was to start looking at seeing if a patch would be accepted, maybe a wrapper for ereport(), something easy. Turns out, it's even easier than that...
pubsite=# DO $$BEGIN RAISE LOG 'MARK: 60 users'; END;$$; DO Time: 0.464 ms pubsite=# DO $$BEGIN RAISE LOG 'MARK: 120 users'; END;$$; DO Time: 0.378 ms pubsite=# DO $$BEGIN RAISE LOG 'MARK: 360 users'; END;$$; DO Time: 0.700 ms
Of course the above will only work on version 9.0 and up (eventually). Previous versions that have PL/pgSQL turned can just create a function that does the same thing. The "LOG" severity level is an informational message that's supposed to always make it into the log files. So with those in place, a grep through the log can reveal just where they appear, and sed can extract the sections of log between those lines and feed them into your favorite analysis utility:
postgres@mothra:~$ grep -n 'LOG: MARK' /var/log/postgresql/postgresql-9.0-main.log 19180:2011-03-31 20:20:37 EDT LOG: MARK: 60 users 19478:2011-03-31 20:25:48 EDT LOG: MARK: 120 users 20247:2011-03-31 20:32:15 EDT LOG: MARK: 360 users postgres@mothra:~$ sed -n '19180,19478p' /var/log/postgresql/postgresql-9.0-main.log | bin/pgsi.pl > 60users.html
Oh, and the performance problem? Turns out it wasn't Postgres at all, every single query average execution time was shown to vary minimally as the concurrent user count was scaled higher and higher. But that's another story.
Postgres Build Farm Animal Differences
I'm a big fan of the Postgres Build Farm, a distributed network of computers that are constantly installling, building, and testing Postgres to detect any problems in the code. The build farm works best when there is a wide variety of operating systems and architectures testing. Thus, while I have a rather common x86_64 Linux box available for testing, I try to make it a little unique to get better test coverage.
One thing I've been working on is clang support (clang is an alternative to gcc). Unfortunately, the latest version of clang has a bug that prevents it from building Postgres on Linux boxes. I submitted a small patch to the Postgres source to fix this, but it was decided that we'll wait until clang fixes their bug. Supposedly they have in their svn head, but I've not been able to get that to compile successfully.
So I also just installed gcc 4.6.0, the latest and greatest. Installing it was not easy (nasty problems with the mfpr dependencies), but it's done now and working. It probably won't make any difference as far as the results, but at least my box is somewhat different from all the other x86_64 Linux boxes in the farm. :)
I've asked before on the list (with no response) about what sort of configuration changes could be made to expand the range of testing. The build farm itself provides a handful of things to choose from, and most of the animals in the farm have most of them configured (I have everything except "pam" and "vpath" enabled). However, one thing I've thought about changing is NAMEDATALEN. It's basically a compile-time option that sets the maximum number of characters things like table names can have. It is set by default to 64, while the SQL spec wants it to be 128. The problem is that this causes some tests to fail, as they have a hard-coded assumption about the length. The real problem of course is that Postgres' 'make check' is a very crude test. I've got some ideas on how to fix that, but that's another post for another day. So, anyone have other ideas on how to make my particular build farm member, and others like it, more useful?
check_postgres without Nagios (Postgres checkpoints)
Version 2.16.0 of check_postgres, a monitoring tool for Postgres, was just released. We're still trying to keep a "release often" schedule, and hopefully this year will see many releases. In addition to a few minor bug fixes, we added a new check by Nicola Thauvin called hot_standby_delay, which, as you might have guessed from the name, calculates the streaming replication lag between a master server and one of the slaves connected to it. Obviously the servers must be running PostgreSQL 9.0 or better.
Another recently added feature (in version 2.15.0) was the simple addition of a --quiet flag. All this does is to prevent any normal output when an OK status is found. I wrote this because sometimes even Nagios is overkill. In the default mode (Nagios, the other major mode is MRTG), check_postgres will exit with one of four states, each with their own exit code: OK, WARNING, CRITICAL, or UNKNOWN. It also outputs a small message, per Nagios conventions, so a txn_idle action might exit with a value of 1 and output something similar to this:
POSTGRES_TXN_IDLE WARNING: (host:svr1) longest idle in txn: 4638s
I had a situation where I wanted to use the functionality of check_postgres (to examine the lag on a warm standby server), but did not want the overhead of adding it into Nagios, and just needed a quick email to be sent if there were any problems. Thus, the use of the quiet flag yielded a quick and cheap Nagios replacement using cron:
*/10 * * * * bin/check_postgres.pl --action=checkpoint -w 300 -c 600 --datadir=/dbdir --quiet
So every 10 minutes the script gathers the number of seconds since the last checkpoint was run. If that number is under five minutes (300 seconds), it exits silently. If it's over five minutes, it outputs something similar to this, which cron then sends in an email:
POSTGRES_CHECKPOINT CRITICAL: Last checkpoint was 842 seconds ago
I'm not advocating replacing Nagios of course: there are many other good reasons to use Nagios instead of cron, but this worked well for the situation at hand. Other actions, feature requests, and patches for check_postgres are always welcome, either on the check_postgres bug tracker or the mailing list.
DBD::Pg, UTF-8, and Postgres client_encoding
Photo by Roger SmithI've been working on getting DBD::Pg to play nicely with UTF-8, as the current system is suboptimal at best. DBD::Pg is the Perl interface to Postgres, and is the glue code that takes the data from the database (via libpq) and gives it to your Perl program. However, not all data is created equal, and that's where the complications begin.
Currently, everything coming back from the database is, by default, treated as byte soup, meaning no conversion is done, and no strings are marked as utf8 (Perl strings are actually objects in which one of the attributes you can set is 'utf8'). If you want strings marked as utf8, you must currently set the pg_enable_utf8 attribute on the database handle like so:
$dbh->{pg_enable_utf8} = 1;
This causes DBD::Pg to scan incoming strings for high bits and mark the string as utf8 if it finds them. There are a few drawbacks to this system:
- It does this for all databases, even SQL_ASCII!
- It doesn't do this for everything, e.g. arrays, custom data types, xml.
- It requires the user to remember to set pg_enable_utf8.
- It adds overhead as we have to parse every single byte coming back from the database.
Here's one proposal for a new system. Feedback welcome, as this is a tricky thing to get right.
DBD::Pg will examine the client_encoding parameter, and see if it matches UTF8. If it does, then we can assume everything coming back to us from Postgres is UTF-8. Therefore, we'll simply flip the utf8 bit on for all strings. The one exception is bytea data, of course, which we'll read in and dequote into a non-utf8 string. Any non-UTF8 client_encodings (e.g. the monstrosity that is SQL_ASCII) will simply get back a byte soup, with no utf8 markings on our part.
The pg_enable_utf8 attribute will remain, so that applications that do their own decoding, or otherwise do not want the utf8 flag set, can forcibly disable it by setting pg_enable_utf8 to 0. Similarly, it can be forced on by setting pg_enable_utf8 to 1. The flag will always trump the client_encoding parameter.
A further complication is client_encoding: What if it defaults to something else? We can set it ourselves upon first connecting, and then if the program changes it after that point, it's on them to deal with the issues. (As DBD::Pg will still assume it is UTF-8, as we don't constantly recheck the parameter.)
Someone also raised the issue of marking ASCII-only strings as utf8. While technically this is not correct, it would be nice to avoid having to parse every single byte that comes out of the database to look for high bits. Hopefully, programs requesting data from a UTF-8 database will not be surprised when things come back marked as utf8.
Feel free to comment here or on the bug that started it all. Thanks also to David Christensen, who has given me great input on this topic.
Character encoding in perl: decode_utf8() vs decode('utf8')
When doing some recent encoding-based work in Perl, I found myself in a situation which seemed fairly unexplainable. I had a function which used some data which was encoded as UTF-8, ran Encode::decode_utf8() on said data to convert to Perl's internal character format, then converted the "wide" characters to the numeric entity using HTML::Entities::encode_entities_numeric(). Logging/printing of the data on input confirmed that the data was properly formatted UTF-8, as did running `iconv -f utf8 -t utf8 output.log >/dev/null` for the purposes of review.
However when I ended up processing the data, it was as if I had not run the decode function at all. In this case, the character in question was € (unicode code point U+20AC). The expected behavior from encode_entities_numeric() would be to turn any of the hi-bit characters in the perl string (i.e. all Unicode code points > 0x80) into the corresponding numeric entity (€ - € in this case). However instead of that specific character's numeric entity appearing in the output, the entities which appeared were: € i.e., the raw UTF-8 encoded value for €, with each octet being treated as an independent character instead of part of the whole encoded value.
What was particularly confusing was that extracting the relevant parts from the script in question resulted in the expected answer, so it was clearly not an issue of HTML::Entities not being able to deal with Unicode characters, as this code snippet demonstrates:
$ perl -MHTML::Entities+encode_entities_numeric -MEncode -e '$c=qq{\xE2\x82\xAC}; print encode_entities_numeric(decode_utf8($c))'
--> €
In the actual non-extracted version of the code, I was scratching my head. This was exhibiting the signs of doubly-encoded data, however I couldn't see how that could be the case. There were no PerlIO layers (e.g., :utf8 or :encoding) at play, the data I was outputting to a log file for verification purposes was being written via a brand new filehandle from a bare open(); I verified in multiple ways that the raw octets being passed in to the function were not doubly-encoded (printing the raw character points, counting lengths of the runs of octets and verifying that these matched the length of the UTF-8 encoded value for the represented characters, etc). The more things I tried the more puzzled I got. Finally, I changed the Encode::decode_utf8() call to a Encode::decode('utf8') one, providing the encoding explicitly. At this point, the processing pipeline started working as expected, and hi-bit characters were being output as their full numeric entities.
Since the documentation for decode_utf8 indicated that it should be identical to decode('utf8'), I resorted to the code to find out why it worked with the version that specified the encoding explicitly. I found that decode_utf8() does one additional thing that the regular decode('utf8') does not, and that is that before processing via the regular decode() function, decode_utf8 first checks the UTF-8 flag of the data that is being passed in, and if it is set it returns the data without further decoding*. My best guess is that this is to prevent errors if someone attempts to decode UTF-8 data in a string which is already in Perl's internal format, so in most cases this will provide a caller-friendly interface that will DWYM in many expected cases.
Armed with this knowledge, I verified that for some reason, the data that was being passed into the function had the UTF-8 flag set, so using the explicit decode('utf8') in lieu of decode_utf8() fixed the issue for me. (Tracing down the reason for the UTF-8 flag being set on this data was out of scope for this exercise, but is the true fix.) And just to verify that this was in fact the cause of the issue at hand, here's our example, modified slightly (we use the utf8::upgrade function to turn the UTF-8 flag on in the data and treat as actual encoded characters instead of raw octets):
$ perl -l -MHTML::Entities+encode_entities_numeric -MEncode -Mutf8 -e '$c=qq{\xE2\x82\xAC}; utf8::upgrade($c); print encode_entities_numeric(decode_utf8($c))'
--> €
* The UTF-8 flag is more-or-less an implementation detail of how Perl is able to deal with legacy 8-bit binary data in no particular encoding (i.e., raw octets, which it treats as latin-1) as well as the full range of Unicode data, and deal with both efficiently and in a backwards-compatible manner.
PostgreSQL 9.0 High Performance Review
I recently had the privilege of reading and reviewing the book PostgreSQL 9.0 High Performance by Greg Smith. While the title of the book suggests that it may be relevant only to PostgreSQL 9.0, there is in fact a wealth of information to be found which is relevant for all community supported versions of Postgres.
Acheiving the highest performance with PostgreSQL is definitely something which touches all layers of the stack, from your specific disk hardware, OS and filesystem to the database configuration, connection/data access patterns, and queries in use. This book gathers up a lot of the information and advice that I've seen bandied about on the IRC channel and the PostgreSQL mailing lists and presents it in one place.
While seemingly related, I believe some of the main points of the book could be summed up as:
- Measure, don't guess. From the early chapters which cover the lowest-level considerations, such as disk hardware/configuration to the later chapters which cover such topics as query optimization, replication and partitioning, considerable emphasis is placed on determining the metrics by which to measure performance before/after specific changes. This is the only way to determine the impact the changes you make have.
- Tailor to your specific needs/workflows. While there are many good rules of thumb out there when it comes to configuration/tuning, this book emphasizes the process of determining/refining those more general numbers to tailoring configuration/setup to your specific database's needs.
- Review the information the database system itself gives you. Information provided by the pg_stat_* views can be useful in identifying bottlenecks in queries, unused/underused indexes.
This book also introduced me to a few goodies which I had not encountered previously. One of the more interesting ones is the pg_buffercache contrib module. This suite of functions allows you to peek at the internals of the shared_buffers cache to get a feel for which relations are heavily accessed on a block-by-block basis. The examples in the book show this being used to more accurately size shared_buffers based on the actual number of accesses to specific portions of different relations.
I found the book to be well-written (always a plus when reading technical books) and felt it covered quite a bit of depth given its ambitious scope. Overall, it was an informative and enjoyable read.
Utah Open Source Conference 2010 part 1
It's been about a little over a month since the 2010 Utah Open Source Conference, and I decided to take a few minutes to review talks I enjoyed and link to my own talk slides.
Magento: Mac Newbold of Code Greene spoke on the Magento ecommerce framework for PHP. I've somewhat familiar with Magento, but a few things stood out:
- He finds the Magento Enterprise edition kind of problematic because Varien won't support you if you have any unsupported extensions. Some of his customers had problems with Varien support and went back to the community edition.
- Magento is now up to around 30 MB of PHP files!
- As I've heard elsewhere, serious customization has a steep learning curve.
- The Magento data model is an EAV (Entity-Attribute-Value) model. To get 50 columns of output requires 50+ joins between 8 tables (one EAV table for each value datatype).
- There are 120 tables total in default install -- many core features don't use the EAV tables for performance reasons.
- Another observation I've heard in pretty much every conversation about Magento: It is very resource intensive. Shared hosting is not recommended. Virtual servers should have a minimum of 1/2 to 1 GB RAM. Fast disk & database help most. APC cache recommended with at least 128 MB.
- A lot of front-end things are highly adjustable from simple back-end admin options.
- Saved credit cards are stored in the database, and the key is on the server. I didn't get a chance to ask for more details about this. I hope it's only the public part of a public/secret keypair!
It was a good overview for someone wanting to go beyond marketing feature lists.
Node.js: Shane Hansen of Backcountry.com spoke on Node, comparing it to Tornado and Twisted in Python. He calls JavaScript "Lisp in C's clothing", and says its culture of asynchronous, callback-driven code patterns makes Node a natural fit.
Performance and parallel processing are clearly strong incentives to look into Node. The echo server does 20K requests/sec. There are 2000+ Node projects on GitHub and 500+ packages in npm (Node Package Manager), including database drivers, web frameworks, parsers, testing frameworks, payment gateway integrations, and web analytics.
A few packages worth looking into further:
- express - web microframework like Sinatra
- Socket-IO - Web Sockets now; falls back to other things if no Web Sockets available
- hummingbird - web analytics, used by Gilt.com
- bespin - "cloud JavaScript editor"
- yui3 - build HTML via DOM, eventbus, etc.
- connect - like Ruby's Rack
I haven't played with Node at all yet, and this got me much more interested.
Metasploit: Jason Wood spoke on Metasploit, a penetration testing (or just plain penetrating!) tool. It was originally in Perl, and now is in Ruby. It comes with 590 exploits and has a text-based interactive control console.
Metasploit uses several external apps: nmap, Maltego (proprietary reconnaissance tool), Nessus (no longer open source, but GPL version and OpenVAS fork still available), Nexpose, Ratproxy, Karma.
The reconnaissance modules include DNS enumeration, and an email address collector that uses the big search engines.
It can backdoor PuTTY, PDFs, audio, and more.
This is clearly something you've got to experiment with to appreciate. Jason posted his Metasploit talk slides which have more detail.
So Many Choices: Web App Deployment with Perl, Python, and Ruby: This was my talk, and it was a lot of fun to prepare for, as I got to take time to see some new happenings I'd missed in these three languages communities' web server and framework space over the past several years.
The slides give pointers to a lot of interesting projects and topics to check out.
My summary was this. We have an embarrassment of riches in the open source web application world. Perl, Python, and Ruby all have very nice modern frameworks for developing web applications. They also have several equivalent solid options for deploying web applications. If you haven't tried the following, check them out:
That's about half of my notes on talks, but all I have time for now. I'll cover more in a later post.
Upgrading old versions of Postgres
Old elephant courtesy of Photos8.comThe recent release of Postgres 9.0.0 at the start of October 2010 was not the only big news from the project. Also released were versions 7.4.30 and 8.0.26, which, as I noted in my usual PGP checksum report, are going to be the last publicly released revisions in the 7.4 and 8.0 branches. In addition, the 8.1 branch will no longer be supported by the end of 2010. If you are still using one of those branches (or something older!), this should be the incentive you need upgrade as soon as possible. To be clear, this means that anyone running Postgres 8.1 or older is not going to get any official updates, including security and bug fixes.
A brief recap: Postgres uses major versions, containing two numbers, to indicate a major change in features and functionality. These are released about every two years. Each of these major versions has many revisions, which are released as often as needed. These revisions are designed to be completely binary compatible with the previous revision, meaning you can upgrade revisions very easily, with no dump and restore of the data needed.
Below are the options available for those running older versions of Postgres, from the most desirable to the least desirable. The three general options are to upgrade to the latest release (9.0 as I write this), migrate to a newer version, or stay on your release.
1. Upgrade to the latest release
This is the best option, as each new version of Postgres adds more features and becomes more efficient, all while maintaining the high code quality standards Postgres is known for. There are three general approaches to upgrading: pg_upgrade, pg_dump, and Bucardo / Slony.
Using pg_upgrade
The pg_upgrade utility is the preferred method for upgrading in the future. Basically, it rewrites your data directory from the "old" on-disk format to the "new" one. Unfortunately, pg_upgrade only works from version 8.3 and onwards, which means it cannot be used if you are coming from an older version. (This utility used to be called pg_migrator, in case you see references to that.)
Dump and restore
The next best method is the tried and true "dump and restore". This involves using pg_dump to create a logical representation of the old database, and then loading it into your new database with pg_restore or psql. The disadvantage to this method is time - dump and reload can take a very, very long time for large databases. Not only does the data need to get loaded into the new database tables, but all the indexes must be recreated, which can be agonizingly slow.
Replication systems
A third option is to use a replication system such as Slony or Bucardo to help with the upgrade. With Slony, you can set up a replication from the old version to the new version, and then failover to the new version once replication is caught up and running smooth. You can do something similar with Bucardo. Note that both systems can only replicate sequences, and tables containing primary keys or unique indexes. Bucardo has a "fullcopy" mode that will copy any table, regardless of primary keys, but it's slow as it's equivalent to a full dump and restore of the table. Note that Bucardo is really only tested on the 8.X versions: for anything older, you will need to use Slony.
Even if you cannot replicate all your tables, such systems can help a migration by replicating most of your data. For example, if you have a 750 GB table full of mostly historical data, you can have Bucardo start tracking changes to the table, set up a copy on the new version (perhaps by using warm standby or a snapshot to reduce load on the master), and then start Bucardo to catch up the rows that have changed since the changes were tracked. If you do this for all your large tables, the actual upgrade process can proceed with minimal downtime by shutting down the master, doing a pg_dump of only the non-tracked tables, and then pointing your apps at the new server.
2. Migrate to a newer version
Even if you don't go to 9.0, you may want to upgrade to a newer version. Why not go all the way to 9.0? There are only two good reasons not to. One, if your system's packaging system does not have 9.0 yet, or you have custom packaging requirements that prevent you from doing so. Two, if you have concerns about application compatibility between two versions. However, that latter concern should be minimal. The largest and most disruptive compatibility change appeared in version 8.3 with the removal of implicit casts. Since 8.2 is likely to be unsupported in the next couple years, you should be going to at least 8.3. And if you can go to 8.3, you can go to 9.0.
3. Stay on your release
This is obviously the least-desirable option, but may be necessary due to real-world constraints involving time, testing, compatibility with other programs, etc. At the bare minimum, make sure you are at least running the latest revision, e.g. 7.4.30 if running 7.4. Moving forward, you will need to keep an eye on the Postgres commits list and/or the detailed release notes for new versions, and examine if any of the fixed bugs apply to your version or your situation. If they do, you'll need to figure out how to apply the patch to your older version, and then release this new version into your environment. Sound risky? It gets worse, because your patch is only being used and tested by an extremely small pool of people, has no build farm support, and is not available to the Postgres developers. If you want to go this route, there are companies familiar with the Postgres code base (including End Point) that will help you do so. But know in advance that we are also going to push you very hard to upgrade to a modern, supported version instead (which we can help you with as well, of course :).
PostgreSQL 8.4 in RHEL/CentOS 5.5
The announcement of end of support coming soon for PostgreSQL 7.4, 8.0, and 8.1 means that people who've put off upgrading their Postgres systems are running out of time before they're in the danger zone where critical bugfixes won't be available.
Given that PostgreSQL 7.4 was released in November 2003, that's nearly 7 years of support, quite a long time for free community support of an open-source project.
Many of our systems run Red Hat Enterprise Linux 5, which shipped with PostgreSQL 8.1. All indications are that Red Hat will continue to support that version of Postgres as it does all parts of a given version of RHEL during its support lifetime. But of course it would be nice to get those systems upgraded to a newer version of Postgres to get the performance and feature benefits of newer versions.
For any developers or DBAs familiar with Postgres, upgrading to a new version with RPMs from the PGDG or other custom Yum repository is not a big deal, but occasionally we've had a client worry that using a packages other than the ones supplied by Red Hat is riskier.
For those holdouts still on PostgreSQL 8.1 because it's the "norm" on RHEL 5, Red Hat gave us a gift in their RHEL 5.5 update. It now includes separate PostgreSQL 8.4 packages that may optionally be used on RHEL 5 instead of PostgreSQL 8.1. (Both can't be used on the same system at the same time.)
I know that getting these packages from Red Hat shouldn't be necessary, but for those who feel jittery about using 3rd-party packages, it's a good nudge to switch to Postgres 8.4 using Red Hat's supported packages. Thanks to Tom Lane at Red Hat for making this happen. Though I don't know whose idea it was, Tom is the author of all the RPM commitlog messages, so thanks, Tom!
This brings up a few other rhetorical questions: Will RHEL 6 ship with PostgreSQL 9.0? Will RHEL 5.6 have backported PostgreSQL 9.0 in similar postgresql90 packages? It'd be great to see each new PostgreSQL release have supported packages in RHEL so that there's even less reason to start a new project on an older version of Postgres. RHEL 5.5 with PostgreSQL 8.4 is a nice start in that direction.
Postgres configuration best practices
This is the first in an occasional series of articles about configuring PostgreSQL. The main way to do this, of course, is the postgresql.conf file, which is read by the Postgres daemon on startup and contains a large number of parameters that affect the database's performance and behavior. Later posts will address specific settings inside this file, but before we do that, there are some global best practices to address.
Version Control
The single most important thing you can do is to put your postgresql.conf file into version control. I care not which one you use, but go do it right now. If you don't already have a version control system on your database box, git is a good choice to use. Barring that, RCS. Doing so is extremely easy. Just change to the directory postgresql.conf is in. The process for git:
- Install git if not there already (e.g. "sudo yum install git")
- Run: git init
- Run: git add postgresql.conf pg_hba.conf
- Run: git commit -a -m "Initial commit"
For RCS:
- Install as needed (e.g. "sudo apt-get install rcs")
- Run: mkdir RCS
- Run: ci -l postgresql.conf pg_hba.conf
Note that we also checked in pg_hba.conf as well. You want to check in any file in that directory you may possibly change. For most people, that only means postgresql.conf and pg_hba.conf, but if you use other files (pg_ident.conf) check those in as well.
Ideally you want the version checked in to be the "raw" configuration files that came with the system - in other words, before you started messing with them. Then you make your initial changes and check it in. From then on of course, you commit every time you change the file.
At a bare minimum, the version control system should be telling you:
- Exactly what was changed
- When it was changed
- Who made the change
- Why it was changed
The first two items happen automatically in all version control systems, so you don't have to worry about those. The third item, "who made the change", must be entered manually if on a shared account (e.g. postgres) and using RCS. If you are using git, you can simply set the environment variables GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL. For shared accounts, I have a custom bashrc file called "gregbashrc" that is called when I log in that sets those ENVs as well as a host of other items.
The fourth item, "why it was changed", is generally the content of the commit message. Never leave this blank, and be as descriptive and verbose as possible - someone later on will be grateful you did. It's okay to be repetitive and state the obvious. If this was done as part of a specific ticket number or project name, mention that as well.
Safe Changes
It's important that the changes you make to the postgresql.conf file (or other files) actually work and don't cause Postgres to be unable to parse the file, or handle a changed setting. Never make changes and restart Postgres, because if it doesn't work, you've got a broken config file, no Postgres daemon, and most likely unhappy applications and/or users. At the very least, do a reload first (e.g. /etc/init.d/postgresql reload or just kill -HUP the PID). Check the logs and see if Postgres was happy with your changes. If you are lucky, it won't even require a restart (some changes do, some do not).
A better way to test your changes is to make it on an identical test box. That way, all the wrinkles are ironed out before you make the changes on production and attempt a reload or restart.
Another way I've found handy is to simply start a new Postgres daemon. Sounds like a lot of work, but it's pretty automatic once you've done it a few times. The process generally looks like this, assuming your production postgresql.conf is in the "data" directory, and your changes are in data/postgresql.conf.new:
- cd ..
- initdb testdata
- cp -f data/postgresql.conf.new testdata/
- echo port=5555 >> testdata/postgresql.conf
- echo max_connections=10 >> testdata/postgresql.conf
The max_connections is not strictly necessary, of course, but unless you are changing something that relies on that setting, it's nicer to keep it (and the resulting memory) low.
- pg_ctl -D testdata -l test.log start
- cat test.log
- pg_ctl -D testdata stop
- rm -fr testdata (or just keep it around for next time)
The test.log file will show you any problems that might have popped up with your changes, and once it works you can be fairly confident it will work for the "main" daemon as well, so to finish up:
- cd data
- mv -f postgresql.conf.new postgresql.conf
- git commit postgresql.conf -m "Adjusted random_page_cost to 2, per bug #4151"
- kill -HUP `head -1 postmaster.pid`
- psql -c 'show random_page_cost'
Keeping it Clean
The postgresql.conf file is fairly long, and can be confusing to read with its mixture of comments, in-line comments, strange wrapping, and the commented out vs. not-commented-out variables. Hence, I recommend this system:
- Put a big notice at the top of the file asking people to make changes to the bottom
- Put all important variables at the bottom, sans comments, one per line
- Line things up
- Put into logical groups.
This avoids having to hunt for settings, prevents the gotcha of when a setting is changed twice in the file, and makes things much easier to read visually. Here's what I put at the top of the postgresql.conf:
## ## PLEASE MAKE ALL CHANGES TO THE BOTTOM OF THIS FILE! ##
I then add a good 20+ empty lines, so anyone viewing the file is forced to focus on the all-caps message above.
The next step is to put all the settings you care about at the bottom of the file. Which ones should you care about? Any setting you have changed (obviously), any setting that you *might* change in the future, and any that you may not have changed, but someone may want to look up. In practice, this means a list of about 25 items. After aligning all the values to the right and breaking things into logical groups, here's what the bottom of the postgresql.conf looks like:
## Connecting port = 5432 listen_addresses = '*' max_connections = 100 ## Memory shared_buffers = 400MB work_mem = 1MB maintenance_work_mem = 1GB ## Disk fsync = on synchronous_commit = on full_page_writes = on checkpoint_segments = 100 ## PITR archive_mode = off archive_command = '' archive_timeout = 0 ## Planner effective_cache_size = 18GB random_page_cost = 2 ## Logging log_destination = 'stderr' logging_collector = on log_filename = 'postgres-%Y-%m-%d.log' log_truncate_on_rotation = off log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 200 log_statement = 'ddl' log_line_prefix = '%t %u@%d %p' ## Autovacuum autovacuum = on autovacuum_vacuum_scale_factor = 0.1 autovacuum_analyze_scale_factor = 0.3
Because everything is in one place, at the bottom of the file, and not commented out, it's very easy to see what is going on. The groups above are somewhat arbitrary, and you can leave them out or create your own, but at least keep things grouped together as much as possible. When in doubt, use the same order as they appear in the original postgresql.conf.
Sometimes people change important settings in a group, such as for bulk loading of data. In this case, I usually make a separate group for it at the very bottom. This makes it easy to switch back and forth, and helps to prevent people from (for example) forgetting to switch fsync back on:
## Bulk loading only - leave 'on' for everyday use! autovacuum = off fsync = off full_page_writes = off
Ownership and permissions
All the conf files should be owned by the postgres user, and the configuration files should be world-readable if possible (indeed, it's a requirement for Debian based system that postgresql.conf be readable for psql to work!). Be careful about SELinux as well: it can get ornery if you do things like use symlinks.
Backups
One final note - make sure you are backing up your changes as well. PITR and pg_dump won't save your postgresql.conf! If you are checking things in to a remote version control system, then some of the pressure is off, but you should have some sort of policy for backing up all your conf files explicitly. Even if using a local git repo, tarring and copying up the whole thing is usually a very quick and cheap action.
Anonymous code blocks
With the release of PostgreSQL 9.0 comes the ability to execute "anonymous code blocks" in various of the PostgreSQL procedural languages. The idea stemmed from work back in autumn of 2009 that tried to respond to a common question on IRC or the mailing lists: how do I grant a permission to a particular user for all objects in a schema? At the time, the only solution short of manually writing commands to grant the permission in question on every object individually was to write a script of some sort. Further discussion uncovered several people that often found themselves writing simple functions to handle various administrative tasks. Many of those people, it turned out, would rather simply call one statement, rather than create a function, call the function, and then drop (or just ignore) the function they'd never need again. Hence, the new DO command.
The first language to support DO was PL/pgSQL. The PostgreSQL documentation provides an example to answer the original question: how do I grant permissions on everything to a particular user.
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT table_schema, table_name FROM information_schema.tables
WHERE table_type = 'VIEW' AND table_schema = 'public'
LOOP
EXECUTE 'GRANT ALL ON ' || quote_ident(r.table_schema) || '.' || quote_ident(r.table_name) || ' TO webuser';
END LOOP;
END$$;
Notice that this doesn't actually tell us what language to use. If no language is specified, DO defaults to PL/pgSQL (which, in 9.0, is enabled by default). But you can use other languages as well:
DO $$
HAI
BTW Calculate pi using Gregory-Leibniz series
BTW This method does not converge particularly quickly...
I HAS A PIADD ITZ 0.0
I HAS A PISUB ITZ 0.0
I HAS A ITR ITZ 0
I HAS A T1
I HAS A T2
I HAS A PI ITZ 0.0
I HAS A ITERASHUNZ ITZ 1000
IM IN YR LOOP
T1 R QUOSHUNT OF 4.0 AN SUM OF 3.0 AN ITR
T2 R QUOSHUNT OF 4.0 AN SUM OF 5.0 AN ITR
PISUB R SUM OF PISUB AN T1
PIADD R SUM OF PIADD AN T2
ITR R SUM OF ITR AN 4.0
BOTH SAEM ITR AN BIGGR OF ITR AN ITERASHUNZ, O RLY?
YA RLY, GTFO
OIC
IM OUTTA YR LOOP
PI R SUM OF 4.0 AN DIFF OF PIADD AN PISUB
VISIBLE "PI R: "
VISIBLE PI
FOUND YR PI
KTHXBYE
$$ LANGUAGE PLLOLCODE;
I tried to rewrite the GRANT function shown above in PL/LOLCODE for this example, until I discovered that some of PL/LOLCODE's limitations make it extremely difficult, if not impossible. So far as I know, PL/LOLCODE was the second language to support anonymous blocks, thanks to what turned out to be a relatively simple programming exercise. After finishing PL/LOLCODE's DO support, I decided to do the same for PL/Perl. I wasn't particularly surprised to find that PL/Perl was harder to extend than PL/LOLCODE; PL/Perl is a much more feature-rich (and hence, complicated) language and I wasn't as familiar with its internals. However, after my initial submission and with helpful commentary from several other people, Andrew Dunstan tied off the loose ends and got it committed. It looks like this:
DO $$
my $row;
my $rv = spi_exec_query(q{
SELECT quote_ident(table_schema) || '.' || quote_ident(table_name) AS relname
FROM information_schema.tables WHERE table_type = 'VIEW' AND table_schema = 'public'
});
my $nrows = $rv->{processed};
foreach my $i (0 .. $nrows - 1) {
my $row = $rv->{rows}[$rn];
spi_exec_query("GRANT ALL ON $row->{relname} TO webuser");
}
$$ LANGUAGE plperl;
DO wasn't the only thing to come from the pgsql-hackers discussion I mentioned above. In PostgreSQL 9.0, the GRANT command has also been modified, so it's now possible to grant permissions several objects in one stroke syntax. For instance:
GRANT SELECT ON ALL TABLES IN SCHEMA public TO webuser
pg_wrapper's very symbolic links
I like pg_wrapper. For a development environment, or testing replication scenarios, it's brilliant. If you're not familiar with pg_wrapper and its family of tools, it's a set of scripts in the postgresql-common and postgresql-client-common packages available in Debian, as well as Ubuntu and other Debian-like distributions. As you may have guessed pg_wrapper itself is a wrapper script that calls the correct version of the binary you're invoking – psql, pg_dump, etc – depending on the version of the database you want to connect to. Maybe not all that exciting in itself, but implied therein is the really cool bit: This set of tools lets you manage multiple installations of Postgres, spanning multiple versions, easily and reliably.
Well, usually reliably. We were helping a client upgrade their production boxes from Postgres 8.1 to 8.4. This was just before the 9.0 release, otherwise we'd consider moving the directly to that instead. It was going fairly smoothly until on one box we hit this message:
Could not parse locale out of pg_controldata output
Oops, they had pinned the older postgres-common version. An upgrade of those packages and no more error!
$ pg_lsclusters Version Cluster Port Status Owner Data directory Log file 8.1 main 5439 online postgres /var/lib/postgresql/8.1/main custom Error: Invalid data directory
Hmm, interesting. Okay, so not quite, got a little bit more work to do. This one took some tracing through the code. The pg_wrapper scripts, if they don't already know it, look for the data directory in a couple of places. The first stop is the postgresql.conf file, specifically /etc/postgresql/<version>/<cluster-name>/postgresql.conf, looking for the data_directory parameter. But, in its transitional state at the time, the postgresql.conf was still a work in progress.
The second place it looks is a symlink in the same /etc/postgresql/<version>/<cluster-name>/ directory. While that's the old way of doing things, it at least let us get things looking reasonable:
# ln -s /var/lib/postgresql/8.4/main /etc/postgresql/8.4/main/pgdata # /etc/init.d/postgresql-8.4 status 8.1 main 5439 online postgres /var/lib/postgresql/8.1/main custom 8.4 main 5432 online postgres /var/lib/postgresql/8.4/main custom
Voilà! From there we were able to proceed with the upgrade, confident that the instance will behave as expected. And now, everything is running great!
As with most things that provide a simpler experience on the surface, there's additional complexity under the hood. But for now, we have one more client upgraded. Thanks, Postgres!
Listen/Notify improvements in PostgreSQL 9.0
Improved listen/notify is one of the new features of Postgres 9.0 I've been waiting for a long time. There are basically two major changes: everything is in shared memory instead of using system tables, and full support for "payload" messages is enabled.
Before I demonstrate the changes, here's a review of what exactly the listen/notify system in Postgres is. Basically, it is an inter-process signalling system, which uses the pg_listener system table to coordinate simple named events between processes. One or more clients connects to the database and issues a command such as:
LISTEN foobar;
The name foobar can be replaced by any valid name; usually the name is something that gives a contextual clue to the listening process, such as the name of a table. Another client (or even one of the original ones) will then issue a notification like so:
NOTIFY foobar;
Each client that is listening for the 'foobar' message will receive a notification that the sender has issued the NOTIFY. It also receives the PID of the sending process. Multiple notifications are collapsed into a single notice, and the notification is not sent until a transaction is committed.
Here's some sample code using DBD::Pg that demonstrates how the system works:
#!/usr/bin/env perl
# -*-mode:cperl; indent-tabs-mode: nil-*-
use strict;
use warnings;
use DBI;
my $dsn = 'dbi:Pg:dbname=test';
my $dbh1 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
my $dbh2 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
print "Postgres version is $dbh1->{pg_server_version}\n";
my $SQL = 'SELECT pg_backend_pid(), version()';
my $pid1 = $dbh1->selectall_arrayref($SQL)->[0][0];
my $pid2 = $dbh2->selectall_arrayref($SQL)->[0][0];
print "Process one has a PID of $pid1\n";
print "Process two has a PID of $pid2\n";
## Process one listens for a notice named "jtx"
$dbh1->do(q{LISTEN jtx});
$dbh1->commit();
## Process one checks for any notices received
print show_notices($dbh1);
## Process two sends a notice, but does not commit
$dbh2->do(q{NOTIFY jtx});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx});
$dbh2->commit();
sleep 1; ## Ensure the notice has time to get to propogate
## Process two receives a single notice from process one
print show_notices($dbh1);
## Now that it has seen the notice, it reports nothing again:
print show_notices($dbh1);
sub show_notices { ## Function to return any notices received
my $dbh = shift;
my $messages = '';
$dbh->commit();
while (my $n = $dbh->func('pg_notifies')) {
$messages .= "Got notice '$n->[0]' from PID $n->[1]\n";
}
return $messages || "No messages\n";
}The output of the above script on a 8.4 Postgres server is:
Postgres version is 80401 Process one has a PID of 18238 Process two has a PID of 18239 No messages No messages Got notice 'jtx' from PID 18239 No messages
As expected, we got a notification only after the other process committed.
Note that because this is asychronous and involves the system tables, we added a sleep call to ensure that the notice had time to propagate so that the other processes will see it. Without the sleep, we usually see four "No messages" appear, as the script goes too fast for the pg_listener table to catch up.
Now for the aforementioned payloads. Payloads allow an arbitrary string to be attached to the notification, such that you can have a standard name like before, but you can also attach some specific text that the other processes can see. I added support for payloads to DBD::Pg back in June 2008, so let's modify the script a little bit to demonstrate the new payload mechanism:
...
## Process two sends two notices, but does not commit
$dbh2->do(q{NOTIFY jtx, 'square'});
$dbh2->do(q{NOTIFY jtx, 'square'});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx, 'triangle'});
$dbh2->commit();
...
## This part changes: we get an extra item from our array:
$messages .= "Got notice '$n->[0]' from PID $n->[1] message is '$n->[2]'\n";
...Here's what the output looks like under version 9.0 of Postgres:
Postgres version is 90000 Process one has a PID of 19089 Process two has a PID of 19090 No messages No messages Got notice 'jtx' from PID 19090 message is 'square' Got notice 'jtx' from PID 19090 message is 'triangle' No messages
Note that the collapsing of identical messages into a single notification now takes into account the message as well, so we received two notifications in the above example for the three total notifications sent. To add a payload, we simply say NOTIFY, then the name of the notification, add a comma, and specify a payload as a quoted string. Of course, the payload string is still completely optional. If no payload is specified, DBD::Pg will simply treat the payload as an empty string (this is also the behavior when you request the payload using DBD::Pg against a pre-9.0 server, so all combinations should be 100% backwards compatible).
We also got rid of the sleep. Because we are now using shared memory instead of system tables, there is no lag whatsoever, and the other process can see the notices right away.
Another large advantage to removing the pg_listener table is that systems that make heavy use of it (such as the replication systems Bucardo and Slony) no longer have to worry about bloat in these tables.
The use of payloads also means that many application can be greatly simplified: in the past, one had to be creative in the name of your notifications in order to pass meta-information to your listener. For example, Bucardo uses a large collection of notifications, meaning that the Bucardo processes had to do the equivalent of things like this:
$dbh->do(q{LISTEN bucardo_reload_config});
$dbh->do(q{LISTEN bucardo_log_message});
$dbh->do(q{LISTEN bucardo_activate_sync_$sync});
$dbh->do(q{LISTEN bucardo_deactivate_sync_$sync});
$dbh->do(q{LISTEN bucardo_kick_sync_$sync});
...
while (my $notice = $dbh->func('pg_notifies')) {
my ($name, $pid) = @$notice;
if ($name eq 'bucardo_reload_config') {
...
}
elsif ($name =~ /bucardo_kick_sync_(.+)/) {
...
}
...
}
We can instead do things like this:
$dbh->do(q{LISTEN bucardo});
...
while (my $notice = $dbh->func('pg_notifies')) {
my ($name, $pid, $msg) = @$notice;
if ($msg eq 'bucardo_reload_config') {
...
}
elsif ($msg =~ /bucardo_kick_sync_(.+)/) {
...
}
...
}
I hope to add this support to Bucardo shortly; it's simply a matter of refactoring all the listen and notify calls into a function that does the right thing depending on the server version it is attached to.
PostgreSQL odd checkpoint failure
Nothing strikes fear into the heart of a DBA like error messages, particularly ones which indicate that there may be data corruption. One such situation happened recently to us, when we ran into a recent unusual situation in an upgrade to PostgreSQL 8.1.21. We had updated the software and manually been running a REINDEX DATABASE command, when we started to notice some errors being reported on the front-end. We decided to dump the database in question to ensure we had a backup to return to, however we still ended up with more messages:
pg_dump -Fc database1 > pgdump.database1.archive pg_dump: WARNING: could not write block 1 of 1663/207394263/443523507 DETAIL: Multiple failures --- write error may be permanent. pg_dump: ERROR: could not open relation 1663/207394263/443523507: No such file or directory CONTEXT: writing block 1 of relation 1663/207394263/443523507 pg_dump: SQL command to dump the contents of table "table1" failed: PQendcopy() failed. pg_dump: Error message from server: ERROR: could not open relation 1663/207394263/443523507: No such file or directory CONTEXT: writing block 1 of relation 1663/207394263/443523507 pg_dump: The command was: COPY public."table1" (id, field1, field2, field3) TO stdout;
Looking at the pg_database contents revealed that 207394263 was not even the database in question. I connected to the aforementioned database and looked for a relation that matched that pg_class.oid, and barring that pg_class.relfilenode. This search revealed nothing. So where was the object itself living, and why were we getting this message?
We decided that since it appeared that something was awry with the database system in general, that we should take this opportunity to dump the tables in question. I proceeded to write a quick script to go through the database tables and dump each one individually using pg_dump's -t option. This worked for some of the tables, but not all of them, which would die with the same error. Looking at the pg_class.relpages field for the non-dumpable tables revealed that these were all the larger tables in the database. Obviously not good, since this is where the bulk of the data lay. However, we also noticed that the message that we got referenced the exact same filesystem path, so it appeared to be something separate from the table that was being dumped.
After some advice on IRC, we reviewed the logs for checkpoint logging, which revealed that checkpoints had been failing. This further meant that the database was in a state such that it could not be shut down cleanly, had we wanted to try to restart to see if that cleared up the flakiness. This further meant that we'd only be able to shutdown via a hard kill, which is definitely something to avoid, WAL or not, particularly since there had not been a checkpoint for some time. A manual CHECKPOINT further failed after a timeout.
Before we went down the road of forcing a hard server shutdown, we ended up just touching the specific relation path in question into existence and then running a CHECKPOINT. This time since the file existed, it was able to complete the checkpoint, and restore working order to the database. We successfully (and quickly) ran a full pg_dump, and went about the task of manually vetting a few of the affected tables, etc.
Our working theory for this is that somehow there was a dirty buffer that referenced a relation that no longer existed, and hence when the there was a checkpoint or other event which attempted to flush shared_buffers (i.e., the loading of a large relation which would require a flush of Least Recently Used pages as in the pg_dump case), the flush attempt for the missing relation failed, which aborted the checkpoint/other action.
After the file existed and PostgreSQL had successfully synched to disk, it was a single two-block file, of which the first block was completely empty and the second block looked like an index page (due to the layout/contents of the data). The most suggestive cause was that had been an interrupted REINDEX earlier in the day. Since this machine was showing no other signs of data corruption and everything else seemed reasonable, our best guess is that there was some race condition that had caused the relation's data to exist in memory even while the canceled REINDEX ensured that the actual relfile and the pg_class rows did not exist for the buffer.
Distributed Transactions and Two-Phase Commit
The typical example of a transaction involves Alice and Bob, and their bank. Alice pays Bob $100, and the bank needs to debit Alice and credit Bob. Easy enough, provided the server doesn't crash. But what happens if the bank debits Alice, and then before crediting Bob, the server goes down? Or what if they credit Bob first, and then try to debit Alice only to find she doesn't have enough funds? A transaction allows the debit and credit operations to happen as a package ("atomically" is the word commonly used), so either both operations happen or neither happens, even if the server crashes halfway through the transaction. That way the bank never credits Bob without debiting Alice, or vice versa.
That's simple enough, but the situation can become more complex. What if, for instance, for buzzword-compliance purposes, the bank has "sharded" its accounts database by splitting it in pieces and putting each piece on a different server (whether this is would be smart or not is outside the scope of this post). The typical transaction handles statements issued only for one database, so we can't wrap the debit and credit operations within a single BEGIN/COMMIT if Alice's account information lives on one server and Bob's lives on another.
Enter "distributed transactions". A distributed transaction allows applications to group multiple transaction-aware systems into a single transaction. These systems might be different databases, or they might include other systems such as message queues, in which case the transaction concept means a message would get delivered if and only if the rest of the transaction completed. So with a distributed transaction, the bank could debit Alice's account in one database and credit Bob's in another, atomically.
All this comes at some cost. Distributed transactions require a "transaction manager", an application which handles the special semantics required to commit a distributed transaction. Second, the systems involved must support "two-phase commit" (which was added to PostgreSQL in version 8.1). Distributed transactions are committed using PREPARE TRANSACTION 'foo' (phase 1), and COMMIT PREPARED 'foo' or ROLLBACK PREPARED 'foo' (phase 2), rather than the usual COMMIT or ROLLBACK.
The beginning of a distributed transaction looks just like any other transaction: the application issues a BEGIN statement (optional in PostgreSQL), followed by normal SQL statements. When the transaction manager is instructed to commit, it runs the first commit phase by saying "PREPARE TRANSACTION 'foo'" (where "foo" is some arbitrary identifier for this transaction) on each system involved in the distributed transaction. Each system does whatever it needs to do to determine whether or not this transaction can be committed and to make sure it can be committed even if the server crashes, and reports success or failure. If all systems succeed, the transaction manager follows up with "COMMIT PREPARED 'foo'", and if a system reports failure, the transaction manager can roll back all the other systems using either ROLLBACK (for those transactions it hasn't yet prepared), or "ROLLBACK PREPARED 'foo'". Using two-phase commit is obviously slower than committing transactions on only one database, but sometimes the data integrity it provides justifies the extra cost.
In PostgreSQL, two-phase commit is supported provided max_prepared_transactions is nonzero. A PREPARE TRANSACTION statement persists the current transaction to disk, and dissociates it from the current session. That way it can survive even if the database goes down. The current session no longer has an active transaction. However, the prepared transaction acts like any other open transaction in that all locks held by the prepared transaction remain held, and VACUUM cannot reclaim storage from that transaction. So it's not a good idea to leave prepared transactions open for a long time.
Distributed transactions are most common, it seems, in Java applications. Full J2EE application servers typically come with a transaction manager component. For my examples I'll use an open source, standalone transaction manager, called Bitronix. I'm not particularly fond of using Java for simple scripts, though, so I've used JRuby for this demonstration code.
This script uses two databases, which I've called "athos" and "porthos". Each has same schema, which provides a simple framework for the sharded bank example described above. This schema provides a table for account names, another for ledger information, and a simple trigger to raise an exception when a transaction would bring a person's balance below $0. I'll first populate athos with Alice's account information. She gets $200 to start. Bob will go in the porthos database, with no initial balance.
5432 josh@athos# insert into accounts values ('Alice');
INSERT 0 1
5432 josh@athos*# insert into ledger values ('Alice', 200);
INSERT 0 1
5432 josh@athos*# commit;
COMMIT5432 josh@athos# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# insert into accounts values ('Bob');
INSERT 0 1
5432 josh@porthos*# commit;
COMMIT
Use of Bitronix is pretty straightforward. After setting up a few constants for easier typing, I create a Bitronix data source for each PostgreSQL database. Here I have to use the PostgreSQL JDBC driver's org.postgresql.xa.PGXADataSource class; "XA" is Java's protocol for two-phase commit, and requires JDBC driver support. Here's the code for setting up one data source; the other is just the same.
ds1 = PDS.new ds1.set_class_name 'org.postgresql.xa.PGXADataSource' ds1.set_unique_name 'pgsql1' ds1.set_max_pool_size 3 ds1.get_driver_properties.set_property 'databaseName', 'athos' ds1.get_driver_properties.set_property 'user', 'josh' ds1.init
Then I simply get a connection from each data source, instantiate a Bitronix TransactionManager object, and begin a transaction.
c1 = ds1.get_connection c2 = ds2.get_connection btm = TxnSvc.get_transaction_manager btm.begin
Within my transaction, I just use normal JDBC commands to debit Alice and credit Bob, after which I commit the transaction through the TransactionManager object. If this transaction fails, it raises an exception, which I can capture using Ruby's begin/rescue exception handling, and roll back the transaction.
begin
s2 = c2.prepare_statement "INSERT INTO ledger VALUES ('Bob', 100)"
s2.execute_update
s2.close
s1 = c1.prepare_statement "INSERT INTO ledger VALUES ('Alice', -100)"
s1.execute_update
s1.close
btm.commit
puts "Successfully committed"
rescue
puts "Something bad happened: " + $!
btm.rollback
end
When I run this, Bitronix gives me a bunch of output, which I haven't bothered to suppress, but among it all is the "Successfully committed" string I told it to print on success. Since Alice is debited $100 each time we run this, and she started with $200, we can run it twice before hitting errors. On the third time, we get this:
Something bad happened: org.postgresql.util.PSQLException: ERROR: Rejecting operation; account owner Alice's balance would drop below 0
This is our trigger firing, to tell us that we can't debit Alice any more. If I look in the two databases, I can see that everything worked as planned:
5432 josh@athos*# select get_balance('Alice');
get_balance
-------------
0
(1 row)
5432 josh@athos*# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# select get_balance('Bob');
get_balance
-------------
200
(1 row)
Remember I've run my script three times, but Bob has only been credited $200, because that's all Alice had to start with.
PostgreSQL: Migration Support Checklist
A database migration (be it from some other database to PostgreSQL, or even from an older version of PostgreSQL to a nice shiny new one) can be a complicated procedure with many details and many moving parts. I've found it helpful to construct a list of questions in order to make sure that you're considering all aspects of the migrations and gauge the scope of what will be involved. This list includes questions we ask our clients; feel free to contribute your own additional considerations or suggestions.
Technical questions:
- Database servers: How many database servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? What kind of storage are you using for the existing database, and what do you plan to use for the new database? Direct-attached storage (SAS, SATA, etc.), SAN (what vendor?), or other? Do you use any configuration management system such as Puppet, Chef, etc.?
- Application servers and other remote access: How many application servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? Do you use any configuration management system such as Puppet, Chef, etc.? What other network considerations are there? Is ODBC used, or SSL transport, any VPNs? Are multiple datacenters involved? How about egress/ingress firewalls?
- Middleware: Do you currently use any sort of connection pooling, load balancing, or other middleware between your application and database servers?
- Data needs: Can you describe your data access patterns? i.e., is the majority of your data historical and rarely accessed? Are there any existing reporting needs that will need to be duplicated on the PostgreSQL system? Do you already have reports of database usage, including traffic levels, frequent or intensive queries, etc?
- Size: What kind of transaction volume do you see? How large are your databases? How many tables do you have and what is the size of the larger ones? How many users or database connections will you need to support?
- Backups: What are your current backup policies/procedures? How will these need to change with the move to PostgreSQL?
- Replication/load balancing: What kind of system redundancy do you currently have/need? Do you have any kind of database load-balancing or master-slave replication?
- Monitoring: What is the current monitoring/in-house support infrastructure? What needs to be duplicated, and can any portion of this facility be reused?
- Interfaces: What language are your applications written in, and what drivers exist to connect to your current database? Will there be a compatible driver available in your language of choice in order?
- Extensions: Are you currently using any in-database procedures or functionality (i.e., in PL/SQL or another embedded language of choice)? If so, how many? What will the difficulty be in porting these functions to PostgreSQL?
And a couple of business-related questions:
- Scheduling: What is the timeframe for transition? When can appropriate downtime be scheduled? How much database downtime can you afford?
- Staffing: Do you currently have in-house DBAs to manage the servers, etc on a day-to-day basis? Is there anyone with PostgreSQL experience or familiarity on staff?
Being able to answer all of these questions is critical to formulating a migration plan and carrying out a migration successfully.
Particularly with the impending (July 2010) end of life for previous PostgreSQL releases 7.4, 8.0 and (in November 2010) 8.1, a database migration may be on your radar. End Point is one of many professional PostgreSQL support companies who would be happy to assist you in your transition.
Views across many similar tables
An application I'm working on has a host of (a dozen or so) status tables, each containing various rows that reflect the state of associated rows in other tables. For instance:
Table "public.inventory" ... status_code | character varying(50) | not null Table "public.inventory_statuses" code | character varying(50) | not null display_label | character varying(70) | not null SELECT * FROM inventory_statuses; code | display_label -----------+--------------- ordered | Ordered shipped | Shipped returned | Returned repaired | Repairedetc.
Several of the codes are common to several tables. For instance, "void" is a status that occurs in seven tables. The application cares about this; there are code-level triggers that will respond to a change of status to "void" in one table, and pass that information along to another table higher up the chain.
Since I wasn't present at the birth of the system (nor do I have unlimited memory to keep 180+ codes in my head), I needed a way to answer the question, "In which table(s) does status 'foo' occur?" This was made rather easier by attention to detail early on: each of the status tables was named "*_statuses"; each primary key was named "code"; and each human-readable description field was named "display_label". I wrote a Pl/PgSQL function to create a view spanning all the tables. (I could have just created the SQL by hand, but I wanted a way to reproduce this effort later, if tables are added, dropped, or modified.)
CREATE FUNCTION create_all_statuses()
RETURNS VOID
LANGUAGE 'plpgsql'
AS $$
DECLARE
stmt TEXT;
tbl RECORD;
BEGIN
stmt := '';
FOR tbl IN EXECUTE $SQL$
SELECT DISTINCT table_name
FROM information_schema.columns a
JOIN information_schema.columns b
USING (table_name)
JOIN information_schema.tables t
USING (table_name)
WHERE a.column_name = 'code'
AND b.column_name = 'display_label'
AND table_name ~ '_statuses$'
AND t.table_type = 'BASE TABLE'
$SQL$
LOOP
IF (LENGTH(stmt) > 0)
THEN
stmt := stmt || ' UNION ';
END IF;
stmt := stmt || 'SELECT code, display_label, ' ||
quote_literal(tbl.table_name) ||
' AS table_name FROM ' ||
quote_ident(tbl.table_name);
END LOOP;
EXECUTE 'CREATE VIEW all_statuses AS ' || stmt;
RETURN;
END;
$$;Now it's easy to answer the question:select * from all_statuses where code = 'void'; code | display_label | table_name ------+---------------+-------------------------------------- void | Void | inventory_statuses void | Void | parcel_statuses void | Void | pick_list_statusesetc.
If your database uses boilerplate columns such as "last_modified" or "date_created" to record timestamps on rows, you could use similar logic to create a view that would tell you which tables were the most recently modified.
NoSQL at RailsConf 2010: An Ecommerce Example
Even more so than Rails 3, NoSQL was a popular technical topic at RailsConf this year. I haven't had much exposure to NoSQL except for reading a few articles written by Ethan (Quick Thoughts on NoSQL Live Boston Conference, NoSQL Live: The Dynamo Derivatives (Cassandra, Voldemort, Riak), and Cassandra, Thrift, and Fibers in EventMachine), so I attended a few sessions to learn more.
First, it was reinforced several times that if you can read JSON, you should have no problem comprehending NoSQL. So, it shouldn't be too hard to jump into code examples! Next, I found it helpful when one of the speakers presented high-level categorization of NoSQL, whether or not the categories meant much to me at the time:
- Key-Value Stores: Advantages include that this is the simplest possible data model. Disadvantages include that range queries are not straightforward and modeling can get complicated. Examples include Redis, Riak, Voldemort, Tokyo Cabinet, MemcacheDB.
- Document stores: Advantages include that the value associated with a key is a document that exposes a structure that allows some database operations to be performed on it. Examples include CouchDB, MongoDB, Riak, FleetDB.
- Column-based stores: Examples include Cassandra, HBase.
- Graph stores: Advantages include that this allows for deep relationships. Examples include Neo4j, HypergraphDB, InfoGrid.
In one NoSQL talk, Flip Sasser presented an example to demonstrate how an ecommerce application might be migrated to use NoSQL, which was the most efficient (and very familiar) way for me to gain an understanding of NoSQL use in a Rails application. Flip introduced the models and relationships shown here:
In the transition to NoSQL, the transaction model stays as is. As a purchase is created, the Notification.create method is called.
class Purchase < ActiveRecord::Base
after_create :create_notification
# model relationships
# model validations
def total
quantity * product.price
end
protected
def create_notification
notifications.create({
:action => "purchased #{quantity == 1 ? 'a' : quantity} #{quantity == 1 ? product.name : product.name.pluralize}",
:description => "Spent a total of #{total}",
:item => self,
:user => user
}
)
end
end
Flip moves the product class to Document store because it needs a lot of flexibility to handle the diverse product metadata. The structure of the product class is defined in the product class and nowhere else.
Before
class Product < ActiveRecord::Base serialize :info, Hash end
After
class Product include MongoMapper::Document key :name, String key :image_path, String key :info, Hash timestamps! end
The Notification class is moved to a Key-Value store. After a user completes a purchase, the create method is called to store a notification against the user that is to receive the notification.
Before
class Notification < ActiveRecord::Base # model relationships # model validations end
After
require 'ostruct'
class Notification < OpenStruct
class << self
def create(attributes)
message = "#{attributes[:user].name} #{attributes[:action]}"
attributes[:user].follower_ids.each do |follower_id|
Red.lpush("user:#{follower_id}:notifications", {:message => message, :description => attributes[:description], :timestamp => Time.now}.to_json)
end
end
end
end
The user model remains an ActiveRecord model and uses the devise gem for user authentication, but is modified to retrieve the notifications, now an OpenStruct. The result is that whenever a user's friend makes a purchase, the user is notified of the purchase. In this simple example, a purchase contains one product only.
Before
class User < ActiveRecord::Base
# user authentication here
# model relationships
def notifications
Notification.where("friend_relationships.friend_id = notifications.user_id OR notifications.user_id = #{id}").
joins("LEFT JOIN friend_relationships ON friend_relationships.user_id = #{id}")
end
end
After
class User < ActiveRecord::Base
# user authentication here
# model relationships
def followers
User.where('users.id IN (friend_relationships.user_id)').
joins("JOIN friend_relationships ON friend_relationships.friend_id = #{id}")
end
def follower_ids
followers.map(&:id)
end
def notifications
(Red.lrange("user:#{id}:notifications", 0, -1) || []).map{|notification| Notification.new(ActiveSupport::JSON.decode(notification))}
end
end
The disadvantages to the NoSQL and RDBMS hybrid is that data portability is limited and ActiveRecord plugins can no longer be used. But the general idea is that performance justifies the move to NoSQL for some data. In several sessions I attended, the speakers reiterated that you will likely never be in a situation where you'll only use NoSQL, but that it's another tool available to suit performance-related business needs. I later spoke with a few Spree developers and we concluded that the NoSQL approach may work well in some applications for product and variant data for improved performance with flexibility, but we didn't come to an agreement on where else this approach may be applied.
Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.
pgcrypto pg_cipher_exists errors on upgrade from PostgreSQL 8.1
While migrating a client from a 8.1 Postgres database to a 8.4 Postgres database, I came across a very annoying pgcrypto problem. (pgcrypto is a very powerful and useful contrib module that contains many functions for encryption and hashing.) Specifically, the following functions were removed from pgcrypto as of version 8.2 of Postgres:
- pg_cipher_exists
- pg_digest_exists
- pg_hmac_exists
While the functions listed above were deprecated, and marked as such for a while, their complete removal from 8.2 presents problems when upgrading via a simple pg_dump. Specifically, even though the client was not using those functions, they were still there as part of the dump. Here's what the error message looked like:
$ pg_dump mydb --create | psql -X -p 5433 -f - >pg.stdout 2>pg.stderr ... psql::2654: ERROR: could not find function "pg_cipher_exists" in file "/var/lib/postgresql/8.4/lib/pgcrypto.so" psql: :2657: ERROR: function public.cipher_exists(text) does not exist
While it doesn't stop the rest of the dump from importing, I like to remove any errors I can. In this case, it really was a SMOP. Inside the Postgres 8.4 source tree, in the contrib/pgcrypto directory, I added the following declarations to pgcrypto.h:
Datum pg_cipher_exists(PG_FUNCTION_ARGS); Datum pg_digest_exists(PG_FUNCTION_ARGS); Datum pg_hmac_exists(PG_FUNCTION_ARGS);
Then I added three simple functions to the bottom of the pgcrypto.c file that simply throw an error if they are invoked, letting the user know that the functions are deprecated. This is a much friendlier way than simply removing the functions, IMHO.
/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_cipher_exists);
Datum
pg_cipher_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_cipher_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_digest_exists);
Datum
pg_digest_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_digest_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_hmac_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_hmac_exists);
Datum
pg_hmac_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_hmac_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
After running make install from the pgcrypto directory, the dump proceeded without any further pgcrypto errors. From this point forward, if the anyone attempts to use one of the functions, it will be quite obvious that the function is deprecated, rather than leaving the user wondering if they typed the function name incorrectly or wondering if pgcrypto is perhaps not installed.
Why not just add some dummy SQL functions to the pgcrypto.sql file instead of hacking the C code? Because pg_dump by default will create the database as a copy of template0. While there are other ways around the problem (such as putting the SQL functions into template1 and forcing the load to use that instead of template0, or by creating the database, adding the SQL functions, and then loading the data), this was the simplest approach.
Photo of Enigma machine by Marcin Wichary
Learn more about End Point's Postgres Support, Development, and Consulting.
The PGCon "Hall Track"
One of my favorite parts of PGCon is always the "hall track", a general term for the sideline discussions and brainstorming sessions that happen over dinner, between sessions (or sometimes during sessions), and pretty much everywhere else during the conference. This year's hall track topics seemed to be set by the developers' meeting; everywhere I went, someone was talking about hooks for external security modules, MERGE, predicate locking, extension packaging and distribution, or exposing transaction order for replication. Other developers' pet projects that didn't appear in the meeting showed up occasionally, including unlogged tables and range types. Even more than, for instance, the wiki pages describing the things people plan to work on, these interstitial discussions demonstrate the vibrancy of the community and give a good idea just how active our development really is.
This year I shared rooms with Robert Haas, so I got a good overview of his plans for global temporary and unlogged tables. I spent a while with Jeff Davis looking through the code for exclusion constraints and deciding whether it was realistically possible to cause a starvation problem with many concurrent insertions into a table with an exclusion constraint. I didn't spend the time I should have talking with Dimitri Fontaine about his PostgreSQL extensions project, but if time permits I'd like to see if I could help out with it. Nor did I find the time I'd have liked to work on PL/Parrot, but I was glad to meet Jonathan Leto, who has done most of the coding work thus far on that project.
In contrast to other conferences, I didn't have a particular itch of my own to scratch between sessions. During past conferences I've been eager to discuss ideas for multi-column statistics; though that work continues, slowly, time hasn't permitted enough recent development even for the topic to be fresh in my mind, much less worthy of in-depth discussion. This lack of one overriding subject turned out to be a refreshing change, however, as it left the other hall track subjects less filtered.
Finally, it was nice to spend time with co-workers, and in fact to meet (finally) in person the one of the "Greg"s I'd talked to on the phone many times, but never actually met in person. Various engagements in my family or his have gotten in the way in the past. One of the quirks of working for a distributed organization...
Update: Fixed link to developers' meeting wiki page, thanks to comment from roppert
Learn more about End Point's Postgres Support, Development, and Consulting.
Postgres Conference - PGCon2010 - Day Two
Day two of the PostgreSQL Conference started a little later than the previous day in obvious recognition of the fact that many people were up very, very late the night before. (Technically, this is day four, as the first two days consisted of tutorials; this was the second day of "talks").
The first talk I went to was PgMQ: Embedding messaging in PostgreSQL by Chris Bohn. It was well attended, although there were definitely a lot of late-comers and bleary eyes. A tough slot to fill! Chris is from Etsy.com and I've worked with him there, although I had no interaction with the PgMQ project, which looks pretty cool. From the talk description:
PgMQ (PostgreSQL Message Queueing) is an add-on that embeds a messaging client inside PostgreSQL. It supports the AMQP, STOMP and OpenWire messaging protocols, meaning that it can work with all of the major messaging systems such as ActiveMQ and RabbitMQ. PgMQ enables two replication capabilities: "Eventually Consistent" Replication and sharding.
As near as I can tell, "eventually consistent" is the same as "asynchronous replication": the slave won't be the same as the master right away, but will be eventually. As with Bucardo and Slony, the actual lag is very small in practice: a handful of seconds at the most. I like the fact that it supports all those common messaging protocols. Chris mentioned in the talk that it should be possible for other systems like Bucardo to support something similar. I'll have to play around with PgMQ a bit and see about doing just that. :)

The typical post-talk gatherings
The next "talk" was the enigmatically labeled Replication Panel. Enigmatic in this case as it had no description whatsoever. It's a good thing I had decided to check it out anyway (I'm a sucker for any talk related to replication, in case it wasn't obvious yet). I was apparently nominated to be on the panel, representing Bucardo! So much for getting all my speaking done and over with the first day. The panel represented a pretty wide swatch of Postgres replication technologies, and by the people who are very deep in the development of each one. From left to right on a cluster of stools at the front of the room was:
- Londiste (Marko Kreen)
- Slony (Jan Wieck)
- pgpool-II (Tatsuo Ishii)
- Hot standby and Streaming replication (Heikki Linnakangas)
- Bucardo (Greg Sabino Mullane)
- Golconde (Gavin M. Roy)
After a quick one-minute each intro describing who we were and what our replication system was, we took questions from the audience. Rather, Dan Langille played the part of the moderator and gathered written questions from the audience which he read to us, and we each took turns answering. We managed to get through 16 questions. All were interesting, even if some did not apply to all the solutions. Some of the more relevant ones I remember:
"If your replication solution was not available, which of the other replication solutions would you recommend?" This was my favorite question. My answer was: if using Bucardo in multi-master mode, switch to pgpool. If using in master-slave mode, use Slony.
"How will PG 9.0 affect your solution? Will your solution still remain relevant?" This most heavily affects Bucardo, Slony, and Londiste, and we all agreed that we're happy to lose users who simply need a read-only copy of their database. Their remains plenty of use cases that 9.0 will not solve however.
"For multi-master solutions: How are database collisions resolved? Do you recommend your solution for geographically remote locations?" This one is pretty much for me alone. :) I gave a quick overview of Bucardo's built-in conflict resolution systems, and how custom ones built on business logic works. Since Bucardo was originally built to support servers over a non-optimal network, the second part was an easy Yes.
"Is there a way to standardize and reduce the number of replication systems and focus on making the subset more robust, efficient, and versatile?" The general answer was no, as the use cases for all of them are so wildly different. I thought the only possible reduction was to combine Slony and Londiste, as they are very close technically and have pretty much identical use cases.
"How easy is it to switch masters? Are you planning on improving the tools to do so?" With Bucardo, switching is as easy as pointing to a different database if using master-master. However, Bucardo master-slave has no built in support at all for failover (like Slony does). So the answer is "not easy at all" and yes, we want to provide tools to do so.
"What is your biggest bug, problem, or limitation you are fixing now?" All three of the async trigger solutions (Bucardo, Slony, and Londiste) answered "DDL triggers". Which is hopefully coming for 9.1 (stop reading this blog and get to work on that, Jan).
All in all, I really liked the panel, and I think the audience did as well. Hopefully we'll see more things like at future conferences. Since we did not know the questions before hand, and took everything from the audience, it was the polar opposite of someone giving a talk with prepared slides.
I had some people come up to me afterwards to ask for more details about Bucardo, because (as they pointed out), it's the only multi-master replication system for Postgres (not technically true, as pg-pool and rubyrep provide multi-master use cases as well, but the former is synchronous and fairly complex, while the latter is very new and lacking some features). Maybe next year I should give a whole talk on Bucardo rather than just blabbing about it here on the blog. :)
After that, I popped into the Check Please! What Your Postgres Databases Wishes You Would Monitor talk by Robert Treat (who I also used to work with). It was a good talk, but pretty much review for me, as watching over and monitoring databases is what I spend a lot of my time doing. :) Here's the description:
Compared to many proprietary systems, Postgres tends to be pretty straight forward to run. However, if you want to get the most from your database, you shouldn't just set it and forget it, you need to monitor a few key pieces of information to keep performance going. This talk will review several key metrics you should be aware of, and explain under which scenarios you may need additional monitoring.
The final talk I went to was Deploying and testing triggers and functions in multiple databases by Norman Yamada. This was an interesting talk for me because he was using a lot of the code from the same_schema action in the check_postgres program to do the actual comparison. Indeed, I made some patches while at the conference to allow for better index comparison's at Norman's request. I also managed to get some work done on tail_n_mail and Bucardo while there - something about being surrounded by all that Postgres energy made me productive despite having very little free time.
I had to catch an early flight, and was not able to catch the final talk slot of the day, nor the closing session or the BOFs that night. Hopefully someone who did catch those will blog about it and let me know how it went. I hear the t-shirt we signed at the developer's meeting went for a sweet ransom.
If you went to PgCon, I have two requests for you.
First, please fill out the feedback for each talk you went to. It takes less than a minute per talk, and is invaluable for both the speakers and the conference organizers. Second, please blog about PgCon. It's helpful for people who did not get to go to see the conference through other people's eyes. And do it now, while things are still fresh.
If you did not go to PgCon, I have one request for you: go next year! Perhaps next year at PgCon 2011 we'll break the 200 person mark. Thanks to Dan Langille as always for creating PgCon and keeping it running smooth year after year.
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL Conference - PGCon 2010 - Day One
The first day of talks for PGCon 2010 is now over, here's a recap of the parts that I attended.
On Wednesday, the developer's meeting took place. It was basically 20 of us gathered around a long conference table, with Dave Page keeping us to a strict schedule. While there were a few side conversations and contentious issues, overall we covered an amazing amount of things in a short period of time, and actually made action items out of almost all of them. My favorite *decision* we made was to finally move to git, something myself and others have been championing for years. The other most interesting parts for me were the discussion of what features we will try to focus on for 9.1 (it's an ambitious list, no doubt), and DDL triggers! It sounds like Jan Wieck has already given this a lot of thought, so I'm looking forward to working with him in implementing these triggers (or at least
nagging him about it if he slows down). These triggers will be immensely useful to replication systems like Bucardo and Slony, which implement DDL replication in a very manual and unsatisfactory way. These triggers will not be like the current triggers, in that they will not be directly attached to system tables. Instead, they will be associated with certain DDL events, such that you could have a trigger on any CREATE events (or perhaps also allowing something finer grained such as a trigger on a CREATE TABLE event). Whenever it comes in, I'll make sure that Bucardo supports it, of course!
The first day of talks kicked off the the plenary by Gavin Roy called "Perspectives on NoSQL" (description and slides are available). Gavin actually took the time to *gasp* research the topic, and gave a quick rundown of some of the more popular "NoSQL" solutions, including CouchDB, MongoDB, Cassandra, Project Voldemort, Redis, and Tokyo Tyrant. He then benchmarked all of them against Postgres for various tasks - and did it against both "regular safe" Postgres and "running with scissors" fsync-off Postgres. The results? Postgres scales, very well, and more than holds it own against the NoSQL newcomers. MongoDB did surprisingly well: see the slides for the details. His slides also had the unfortunate portmanteau of "YeSQL", which only helps to empahsize how silly our "PostgreSQL" name is. :)
The next talk was Postgres (for non-Postgres people) by Greg Sabino Mullane (me!). Unlike previous years, my slides are already online. Yes, at first blush, it seems a strange talk to give at a conference like this, but we always have a good number of people from other database systems that are considering Postgres, are in the process of migrating to Postgres, or are just new to Postgres. The talk was in three parts: the first was about the mechanics of migrating your application to Postgres: the data types that Postgres uses, how we implement indexes, the best way to migrate your data, and many other things, with an eye towards common migration problems (especially when coming from MySQL). The second part of the talk discussed some of the quirks of Postgres people coming from DB2, Oracle, etc. should be aware of. Some things discussed: how Postgres does MVCC and need for vacuum, our really smart planner and lack of hints, the automatic (and against the spec) lowercasing, and our concept of schemas. I also touched on what I see as some of our drawbacks: tuned for a toaster, no true in place upgrade, the unpronounceable name, the lack of marketing. and what some of our perceived-but-not-real drawbacks are: lack of replication, poor speed. What would a list of drawbacks be without a list of strengths?: transactional DDL, very friendly and helpful community, PostGIS, authentication options, awesome query planner, the ability to create your own custom database objects, and our distributed nature that ensures the project cannot be bought out or destroyed. The last part of the talk went over the Postgres project itself: the community, the developers, the philosophy, and how it all fits together. I ran out of time so did not get to tell my "longest patch process ever" story for \dfS (six years!) but I don't think I missed anything important and gave time for some questions.
The next talk was Hypothetical Indexes towards self-tuning in PostgreSQL by Sergio Lifschitz. In the words of Sergio:
Hypothetical indexes are simulated index structures created solely in the database catalog. This type of index has no physical extension and, therefore, cannot be used to answer actual queries. The main benefit is to provide a means for simulating how query execution plans would change if the hypothetical indexes were actually created in the database. This feature is quite useful for database tuners and DBAs.
It was a very interesting talk. Robert Haas asked him to put it in the PostgreSQL license so we can easily put it into the project as needed. Sergio promised to make the change immediately after the talk!
After lunch, the next talk was pg_statsinfo - More useful statistics information for DBAs by Tatsuhito Kasahara. This talk was a little hard to follow along, but had some interesting ideas about monitoring Postgres, a lot of which overlapped with some of my projects such as tail_n_mail and check_postgres.
The next talk was Forensic Analysis of Corrupted Databases by Greg Stark. This was a neat little talk; many of the error messages he displayed were all too familiar to me. It was nice overview of how to track down the exact location of a problem in a corrupted database, and some strategies for fixing it, including the old "using dd to write things from /dev/zero directly into your Postgres files" trick. There was even a discussion about the possibility of zeroing out specific parts of a page header, with the consensus that it would not work as one would hope.
After a quick hacky sack break with Robert Treat and some Canadian locals, I went to the final real talk of the day: The PostgreSQL Query Planner by Robert Haas. I had seen this talk recently, but wanted to see it again as I missed some of the beginning of the talk when I saw it at Pg East 2010 in Philly. Robert gave a good talk, and was very good at repeating the audience's questions. I didn't learn all that much, but it was a very good overview of the planner, including some of the new planner tricks (such as join removal) in 9.0 and 9.1.
After that, the lightning talks started. I really like lightning talks, and thankfully they weren't held on the last day of the conference this time (a common mistake). The MC was Selena Deckelmann, who did a great job of making sure all the slides were gathered up beforehand, and strictly enforced the five minute time limit. The list of slides is on the Postgres wiki. I talked on my latest favorite project, tail_n_mail - the slides are available on the wiki. I didn't make it through all my slides, so if you were at the talks, check out the PDF for the final two that were not shown. There seemed to be good interest in the project, and I had several people tell me afterwards they would try it out.
The night ended with the EnterpriseDB sponsored party. I spoke to a lot of people there, about replication, PITR scripts, log monitoring, the problem with a large number of inherited objects, and many other topics. Note to EDB: I don't think that venue is going to scale, as the conference gets bigger each year! The total number of people at the conference this year was 184, a new record.
A very good first day: I learned a lot, met new people, saw old friends, and hopefully sold Postgres to some non-Postgres people :). I also managed to git push some changes to tail_n_mail, check_postgres, and Bucardo. It's hard to say no to feature requests when someone asks you in person. :)
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL switches to Git
Looks like the Postgres project is finally going to be bite the bullet and switch to git as the canonical VCS. Some details are yet to be hashed out, but the decision has been made and a new repo will be built soon. Now to lobby to get that commit-with-inline-patches list to be created...
PostgreSQL 8.4 on RHEL 4: Teaching an old dog new tricks
So a client has been running a really old version of PostgreSQL in production for a while. We finally got the approval to upgrade them from 7.3 to the latest 8.4. Considering the age of the installation, it should come as little surprise that they had been running a similarly ancient OS: RHEL 4.
Like the installed PostgreSQL version, RHEL 4 is ancient -- 5 years old. I anticipated that in order to get us to a current version of PostgreSQL, we'd need to resort to a source build or rolling our own PostgreSQL RPMs. Neither approach was particularly appealing.
While the age/decrepitude of the current machine's OS came as little surprise, what did come as a surprise was that there were supported RPMs available for RHEL 4 in the community yum rpm repository, located at http://yum.pgrpms.org/8.4/redhat/rhel-4-i386/repoview/ (modulo your architecture of choice).
In order to get things installed, I followed the instructions for installing the specific yum repo. There were a few seconds where I was confused because the installation command was giving a "permission denied" error when attempting to install the 8.4 PGDG rpm as root. A little brainstorming and a lsattr later revealed that a previous administrator, apparently in the quest for über-security, had performed a chattr +i on the /etc/yum.repo.d directory.
Evil having been thwarted, in the interest of über-usability I did a quick chattr -i /etc/yum.repo.d and installed the PGDG rpm. Away we went. From that point, the install was completely straightforward; I had a PostgreSQL 8.4.4 system running in no time, and could finally get off that 7.3 behemoth. Now to talk my way into an OS upgrade...
Learn more about End Point's Postgres Support, Development, and Consulting.
Finding the PostgreSQL version - without logging in!
Metasploit used the error messages given by a PostgreSQL server to find out the version without actually having to log in and issue a "SELECT version()" command. The original article is at http://blog.metasploit.com/2010/02/postgres-fingerprinting.html and is worth a read. I'll wait.
The basic idea is that because version 3 of the Postgres protocol gives you the file and the line number in which the error is generated, you can use the information to figure out what version of Postgres is running, as the line numbers change from version to version. In effect, each version of Postgres reveals enough in its error message to fingerprint it. This was a neat little trick, and I wanted to explore it more myself. The first step was to write a quick Perl script to connect and get the error string out. The original Metasploit script focuses on failed login attempts, but after some experimenting I found an easier way was to send an invalid protocol number (Postgres expects "2.0" or "3.0"). Sending a startup packet with an invalid protocol of "3.1" gave me back the following string:
E|SFATALC0A000Munsupported frontend protocol 3.1: server supports 1.0 to 3.0Fpostmaster.cL1507RProcessStartupPacket
The important part of the string was the parts indicating the file and line number:
Fpostmaster.cL1507
In this case, we can clearly see that line 1507 of postmaster.c was throwing the error. After firing up a few more versions of Postgres and recording the line numbers, I found that all versions since 7.3 were hitting the same chunk of code from postmaster.c:
/* Check we can handle the protocol the frontend is using. */
if (PG_PROTOCOL_MAJOR(proto) <> PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) ||
(PG_PROTOCOL_MAJOR(proto) == PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) &&
PG_PROTOCOL_MINOR(proto) > PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST)))
ereport(FATAL,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("unsupported frontend protocol %u.%u: server supports %u.0 to %u.%u",
PG_PROTOCOL_MAJOR(proto), PG_PROTOCOL_MINOR(proto),
PG_PROTOCOL_MAJOR(PG_PROTOCOL_EARLIEST),
PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST),
PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST))));
Line numbers were definitely different across major versions of Postgres (e.g. 8.2 vs. 8.3), and were even different sometimes across revisions. Rather than fire up every possible revision of Postgres and run my program against it, I simply took advantage of the cvs tags (aka symbolic names) and did this:
cvs update -rREL8_3_0 -p postmaster.c | grep -Fn 'LATEST))))'
This showed me that the string occurred on line 1497 of postmaster.c. I created a Postgres instance and verified that the line number was the same. At that point, it was a simple matter of making a bash script to grab all releases since 7.3 and build up a comprehensive list of when that line changed from version to version.
Once that was done, I rolled the whole thing up into a new Perl script called "detect_postgres_version.pl". Here's the script, broken into pieces for explanation. A link to the entire script is at the bottom of the post.
First, we do some standard Perl script things and read in the __DATA__ section at the bottom of the script, which lists at which version the message has changed:
#!/usr/bin/env perl
## Quickly and roughly determine what version of Postgres is running
## greg@endpoint.com
use strict;
use warnings;
use IO::Socket;
use Data::Dumper;
use Getopt::Long;
## __DATA__ looks like this: filname / line / version when it changed
## postmaster.c 1287 7.4.0
## postmaster.c 1293 7.4.2
## postmaster.c 1293 7.4.29
##
## postmaster.c 1408 8.0.0
## postmaster.c 1431 8.0.2
## Build our hash of file-and-line to version matches
my %map;
my ($last,$lastmin,$lastline) = ('',0,0);
while () {
next if $_ !~ /(\w\S+)\s+(\d+)\s+(.+)/;
my ($file,$line,$version) = ($1,$2,$3);
die if $version !~ /(\d+)\.(\d+)\.(\d+)/;
my ($vmaj,$vmin,$vrev) = ($1,$2,$3);
my $current = "$file|$vmaj|$vmin";
if ($current eq $last) {
my ($lfile,$lmaj,$lmin) = split /\|/ => $last;
for (my $x = $lastmin+1 ; $x<$vrev; $x++) {
push @{$map{$file}{$lastline}}
=> ["$lmaj.$lmin","$lmaj.$lmin.$x"];
}
}
push @{$map{$file}{$line}} => ["$vmaj.$vmin",$version];
$last = $current;
$lastmin = $vrev;
$lastline = $line;
}
Next, we allow a few options to the script: port and host. We'll default to a Unix socket if the host is not set, and default to port 5432 if none is given:
## Read in user options and set defaults
my %opt;
GetOptions(\%opt,
'port=i',
'host=s',
);
my $port = $opt{port} || 5432;
my $host = $opt{host} || '';
We're ready to connect, using the very standard IO::Socket module. If the host starts with a slash, we assume this is the unix_socket_directory and replace the default '/tmp' location:
## Start the connection, either unix or tcp
my $server;
if (!$host or !index $host, '/') {
my $path = $host || '/tmp';
$server = IO::Socket::UNIX->new(
Type => IO::Socket::SOCK_STREAM,
Peer => "$path/.s.PGSQL.$port",
) or die "Could not connect!: $@";
}
else {
$server = IO::Socket::INET->new(
PeerAddr => $host,
PeerPort => $port,
Proto => 'tcp',
Timeout => 3,
) or warn "Could not connect!: $@";
}
Now we're ready to actually send something over our new socket. Postgres expects the startup packet to be in a certain format. We'll follow that format, but send it an invalid protocol number, 3.1. The rest of the information does not really matter, but we'll also tell it we're connecting as user "pg". Finally, we read back in the message, extract the file and line number, and spit them back out to the user:
## Build and sent the packet
my $packet = pack('nn', 3,1) . "user\0pg\0\0";
$packet = pack('N', length($packet) + 4). $packet;
$server->send($packet, 0);
## Get the message back and extract the filename and line number
my $msg;
recv $server, $msg, 1000, 0;
if ($msg !~ /F([\w\.]+)\0L(\d+)/) {
die "Could not find a file and line from error message: $msg\n";
}
my ($file,$line) = ($1,$2);
print "File: $file Line: $line\n";
Finally, we try to map the file name and line number we received back to the version of PostgreSQL it came from. If the file is not recognized, or the line number is not known, we bail out early:
$map{$file}
or die qq{Sorry, I do not know anything about the file "$file"\n};
$map{$file}{$line}
or die qq{Sorry, I do not know anything about line $line of file "$file"\n};
If there is only one result for this line and file number, we can state what it is and exit.
my $result = $map{$file}{$line};
if (1 == @$result) {
print "Most likely Postgres version $result->[0][1]\n";
exit;
}
In most cases, though, we don't know the exact version down to the revision after the second dot, so we'll state what the major version is, and all the possible revisions:
## Walk through and figure out which versions it may be.
## For now, we know that the major version does not overlap
print "Most likely Postgres version $result->[0][0]\n";
print "Specifically, one of these:\n";
for my $row (@$result) {
print " Postgres version $row->[1]\n";
}
exit;
The only thing left is the DATA section, which I'll show here to be complete:
__DATA__ ## Format: filename line version postmaster.c 1167 7.3.0 postmaster.c 1167 7.3.21 postmaster.c 1287 7.4.0 postmaster.c 1293 7.4.2 postmaster.c 1293 7.4.29 postmaster.c 1408 8.0.0 postmaster.c 1431 8.0.2 postmaster.c 1441 8.0.5 postmaster.c 1445 8.0.6 postmaster.c 1439 8.0.7 postmaster.c 1443 8.0.9 postmaster.c 1445 8.0.14 postmaster.c 1445 8.0.25 postmaster.c 1449 8.1.0 postmaster.c 1450 8.1.1 postmaster.c 1454 8.1.2 postmaster.c 1448 8.1.3 postmaster.c 1452 8.1.4 postmaster.c 1448 8.1.9 postmaster.c 1454 8.1.10 postmaster.c 1454 8.1.21 postmaster.c 1432 8.2.0 postmaster.c 1437 8.2.1 postmaster.c 1440 8.2.5 postmaster.c 1432 8.2.17 postmaster.c 1497 8.3.0 postmaster.c 1507 8.3.8 postmaster.c 1507 8.3.11 postmaster.c 1570 8.4.0 postmaster.c 1621 8.4.1 postmaster.c 1621 8.4.4 postmaster.c 1664 9.0.0
(Because version 9.0 is not released yet, its line number may still change.)
I found this particular protocol error to be a good one because there is no overlap of line numbers across major versions. Of the approximately 125 different versions released since 7.3.0, only 6 are unique enough to identify to the exact revision. That's okay for this iteration of the script. If you wanted to know the exact revision, you could try other errors, such as an invalid login, as the metasploit code does.
The complete code can be read here: detect_postgres_version.pl
I'll be giving a talk later on this week at PgCon 2010, so say hi if you see me there. I'll probably be giving a lightning talk as well.
Learn more about End Point's Postgres Support, Development, and Consulting.


Benchmark White Paper















