Database Blog Archive
RailsAdmin Import: Part 2
I recently wrote about importing data in RailsAdmin. RailsAdmin is a Rails engine that provides a nice admin interface for managing your data, which comes packed with configuration options.
In a recent Ruby on Rails ecommerce project, I've been using RailsAdmin, Piggybak (a Rails ecommerce gem supported by End Point), and have been building out custom front-end features such as advanced search and downloadable product support. When this client came to End Point with the project, we offered several options for handling data migration from a legacy system to the new Rails application:
- Create a standard migration file, which migrates data from the existing legacy database to the new data architecture. The advantage with this method is that it requires virtually no manual interaction for the migration process. The disadvantage with this is that it's basically a one-off solution and would never be useful again.
- Have the client manually enter data. This was a reasonable solution for several of the models that required 10 or less entries, but not feasible for the tables containing thousands of entries.
- Develop import functionality to plug into RailsAdmin which imports from CSV files. The advantage to this method is that it could be reused in the future. The disadvantage with ths method is that data exported from the legacy system would have to be cleaned up and formatted for import.
The client preferred option #3. Using a quick script for generating custom actions for RailsAdmin, I developed a new gem called rails_admin_import to handle import that could be plugged into RailsAdmin. Below are some technical details on the generic import solution.
ActiveSupport::Concern
Using ActiveSupport::Concern, the rails_admin_import gem extends ActiveRecord::Base to add the following class methods:
- import_fields: Returns an array of fields that will be included in the import, excluding :id, :created_at, and :updated_at, belongs_to fields, and file fields.
- belongs_to_fields: Returns an array of fields with belongs_to relationships to other models.
- many_to_many_fields: Returns an array of fields with has_and_belongs_to_many relationships to other models.
- file_fields: Returns an array of fields that represent data for Paperclip attached files.
- run_import: Method for running the actual import, receives request params.
And the following instance methods:
- import_files: sets attached files for object
- import_belongs_to_data: sets belongs_to associated data for object
- import_many_to_many_data: sets many_to_many associated data for object
The general approach here is that the import of files, belongs_to, many_to_many relationships, and standard fields makes up the import process for a single object. The run_import method collects success and failure messages for each object import attempt and those results are presented to the user. A regular ActiveRecord save method is called on the object, so the existing validation of objects during each save applies.
Working with Associated Data
One of the tricky parts here is how to handle import of fields representing associations. Given a user model that belongs to a state, country, and has many roles, how would one decide what state, country, or role value to include in the import?
I've solved this by including a dropdown to select the attribute used for mapping in the form. Each of the dropdowns contains a list of model attributes that are used for association mapping. A user can then select the associated mappings when they upload a file. In a real-life situation, I may import the state data via abbreviation, country via display name (e.g. "United States", "Canada") and role via the role name (e.g. "admin"). My data import file might look like this:
| name | favorite_color | state | country | role | |
| Steph Skardal | steph@endpoint.com | blue | CO | United States | admin |
| Aleks Skardal | aleksskardal@gmail.com | green | Norway | user | |
| Roger Skardal | roger@gmail.com | tennis ball yellow | UT | United States | dog |
| Milton Skardal | milton@gmail.com | kibble brown | UT | United States | dog |
Many to Many Relationships
Many to many relationships are handled by allowing multiple columns in the CSV to correspond to the imported data. For example, there may be two columns for role on the user import, where users may be assigned to multiple roles. This may not be suitable for data with a large number of many to many assignments.
Import of File Fields
In this scenario, I've chosen to use open-uri to request existing files from a URL. The CSV must contain the URL for that file to be imported. The import process downloads the file and attaches it to the imported object.
self.class.file_fields.each do |key|
if map[key] && !row[map[key]].nil?
begin
row[map[key]] = row[map[key]].gsub(/\s+/, "")
format = row[map[key]].match(/[a-z0-9]+$/)
open("#{Rails.root}/tmp/uploads/#{self.permalink}.#{format}", 'wb') { |file| file << open(row[map[key]]).read }
self.send("#{key}=", File.open("#{Rails.root}/tmp/uploads/#{self.permalink}.#{format}"))
rescue Exception => e
self.errors.add(:base, "Import error: #{e.inspect}")
end
end
end
If the file request fails, an error is added to the object and presented to the user. This method may not be suitable for handling files that do not currently exist on a web server, but it was suitable for migrating a legacy application.
Configuration: Display
Following RailsAdmin's example for setting configurations, I've added the ability to allow the import display to be set for each model.
config.model User do label :name end
The above configuration will yield success and error messages with the user.name, e.g.:
Configuration: Excluded Fields
In addition to allowing a configurable display option, I've added the configuration for excluding fields.
config.model User do
excluded_fields do
[:reset_password_token, :reset_password_sent_at, :remember_created_at,
:sign_in_count, :current_sign_in_at, :last_sign_in_at, :current_sign_in_ip,
:last_sign_in_ip]
end
end
The above configuration will exclude the specified fields during the import, and they will not display on the import page.
Configuration: Additional Fields and Additional Processing
Another piece of functionality that I found necessary for various imports was to hook in additional import functionality. Any model can have an instance method before_import_save that accepts the row of CSV data and map of CSV keys to perform additional tasks. For example:
def before_import_save(row, map) self.created_nested_items(row, map) end
The above method will create nested items during the import process. This simple extensibility allows for additional data to be handled upon import outside the realm of has_and_belongs_to and belongs_to relationships.
Fields for additional nested data can be defined with the extra_fields configuration, and are shown on the import page.
config.model User do
extra_fields do
[:field1, :field2, :field3, :field4]
end
end
Hooking into RailsAdmin
As I mentioned above, I used a script to generate this Engine. Using RailsAdmin configurable actions, import must be added as an action:
config.actions do dashboard index ... import end
And CanCan settings must be updated to allow for import if applicable, e.g.:
cannot :import, :all can :import, User
Conclusion
My goal in developing this tool was to produce reusable functionality that could easily be applied to multiple models with different import needs, and to use this tool across Rails applications. I've already used this gem in another Rails 3.1 project to quickly import data that would otherwise be difficult to deal with manually. The combination of association mapping and configurability produces a flexibility that encourages reusability.
Feel free to review or check out the code here, or read more about End Point's services here.
Protecting and auditing your secure PostgreSQL data

PostgreSQL functions can be written in many languages. These languages fall into two categories, 'trusted' and 'untrusted'. Trusted languages cannot do things "outside of the database", such as writing to local files, opening sockets, sending email, connecting to other systems, etc. Two such languages are PL/pgSQL and and PL/Perl. For "untrusted" languages, such as PL/PerlU, all bets are off, and they have no limitations placed on what they can do. Untrusted languages can be very powerful, and sometimes dangerous.
One of the reasons untrusted languages can be considered dangerous is that they can cause side effects outside of the normal transactional flow that cannot be rolled back. If your function writes to local disk, and the transaction then rolls back, the changes on disk are still there. Working around this is extremely difficult, as there is no way to detect when a transaction has rolled back at the level where you could, for example, undo your local disk changes.
However, there are times when this effect can be very useful. For example, in a recent thread on the PostgreSQL "general" mailing list (aka pgsql-general), somebody asked for a way to audit SELECT queries into a logging table that would survive someone doing a ROLLBACK. In other words, if you had a function named weapon_details() and wanted to have that function log all requests to it by inserting to a table, a user could simply run the query, read the data, and then rollback to thwart the auditing:
BEGIN;
SELECT weapon_details('BFG 9000'); -- also inserts to an audit table
ROLLBACK; -- inserts to the audit table are now gone!
Certainly there are other ways to track who is using this query, the most obvious being by enabling full Postgres logging (by setting log_statement = 'all' in your postgresql.conf file.) However, extracting that information from logs is no fun, so let's find a way to make that INSERT stick, even if the surrounding function was rolled back.
Stepping back for one second, we can see there are actually two problems here: restricting access to the data, and logging that access somewhere. The ultimate access restriction is to simply force everyone to go through your custom interface. However, in this example, we will assume that someone has psql access and needs to be able to run ad hoc SQL queries, as well as be able to BEGIN, ROLLBACK, COMMIT, etc.
Let's assume we have a table with some Very Important Data inside of it. Further, let's establish that regular users can only see some of that data, and that we need to know who asked for what data, and when. For this example, we will create a normal user named Alice:
postgres=> CREATE USER alice;
CREATE ROLE
We need a way to tell which rows are suitable for people like Alice to view. We will set up a quick classification scheme using the nifty ENUM feature of PostgreSQL:
postgres=> CREATE TYPE classification AS ENUM (
'unclassified',
'restricted',
'confidential',
'secret',
'top secret'
);
CREATE TYPE
Next, as a superuser, we create the table containing sensitive information, and populate it:
postgres=> CREATE TABLE weapon (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
cost TEXT NOT NULL,
security_level CLASSIFICATION NOT NULL,
description TEXT NOT NULL DEFAULT 'a fine weapon'
);
NOTICE: CREATE TABLE will create implicit sequence "weapon_id_seq" for serial column "weapon.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "weapon_pkey" for table "weapon"
CREATE TABLE
postgres=> INSERT INTO weapon (name,cost,security_level) VALUES
('Crowbar', 10, 'unclassified'),
('M9', 200, 'restricted'),
('M16A2', 300, 'restricted'),
('M4A1', 400, 'restricted'),
('FGM-148 Javelin', 700, 'confidential'),
('Pulse Rifle', 50000, 'secret'),
('Zero Point Energy Field Manipulator', 'unknown', 'top secret');
INSERT 0 7
We don't want anyone but ourselves to be able to access this table, so for safety, we make some explicit revocations. We'll examine the permissions before and after we do this:
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+-------------------+--------------------------
public | weapon | table | |
postgres=> REVOKE ALL ON TABLE weapon FROM public;
REVOKE
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+---------------------------+--------------------------
public | weapon | table | postgres=arwdDxt/postgres |
As you can see, what the REVOKE really does is remove the implicit "no permission" and grant explicit permissions to only the postgres user to view or modify the table. Let's confirm that Alice cannot do anything with that table:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> postgres=> SELECT * FROM weapon;
ERROR: permission denied for relation weapon
postgres=> postgres=> UPDATE weapon SET id = id;
ERROR: permission denied for relation weapon
Alice does need to have access to parts of this table, so we will create a "wrapper function" that will query the table for us and return some results. By declaring this function as SECURITY DEFINER, it will run as if the person who created the function invoked it - in this case, the postgres user. For this example, we'll be letting Alice see the "cost and description" of exactly one item at a time. Further, we are not going to let her (or anyone else using this function) view certain items. Only those items classified as "confidential" or lower can be viewed (i.e. "confidential", "restricted", or "unclassified"). Here's the first version of our function:
postgres=> CREATE LANGUAGE plperlu;
CREATE LANGUAGE
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
The above should be fairly self-explanatory. We are using PL/Perl's built-in database access functions, such as spi_prepare, to do the actual querying. Let's confirm that this works as it should for Alice:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('anvil');
name | cost | description
------+------+-------------
(0 rows)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
Now that we have solved the restricted access problem, let's move on the auditing. We will create a simple table to hold information about who accessed what and when:
postgres=> CREATE TABLE data_audit (
tablename TEXT NOT NULL,
arguments TEXT NULL,
results INTEGER NULL,
status TEXT NOT NULL DEFAULT 'normal',
username TEXT NOT NULL DEFAULT session_user,
txntime TIMESTAMPTZ NOT NULL DEFAULT now(),
realtime TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()
);
CREATE TABLE
The 'tablename' column simply records which table they are getting data from. The 'arguments' is a free-form field describing what they were looking for. The 'results' column shows how many matching rows were found. The 'status' column will be used primarily to log unusual requests, such as the case where Alice looks for a forbidden item. The 'username' column records the name of the user doing the searching. Because we are using functions with SECURITY DEFINER set, this needs to be session_user, not current_user, as the latter will switch to 'postgres' within the function, and we want to log the real caller (e.g. 'alice'). The final two columns tell us then the current transaction started, and the exact time when an entry was made inside of this table. As a first attempt, we'll have our function do some simple inserts to this new data_audit table:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Log this request
$SQL = 'INSERT INTO data_audit(tablename,arguments,results,status)
VALUES ($1,$2,$3,$4)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
$sth = spi_prepare($SQL, 'TEXT', 'TEXT', 'INTEGER', 'TEXT');
spi_exec_prepared($sth, 'weapon', $name, $rv->{processed}, $status);
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
However, this fails the case pointed out in the original poster's email about viewing the data within a transaction that is then rolled back. It also fails to work at all when a forbidden item is requested, as that insert is rolled back by the die() call:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM weapon_details('m9');
name | cost | description
------+------+---------------
M9 | 200 | a fine weapon
(1 row)
postgres=> ROLLBACK;
ROLLBACK
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=> SELECT * FROM data_audit \x \g
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results | 1
status | normal
username | alice
txntime | 2012-01-30 17:37:39.497491-05
realtime | 2012-01-30 17:37:39.545891-05
How do we get around this? We need a way to commit something that will survive the surrounding transaction's rollback. The closest thing Postgres has to such a thing at the moment is to connect back to the database with a new and entirely separate connection. Two such popular ways to do so are with the dblink program and the PL/PerlU language. Obviously, we are going to focus on the latter, but all of this could be done with dblink as well. Here are the additional steps to connect back to the database, do the insert, and then leave again:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
VOLATILE
AS $bc$
use strict;
use warnings;
use DBI;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = defined $row ? delete $row->{allowed} : 1;
## Log this request
$SQL = 'INSERT INTO data_audit(username,tablename,arguments,results,status)
VALUES (?,?,?,?,?)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
my $dbh = DBI->connect('dbi:Pg:service=auditor', '', '',
{AutoCommit=>0, RaiseError=>1, PrintError=>0});
$sth = $dbh->prepare($SQL);
my $user = spi_exec_query('SELECT session_user')->{rows}[0]{session_user};
$sth->execute($user, 'weapon', $name, $rv->{processed}, $status);
$dbh->commit();
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
Note that because we are making external changes, we marked the function as VOLATILE, which ensures that it will always be run every time it is called, and not cached in any form. We are also using a Postgres service file with the 'db:Pg:service=auditor'. This means that the connection information (username, password, database) is contained in an external file. This is not only tidier than hard-coding those values into this function, but safer as well, as the function itself can be viewed by Alice. Finally, note that we are passing the 'username' directly into the function this time, as we have a brand new connection which is no longer linked to the 'alice' user, so we have to derive it ourselves from "SELECT session_user" and then pass it along.
Once this new function is in place, and we re-run the same queries as we did before, we see three entries in our audit table:
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results | 1
status | normal
username | alice
txntime | 2012-01-30 17:56:01.544557-05
realtime | 2012-01-30 17:56:01.54569-05
-[ RECORD 2 ]----------------------------
tablename | weapon
arguments | pulse rifle
results | 1
status | forbidden
username | alice
txntime | 2012-01-30 17:56:01.559532-05
realtime | 2012-01-30 17:56:01.561225-05
-[ RECORD 3 ]----------------------------
tablename | weapon
arguments | m9
results | 1
status | normal
username | alice
txntime | 2012-01-30 17:56:01.573335-05
realtime | 2012-01-30 17:56:01.574989-05
So that's the basic premise of how to solve the auditing problem. For an actual production script, you would probably want to cache the database connection by sticking things inside of the special %_SHARED hash available to PL/Perl and Pl/PerlU. Note that each user gets their own version of that hash, so Alice will not be able to create a function and have access to the same %_SHARED hash that the postgres user has access to. It's probably a good idea to simply not let users like Alice use the language at all. Indeed, that's the default when we do the CREATE LANGUAGE call as above:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> CREATE FUNCTION showplatform()
RETURNS TEXT
LANGUAGE plperlu
AS $bc$
return $^O;
$bc$;
ERROR: permission denied for language plperlu
Further refinements to the actual script might include refactoring the logging bits to a separate function, writing some of the auditing data to a file on the local disk, recording the actual results returned to the user, and sending the data to another Postgres server entirely. For that matter, as we are using DBI, you could send it to other place entirely - such as a MySQL, Oracle, or DB2 database!
Another place for improvement would be associating each user with a security_level classification, such that any user could run the function and only see things at or below their level, rather than hard-coding the level as "confidential" as we have done here. Another nice refinement might be to always return undef (no matches) for items marked "top secret", to prevent the very existence of a top secret weapon from being deduced. :)
Interchange loops using DBI Slice
One day I was reading through the documentation on search.cpan.org for the DBI module and ran across an attribute that you can use with selectall_arrayref() that creates the proper data structure to be used with Interchange's object.mv_results loop attribute. The attribute is called Slice which causes selectall_arrayref() to return an array of hashrefs instead of an array of arrays. To use this you have to be working in global Perl modules as Safe.pm will not let you use the selectall_arrayref() method.
An example of what you could use this for is an easy way to generate a list of items in the same category. Inside the module, you would do like this:
my $results = $dbh->selectall_arrayref(
q{
SELECT
sku,
description,
price,
thumb,
category,
prod_group
FROM
products
WHERE
category = ?},
{ Slice => {} },
$category
);
$::Tag->tmpn("product_list", $results);
In the actual HTML page, you would do this:
<table cellpadding=0 cellspacing=2 border=1>
<tr>
<th>Image</th>
<th>Description</th>
<th>Product Group</th>
<th>Category</th>
<th>Price</th>
</tr>
[loop object.mv_results=`$Scratch->{product_list}` prefix=plist]
[list]
<tr>
<td><a href="/cgi-bin/vlink/[plist-param sku].html"><img src="[plist-param thumb]"></a></td>
<td>[plist-param description]</td>
<td>[plist-param prod_group]</td>
<td>[plist-param category]</td>
<td>[plist-param price]</td>
</tr>
[/list]
[/loop]
</table>
We normally use this when writing ActionMaps and using some template as our setting for mv_nextpage.
Some great press for College District
College District has been getting some positive press lately, the most recent being a Forbes article which talks about the success they have been seeing in the last few years.
College District is a company that sells collegiate merchandise to fans. They got their start focusing on the LSU Tigers at TigerDistrict.com and have branched out to teams such as the Oregon Ducks and Alabama Roll Tide.
We've been working with Jared Loftus @ College District for more then four and a half years. College District is running on a heavily modified Interchange system with some cool Postgres tricks. The system can support a nearly unlimited number of sites, running on 2 catalogs (1 for the admin, 1 for the front end) and 1 database. The key to the system is different schemas, fronted by views, that hide and expose records based on the database user that is connected. The great thing about this system is that Jared can choose to launch a new store within a day and be ready for sales, something he has taken advantage of in the past when a team is on fire and he sees an opportunity he can't pass up.
We are currently preparing for a re-launch of the College District site that will focus on crowd-sourced designs. Artists and fans will submit their designs, have them voted on, some will be chosen to be sold and the folks that have their designs chosen will get paid for their efforts. The goal here is to grow a community that guides what College District and the individual school sites ultimately sell.
With College District's quick growth we've also been helping them improve their order fulfillment process. This includes streamlining how orders are picked, packed and shipped. The introduction of bar code scanners will help with the accuracy and speed of the process.
We get a kick out of seeing our clients succeed, especially those that come to us with a clear vision and a good attitude, and then put the hard work in to make it happen. It's an exciting year ahead for College District and we'll be right there supporting them on the journey.
Interchange Search Caching with "Permanent More"
Most sites that use Interchange take advantage of Interchange's "more lists". These are built-in tools that support an Interchange "search" (either the search/scan action, or result of direct SQL via [query]) to make it very easy to paginate results. Under the hood, the more list is a drill-in to a cached "search object", so each page brings back a slice from the cache of the original search. There are extensive ways to modify the look and behavior of more lists and, with a bit of effort, they can be configured to meet design requirements.
Where more lists tend to fall short, however, is with respect to SEO. There are two primary SEO deficiencies that get business stakeholders' attention:
- There is little control over the construction of the URLs for more lists. They leverage the scan actionmap and contain a hash key for the search object and numeric data to identify the slice and page location. They possess no intrinsic value in identifying the content they reference.
- The search cache by default is ephemeral and session-specific. This means all those results beyond page 1 the search engine has cataloged will result in dead links for search users who try to land directly on the more-listed pages.
It is the latter issue that I wish to address because there is--and has been for some time now--a simple mechanism called "permanent more" to remedy the default behavior.
You can leverage "permanent more" by adding the boolean mv_more_permanent, or the shorthand pm, to your search conditions. E.g.:
Link:
<a href="[area search="
co=1
sf=category
se=Foo
op=rm
more=1
ml=5
pm=1
"]">All Foos</a>
Loop:
[loop search="
co=1
sf=category
se=Foo
op=rm
more=1
ml=5
pm=1
"]
...loop body with [more-list]...
[/loop]
Query:
[query
list=1
more=1
ml=10
pm=1
sql="SELECT * FROM products WHERE category LIKE '%Foo%'"
]
...same as loop but with 10 matches/page...
[/query]
If the initial search is defined with the "permanent more" setting, it will produce the following adjustments:
- The hash key used to store and identify the search cache is deterministic based on the search conditions. Many searches for Interchange are category driven. Thus, all end users who wish to browse a category end up clicking identical links, which create duplicate search caches, belonging uniquely to them. With permanent more, they all share the same cache, with the same identifier. As long as the search conditions don't change, neither does the cache identifier. Even as the cache is refreshed with new executions of the search, the object remains in the same location. Thus, the results a search engine produced this morning reference links still valid now, tomorrow, or next week, provided they reference the same search conditions.
- The cached search object has no session affinity. Any link referencing the cache with the correct hash key has access to the content.
Taken together, "permanent more" removes (for the most part, addressed later) dead links from more lists cataloged by search engines. There are, however, other benefits to "permanent more" beyond those intended as described above:
- As stated in passing, standard Interchange search caching produces duplicate search objects for common search conditions. For a busy site, these caches can have an impact on storage. Typically, maintenance is implemented to clean up cache files for all such files whose age exceeds by some amount the session duration (standard is 48 hours). With permanent more, duplicate caches are eliminated. A cache location is reused by all users with the same search requirements, keeping data-storage requirements for caches to the minimum necessary. As searches change, ophaned caches can still easily be cleaned up as they will immediately start to age with no more access to them necessary for storage.
- For the same reason that "permanent more" resolves search-engine links, it also resolves content management for individual sites using a reverse proxy for caching. Because most (and certainly the easiest) caching keys are based off of URL, the deterministic nature of the hash keys for "permanent more" allows assurance that the cached content in the proxy accurately reflects the search content over time, and that all users will hit the cached resource and not generate new, unique links with varying hash keys.
One shortcoming of "permanent more" to be aware of is the impact of changing data underneath the search. Even if search conditions do not change, the count and order of matching record sets may. So, e.g., enough products may be removed from a given category to cause the last page of a more list to become empty, which would cause any specific link into that page to become dead. More minor, but still a possibility, is the introduction or removal of products so that a particularly searched-for term has been "bumped" to another page within the search cache since the last time the search engine crawled the more lists. For searches backed by particularly volatile data, "permanent more" may not be sufficient to address search-engine or caching demands.
Finally, "permanent more" should be avoided for any search features that may cache data sensitive to an individual user. This is unlikely to happen as, under most circumstances, the configuration of the search itself will change based on the unique characteristics of the user executing the search (e.g., a username included in a query to review order history). However, it is still possible that context-sensitive information could be stored in the search object and, if so, all other users with access to the more lists would have access to that information.
Finding PostgreSQL temporary_file problems with tail_n_mail
PostgreSQL does as much work as it can in RAM, but sometimes it needs to (or thinks that it needs to) write things temporarily to disk. Typically, this happens on large or complex queries in which the required memory is greater than the work_mem setting.
This is usually an unwanted event: not only is going to disk much slower than keeping things in memory, but it can cause I/O contention. For very large, not-run-very-often queries, writing to disk can be warranted, but in most cases, you will want to adjust the work_mem setting. Keep in mind that this is very flexible setting, and can be adjusted globally (via the postgresql.conf file), per-user (via the ALTER USER command), and dynamically within a session (via the SET command). A good rule of thumb is to set it to something reasonable in your postgresql.conf (e.g. 8MB), and set it higher for specific users that are known to run complex queries. When you discover a particular query run by a normal user requires a lot of memory, adjust the work_mem for that particular query or set of queries.
How do you tell when you work_mem needs adjusting, or more to the point, when Postgres is writing files to disk? The key is the setting in postgresql.conf called log_temp_files. By default it is set to -1, which does no logging at all. Not very useful. A better setting is 0, which is my preferred setting: it logs all temporary files that are created. Setting log_temp_files to a positive number will only log entries that have an on-disk size greater than the given number (in kilobytes). Entries about temporary files used by Postgres will appear like this in your log file:
2011-01-12 16:33:34.175 EST LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp16501.0", size 130220032
The only important part is the size, in bytes. In the example above, the size is 124 MB, which is not that small of a file, especially as it may be created many, many times. So the question becomes, how can we quickly parse the files and get a sense of which queries are causing excess writes to disk? Enter the tail_n_mail program, which I recently tweaked to add a "tempfile" mode for just this purpose.
To enter this mode, just name your config file with "tempfile" in its name, and have it find the lines containing the temporary file information. It's also recommended you make use of the tempfile_limit parameter, which limits the results to the "top X" ones, as the report can get very verbose otherwise. An example config file and an example invocation via cron:
$ cat tail_n_mail.tempfile.myserver.txt
## Config file for the tail_n_mail program
## This file is automatically updated
## Last updated: Thu Nov 10 01:23:45 2011
MAILSUBJECT: Myserver tempfile sizes
EMAIL: greg@endpoint.com
FROM: postgres@myserver.com
INCLUDE: temporary file
TEMPFILE_LIMIT: 5
FILE: /var/log/pg_log/postgres-%Y-%m-%d.log
$ crontab -l | grep tempfile
## Mail a report each morning about tempfile usage:
0 5 * * * bin/tail_n_mail tnm/tail_n_mail.tempfile.myserver.txt --quiet
For the client I wrote this for, we run this once a day and it mails us a nice report giving the worst tempfile offenders. The queries are broken down in three ways:
- Largest overall temporary file size
- Largest arithmetic mean (average) size
- Largest total size across all the same query
Here is a slightly edited version of an actual tempfile report email:
Date: Mon Nov 7 06:39:57 2011 EST
Host: myserver.example.com
Total matches: 1342
Matches from [A] /var/log/pg_log/2011-11-08.log: 1241
Matches from [B] /var/log/pg_log/2011-11-09.log: 101
Not showing all lines: tempfile limit is 5
Top items by arithmetic mean | Top items by total size
----------------------------------+-------------------------------
860 MB (item 5, count is 1) | 17 GB (item 4, count is 447)
779 MB (item 1, count is 2) | 8 GB (item 2, count is 71)
597 MB (item 7, count is 1) | 6 GB (item 334, count is 378)
597 MB (item 8, count is 1) | 6 GB (item 46, count is 104)
596 MB (item 9, count is 1) | 5 GB (item 3, count is 63)
[1] From file B Count: 2
Arithmetic mean is 779.38 MB, total size is 1.52 GB
Smallest temp file size: 534.75 MB (2011-11-08 12:33:14.312 EST)
Largest temp file size: 1024.00 MB (2011-11-08 16:33:14.121 EST)
First: 2011-11-08 05:30:12.541 EST
Last: 2011-11-09 03:12:22.162 EST
SELECT ab.order_number, TO_CHAR(ab.creation_date, 'YYYY-MM-DD HH24:MI:SS') AS order_date,
FROM orders o
JOIN order_summary os ON (os.order_id = o.id)
JOIN customer c ON (o.customer = c.id)
ORDER BY creation_date DESC
[2] From file A Count: 71
Arithmetic mean is 8.31 MB, total size is 654 MB
Smallest temp file size: 12.12 MB (2011-11-08 06:12:15.012 EST)
Largest temp file size: 24.23 MB (2011-11-08 19:32:45.004 EST)
First: 2011-11-08 06:12:15.012 EST
Last: 2011-11-09 04:12:14.042 EST
CREATE TEMPORARY TABLE tmp_sales_by_month AS SELECT * FROM sales_by_month_view;
While it still needs a little polishing (such as showing which file each smallest/largest came from), it has already been an indispensible tool forfinding queries that causing I/O problems via frequent and/or large temporary files.
A comparison of JasperSoft iReport and Pentaho Reporting
I've recently been involved in reporting projects using both JasperSoft's iReport and Pentaho's Reporting, so this seemed a good opportunity to compare the two. Both are Java-based, open source reporting systems which claim to build "pixel-perfect" documents ("pixel-perfect" means that when you put something somewhere on a report design, it doesn't move around. That this isn't taken for granted is a rant for another time). I have more experience with Pentaho than with JasperSoft, and once reviewed a book on Pentaho; I'll try to give the two a fair evaluation, but in the end I can't promise my greater experience with Pentaho won't affect my conclusions one way or the other. Both suites are available in open source and commercial flavors; I'll consider only the open source versions here.
First let me point out that JasperSoft and Pentaho both produce business intelligence software suites. The two suites exist in both community (open source) and enterprise (commercial) forms, are well worth comparing in their entirety, but I'm focusing principally on the report builder component of each, because that's where my recent experience has led me. These report builder packages allow users to build complex and visually interesting data-based documents from many different kinds of data sources. A "document" could be a simple form letter, an address book, or a complex dashboard of business metrics. In each case, users build documents by dragging and dropping various components into report "bands", and then modifying components' properties. "Bands" are horizontal sections of the page with different roles. The page header and footer bands, for instance, are (obviously) printed at the top and bottom of each page. "Detail" bands print once per row returned by the query that underlies the report. Both iReport and Pentaho Reporting will group and filter query results for you, if you want, and include header and footer bands for each group. Both products allow users to publish finished reports to a server where other users can view them, schedule them to run periodically, or modify them for new purposes, and both provide a Java-based reporting library to embed reporting in other applications.

iReport Query Dialog
Reports in both products are based on queries. These queries may be SQL using JDBC data sources, or they can come from other more obscure data sources, such as MDX queries, XQuery, scripts in various languages, web services, and more. Both also provide a query editor, at least for SQL queries, and in fact, both use the same query editor. I've only used it rarely, and only in Pentaho; both products also allow users to type in queries free-form, which I much prefer. In Pentaho, the data source for the query is embedded in the report itself, in the form of a JDBC URL, a JNDI name, or something else appropriate for that data source, so if you publish the report to a server, you're responsible for ensuring that JNDI name or JDBC URL or whatever makes sense on that server. Jasper, on the other hand, prompts the user with a list of available data sources when publishing a report to a server. Jasper's method seems more friendly, but Pentaho's choice may have its advantages here, because each report is self-contained, whereas Jasper's reports have metadata outside the single report file to describe the associated data source.
The Pentaho Report Builder
The component libraries available in each product are fairly similar. Users can select from basic drawing components such as lines and circles, static formatted text labels, and of course various numeric, textual, and date components to display query results. Both products also include complex charting components, to display visualizations of many different kinds. Though the charting functionality of both products, at least at the basic level I've used, is quite similar, I found iReport's charting dialogs quite helpful for making complex charts much easier to create than they would otherwise be. In Pentaho, after adding a chart component the user can open a special properties window, but the window offers few clues beyond the usual tooltips and occasionally meaningful property names to help the user know what to do. In contrast, iReport describes many of the properties involved in clear language.
iReport's chart selection dialog
This brings me to the topic of documentation, which, in fact, I found lacking for both produts. Yes, these are open source projects, and yes, documentation isn't always as fun to write as code, so yes, open source projects sometimes end up with lousy documentation. The enterprise versions come with documentation in one form or another, and there are several books published on the different suites' components, including reporting. But the documentation available free-of-charge on the web left me unimpressed. Here Pentaho was particularly frustrating, largely because Pentaho's reporting has changed a great deal from version to version, especially with its 3.5 release a couple years ago, so much of the help available in forums and wikis is completely out of date. Jasper documentation was more difficult to find, in general, but more accurate and up-to-date when I found it.
Both products save their results in fairly comprehensible formats, which is helpful if you ever need to modify them by hand, without the help of the report building tool. iReport's files are in a single XML file; Pentaho creates a ZIP-ed archive of several XML files. That fact has come in quite handy several times, both with Pentaho and with iReport, typically because it's faster for me to edit XML by hand than click on each of 20 components and tweak properties one-by-one, because (in Pentaho) I needed to modify a data source, or because the tool failed to figure out what columns to expect from a query and I wanted to enter them manually.
One major difference between the two products stems from what is available in their open source "community editions". Jasper looks a bit more polished to me, the documentation is somewhat more consistent, and its selection of sample reports and sample code is more comprehensive -- but the community edition does little more than support iReport. Pentaho's community edition, in addition to the reporting functions I've discussed, also offers ad-hoc web based reporting and a powerful MDX analysis package. JasperSoft offers those features only in its enterprise edition. That may not be a big deal in some places, but it is the deciding factor in others. Having access to Pentaho's code has made possible a number of things we certainly couldn't have done otherwise.
I've been trying to keep this "a comparison" of Pentaho Reporting and iReport, rather than a "showdown" or "shootout" or even "Jasper vs. Pentaho", but at some point, conclusions must be drawn. I'll draw here somewhat on evidence not mentioned above, because, well, the post was getting a bit long as it was, and I didn't want to describe everything. I admit to liking iReport's interface better, because to me it seems to make better use of screen real estate. Google was much better able to answer questions for iReport than for Pentaho. But, although both products sometimes seem to make simple things hard, Pentaho seemed to do this less than iReport, and the various magic incantations I needed to get things working were fewer in Pentaho. In the end, the much greater capability of Pentaho's open source offering over JasperSoft's clinches it for me. I'll take open source over closed any day.
Viewing schema changes over time with check_postgres
Version 2.18.0 of check_postgres, a monitoring tool for PostgreSQL, has just been released. This new version has quite a large number of changes: see the announcement for the full list. One of the major features is the overhaul of the same_schema action. This allows you to compare the structure of one database to another and get a report of all the differences check_postgres finds. Note that "schema" here means the database structure, not the object you get from a "CREATE SCHEMA" command. Further, remember the same_schema action does not compare the actual data, just its structure.
Unlike most check_postgres actions, which deal with the current state of a single database, same_schema can compare databases to each other, as well as audit things by finding changes over time. In addition to having the entire system overhauled, same_schema now allows comparing as many databases you want to each other. The arguments have been simplified, in that a comma-separated list is all that is needed for multiple entries. For example:
./check_postgres.pl --action=same_schema \
--dbname=prod,qa,dev --dbuser=alice,bob,charlie
The above command will connect to three databases, as three different users, and compare their schemas (i.e. structures). Note that we don't need to specify a warning or critical value: we consider this an 'OK' Nagios check if the schemas match, otherwise it is 'CRITICAL'. Each database gets assigned a number for ease of reporting, and the output looks like this:
POSTGRES_SAME_SCHEMA CRITICAL: (databases:prod,qa,dev)
Databases were different. Items not matched: 1 | time=0.54s
DB 1: port=5432 dbname=prod user=alice
DB 1: PG version: 9.1.1
DB 1: Total objects: 312
DB 2: port=5432 dbname=qa user=bob
DB 2: PG version: 9.1.1
DB 2: Total objects: 312
DB 3: port=5432 dbname=dev user=charlie
DB 3: PG version: 9.1.1
DB 3: Total objects: 313
Language "plpgsql" does not exist on all databases:
Exists on: 3
Missing on: 1, 2
The second large change was a simplification of the filtering options. Everything is now controlled by the --filter argument, and basically you can tell it what things to ignore. For example:
./check_postgres.pl --action=same_schema \
--dbname=A,B --filter=nolanguage,nosequence
The above command will compare the schemas on databases A and B, but will ignore any difference in which languages are installed, and ignore any differences in the sequences used by the databases. Most objects can be filtered out in a similar way. There are also a few other useful options for the --filter argument:
- noposition: Ignore what order columns are in
- noperms: Do not worry about any permissions on database objects
- nofuncbody: Do not check function source
The final and most exciting large change is the ability to compare a database to itself, over time. In other words, you can see exactly what changed during a certain time period. We have a client using that now to send a daily report on all schema changes made in the last 24 hours, for all the databases in their system. This is a very nice thing for a DBA to receive: not only is there a nice audit trail in your email, you can answer questions such as:
- Was this a known change, or did someone make it without letting anyone else know?
- Did somebody fat-finger and drop an index by mistake?
- Were the changes applied to database X also applied to database Y and Z?
To enable time-based checks, simply provide a single database to check. The first time it is run, same_schema simply gathers all the schema information and stores it on disk. The next time it is run, it detects the file, reads it in as database "2", and compares it to the current database (number "1"). The --replace argument will rewrite the file with the current data when it is done. So the cronjob for the aforementioned client is as simple as:
10 0 * * * ~/bin/check_postgres.pl --action=same_schema \
--host=bar --dbname=abc --quiet --replace
The --quiet argument ensures that no output is given if everything is 'OK'. If everything is not okay (i.e. if differences are found), cron gets a bunch of input sent to it and duly mails it out. Thus, a few minutes after 10AM each day, a report is sent if anything has changed in the last day. Here's a slightly redacted version of this morning's report, which shows that a schema named "stat_backup" was dropped at some point in the last 24 hours (which was a known operation):
POSTGRES_SAME_SCHEMA CRITICAL: DB "abc" (host:bar)
Databases were different. Items not matched: 1 | time=516.56s
DB 1: port=5432 host=bar dbname=abc user=postgres
DB 1: PG version: 8.3.16
DB 1: Total objects: 11863
DB 2: File=check_postgres.audit.port.5432.host.bar.db.abc
DB 2: Creation date: Sun Oct 2 10:06:12 2011 CP version: 2.18.0
DB 2: port=5432 host=bar dbname=abc user=postgres
DB 2: PG version: 8.3.16
DB 2: Total objects: 11864
Schema "stat_backup" does not exist on all databases:
Exists on: 2
Missing on: 1
As you can see, the first part is a standard Nagios-looking output, followed by a header explaining how we defined database "1" and "2" (the former a direct database call, and the latter a frozen version of the same.)
Sometimes you want to store more than one version at a time: for example, if you want both a daily and a weekly view. To enable this, use the --suffix argument to create different instances of the saved file. For example:
10 0 * * * ~/bin/check_postgres.pl --action=same_schema \
--host=bar --dbname=abc --quiet --replace --suffix=daily
10 0 * * Fri ~/bin/check_postgres.pl --action=same_schema \
--host=bar --dbname=abc --quiet --replace --suffix=weekly
The above command would end up recreating this file every morning at 10:check_postgres.audit.port.5432.host.bar.db.abc.daily and this file each Friday at 10: check_postgres.audit.port.5432.host.bar.db.abc.weekly.
Thanks to all the people that made 2.18.0 happen (see the release notes for the list). There are still some rough edges to the same_schema action: for example, the output could be a little more user-friendly, and not all database objects are checked yet (e.g. no custom aggregates or operator classes). Development is ongoing; patches and other contributions are always welcome. In particular, we need more translators. We have French covered, but would like to include more languages. The code can be checked out at:
git clone git://bucardo.org/check_postgres.git
There is also a github mirror if you so prefer:
https://github.com/bucardo/check_postgres
You can also file a bug (or feature request), or join one of the mailing lists: general, announce, and commit.
PostgreSQL Serializable and Repeatable Read Switcheroo

PostgreSQL allows for different transaction isolation levels to be specified. Because Bucardo needs a consistent snapshot of each database involved in replication to perform its work, the first thing that the Bucardo daemon does when connecting to a remote PostgreSQL database is:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE;
The 'READ WRITE' bit sets us in read/write mode, just in case the entire database has been set to read only (a quick and easy way to make your slave databases non-writeable!). It also sets the transaction isolation level to 'SERIALIZABLE'. At least, it used to. Now Bucardo uses 'REPEATABLE READ' like this:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ WRITE;
Why the change? In version 9.1 of PostgreSQL the concept of SSI (Serializable Snapshot Isolation) was introduced. How it actually works is a little complicated (follow the link for more detail), but before 9.1 PostgreSQL was only *sort of* doing serialized transactions when you asked for serializable mode. What it was really doing was repeatable read and not trying to really serialize the transactions. In 9.1, PostgreSQL is doing *true* serializable transactions. It also adds a new distinct 'internal' transaction mode, 'repeatable read', which does exactly what the old 'serializable' used to do. Finally, if you issue a 'repeatable read' on a pre-9.1 database, it silently upgrades it to the old 'serializable' mode.
So in summary, if your application was using 'SERIALIZABLE' before, you can now replace that with 'REPEATABLE READ' and get the exact same behavior as before, regardless of the version. Of course, if you want *true* serializable transactions, use SERIALIZABLE. It will continue to mean the same as 'REPEATABLE READ' in pre-9.1 databases, and provide true serializability in 9.1 and beyond. (I haven't determined yet if Bucardo is going to use this new level, as it comes with a little bit of overhead)
Since this can be a little confusing, here's a handy chart showing how version 9.1 changed the meaning of SERIALIZABLE, and added a new 'internal' isolation level:
| Postgres version 9.0 and earlier | Postgres version 9.1 and later | ||||||
|---|---|---|---|---|---|---|---|
| Requested isolation level | → | Actual internal isolation level | Version comparison | Actual internal isolation level | ← | Requested isolation level | |
| READ UNCOMMITTED | ↘ | Read committed | Exact same | Read committed | ↙ | READ UNCOMMITTED | |
| READ COMMITTED | ↗ | ↖ | READ COMMITTED | ||||
| REPEATABLE READ | ↘ | Serializable | Functionally identical | Repeatable read | ← | REPEATABLE READ | |
| SERIALIZABLE | ↗ | ||||||
| 9.1 only! | Serializable (true) | ← | SERIALIZABLE | ||||
Congratulations and thanks to Kevin Grittner and Dan Ports for making true serializability a reality!
Bucardo PostgreSQL replication to other tables with customname
(Don't miss the Bucardo5 talk at Postgres Open in Chicago)
Work on the next major version of Bucardo is wrapping up (version 5 is now in beta), and two new features have been added to this major version. The first, called customname, allows you to replicate to a table with a different name. This has been a feature people have been asking for a long time, and even allows you to replicate between differently named Postgres schemas. The second option, called customcols, allows you replicate to different columns on the target: not only a subset, but different column names (and types), as well as other neat tricks.
The "customname" options allows changing of the table name for one or more targets. Bucardo replicates tables from the source databases to the target databases, and all tables must have the same name and schema everywhere. With the customname feature, you can change the target table names, either globally, per database, or per sync.
We'll go through a full example here, using a stock 64-bit RedHat 6.1 EC2 box (ami-5e837b37). I find EC2 a great testing platform - not only can you try different operating systems and architectures, but (as my own personal box is very customized) it is great to start afresh from a stock configuration.
First, let's turn off SELinux, install the EPEL rpm, update the box, and install a few needed packages.
# # # # # # | echo 0 > /selinux/enforce wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm rpm -ivh epel-release-6-5.noarch.rpm yum update yum install emacs-nox perl-DBIx-Safe perl-DBD-Pg git postgresql-plperl cpan boolean |
The yum update takes a while to run, but I always feel better when things are up to date. Next, we will create a new database cluster, create the /var/run/bucardo directory that Bucardo uses to store its PIDs, adjust the ultraconservative stock pg_hba.conf file, and start Postgres up:
# # # # # | service postgresql initdb mkdir /var/run/bucardo chown postgres.postgres /var/run/bucardo emacs /var/lib/pgsql/data/pg_hba.conf service postgresql start |
For the pg_hba.conf configuration file, because we want to be able to connect to the database as the bucardo user without actually logging into that account, we will allow access using the 'md5' (password) method instead of 'ident'. But we don't want to bother creating a password for the postgres user, we will still allow those connections via ident. The relevant lines in the pg_hba.conf will end up like this:
# TYPE DATABASE USER METHOD
local all postgres ident
local all all md5
At this point, we (as the postgres user) download and install Bucardo itself:
# $ $ $ $ $ $ | su - postgres git clone git://bucardo.org/bucardo.git cd bucardo perl Makefile.PL make sudo make install bucardo install# (enter 'p' and keep the default values) |
We are now ready to start testing out the new customname feature. First we will need some data to replicate! For this demo we are going to use one of the handy sample datasets from the dbsamples project. The one we will use has a few small tables with information about towns in France. Note that the tarball does not (sadly) contain a top-level directory, so we have to create one ourselves. We will then create three identical databases holding the data from that file.
$ $ $ $ $ $ $ $ $ | wget http://pgfoundry.org/frs/download.php/935/french-towns-communes-francaises-1.0.tar.gz mkdir frenchtowns cd frenchtowns tar xvfz ../french-towns-communes-francaises-1.0.tar.gz psql -c 'create database french1' psql french1 -q -f french-towns-communes-francaises.sql psql -c 'create database french2 template french1' psql -c 'create database french3 template french1' psql -c 'create database french4 template french1' |
Bucardo is installed but does not know what to do yet, so we will teach Bucardo about each of the databases, and add in all the tables, grouping then into a herd in the process. Finally, we create a sync in which french1 and french2 are both source (master) databases, and french3 and french4 will be target (slave) databases.
$ $ $ $ $ $ | bucardo add db f1 db=french1 bucardo add db f2 db=french2 bucardo add db f3 db=french3 bucardo add db f4 db=french4 bucardo add all tables herd=fherd bucardo add sync wildstar herd=fherd dbs=f1=source,f2=source,f3=target,f4=target |
Before starting it up, I usually raise the debug level, as it gives a much clearer picture of what is going on in the logs. It does make the logs a lot more crowded, so it is not recommended for production use:
| $ | echo log_level=DEBUG >> ~/.bucardorc |
Next, we start Bucardo up and make sure everything is working as it should. Scanning the log.bucardo file that is generated is a great way to do this:
$ $ $ | bucardo start sleep 3 tail log.bucardo |
If all goes well, you should see something very similar to this in the last lines of your log.bucardo file:
(972) [Sat Sep 3 16:18:54 2011] KID Total time for sync "wildstar" (0 rows): 0.05 seconds
(966) [Sat Sep 3 16:18:55 2011] CTL Got NOTICE ctl_syncdone_wildstar from 973 (line 1624)
(966) [Sat Sep 3 16:18:55 2011] CTL Kid 973 has reported that sync wildstar is done
(966) [Sat Sep 3 16:18:55 2011] CTL Sending NOTIFY "syncdone_wildstar" (line 1709)
(954) [Sat Sep 3 16:18:55 2011] MCP Got NOTICE syncdone_wildstar from 967 (line 749)
(954) [Sat Sep 3 16:18:55 2011] MCP Sync wildstar has finished
(954) [Sat Sep 3 16:18:55 2011] MCP Sending NOTIFY "syncdone_wildstar" (line 812)
(954) [Sat Sep 3 16:18:56 2011] MCP Got NOTICE syncdone_wildstar from 957 (Bucardo DB) (line 749)From the above, we see that a KID finished running the sync we created, without finding any changed rows to replicate. Then there is some chatter between the different Bucardo processes. Now to test out the customname feature. We'll rename one of the tables, tell Bucardo about the change, reload the sync, and verify that all is still being replicated.
$ $ $ | psql french3 -c 'ALTER TABLE regions RENAME TO tesla' bucardo add customname regions tesla db=f3 bucardo reload wildstar |
$ $ $ $ | psql french3 -c 'truncate table tesla cascade' TRUNCATE psql french3 -t -c 'select count(*) from tesla' 0 psql french1 -c 'update regions set name=name' UPDATE 26 psql french3 -t -c 'select count(*) from tesla' 26 |
In the above, the update on the regions table inthe french1 database calls a trigger that notifies Bucardo that some rows have changed; Bucardo then has a KID copy the rows from the source databases french1 to the other source database french2, as well as the targets french3 and french4. The final internal DELETE and COPY that it performs is done on database french3 to the tesla table rather than the regions table.
The customname feature cannot be used to change the tables in a source database, as they must all be the same (for obvious reasons). We can, however, specify that a different schema be used for a target, as well as a different table. This only applies to Postgres targets, as other database types (e.g. MySQL) do not use schemas. Let's see that in action:
$ $ $ $ $ | psql french4 -c 'create schema banana' psql french4 -c 'alter table regions set schema banana' psql french4 -c 'truncate table banana.regions cascade' bucardo add customname regions banana.regions db=f4 bucardo reload wildstar |
$ $ $ | psql french4 -t -c 'select count(*) from banana.regions' 0 psql french2 -c 'update regions set name=name' UPDATE 26 psql french4 -t -c 'select count(*) from banana.regions' 26 |
As before, the update on a source causes the changes to propagate to the other source database, as well as both targets. Note that the ALTER TABLE also mutated the associated sequence for the table, so there will be warnings in Bucardo's logs about the DEFAULT values for the primary keys in the regions' tables being different. Since this post is getting long, I will save the discussion of customcols for another day.
PostgreSQL log analysis / PGSI
End Point recently started working with a new client (a startup in stealth mode, cannot name names, etc.) who is using PostgreSQL because of the great success some of the people starting the company have had with Postgres in previous companies. One of the things we recommend to our clients is a regular look at the database to see where the bottlenecks are. A good way to do this is by analyzing the logs. The two main tools for doing so are PGSI (Postgres System Impact) and pgfouine. We prefer PGSI for a few reasons: the output is better, it considers more factors, and it does not require you to munge your log_line_prefix setting quite as badly.
Both programs work basically the same: given a large number of log lines from Postgres, normalize the queries, see how long they took, and produce some pretty output.If you only want to look at the longest queries, it's usually enough to set your log_min_duration_statement to something sane (such as 200), and then run a daily tail_n_mail job against it. This is what we are doing with this client, and it sends a daily report that looks like this:
Date: Mon Aug 29 11:22:33 2011 UTC
Host: acme-postgres-1
Minimum duration: 2000 ms
Matches from /var/log/pg_log/postgres-2011-08-29.log: 7
[1] (from line 227)
2011-08-29 08:36:50 UTC postgres@maindb [25198]
LOG: duration: 276945.482 ms statement: COPY public.sales
(id, name, region, item, quantity) TO stdout;
[2] (from line 729)
2011-08-29 21:29:18 UTC tony@quadrant [17176]
LOG: duration: 8229.237 ms execute dbdpg_p29855_1: SELECT
id, singer, track FROM album JOIN artist ON artist.id =
album.singer WHERE id < 1000 AND track <> 1
However, the PGSI program was born of the need to look at all the queries in the database, not just the slowest-running ones; the cumulative effect of many short queries can have much more of an impact on the server than a smaller number of long-running queries. Thus, PGSI looks not only at how long a query takes to run, but how many times it has run in a certain period, as well as how often it runs. All of this information is put together to give a score to each normalized query, known as the "system impact". Like the costs on a Postgres explain plan, this is a unit-less number and of little importance in and of itself - the important thing is to compare it to the other queries to see the relative impact. We also have that report emailed out, it looks similar to this (this is a text version of the HTML produced):
Log file: /var/log/pg_log/postgres-2011-08-29.log
* SELECT (24)
* UPDATE (1)
Query System Impact : SELECT
Log activity from 2011-08-29 11:00:01 to 2011-08-29 11:15:01
+----------------------------------+
| System Impact: | 0.15 |
| Mean Duration: | 1230.95 ms |
| Median Duration: | 1224.70 ms |
| Total Count: | 411 |
| Mean Interval: | 4195 seconds |
| Std. Deviation: | 126.01 ms |
+---------------------------------+
SELECT *
FROM albums
WHERE track <> ? AND artist = ?
ORDER BY artist, track
At this point you may be wondering how we get all the queries into the log. This is done by setting log_min_duration_statement to 0. However, most (but not all!) clients do not want full logging 24 hours a day, as this creates some very large log files. So the solution we use is to analyze a slice of the day, only. It depends on the client, but we try for about 15 minutes during a busy time of day. Thus, the sequence of events is:
- Turn on "full logging" by dropping log_min_duration_statement to zero
- Some time later, set log_min_duration_statement back to what it was (e.g. 200)
- Extract the logs from the time it was set to zero to when it was flipped back.
- Run PGSI against the log subsection we pulled out
- Mail the results out
All of this is run by cron. The first problem is how to update the postgresql.conf file and have Postgres re-read it, all automatically. As covered previously, we use the modify_postgres.pl script for this.
The exact incantation looks like this:
0 11 * * * perl bin/modify_postgres_conf --quiet \
--pgconf /etc/postgresql/9.0/main/postgresql.conf \
--change log_min_duration_statement=0
15 11 * * * perl bin/modify_postgres_conf --quiet \
--pgconf /etc/postgresql/9.0/main/postgresql.conf \
--change log_min_duration_statement=200 --no-comment
## The above are both one line each, but split for readability here
This changes log_min_duration_statement to 0 at 11AM, and then back to 200 15 minutes later. We use the --quiet argument as this is run from cron so we don't want any output from modify_postgres_conf on success. We do want a comment when we flip it to 0, as this is the temporary state and we want people viewing the postgresql.conf file at that time to realize it (or someone just doing a "git diff"). We don't want a comment when we flip it back, as the timestamp in the comment would cause git to think the file had changed.
Now for the tricky bit: extracting out just the section of logs that we want and sending it to PGSI. Here's the recipe I came up with for this client:
16 11 * * * tac `ls -1rt /var/log/pg_log/postgres*log \
| tail -1` \
| sed -n '/statement" changed to "200"/,/statement" changed to "0"/ p' \
| tac \
| bin/pgsi.pl --quiet > tmp/pgsi.html && bin/send_pgsi.pl
## Again, the above is all one line
What does this do? First, it finds the latest file in the /var/log/pg_log directory that starts with 'postgres' and ends with 'log'. Then it uses the tac program to spool the file backwards, one line at a time ('tac' is the opposite of 'cat'). Then it pipes that output to the sed program, which prints out all lines starting with the one where we changed the log_min_duration_statement to 200, and ending with the one where we changed it to 0 (the reverse of what we actually did, as we are reading it backwards). Finally, we use tac again to put the lines back in the correct order, pipe the output to pgsi, write the output to a temporary file, and then call a quick Perl script named send_pgsi.pl which mails the temporary HTML file to some interested parties.
Why do we use tac? Because we want to read the file backwards, so as to make sure we get the correct slice of log files as delimited by the log_min_duration_statement changes. If we simply started at the beginning of the file, we might encounter other similar changes that were made earlier and not by us.
All of this is not foolproof, of course, but it does not have to be, as it is very easy to run manually is something (for example the sed recipe) goes wrong, as the log file will still be there. Yes, it's also possible to grab the ranges in other ways (such as perl), but I find sed the quickest and easiest. As tempting as it was to write Yet Another Perl Script to extract the lines, sometimes a few chained Unix programs can do the job quite nicely.
Changing postgresql.conf from a script
The modify_postgres_conf script for Postgres allows you to change your postgresql.conf file from the command line, via a cron job, or any time when you want to automate the process.
Postgres runs as a background daemon. The configuration parameters it runs with are stored in a file named postgresql.conf. To change the behavior of Postgres, one must usually edit this file, and then tell Postgres that you have made the changes. Sometimes all that is needed is to 'HUP' or reload Postgres. Most changes fall into this category. Other changes require a full restart of Postgres, which entails disconnecting all current clients.
Thus, to make a change, one must edit the file, find the item to change (the file consists of "name = value" lines), change it, then send a signal to the main Postgres process so it picks up the change. Finally, you should then connect to Postgres to make sure it is still running and has accepted the latest change.
Doing this automatically (such as via a cron script) is very difficult. One method, if you are doing something simple like toggling between two known configuration files, is to simply store copies of both files and replace them, like this example cronjob:
30 10 * * * cp -f conf/postgresql.conf.1 /etc/postgresql.conf; /etc/init.d/postgresql reload
50 10 * * * cp -f conf/postgresql.conf.2 /etc/postgresql.conf; /etc/init.d/postgresql reload
The major problem with that approach, as I quickly learned when I tried it, is that despite nobody making changes to the postgresql.conf file in *years*, a few days after I put the above change in place, someone decided to edit postgresql.conf. At 10:30AM the next day, their changes were blown away. A better way is to simply write a program to make the change for you. Thus, the modify_postgres_conf.pl script.
The basic usage is to tell the script where the conf file is, and list what changes you want to make. Here's an example that will change the random_page_cost to 2 on a Debian system:
./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
Here is exactly what the script does for the above statement:
- For each item to be changed, we:
- Ask the database what the current value is (and die if that parameter does not exist)
- If the current and new value are the same, do nothing
- Otherwise, open (and flock) the configuration file and change the parameter
- If no changes were made, exit
- Otherwise, close the configuration file
- Figure out the Postgres PID and send it a HUP signal
- Reconnect to the database and confirm each change has taken effect
By default, it adds a comment after the changed value as well, to help in tracking down who made the change. A diff of the postgresql.conf file after running the example above produces:
diff -r1.1 postgresql.conf
499c499
< random_page_cost = 4
---
> random_page_cost = 2 ## changed by modify_postgres_conf.pl on Wed Aug 10 13:31:34 2011
The addition of the comment can be stopped by added a --no-comment argument. If the script runs successfully, it also returns two items of information: the size and name of the current Postgres log file. This is useful so you can know exactly where in the log this change took place. Note that this only works for items that are already explicitly set in your configuration file. However, as discussed before, you should already have all the items that you may possibly change explicitly listed out at the bottom of the file already. Whitespace is preserved as well, for those (like me) who like to keep things lined up neatly inside the file (see examples in the link above).
Here are some more examples of the script in action:
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
114991 /var/log/postgres/postgres-2011-08-10.log
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
No change made: value of "random_page_cost" is already 2
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \
> --change random_page_cost=2 \
> --change log_statement=ddl \
> --change log_min_duration_statement=100
No change made: value of "random_page_cost" is already 2
118459 /var/log/postgres/postgres-2011-08-10.log
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \
> --change default_statitics_target=200 --no-comment
There is no Postgres variable named "default_statitics_target"!
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \
> --change default_statistics_target=200 --no-comment
123396 /var/log/postgres/postgres-2011-08-10.log
Note that we make no attempt to automatically check changes in to version control: as you will see in an upcoming blog post on a real-life use case, such a checkin is usually not wanted, as we are making temporary changes.
This is a fairly simple Perl script, but I thought I would put it out there in the hopes of helping others out (and preventing the reinventing of wheels). Of course, if you find a bug or want to write a patch for it, those are welcome additions at any time! The code can be found on github:
git clone git://git@github.com:bucardo/modify_postgres_config.git
Announcing pg_blockinfo!
I'm pleased to announce the initial release of pg_blockinfo. It is a tool to examine your PostgreSQL heap data files, written in Perl.
Similar in purpose to pg_filedump, it is used to display (and soon validate) buffer-page-level information for PostgreSQL page/heap files.
pg_blockinfo aims to work in a portable and non-destructive way, using read-only "mmap", sys-level IO functions, and "unpack" in order to minimize any Perl overhead.
What we buy for the compromise of writing this in Perl instead of C is two-fold:
- portability/low impact — pg_blockinfo has no other dependencies than Perl and several core Perl modules and will work in environments where you can't or won't easily install other packages or compile based on specific headers.
- expressibility — while not currently supported in full, one of pg_blockinfo's future goals is to allow you to specify criteria for display of both page-level and tuple-level info. It will allow you to define arbitrary Perl expressions to filter the objects you're looking at (i.e., pages, tuples, etc; think "grep" but on a tuple level). It will support a DSL for querying based off of the named fields as well as allow you to supply arbitrary Perl for scanning for any criteria.
Requirements
We require a perl version with PerlIO ":mmap" support, which basically means any perl >= 5.8. We do not require any non-core perl modules; currently we only use Data::Dumper and Getopt::Long for debugging and option parsing respectively, the former only when requested.
Getting pg_blockinfo
The canonical git repo for development for pg_blockinfo is located at github:
http://github.com/machack666/pg_blockinfo/
For the development repo, simply run:
$ git clone git://github.com/machack666/pg_blockinfo.git
Or you can just grab the current script directly here:
https://raw.github.com/machack666/pg_blockinfo/master/pg_blockinfo
Using pg_blockinfo
To get help including available options, canonical and alternate/abbreviated names of recognized fields, range syntax:
$ pg_blockinfo -h
To dump all fields for all page headers for all pages in a relation:
$ pg_blockinfo /path/to/relfile
To include only specific fields in the output you can specify multiple -f options and/or include multiple options per -f argument by comma delimiting. Field specifiers are processed in order, so only the final logical set will be included.
"all" is a special shorthand type which will expand to all known columns. pg_blockinfo may support other shorthand groups in the future. When no fields are provided explicitly, "all" is implicitly assumed.
There are both positive and negative field inclusions. An example of a positive inclusion is:
$ pg_blockinfo /path/to/relfile -f prune_xid,tli
This will display only the indicated fields in question for all blocks in relfile. To include all fields *except* certain ones, prefix their name with a '-' sign:
$ pg_blockinfo -f -pagesize_version /path/to/relfile
This will display all page header fields in all blocks with the exception of the pagesize_version header field.
One consequence of the way these field display options are designed (particularly going forward with additional field/tuple display options) that you could define a "view" of the column data using a shell alias, then add/remove columns/criteria by passing additional -f options to it:
# using this as a shorthand to display just those fields $ alias lsn='pg_blockinfo -f lsn_seq,lsn_off,tli' $ lsn -f -tli /path/to/foo # remove fields from the display $ lsn -f prune_xid /path/to/foo # or add to the list as well
Similar functionality is available for selecting the specific blocks available using the range option (-r or -b), which lets you specify a range of blocks to look at instead of the entire file.
$ pg_blockinfo -r 2-49 /path/to/relfile $ pg_blockinfo -r -100 /path/to/relfile $ pg_blockinfo -r 2,4,120-140,0xFF-0x1100 /path/to/relfile
Range options can be provided multiple times, each with one or more comma-delimited block-range specifications. Blocks are numbered from 0, can be provided in decimal or hexadecimal (when prefixed via 0x), and can appear singly or in a range (unbounded or unbounded) when separated by a hyphen.
Planned future features/TODO
In no particular order:
- dump tuples/tuple headers.
- better output/interpretation of bitflags.
- support alternate structures to allow detection/specification of different target versions of the page/tuple headers.
- allow querying/filtering pages/tuples.
- validation/sanity checking of various pages.
- actual extraction of ranges in the heap file.
- extract/dump tuples by raw ctid.
- allow arbitrary expressions to define powerful filtering options when querying/displaying information about the tuples/data files.
- detections of invalid toast tuple pointers/corrupted lz_compressed data (will require connection to theactive system catalog).
- detect relfile type?
- mvcc queries against tuples at a given arbitrarily-constructed snapshot
- detect xids that are invalid (i.e. map to non-existent pages in the pg_clog directory).
- better/shorter name?
I look forward to any feedback, patches, or other improvements/interest.
DBD::Pg UTF-8 for PostgreSQL server_encoding
We are preparing to make a major version bump in DBD::Pg, the Perl interface for PostgreSQL, from the 2.x series to 3.x. This is due to a reworking of how we handle UTF-8. The change is not going to be backwards compatible, but will probably not affect many people. If you are using the pg_enable_utf8 flag, however, you definitely need to read on for the details.
The short version is that DBD::Pg is going return all strings from the Postgres server with the Perl utf8 flag on. The sole exception will be databases in which the server_encoding is SQL_ASCII, in which case the flag will never be turned on.
For backwards compatibility and fine-tuning control, there is a new attribute called pg_utf8_strings that can be set at connection time to override the decision above. For example, if you need your connection to return byte-soup, non-utf8-marked strings, despite coming from a UTF-8 Postgres database, you can say:
my $dsn = 'dbi:Pg:dbname=foobar';
my $dbh = DBI->connect($dsn, $dbuser, $dbpass,
{ AutoCommit => 0,
RaiseError => 0,
PrintError => 0,
pg_utf8_strings => 0,
}
);
Similarly, you can set pg_utf8_strings to 1 and it will force settings returned strings as utf8, even if the backend is SQL_ASCII. You should not be using SQL_ASCII of course, and certainly not forcing the strings returned from it to UTF-8. :)
All Perl variables (be they strings or otherwise) are actually Perl objects, with some internal attributes defined on them. One of those is the utf8 flag, which can be flipped on to indicate that the string should be treated as possibly containing multi-byte characters, or it can be left off, to indicate the string should always be treated on a byte-by-byte basis. This will affect things like the Perl length function, and the Perl \w regex flag. This is completely unrelated to the Perl pragma use utf8, which DBD::Pg has nothing at all to do with. Have I mentioned that UTF-8, and UTF-8 in Perl in particular, can be quite confusing?
There are a few exceptions as to what things DBD::Pg will mark as utf8. Integers and other numbers will not, boolean values will not, and no bytea data will ever have the flag set. When in doubt, assume that it is set.
The old attribute, pg_enable_utf8, will be deprecated, and have no effect. We thought about re-using that but it seemed clearer and cleaner to simply create a new variable (pg_utf8_strings), as the behavior has significantly changed.
A beta version of DBD::Pg (2.99.9_1) with these changes has been uploaded to CPAN for anyone to experiment with. Right now, none of this is set in stone, but we did want to get a working version out there to start the discussion and see how it interacts with applications that were making use of the pg_enable_utf8 flag. You can web search for "dbdpg" and look for the "Latest Dev. Release", or jump straight to the page for DBD::Pg 2.99.9_1. The trailing underscore is a CPAN convention that indicates this is a development version only, and thus will not replace the latest production version (2.18.1 as of this writing).
As a reminder, DBD::Pg has switched to using git, so you can follow along with the development with:
git clone git://bucardo.org/dbdpg.git
There is also a commits mailing list you can join to receive notifications of commits as they are pushed to the main repo. To sign up, send an email to dbd-pg-changes-subscribe@perl.org.
MongoDB replication from Postgres using Bucardo
One of the features of the upcoming version of Bucardo (a replication system for the PostgreSQL RDBMS) is the ability to replicate data to things other than PostgreSQL databases. One of those new targets is MongoDB, a non-relational 'document-based' database. (to be clear, we can only use MongoDB as a target, not as a source)
To see this in action, let's setup a quick example, modified from the earlier blog post on running Bucardo 5. We will create a Bucardo instance that replicates from two Postgres master databases to a Postgres database target and a MongoDB instance target. We will start by setting up the prerequisites:
sudo aptitude install postgresql-server \
perl-DBIx-Safe \
perl-DBD-Pg \
postgresql-contrib
Getting Postgres up and running is left as an exercise to the reader. If you have problems, the friendly folks at #postgresql on irc.freenode.net will be able to help you out.
Now for the MongoDB parts. First, we need the server itself. Your distro may have it already available, in which case it's as simple as:
aptitude install mongodb
For more installation information, follow the links from the MongoDB Quickstart page. For my test box, I ended up installing from source by following the directions at the Building for Linux page.
Once MongoDB is installed, we will need to start it up. First, create a place for MongoDB to store its data, and then launch the mongodb process:
$ mkdir /tmp/mongodata
$ mongod --dbpath=/tmp/mongodata --fork --logpath=/tmp/mongo.log
all output going to: /tmp/mongo.log
forked process: 428
You can perform a quick test that it is working by invoking the command-line shell for MongoDB (named "mongo" of course) Use quit() to exit:
$ mongo
MongoDB shell version: 1.8.1
Fri Jun 10 12:45:00
connecting to: test
> quit()
$
The other piece we need is a Perl driver so that Bucardo (which is written in Perl) can talk to the MongoDB server. Luckily, there is an excellent one available on CPAN named 'MongoDB'. We started the MongoDB server before doing this step because the driver we will install needs a running MongoDB instance to pass all of its tests. The module has very good documentation available on its CPAN page. Installation may be as easy as:
$ sudo cpan MongoDB
If that did not work for you (case matters!), there are more detailed directions on the Perl Language Center page.
Our next step is to grab the latest Bucardo, install it, and create a new Bucardo instance. See the previous blog post for more details about each step.
$ git clone git://bucardo.org/bucardo.git
Initialized empty Git repository...
$ cd bucardo
$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for Bucardo
$ make
cp bucardo.schema blib/share/bucardo.schema
cp Bucardo.pm blib/lib/Bucardo.pm
cp bucardo blib/script/bucardo
/usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/bucardo
Manifying blib/man1/bucardo.1pm
Manifying blib/man3/Bucardo.3pm
$ sudo make install
Installing /usr/local/lib/perl5/site_perl/5.10.0/Bucardo.pm
Installing /usr/local/share/bucardo/bucardo.schema
Installing /usr/local/bin/bucardo
Installing /usr/local/share/man/man1/bucardo.1pm
Installing /usr/local/share/man/man3/Bucardo.3pm
Appending installation info to /usr/lib/perl5/5.10.0/i386-linux-thread-multi/perllocal.pod
$ sudo mkdir /var/run/bucardo
$ sudo chown $USER /var/run/bucardo
$ bucardo install
This will install the bucardo database into an existing Postgres cluster.
...
Installation is now complete.
Now we create some test databases and populate with pgbench:
$ psql -c 'create database btest1'
CREATE DATABASE
$ pgbench -i btest1
NOTICE: table "pgbench_branches" does not exist, skipping
...
creating tables...
10000 tuples done.
20000 tuples done.
...
100000 tuples done.
$ psql -c 'create database btest2 template btest1'
CREATE DATABASE
$ psql -c 'create database btest3 template btest1'
CREATE DATABASE
$ psql btest3 -c 'truncate table pgbench_accounts'
TRUNCATE TABLE
$ bucardo add db t1 dbname=btest1
Added database "t1"
$ bucardo add db t2 dbname=btest2
Added database "t2"
$ bucardo add db t3 dbname=btest3
Added database "t3"
$ bucardo list dbs
Database: t1 Status: active Conn: psql -p 5432 -U bucardo -d btest1
Database: t2 Status: active Conn: psql -p 5432 -U bucardo -d btest2
Database: t3 Status: active Conn: psql -p 5432 -U bucardo -d btest3
$ bucardo add tables pgbench_accounts pgbench_branches pgbench_tellers herd=therd
Created herd "therd"
Added table "public.pgbench_accounts"
Added table "public.pgbench_branches"
Added table "public.pgbench_tellers"
$ bucardo list tables
Table: public.pgbench_accounts DB: t1 PK: aid (int4)
Table: public.pgbench_branches DB: t1 PK: bid (int4)
Table: public.pgbench_tellers DB: t1 PK: tid (int4)
The next step is to add in our MongoDB instance. The syntax is the same as the "add db" above, but we also tell it the type of database, as it is not the default of "postgres". We will also assign an arbitrary database name, "btest1", the same as the others. Everything else (such as the port and host) is default, so all we need to say is:
$ bucardo add db m1 dbname=btest1 type=mongo
Added database "m1"
$ bucardo list dbs
Database: m1 Type: mongo Status: active
Database: t1 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest1
Database: t2 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest2
Database: t3 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest3
Next we group our databases together and assign them roles:
$ bucardo add dbgroup tgroup t1:source t2:source t3:target m1:target
Created database group "tgroup"
Added database "t1" to group "tgroup" as source
Added database "t2" to group "tgroup" as source
Added database "t3" to group "tgroup" as target
Added database "m1" to group "tgroup" as target
Note that "target" is the default action, so we could shorten that to:
$ bucardo add dbgroup tgroup t1:source t2 t3 m1
However, I think it is best to be explicit, even if it does (incorrectly) hint that m1 could be anything *other* than a target. :)
We are almost ready to go. The final step is to create a sync (a basic replication event in Bucardo), then we can start up Bucardo, put some test data into the master databases, and 'kick' the sync:
$ bucardo add sync mongotest herd=therd dbs=tgroup ping=false
Added sync "mongotest"
$ bucardo start
Checking for existing processes
Starting Bucardo
$ pgbench -t 10000 btest1
starting vacuum...end.
transaction type: TPC-B (sort of)
number of transactions actually processed: 10000/10000
...
tps = 503.300595 (excluding connections establishing)
$ pgbench -t 10000 btest2
number of transactions actually processed: 10000/10000
...
tps = 408.059368 (excluding connections establishing)
$ bucardo kick mongotest
We'll give it a few seconds to replicate those changes (it took 18 seconds on my test box), and then check the output of bucardo status:
$ bucardo status
PID of Bucardo MCP: 3317
Name State Last good Time Last I/D/C Last bad Time
===========+========+============+=======+=============+===========+=======
mongotest | Good | 21:57:47 | 11s | 6/36234/898 | none |
Looks good, but what about the data in MongoDB? Let's get some counts from the Postgres masters and slave, and then look at the data inside MongoDB with the mongo command-line client:
$ psql btest1 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest2 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest3 -c 'SELECT count(*) FROM pgbench_accounts'
18106
$ psql btest1 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest2 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest3 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:46:00
connecting to: btest1
> show collections
bucardo_status
pgbench_accounts
pgbench_branches
pgbench_tellers
system.indexes
> db.pgbench_accounts.count()
18106
> db.pgbench_accounts.find().sort({abalance:1}).limit(1).next()
{
"_id" : ObjectId("4df39bcb8795839660001de5"),
"abalance" : -12071,
"aid" : 84733,
"bid" : 1,
"filler" : " "
}
> db.pgbench_accounts.find().sort({abalance:-1}).limit(1).next()
{
"_id" : ObjectId("4df39bd08795839660002fb0"),
"abalance" : 13010,
"aid" : 45500,
"bid" : 1,
"filler" : " "
}
Why the difference in counts? We only started replicating after we populated the Postgres tables on the master databases with 100,000 rows, so the eighteen thousand is the number of rows that was changed during the subsequent pgbench run. (Note that pgbench uses randomness, so your numbers will be different than the above). In the future Bucardo will support the "onetimecopy" feature for MongoDB, but until then we can fully populate the pgbench_accounts collection simply by "touching' all the records on one of the masters:
$ psql btest1 -c 'UPDATE pgbench_accounts SET aid=aid'
UPDATE 100000
$ bucardo kick mongotest
Kicked sync mongotest
$ echo 'db.pgbench_accounts.count()' | mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:47:00
connecting to: btest1
> 100000
> bye
A nice feature of MongoDB is its autovivification ability (aka dynamic schemas), which means unlike Postgres you do not have to create your tables first, but can simply ask MongoDB to do an insert, and it will create the table (or, in mongospeak, the collection) automatically for you.
Because MongoDB has no concept of transactions, and because Bucardo does not update, but does deletes plus inserts (for reasons I'll not get into today), there is one more trick Bucardo does when replicating to a MongoDB instance. A collection named 'bucardo_status' is created and updated at the start and the end of a sync (a replication event). Thus, your application can pause if it sees this table has a 'started' value, and wait until it sees 'complete' or 'failed'. Not foolproof by any means, but better than nothing :) You should, of course, carefully consider the way your app and Bucardo will coordinate things.
Feedback from Postgres or MongoDB folk is much appreciated: there are probably some rough edges, but as you can see from above, the basics are there are working. Feel free to email the bucardo-general mailing list or make a feature request / bug report on the Bucardo Bugzilla page.
Bucardo multi-master for PostgreSQL
The next version of Bucardo, a replication system for Postgres, is almost complete. The scope of the changes required a major version bump, so this Bucardo will start at version 5.0.0. Much of the innards was rewritten, with the following goals:
Multi-master support
Where "multi" means "as many as you want"! There are no more pushdelta (master to slaves) or swap (master to master) syncs: there is simply one sync where you tell it which databases to use, and what role they play. See examples below.
Ease of use
The bucardo program (previously known as 'bucardo_ctl') has been greatly improved, making all the administrative tasks such as adding tables, creating syncs, etc. much easier.
Performance
Much of the underlying architecture was improved, and sometimes rewritten, to make things go much faster. Most striking is the difference between the old multi-master "swap syncs" and the new method, which has been described as "orders of magnitudes" faster by early testers. We use async database calls whenever possible, and no longer have the bottleneck of a single large bucardo_delta table.
Improved logging
Not only are more details provided, there is now the ability to control how verbose the logs are. Just set the log_level parameter to terse, normal, verbose, or debug. Those who had busy systems, which was the equivalent of a 'debug' firehose, will really appreciate this.
Different targets
Who says your slave (target) databases need to be Postgres? In addition to the ability to write text SQL files (for say, shipping to a different system), you can have Bucardo push to other systems as well. Stay tuned for more details on this. (Update: there is a blog post about using MongoDB as a target)
This new version is not quite at beta yet, but you can try out a demo of multi-master on Postgres quie easily. Let's see if we can do it in ten steps.
I. Download all prerequisites
To run Bucardo, you will need a Postgres database (obviously), the DBIx::Safe module, the DBI and DBD::Pg modules, and (for the purposes of this demo) the pgbench utility. Systems vary, but on aptitude-based systems, one can grab all of the above like this:
aptitude install postgresql-server \
perl-DBIx-Safe \
perl-DBD-Pg \
postgresql-contrib
II. Grab the latest Bucardo
git clone git://bucardo.org/bucardo.git
III. Install the program
cd bucardo
perl Makefile.PL
make
sudo make install
You can ignore any errors that come up about ExtUtils::MakeMaker not being recent.
IV. Setup an instance of Bucardo
This step assumes there is a running Postgres available to connect to.
sudo mkdir /var/run/bucardo
sudo chown $USER /var/run/bucardo
bucardo install
V. Use the pgbench program to create some test tables
psql -c 'CREATE DATABASE btest1'
pgbench -i btest1
psql -c 'CREATE DATABASE btest2 TEMPLATE btest1'
psql -c 'CREATE DATABASE btest3 TEMPLATE btest1'
psql -c 'CREATE DATABASE btest4 TEMPLATE btest1'
psql -c 'CREATE DATABASE btest5 TEMPLATE btest1'
VI. Tell Bucardo about the databases and tables you are going to use
bucardo add db t1 dbname=btest1
bucardo add db t2 dbname=btest2
bucardo add db t3 dbname=btest3
bucardo add db t4 dbname=btest4
bucardo add db t5 dbname=btest5
bucardo list dbs
bucardo add table pgbench_accounts pgbench_branches pgbench_tellers herd=therd
bucardo list tables
A herd is simply a logical grouping of tables. We did not add the other pgbench table, pgbench_history, because it has no primary key or unique index.
VII. Group the databases together and set their roles
bucardo add dbgroup tgroup t1:source t2:source t3:source t4:source t5:target
We've grouped all five databases together, and made four of them masters (aka source), and one of them a slave (aka target). You can any combination of master and slaves you want, as long as there is at least one master.
VII. Create the Bucardo sync
bucardo add sync foobar herd=therd dbs=tgroup ping=false
Here we simply create a new sync, which is a controllable replication event, telling it which tables we want to replicate, and which databases we are going to use. We also set ping to false, which means that we will not create triggers to automatically fire off replication on any changes, but will do it manually. In a real world scenario, you generally do want those triggers, or want to set Bucardo to check periodically.
VIII. Start up Bucardo
bucardo start
If all went well, you should see some information in the log.bucardo file in the current directory.
IX. Make a bunch of changes on all the source databases.
pgbench -t 10000 btest1
pgbench -t 10000 btest2
pgbench -t 10000 btest3
pgbench -t 10000 btest4
Here, we've told pgbench to run ten thousand transactions against each of the first four databases. Triggers on these tables have captured the changes.
X. Kick off the sync and watch the fun.
bucardo kick foobar
You can now tail the log.bucardo file to see the fun, or simply run:
bucardo status
...to see what it is doing, and the final counts when we are done. Don't forget to stop Bucardo when you are done testing:
bucardo stop
The output of bucardo status, after the sync has completed, should look like this:
bucardo status Name State Last good Time Last I/D/C Last bad Time ========+========+============+=======+====================+===========+======= foobar | Good | 17:58:37 | 3m2s | 131836/131836/4785 | none |
Here we see that this syncs has never failed ("Last bad"), the time of day of the last good run, how long ago it was from right now (3 minutes and 2 seconds), as well as details of the last successful run. Last I/D/C stands for number of inserts, deletes, and collisions across all databases for this syncs. This is just an overview of all syncs at a high level, but we can also give status an argument of a sync name to see more details like so:
bucardo status foobar Last good : Jun 02, 2011 17:57:47 (time to run: 42s) Rows deleted/inserted/conflicts : 131,836 / 131,836 / 4,785 Sync name : foobar Current state : Good Source herd/database : therd / t1 Tables in sync : 3 Status : active Check time : none Overdue time : 00:00:00 Expired time : 00:00:00 Stayalive/Kidsalive : yes / yes Rebuild index : 0 Ping : no Onetimecopy : 0 Post-copy analyze : Yes Last error: :
This gives us a little more information about the sync itself, as well as another important metric, how long the sync itself took to run, in this case, 42 seconds. That particular metric might make its way back to the overall "status" view above. Try things out and help us find bugs and improve Bucardo!
Postgres Bug Tracking - Help Wanted!
Once again there is talk in the Postgres community about adopting the use of a bug tracker. The latest thread, on pgsql-hackers, was started by someone asking about the status of their patch. Or rather, asking an even better meta-question about how one finds out the status of a PostgreSQL bug report or patch. Sadly, the answer is that there is no standard way, other than sending emails until someone replies one way or another. The current process works something like this:
- Someone finds a bug
- They send an email to pgsql-bugs@postgresql.org OR they use the web form, which grabs a sequential number and mails the report to pgsql-bugs@postgresql.org. Nothing else is done/stored, it just sends the email.
- Someone replies about the bug OR nobody replies about the bug.
- After a fix is found, which may involve some emails on other mailing lists, someone replies that the bug is fixed on the original thread. Maybe.
As you can see, there is some room for improvement there. Some of the most major and glaring holes in the current system:
- No way to search previous / existing bugs
- No way to tell the status of a bug
- No way to categorize and group bugs (per version, per platform, per component, per severity, etc.)
- No way to know who is working on a bug
- No way to prevent things from slipping through the cracks
Luckily, the above problems have been solved for many many years now but a wide variety of bug tracking software. There have traditionally been three problems to getting a bug tracker working for the Postgres project:
Inertia
The current system is, in a very literal sense, "good enough", so it's hard to impose the inevitable short-term pain of a new system when there always seem to be more pressing matters to attend to.
Doesn't Make Julienne Fries
Everyone wants a different set of features, and getting all the hackers involved to agree on even a simple subset of desired features is pretty difficult. This is sort of similar to the crusade by myself and others to get git as the replacement version control system; there were some strong voices for competing systems (e.g. mercurial).
Who Will Put the Bell on the Cat?
Everyone talks about the problem, and there have even been some attempts over the years to implement some sort of system, but the problem remains that setting up such a system, getting it smoothly integrated into the project's work flow, and then maintaining said system is a non-trivial task. Especially when you can't be assured of buy-in from some of the major players.
I'm hopeful that the recent thread indicates a slight shift of late in global acceptance of the need for a bug tracking system. The question is, which one, and who is going to take the time to write something? I'm really hoping someone who has been lurking in the background will step up and help create something wonderful (okay, we can start with 'decent' :) Perhaps even someone with experience setting up bug tracking systems. Certainly Postgres must be one of the last major open source projects without a bug tracker; there is plenty of hard-won experience out there to be learned from. It would also be ideal if the person or persons was *not* a Postgres hacker of any sort, as taking the time to build and maintain this system would definitely take time away from their other hacking tasks. On the other hand, one could argue that a bug tracker is a vital piece of project infrastructure that is potentially as important as any other work that goes on. I certainly think so.
Only Try This At Home

Taken by Josh 6 years to the day before the release of 9.1 beta 1
For the record, 9.1 is gearing up to be an awesome release. I was tinkering and testing PostgreSQL 9.1 Beta 1 (... You are beta testing, too, right?) ... and some of the new PL/Python features caught my eye. These are minor among all the really cool high profile features, to be sure. But it made me think back to a little bit of experimental code written some time ago, and how these couple language additions could make a big difference.
For one reason or another I'd just hit the top level postgresql.org website, and suddenly realized just how many Postgres databases it took to put together what I was seeing on the screen. Not only does it power the content database that generated the page, of course, but even the lookup of the .org went through Afilias and their Postgres-backed domain service. It's a pity the DBMS couldn't act as the middle layer between those.
Or could it?
That's a shortened form of it just for demonstration purposes (the original one had things like a table browser) ... but it works. For example, on this test 9.1 install, hit http://localhost:8000/public/webtest and the following table appears:
| generate_series | lh | rnd |
|---|---|---|
| 1 | 0 | 0.548577250913 |
| 2 | 1 | 1.70926172473 |
| 3 | 1 | 1.24841631576 |
| (etc) | ... | ... |
Note the use of two specific 9.1 features, though. The plpy object contains nice query building helper utilities like quote_ident that you may be familiar with in other languages. But this also makes use of subtransactions, which helps recover from db errors. That's important here, as something like a typo in a table name will generate an error from Postgres and without that in place the database will end the transaction and ignore any subsequent commands the function tries to run.
But with that in place, the page shows the 404 error, and picks up where it left off with subsequent requests:
Error code 404. Message: Table not found.
By the way, if it's not clear by now don't take this anywhere near a production database, if not any other reason that a transaction will be held open as long as that function runs. That will hold back all the nice maintenance stuff that keeps things running efficiently. Still, I think it helps show off what just a handful of lines of code can do in a powerful language like PL/Python. I'm sure with the right module PL/PerlU could do something very similar. But even more I think it shows how Postgres is growing and innovating by leaps and bounds, seemingly every day!
NOTIFY vs Prepared Transactions in Postgres (the Bucardo solution)

We recently had a client use Bucardo to migrate their app from Postgres 8.2 to Postgres 9.0 with no downtime (which went great). They also were using Bucardo to replicate from the new 9.0 mater to a bunch of 9.0 slaves. This ran into problems the moment the application started, as we started seeing these messages in the logs:
ERROR: cannot PREPARE a transaction that has
executed LISTEN, UNLISTEN or NOTIFY
The problem is that the Postgres LISTEN/NOTIFY system cannot be used with prepared transactions. Bucardo uses a trigger on the source tables that issues a NOTIFY to let the main Bucardo daemon know that something has changed and needs to be replicated. However, their application was issuing a PREPARE TRANSACTION as an occasional part of its work. Thus, they would update the table, which would fire the trigger, which would send the NOTIFY. Then the application would issue the PREPARE TRANSACTION which produced the error given above. Bucardo is setup to deal with this situation; rather than using notify triggers, the Bucardo daemon can be set to look for any changes at a set interval. The steps to change Bucardo's behavior for a given sync is simply:
$ bucardo_ctl update sync foobar ping=false checktime=15
$ bucardo_ctl validate foobar
$ bucardo_ctl reload foobar
The first command tells the sync not to use notify triggers (these are actually statement-level triggers that simply issue a NOTIFY bucardo_kick_sync_foobar. It also sets a checktime of 15 seconds, which means that the Bucardo daemon will check for changes every 15 seconds - or as if the original notify trigger is firing every 15 seconds. The second command validates the sync but checking that all supporting tables, functions, triggers, etc. are installed and up to date. It also removes triggers that are no longer needed: in this case, the statement-level notify triggers for all tables in this sync. Finally, the third command simply tells the Bucardo daemon to stop the sync, load in the new changes, and restart it.
Another solution to the problem is to simply not use prepared transactions: very few applications actually need it, but I've noticed a few that use it anyway when they should not be. What exactly is a prepared transaction? It's the Postgres way of implementing two-part commit. Basically, this means that a transaction's state is stored away on disk, and can be committed or rolled back at a later time - even by a different session. This is handy if you need to ensure that, for example, you can atomically commit multiple database connections. By atomically, I mean that either they all commit or none of them do. This is done by doing work on each database, issuing a PREPARE TRANSACTION, and then, once all have been prepared, issuing the COMMIT TRANSACTION against each one.
As an aside, prepared transactions are often confused with prepared statements. While the use of prepared statements is very common, use of prepared transactions is very rare. Prepared statements are simply a way of planning a query one time, then re-running it multiple times without having to run the query through the planner each time. Many interfaces, such as DBD::Pg, will do this for you automatically behind the scenes. Sometimes using prepared statements can cause issues, but it is usually a win.
As mentioned above, the use of 2PC (two-phase commit) is very rare, which is why the default for the max_prepared_transactions variable was recently changed to 0, which effectively disallows the use of prepared transactions until you explicitly turning them on in your postgresql.conf file. This helps prevent people from accidentally issuing a PREPARE TRANSACTION and then leaving them around. This mistake is easy to do, for once you issue the command, everything goes back to normal and it's easy to forget about them. However, having them around is a bad thing, as they continue to hold locks, and can prevent vacuum from running.The check_postgres program even has a specific check for this situation:check_prepared_txns.
What does two-part commit look like? There are only three basic commands: PREPARE TRANSACTION, COMMIT PREPARED, and ROLLBACK PREPARED. Each takes a name, which is an arbitrary string 200 characters or less. Usage is to start a transaction, do some work, and then issue a PREPARE TRANSACTION instead of a COMMIT. At this point, all the work you have done is gone from your session and stored on disk. You cannot get back into this transaction: you can only commit it or roll it back. See the docs on PREPARE TRANSACTION for the full details.
Here's an example of two-part commit in action:
testdb=# BEGIN;
BEGIN
testdb=#* CREATE TABLE preptest(a int);
CREATE TABLE
testdb=#* INSERT INTO preptest VALUES (1),(2),(3);
INSERT 0 3
testdb=#* SELECT * FROM preptest;
a
---
1
2
3
(3 rows)
testdb=#* PREPARE TRANSACTION 'foobar';
PREPARE TRANSACTION
testdb=# SELECT * FROM preptest;
ERROR: relation "preptest" does not exist
LINE 1: SELECT * FROM preptest;
^
testdb=# COMMIT PREPARED 'foobar';
COMMIT PREPARED
testdb=# SELECT * FROM preptest;
a
---
1
2
3
(3 rows)
A contrived example, but you can see how easy it could be to issue a PREPARE TRANSACTION and not even realize that it actually sticks around forever!
MySQL Integer Size Attributes
MySQL has those curious size attributes you can apply to integer data types. For example, when creating a table, you might see:
mysql> CREATE TABLE foo (
-> field_ti tinyint(1),
-> field_si smallint(2),
-> field_int int(4),
-> field_bi bigint(5)
-> );
Query OK, 0 rows affected (0.05 sec)
mysql> desc foo;
+-----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+-------+
| field_ti | tinyint(1) | YES | | NULL | |
| field_si | smallint(2) | YES | | NULL | |
| field_int | int(4) | YES | | NULL | |
| field_bi | bigint(5) | YES | | NULL | |
+-----------+-------------+------+-----+---------+-------+
3 rows in set (0.03 sec)
mysql>
I had always assumed those size attributes were limiters, MySQL's way of providing some sort of constraint on the integers allowed in the field. While doing some recent work for a MySQL client, I attempted to enforce the range of a tinyint according to that assumption. In reality, I only wanted a sign field, and would have liked to have applied a "CHECK field IN (-1,1)", but without check constraints I figured at least keeping obviously incorrect data out would be better than nothing.
I wanted to see what MySQL's behavior would be on data entry that failed the limiters. I was hoping for an error, but expecting truncation. What I discovered was neither.
mysql> INSERT INTO foo (field_ti) VALUES (-1); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | +----------+ 1 row in set (0.00 sec) mysql> INSERT INTO foo (field_ti) VALUES (1); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | | 1 | +----------+ 2 rows in set (0.00 sec) mysql> INSERT INTO foo (field_ti) VALUES (10); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | | 1 | | 10 | +----------+ 3 rows in set (0.00 sec) mysql> INSERT INTO foo (field_ti) VALUES (100); Query OK, 1 row affected (0.00 sec) mysql> SELECT field_ti FROM foo; +----------+ | field_ti | +----------+ | -1 | | 1 | | 10 | | 100 | +----------+ 4 rows in set (0.00 sec) mysql>
Two possible conclusions followed immediately: either the limiter feature was horribly broken, or those apparent sizes didn't represent a limiter feature. A full review of MySQL's Numeric Types documentation provided the answer:
MySQL supports an extension for optionally specifying the display width of integer data types in parentheses following the base keyword for the type. For example, INT(4) specifies an INT with a display width of four digits. This optional display width may be used by applications to display integer values having a width less than the width specified for the column by left-padding them with spaces. (That is, this width is present in the metadata returned with result sets. Whether it is used or not is up to the application.) The display width does not constrain the range of values that can be stored in the column.
And, so, the lesson is repeated: Beware assumptions.
Postgres query caching with DBIx::Cache
A few years back, I started working on a module named DBIx::Cache which would add a caching layer at the database driver level. The project that was driving it got put on hold indefinitely, so it's been on my long-term todo list to release what I did have to the public in the hope that someone else may find it useful. Hence, I've just released version 1.0.1 of DBIx::Cache. Consider it the closest thing Postgres has at the moment for query caching. :) The canonical webpage:
http://bucardo.org/wiki/DBIx-Cache
You can also grab it via git, either directly:
git clone git://bucardo.org/dbixcache.git/
or through the indispensable github:
https://github.com/bucardo/dbixcache
So, what does it do exactly? Well, the idea is that certain queries that are either repeated often and/or are very expensive to run should be cached somewhere, such that the database does not have to redo all the same work, just to return the same results over and over to the client application. Currently, the best you can hope for with Postgres is that things are in RAM from being run recently. DBIx::Cache changes this by caching the results somewhere else. The default destination is memcached.
DBIx::Cache acts as a transparent layer around your DBI calls. You can control which queries, or classes of queries get cached. Most of the basic DBI methods are overridden so that rather than query Postgres, they actually query memcached as needed (or other caching layer - could even query back into Postgres itself!). Let's look at a simple example:
use strict;
use warnings;
use Data::Dumper;
use DBIx::Cache;
use Cache::Memcached::Fast;
## Connect to an existing memcached server,
## and establish a default namespace
my $mc = Cache::Memcached::Fast->new(
{
servers => [ { address => 'localhost:11211' } ],
namespace => 'joy',
});
## Rather than DBI->connect, use DBIx->connect
## Tell it what to use as our caching source
## (the memcached server above)
my $dbh = DBIx::Cache->connect('', '', '',
{ RaiseError => 1,
dxc_cachehandle => $mc
});
## This is an expensive query, that takes 30 seconds to run:
my $SQL = 'SELECT * FROM analyze_sales_data()';
## Prepare this query
my $sth = $dbh->prepare($SQL);
## Run it ten times in a row.
## The first time takes 30 seconds, the other nine return instantly.
for (1..10) {
my $count = $sth->execute();
my $info = $sth->fetchall_arrayref({});
print Dumper $info;
}
In the above, the prepare($SQL) is actually calling the DBIx::Class::prepare method. This parses the query and tries to determine if it is cacheable or not, then stores that decision internally. Regardless of the result, it calls DBI::prepare (which is techincally DBD::Pg::prepare), and returns the result.The magic comes in the call to execute() later on. As you might imagine, this is also actually the DBIx::Class::execute() method. If the query is not cacheable, it simply runs it as normal and returns. If it is cacheable, and this is the first time it is run, DBIx::Class runs an EXPLAIN EXECUTE on the original statement, and parses out a list of all tables that are used in this query. Then it caches all of this information into memcached, so that subsequent runs using the same list of arguments to execute() don't need to do that work again.
Finally, we come to fetchall_arrayref(). The first time it is run, we simply call the parent methods and get the data back. Then we build unique keys and store the results of the query into memcached. Finally, we mark the execute() as fully cached. Thus, on subsequent calls to execute(), we don't actually execute anything on the database server, but simply return the count as stashed inside of memcached (in the case of execute, this is the number of affected rows). For the various fetch() methods, we do the same thing - rather than fetch things from the database (via DBI, DBD::Pg, and libpq), we get the results from memcached (frozen via Data::Dumper), and then unpack and return them. Since we don't actually need to do any work against the database, everything returns as fast as we can query memcached - which is in general very fast indeed.
Most of the above is working, but the piece that is not written is the cache invalidation. DBIx::Cache knows which tables go to which queries, so in theory you could have (for example), an UPDATE/INSERT/DELETE trigger on table X which calls DBIx::Cache and tells it to invalidate all items related to table X, so that the next call to prepare() or execute() or fetch() will not find any memcached matches and re-run the whole query and store the results. You could also simply handle that in your application, of course, and have it decide when to invalidate items.
It's been a while since I've really looked at the code, but as far as I can tell it is close to being able to actually use somewhere. :) Patches and questions welcome!
DBD::Pg query cancelling in Postgres
A new version of DBD::Pg, the Perl driver for PostgreSQL, has just been released. In addition to fixing some memory leaks and other minor bugs, this release (version 2.18.0) introduces support for the DBI method known as cancel(). A giant thanks to Eric Simon, who wrote this new feature. The new method is similar to the existing pg_cancel() method, except it works on synchronous rather than asynchronous queries. I'll show an example of both below.
DBD::Pg has been able to handle asynchronous queries for a while now. Basically, that means you don't have to wait around for the database to finish a query. Your application can do other things while the query runs, then check back later to see if it has completed and grab the results. The way to cancel an already kicked-off asynchronous query is with the pg_cancel() method (the other asynchronous methods are pg_ready and pg_result, which have no synchronous equivalents).
The prefix "pg_" is used because there is no corresponding built-in DBI method to override, and the convention is to prefix everything custom to a driver with the driver's prefix, in our case 'pg'. Here's an example showing one possible use of asynchronous queries using DBD::Pg in some Perl code:
## We are connecting to two servers and running expensive
## queries on both. We kick both off right away, then wait
## for them both to finish. Our total wait time is thus
## max(server1,server2) rather than sum(server1,server2)
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn1 = 'dbi:Pg:dbname=sales;host=example1.com';
my $dsn2 = 'dbi:Pg:dbname=sales;host=example2.com';
my $dbh1 = DBI->connect($dsn1, '', '', {AutoCommit=>0, RaiseError=>1});
my $dbh2 = DBI->connect($dsn2, '', '', {AutoCommit=>0, RaiseError=>1});
my $SQL = 'SELECT gather_yearly_sales_data()';
print "Kicking off a long, expensive query on database one\n";
## Normally, a do() will not return until the query is complete
## However, the async flag causes it to return immediately
$dbh1->do($SQL, {pg_async => PG_ASYNC});
print "Kicking off a long, expensive query on database two\n";
$dbh2->do($SQL, {pg_async => PG_ASYNC});
## Both queries are running in the 'background'
## We have to wait for both, so it doesn't matter which one we wait for here
## However, if it's been over 2 minutes, we'll cancel both and quit
my $time = 0;
while ( ! $dbh1->pg_ready() ) {
sleep 1;
if ($time++ > 120) {
print "Taking too long, let's cancel the queries\n";
$dbh1->pg_cancel();
$dbh2->pg_cancel();
$dbh1->rollback();
$dbh2->rollback();
die "No sales data was retrieved\n";
}
}
## We know that database 1 has finished, so we read in the results
my $rows1 = $dbh1->pg_result();
## We then grab results from database 2
## This will block until done, which is okay
my $rows2 = $dbh2->pg_result();
The new method, simply known as cancel(), will kill any synchronously running query. One of the main uses for this is to timeout a query by using the builtin Perl alarm function. However, since the builtin alarm function has some quirks, we will instead use the much safer POSIX::SigAction method. Another example:
## We are running a series of queries against a database, but if
## the whole thing is taking over 30 seconds, we want to cancel
## the currently running query and move on to something else.
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn = 'dbi:Pg:dbname=dq';
my $dbh = DBI->connect($dsn, '', '', {AutoCommit=>0, RaiseError=>1});
## Setup all the POSIX alarm plumbing
my $mask = POSIX::SigSet->new(SIGALRM);
my $action = POSIX::SigAction->new(
sub { die "TIMEOUT\n" },
$mask,
);
my $oldaction = POSIX::SigAction->new();
sigaction( SIGALRM, $action, $oldaction );
## Prepare the queries
my $upd = $dbh->prepare('UPDATE foobar SET x=? WHERE y=?');
my $inv = $dbh->prepare('SELECT refresh_inventory(?)');
## Yes, a double eval. Async is looking better all the time :)
eval {
eval {
alarm 30;
for my $y (12,24,48) {
print "Adjusting widget #$y\n";
$upd->execute(555,$y);
print "Recalculating inventory\n";
$inv->execute($y);
}
};
alarm 0; ## Turn off our alarm
die "$@\n" if $@; ## Bubble the error to the outer eval
};
if ($@) { ## Something went wrong
if ($@ =~ /TIMEOUT/) {
print "Queries are taking too long! Cancelling\n";
## We don't know which one is still running, and don't care
## It's safe to cancel a non-active statement handle
$upd->cancel() or die qq{Failed to cancel the query!\n};
$inv->cancel() or die qq{Failed to cancel the query!\n};
$dbh->rollback();
die "Who has time to wait 30 seconds anymore?";
}
## Some other non-alarm error, so we simply:
die $@;
}
print "Updates are complete\n";
$dbh->commit();
exit;
Got an interesting use case for asynchronous queries or the new $dbh‑>cancel()? Let me know!
Annotating Your Logs
We recently did some PostgreSQL performance analysis for a client with an application having some scaling problems. In essence, they wanted to know where Postgres was getting bogged down, and once we knew that we'd be able to target some fixes. But to get to that point, we had to gather a whole bunch of log data for analysis while the test software hit the site.
This is on Postgres 8.3 in a rather locked down environment, by the way. Coordinated pg_rotate_logfile() was useful, but occasionally it would seem to devolve to something resembling: "Okay, we're adding 60 more users ... now!" And I'd write down the time stamp, and figure out an appropriate place to slice the log file later.
Got me thinking, what if we could just drop an entry into the log file, and use it to filter things out later? My first instinct was to start looking at seeing if a patch would be accepted, maybe a wrapper for ereport(), something easy. Turns out, it's even easier than that...
pubsite=# DO $$BEGIN RAISE LOG 'MARK: 60 users'; END;$$; DO Time: 0.464 ms pubsite=# DO $$BEGIN RAISE LOG 'MARK: 120 users'; END;$$; DO Time: 0.378 ms pubsite=# DO $$BEGIN RAISE LOG 'MARK: 360 users'; END;$$; DO Time: 0.700 ms
Of course the above will only work on version 9.0 and up (eventually). Previous versions that have PL/pgSQL turned can just create a function that does the same thing. The "LOG" severity level is an informational message that's supposed to always make it into the log files. So with those in place, a grep through the log can reveal just where they appear, and sed can extract the sections of log between those lines and feed them into your favorite analysis utility:
postgres@mothra:~$ grep -n 'LOG: MARK' /var/log/postgresql/postgresql-9.0-main.log 19180:2011-03-31 20:20:37 EDT LOG: MARK: 60 users 19478:2011-03-31 20:25:48 EDT LOG: MARK: 120 users 20247:2011-03-31 20:32:15 EDT LOG: MARK: 360 users postgres@mothra:~$ sed -n '19180,19478p' /var/log/postgresql/postgresql-9.0-main.log | bin/pgsi.pl > 60users.html
Oh, and the performance problem? Turns out it wasn't Postgres at all, every single query average execution time was shown to vary minimally as the concurrent user count was scaled higher and higher. But that's another story.
Postgres Build Farm Animal Differences
I'm a big fan of the Postgres Build Farm, a distributed network of computers that are constantly installling, building, and testing Postgres to detect any problems in the code. The build farm works best when there is a wide variety of operating systems and architectures testing. Thus, while I have a rather common x86_64 Linux box available for testing, I try to make it a little unique to get better test coverage.
One thing I've been working on is clang support (clang is an alternative to gcc). Unfortunately, the latest version of clang has a bug that prevents it from building Postgres on Linux boxes. I submitted a small patch to the Postgres source to fix this, but it was decided that we'll wait until clang fixes their bug. Supposedly they have in their svn head, but I've not been able to get that to compile successfully.
So I also just installed gcc 4.6.0, the latest and greatest. Installing it was not easy (nasty problems with the mfpr dependencies), but it's done now and working. It probably won't make any difference as far as the results, but at least my box is somewhat different from all the other x86_64 Linux boxes in the farm. :)
I've asked before on the list (with no response) about what sort of configuration changes could be made to expand the range of testing. The build farm itself provides a handful of things to choose from, and most of the animals in the farm have most of them configured (I have everything except "pam" and "vpath" enabled). However, one thing I've thought about changing is NAMEDATALEN. It's basically a compile-time option that sets the maximum number of characters things like table names can have. It is set by default to 64, while the SQL spec wants it to be 128. The problem is that this causes some tests to fail, as they have a hard-coded assumption about the length. The real problem of course is that Postgres' 'make check' is a very crude test. I've got some ideas on how to fix that, but that's another post for another day. So, anyone have other ideas on how to make my particular build farm member, and others like it, more useful?
check_postgres without Nagios (Postgres checkpoints)
Version 2.16.0 of check_postgres, a monitoring tool for Postgres, was just released. We're still trying to keep a "release often" schedule, and hopefully this year will see many releases. In addition to a few minor bug fixes, we added a new check by Nicola Thauvin called hot_standby_delay, which, as you might have guessed from the name, calculates the streaming replication lag between a master server and one of the slaves connected to it. Obviously the servers must be running PostgreSQL 9.0 or better.
Another recently added feature (in version 2.15.0) was the simple addition of a --quiet flag. All this does is to prevent any normal output when an OK status is found. I wrote this because sometimes even Nagios is overkill. In the default mode (Nagios, the other major mode is MRTG), check_postgres will exit with one of four states, each with their own exit code: OK, WARNING, CRITICAL, or UNKNOWN. It also outputs a small message, per Nagios conventions, so a txn_idle action might exit with a value of 1 and output something similar to this:
POSTGRES_TXN_IDLE WARNING: (host:svr1) longest idle in txn: 4638s
I had a situation where I wanted to use the functionality of check_postgres (to examine the lag on a warm standby server), but did not want the overhead of adding it into Nagios, and just needed a quick email to be sent if there were any problems. Thus, the use of the quiet flag yielded a quick and cheap Nagios replacement using cron:
*/10 * * * * bin/check_postgres.pl --action=checkpoint -w 300 -c 600 --datadir=/dbdir --quiet
So every 10 minutes the script gathers the number of seconds since the last checkpoint was run. If that number is under five minutes (300 seconds), it exits silently. If it's over five minutes, it outputs something similar to this, which cron then sends in an email:
POSTGRES_CHECKPOINT CRITICAL: Last checkpoint was 842 seconds ago
I'm not advocating replacing Nagios of course: there are many other good reasons to use Nagios instead of cron, but this worked well for the situation at hand. Other actions, feature requests, and patches for check_postgres are always welcome, either on the check_postgres bug tracker or the mailing list.
DBD::Pg, UTF-8, and Postgres client_encoding
Photo by Roger SmithI've been working on getting DBD::Pg to play nicely with UTF-8, as the current system is suboptimal at best. DBD::Pg is the Perl interface to Postgres, and is the glue code that takes the data from the database (via libpq) and gives it to your Perl program. However, not all data is created equal, and that's where the complications begin.
Currently, everything coming back from the database is, by default, treated as byte soup, meaning no conversion is done, and no strings are marked as utf8 (Perl strings are actually objects in which one of the attributes you can set is 'utf8'). If you want strings marked as utf8, you must currently set the pg_enable_utf8 attribute on the database handle like so:
$dbh->{pg_enable_utf8} = 1;
This causes DBD::Pg to scan incoming strings for high bits and mark the string as utf8 if it finds them. There are a few drawbacks to this system:
- It does this for all databases, even SQL_ASCII!
- It doesn't do this for everything, e.g. arrays, custom data types, xml.
- It requires the user to remember to set pg_enable_utf8.
- It adds overhead as we have to parse every single byte coming back from the database.
Here's one proposal for a new system. Feedback welcome, as this is a tricky thing to get right.
DBD::Pg will examine the client_encoding parameter, and see if it matches UTF8. If it does, then we can assume everything coming back to us from Postgres is UTF-8. Therefore, we'll simply flip the utf8 bit on for all strings. The one exception is bytea data, of course, which we'll read in and dequote into a non-utf8 string. Any non-UTF8 client_encodings (e.g. the monstrosity that is SQL_ASCII) will simply get back a byte soup, with no utf8 markings on our part.
The pg_enable_utf8 attribute will remain, so that applications that do their own decoding, or otherwise do not want the utf8 flag set, can forcibly disable it by setting pg_enable_utf8 to 0. Similarly, it can be forced on by setting pg_enable_utf8 to 1. The flag will always trump the client_encoding parameter.
A further complication is client_encoding: What if it defaults to something else? We can set it ourselves upon first connecting, and then if the program changes it after that point, it's on them to deal with the issues. (As DBD::Pg will still assume it is UTF-8, as we don't constantly recheck the parameter.)
Someone also raised the issue of marking ASCII-only strings as utf8. While technically this is not correct, it would be nice to avoid having to parse every single byte that comes out of the database to look for high bits. Hopefully, programs requesting data from a UTF-8 database will not be surprised when things come back marked as utf8.
Feel free to comment here or on the bug that started it all. Thanks also to David Christensen, who has given me great input on this topic.
Character encoding in perl: decode_utf8() vs decode('utf8')
When doing some recent encoding-based work in Perl, I found myself in a situation which seemed fairly unexplainable. I had a function which used some data which was encoded as UTF-8, ran Encode::decode_utf8() on said data to convert to Perl's internal character format, then converted the "wide" characters to the numeric entity using HTML::Entities::encode_entities_numeric(). Logging/printing of the data on input confirmed that the data was properly formatted UTF-8, as did running `iconv -f utf8 -t utf8 output.log >/dev/null` for the purposes of review.
However when I ended up processing the data, it was as if I had not run the decode function at all. In this case, the character in question was € (unicode code point U+20AC). The expected behavior from encode_entities_numeric() would be to turn any of the hi-bit characters in the perl string (i.e. all Unicode code points > 0x80) into the corresponding numeric entity (€ - € in this case). However instead of that specific character's numeric entity appearing in the output, the entities which appeared were: € i.e., the raw UTF-8 encoded value for €, with each octet being treated as an independent character instead of part of the whole encoded value.
What was particularly confusing was that extracting the relevant parts from the script in question resulted in the expected answer, so it was clearly not an issue of HTML::Entities not being able to deal with Unicode characters, as this code snippet demonstrates:
$ perl -MHTML::Entities+encode_entities_numeric -MEncode -e '$c=qq{\xE2\x82\xAC}; print encode_entities_numeric(decode_utf8($c))'
--> €
In the actual non-extracted version of the code, I was scratching my head. This was exhibiting the signs of doubly-encoded data, however I couldn't see how that could be the case. There were no PerlIO layers (e.g., :utf8 or :encoding) at play, the data I was outputting to a log file for verification purposes was being written via a brand new filehandle from a bare open(); I verified in multiple ways that the raw octets being passed in to the function were not doubly-encoded (printing the raw character points, counting lengths of the runs of octets and verifying that these matched the length of the UTF-8 encoded value for the represented characters, etc). The more things I tried the more puzzled I got. Finally, I changed the Encode::decode_utf8() call to a Encode::decode('utf8') one, providing the encoding explicitly. At this point, the processing pipeline started working as expected, and hi-bit characters were being output as their full numeric entities.
Since the documentation for decode_utf8 indicated that it should be identical to decode('utf8'), I resorted to the code to find out why it worked with the version that specified the encoding explicitly. I found that decode_utf8() does one additional thing that the regular decode('utf8') does not, and that is that before processing via the regular decode() function, decode_utf8 first checks the UTF-8 flag of the data that is being passed in, and if it is set it returns the data without further decoding*. My best guess is that this is to prevent errors if someone attempts to decode UTF-8 data in a string which is already in Perl's internal format, so in most cases this will provide a caller-friendly interface that will DWYM in many expected cases.
Armed with this knowledge, I verified that for some reason, the data that was being passed into the function had the UTF-8 flag set, so using the explicit decode('utf8') in lieu of decode_utf8() fixed the issue for me. (Tracing down the reason for the UTF-8 flag being set on this data was out of scope for this exercise, but is the true fix.) And just to verify that this was in fact the cause of the issue at hand, here's our example, modified slightly (we use the utf8::upgrade function to turn the UTF-8 flag on in the data and treat as actual encoded characters instead of raw octets):
$ perl -l -MHTML::Entities+encode_entities_numeric -MEncode -Mutf8 -e '$c=qq{\xE2\x82\xAC}; utf8::upgrade($c); print encode_entities_numeric(decode_utf8($c))'
--> €
* The UTF-8 flag is more-or-less an implementation detail of how Perl is able to deal with legacy 8-bit binary data in no particular encoding (i.e., raw octets, which it treats as latin-1) as well as the full range of Unicode data, and deal with both efficiently and in a backwards-compatible manner.
PostgreSQL 9.0 High Performance Review
I recently had the privilege of reading and reviewing the book PostgreSQL 9.0 High Performance by Greg Smith. While the title of the book suggests that it may be relevant only to PostgreSQL 9.0, there is in fact a wealth of information to be found which is relevant for all community supported versions of Postgres.
Acheiving the highest performance with PostgreSQL is definitely something which touches all layers of the stack, from your specific disk hardware, OS and filesystem to the database configuration, connection/data access patterns, and queries in use. This book gathers up a lot of the information and advice that I've seen bandied about on the IRC channel and the PostgreSQL mailing lists and presents it in one place.
While seemingly related, I believe some of the main points of the book could be summed up as:
- Measure, don't guess. From the early chapters which cover the lowest-level considerations, such as disk hardware/configuration to the later chapters which cover such topics as query optimization, replication and partitioning, considerable emphasis is placed on determining the metrics by which to measure performance before/after specific changes. This is the only way to determine the impact the changes you make have.
- Tailor to your specific needs/workflows. While there are many good rules of thumb out there when it comes to configuration/tuning, this book emphasizes the process of determining/refining those more general numbers to tailoring configuration/setup to your specific database's needs.
- Review the information the database system itself gives you. Information provided by the pg_stat_* views can be useful in identifying bottlenecks in queries, unused/underused indexes.
This book also introduced me to a few goodies which I had not encountered previously. One of the more interesting ones is the pg_buffercache contrib module. This suite of functions allows you to peek at the internals of the shared_buffers cache to get a feel for which relations are heavily accessed on a block-by-block basis. The examples in the book show this being used to more accurately size shared_buffers based on the actual number of accesses to specific portions of different relations.
I found the book to be well-written (always a plus when reading technical books) and felt it covered quite a bit of depth given its ambitious scope. Overall, it was an informative and enjoyable read.
Utah Open Source Conference 2010 part 1
It's been about a little over a month since the 2010 Utah Open Source Conference, and I decided to take a few minutes to review talks I enjoyed and link to my own talk slides.
Magento: Mac Newbold of Code Greene spoke on the Magento ecommerce framework for PHP. I've somewhat familiar with Magento, but a few things stood out:
- He finds the Magento Enterprise edition kind of problematic because Varien won't support you if you have any unsupported extensions. Some of his customers had problems with Varien support and went back to the community edition.
- Magento is now up to around 30 MB of PHP files!
- As I've heard elsewhere, serious customization has a steep learning curve.
- The Magento data model is an EAV (Entity-Attribute-Value) model. To get 50 columns of output requires 50+ joins between 8 tables (one EAV table for each value datatype).
- There are 120 tables total in default install -- many core features don't use the EAV tables for performance reasons.
- Another observation I've heard in pretty much every conversation about Magento: It is very resource intensive. Shared hosting is not recommended. Virtual servers should have a minimum of 1/2 to 1 GB RAM. Fast disk & database help most. APC cache recommended with at least 128 MB.
- A lot of front-end things are highly adjustable from simple back-end admin options.
- Saved credit cards are stored in the database, and the key is on the server. I didn't get a chance to ask for more details about this. I hope it's only the public part of a public/secret keypair!
It was a good overview for someone wanting to go beyond marketing feature lists.
Node.js: Shane Hansen of Backcountry.com spoke on Node, comparing it to Tornado and Twisted in Python. He calls JavaScript "Lisp in C's clothing", and says its culture of asynchronous, callback-driven code patterns makes Node a natural fit.
Performance and parallel processing are clearly strong incentives to look into Node. The echo server does 20K requests/sec. There are 2000+ Node projects on GitHub and 500+ packages in npm (Node Package Manager), including database drivers, web frameworks, parsers, testing frameworks, payment gateway integrations, and web analytics.
A few packages worth looking into further:
- express - web microframework like Sinatra
- Socket-IO - Web Sockets now; falls back to other things if no Web Sockets available
- hummingbird - web analytics, used by Gilt.com
- bespin - "cloud JavaScript editor"
- yui3 - build HTML via DOM, eventbus, etc.
- connect - like Ruby's Rack
I haven't played with Node at all yet, and this got me much more interested.
Metasploit: Jason Wood spoke on Metasploit, a penetration testing (or just plain penetrating!) tool. It was originally in Perl, and now is in Ruby. It comes with 590 exploits and has a text-based interactive control console.
Metasploit uses several external apps: nmap, Maltego (proprietary reconnaissance tool), Nessus (no longer open source, but GPL version and OpenVAS fork still available), Nexpose, Ratproxy, Karma.
The reconnaissance modules include DNS enumeration, and an email address collector that uses the big search engines.
It can backdoor PuTTY, PDFs, audio, and more.
This is clearly something you've got to experiment with to appreciate. Jason posted his Metasploit talk slides which have more detail.
So Many Choices: Web App Deployment with Perl, Python, and Ruby: This was my talk, and it was a lot of fun to prepare for, as I got to take time to see some new happenings I'd missed in these three languages communities' web server and framework space over the past several years.
The slides give pointers to a lot of interesting projects and topics to check out.
My summary was this. We have an embarrassment of riches in the open source web application world. Perl, Python, and Ruby all have very nice modern frameworks for developing web applications. They also have several equivalent solid options for deploying web applications. If you haven't tried the following, check them out:
That's about half of my notes on talks, but all I have time for now. I'll cover more in a later post.
Upgrading old versions of Postgres
Old elephant courtesy of Photos8.comThe recent release of Postgres 9.0.0 at the start of October 2010 was not the only big news from the project. Also released were versions 7.4.30 and 8.0.26, which, as I noted in my usual PGP checksum report, are going to be the last publicly released revisions in the 7.4 and 8.0 branches. In addition, the 8.1 branch will no longer be supported by the end of 2010. If you are still using one of those branches (or something older!), this should be the incentive you need upgrade as soon as possible. To be clear, this means that anyone running Postgres 8.1 or older is not going to get any official updates, including security and bug fixes.
A brief recap: Postgres uses major versions, containing two numbers, to indicate a major change in features and functionality. These are released about every two years. Each of these major versions has many revisions, which are released as often as needed. These revisions are designed to be completely binary compatible with the previous revision, meaning you can upgrade revisions very easily, with no dump and restore of the data needed.
Below are the options available for those running older versions of Postgres, from the most desirable to the least desirable. The three general options are to upgrade to the latest release (9.0 as I write this), migrate to a newer version, or stay on your release.
1. Upgrade to the latest release
This is the best option, as each new version of Postgres adds more features and becomes more efficient, all while maintaining the high code quality standards Postgres is known for. There are three general approaches to upgrading: pg_upgrade, pg_dump, and Bucardo / Slony.
Using pg_upgrade
The pg_upgrade utility is the preferred method for upgrading in the future. Basically, it rewrites your data directory from the "old" on-disk format to the "new" one. Unfortunately, pg_upgrade only works from version 8.3 and onwards, which means it cannot be used if you are coming from an older version. (This utility used to be called pg_migrator, in case you see references to that.)
Dump and restore
The next best method is the tried and true "dump and restore". This involves using pg_dump to create a logical representation of the old database, and then loading it into your new database with pg_restore or psql. The disadvantage to this method is time - dump and reload can take a very, very long time for large databases. Not only does the data need to get loaded into the new database tables, but all the indexes must be recreated, which can be agonizingly slow.
Replication systems
A third option is to use a replication system such as Slony or Bucardo to help with the upgrade. With Slony, you can set up a replication from the old version to the new version, and then failover to the new version once replication is caught up and running smooth. You can do something similar with Bucardo. Note that both systems can only replicate sequences, and tables containing primary keys or unique indexes. Bucardo has a "fullcopy" mode that will copy any table, regardless of primary keys, but it's slow as it's equivalent to a full dump and restore of the table. Note that Bucardo is really only tested on the 8.X versions: for anything older, you will need to use Slony.
Even if you cannot replicate all your tables, such systems can help a migration by replicating most of your data. For example, if you have a 750 GB table full of mostly historical data, you can have Bucardo start tracking changes to the table, set up a copy on the new version (perhaps by using warm standby or a snapshot to reduce load on the master), and then start Bucardo to catch up the rows that have changed since the changes were tracked. If you do this for all your large tables, the actual upgrade process can proceed with minimal downtime by shutting down the master, doing a pg_dump of only the non-tracked tables, and then pointing your apps at the new server.
2. Migrate to a newer version
Even if you don't go to 9.0, you may want to upgrade to a newer version. Why not go all the way to 9.0? There are only two good reasons not to. One, if your system's packaging system does not have 9.0 yet, or you have custom packaging requirements that prevent you from doing so. Two, if you have concerns about application compatibility between two versions. However, that latter concern should be minimal. The largest and most disruptive compatibility change appeared in version 8.3 with the removal of implicit casts. Since 8.2 is likely to be unsupported in the next couple years, you should be going to at least 8.3. And if you can go to 8.3, you can go to 9.0.
3. Stay on your release
This is obviously the least-desirable option, but may be necessary due to real-world constraints involving time, testing, compatibility with other programs, etc. At the bare minimum, make sure you are at least running the latest revision, e.g. 7.4.30 if running 7.4. Moving forward, you will need to keep an eye on the Postgres commits list and/or the detailed release notes for new versions, and examine if any of the fixed bugs apply to your version or your situation. If they do, you'll need to figure out how to apply the patch to your older version, and then release this new version into your environment. Sound risky? It gets worse, because your patch is only being used and tested by an extremely small pool of people, has no build farm support, and is not available to the Postgres developers. If you want to go this route, there are companies familiar with the Postgres code base (including End Point) that will help you do so. But know in advance that we are also going to push you very hard to upgrade to a modern, supported version instead (which we can help you with as well, of course :).
PostgreSQL 8.4 in RHEL/CentOS 5.5
The announcement of end of support coming soon for PostgreSQL 7.4, 8.0, and 8.1 means that people who've put off upgrading their Postgres systems are running out of time before they're in the danger zone where critical bugfixes won't be available.
Given that PostgreSQL 7.4 was released in November 2003, that's nearly 7 years of support, quite a long time for free community support of an open-source project.
Many of our systems run Red Hat Enterprise Linux 5, which shipped with PostgreSQL 8.1. All indications are that Red Hat will continue to support that version of Postgres as it does all parts of a given version of RHEL during its support lifetime. But of course it would be nice to get those systems upgraded to a newer version of Postgres to get the performance and feature benefits of newer versions.
For any developers or DBAs familiar with Postgres, upgrading to a new version with RPMs from the PGDG or other custom Yum repository is not a big deal, but occasionally we've had a client worry that using a packages other than the ones supplied by Red Hat is riskier.
For those holdouts still on PostgreSQL 8.1 because it's the "norm" on RHEL 5, Red Hat gave us a gift in their RHEL 5.5 update. It now includes separate PostgreSQL 8.4 packages that may optionally be used on RHEL 5 instead of PostgreSQL 8.1. (Both can't be used on the same system at the same time.)
I know that getting these packages from Red Hat shouldn't be necessary, but for those who feel jittery about using 3rd-party packages, it's a good nudge to switch to Postgres 8.4 using Red Hat's supported packages. Thanks to Tom Lane at Red Hat for making this happen. Though I don't know whose idea it was, Tom is the author of all the RPM commitlog messages, so thanks, Tom!
This brings up a few other rhetorical questions: Will RHEL 6 ship with PostgreSQL 9.0? Will RHEL 5.6 have backported PostgreSQL 9.0 in similar postgresql90 packages? It'd be great to see each new PostgreSQL release have supported packages in RHEL so that there's even less reason to start a new project on an older version of Postgres. RHEL 5.5 with PostgreSQL 8.4 is a nice start in that direction.
Postgres configuration best practices
This is the first in an occasional series of articles about configuring PostgreSQL. The main way to do this, of course, is the postgresql.conf file, which is read by the Postgres daemon on startup and contains a large number of parameters that affect the database's performance and behavior. Later posts will address specific settings inside this file, but before we do that, there are some global best practices to address.
Version Control
The single most important thing you can do is to put your postgresql.conf file into version control. I care not which one you use, but go do it right now. If you don't already have a version control system on your database box, git is a good choice to use. Barring that, RCS. Doing so is extremely easy. Just change to the directory postgresql.conf is in. The process for git:
- Install git if not there already (e.g. "sudo yum install git")
- Run: git init
- Run: git add postgresql.conf pg_hba.conf
- Run: git commit -a -m "Initial commit"
For RCS:
- Install as needed (e.g. "sudo apt-get install rcs")
- Run: mkdir RCS
- Run: ci -l postgresql.conf pg_hba.conf
Note that we also checked in pg_hba.conf as well. You want to check in any file in that directory you may possibly change. For most people, that only means postgresql.conf and pg_hba.conf, but if you use other files (pg_ident.conf) check those in as well.
Ideally you want the version checked in to be the "raw" configuration files that came with the system - in other words, before you started messing with them. Then you make your initial changes and check it in. From then on of course, you commit every time you change the file.
At a bare minimum, the version control system should be telling you:
- Exactly what was changed
- When it was changed
- Who made the change
- Why it was changed
The first two items happen automatically in all version control systems, so you don't have to worry about those. The third item, "who made the change", must be entered manually if on a shared account (e.g. postgres) and using RCS. If you are using git, you can simply set the environment variables GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL. For shared accounts, I have a custom bashrc file called "gregbashrc" that is called when I log in that sets those ENVs as well as a host of other items.
The fourth item, "why it was changed", is generally the content of the commit message. Never leave this blank, and be as descriptive and verbose as possible - someone later on will be grateful you did. It's okay to be repetitive and state the obvious. If this was done as part of a specific ticket number or project name, mention that as well.
Safe Changes
It's important that the changes you make to the postgresql.conf file (or other files) actually work and don't cause Postgres to be unable to parse the file, or handle a changed setting. Never make changes and restart Postgres, because if it doesn't work, you've got a broken config file, no Postgres daemon, and most likely unhappy applications and/or users. At the very least, do a reload first (e.g. /etc/init.d/postgresql reload or just kill -HUP the PID). Check the logs and see if Postgres was happy with your changes. If you are lucky, it won't even require a restart (some changes do, some do not).
A better way to test your changes is to make it on an identical test box. That way, all the wrinkles are ironed out before you make the changes on production and attempt a reload or restart.
Another way I've found handy is to simply start a new Postgres daemon. Sounds like a lot of work, but it's pretty automatic once you've done it a few times. The process generally looks like this, assuming your production postgresql.conf is in the "data" directory, and your changes are in data/postgresql.conf.new:
- cd ..
- initdb testdata
- cp -f data/postgresql.conf.new testdata/
- echo port=5555 >> testdata/postgresql.conf
- echo max_connections=10 >> testdata/postgresql.conf
The max_connections is not strictly necessary, of course, but unless you are changing something that relies on that setting, it's nicer to keep it (and the resulting memory) low.
- pg_ctl -D testdata -l test.log start
- cat test.log
- pg_ctl -D testdata stop
- rm -fr testdata (or just keep it around for next time)
The test.log file will show you any problems that might have popped up with your changes, and once it works you can be fairly confident it will work for the "main" daemon as well, so to finish up:
- cd data
- mv -f postgresql.conf.new postgresql.conf
- git commit postgresql.conf -m "Adjusted random_page_cost to 2, per bug #4151"
- kill -HUP `head -1 postmaster.pid`
- psql -c 'show random_page_cost'
Keeping it Clean
The postgresql.conf file is fairly long, and can be confusing to read with its mixture of comments, in-line comments, strange wrapping, and the commented out vs. not-commented-out variables. Hence, I recommend this system:
- Put a big notice at the top of the file asking people to make changes to the bottom
- Put all important variables at the bottom, sans comments, one per line
- Line things up
- Put into logical groups.
This avoids having to hunt for settings, prevents the gotcha of when a setting is changed twice in the file, and makes things much easier to read visually. Here's what I put at the top of the postgresql.conf:
## ## PLEASE MAKE ALL CHANGES TO THE BOTTOM OF THIS FILE! ##
I then add a good 20+ empty lines, so anyone viewing the file is forced to focus on the all-caps message above.
The next step is to put all the settings you care about at the bottom of the file. Which ones should you care about? Any setting you have changed (obviously), any setting that you *might* change in the future, and any that you may not have changed, but someone may want to look up. In practice, this means a list of about 25 items. After aligning all the values to the right and breaking things into logical groups, here's what the bottom of the postgresql.conf looks like:
## Connecting port = 5432 listen_addresses = '*' max_connections = 100 ## Memory shared_buffers = 400MB work_mem = 1MB maintenance_work_mem = 1GB ## Disk fsync = on synchronous_commit = on full_page_writes = on checkpoint_segments = 100 ## PITR archive_mode = off archive_command = '' archive_timeout = 0 ## Planner effective_cache_size = 18GB random_page_cost = 2 ## Logging log_destination = 'stderr' logging_collector = on log_filename = 'postgres-%Y-%m-%d.log' log_truncate_on_rotation = off log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 200 log_statement = 'ddl' log_line_prefix = '%t %u@%d %p' ## Autovacuum autovacuum = on autovacuum_vacuum_scale_factor = 0.1 autovacuum_analyze_scale_factor = 0.3
Because everything is in one place, at the bottom of the file, and not commented out, it's very easy to see what is going on. The groups above are somewhat arbitrary, and you can leave them out or create your own, but at least keep things grouped together as much as possible. When in doubt, use the same order as they appear in the original postgresql.conf.
Sometimes people change important settings in a group, such as for bulk loading of data. In this case, I usually make a separate group for it at the very bottom. This makes it easy to switch back and forth, and helps to prevent people from (for example) forgetting to switch fsync back on:
## Bulk loading only - leave 'on' for everyday use! autovacuum = off fsync = off full_page_writes = off
Ownership and permissions
All the conf files should be owned by the postgres user, and the configuration files should be world-readable if possible (indeed, it's a requirement for Debian based system that postgresql.conf be readable for psql to work!). Be careful about SELinux as well: it can get ornery if you do things like use symlinks.
Backups
One final note - make sure you are backing up your changes as well. PITR and pg_dump won't save your postgresql.conf! If you are checking things in to a remote version control system, then some of the pressure is off, but you should have some sort of policy for backing up all your conf files explicitly. Even if using a local git repo, tarring and copying up the whole thing is usually a very quick and cheap action.
Anonymous code blocks
With the release of PostgreSQL 9.0 comes the ability to execute "anonymous code blocks" in various of the PostgreSQL procedural languages. The idea stemmed from work back in autumn of 2009 that tried to respond to a common question on IRC or the mailing lists: how do I grant a permission to a particular user for all objects in a schema? At the time, the only solution short of manually writing commands to grant the permission in question on every object individually was to write a script of some sort. Further discussion uncovered several people that often found themselves writing simple functions to handle various administrative tasks. Many of those people, it turned out, would rather simply call one statement, rather than create a function, call the function, and then drop (or just ignore) the function they'd never need again. Hence, the new DO command.
The first language to support DO was PL/pgSQL. The PostgreSQL documentation provides an example to answer the original question: how do I grant permissions on everything to a particular user.
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT table_schema, table_name FROM information_schema.tables
WHERE table_type = 'VIEW' AND table_schema = 'public'
LOOP
EXECUTE 'GRANT ALL ON ' || quote_ident(r.table_schema) || '.' || quote_ident(r.table_name) || ' TO webuser';
END LOOP;
END$$;
Notice that this doesn't actually tell us what language to use. If no language is specified, DO defaults to PL/pgSQL (which, in 9.0, is enabled by default). But you can use other languages as well:
DO $$
HAI
BTW Calculate pi using Gregory-Leibniz series
BTW This method does not converge particularly quickly...
I HAS A PIADD ITZ 0.0
I HAS A PISUB ITZ 0.0
I HAS A ITR ITZ 0
I HAS A T1
I HAS A T2
I HAS A PI ITZ 0.0
I HAS A ITERASHUNZ ITZ 1000
IM IN YR LOOP
T1 R QUOSHUNT OF 4.0 AN SUM OF 3.0 AN ITR
T2 R QUOSHUNT OF 4.0 AN SUM OF 5.0 AN ITR
PISUB R SUM OF PISUB AN T1
PIADD R SUM OF PIADD AN T2
ITR R SUM OF ITR AN 4.0
BOTH SAEM ITR AN BIGGR OF ITR AN ITERASHUNZ, O RLY?
YA RLY, GTFO
OIC
IM OUTTA YR LOOP
PI R SUM OF 4.0 AN DIFF OF PIADD AN PISUB
VISIBLE "PI R: "
VISIBLE PI
FOUND YR PI
KTHXBYE
$$ LANGUAGE PLLOLCODE;
I tried to rewrite the GRANT function shown above in PL/LOLCODE for this example, until I discovered that some of PL/LOLCODE's limitations make it extremely difficult, if not impossible. So far as I know, PL/LOLCODE was the second language to support anonymous blocks, thanks to what turned out to be a relatively simple programming exercise. After finishing PL/LOLCODE's DO support, I decided to do the same for PL/Perl. I wasn't particularly surprised to find that PL/Perl was harder to extend than PL/LOLCODE; PL/Perl is a much more feature-rich (and hence, complicated) language and I wasn't as familiar with its internals. However, after my initial submission and with helpful commentary from several other people, Andrew Dunstan tied off the loose ends and got it committed. It looks like this:
DO $$
my $row;
my $rv = spi_exec_query(q{
SELECT quote_ident(table_schema) || '.' || quote_ident(table_name) AS relname
FROM information_schema.tables WHERE table_type = 'VIEW' AND table_schema = 'public'
});
my $nrows = $rv->{processed};
foreach my $i (0 .. $nrows - 1) {
my $row = $rv->{rows}[$rn];
spi_exec_query("GRANT ALL ON $row->{relname} TO webuser");
}
$$ LANGUAGE plperl;
DO wasn't the only thing to come from the pgsql-hackers discussion I mentioned above. In PostgreSQL 9.0, the GRANT command has also been modified, so it's now possible to grant permissions several objects in one stroke syntax. For instance:
GRANT SELECT ON ALL TABLES IN SCHEMA public TO webuser
pg_wrapper's very symbolic links
I like pg_wrapper. For a development environment, or testing replication scenarios, it's brilliant. If you're not familiar with pg_wrapper and its family of tools, it's a set of scripts in the postgresql-common and postgresql-client-common packages available in Debian, as well as Ubuntu and other Debian-like distributions. As you may have guessed pg_wrapper itself is a wrapper script that calls the correct version of the binary you're invoking – psql, pg_dump, etc – depending on the version of the database you want to connect to. Maybe not all that exciting in itself, but implied therein is the really cool bit: This set of tools lets you manage multiple installations of Postgres, spanning multiple versions, easily and reliably.
Well, usually reliably. We were helping a client upgrade their production boxes from Postgres 8.1 to 8.4. This was just before the 9.0 release, otherwise we'd consider moving the directly to that instead. It was going fairly smoothly until on one box we hit this message:
Could not parse locale out of pg_controldata output
Oops, they had pinned the older postgres-common version. An upgrade of those packages and no more error!
$ pg_lsclusters Version Cluster Port Status Owner Data directory Log file 8.1 main 5439 online postgres /var/lib/postgresql/8.1/main custom Error: Invalid data directory
Hmm, interesting. Okay, so not quite, got a little bit more work to do. This one took some tracing through the code. The pg_wrapper scripts, if they don't already know it, look for the data directory in a couple of places. The first stop is the postgresql.conf file, specifically /etc/postgresql/<version>/<cluster-name>/postgresql.conf, looking for the data_directory parameter. But, in its transitional state at the time, the postgresql.conf was still a work in progress.
The second place it looks is a symlink in the same /etc/postgresql/<version>/<cluster-name>/ directory. While that's the old way of doing things, it at least let us get things looking reasonable:
# ln -s /var/lib/postgresql/8.4/main /etc/postgresql/8.4/main/pgdata # /etc/init.d/postgresql-8.4 status 8.1 main 5439 online postgres /var/lib/postgresql/8.1/main custom 8.4 main 5432 online postgres /var/lib/postgresql/8.4/main custom
Voilà! From there we were able to proceed with the upgrade, confident that the instance will behave as expected. And now, everything is running great!
As with most things that provide a simpler experience on the surface, there's additional complexity under the hood. But for now, we have one more client upgraded. Thanks, Postgres!
Listen/Notify improvements in PostgreSQL 9.0
Improved listen/notify is one of the new features of Postgres 9.0 I've been waiting for a long time. There are basically two major changes: everything is in shared memory instead of using system tables, and full support for "payload" messages is enabled.
Before I demonstrate the changes, here's a review of what exactly the listen/notify system in Postgres is. Basically, it is an inter-process signalling system, which uses the pg_listener system table to coordinate simple named events between processes. One or more clients connects to the database and issues a command such as:
LISTEN foobar;
The name foobar can be replaced by any valid name; usually the name is something that gives a contextual clue to the listening process, such as the name of a table. Another client (or even one of the original ones) will then issue a notification like so:
NOTIFY foobar;
Each client that is listening for the 'foobar' message will receive a notification that the sender has issued the NOTIFY. It also receives the PID of the sending process. Multiple notifications are collapsed into a single notice, and the notification is not sent until a transaction is committed.
Here's some sample code using DBD::Pg that demonstrates how the system works:
#!/usr/bin/env perl
# -*-mode:cperl; indent-tabs-mode: nil-*-
use strict;
use warnings;
use DBI;
my $dsn = 'dbi:Pg:dbname=test';
my $dbh1 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
my $dbh2 = DBI->connect($dsn,'test','', {AutoCommit=>0,RaiseError=>1,PrintError=>0});
print "Postgres version is $dbh1->{pg_server_version}\n";
my $SQL = 'SELECT pg_backend_pid(), version()';
my $pid1 = $dbh1->selectall_arrayref($SQL)->[0][0];
my $pid2 = $dbh2->selectall_arrayref($SQL)->[0][0];
print "Process one has a PID of $pid1\n";
print "Process two has a PID of $pid2\n";
## Process one listens for a notice named "jtx"
$dbh1->do(q{LISTEN jtx});
$dbh1->commit();
## Process one checks for any notices received
print show_notices($dbh1);
## Process two sends a notice, but does not commit
$dbh2->do(q{NOTIFY jtx});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx});
$dbh2->commit();
sleep 1; ## Ensure the notice has time to get to propogate
## Process two receives a single notice from process one
print show_notices($dbh1);
## Now that it has seen the notice, it reports nothing again:
print show_notices($dbh1);
sub show_notices { ## Function to return any notices received
my $dbh = shift;
my $messages = '';
$dbh->commit();
while (my $n = $dbh->func('pg_notifies')) {
$messages .= "Got notice '$n->[0]' from PID $n->[1]\n";
}
return $messages || "No messages\n";
}The output of the above script on a 8.4 Postgres server is:
Postgres version is 80401
Process one has a PID of 18238
Process two has a PID of 18239
No messages
No messages
Got notice 'jtx' from PID 18239
No messages
As expected, we got a notification only after the other process committed.
Note that because this is asychronous and involves the system tables, we added a sleep call to ensure that the notice had time to propagate so that the other processes will see it. Without the sleep, we usually see four "No messages" appear, as the script goes too fast for the pg_listener table to catch up.
Now for the aforementioned payloads. Payloads allow an arbitrary string to be attached to the notification, such that you can have a standard name like before, but you can also attach some specific text that the other processes can see. I added support for payloads to DBD::Pg back in June 2008, so let's modify the script a little bit to demonstrate the new payload mechanism:
...
## Process two sends two notices, but does not commit
$dbh2->do(q{NOTIFY jtx, 'square'});
$dbh2->do(q{NOTIFY jtx, 'square'});
## Process one does not see the notice yet
print show_notices($dbh1);
## Process two sends the same notice again, then commits
$dbh2->do(q{NOTIFY jtx, 'triangle'});
$dbh2->commit();
...
## This part changes: we get an extra item from our array:
$messages .= "Got notice '$n->[0]' from PID $n->[1] message is '$n->[2]'\n";
...Here's what the output looks like under version 9.0 of Postgres:
Postgres version is 90000
Process one has a PID of 19089
Process two has a PID of 19090
No messages
No messages
Got notice 'jtx' from PID 19090 message is 'square'
Got notice 'jtx' from PID 19090 message is 'triangle'
No messages
Note that the collapsing of identical messages into a single notification now takes into account the message as well, so we received two notifications in the above example for the three total notifications sent. To add a payload, we simply say NOTIFY, then the name of the notification, add a comma, and specify a payload as a quoted string. Of course, the payload string is still completely optional. If no payload is specified, DBD::Pg will simply treat the payload as an empty string (this is also the behavior when you request the payload using DBD::Pg against a pre-9.0 server, so all combinations should be 100% backwards compatible).
We also got rid of the sleep. Because we are now using shared memory instead of system tables, there is no lag whatsoever, and the other process can see the notices right away.
Another large advantage to removing the pg_listener table is that systems that make heavy use of it (such as the replication systems Bucardo and Slony) no longer have to worry about bloat in these tables.
The use of payloads also means that many application can be greatly simplified: in the past, one had to be creative in the name of your notifications in order to pass meta-information to your listener. For example, Bucardo uses a large collection of notifications, meaning that the Bucardo processes had to do the equivalent of things like this:
$dbh->do(q{LISTEN bucardo_reload_config});
$dbh->do(q{LISTEN bucardo_log_message});
$dbh->do(q{LISTEN bucardo_activate_sync_$sync});
$dbh->do(q{LISTEN bucardo_deactivate_sync_$sync});
$dbh->do(q{LISTEN bucardo_kick_sync_$sync});
...
while (my $notice = $dbh->func('pg_notifies')) {
my ($name, $pid) = @$notice;
if ($name eq 'bucardo_reload_config') {
...
}
elsif ($name =~ /bucardo_kick_sync_(.+)/) {
...
}
...
}
We can instead do things like this:
$dbh->do(q{LISTEN bucardo});
...
while (my $notice = $dbh->func('pg_notifies')) {
my ($name, $pid, $msg) = @$notice;
if ($msg eq 'bucardo_reload_config') {
...
}
elsif ($msg =~ /bucardo_kick_sync_(.+)/) {
...
}
...
}
I hope to add this support to Bucardo shortly; it's simply a matter of refactoring all the listen and notify calls into a function that does the right thing depending on the server version it is attached to.
PostgreSQL odd checkpoint failure
Nothing strikes fear into the heart of a DBA like error messages, particularly ones which indicate that there may be data corruption. One such situation happened recently to us, when we ran into a recent unusual situation in an upgrade to PostgreSQL 8.1.21. We had updated the software and manually been running a REINDEX DATABASE command, when we started to notice some errors being reported on the front-end. We decided to dump the database in question to ensure we had a backup to return to, however we still ended up with more messages:
pg_dump -Fc database1 > pgdump.database1.archive pg_dump: WARNING: could not write block 1 of 1663/207394263/443523507 DETAIL: Multiple failures --- write error may be permanent. pg_dump: ERROR: could not open relation 1663/207394263/443523507: No such file or directory CONTEXT: writing block 1 of relation 1663/207394263/443523507 pg_dump: SQL command to dump the contents of table "table1" failed: PQendcopy() failed. pg_dump: Error message from server: ERROR: could not open relation 1663/207394263/443523507: No such file or directory CONTEXT: writing block 1 of relation 1663/207394263/443523507 pg_dump: The command was: COPY public."table1" (id, field1, field2, field3) TO stdout;
Looking at the pg_database contents revealed that 207394263 was not even the database in question. I connected to the aforementioned database and looked for a relation that matched that pg_class.oid, and barring that pg_class.relfilenode. This search revealed nothing. So where was the object itself living, and why were we getting this message?
We decided that since it appeared that something was awry with the database system in general, that we should take this opportunity to dump the tables in question. I proceeded to write a quick script to go through the database tables and dump each one individually using pg_dump's -t option. This worked for some of the tables, but not all of them, which would die with the same error. Looking at the pg_class.relpages field for the non-dumpable tables revealed that these were all the larger tables in the database. Obviously not good, since this is where the bulk of the data lay. However, we also noticed that the message that we got referenced the exact same filesystem path, so it appeared to be something separate from the table that was being dumped.
After some advice on IRC, we reviewed the logs for checkpoint logging, which revealed that checkpoints had been failing. This further meant that the database was in a state such that it could not be shut down cleanly, had we wanted to try to restart to see if that cleared up the flakiness. This further meant that we'd only be able to shutdown via a hard kill, which is definitely something to avoid, WAL or not, particularly since there had not been a checkpoint for some time. A manual CHECKPOINT further failed after a timeout.
Before we went down the road of forcing a hard server shutdown, we ended up just touching the specific relation path in question into existence and then running a CHECKPOINT. This time since the file existed, it was able to complete the checkpoint, and restore working order to the database. We successfully (and quickly) ran a full pg_dump, and went about the task of manually vetting a few of the affected tables, etc.
Our working theory for this is that somehow there was a dirty buffer that referenced a relation that no longer existed, and hence when the there was a checkpoint or other event which attempted to flush shared_buffers (i.e., the loading of a large relation which would require a flush of Least Recently Used pages as in the pg_dump case), the flush attempt for the missing relation failed, which aborted the checkpoint/other action.
After the file existed and PostgreSQL had successfully synched to disk, it was a single two-block file, of which the first block was completely empty and the second block looked like an index page (due to the layout/contents of the data). The most suggestive cause was that had been an interrupted REINDEX earlier in the day. Since this machine was showing no other signs of data corruption and everything else seemed reasonable, our best guess is that there was some race condition that had caused the relation's data to exist in memory even while the canceled REINDEX ensured that the actual relfile and the pg_class rows did not exist for the buffer.
Distributed Transactions and Two-Phase Commit
The typical example of a transaction involves Alice and Bob, and their bank. Alice pays Bob $100, and the bank needs to debit Alice and credit Bob. Easy enough, provided the server doesn't crash. But what happens if the bank debits Alice, and then before crediting Bob, the server goes down? Or what if they credit Bob first, and then try to debit Alice only to find she doesn't have enough funds? A transaction allows the debit and credit operations to happen as a package ("atomically" is the word commonly used), so either both operations happen or neither happens, even if the server crashes halfway through the transaction. That way the bank never credits Bob without debiting Alice, or vice versa.
That's simple enough, but the situation can become more complex. What if, for instance, for buzzword-compliance purposes, the bank has "sharded" its accounts database by splitting it in pieces and putting each piece on a different server (whether this is would be smart or not is outside the scope of this post). The typical transaction handles statements issued only for one database, so we can't wrap the debit and credit operations within a single BEGIN/COMMIT if Alice's account information lives on one server and Bob's lives on another.
Enter "distributed transactions". A distributed transaction allows applications to group multiple transaction-aware systems into a single transaction. These systems might be different databases, or they might include other systems such as message queues, in which case the transaction concept means a message would get delivered if and only if the rest of the transaction completed. So with a distributed transaction, the bank could debit Alice's account in one database and credit Bob's in another, atomically.
All this comes at some cost. Distributed transactions require a "transaction manager", an application which handles the special semantics required to commit a distributed transaction. Second, the systems involved must support "two-phase commit" (which was added to PostgreSQL in version 8.1). Distributed transactions are committed using PREPARE TRANSACTION 'foo' (phase 1), and COMMIT PREPARED 'foo' or ROLLBACK PREPARED 'foo' (phase 2), rather than the usual COMMIT or ROLLBACK.
The beginning of a distributed transaction looks just like any other transaction: the application issues a BEGIN statement (optional in PostgreSQL), followed by normal SQL statements. When the transaction manager is instructed to commit, it runs the first commit phase by saying "PREPARE TRANSACTION 'foo'" (where "foo" is some arbitrary identifier for this transaction) on each system involved in the distributed transaction. Each system does whatever it needs to do to determine whether or not this transaction can be committed and to make sure it can be committed even if the server crashes, and reports success or failure. If all systems succeed, the transaction manager follows up with "COMMIT PREPARED 'foo'", and if a system reports failure, the transaction manager can roll back all the other systems using either ROLLBACK (for those transactions it hasn't yet prepared), or "ROLLBACK PREPARED 'foo'". Using two-phase commit is obviously slower than committing transactions on only one database, but sometimes the data integrity it provides justifies the extra cost.
In PostgreSQL, two-phase commit is supported provided max_prepared_transactions is nonzero. A PREPARE TRANSACTION statement persists the current transaction to disk, and dissociates it from the current session. That way it can survive even if the database goes down. The current session no longer has an active transaction. However, the prepared transaction acts like any other open transaction in that all locks held by the prepared transaction remain held, and VACUUM cannot reclaim storage from that transaction. So it's not a good idea to leave prepared transactions open for a long time.
Distributed transactions are most common, it seems, in Java applications. Full J2EE application servers typically come with a transaction manager component. For my examples I'll use an open source, standalone transaction manager, called Bitronix. I'm not particularly fond of using Java for simple scripts, though, so I've used JRuby for this demonstration code.
This script uses two databases, which I've called "athos" and "porthos". Each has same schema, which provides a simple framework for the sharded bank example described above. This schema provides a table for account names, another for ledger information, and a simple trigger to raise an exception when a transaction would bring a person's balance below $0. I'll first populate athos with Alice's account information. She gets $200 to start. Bob will go in the porthos database, with no initial balance.
5432 josh@athos# insert into accounts values ('Alice');
INSERT 0 1
5432 josh@athos*# insert into ledger values ('Alice', 200);
INSERT 0 1
5432 josh@athos*# commit;
COMMIT5432 josh@athos# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# insert into accounts values ('Bob');
INSERT 0 1
5432 josh@porthos*# commit;
COMMIT
Use of Bitronix is pretty straightforward. After setting up a few constants for easier typing, I create a Bitronix data source for each PostgreSQL database. Here I have to use the PostgreSQL JDBC driver's org.postgresql.xa.PGXADataSource class; "XA" is Java's protocol for two-phase commit, and requires JDBC driver support. Here's the code for setting up one data source; the other is just the same.
ds1 = PDS.new ds1.set_class_name 'org.postgresql.xa.PGXADataSource' ds1.set_unique_name 'pgsql1' ds1.set_max_pool_size 3 ds1.get_driver_properties.set_property 'databaseName', 'athos' ds1.get_driver_properties.set_property 'user', 'josh' ds1.init
Then I simply get a connection from each data source, instantiate a Bitronix TransactionManager object, and begin a transaction.
c1 = ds1.get_connection c2 = ds2.get_connection btm = TxnSvc.get_transaction_manager btm.begin
Within my transaction, I just use normal JDBC commands to debit Alice and credit Bob, after which I commit the transaction through the TransactionManager object. If this transaction fails, it raises an exception, which I can capture using Ruby's begin/rescue exception handling, and roll back the transaction.
begin
s2 = c2.prepare_statement "INSERT INTO ledger VALUES ('Bob', 100)"
s2.execute_update
s2.close
s1 = c1.prepare_statement "INSERT INTO ledger VALUES ('Alice', -100)"
s1.execute_update
s1.close
btm.commit
puts "Successfully committed"
rescue
puts "Something bad happened: " + $!
btm.rollback
end
When I run this, Bitronix gives me a bunch of output, which I haven't bothered to suppress, but among it all is the "Successfully committed" string I told it to print on success. Since Alice is debited $100 each time we run this, and she started with $200, we can run it twice before hitting errors. On the third time, we get this:
Something bad happened: org.postgresql.util.PSQLException: ERROR: Rejecting operation; account owner Alice's balance would drop below 0
This is our trigger firing, to tell us that we can't debit Alice any more. If I look in the two databases, I can see that everything worked as planned:
5432 josh@athos*# select get_balance('Alice');
get_balance
-------------
0
(1 row)
5432 josh@athos*# \c porthos
You are now connected to database "porthos".
5432 josh@porthos# select get_balance('Bob');
get_balance
-------------
200
(1 row)
Remember I've run my script three times, but Bob has only been credited $200, because that's all Alice had to start with.
PostgreSQL: Migration Support Checklist
A database migration (be it from some other database to PostgreSQL, or even from an older version of PostgreSQL to a nice shiny new one) can be a complicated procedure with many details and many moving parts. I've found it helpful to construct a list of questions in order to make sure that you're considering all aspects of the migrations and gauge the scope of what will be involved. This list includes questions we ask our clients; feel free to contribute your own additional considerations or suggestions.
Technical questions:
- Database servers: How many database servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? What kind of storage are you using for the existing database, and what do you plan to use for the new database? Direct-attached storage (SAS, SATA, etc.), SAN (what vendor?), or other? Do you use any configuration management system such as Puppet, Chef, etc.?
- Application servers and other remote access: How many application servers do you have? For each, what are the basic system specifications (OS, CPU architecture, 32- vs 64-bit, RAM, disk, etc)? Do you use any configuration management system such as Puppet, Chef, etc.? What other network considerations are there? Is ODBC used, or SSL transport, any VPNs? Are multiple datacenters involved? How about egress/ingress firewalls?
- Middleware: Do you currently use any sort of connection pooling, load balancing, or other middleware between your application and database servers?
- Data needs: Can you describe your data access patterns? i.e., is the majority of your data historical and rarely accessed? Are there any existing reporting needs that will need to be duplicated on the PostgreSQL system? Do you already have reports of database usage, including traffic levels, frequent or intensive queries, etc?
- Size: What kind of transaction volume do you see? How large are your databases? How many tables do you have and what is the size of the larger ones? How many users or database connections will you need to support?
- Backups: What are your current backup policies/procedures? How will these need to change with the move to PostgreSQL?
- Replication/load balancing: What kind of system redundancy do you currently have/need? Do you have any kind of database load-balancing or master-slave replication?
- Monitoring: What is the current monitoring/in-house support infrastructure? What needs to be duplicated, and can any portion of this facility be reused?
- Interfaces: What language are your applications written in, and what drivers exist to connect to your current database? Will there be a compatible driver available in your language of choice in order?
- Extensions: Are you currently using any in-database procedures or functionality (i.e., in PL/SQL or another embedded language of choice)? If so, how many? What will the difficulty be in porting these functions to PostgreSQL?
And a couple of business-related questions:
- Scheduling: What is the timeframe for transition? When can appropriate downtime be scheduled? How much database downtime can you afford?
- Staffing: Do you currently have in-house DBAs to manage the servers, etc on a day-to-day basis? Is there anyone with PostgreSQL experience or familiarity on staff?
Being able to answer all of these questions is critical to formulating a migration plan and carrying out a migration successfully.
Particularly with the impending (July 2010) end of life for previous PostgreSQL releases 7.4, 8.0 and (in November 2010) 8.1, a database migration may be on your radar. End Point is one of many professional PostgreSQL support companies who would be happy to assist you in your transition.
Views across many similar tables
An application I'm working on has a host of (a dozen or so) status tables, each containing various rows that reflect the state of associated rows in other tables. For instance:
Table "public.inventory" ... status_code | character varying(50) | not null Table "public.inventory_statuses" code | character varying(50) | not null display_label | character varying(70) | not null SELECT * FROM inventory_statuses; code | display_label -----------+--------------- ordered | Ordered shipped | Shipped returned | Returned repaired | Repairedetc.
Several of the codes are common to several tables. For instance, "void" is a status that occurs in seven tables. The application cares about this; there are code-level triggers that will respond to a change of status to "void" in one table, and pass that information along to another table higher up the chain.
Since I wasn't present at the birth of the system (nor do I have unlimited memory to keep 180+ codes in my head), I needed a way to answer the question, "In which table(s) does status 'foo' occur?" This was made rather easier by attention to detail early on: each of the status tables was named "*_statuses"; each primary key was named "code"; and each human-readable description field was named "display_label". I wrote a Pl/PgSQL function to create a view spanning all the tables. (I could have just created the SQL by hand, but I wanted a way to reproduce this effort later, if tables are added, dropped, or modified.)
CREATE FUNCTION create_all_statuses()
RETURNS VOID
LANGUAGE 'plpgsql'
AS $$
DECLARE
stmt TEXT;
tbl RECORD;
BEGIN
stmt := '';
FOR tbl IN EXECUTE $SQL$
SELECT DISTINCT table_name
FROM information_schema.columns a
JOIN information_schema.columns b
USING (table_name)
JOIN information_schema.tables t
USING (table_name)
WHERE a.column_name = 'code'
AND b.column_name = 'display_label'
AND table_name ~ '_statuses$'
AND t.table_type = 'BASE TABLE'
$SQL$
LOOP
IF (LENGTH(stmt) > 0)
THEN
stmt := stmt || ' UNION ';
END IF;
stmt := stmt || 'SELECT code, display_label, ' ||
quote_literal(tbl.table_name) ||
' AS table_name FROM ' ||
quote_ident(tbl.table_name);
END LOOP;
EXECUTE 'CREATE VIEW all_statuses AS ' || stmt;
RETURN;
END;
$$;Now it's easy to answer the question:select * from all_statuses where code = 'void'; code | display_label | table_name ------+---------------+-------------------------------------- void | Void | inventory_statuses void | Void | parcel_statuses void | Void | pick_list_statusesetc.
If your database uses boilerplate columns such as "last_modified" or "date_created" to record timestamps on rows, you could use similar logic to create a view that would tell you which tables were the most recently modified.
NoSQL at RailsConf 2010: An Ecommerce Example
Even more so than Rails 3, NoSQL was a popular technical topic at RailsConf this year. I haven't had much exposure to NoSQL except for reading a few articles written by Ethan (Quick Thoughts on NoSQL Live Boston Conference, NoSQL Live: The Dynamo Derivatives (Cassandra, Voldemort, Riak), and Cassandra, Thrift, and Fibers in EventMachine), so I attended a few sessions to learn more.
First, it was reinforced several times that if you can read JSON, you should have no problem comprehending NoSQL. So, it shouldn't be too hard to jump into code examples! Next, I found it helpful when one of the speakers presented high-level categorization of NoSQL, whether or not the categories meant much to me at the time:
- Key-Value Stores: Advantages include that this is the simplest possible data model. Disadvantages include that range queries are not straightforward and modeling can get complicated. Examples include Redis, Riak, Voldemort, Tokyo Cabinet, MemcacheDB.
- Document stores: Advantages include that the value associated with a key is a document that exposes a structure that allows some database operations to be performed on it. Examples include CouchDB, MongoDB, Riak, FleetDB.
- Column-based stores: Examples include Cassandra, HBase.
- Graph stores: Advantages include that this allows for deep relationships. Examples include Neo4j, HypergraphDB, InfoGrid.
In one NoSQL talk, Flip Sasser presented an example to demonstrate how an ecommerce application might be migrated to use NoSQL, which was the most efficient (and very familiar) way for me to gain an understanding of NoSQL use in a Rails application. Flip introduced the models and relationships shown here:
In the transition to NoSQL, the transaction model stays as is. As a purchase is created, the Notification.create method is called.
class Purchase < ActiveRecord::Base
after_create :create_notification
# model relationships
# model validations
def total
quantity * product.price
end
protected
def create_notification
notifications.create({
:action => "purchased #{quantity == 1 ? 'a' : quantity} #{quantity == 1 ? product.name : product.name.pluralize}",
:description => "Spent a total of #{total}",
:item => self,
:user => user
}
)
end
end
Flip moves the product class to Document store because it needs a lot of flexibility to handle the diverse product metadata. The structure of the product class is defined in the product class and nowhere else.
Before
class Product < ActiveRecord::Base serialize :info, Hash end
After
class Product include MongoMapper::Document key :name, String key :image_path, String key :info, Hash timestamps! end
The Notification class is moved to a Key-Value store. After a user completes a purchase, the create method is called to store a notification against the user that is to receive the notification.
Before
class Notification < ActiveRecord::Base # model relationships # model validations end
After
require 'ostruct'
class Notification < OpenStruct
class << self
def create(attributes)
message = "#{attributes[:user].name} #{attributes[:action]}"
attributes[:user].follower_ids.each do |follower_id|
Red.lpush("user:#{follower_id}:notifications", {:message => message, :description => attributes[:description], :timestamp => Time.now}.to_json)
end
end
end
end
The user model remains an ActiveRecord model and uses the devise gem for user authentication, but is modified to retrieve the notifications, now an OpenStruct. The result is that whenever a user's friend makes a purchase, the user is notified of the purchase. In this simple example, a purchase contains one product only.
Before
class User < ActiveRecord::Base
# user authentication here
# model relationships
def notifications
Notification.where("friend_relationships.friend_id = notifications.user_id OR notifications.user_id = #{id}").
joins("LEFT JOIN friend_relationships ON friend_relationships.user_id = #{id}")
end
end
After
class User < ActiveRecord::Base
# user authentication here
# model relationships
def followers
User.where('users.id IN (friend_relationships.user_id)').
joins("JOIN friend_relationships ON friend_relationships.friend_id = #{id}")
end
def follower_ids
followers.map(&:id)
end
def notifications
(Red.lrange("user:#{id}:notifications", 0, -1) || []).map{|notification| Notification.new(ActiveSupport::JSON.decode(notification))}
end
end
The disadvantages to the NoSQL and RDBMS hybrid is that data portability is limited and ActiveRecord plugins can no longer be used. But the general idea is that performance justifies the move to NoSQL for some data. In several sessions I attended, the speakers reiterated that you will likely never be in a situation where you'll only use NoSQL, but that it's another tool available to suit performance-related business needs. I later spoke with a few Spree developers and we concluded that the NoSQL approach may work well in some applications for product and variant data for improved performance with flexibility, but we didn't come to an agreement on where else this approach may be applied.
Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.
pgcrypto pg_cipher_exists errors on upgrade from PostgreSQL 8.1
While migrating a client from a 8.1 Postgres database to a 8.4 Postgres database, I came across a very annoying pgcrypto problem. (pgcrypto is a very powerful and useful contrib module that contains many functions for encryption and hashing.) Specifically, the following functions were removed from pgcrypto as of version 8.2 of Postgres:
- pg_cipher_exists
- pg_digest_exists
- pg_hmac_exists
While the functions listed above were deprecated, and marked as such for a while, their complete removal from 8.2 presents problems when upgrading via a simple pg_dump. Specifically, even though the client was not using those functions, they were still there as part of the dump. Here's what the error message looked like:
$ pg_dump mydb --create | psql -X -p 5433 -f - >pg.stdout 2>pg.stderr ... psql::2654: ERROR: could not find function "pg_cipher_exists" in file "/var/lib/postgresql/8.4/lib/pgcrypto.so" psql: :2657: ERROR: function public.cipher_exists(text) does not exist
While it doesn't stop the rest of the dump from importing, I like to remove any errors I can. In this case, it really was a SMOP. Inside the Postgres 8.4 source tree, in the contrib/pgcrypto directory, I added the following declarations to pgcrypto.h:
Datum pg_cipher_exists(PG_FUNCTION_ARGS);
Datum pg_digest_exists(PG_FUNCTION_ARGS);
Datum pg_hmac_exists(PG_FUNCTION_ARGS);
Then I added three simple functions to the bottom of the pgcrypto.c file that simply throw an error if they are invoked, letting the user know that the functions are deprecated. This is a much friendlier way than simply removing the functions, IMHO.
/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_cipher_exists);
Datum
pg_cipher_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_cipher_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_cipher_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_digest_exists);
Datum
pg_digest_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_digest_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
/* SQL function: pg_hmac_exists(text) returns boolean */
PG_FUNCTION_INFO_V1(pg_hmac_exists);
Datum
pg_hmac_exists(PG_FUNCTION_ARGS)
{
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
errmsg("pg_hmac_exists is a deprecated function")));
PG_RETURN_TEXT_P("0");
}
After running make install from the pgcrypto directory, the dump proceeded without any further pgcrypto errors. From this point forward, if the anyone attempts to use one of the functions, it will be quite obvious that the function is deprecated, rather than leaving the user wondering if they typed the function name incorrectly or wondering if pgcrypto is perhaps not installed.
Why not just add some dummy SQL functions to the pgcrypto.sql file instead of hacking the C code? Because pg_dump by default will create the database as a copy of template0. While there are other ways around the problem (such as putting the SQL functions into template1 and forcing the load to use that instead of template0, or by creating the database, adding the SQL functions, and then loading the data), this was the simplest approach.
Photo of Enigma machine by Marcin Wichary
Learn more about End Point's Postgres Support, Development, and Consulting.
The PGCon "Hall Track"
One of my favorite parts of PGCon is always the "hall track", a general term for the sideline discussions and brainstorming sessions that happen over dinner, between sessions (or sometimes during sessions), and pretty much everywhere else during the conference. This year's hall track topics seemed to be set by the developers' meeting; everywhere I went, someone was talking about hooks for external security modules, MERGE, predicate locking, extension packaging and distribution, or exposing transaction order for replication. Other developers' pet projects that didn't appear in the meeting showed up occasionally, including unlogged tables and range types. Even more than, for instance, the wiki pages describing the things people plan to work on, these interstitial discussions demonstrate the vibrancy of the community and give a good idea just how active our development really is.
This year I shared rooms with Robert Haas, so I got a good overview of his plans for global temporary and unlogged tables. I spent a while with Jeff Davis looking through the code for exclusion constraints and deciding whether it was realistically possible to cause a starvation problem with many concurrent insertions into a table with an exclusion constraint. I didn't spend the time I should have talking with Dimitri Fontaine about his PostgreSQL extensions project, but if time permits I'd like to see if I could help out with it. Nor did I find the time I'd have liked to work on PL/Parrot, but I was glad to meet Jonathan Leto, who has done most of the coding work thus far on that project.
In contrast to other conferences, I didn't have a particular itch of my own to scratch between sessions. During past conferences I've been eager to discuss ideas for multi-column statistics; though that work continues, slowly, time hasn't permitted enough recent development even for the topic to be fresh in my mind, much less worthy of in-depth discussion. This lack of one overriding subject turned out to be a refreshing change, however, as it left the other hall track subjects less filtered.
Finally, it was nice to spend time with co-workers, and in fact to meet (finally) in person the one of the "Greg"s I'd talked to on the phone many times, but never actually met in person. Various engagements in my family or his have gotten in the way in the past. One of the quirks of working for a distributed organization...
Update: Fixed link to developers' meeting wiki page, thanks to comment from roppert
Learn more about End Point's Postgres Support, Development, and Consulting.
Postgres Conference - PGCon2010 - Day Two
Day two of the PostgreSQL Conference started a little later than the previous day in obvious recognition of the fact that many people were up very, very late the night before. (Technically, this is day four, as the first two days consisted of tutorials; this was the second day of "talks").
The first talk I went to was PgMQ: Embedding messaging in PostgreSQL by Chris Bohn. It was well attended, although there were definitely a lot of late-comers and bleary eyes. A tough slot to fill! Chris is from Etsy.com and I've worked with him there, although I had no interaction with the PgMQ project, which looks pretty cool. From the talk description:
PgMQ (PostgreSQL Message Queueing) is an add-on that embeds a messaging client inside PostgreSQL. It supports the AMQP, STOMP and OpenWire messaging protocols, meaning that it can work with all of the major messaging systems such as ActiveMQ and RabbitMQ. PgMQ enables two replication capabilities: "Eventually Consistent" Replication and sharding.
As near as I can tell, "eventually consistent" is the same as "asynchronous replication": the slave won't be the same as the master right away, but will be eventually. As with Bucardo and Slony, the actual lag is very small in practice: a handful of seconds at the most. I like the fact that it supports all those common messaging protocols. Chris mentioned in the talk that it should be possible for other systems like Bucardo to support something similar. I'll have to play around with PgMQ a bit and see about doing just that. :)

The typical post-talk gatherings
The next "talk" was the enigmatically labeled Replication Panel. Enigmatic in this case as it had no description whatsoever. It's a good thing I had decided to check it out anyway (I'm a sucker for any talk related to replication, in case it wasn't obvious yet). I was apparently nominated to be on the panel, representing Bucardo! So much for getting all my speaking done and over with the first day. The panel represented a pretty wide swatch of Postgres replication technologies, and by the people who are very deep in the development of each one. From left to right on a cluster of stools at the front of the room was:
- Londiste (Marko Kreen)
- Slony (Jan Wieck)
- pgpool-II (Tatsuo Ishii)
- Hot standby and Streaming replication (Heikki Linnakangas)
- Bucardo (Greg Sabino Mullane)
- Golconde (Gavin M. Roy)
After a quick one-minute each intro describing who we were and what our replication system was, we took questions from the audience. Rather, Dan Langille played the part of the moderator and gathered written questions from the audience which he read to us, and we each took turns answering. We managed to get through 16 questions. All were interesting, even if some did not apply to all the solutions. Some of the more relevant ones I remember:
"If your replication solution was not available, which of the other replication solutions would you recommend?" This was my favorite question. My answer was: if using Bucardo in multi-master mode, switch to pgpool. If using in master-slave mode, use Slony.
"How will PG 9.0 affect your solution? Will your solution still remain relevant?" This most heavily affects Bucardo, Slony, and Londiste, and we all agreed that we're happy to lose users who simply need a read-only copy of their database. Their remains plenty of use cases that 9.0 will not solve however.
"For multi-master solutions: How are database collisions resolved? Do you recommend your solution for geographically remote locations?" This one is pretty much for me alone. :) I gave a quick overview of Bucardo's built-in conflict resolution systems, and how custom ones built on business logic works. Since Bucardo was originally built to support servers over a non-optimal network, the second part was an easy Yes.
"Is there a way to standardize and reduce the number of replication systems and focus on making the subset more robust, efficient, and versatile?" The general answer was no, as the use cases for all of them are so wildly different. I thought the only possible reduction was to combine Slony and Londiste, as they are very close technically and have pretty much identical use cases.
"How easy is it to switch masters? Are you planning on improving the tools to do so?" With Bucardo, switching is as easy as pointing to a different database if using master-master. However, Bucardo master-slave has no built in support at all for failover (like Slony does). So the answer is "not easy at all" and yes, we want to provide tools to do so.
"What is your biggest bug, problem, or limitation you are fixing now?" All three of the async trigger solutions (Bucardo, Slony, and Londiste) answered "DDL triggers". Which is hopefully coming for 9.1 (stop reading this blog and get to work on that, Jan).
All in all, I really liked the panel, and I think the audience did as well. Hopefully we'll see more things like at future conferences. Since we did not know the questions before hand, and took everything from the audience, it was the polar opposite of someone giving a talk with prepared slides.
I had some people come up to me afterwards to ask for more details about Bucardo, because (as they pointed out), it's the only multi-master replication system for Postgres (not technically true, as pg-pool and rubyrep provide multi-master use cases as well, but the former is synchronous and fairly complex, while the latter is very new and lacking some features). Maybe next year I should give a whole talk on Bucardo rather than just blabbing about it here on the blog. :)
After that, I popped into the Check Please! What Your Postgres Databases Wishes You Would Monitor talk by Robert Treat (who I also used to work with). It was a good talk, but pretty much review for me, as watching over and monitoring databases is what I spend a lot of my time doing. :) Here's the description:
Compared to many proprietary systems, Postgres tends to be pretty straight forward to run. However, if you want to get the most from your database, you shouldn't just set it and forget it, you need to monitor a few key pieces of information to keep performance going. This talk will review several key metrics you should be aware of, and explain under which scenarios you may need additional monitoring.
The final talk I went to was Deploying and testing triggers and functions in multiple databases by Norman Yamada. This was an interesting talk for me because he was using a lot of the code from the same_schema action in the check_postgres program to do the actual comparison. Indeed, I made some patches while at the conference to allow for better index comparison's at Norman's request. I also managed to get some work done on tail_n_mail and Bucardo while there - something about being surrounded by all that Postgres energy made me productive despite having very little free time.
I had to catch an early flight, and was not able to catch the final talk slot of the day, nor the closing session or the BOFs that night. Hopefully someone who did catch those will blog about it and let me know how it went. I hear the t-shirt we signed at the developer's meeting went for a sweet ransom.
If you went to PgCon, I have two requests for you.
First, please fill out the feedback for each talk you went to. It takes less than a minute per talk, and is invaluable for both the speakers and the conference organizers. Second, please blog about PgCon. It's helpful for people who did not get to go to see the conference through other people's eyes. And do it now, while things are still fresh.
If you did not go to PgCon, I have one request for you: go next year! Perhaps next year at PgCon 2011 we'll break the 200 person mark. Thanks to Dan Langille as always for creating PgCon and keeping it running smooth year after year.
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL Conference - PGCon 2010 - Day One
The first day of talks for PGCon 2010 is now over, here's a recap of the parts that I attended.
On Wednesday, the developer's meeting took place. It was basically 20 of us gathered around a long conference table, with Dave Page keeping us to a strict schedule. While there were a few side conversations and contentious issues, overall we covered an amazing amount of things in a short period of time, and actually made action items out of almost all of them. My favorite *decision* we made was to finally move to git, something myself and others have been championing for years. The other most interesting parts for me were the discussion of what features we will try to focus on for 9.1 (it's an ambitious list, no doubt), and DDL triggers! It sounds like Jan Wieck has already given this a lot of thought, so I'm looking forward to working with him in implementing these triggers (or at least
nagging him about it if he slows down). These triggers will be immensely useful to replication systems like Bucardo and Slony, which implement DDL replication in a very manual and unsatisfactory way. These triggers will not be like the current triggers, in that they will not be directly attached to system tables. Instead, they will be associated with certain DDL events, such that you could have a trigger on any CREATE events (or perhaps also allowing something finer grained such as a trigger on a CREATE TABLE event). Whenever it comes in, I'll make sure that Bucardo supports it, of course!
The first day of talks kicked off the the plenary by Gavin Roy called "Perspectives on NoSQL" (description and slides are available). Gavin actually took the time to *gasp* research the topic, and gave a quick rundown of some of the more popular "NoSQL" solutions, including CouchDB, MongoDB, Cassandra, Project Voldemort, Redis, and Tokyo Tyrant. He then benchmarked all of them against Postgres for various tasks - and did it against both "regular safe" Postgres and "running with scissors" fsync-off Postgres. The results? Postgres scales, very well, and more than holds it own against the NoSQL newcomers. MongoDB did surprisingly well: see the slides for the details. His slides also had the unfortunate portmanteau of "YeSQL", which only helps to empahsize how silly our "PostgreSQL" name is. :)
The next talk was Postgres (for non-Postgres people) by Greg Sabino Mullane (me!). Unlike previous years, my slides are already online. Yes, at first blush, it seems a strange talk to give at a conference like this, but we always have a good number of people from other database systems that are considering Postgres, are in the process of migrating to Postgres, or are just new to Postgres. The talk was in three parts: the first was about the mechanics of migrating your application to Postgres: the data types that Postgres uses, how we implement indexes, the best way to migrate your data, and many other things, with an eye towards common migration problems (especially when coming from MySQL). The second part of the talk discussed some of the quirks of Postgres people coming from DB2, Oracle, etc. should be aware of. Some things discussed: how Postgres does MVCC and need for vacuum, our really smart planner and lack of hints, the automatic (and against the spec) lowercasing, and our concept of schemas. I also touched on what I see as some of our drawbacks: tuned for a toaster, no true in place upgrade, the unpronounceable name, the lack of marketing. and what some of our perceived-but-not-real drawbacks are: lack of replication, poor speed. What would a list of drawbacks be without a list of strengths?: transactional DDL, very friendly and helpful community, PostGIS, authentication options, awesome query planner, the ability to create your own custom database objects, and our distributed nature that ensures the project cannot be bought out or destroyed. The last part of the talk went over the Postgres project itself: the community, the developers, the philosophy, and how it all fits together. I ran out of time so did not get to tell my "longest patch process ever" story for \dfS (six years!) but I don't think I missed anything important and gave time for some questions.
The next talk was Hypothetical Indexes towards self-tuning in PostgreSQL by Sergio Lifschitz. In the words of Sergio:
Hypothetical indexes are simulated index structures created solely in the database catalog. This type of index has no physical extension and, therefore, cannot be used to answer actual queries. The main benefit is to provide a means for simulating how query execution plans would change if the hypothetical indexes were actually created in the database. This feature is quite useful for database tuners and DBAs.
It was a very interesting talk. Robert Haas asked him to put it in the PostgreSQL license so we can easily put it into the project as needed. Sergio promised to make the change immediately after the talk!
After lunch, the next talk was pg_statsinfo - More useful statistics information for DBAs by Tatsuhito Kasahara. This talk was a little hard to follow along, but had some interesting ideas about monitoring Postgres, a lot of which overlapped with some of my projects such as tail_n_mail and check_postgres.
The next talk was Forensic Analysis of Corrupted Databases by Greg Stark. This was a neat little talk; many of the error messages he displayed were all too familiar to me. It was nice overview of how to track down the exact location of a problem in a corrupted database, and some strategies for fixing it, including the old "using dd to write things from /dev/zero directly into your Postgres files" trick. There was even a discussion about the possibility of zeroing out specific parts of a page header, with the consensus that it would not work as one would hope.
After a quick hacky sack break with Robert Treat and some Canadian locals, I went to the final real talk of the day: The PostgreSQL Query Planner by Robert Haas. I had seen this talk recently, but wanted to see it again as I missed some of the beginning of the talk when I saw it at Pg East 2010 in Philly. Robert gave a good talk, and was very good at repeating the audience's questions. I didn't learn all that much, but it was a very good overview of the planner, including some of the new planner tricks (such as join removal) in 9.0 and 9.1.
After that, the lightning talks started. I really like lightning talks, and thankfully they weren't held on the last day of the conference this time (a common mistake). The MC was Selena Deckelmann, who did a great job of making sure all the slides were gathered up beforehand, and strictly enforced the five minute time limit. The list of slides is on the Postgres wiki. I talked on my latest favorite project, tail_n_mail - the slides are available on the wiki. I didn't make it through all my slides, so if you were at the talks, check out the PDF for the final two that were not shown. There seemed to be good interest in the project, and I had several people tell me afterwards they would try it out.
The night ended with the EnterpriseDB sponsored party. I spoke to a lot of people there, about replication, PITR scripts, log monitoring, the problem with a large number of inherited objects, and many other topics. Note to EDB: I don't think that venue is going to scale, as the conference gets bigger each year! The total number of people at the conference this year was 184, a new record.
A very good first day: I learned a lot, met new people, saw old friends, and hopefully sold Postgres to some non-Postgres people :). I also managed to git push some changes to tail_n_mail, check_postgres, and Bucardo. It's hard to say no to feature requests when someone asks you in person. :)
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL switches to Git
Looks like the Postgres project is finally going to be bite the bullet and switch to git as the canonical VCS. Some details are yet to be hashed out, but the decision has been made and a new repo will be built soon. Now to lobby to get that commit-with-inline-patches list to be created...
PostgreSQL 8.4 on RHEL 4: Teaching an old dog new tricks
So a client has been running a really old version of PostgreSQL in production for a while. We finally got the approval to upgrade them from 7.3 to the latest 8.4. Considering the age of the installation, it should come as little surprise that they had been running a similarly ancient OS: RHEL 4.
Like the installed PostgreSQL version, RHEL 4 is ancient -- 5 years old. I anticipated that in order to get us to a current version of PostgreSQL, we'd need to resort to a source build or rolling our own PostgreSQL RPMs. Neither approach was particularly appealing.
While the age/decrepitude of the current machine's OS came as little surprise, what did come as a surprise was that there were supported RPMs available for RHEL 4 in the community yum rpm repository, located at http://yum.pgrpms.org/8.4/redhat/rhel-4-i386/repoview/ (modulo your architecture of choice).
In order to get things installed, I followed the instructions for installing the specific yum repo. There were a few seconds where I was confused because the installation command was giving a "permission denied" error when attempting to install the 8.4 PGDG rpm as root. A little brainstorming and a lsattr later revealed that a previous administrator, apparently in the quest for über-security, had performed a chattr +i on the /etc/yum.repo.d directory.
Evil having been thwarted, in the interest of über-usability I did a quick chattr -i /etc/yum.repo.d and installed the PGDG rpm. Away we went. From that point, the install was completely straightforward; I had a PostgreSQL 8.4.4 system running in no time, and could finally get off that 7.3 behemoth. Now to talk my way into an OS upgrade...
Learn more about End Point's Postgres Support, Development, and Consulting.
Finding the PostgreSQL version - without logging in!
Metasploit used the error messages given by a PostgreSQL server to find out the version without actually having to log in and issue a "SELECT version()" command. The original article is at http://blog.metasploit.com/2010/02/postgres-fingerprinting.html and is worth a read. I'll wait.
The basic idea is that because version 3 of the Postgres protocol gives you the file and the line number in which the error is generated, you can use the information to figure out what version of Postgres is running, as the line numbers change from version to version. In effect, each version of Postgres reveals enough in its error message to fingerprint it. This was a neat little trick, and I wanted to explore it more myself. The first step was to write a quick Perl script to connect and get the error string out. The original Metasploit script focuses on failed login attempts, but after some experimenting I found an easier way was to send an invalid protocol number (Postgres expects "2.0" or "3.0"). Sending a startup packet with an invalid protocol of "3.1" gave me back the following string:
E|SFATALC0A000Munsupported frontend protocol 3.1:
server supports 1.0 to 3.0Fpostmaster.cL1507RProcessStartupPacket
The important part of the string was the parts indicating the file and line number:
Fpostmaster.cL1507
In this case, we can clearly see that line 1507 of postmaster.c was throwing the error. After firing up a few more versions of Postgres and recording the line numbers, I found that all versions since 7.3 were hitting the same chunk of code from postmaster.c:
/* Check we can handle the protocol the frontend is using. */
if (PG_PROTOCOL_MAJOR(proto) <> PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) ||
(PG_PROTOCOL_MAJOR(proto) == PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) &&
PG_PROTOCOL_MINOR(proto) > PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST)))
ereport(FATAL,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("unsupported frontend protocol %u.%u: server supports %u.0 to %u.%u",
PG_PROTOCOL_MAJOR(proto), PG_PROTOCOL_MINOR(proto),
PG_PROTOCOL_MAJOR(PG_PROTOCOL_EARLIEST),
PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST),
PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST))));
Line numbers were definitely different across major versions of Postgres (e.g. 8.2 vs. 8.3), and were even different sometimes across revisions. Rather than fire up every possible revision of Postgres and run my program against it, I simply took advantage of the cvs tags (aka symbolic names) and did this:
cvs update -rREL8_3_0 -p postmaster.c | grep -Fn 'LATEST))))'
This showed me that the string occurred on line 1497 of postmaster.c. I created a Postgres instance and verified that the line number was the same. At that point, it was a simple matter of making a bash script to grab all releases since 7.3 and build up a comprehensive list of when that line changed from version to version.
Once that was done, I rolled the whole thing up into a new Perl script called "detect_postgres_version.pl". Here's the script, broken into pieces for explanation. A link to the entire script is at the bottom of the post.
First, we do some standard Perl script things and read in the __DATA__ section at the bottom of the script, which lists at which version the message has changed:
#!/usr/bin/env perl
## Quickly and roughly determine what version of Postgres is running
## greg@endpoint.com
use strict;
use warnings;
use IO::Socket;
use Data::Dumper;
use Getopt::Long;
## __DATA__ looks like this: filname / line / version when it changed
## postmaster.c 1287 7.4.0
## postmaster.c 1293 7.4.2
## postmaster.c 1293 7.4.29
##
## postmaster.c 1408 8.0.0
## postmaster.c 1431 8.0.2
## Build our hash of file-and-line to version matches
my %map;
my ($last,$lastmin,$lastline) = ('',0,0);
while () {
next if $_ !~ /(\w\S+)\s+(\d+)\s+(.+)/;
my ($file,$line,$version) = ($1,$2,$3);
die if $version !~ /(\d+)\.(\d+)\.(\d+)/;
my ($vmaj,$vmin,$vrev) = ($1,$2,$3);
my $current = "$file|$vmaj|$vmin";
if ($current eq $last) {
my ($lfile,$lmaj,$lmin) = split /\|/ => $last;
for (my $x = $lastmin+1 ; $x<$vrev; $x++) {
push @{$map{$file}{$lastline}}
=> ["$lmaj.$lmin","$lmaj.$lmin.$x"];
}
}
push @{$map{$file}{$line}} => ["$vmaj.$vmin",$version];
$last = $current;
$lastmin = $vrev;
$lastline = $line;
}
Next, we allow a few options to the script: port and host. We'll default to a Unix socket if the host is not set, and default to port 5432 if none is given:
## Read in user options and set defaults
my %opt;
GetOptions(\%opt,
'port=i',
'host=s',
);
my $port = $opt{port} || 5432;
my $host = $opt{host} || '';
We're ready to connect, using the very standard IO::Socket module. If the host starts with a slash, we assume this is the unix_socket_directory and replace the default '/tmp' location:
## Start the connection, either unix or tcp
my $server;
if (!$host or !index $host, '/') {
my $path = $host || '/tmp';
$server = IO::Socket::UNIX->new(
Type => IO::Socket::SOCK_STREAM,
Peer => "$path/.s.PGSQL.$port",
) or die "Could not connect!: $@";
}
else {
$server = IO::Socket::INET->new(
PeerAddr => $host,
PeerPort => $port,
Proto => 'tcp',
Timeout => 3,
) or warn "Could not connect!: $@";
}
Now we're ready to actually send something over our new socket. Postgres expects the startup packet to be in a certain format. We'll follow that format, but send it an invalid protocol number, 3.1. The rest of the information does not really matter, but we'll also tell it we're connecting as user "pg". Finally, we read back in the message, extract the file and line number, and spit them back out to the user:
## Build and sent the packet
my $packet = pack('nn', 3,1) . "user\0pg\0\0";
$packet = pack('N', length($packet) + 4). $packet;
$server->send($packet, 0);
## Get the message back and extract the filename and line number
my $msg;
recv $server, $msg, 1000, 0;
if ($msg !~ /F([\w\.]+)\0L(\d+)/) {
die "Could not find a file and line from error message: $msg\n";
}
my ($file,$line) = ($1,$2);
print "File: $file Line: $line\n";
Finally, we try to map the file name and line number we received back to the version of PostgreSQL it came from. If the file is not recognized, or the line number is not known, we bail out early:
$map{$file}
or die qq{Sorry, I do not know anything about the file "$file"\n};
$map{$file}{$line}
or die qq{Sorry, I do not know anything about line $line of file "$file"\n};
If there is only one result for this line and file number, we can state what it is and exit.
my $result = $map{$file}{$line};
if (1 == @$result) {
print "Most likely Postgres version $result->[0][1]\n";
exit;
}
In most cases, though, we don't know the exact version down to the revision after the second dot, so we'll state what the major version is, and all the possible revisions:
## Walk through and figure out which versions it may be.
## For now, we know that the major version does not overlap
print "Most likely Postgres version $result->[0][0]\n";
print "Specifically, one of these:\n";
for my $row (@$result) {
print " Postgres version $row->[1]\n";
}
exit;
The only thing left is the DATA section, which I'll show here to be complete:
__DATA__
## Format: filename line version
postmaster.c 1167 7.3.0
postmaster.c 1167 7.3.21
postmaster.c 1287 7.4.0
postmaster.c 1293 7.4.2
postmaster.c 1293 7.4.29
postmaster.c 1408 8.0.0
postmaster.c 1431 8.0.2
postmaster.c 1441 8.0.5
postmaster.c 1445 8.0.6
postmaster.c 1439 8.0.7
postmaster.c 1443 8.0.9
postmaster.c 1445 8.0.14
postmaster.c 1445 8.0.25
postmaster.c 1449 8.1.0
postmaster.c 1450 8.1.1
postmaster.c 1454 8.1.2
postmaster.c 1448 8.1.3
postmaster.c 1452 8.1.4
postmaster.c 1448 8.1.9
postmaster.c 1454 8.1.10
postmaster.c 1454 8.1.21
postmaster.c 1432 8.2.0
postmaster.c 1437 8.2.1
postmaster.c 1440 8.2.5
postmaster.c 1432 8.2.17
postmaster.c 1497 8.3.0
postmaster.c 1507 8.3.8
postmaster.c 1507 8.3.11
postmaster.c 1570 8.4.0
postmaster.c 1621 8.4.1
postmaster.c 1621 8.4.4
postmaster.c 1664 9.0.0
(Because version 9.0 is not released yet, its line number may still change.)
I found this particular protocol error to be a good one because there is no overlap of line numbers across major versions. Of the approximately 125 different versions released since 7.3.0, only 6 are unique enough to identify to the exact revision. That's okay for this iteration of the script. If you wanted to know the exact revision, you could try other errors, such as an invalid login, as the metasploit code does.
The complete code can be read here: detect_postgres_version.pl
I'll be giving a talk later on this week at PgCon 2010, so say hi if you see me there. I'll probably be giving a lightning talk as well.
Learn more about End Point's Postgres Support, Development, and Consulting.
PostgreSQL template databases to restore to a known state
Someone asked on the mailing lists recently about restoring a PostgreSQL database to a known state for testing purposes. How to do this depends a little bit on what one means by "known state", so let's explore a few scenarios and their solutions.
First, let's assume you have a Postgres cluster with one or more databases that you create for developers or QA people to mess around with. At some point, you want to "reset" the database to the pristine state it was in before people starting making changes to it.
The first situation is that people have made both DDL changes (such as ALTER TABLE ... ADD COLUMN) and DML changes (such as INSERT/UPDATE/DELETE). In this case, what you want is a complete snapshot of the database at a point in time, which you can then restore from. The easiest way to do this is to use the TEMPLATE feature of the CREATE DATABASE command.
Every time you run CREATE DATABASE, it uses an already existing database as the "template". Basically, it creates a copy of the template database you specify. If no template is specified, it uses "template1" by default, so that these two commands are equivalent:
CREATE DATABASE foobar;
CREATE DATABASE foobar TEMPLATE template1;
Thus, if we want to create a complete copy of an existing database, we simply use it as a template for our copy:
CREATE DATABASE mydb_template TEMPLATE mydb;
Thus, when we want to restore the mydb database to the exact same state as it was when we ran the above command, we simply do:
DROP DATABASE mydb;
CREATE DATABASE mydb TEMPLATE mydb_template;
You may want to make sure that nobody changes your new template database. One way to do this is to not allow any non-superusers to connect to the database by setting the user limit to zero. This can be done either at creation time, or afterwards, like so:
CREATE DATABASE mydb_template TEMPLATE mydb CONNECTION LIMIT 0;
ALTER DATABASE mydb_template CONNECTION LIMIT 0;
You may want to go further by granting the database official "template" status by adjusting the datistemplate column in the pg_database table:
UPDATE pg_database SET datistemplate = TRUE WHERE datname = 'mydb_template';
This will allow anyone to use the database as a template, as long as they have the CREATEDB privilege. You can also restrict *all* connections to the database, even superusers, by adjusting the datallowconn column:
UPDATE pg_database SET datallowconn = FALSE WHERE datname = 'mydb_template';
Another way to restore the database to a known state is to use the pg_dump utility to create a file, then use psql to restore that database. In this case, the command to save a copy would be:
pg_dump mydb --create > mydb.template.pg
The --create option tells pg_dump to create the database itself as the first command in the file. If you look at the generated file, you'll see that it is using template0 as the template database in this case. Why does Postgres have template0 and template1? The template1 database is meant as a user configurable template that you can make changes to that will be picked up by all future CREATE DATABASE commands (a common example is a CREATE LANGUAGE command). The template0 database on the other hand is meant as a "hands off, don't ever change it" stable database that can always safely be used as a template, with no changes from when the cluster was first created. To that end, you are not even allowed to connect to the template0 database (thanks to the datallowconn column metioned earlier).
Now that we have a file (mydb.template.pg), the procedure to recreate the database becomes:
psql -X -c 'DROP DATABASE mydb'
psql -X --set ON_ERROR_STOP=on --quiet --file mydb.template.pg
We use the -X argument to ensure we don't have any surprises lurking inside of psqlrc files. The --set ON_ERROR_STOP=on option tells psql to stop processing the moment it encounters an error, and the --quiet tells psql to not be verbose and only let us know about very important things. (While I normally advocate using the --single-transaction option as well, we cannot in this case as our file contains a CREATE DATABASE line).
What if (as someone posited in the thread) the original poster really wanted only the *data* to be cleaned out, and not the schema (e.g. DDL)?. In this case, what we want to do is remove all rows from all tables. The easiest way to do this is with the TRUNCATE command of course. Because we don't want to worry about which tables need to be deleted before other ones because of foreign key constraints, we'll also use the CASCADE option to TRUNCATE. We'll query the system catalogs for a list of all user tables, generate truncate commands for them, and then play back the commands we just created. First, we create a simple text file containing commands to truncate all the tables:
SELECT 'TRUNCATE TABLE '
|| quote_ident(nspname)
|| '.'
|| quote_ident(relname)
|| ' CASCADE;'
FROM pg_class
JOIN pg_namespace n ON (n.oid = relnamespace)
WHERE nspname !~ '^pg'
AND nspname <> 'information_schema'
AND relkind = 'r';
Once that's saved as truncate_all_tables.pg, resetting the database by removing all rows from all tables becomes as simple as:
psql mydb -X -t -f truncate_all_tables.pg | psql mydb --quiet
We again use the --quiet option to limit the output, as we don't need to see a string of "TRUNCATE TABLE" strings scroll by. The -t option (also written as --tuples-only) prevents the headers and footers from being output, as we don't want to pipe those back in.
It's most likely you'd also want the sequences to be reset to their starting point as well. While sequences generally start at "1", we'll take out the guesswork by using the "ALTER SEQUENCE seqname RESTART" syntax. We'll append the following SQL to the text file we created earlier:
SELECT 'ALTER SEQUENCE '
|| quote_ident(nspname)
|| '.'
|| quote_ident(relname)
|| ' RESTART;'
FROM pg_class
JOIN pg_namespace n ON (n.oid = relnamespace)
WHERE nspname !~ '^pg'
AND nspname <> 'information_schema'
AND relkind = 'S';
The command is run the same as before, but now in addition to table truncation, the sequences are all reset to their starting values.
A final way to restore the database to a known state is a variation on the previous pg_dump command. Rather than save the schema *and* data, we simply want to restore the database without any data:
## Create the template file:
pg_dump mydb --schema-only --create > mydb.template.schemaonly.pg
## Restore it:
psql -X -c 'DROP DATABASE mydb'
psql -X --set ON_ERROR_STOP=on --file mydb.template.schemaonly.pg
Those are a few basic ideas on how to reset your database. There are a few limitations that got glossed over, such as that nobody can be connected to the database that is being used as a template for another one when the CREATE DATABASE command is being run, but this should be enough to get you started.
Learn more about End Point's Postgres Support, Development, and Consulting.
Tail_n_Mail does Windows (log file monitoring)
I've just released version 1.10.1 of tail_n_mail.pl, the handy script for watching over your Postgres logs and sending email when interesting things happen.
Much of the recent work on tail_n_mail has been in improving the parsing of statements in order to normalize them and give reports like this:
[1] From files A to Q Count: 839 First: [A] 2010-05-08T05:10:46-05:00 alpha postgres[13567] Last: [Q] 2010-05-09T05:02:27-05:00 bravo postgres[19334] ERROR: duplicate key violates unique constraint "unique_email_address" STATEMENT: INSERT INTO email_table (id, email, request, token) VALUES (?) [2] From files C to E (between lines 12523 of A and 268431 of B, occurs 6159 times) First: [C] 2010-05-04 16:32:23 UTC [22504] Last: [E] 2010-05-05 05:04:53 UTC [23907] ERROR: invalid byte sequence for encoding "UTF8": 0x???? HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". ## The above examples are from two separate instances, the first ## of which has the "find_line_number" option turned off
However, I've only ever used tail_n_mail on Linux-like systems, so it will not work on Windows systems...until now. Thanks to an error report and patch from Paulo Saudin, this program will now work on Windows. There is an new option, mailmode, which defaults to 'sendmail', for the same behavior as previous versions of tail_n_mail. This assumes you have access to a sendmail binary (which may or may not be from the actual Sendmail program: many mail programs provide a compatible binary of the same name). If you don't have sendmail, you can now specify an argument of 'smtp' to the mailmode argument (you can also simply use --smtp). This switches to using the Net::SMTP::SSL module to send the mail instead of sendmail.
Switching the mailmode is not enough, of course, so there are some additional flags to help the mail go out:
- --mailserver : the name of the outgoing SMTP server
- --mailuser : the user to authenticate with
- --mailpass : the password of the user
- --mailport : the port to use: defaults to 465
Needless to say, using the --mailpass option from the command line or even in a script is not the best practice, so it is highly recommended that you put the new variables inside a tailnmailrc file. When the script starts, it looks for a file named .tailnmailrc in the current directory. If that is not found, it looks for the same file in your home directory (or technically, whatever the HOME environment variable is set to). If that does not exist, it checks for the file /etc/tailnmailrc. You can override those checks by specifying the file directly with the --tailnmailrc= option, or disable all rc files with the --no-tailnmailrc option.
The tailnmailrc file is very straightforward: each line is a name and value pair, separated by a colon or an equal sign. Lines starting with a '#' indicate a comment and are skipped. So someone using the new Net::SMTP::SSL method might have a .tailnmailrc in their home directory that looks like this:
mailmode=smtp mailserver=mail.example.com mailuser=greg@example.com mailpass=mysupersekretpassword
The tail_n_mail program is open source and BSD licensed. Contributions are always welcome: send a patch, or fork a version through the Github mirror. There is also a Bugzilla system to accept bug reports and feature requests.
Learn more about End Point's Postgres Support, Development, and Consulting.
Tickle me Postgres: Tcl inside PostgreSQL with pl/tcl and pl/tclu
Although I really love Pl/Perl and find it the most useful language to write PostgreSQL functions in, Postgres has had (for a long time) another set of procedural languages: Pl/Tcl and Pl/TclU. The Tcl language is pronounced "tickle", so those two languages are pronounced as "pee-el-tickle" and "pee-el-tickle-you". The pl/tcl languages have been around since before any others, even pl/perl; for a long time in the early days of Postgres using pl/tclu was the only way to do things "outside of the database", such as making system calls, writing files, sending email, etc.
Sometimes people are surprised when they hear I still use Tcl. Although it's not as widely mentioned as other procedural languages, it's a very clean, easy to read, powerful language that shouldn't be overlooked. Of course, with Postgres, you have a wide variety of languages to write your functions in, including:
The nice thing about Tcl is that not only is it an easy language to write in, it's fully supported by Postgres. Only three languages are maintained inside the Postgres tree itself: Perl, Tcl, and Python. Only two of those have a trusted and untrusted version: Perl and Tcl. All procedural languages in Postgres are untrusted by default, which means they can do things like make system calls. To be a trusted language, there must be some capacity to limit what can be done by the language. With Perl, this is accomplished through the "Safe" Perl module. For Tcl, this is accomplished by having two versions of the Tcl interpreter: a normal one for pltclu and a separate one that uses the "Safe-Tcl mechanism" for pltcl.
Let's take a quick look at what a pltcl function looks like. We'll use pl/tcl to implement the common problem of "SELECT COUNT(*) is very slow" by tracking the row count using triggers as we go along. For this, we'll start with a sample table that we want to be able to find out exactly how many rows are inside of any time, without suffering the delay of COUNT(*). Here's the table definition, and a quick command to populate it with some dummy data:
CREATE SEQUENCE customer_id_seq;
CREATE TABLE customer (
id INTEGER NOT NULL DEFAULT nextval('customer_id_seq') PRIMARY KEY,
email TEXT NULL,
address TEXT NULL,
cdate TIMESTAMPTZ NOT NULL DEFAULT now()
);
INSERT INTO customer (email, address)
SELECT 'jsixpack@example.com', '123 Main Street'
FROM generate_series(1,10000);
A quick review: we create a sequence for use by the table to populate its primary key, the 'id' column. Each customer also has an optional email and address, plus we automatically track when we create the row by using the "DEFAULT now()" trick on the 'cdate' column. Finally, we use the super handy generate_series function to populate the new table with ten thousand rows of data.
Next, we'll create a helper table that will keep track of the rows for us. We'll make it generic so that it can track any number of tables:
CREATE TABLE table_count (
schemaname TEXT NOT NULL,
tablename TEXT NOT NULL,
rows BIGINT NOT NULL DEFAULT 0
);
INSERT INTO table_count(schemaname,tablename,rows)
SELECT 'public', 'customer', count(*) FROM customer;
We also populated it with the current number of rows in customer. Of course, this will be out of date as soon as someone updates the table, so let's add our triggers. We don't want to update the table_count table on every single row change, but only at the end of each statement. To do that, we'll make a row-level trigger that stores up the changes inside a global variable, and then a statement-level trigger that uses the global variable to update the table_count table.
CREATE FUNCTION update_table_count_row()
RETURNS TRIGGER
SECURITY DEFINER
VOLATILE
LANGUAGE pltcl
AS $BC$
## Declare tablecount as a global variable so other functions
## can access our changes
variable tablecount
## Set the local count of rows changed to 0
set rows 0
## $TG_op indicates what type of command was just run
## Modify the local variable rows depending on what we just did
switch $TG_op {
INSERT {
incr rows 1
}
UPDATE {
## No change in number of rows
## We could also leave out the ON UPDATE from the trigger below
}
DELETE {
incr rows -1
}
}
## The tablecount variable will be an associative array
## The index will be this table's name, the value is the rows changed
## We should probably be using $TG_schema_name as well, but we'll ignore that
## If there is no variable for this table yet, create it, otherwise just change it
if {![ info exists tablecount($TG_table_name) ] } {
set tablecount($TG_table_name) $rows
} else {
incr tablecount($TG_table_name) $rows
}
return OK
$BC$;
CREATE FUNCTION update_table_count_statement()
RETURNS TRIGGER
SECURITY DEFINER
LANGUAGE pltcl
AS $BC$
## Make sure we access the global version of the tablecount variable
variable tablecount
## If it doesn't exist yet (for example, when an update changes no
## rows), we simply exit early without making changes
if { ! [ info exists tablecount ] } {
return OK
}
## Same logic if our specific entry in the array does not exist
if { ! [ info exists tablecount($TG_table_name) ] } {
return OK
}
## If no rows were changed, we simply exit
if { $tablecount($TG_table_name) == 0 } {
return OK
}
## Update the table_count table: may be a positive ior negative shift
spi_exec "
UPDATE table_count
SET rows=rows+$tablecount($TG_table_name)
WHERE tablename = '$TG_table_name'
"
## Reset the global variable for the next round
set tablecount($TG_table_name) 0
return OK
$BC$;
CREATE TRIGGER update_table_count_row
AFTER INSERT OR UPDATE OR DELETE
ON public.customer
FOR EACH ROW
EXECUTE PROCEDURE update_table_count_row();
CREATE TRIGGER update_table_count_statement
AFTER INSERT OR UPDATE OR DELETE
ON public.customer
FOR EACH STATEMENT
EXECUTE PROCEDURE update_table_count_statement();
(Caveat: because there is a single Tcl interpreter for all pl/tcl functions, these functions are not 100% safe, as there is a theoretical chance that changes made by processes running at the exact same time may step on each other's global variables. In practice, this is unlikely.)
If everything is working correctly, we should see the entries in the table_count table match up with the output of SELECT COUNT(*). Let's take a look via a psql session:
psql=# \t Showing only tuples. psql=# \a Output format is unaligned. psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer; public|customer|10000 10000 psql=# UPDATE customer SET email=email WHERE id <= 10; UPDATE 10 psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer; public|customer|10000 10000 psql=# INSERT INTO customer (email, address) psql-# SELECT email, address FROM customer LIMIT 4; INSERT 0 4 psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer; public|customer|10004 10004 psql=# DELETE FROM customer WHERE id <= 10; DELETE 10 psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer; public|customer|9994 9994 psql=# TRUNCATE TABLE customer; TRUNCATE TABLE psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer; public|customer|9994 0
Whoops! Everything matched up until that TRUNCATE. On earlier versions of Postgres, there was no way around that problem, but if we have Postgres version 8.4 or better, we can use truncate triggers!
CREATE FUNCTION update_table_count_truncate()
RETURNS TRIGGER
SECURITY DEFINER
LANGUAGE pltcl
AS $BC$
spi_exec "
UPDATE table_count
SET rows=0
WHERE tablename = '$TG_table_name'
"
set tablecount($TG_table_name) 0
return OK
$BC$;
CREATE TRIGGER update_table_count_truncate
AFTER TRUNCATE
ON public.customer
FOR EACH STATEMENT
EXECUTE PROCEDURE update_table_count_truncate();
Pretty straightforward, let's make sure it works:
psql=# TRUNCATE TABLE customer; TRUNCATE TABLE psql=# SELECT * FROM table_count; SELECT COUNT(*) FROM customer; public|customer|0 0
Success! This was a fairly contrived example, but Tcl (and especially pl/tclU) offers a lot more functionality. If you want to examine pl/tcl and pl/tclu for yourself, you'll need to make sure it's compiled into the Postgres you are using. If using a packaging system, it's as simple as doing this (or something like it, depending on what packaging system you use):
yum install postgresql-pltcl
If compiling from source, just pass the --with-tcl option to configure. You'll probably also need to install the Tcl development package, e.g. with yum install tcl-devel
Once installed, installing it into a specific database is as simple as:
$ CREATE LANGUAGE pltcl; CREATE LANGUAGE $ CREATE LANGUAGE pltclu; CREATE LANGUAGE
For more about Tcl, check out the The Tcl Wiki, the Tcl tutorial, or this Tcl reference. For more about pl/tcl and pl/tclu. visit the Postgres pltcl documentation
Viewing Postgres function progress from the outside
Getting visibility into what your PostgreSQL function is doing can be a difficult task. While you can sprinkle notices inside your code, for example with the RAISE feature of plpgsql, that only shows the notices to the session that is currently running the function. Let's look at a solution to peek inside a long-running function from any session.
While there are a few ways to do this, one of the most elegant is to use Postgres sequences, which have the unique property of living "outside" the normal MVCC visibility rules. We'll abuse this feature to allow the function to update its status as it goes along.
First, let's create a simple example function that simulates doing a lot of work, and taking a long time to do so. The function doesn't really do anything, of course, so we'll throw some random sleeps in to emulate the effects of running on a busy production machine. Here's what the first version looks like:
DROP FUNCTION IF EXISTS slowfunc();
CREATE FUNCTION slowfunc()
RETURNS TEXT
VOLATILE
SECURITY DEFINER
LANGUAGE plpgsql
AS $BC$
DECLARE
x INT = 1;
mynumber INT;
BEGIN
RAISE NOTICE 'Start of function';
WHILE x <= 5 LOOP
-- Random number from 1 to 10
SELECT 1+(random()*9)::int INTO mynumber;
RAISE NOTICE 'Start expensive step %: time to run=%', x, mynumber;
PERFORM pg_sleep(mynumber);
x = x + 1;
END LOOP;
RETURN 'End of function';
END
$BC$;Pretty straightforward function: we simply emulate doing five expensive steps, and output a small notice as we go along. Running it gives this output (with pauses from 1-10 seconds of course):
$ psql -f slowfunc.sql DROP FUNCTION CREATE FUNCTION psql:slowfunc.sql:30: NOTICE: Start of function psql:slowfunc.sql:30: NOTICE: Start expensive step 1: time to run=2 psql:slowfunc.sql:30: NOTICE: Start expensive step 2: time to run=7 psql:slowfunc.sql:30: NOTICE: Start expensive step 3: time to run=3 psql:slowfunc.sql:30: NOTICE: Start expensive step 4: time to run=8 psql:slowfunc.sql:30: NOTICE: Start expensive step 5: time to run=5 slowfunc ----------------- End of function
To grant some visibility to other processes about where we are, we're going to change a sequence from within the function itself. First we need to decide on what sequence to use. While we could pick a common name, this won't allow us to run the function in more than one process at a time. Therefore, we'll create unique sequences based on the PID of the process running the function. Doing so is fairly trivial for an application: just create that sequence before the expensive function is called. For this example, we'll use some psql tricks to achieve the same effect like so:
\t
\o tmp.drop.sql
SELECT 'DROP SEQUENCE IF EXISTS slowfuncseq_' || pg_backend_pid() || ';';
\o tmp.create.sql
SELECT 'CREATE SEQUENCE slowfuncseq_' || pg_backend_pid() || ';';
\o
\t
\i tmp.drop.sql
\i tmp.create.sql
From the top, this script turns off everything but tuples (so we have a clean output), then arranges for all output to go to the file named "tmp.drop.sql". Then we build a sequence name by concatenating the string 'slowfuncseq_' with the current PID. We put that into a DROP SEQUENCE statement. Then we redirect the output to a new file named "tmp.create.sql" (this closes the old one as well). We do the same thing for CREATE SEQUENCE. Finally, we stop sending things to the file, turn off "tuples only" mode, and import the two files we just created, first to drop the sequence if it exists, and then to create it. The files will look something like this:
$ more tmp.*.sql :::::::::::::: tmp.drop.sql :::::::::::::: DROP SEQUENCE IF EXISTS slowfuncseq_8762; :::::::::::::: tmp.create.sql :::::::::::::: CREATE SEQUENCE slowfuncseq_8762;
The only thing left is to add the calls to the sequence from within the function itself. Remember that the sequence called must exist, or the function will throw an exception, so make sure you create the sequence before the function is called! (Alternatively, you could use the same named sequence every time, but as explained before, you lose the ability to track more than one iteration of the function at a time.)
DROP FUNCTION IF EXISTS slowfunc();
CREATE FUNCTION slowfunc()
RETURNS TEXT
VOLATILE
SECURITY DEFINER
LANGUAGE plpgsql
AS $BC$
DECLARE
x INT = 1;
mynumber INT;
seqname TEXT;
BEGIN
SELECT INTO seqname 'slowfuncseq_' || pg_backend_pid();
PERFORM nextval(seqname);
RAISE NOTICE 'Start of function';
WHILE x <= 5 LOOP
-- Random number from 1 to 10
SELECT 1+(random()*9)::int INTO mynumber;
RAISE NOTICE 'Start expensive step %: time to run=%', x, mynumber;
PERFORM pg_sleep(mynumber);
PERFORM nextval(seqname);
x = x + 1;
END LOOP;
RETURN 'End of function';
END
$BC$;
Again, it's important that the steps become to create the sequence, run the function, and then drop the sequence. While access to sequences lives outside MVCC, creation of the sequence itself is not. Here's what the whole thing will look like in psql:
\t
\o tmp.drop.sql
SELECT 'DROP SEQUENCE IF EXISTS slowfuncseq_' || pg_backend_pid() || ';';
\o tmp.create.sql
SELECT 'CREATE SEQUENCE slowfuncseq_' || pg_backend_pid() || ';';
\o
\t
\i tmp.drop.sql
\i tmp.create.sql
SELECT slowfunc();
\i tmp.drop.sql
Now you can see how far along the function is from any other process. For example, if we kick off the script above, then go into psql from another window, we can use the process id from the pg_stat_activity view to see how far along our function is:
$ select procpid, current_query from pg_stat_activity; procpid | current_query ---------+------------------------------------------------------ 10206 | SELECT slowfunc(); 10313 | select procpid, current_query from pg_stat_activity; $ select last_value from slowfuncseq_10206; last_value ------------ 3
You can assign your own values and meanings to the numbers, of course: this one simply tells us that the script is on the third iteration of our sleep loop. You could use multiple sequences to convey even more information.
There are other ways besides sequences to achieve this trick: one that I've used before is to have a plperlu function open a new connection to the existing database and update a text column in a simple tracking table. Another idea is to update a small semaphore table within the function, and check the modification time of the underlying file underneath your data directory.
PostgreSQL Conference East 2010 review
I just returned from the PostgreSQL Conference East 2010. This is one of the US "regional" Postgres conferences, which usually occur once a year on both the East and West coast. This is the second year the East conference has taken place in my home town of Philadelphia.
Overall, it was a great conference. In addition to the talks, of course, there are many other important benefits to such a conference, such as the "hallway tracks", seeing old friends and clients, meeting new ones, and getting to argue about default postgresql.conf settings over lunch. I gave a 90 minute talk on "Postgres for non-Postgres people" and a lightning talk on the indispensable tail_n_mail.pl program.
This year saw the conference take place at a hotel for the first time, and this was a big improvement over the previous school campus-based conferences. Everything was in one building, there was plenty of space to hang out and chat between the talks, and everything just felt a little bit easier. The one drawback was that the rooms were not really designed to lecture to large numbers of people (e.g. no stadium seating), but this was not too much of an issue for most of the talks.
A few of the talks I attended included:
- Mine! Luckily, my talk was in the very first slot, so I was able to give it and then be done talking for the rest of the conference (with the exception of the lightning talk). My talk was "PostgreSQL for MySQL (and other database people)". A quick show of hands showed that in addition to a good number of MySQL people, we had people coming from Oracle, Microsoft SQL Server, and even Informix. I walked through the steps to take when upgrading your application from using some other database to using Postgres, pointing out some of the pain points and particular Postgres gotchas, focusing on the SQL syntax. The second half of the talk focused on the Postgres project itself, explaining how it all worked, what the "community" and "core" consists of, how companies are involved, how development is done, and the philosophy of the project.
- "PostgreSQL at myYearbook.com" by Gavin M. Roy. I've heard earlier versions of this talk before, but it was neat to see how much myyearbook.com had grown in just one year and some of the new challenges they faced. Of course, Gavin is still upset about the primary key situation and they are still doing unique indexes instead of PKs so they can do in-place reindexing for bloat removal.
- Baron Schwartz spoke about "Query Analysis with mk-query-digest". The "mk" is short for maatkit, a nice suite of tools for doing all sorts of database-related things. Granted, it's very MySQL focused at the moment, but Baron has started to port things over to Postgres, and the demo he gave was pretty impressive. I'll definitely be downloading that code and taking a look.
- Magnus Hagander gave a talk on "Secure PostgreSQL Deployment" which was a lot more interesting than I thought it would be (I knew it had Windows slides). My take-home lessons: never use the ssl mode of "prefer", and always check your Debian systems as they like to switch SSL on everything for no good reason. It's also quite fascinating to see the number of ways you can authenticate to a Postgres database.
- I attended a talk on "Inside the PostgreSQL Infrastructure" by Dave Page. A lot of it I already knew, as I'm a little involved in said infrastructure, but it was good to hear some of the future plans, including standardizing on Debian instead of FreeBSD in the future.
- Spencer Christensen's talk on "PostgreSQL Administration for System Administrators" was very well done but mostly review for me :). It was nice to see a shout out in his talk (and some others) for check_postgres.pl.
- Robert Haas gave a good talk on "The PostgreSQL Query Planner" that seemed to be very well received. The bit about the join removal tech was particularly interesting: the Postgres planner does some really, really clever things when trying to build the best possible plan for your query.
At the lunch on Saturday, Josh Drake asked if anyone else wanted to do a lightning talk, so I made a quick outline on the back of a nearby piece of paper and gave a no-slides, no-notes five minute talk on tail_n_mail.pl. It went pretty well, and I even had 30 seconds left over at the end for questions. To clarify my answer to one of those further now: tail_n_mail.pl can parse CSV logs (indeed, any text file), but it cannot consolidate similar entries yet or any of the other neat things it does until we can teach TNM about how to parse the CSV logs properly.
An excellent conference overall, but I'd be amiss if I didn't offer a little constructive criticism for the next time (and other conferences):
- Scheduling. The rooms were sometimes hard to find, and the schedule did not list the room next to the talk. That color-coded thing just does not work. In addition, it seemed like similar talks were sometimes stacked up against each other rather than staggered. Thus, you could learn about londiste OR rubyrep, but not both. Similarly, there were two Python talks up against each other.
- Lightning talks. Always, always put the lightning talks at the *start* of the conference, not the end. Lightning talks are a great way to learn about what other people are doing. By having it at the start of the conference, you have the entire rest of the time to followup with people about their talks and foster more real-life discussions.
- Lightning talks. Okay, not done talking about these yet. Lightning talks are somewhat notorious for spending lots of time getting the video to work right, as people switch computers, fiddle with plugs, etc. If you can't get it setup in 30 seconds, start the clock! You should be able to give your lightning talk without slides, if need be.
NoSQL Live: The Dynamo Derivatives (Cassandra, Voldemort, Riak)
For me, one of the big parts of attending the NoSQL Live conference was to hear more about the differences between the various Dynamo-inspired open software projects. The Cassandra, Voldemort, and Riak projects were all represented, and while they differ in various ways at the logical layer (how one models data) and various features, they all share a similar conceptual foundation as outlined in Amazon's seminal Dynamo paper. So what differentiates these projects? What production deployments exist and what kind of stories do people have from running these systems in production?
Some of these questions can be at least partially answered by combing the interwebs, looking over the respective project sites, etc. Yet that's not quite the same thing as having community players in the same room, talking about stuff.
Of the three projects mentioned, Cassandra clearly has the "momentum" (a highly accurate indicator of future dominance). To me, this felt like the case even before Twitter started getting involved with it, but the Twitter effect was pretty evident based on the number of people sticking around for the Cassandra break-out session with Jonathan Ellis, compared to the break-out session given by Alex Feinberg for Voldemort (both of whom were very kind and thoughtful in answering my stream of irritating questions over lunch).
Regrettably, the break-out sessions were scheduled such that one had to choose between the Riak session and the Voldemort session; having already gone through the effort of setting up a small Riak cluster, manipulating data therein, etc., I felt there was more to be gained by attending the Voldemort session. Consequently, it's possible some of my take-aways from the conference are not entirely fair. Additionally, it seems strange to me that in the big room, Riak's representation was purely on the panel to discuss schema design in document-oriented databases; Riak had no representation on the panels related to scaling, operations, etc., despite that being a major focus of the project.
Most of what I learned had to do with nit-picky technical details, changes in upcoming versions, etc. Probably all of it was already documented. But, anyway, here are my takeaways on this topic, which may have been learned at the conference, or simply confirmed or reinforced by the conference. Random thoughts mixed in. Schema-less design.
- The simplicity of the pure key/value store (Voldemort and Riak are more like this) brings flexibility in what you represent; having a somewhat more structured data model with which to work (as in Cassandra) can add some complexity to how you design your data, but brings improved flexibility in how you can navigate that data.
- By digging around the web, one might get the impression that Cassandra has the broadest range of interesting deployments, Voldemort has fewer but is still interesting (Linkedin is certainly no slouch), and Riak has nothing to point to outside Basho Technologies' non-free Enterprise variant. By attending a conference in which each project was represented, one might get exactly the same impression. Brian Fink (for Riak) spoke of usage scenarios and was obviously informed by production experience with Riak, yet no actual use case, company, site, etc. was ever mentioned (again, the break-out session may contradict this).
- The Voldemort and Cassandra project teams are clearly paying attention to each other's work, at least to some degree. There was even some informal discussion of the merkle tree design in Cassandra potentially making its way into Voldemort. Both Alex and Jonathan had intelligent things to say about Riak, as well, when I pestered them about it.
- Having Ryan King from Twitter present on the "scaling with NoSQL" panel representing Cassandra was cool, and it offered confirmation that Cassandra in particular, but probably the Dynamo model as a whole, achieves its basic purpose: machines can fail but service is maintained and state is preserved; your structured storage system can scale horizontally, can scale writes, etc. Now, all that said, I wish there had been more detail available. Furthermore, Ryan King (understandably) did not seem particularly well-versed in other production deployments (like Digg's, for instance), so the "scaling with NoSQL" Cassandra representation disproportionately focused on exactly one use case.
- A lot of good stuff is coming in Cassandra in particular. Eliminating the need for a particular row to fit in memory will make the data model more flexible, particularly in how one designs secondary indexes (in which one needs millions or potentially billions of columns, which are auto-sorted at write time by Cassandra, to effectively form an index using the column names as the indexed value and the related key as the value). The (relatively recent) support added for Hadoop map/reduce expands the use case scenarios for the database. Jonathan Ellis spoke of potentially adding native secondary index support, which would certainly be helpful.
- We're only at the beginning, here. The share-nothing design of the Dynamo model is a great foundation on which to work. The production experience of early adopters brings valuable knowledge that is rapidly improving the various solutions (as one would expect). As patterns like the secondary index emerge, those patterns can be integrated into the main projects over time.
- With that in mind, as higher-level abstractions build up over time, it wouldn't surprise me if the space comes to a place in which people write fairly flexible queries that describe the sets they want. In which case, the risk and uncertainty one may feel in contemplating the use of these solutions will probably go down. Additionally, the "NoSQL" name will seem even sillier than it already does.
Quick Thoughts on NoSQL Live Boston Conference
I'm back home now from the Boston "NoSQL Live" conference organized by 10gen.com (the MongoDB folks). It was a good event. A lot of stuff covered, a broad range of topics. I have a fair amount to say, but need to digest, review notes, etc. In any case, thanks to 10gen and the various sponsors that made it happen.
Some quick, random thoughts:
- Picking a good table at lunch is key: we ended up sitting with four different presenters, including Jonathan Ellis for Cassandra and Alex Feinberg for Voldemort, which happen to be two of the systems I'm personally most interested in using at the moment.
- There is an undeniable latest-thing-fan(boy|girl)ism aura surrounding the "NoSQL" brand/meme/whatever, but the various presenters and leading lights in various projects appear to be reasonable and fact-based; don't let the breathless silliness of fans fool you.
- I went in feeling convinced of the desirability of non-relational datastores for specific modeling situations (graphs) and for scalability/availability/volume concerns (Dynamo and BigTable derivatives), while feeling relatively skeptical of "document datastores". I left feeling basically the same way, though decidedly less skeptical of CouchDB than I previously was.
- There is a lot of good thinking and discussion going on in the space, it's moving very fast, and the future looks bright.
More later. Try to contain your anticipation.
PostgreSQL UTF-8 Conversion
It's becoming increasingly common for me to be involved in conversion of an old version of PostgreSQL to a new one, and at the same time, from an old "SQL_ASCII" encoding (that is, undeclared, unvalidated byte soup) to UTF-8.
Common ways to do this are to run pg_dumpall and then pipe the output through iconv or recode. When your source encoding is all pure ASCII, you don't need to do even that. When it's really all Windows-1252 (a superset of Latin-1 aka ISO-8859-1) it's easy.
But often, the data is stored in various unknown encodings from several sources over the course of years, including some that's already in UTF-8. When you convert with iconv, it dies with an error at the first problem, whereas recode will let you ignore encoding problems, but that leaves you with junk in your output.
The case I'm often encountering is fairly easy, but not perfect: Lots of ASCII, some Windows-1252, and some UTF-8. Since both pure ASCII and UTF-8 can be mechanistically detected, I put together this script to do the detection. It's Perl and uses the nice IsUTF8 module to do its character encoding detection:
Pipe input to the script. It handles one line at a time. When run with any arguments (such as --test) it will swallow pure ASCII lines, write lines it thinks are valid UTF-8 to stderr, and will convert the remaining presumed Windows-1252 lines to stdout, for manual examination.
If its guesses look correct, run it again with no arguments, and it will write all 3 types of encoding to stdout, ready for input to psql in your new UTF-8 encoded database.
(Don't forget to munge your pg_dump file to remove any hardcoded declarations of "SQL_ASCII" encoding from CREATE DATABASE statements, or otherwise make sure your database actually is created with UTF-8 encoding!)
Riak Install on Debian Lenny
I'm doing some comparative analysis of various distributed non-relational databases and consequently wrestled with the installation of Riak on a server running Debian Lenny.
I relied upon the standard "erlang" debian package, which installs cleanly on a basically bare system without a hitch (as one would expect). However, the latest Riak's "make" tasks fail to run; this is because the rebar script on which the make tasks rely chokes on various bad characters:
riak@nosql-01:~/riak$ make all rel ./rebar compile ./rebar:2: syntax error before: PK ./rebar:11: illegal atom ./rebar:30: illegal atom ./rebar:72: illegal atom ./rebar:76: syntax error before: ��n16 ./rebar:79: syntax error before: ',' ./rebar:91: illegal integer ./rebar:149: illegal atom ./rebar:160: syntax error before: Za��ze ./rebar:172: illegal atom ./rebar:176: illegal atom escript: There were compilation errors. make: *** [compile] Error 127
Delicious.
Ultimately, I came across this article describing issues getting Riak to install on Ubuntu 9.04, and ultimately determined that the Erlang version mentioned seemed to apply here. Following the article's instructions for building Erlang from source worked out fine, and so far I've been able to start, ping, and stop the local Riak server without incident.
Since a true investigation requires running these kinds of tools in a cluster, and that means automation of the installation/configuration is desirable, I've been scripting out the configuration steps (putting things into a configuration management tool like Puppet will come later when we're farther along and closer to picking the right solution for the problem in question). So, here's the script I've been running to build these things from my local machine (relying upon SSH); these are rough, a work in progress, and are not intended as examples of excellence, elegance, or beauty -- they simply get the job done (so far) for me and may help somebody else.
#!/bin/sh hostname=$1 erlang_release=otp_src_R13B04 riak_release=riak-0.8.1 ssh root@$hostname " # necessary for Erlang build apt-get install build-essential libncurses5-dev m4 apt-get install openssl libssl-dev # standard from-source build mkdir erlang-build cd erlang-build wget http://ftp.sunet.se/pub/lang/erlang/download/$erlang_release.tar.gz tar xzf $erlang_release.tar.gz cd $erlang_release ./configure make make install # put all of riak in a riak user useradd -m riak su -c 'wget http://bitbucket.org/basho/riak/downloads/$riak_release.tar.gz' - riak su -c 'tar xzf $riak_release.tar.gz' - riak su -c 'cd $riak_release && make all rel' - riak su -c 'mv $riak_release/rel riak' - riak "
(I have other scripts for preparing the box post-OS-install, but I don't think they impact this particular part of the process.)
PostgreSQL EC2/EBS/RAID 0 snapshot backup
One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.
A big benefit of Amazon's cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon's EBS storage is that it is much more I/O limited than, say, a nice SAN.
To partially mitigate the I/O limitations, they're using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.
They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That's working fine.
They also do nightly backups using pg_dumpall on the master so that there's a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).
There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I'm not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:
- Create an atomic snapshot of the block devices
- Spin up another virtual server
- Mount the backup volume
- Start Postgres and allow it to recover from the apparent "crash" the server had (since there wasn't a clean shutdown of the database before the snapshot
- Do whatever pg_dump or other backups are desired
- Make throwaway copies of the snapshot for QA or other testing
The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That's a big difference from a freshly created database and import from pg_dump.
The difference here is that we're using 4 EBS volumes with RAID 0 striped across them, and there isn't currently a way to do an atomic snapshot of all 4 volumes at the same time. So it's no longer "atomic" and who knows what state the filesystem metadata and the file data itself would be in?
Well, why not try it anyway? Filesystem metadata doesn't change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren't all crashes a little unpredictable? Why shouldn't the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.
I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:
- Created one EC2 image:
Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
Debian AMI ID ami-4ffe1926 (x86_64)
Instance Type: High-CPU Extra Large (c1.xlarge) - 7 GB RAM, 8 CPU cores - Created 4 x 10 GB EBS volumes
- Attached volumes to the image
- Created software RAID 0 device:
mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi
- Created XFS filesystem on top of RAID 0 device:
mkfs -t xfs -L /pgdata /dev/md0
- Set up in /etc/fstab and mounted:
mkdir /pgdata # edit /etc/fstab, with noatime mount /pgdata
- Installed PostgreSQL 8.3
- Configured postgresql.conf to be similar to primary production database server
- Created empty new database cluster with data directory in /pgdata
- Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory
- Ran some bulk inserts to grow database to around 5 GB
- Rebooted EC2 instance to confirm everything came back up correctly on its own
- Set up two concurrent data-insertion processes:
- 50 million row insert based on another local table (INSERT INTO ... SELECT ...), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won't have committed before the snapshot is taken)
- Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through
- Started those "beater" jobs, which mostly consumed 2-3 CPU cores
- Manually inserted a known test row and created a known view that should appear in the snapshot
- Started Postgres's backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:
SELECT pg_start_backup('raid_backup'); - Manually inserted a 2nd known test row & 2nd known test view that I don't want to appear in the snapshot after recovery
- Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes -- during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes
- Tell Postgres the backup's over:
SELECT pg_stop_backup();
- Ran script to create new EBS volumes derived from the 4 snapshots (which aren't directly usable and always go into S3), using ec2-create-volume --snapshot
- Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume
- Then, on the new EC2 instance for doing backups:
- mdadm --assemble --scan
- mount /pgdata
- Start Postgres
- Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn't show any new rows, and that the table getting individual rows committed to reads correctly
- VACUUM VERBOSE -- and confirm no errors or inconsistencies detected
- pg_dumpall # confirmed no errors and data looks sound
It worked! No errors or problems, and pretty straightforward to do.
Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn't make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn't a fool's errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they'd be close to atomic.
There are lots of caveats to note here:
- This is an experiment in progress, not a how-to for the general public.
- The data set that was snapshotted was fairly small.
- Two successful runs, even with no failures, is not a very big sample set. :)
- I didn't use Postgres's point-in-time recovery (PITR) here at all -- I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.
So there's more work to be done to prove this technique viable in production for a mission-critical database, but it's a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store's snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.
MySQL Ruby Gem CentOS RHEL 5 Installation Error Troubleshooting
Building and installing the Ruby mysql gem on freshly-installed Red Hat based systems sometimes produces the frustratingly ambiguous error below:
# gem install mysql /usr/bin/ruby extconf.rb checking for mysql_ssl_set()... no checking for rb_str_set_len()... no checking for rb_thread_start_timer()... no checking for mysql.h... no checking for mysql/mysql.h... no *** extconf.rb failed *** Could not create Makefile due to some reason, probably lack of necessary libraries and/or headers. Check the mkmf.log file for more details. You may need configuration options.
Searching the web for info on this error yields two basic solutions:
- Install the mysql-devel package (this provides the mysql.h file in /usr/include/mysql/).
- Run gem install mysql -- --with-mysql-config=/usr/bin/mysql_config or some other additional options.
These are correct but not sufficient. Because this gem compiles a library to interface with MySQL's C API, the gcc and make packages are also required to create the build environment:
# yum install mysql-devel gcc make # gem install mysql -- --with-mysql-config=/usr/bin/mysql_config
Alternatively, if you're using your distro's ruby (not a custom build like Ruby Enterprise Edition), you can install EPEL's ruby-mysql package along with their rubygem-rails and other packages.
PostgreSQL version 9.0 release date prediction
So when will PostgreSQL version 9.0 come out? I decided to "run the numbers" and take a look at how the Postgres project has done historically. Here's a quick graph showing the approximate number of days each major release since version 6.0 took:
Some interesting things can be seen here: there is a rough correlation between the complexity of a new release and the time it takes, major releases take longer, and the trend is gradually towards more days per release. Overall the project is doing great, releasing on average every 288 days since version 6. If we only look at version 7 and onwards, the releases are on average 367 days apart. If we look at *just* version 7, the average is 324 days. If we look at *just* version 8, the average is 410. Since the last major version that came out was on July 1, 2009, the numbers predict 9.0 will be released on July 3, 2010, based on the version 7 and 8 averages, and on August 15, 2010, based on just the version 8 averages. However, this upcoming version has two very major features, streaming replication (SR) and hot standby (HS). How those will affect the release schedule remains to be seen, but I suspect the 9.0 to 9.1 window will be short indeed.
As a recap, the Postgres project only bumps the first part of the version number for major changes (Although many, myself included, would argue that 7.4 was such a major jump it should have been called 8.0). The second number occurs anytime a "new release" happens, and means new features and enhancements. The final number, the revision, is only incremented for security and bug fixes, and is almost always a 100% binary compatible drop in for the previous revision in the branch. (What's the average (mean) days between revisions? 84 days since version 6, and 88 days since version 7. The medians are 84 and 87 respectively.)
How busy were those periods? Here's the number of commits per release period. Note that I said release period, not release, as commits are still being made to old branches, although this is a very small minority of the commits, so I did not bother to break it down at that level.
There is a strong correlation with the previous chart. Of note is version 8.1, which had few commits and was released relatively quickly. Also note that version 8.0 is still winning as far as the sheer number of commits, most likely due to the fact that native Windows support was added in that version.
Some other items of interest from the data:
- There have been roughly 140,000 commits from version 6.0 to 8.4.2.
- There have been 32 CVS committers since the start of the project (and of course, many hundreds of others whose work was funnelled through those committers)
- The mean number of commits per person is 4383, but the distribution is very skewed: Bruce, Peter, and Tom account for 80% of all commits, with the mean between them of 37,000 commits.
- Commits changed about 40 lines on average.
Alright, two final charts: commits per time periods. I'll let the data speak for itself this time. Stay tuned for future blog posts exploring this data further!

Splitting Postgres pg_dump into pre and post data files
I've just released a small Perl script that has helped me solve a specific problem with Postgres dump files. When you use pg_dump or pg_dumpall, it outputs things in the following order, per database:
- schema creation commands (e.g. CREATE TABLE)
- data loading command (e.g. COPY tablename FROM STDIN)
- post-data schema commands (e.g. CREATE INDEX)
The problem is that using the --schema-only flag outputs the first and third sections into a single file. Hence, if you load the file and then load a separate --data-only dump, it can be very slow as all the constraints, indexes, and triggers are already in place. The split_postgres_dump script breaks the dump file into two segments, a "pre" and a "post". (It doesn't handle a file with a data section yet, only a --schema-only version)
Why would you need to do this instead of just using a full dump? Some reasons I've found include:
- When you need to load the data more than once, such as debugging a data load error.
- When you want to stop after the data load step (which you can't do with a full dump)
- When you need to make adjustments to the schema before the data is loaded (seen quite a bit on major version upgrades)
Usage is simply ./split_postgres_dump.pl yourdumpfile.pg, which will then create two new files, yourdumpfile.pg.pre and yourdumpfile.pg.post. It doesn't produce perfectly formatted files, but it gets the job done!
It's a small script, so it has no bug tracker, git repo, etc. but it does have a small wiki page at http://bucardo.org/wiki/Split_postgres_dump from which you can download the latest version.
Future versions of pg_dump will allow you to break things into pre and post data sections with flags, but until then, I hope somebody finds this script useful.
Update: There is now a git repo: git clone git://bucardo.org/split_postgres_dump.git
Gathering server information with boxinfo
I've just publicly released another Postgres-related script, this one called "boxinfo". Basically, it gathers information about a box (server), hence the catchy and original name. It outputs the information it finds into an HTML page, or into a MediaWiki formatted page.
The goal of boxinfo is to have a simple, single script that quickly gathers important information about a server into a web page, so that you can get a quick overview of what is installed on the server and how things are configured. It's also useful as a reference page when you are trying to remember which server was it that had Bucardo version 4.5.0 installed and was running pgbouncer.
As we use MediaWiki internally here at End Point (running with a Postgres backend, naturally), the original (and default) format is HTML with some MediaWiki specific items inside of it.
Because it is meant to run on a wide a range of boxes as possible, it's written in Perl. While we've run into a few boxes over the years that did not have Perl installed, the number that had any other language you choose (except perhaps sh) is much greater. It requires no other Perl modules, and simply makes a lot of system calls.
Various information about the box is gathered. System wide things such as mount points, disk space, schedulers, packaging systems are gathered first, along with versions of many common Unix utilities. We also gather information on some programs where more than just the version number is important, such as puppet, heartbeat, and lifekeeper. Of course, we also go into a great amount of detail about all the installed Postgres clusters on the box as well.
The program tries its best to locate every active Postgres cluster on the box, and then gathers information about it, such as where pg_xlog is linked to, any contrib modules installed, any interesting configuration variables from postgresql.conf, the size of each database, and lots of detailed information about any Slony or Bucardo configurations it finds.
The main page for it is on the Bucardo wiki at http://bucardo.org/wiki/Boxinfo. That page details the various command line options and should be considered the canonical documentation for the script. The latest version of boxinfo can be downloaded from that page as well. For any enhancement requests or problems to report, please visit the bug tracker at http://bucardo.org/bugzilla/.
What exactly does the output look like? We've got an example on the wiki showing the sample output from a run against my laptop. Some of the items were removed, but it should give you an idea of what the script can do, particularly with regards to the Postgres information: http://bucardo.org/wiki/Boxinfo/Example
The script is still a little rough, so we welcome any patches, bug reports, requests, or comments. The development version can be obtained by running: git clone git://bucardo.org/boxinfo.git
Postgres Upgrades - Ten Problems and Solutions
Upgrading between major versions of Postgres is a fairly straightforward affair, but Murphy's law often gets in the way. Here at End Point we perform a lot of upgrades, and the following list explains some of the problems that come up, either during the upgrade itself, or afterwards.
When we say upgrade, we mean going from an older major version to a newer major version. We've (recently) migrated client systems as old as 7.2 to as new as 8.4. The canonical way to perform such an upgrade is to simply do:
pg_dumpall -h oldsystem > dumpfile psql -h newsystem -f dumpfile
The reality can be a little more complicated. Here are the top ten gotchas we've come across, and their solutions. The more common and severe problems are at the top.
1. Removal of implicit casting
Postgres 8.3 removed many of the "implicit casts", meaning that many queries that used to work on previous versions now gave an error. This was a pretty severe regression, and while it is technically correct to not have them, the sudden removal of these casts has caused *lots* of problems. Basically, if you are going from any version of PostgreSQL 8.2 or lower to any version 8.3 or higher, expect to run into this problem.
Solution: The best way of course is to "fix your app", which means specifically casting items to the proper datatype, for example writing "123::int" instead of "123". However, it's not always easy to do this - not only can finding and changing all instances across your code base be a huge undertaking, but the problem also exists for some database drivers and other parts of your system that may be out of your direct control. Therefore, the other option is to add the casts back in. Peter Eisentraut posted a list of casts that restore some of the pre-8.3 behavior. Do not just apply them all, but add in the ones that you need. We've found that the first one (integer AS text) solves 99% of our clients' casting issues.
2. Encoding issues (bad data)
Older databases frequently were not careful about their encoding, and ended up using the default "no encoding" mode of SQL_ASCII. Often this was done because nobody was thinking about, or worrying about, encoding issues when the database as first being designed. Flash forward years later, and people want to move to something better than SQL_ASCII such as the now-standard UTF-8. The problem is that SQL_ASCII accepts everything without complaint, and this can cause you migration to fail as the data will not load into the new database with a different encoding. (Also note that even UTF-8 to UTF-8 may cause problems as it was not until Postgres version 8.1 that UTF-8 input was strictly validated.)
Solution: The best remedy is to clean the data on the "old" database and try the migration again. How to do this depends on the nature of the bad data. If it's just a few known rows, manual updates can be done. Otherwise, we usually write a Perl script to search for invalid characters and replace them. Alternatively, you can pipe the data through iconv in the middle of the upgrade. If all else fails, you can always fall back to SQL_ASCII on the new database, but that should really be a last resort.
3. Time
Since the database is almost always an integral part of the business, minimizing the time it is unavailable for use is very important. People tend to underestimate how much time an upgrade can take. (Here we are talking about the actual migration, not the testing, which is a very important step that should not be neglected.) Creating the new database and schema objects is very fast, of course, but the data must be copied row by row, and then all the constraints and indexes created. For large databases with many indexes, the index creation step can take longer than the data import!
Solution: The first step is to do a practice run with as similar hardware as possible to get an idea of how long it will take. If this time period does not comfortably fit within your downtime window (and by comfortable, I mean add 50% to account for Murphy), then another solution is needed. The easiest way is to use a replication system like Bucardo to "pre-populate" the static part of the database, and then the final migration only involves a small percentage of your database. It should also be noted that recent versions of Postgres can speed things up by using the "-j" flag to the pg_restore utility, which allows some of the restore to be done in parallel.
4. Dependencies
When you upgrade Postgres, you're upgrading the libraries as well, which many other programs (e.g. database drivers) depend on. Therefore, it's important to make sure everything else relying on those libraries still works. If you are installing Postgres with a packaging system, this is usually not a problem as the dependencies are taken care of for you.
Solution: Make sure your test box has all the applications, drivers, cron scripts, etc. that your production box has and make sure that each of them either works with the new version, or has a sane upgrade plan. Note: Postgres may have some hidden indirect dependencies as well. For example, if you are using Pl/PerlU, make sure that any external modules used by your functions are installed on the box.
5. Postgres contrib modules
Going from one version of Postgres to another can introduce some serious challenges when it comes to contrib modules. Unfortunately, they are not treated with the same level of care as the Postgres core is. To be fair, most of them will continue to just work, simply by doing a "make install" on the new database before attempting to import. Some modules, however, have functions that no longer exist. Some are not 100% forward compatible, and some even lack important pieces such as uninstall scripts.
Solution: Solving this depends quite a bit on the exact nature of the problem. We've done everything from carefully modifying the --schema-only output, to modifying the underlying C code and recompiling the modules, to removing them entirely and getting the functionality in other ways.
6. Invalid constraints (bad data)
Sometimes when upgrading, we find that the existing constraints are not letting the existing data back in! This can happen for a number of reasons, but basically it means that you have invalid data. This can be mundane (a check constraint is missing a potential value) or more serious (multiple primary keys with the same value).
Solution: The best bet is to fix the underlying problem on the old database. Sometimes this is a few rows, but sometimes (as in a case with multiple identical primary keys), it indicates an underlying hardware problem (e.g. RAM). In the latter case, the damage can be very widespread, and your simple upgrade plan has now turned into a major damage control exercise (but aren't you glad you found such a problem now rather than later?) Detecting and preventing such problems is the topic for another day. :)
7. tsearch2
This is a special case for the contrib module situation mentioned above. The tsearch2 module first appeared in version 7.4, and was moved into core of Postgres in version 8.3. While there was a good attempt at providing an upgrade path, upgrades can still cause an occasional issue.
Solution: Sometimes the only real solution is edit the pg_dump output by hand. If you are not using tsearch in that many places (e.g. just a few indexes or columns on a couple tables), you can also simply remove it before the upgrade, then add it back in afterwards.
8. Application behavior
In addition to the implicit casting issues above, applications sometimes have bad behaviors that were tolerated in older versions of Postgres, but now are not. A typical example is writing queries without explicitly naming all of the tables in the "FROM" section.
Solution: As always, fixing the app is the best solution. However, for some things you can also flip a compatibility switch inside of postgresql.conf. In the example above, one would change the "add_missing_from" from its default of 'off' to 'on'. This should be considered an option of last resort, however.
9. System catalogs
Seldom a major update goes by that doesn't see a change in the system catalogs, the low-level meta-data tables used by Postgres to describe everything in the database. Sometimes programs rely on the catalogs looking a certain way.
Solution: Most programs, if they use the system catalogs directly, are careful about it, and upgrading the program version often solves the problem. At other times, we've had to rewrite the program right then and there, either by having it abstract out the information (for example, by using the information_schema views), or (less preferred) by adding conditionals to the code to handle multiple versions of the system catalogs.
10. Embedded data
This is a rare but annoying problem: triggers on a table rely on certain data being in other tables, such that doing a --schema-only dump before a --data-only dump will always fail when importing.
Solution: The easiest way is to simply use pg_dumpall, which loads the schema, then the data, then the constraints and indexes. However, this may not be possible if you have to separate things for other reasons (such as contrib module issues). In this case, you can break the --schema-only pg_dump output into pre and post segments. We have a script that does this for us, but it is also slated to be an option for pg_dump in the future.
That's the list! If you've seen other things, please make a note in the comments. Don't forget to run a database-wide ANALYZE after importing into your new database, as the table statistics are not carried across when using pg_dump.
Postgres SQL Backup Gzip Shrinkage, aka DON'T PANIC!!!
I was investigating a recent Postgres server issue, where we had discovered that one of the RAM modules on the server in question had gone bad. Unsurprisingly, one of the things we looked at was the possibility of having to do a restore from a SQL dump, as if there had been any potential corruption to the data directory, a base backup would potentially have been subject to the same possible errors that we were trying to restore to avoid.
As it was already the middle of the night (anyone have a server emergency during the normal business hours?), my investigations were hampered by my lack of sleep.
If there had been some data directory corruption, the pg_dump process would likely fail earlier than in the backup process, and we'd expect the dumps to be truncated; ideally this wasn't the case, as memory testing had not shown the DIMM to be bad, but the sensor had alerted us as well.
I logged into the backup server and looked at the backup dumps; from the alerts that we'd gotten, the memory was flagged bad on January 3. I listed the files, and noticed the following oddity:
-rw-r--r-- 1 postgres postgres 2379274138 Jan 1 04:33 backup-Jan-01.sql.gz -rw-r--r-- 1 postgres postgres 1957858685 Jan 2 09:33 backup-Jan-02.sql.gz
Well, this was disconcerting. The memory event had taken place on the 3rd, but there was a large drop in size of the dumps between January 1st and January 2nd (more than 400MB of *compressed* output, for those of you playing along at home). This indicated that either the memory event took place earlier than recorded, or something somewhat catastrophic had happened to the database; perhaps some large deletion or truncation of some key tables.
Racking my brains, I tried to come up with an explanation: we'd had a recent maintenance window that took place between January 1 and January 2; we'd scheduled a CLUSTER/REINDEX to reclaim some of the bloat which was in the database itself. But this would only reduce the size of the data directory; the amount of live data would have stayed the same or with a modest increase.
Obviously we needed to compare the two files in order to determine what had changed between the two days. I tried:
diff <(zcat backup-Jan-01.sql.gz | head -2300) <(zcat backup-Jan-02.sql.gz | head -2300)
Based on my earlier testing, this was the offset in the SQL dumps which defined the actual schema for the database excluding the data; in particular I was interested to see if there had been (say) any temporarily created tables which had been dropped during the maintenance window. However, this showed only minor changes (updates to default sequence values). It was time to do a full diff of the data to try and see if some of the aforementioned temporary tables had been truncated or if some catastrophic deletion had occurred or...you get the idea. I tried:
diff <(zcat backup-Jan-01.sql.gz) <(zcat backup-Jan-02.sql.gz)
However, this approach fell down when diff ran out of memory. We decided to unzip the files and manually diff the two files in case it had something to do with the parallel unzips, and here was a mystery; after unzipping the dumps in question, we saw the following:
-rw-r--r-- 1 root root 10200609877 Jan 8 02:19 backup-Jan-01.sql -rw-r--r-- 1 root root 10202928838 Jan 8 02:24 backup-Jan-02.sql
The uncompressed versions of these files showed sizes consistent with slow growth; the Jan 02 backup was slightly larger than the Jan 01 backup. This was really weird! Was there some threshold in gzip where given a particular size file it switched to a different compression algorithm? Had someone tweaked the backup script to gzip with a different compression level? Had I just gone delusional from lack of sleep? Since gzip can operate on streams, the first option seemed unlikely, and something I would have heard about before. I verified that the arguments to gzip in the backup job had not changed, so that took that choice off the table. Which left the last option, but I had the terminal scrollback history to back me up.
We finished the rest of our work that night, but the gzip oddity stuck with me through the next day. I was relating the oddity of it all to a co-worker, when insight struck: since we'd CLUSTERed the table, that meant that similar data (in the form of the tables' multi-part primary keys) had been reorganized to be on the same database pages, so when pg_dump read/wrote out the data in page order, gzip had that much more similarity in the same neighborhood to work with, which resulted in the dramatic decrease in the compressed gzip dumps.
So the good news was that CLUSTER will save you space in your SQL dumps as well (if you're compressing), the bad news was that it took an emergency situation and an almost heart-attack for this engineer to figure it all out. Hope I've saved you the trouble... :-)
State of the Postgres project
It's been interesting watching the MySQL drama unfold, but I have to take issue when people start trying to drag Postgres into it again by spreading FUD (Fear, Uncertainty, and Doubt). Rather than simply rebut the FUD, I thought this was a good opportunity to examine the strength of the Postgres project.
Monty recently espoused the following in a blog comment:"...This case is about ensuring that Oracle doesn't gain money and market share by killing an Open Source competitor. Today MySQL, tomorrow PostgreSQL. Yes, PostgreSQL can also be killed; To prove the case, think what would happen if someone managed to ensure that the top 20 core PostgreSQL developers could not develop PostgreSQL anymore or if each of these developers would fork their own PostgreSQL project."
Later on in his blog he raises the same theme again with a slight bit more detail:
"Note that not even PostgreSQL is safe from this threat! For example, Oracle could buy some companies developing PostgreSQL and target the core developers. Without the core developers working actively on PostgreSQL, the PostgreSQL project will be weakened tremendously and it could even die as a result."
Is this a valid concern? It's easy enough to overlook it considering the Sturm und Drang in Monty's recent posts, but I think this is something worth seriously looking into. Specifically, is the Postgres project capable of withstanding a direct threat from a large company with deep pockets (e.g. Oracle)?
To get to the answer, let's run some numbers first. Monty mentions the "top 20" Postgres developers. If we look at the community contributors page, we see that there are in fact 25 major developers listed, as well as 7 core members, so 20 would indeed be a significant chunk of that page. To dig deeper, I looked at the cvs logs for the year of 2009 for the Postgres project, and ran some scripts against them. The 9185 commits were spread across 16 different people, and about 16 other people were mentioned in the commit notes as having contributed in some way (e.g. a patch from a non-committer). So again, it looks like Monty's number of 20 is a pretty good approximation.
However (and you knew there was a however), the catch comes from being able to actually stop 20 of those people from working on Postgres. There are basically two ways to do this: Oracle could buy out a company, or they could hire (buy out) a person. The first problem is that the Postgres community is very widely distributed. If you look at the people on the community contributors page, you'll see that the 32 people work for 24 different companies. Further, no one company holds sway: the median is one company, and the high water mark is a mere three developers. All of this is much better than it was years ago, in the total number and in the distribution.
The next fly in the ointment is that buying out a company is not always easy to do, despite the size of your pockets. Many companies on that list are privately held and will not sell. Even if you did buy out the company, there is no way to prevent the people working there from then moving to a different company. Finally, buying out some companies just isn't possible, even if you are Oracle, because there are some big names on the list of people employing major Postgres developers: Google, Red Hat, Skype, and SRA. Then of course there is NTT, which is a really, really big company (larger than Oracle). NTT's Postgres developers are not always as visible as some of the English-speaking ones, but NTT employs a lot of people to work on Postgres (which is extremely popular in Japan).
The second way is hiring people directly. However, people can not always be bought off. Sure, some of the developers might choose to leave if Oracle offered them $20 million dollars, but not all of them (Larry, I might go for $19 million, call me :). Even if they did leave, the depth of the Postgres community should not be underestimated. For every "major developer" on that page, there are many others who read the lists, know the code well, but just haven't, for one reason or another, made it on to that list. At a rough guess, I'd say that there are a couple hundred people in the world who would be able to make commits to the Postgres source code. Would all of them be as fast or effective as some of the existing people? Perhaps not, but the point is that it would be nigh impossible to thin the pool fast enough to make a dent.
The project's email lists are as strong as ever, to such a point that I find it hard to keep up with the traffic, a problem I did not have a few years ago. The number of conferences and people attending each is growing rapidly, and there is a great demand for people with Postgres skills. The number of projects using Postgres, or offering it as an alternative database backend, is constantly growing. It's no longer difficult to find a hosting provider that offers Postgres in addition to MySQL. Most important of all, the project continues to regularly release stable new versions. Version 8.5 will probably be released in 2010.
In conclusion, the state of the Postgres project is in great shape, due to the depth and breadth of the community (and the depth and breadth of the developer subset). There is no danger of Postgres going the MySQL route; the PG developers are spread across a number of businesses, the code (and documentation!) is BSD, and no one firm holds sway in the project.
Monitoring Postgres log files with tail_n_mail
We've just publically released a useful script named tail_n_mail that keeps an eye on your Postgres log files and mails interesting lines to one or more addresses. It's released under a BSD license and is available at:
http://bucardo.org/wiki/Tail_n_mail
Complete documentation is available at the above, but here's a quick overview. First, it figures out the current log file (it actually works for any file, but it's primarily aimed at Postgres log files). Then, it finds any lines that match based on the INCLUDE lines in the config file, and finally removes any that do not match the EXCLUDE lines in the config files. It summarizes the results and sends a report to one or more emails.
To use, just specify a a configuration file as the first argument. Typically, the script is run from cron, either for instant reports (e.g. FATAL or PANIC errors), or for daily reports (e.g. all interesting ERRORs in the last 24 hours).
Here's what a typical config file looks like. In this example, we'll look for any FATAL or PANIC notices from Postgres, while ignoring a few known errors that we don't care about.
## Config file for the tail_n_mail.pl program ## This file is automatically updated EMAIL: greg@endpoint.com, postgres@endpoint.com FILE: /var/log/pg_log/postgres-%Y-%m-%d.log INCLUDE: FATAL: INCLUDE: PANIC: EXCLUDE: database ".+" does not exist EXCLUDE: database "template0" is not currently accepting connections MAILSUBJECT: HOST Postgres fatal errors (FILE)
It should be setup to run often from cron:
*/5 * * * * perl bin/tail_n_mail.pl bin/tnm/tnm.fatals.config
The resulting mail message will look like this:
Matches from /var/log/pg_log/postgres-2010-01-01.log: 42 Date: Fri Jan 1 10:34:00 2010 Host: pollo [1] Between lines 123005 and 147976, occurs 39 times. First: Jan 1 00:00:01 rojogrande postgres[4306] Last: Jan 1 10:30:00 rojogrande postgres[16854] Statement: user=root,db=rojogrande FATAL: password authentication failed for user "root" [2] Between lines 147999 and 148213, occurs 2 times. First: Jan 1 10:31:01 rojogrande postgres[3561] Last: Jan 1 10:31:10 rojogrande postgres[15312] Statement: FATAL main: write to worker pipe failed -(9) Bad file descriptor [3] (from line 152341) PANIC: could not locate a valid checkpoint record
There may be false positives, but it's not designed to be a complete log parser. There are some other command line flags and options for the config file: see the documentation for the full list. This script has been watching over a number of production systems for a while now, but improvements, ideas, and patches are always welcome. It's tracked via git; you can clone it by running:
git clone git://bucardo.org/tail_n_mail.git
Bugs and feature requests can be filed and tracked at:
MySQL and Postgres command equivalents (mysql vs psql)
Users toggling between MySQL and Postgres are often confused by the equivalent commands to accomplish basic tasks. Here's a chart listing some of the differences between the command line client for MySQL (simply called mysql), and the command line client for Postgres (called psql).
| MySQL (using mysql) | Postgres (using psql) | Notes |
|---|---|---|
| \c Clears the buffer | \r (same) | |
| \d string Changes the delimiter | No equivalent | |
| \e Edit the buffer with external editor | \e (same) | Postgres also allows \e filename which will become the new buffer |
| \g Send current query to the server | \g (same) | |
| \h Gives help - general or specific | \h (same) | |
| \n Turns the pager off | \pset pager off (same) | The pager is only used when needed based on number of rows; to force it on, use \pset pager always |
| \p Print the current buffer | \p (same) | |
| \q Quit the client | \q (same) | |
\r [dbname] [dbhost]
Reconnect to server | \c [dbname] [dbuser]
(same) | |
| \s Status of server | No equivalent | Some of the same info is available from the pg_settings table |
| \t Stop teeing output to file | No equivalent | However, \o (without any argument) will stop writing to a previously opened outfile |
| \u dbname Use a different database | \c dbname (same) | |
| \w Do not show warnings | No equivalent | Postgres always shows warnings by default |
| \C charset Change the charset | \encoding encoding Change the encoding | Run \encoding with no argument to view the current one |
| \G Display results vertically (one column per line) | \x (same) | Note that \G is a one-time effect, while \x is a toggle from one mode to another. To get the exact same effect as \G in Postgres, use \x\g\x |
| \P pagername Change the current pager program | Environment variable PAGER or PSQL_PAGER | |
| \R string Change the prompt | \set PROMPT1 string (same) | Note that the Postgres prompt cannot be reset by omitting an argument. A good prompt to use is:\set PROMPT1 '%n@%`hostname`:%>%R%#%x%x%x ' |
| \T filename Sets the tee output file | No direct equivalent | Postgres can output to a pipe, so you can do: \o | tee filename |
| \W Show warnings | No equivalent | Postgres always show warnings by default |
| \? Help for internal commands | \? (same) | |
| \# Rebuild tab-completion hash | No equivalent | Not needed, as tab-completion in Postgres is always done dynamically |
| \! command Execute a shell command | \! command (same) | If no command is given with Postgres, the user is dropped to a new shell (exit to return to psql) |
| \. filename Include a file as if it were typed in | \i filename (same) | |
| Timing is always on | \timing Toggles timing on and off | |
| No equivalent | \t Toggles 'tuple only' mode | This shows the data from select queries, with no headers or footers |
| show tables; List all tables | \dt (same) | Many also use just \d, which lists tables, views, and sequences |
| desc tablename; Display information about the given table | \d tablename (same) | |
| show index from tablename; Display indexes on the given table | \d tablename (same) | The bottom of the \d tablename output always shows indexes, as well as triggers, rules, and constraints |
| show triggers from tablename; Display triggers on the given table | \d tablename (same) | See notes on show index above |
| show databases; List all databases | \l (same) | |
| No equivalent | \dn List all schemas | MySQL does not have the concept of schemas, but uses databases as a similar concept |
| select version(); Show backend server version | select version(); (same) | |
| select now(); Show current time | select now(); (same) | Postgres will give fractional seconds in the output |
| select current_user; Show the current user | select current_user; (same) | |
| select database(); Show the current database | select current_database(); (same) | |
| show create table tablename; Output a CREATE TABLE statement for the given table | No equivalent | The closest you can get with Postgres is to use pg_dump --schema-only -t tablename |
| show engines; List all server engines | No equivalent | Postgres does not use separate engines |
| CREATE object ... Create an object: database, table, etc. | CREATE object ... Mostly the same | Most CREATE commands are similar or identical. Lookup specific help on commands (for example: \h CREATE TABLE) |
If there are any commands not listed you would like to see, or if there are errors in the above, please let me know. There are differences in how you invoke mysql and psql, and in the flags that they use, but that's a topic for another day.
Updates: Added PSQL_PAGER and \o |tee filename, thanks to the Davids in the comments section. Added \t back in, per Joe's comment.
Verifying Postgres tarballs with PGP
If you are downloading the Postgres source code tarballs from a mirror, how can you tell if these are the same tarballs that were created by the packagers? You can't really - although they come with a MD5 checksum file, these files are packaged right alongside the tarballs themselves, so it would be easy enough for someone to create an evil tarball along with a new MD5 file. All you could do is perhaps check if the tarball that came from mirror A has a matching checksum file from mirror B, or even the main repository itself.
One way around this is to use PGP (which almost always means GnuPG in the open-source software world) to digitally sign the tarballs. Until the Postgres project gets an official key and starts doing this, one workaround is to at least know the checksums from one single point in time. To that end, I've been digitally signing messages containing the checksums for the tarballs for many years now now and posting them to pgsql-announce. You'll need a copy of my public key (0x14964AC8m fingerprint 2529 DF6A B8F7 9407 E944 45B4 BC9B 9067 1496 4AC8) to verify the messages. A copy of the latest announcement message is below.
Note that I've also added a sha1sum for each tarball, as a precaution against relying on a single MD5 checksum (sha1sum does a SHA-1 checksum, naturally). Also note that rather than signing each tarball, I've simply signed a message containing the checksums for each one.
While this is far from a fool-proof system, it's much, much better than the existing system, and provides a way for changed tarballs to be detected. If anyone ever finds a mismatch please let me know (or better yet, email pgsql-general@postgresql.org)
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Source code MD5 and SHA1 checksums for PostgreSQL versions 8.4.2, 8.3.9, 8.2.15, 8.1.19, 8.0.23, and 7.4.27 For instructions on how to use this file to verify Postgres tarballs, please see: http://www.gtsm.com/postgres_sigs.html ## Created with md5sum: 1bc9cdc76c6a2a13bd7fdc0f3f53667f postgresql-8.4.2.tar.gz d738227e2f1f742d2f2d4ab56496c5c6 postgresql-8.4.2.tar.bz2 4f176a4e7c0a9f8a7673bec99d1905a0 postgresql-8.3.9.tar.gz e120b001354851b5df26cbee8c2786d5 postgresql-8.3.9.tar.bz2 a9d97def309c93998f4ff3e360f3f226 postgresql-8.2.15.tar.gz e6f2274613ad42fe82f4267183ff174a postgresql-8.2.15.tar.bz2 335d8c42bd6e7522bb310d19d1f9a91b postgresql-8.1.19.tar.gz ba84995e1e2d53b0d750b75adfaeede3 postgresql-8.1.19.tar.bz2 eb35f66d1c49d87c27f2ab79f0cebf8e postgresql-8.0.23.tar.gz 1c6fac4265e71b4f314a827ca5f58f6a postgresql-8.0.23.tar.bz2 77d09f4806bd913820f82abc27aca70e postgresql-7.4.27.tar.gz 1fd1d2702303f9b29b5dba1ec4e1aade postgresql-7.4.27.tar.bz2 ## Created with sha1sum: 563caa3da16ca84608e5ff9c487753f3bd127883 postgresql-8.4.2.tar.gz a617698ef3b41a74fe2c4af346172eb03e7f8a7f postgresql-8.4.2.tar.bz2 6ee1e384bdd37150ce6fafa309a3516ec3bbef02 postgresql-8.3.9.tar.gz 5403f13bb14fe568e2b46a3350d6e28808d93a2c postgresql-8.3.9.tar.bz2 bd803d74bf9aeac756cb69ae6c1c261046d90772 postgresql-8.2.15.tar.gz 4de199b3223dba2164a9e56d998f6deb708f0f74 postgresql-8.2.15.tar.bz2 233a365985a5a636a97f9d1ab4e777418937caed postgresql-8.1.19.tar.gz f1667a64e92a365ae3d46903382648bdc0daa1ba postgresql-8.1.19.tar.bz2 7783dc54638e044cff3c339d9fd960a9b65a31df postgresql-8.0.23.tar.gz a2c37eb802a4d67bc2508f72035dae6fb29494df postgresql-8.0.23.tar.bz2 405909d755aa907fc176d22d1b51d6b5704eb3b4 postgresql-7.4.27.tar.gz bb35cc844157b8a0d0b2e9e1ab25b6597c82dd1c postgresql-7.4.27.tar.bz2 - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200912151528 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAksoDPgACgkQvJuQZxSWSsikVQCgiE34ycdexL9lwSfZ+TLTZh5m G3AAnRkazEu/uHLJCNvDZe2cmqCrCkem =HjAS -----END PGP SIGNATURE-----
Editing large files in place
Running out of disk space seems to be an all too common problem lately, especially when dealing with large databases. One situation that came up recently was a client who needed to import a large Postgres dump file into a new database. Unfortunately, they were very low on disk space and the file needed to be modified. Without going into all the reasons, we needed the databases to use template1 as the template database, and not template0. This was a very large, multi-gigabyte file, and the amount of space left on the disk was measured in megabytes. It would have taken too long to copy the file somewhere else to edit it, so I did a low-level edit using the Unix utility dd. The rest of this post gives the details.
To demonstrate the problem and the solution, we'll need a disk partition that has little-to-no free space available. In Linux, it's easy enough to create such a thing by using a RAM disk. Most Linux distributions already have these ready to go. We'll check it out with:
$ ls -l /dev/ram* brw-rw---- 1 root disk 1, 0 2009-12-14 13:04 /dev/ram0 brw-rw---- 1 root disk 1, 1 2009-12-14 22:27 /dev/ram1
From the above, we see that there are some RAM disks available (there are actually 16 of them available on my box, but I only showed two). Here's the steps to create a usable partition from /dev/ram1, and to then check the size:
$ mkdir /home/greg/ramtest
$ sudo mke2fs /dev/ram1
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
4096 inodes, 16384 blocks
819 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=16777216
2 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
8193
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
$ sudo mount /dev/ram1 /home/greg/ramtest
$ sudo chown greg:greg /home/greg/ramtest
$ df -h /dev/ram1
Filesystem Size Used Avail Use% Mounted on
/dev/ram1 16M 140K 15M 1% /home/greg/ramtest
First we created a new directory to server as the mount point, then we used the mke2fs utility to create a new file system (ext2) on the RAM disk at /dev/ram1. It's a fairly verbose program by default, but there is nothing in the output that's really important for this example. Then we mounted our new filesystem to the directory we just created. Finally, we reset the permissions on the directory such that an ordinary user (e.g. 'greg') can read and write to it. At this point, we've got a directory/filesystem that is just under 16 MB large (we could have made it much closer to 16 MB by specifying a -m 0 to mke2fs, but the actual size doesn't matter).
To simulate what happened, let's create a database dump and then bloat it until there it takes up all available space:
$ cd /home/greg/ramtest $ pg_dumpall > data.20091215.pg $ ls -l data.20091215.pg -rw-r--r-- 1 greg greg 3685 2009-12-15 10:42 data.20091215.pg $ dd seek=3685 if=/dev/zero of=data.20091215.pg bs=1024 count=99999 dd: writing 'data.20091215.pg': No space left on device 13897+0 records in 13896+0 records out 14229504 bytes (14 MB) copied, 0.0814188 s, 175 MB/s $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/ram1 16M 15M 0 100% /home/greg/ramtest
First we created the dump, then we found the size of it, and told dd via the 'seek' argument to start adding data to it at the 3685 byte mark (in other words, we appended to the file). We used the special file /dev/zero as the 'if' (input file), and our existing dump as the 'of' (output file). Finally, we told it to chunk the inserts into 1024 bytes at a time, and to attempt to add 999,999 of those chunks. Since this is approximately 100MB, we ran out of disk space quickly, as we intended. The filesystem is now at 100% usage, and will refuse any further writes to it.
To recap, we need to change the first three instances of template0 with template1. Let's use grep to view the lines:
$ grep --text --max-count=3 template data.20091215.pg CREATE DATABASE greg WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8'; CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8'; CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
We need the --text argument here because grep correctly surmises that we've changed the file from text-based to binary with the addition of all those zeroes on the end. We also used the handy --max-count argument to stop processing once we've found the lines we want. Very handy argument when the actual file is gigabytes in size!
There are two major problems with using a normal text editor to change the file. First, the file (in the real situation, not this example!) was very, very large. We only needed to edit something at the very top of the file, so loading the entire thing into an editor is very inefficient. Second, editors need to save their changes somewhere, and there just was not enough room to do so.
Attempting to edit with emacs gives us: emacs: IO error writing /home/greg/ramtest/data.20091215.pg: No space left on device
An attempt with vi gives us: vi: Write error in swap file on startup. "data.20091215.pg" E514: write error (file system full?)
Although emacs gives the better error message (why is vim making a guess and outputting some weird E514 error?), the advantage always goes to vi in cases like this as emacs has a major bug in that it cannot even open very large files.
What about something more low-level like sed? Unfortunately, while sed is more efficient than emacs or vim, it still needs to read the old file and write the new one. We can't do that writing as we have no disk space! More importantly, in sed there is no way (that I could find anyway) to tell it stop processing after a certain number of matches.
What we need is something *really* low-level. The utility dd comes to the rescue again. We can use dd to truly edit the file in place. Basically, we're going to overwrite some of the bytes on disk, without needing to change anything else. First though, we have to figure out exactly which bytes to change. The grep program has a nice option called --byte-offset that can help us out:
$ grep --text --byte-offset --max-count=3 template data.20091215.pg 301:CREATE DATABASE greg WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8'; 380:CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8'; 459:CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
This tells us the offset for each line, but we want to replace the number '0' in 'template0' with the number '1'. Rather than count it out manually, let's just use another Unix utility, hexdump, to help us find the number:
$ grep --text --byte-offset --max-count=3 template data.20091215.pg | hexdump -C 00000000 33 30 31 3a 43 52 45 41 54 45 20 44 41 54 41 42 |301:CREATE DATAB| 00000010 41 53 45 20 67 72 65 67 20 57 49 54 48 20 54 45 |ASE greg WITH TE| 00000020 4d 50 4c 41 54 45 20 3d 20 74 65 6d 70 6c 61 74 |MPLATE = templat| 00000030 65 30 20 4f 57 4e 45 52 20 3d 20 67 72 65 67 20 |e0 OWNER = greg | 00000040 45 4e 43 4f 44 49 4e 47 20 3d 20 27 55 54 46 38 |ENCODING = 'UTF8| ...
Each line is 16 characters, so the first three lines comes to 48 characters, then we add two for the 'e0', subtract four for the '301:', and get 301+48+2-4=347. We subtract one more as we want to seek to the point just before that character, and we can now use our dd command:
$ echo 1 | dd of=data.20091215.pg seek=346 bs=1 count=1 conv=notrunc 1+0 records in 1+0 records out 1 byte (1 B) copied, 0.00012425 s, 8.0 kB/s
Instead of an input file (the 'if' argument), we simply pass the number '1' via stdin to the dd command. We use our calculated seek, tell it to copy a single byte (bs=1), one time (count=1), and (this is very important!) tell dd NOT to truncate the file when it is done (conv=notrunc). Technically, we are sending two characters to the dd program, the number one and a newline, but the bs=1 argument ensures only the first character is being copied. We can now verify that the change was made as we expected:
$ grep --text --byte-offset --max-count=3 TEMPLATE data.20091215.pg 301:CREATE DATABASE greg WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8'; 380:CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8'; 459:CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
Now for the other two entries. From before, the magic number is 45, so we now add 380 to 45 to get 425. For the third line, the name of the database is 1 character longer so we add 459+45+1 = 505:
$ echo 1 | dd of=data.20091215.pg seek=425 bs=1 count=1 conv=notrunc 1+0 records in 1+0 records out 1 byte (1 B) copied, 0.000109234 s, 9.2 kB/s $ echo 1 | dd of=data.20091215.pg seek=505 bs=1 count=1 conv=notrunc 1+0 records in 1+0 records out 1 byte (1 B) copied, 0.000109932 s, 9.1 kB/s $ grep --text --byte-offset --max-count=3 TEMPLATE data.20091215.pg 301:CREATE DATABASE greg WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8'; 380:CREATE DATABASE rand WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8'; 459:CREATE DATABASE sales WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
Success! On the real system, the database was loaded with no errors, and the large file was removed. If you've been following along and need to cleanup:
$ cd ~ $ sudo umount /home/greg/ramtest $ rmdir ramtest
Keep in mind that dd is a very powerful and thus very dangerous utility, so treat it with care. It can be invaluable for times like this however!
Live by the sword, die by the sword
In an amazing display of chutzpah, Monty Widenius recently asked on his blog for people to write to the EC about the takeover of Sun by Oracle and its effect on MySQL, saying:
I, Michael "Monty" Widenius, the creator of MySQL, is asking you urgently to help save MySQL from Oracle's clutches. Without your immediate help Oracle might get to own MySQL any day now. By writing to the European Commission (EC) you can support this cause and help secure the future development of the product MySQL as an Open Source project.
"Help secure the future development"? Sorry, but that ship has sailed. Specifically, when MySQL was sold to Sun. There were many other missed opportunities over the years to keep MySQL as a good open source project. Some of the missteps:
- Bringing in venture capitalists
- Selling to Sun instead of making an IPO (Initial Public Offering)
- Failing to check on the long-term health of Sun before selling to them
- Choosing the proprietary dual-licensing route
- Making the documentation have a restricted license
- Failing to acquire InnoDB (which instead was bought by Oracle)
- Failing to acquire SleepyCat (which was instead bought by Oracle)
- Spreading FUD about the dual license and twisting the GPL in novel and dubious ways
Also interesting is some of the related blog posters and pundits, who seem to think that MySQL has some sort of special mystical quality that requires it be 'saved'. Sorry, but the business world and the open source world are both harsh ecosystems, where today's market leader can become tomorrow's has-been. For all those who are bemoaning MySQL's fate (especially those directly involved in selling this dual-licensed project for money), I offer a quote: "live by the sword, die by the sword". Not that MySQL is dead yet, but it's been dealt quite a number of near-fatal blows, and I'm not convinced that all the forks, spinoffs, well-wishers, and ex-developers can fix that. Should be interesting times ahead.
Permission denied for postgresql.conf
I recently saw a problem in which Postgres would not startup when called via the standard 'service' script, /etc/init.d/postgresql. This was on a normal Linux box, Postgres was installed via yum, and the startup script had not been altered at all. However, running this as root:
service postgresql start
...simply gave a "FAILED".
Looking into the script showed that output from the startup attempt should be going to /var/lib/pgsql/pgstartup.log. Tailing that file showed this message:
postmaster cannot access the server configuration file "/var/lib/pgsql/data/postgresql.conf": Permission denied
However, the postgres user can see this file, as evidenced by an su to the account and viewing the file. What's going on? Well, anytime you see something odd when using Linux, especially if permissions are involved, you should suspect SELinux. The first thing to check is if SELinux is running, and in what mode:
# sestatus SELinux status: enabled SELinuxfs mount: /selinux Current mode: enforcing Mode from config file: enforcing Policy version: 21 Policy from config file: targeted
Yes, it is running and most importantly, in 'enforcing' mode. SELinux logs to /var/log/audit/ by default on most distros, although some older ones may log directly to /var/log/messages. In this case, I quickly found the problem in the logs:
# grep postgres /var/log/audit/audit.log | grep denied | tail -1
type=AVC msg=audit(1234567890.334:432): avc: denied { read } for
pid=1234 comm="postmaster" name="pgsql" dev=newpgdisk ino=403123
scontext=user_u:system_r:postgresql_t:s0
tcontext=system_u:object_r:var_lib_t:s0 tclass=lnk_file
Looks like SELinux did not like a symlink, and sure enough:
# ls -ld /var/lib/pgsql /var/lib/pgsql/data /var/lib/pgsql/data/postgresql.conf lrwxrwxrwx. 1 postgres postgres 18 1999-12-31 23:55 /var/lib/pgsql -> /mnt/newpgdisk drwx------. 2 postgres postgres 4096 1999-12-31 23:56 /var/lib/pgsql/data -rw-------. 1 postgres postgres 16816 1999-12-31 23:57 /var/lib/pgsql /data/postgresql.conf
Here we see that although the postgres user owns the symlink, owns the data directory at /var/lib/pgsql/data, and owns the file in question, /var/lib/pgsql/data/postgresql.conf, the conf file is no longer really on /var/lib/pgsql, but is on /mnt/newpgdisk. SELinux did not like the fact that the postmaster process was trying to read across that symlink.
Now that we know SELinux is the problem, what can we do about it? There are four possible solutions at this point to get Postgres working again:
First, we can simply edit the PGDATA assignment within the /etc/init.d/postgresql file to point to the actual data dir, and bypass the symlink. In this case, we'd change the line as follows:
#PGDATA=/var/lib/pgsql/data PGDATA=/mnt/newpgdisk/data
The second solution is to simply turn SELinux off. Unless you are specifically using it for something, this is the quickest and easiest solution.
The third solution is to change the SELinux mode. Switching from "enforcing" to "permissive" will keep SELinux on, but rather than denying access, it will log the attempt and still allow it to proceed. This mode is a good way to debug things while you attempt to put in new enforcement rules or change existing ones.
The fourth solution is the most correct one, but also the most difficult. That of course is to carve out an SELinux exception for the new symlink. If you move things around again, you'll need to tweak the rules again, or course.
Text sequences
Somebody recently asked on the Postgres mailing list about "Generating random unique alphanumeric IDs". While there were some interesting solutions given, from a simple Pl/pgsql function to using mathematical transformations, I'd like to lay out a simple and powerful solution using Pl/PerlU
First, to paraphrase the original request, the poster needed a table to have a text column be its primary key, and to have a five-character alphanumeric string used as that key. Let's knock out a quick function using Pl/PerlU that solves the generation part of the question:
DROP FUNCTION IF EXISTS nextvalalpha(TEXT);
CREATE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
LANGUAGE plperlu
AS $_$
use strict;
my $numchars = 5;
my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
my $value = join '' => @chars[map{rand @chars}(1..$numchars)];
return $value;
$_$;
Pretty simple: it simply pulls a number of random characters from a string (with some commonly confused letters and number removed) and returns a string:
greg=# SELECT nextvalalpha('foo'); nextvalalpha -------------- MChNf (1 row) greg=# SELECT nextvalalpha('foo'); nextvalalpha -------------- q4jHm (1 row)
So let's set up our test table. Since Postgres can use many things column DEFAULTS, including user-defined functions, this is pretty straightforward:
DROP TABLE IF EXISTS seq_test;
CREATE TABLE seq_test (
id VARCHAR(5) NOT NULL DEFAULT nextvalalpha('foo'),
city TEXT,
state TEXT
);
A quick test shows that the id column is auto-propagated with some random values:
greg=# PREPARE abc(TEXT,TEXT) AS INSERT INTO seq_test(city,state) greg-# VALUES($1,$2) RETURNING id; greg=# EXECUTE abc('King of Prussia', 'Pennsylvania'); id ------- 9zbsd (1 row) INSERT 0 1 greg=# EXECUTE abc('Buzzards Bay', 'Massachusetts'); id ------- 4jJ5D (1 row) INSERT 0 1
So far so good. But while those returned values are random, they are not in any way unique, which a primary key needs to be. First, let's create a helper table to keep track of which values we've already seen. We'll also track the 'name' of the sequence as well, to allow for more than one unique set of sequences at a time:
DROP TABLE IF EXISTS alpha_sequence; CREATE TABLE alpha_sequence ( sname TEXT, value TEXT ); CREATE UNIQUE INDEX alpha_sequence_unique_value ON alpha_sequence(sname,value);
Now we tweak the original function to use this new table.
CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
use strict;
my $sname = shift;
my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
my $numchars = 5;
my $toomanyloops = 10000; ## Completely arbitrary pick
my $loops = 0;
my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = $1 AND value = $2';
my $sth = spi_prepare($SQL, 'text', 'text');
my $value = '';
SEARCHING:
{
## Safety valve
if ($loops++ >= $toomanyloops) {
die "Could not find a unique value, even after $toomanyloops tries!\n";
}
## Build a new value, then test it out
$value = join '' => @chars[map{rand @chars}(1..$numchars)];
my $count = spi_exec_prepared($sth,$sname,$value)->{processed};
redo if $count >= 1;
}
## Store it and commit the change
$SQL = 'INSERT INTO alpha_sequence VALUES ($1,$2)';
$sth = spi_prepare($SQL, 'text', 'text');
spi_exec_prepared($sth,$sname,$value);
return $value;
$_$;
Alright, that seems to work well, and prevents duplicate values. Or does it? Recall that one of the properties of sequences in Postgres is that they live outside of the normal MVCC rules. In other words, once you get a number via a call to nextval(), nobody else can get that number again (even you!) - regardless of whether you commit or rollback. Thus, sequences are guaranteed unique across all transactions and sessions, even if used for more than one table, called manually, etc. Can we do the same with our text sequence? Yes!
For this trick, we'll need to ensure that we only return a new value if we are 100% sure it is unique. We also need to record the value returned, even if the transaction that calls it rolls back. In other words, we need to make a small 'subtransaction' that commits, regardless of the rest of the transaction. Here's the solution:
CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
use strict;
use DBI;
my $sname = shift;
my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
my $numchars = 5;
my $toomanyloops = 10000;
my $loops = 0;
## Connect to this very database, but with a new session
my $port = spi_exec_query('SHOW port')->{rows}[0]{port};
my $dbname = spi_exec_query('SELECT current_database()')->{rows}[0]{current_database};
my $dbuser = spi_exec_query('SELECT current_user')->{rows}[0]{current_user};
my $dsn = "dbi:Pg:dbname=$dbname;port=$port";
my $dbh = DBI->connect($dsn, $dbuser, '', {AutoCommit=>1,RaiseError=>1,PrintError=>0});
my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = ? AND value = ?';
my $sth = $dbh->prepare($SQL);
my $value = '';
SEARCHING:
{
## Safety valve
if ($loops++ >= $toomanyloops) {
die "Could not find a unique value, even after $toomanyloops tries!\n";
}
## Build a new value, then test it out
$value = join '' => @chars[map{rand @chars}(1..$numchars)];
my $count = $sth->execute($sname,$value);
$sth->finish();
redo if $count >= 1;
}
## Store it and commit the change
$SQL = 'INSERT INTO alpha_sequence VALUES (?,?)';
$sth = $dbh->prepare($SQL);
$sth->execute($sname,$value); ## Does a commit
## Only now do we return the value to the caller
return $value;
$_$;
What's the big difference between this one and the previous version? Rather than examine the alpha_sequence table in our /current/ session, we figure out who and where we are, and make a completely separate connection to the same database using DBI. Then we find an unused value, INSERT that value into the alpha_sequence table, and commit that outside of our current transaction.Only then can we return the value to the caller.
Postgres sequences also have a currval() function, which returns the last value returned via a nextval() in the current session. The lastval() function is similar, but it returns the last call to nextval(), regardless of the name used. We can make a version of these easy enough, because Pl/Perl functions have a built-in shared hash named '%_SHARED'. Thus, we'll add two new lines to the end of the function above:
...
$sth->execute($sname,$value); ## Does a commit
$_SHARED{nva_currval}{$sname} = $value;
$_SHARED{nva_lastval} = $value;
...
Then we create a simple function to display that value, as well as throw an error if called too early - just like nextval() does:
DROP FUNCTION IF EXISTS currvalalpha(TEXT)
CREATE FUNCTION currvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
my $sname = shift;
if (exists $_SHARED{nva_currval}{$sname}) {
return $_SHARED{nva_currval}{$sname};
}
else {
die qq{currval of text sequence "$sname" is not yet defined in this session\n};
}
$_$;
Now the lastval() version:
DROP FUNCTION IF EXISTS lastvalalpha();
CREATE FUNCTION lastvalalpha()
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
if (exists $_SHARED{nva_lastval}) {
return $_SHARED{nva_lastval};
}
else {
die qq{lastval (text) is not yet defined in this session\n};
}
$_$;
For the next tests, we'll create a normal (integer) sequence, and see how it acts compared to our newly created text sequence:
DROP SEQUENCE IF EXISTS newint; CREATE SEQUENCE newint STARTS WITH 42; greg=# SELECT lastval(); ERROR: lastval is not yet defined in this session greg=# SELECT currval('newint'); ERROR: currval of sequence "newint" is not yet defined in this session greg=# SELECT nextval('newint'); nextval --------- 42 (1 row) greg=# SELECT currval('newint'); currval --------- 42 greg=# SELECT lastval(); lastval --------- 42
greg=# SELECT lastvalalpha(); ERROR: error from Perl function "lastvalalpha": lastval (text) is not yet defined in this session greg=# SELECT currvalalpha('newtext'); ERROR: error from Perl function "currvalalpha": currval of text sequence "newtext" is not yet defined in this session greg=# SELECT nextvalalpha('newtext'); nextvalalpha -------------- rRwJ6 greg=# SELECT currvalalpha('newtext'); currvalalpha -------------- rRwJ6 greg=# SELECT lastvalalpha(); lastvalalpha -------------- rRwJ6
There is one more quick optimization we could make. Since the %_SHARED hash is available across our session, there is no need to do anything in the function more than once if we can cache it away. In this case, we'll cache away the server information we look up, the database handle, and the prepares. Our final function looks like this:
CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
use strict;
use DBI;
my $sname = shift;
my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
my $numchars = 5;
my $toomanyloops = 10000;
my $loops = 0;
## Connect to this very database, but with a new session
if (! exists $_SHARED{nva_dbi}) {
my $port = spi_exec_query('SHOW port')->{rows}[0]{port};
my $dbname = spi_exec_query('SELECT current_database()')->{rows}[0]{current_database};
my $dbuser = spi_exec_query('SELECT current_user')->{rows}[0]{current_user};
my $dsn = "dbi:Pg:dbname=$dbname;port=$port";
$_SHARED{nva_dbi} = DBI->connect($dsn, $dbuser, '', {AutoCommit=>1,RaiseError=>1,PrintError=>0});
my $dbh = $_SHARED{nva_dbi};
my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = ? AND value = ?';
$_SHARED{nva_sth_check} = $dbh->prepare($SQL);
$SQL = 'INSERT INTO alpha_sequence VALUES (?,?)';
$_SHARED{nva_sth_add} = $dbh->prepare($SQL);
}
my $value = '';
SEARCHING:
{
## Safety valve
if ($loops++ >= $toomanyloops) {
die "Could not find a unique value, even after $toomanyloops tries!\n";
}
## Build a new value, then test it out
$value = join '' => @chars[map{rand @chars}(1..$numchars)];
my $count = $_SHARED{nva_sth_check}->execute($sname,$value);
$_SHARED{nva_sth_check}->finish();
redo if $count >= 1;
}
## Store it and commit the change
$_SHARED{nva_sth_add}->execute($sname,$value); ## Does a commit
$_SHARED{nva_currval}{$sname} = $value;
$_SHARED{nva_lastval} = $value;
return $value;
$_$;
Having the ability to reach outside the database in Pl/PerlU - even if simply to go back in again! - can be a powerful tool, and allows us to do things that might otherwise seem impossible.
Perl+Postgres: changes in DBD::Pg 2.15.1
DBD::Pg, the Perl interface to Postgres, recently released version 2.15.1. The last two weeks has seen a quick flurry of releases: 2.14.0, 2.14.1, 2.15.0, and 2.15.1. Per the usual versioning convention, the numbers on the far right (in this case the "dot one" releases) were simply bug fixes, while 2.14.0 and 2.15.0 introduced API and/or major internal changes. Some of these changes are explained below.
From the Changes file for 2.15.0:
CHANGE: - Allow execute_array and bind_param_array to take oddly numbered items, such that DBI will make missing entries undef/null (CPAN bug #39829) [GSM]
The Perl Database Interface (DBI) has a neat feature to allow you to execute many sets of items at one time, known as execute_array. The basic format is to pass in an list of arrays, in which each array contains the placeholders needed to execute the query. For example:
## Create a simple test table with two columns
$dbh->do('DROP TABLE IF EXISTS people');
$dbh->do('CREATE TABLE people (id int, fname text)');
## Pass in all ids as a single array
my @numbers = (1,2,3);
## Pass in all names as a single array
my @names = ("Garrett", "Viktoria", "Basso");
## Prepare the statement
my $sth = $dbh->prepare('INSERT INTO people VALUES (?, ?)');
## Execute the statement multiple times (three times in this case)
$sth->execute_array(undef, \@numbers, \@names);
## (the first argument is an optional argument hash which we don't use here)
## Pull back and display the rows from our new table
$SQL = 'SELECT id, fname FROM people ORDER BY fname';
for my $row (@{$dbh->selectall_arrayref($SQL)}) {
print "Found: $row->[0] : $row->[1]\n";
}
$ perl testscript.pl
Found: 3 : Basso
Found: 1 : Garrett
Found: 2 : Viktoria
In 2.15.0, we loosened the requirement that the number of placeholders in each array match up with the expected number. Per the DBI spec, any "missing" items are considered undef, which maps to a SQL NULL. Thus:
$dbh->do('DROP TABLE IF EXISTS people');
$dbh->do('CREATE TABLE people (id int, fname text)');
## Note that this time there are only two ids given, not three:
my @numbers = (1,2);
my @names = ("Garrett", "Viktoria", "Basso");
my $sth = $dbh->prepare("INSERT INTO people VALUES (?, ?)");
$sth->execute_array(undef, \@numbers, \@names);
## Show a question mark for any null ids
$SQL = q{
SELECT CASE WHEN id IS NULL THEN '?' ELSE id::text END, fname
FROM people ORDER BY fname
};
for my $row (@{$dbh->selectall_arrayref($SQL)}) {
print "Found: $row->[0] : $row->[1]\n";
}
$ perl testscript2.pl
Found: ? : Basso
Found: 1 : Garrett
Found: 2 : Viktoria
Also note that bind_param_array is an alternate way to add the list of arrays before the execute is called. This is similar in concept to a regular execute: if you bind the values first, you can call execute without any arguments:
... $sth->bind_param_array(1, \@numbers); $sth->bind_param_array(2, \@names); $sth->execute_array(undef); ...
CHANGE: - Use PQexecPrepared even when no placeholders (CPAN bug #48155) [GSM]
Sending queries to Postgres via DBD::Pg usually involves two steps: prepare and execute. The prepare is done one time, while the execute can be called many times, often times with different arguments. Previously, DBD::Pg would call PQexec for queries that had no placeholders. However, the ability to handle placeholders smoothly is only one advantage of using server-side prepares in Postgres. The other advantage is that Postgres only has to parse the query a single time, in the initial prepare. In 2.15.0, we use PQexecPrepared for all queries, whether they have placeholders or not. The upshot of this is that multiple calls to the execute() function will be a little bit faster, and that we only use PQexec when we really have to.
CHANGE: - Fix quoting of booleans to respect more Perlish variants (CPAN bug #41565) [GSM]
In previous versions, the mapping of Perl vars to booleans was very simple, and did only simple 0/1 mapping. However, Perl's values of "truth" is richer than that. We can now do things like this:
for my $name ('0', '1', '0E0', '0 but true', 'F', 'T', 'TRUE', 'false') {
printf qq{Value '%s' is %s\n}, $name, $dbh->quote($name, {pg_type => PG_BOOL});
}
$ perl testscript3.pl
Value '0' is FALSE
Value '1' is TRUE
Value '0E0' is TRUE
Value '0 but true' is TRUE
Value 'F' is FALSE
Value 'T' is TRUE
Value 'TRUE' is TRUE
Value 'false' is FALSE
CHANGE:
- Return ints and bools-cast-to-number from the db as true Perlish numbers.
(CPAN bug #47619) [GSM]
This one is a little more subtle. When a value is returned from the database, it gets mapped back to a string. So even if the value in the database came from an INTEGER column, by the time it made it's way back to your Perl script it was a string that happened to hold an integer value. DBD::Pg now attempts to cast some types to their Perl equivalent. This is normally hard to see without peering inside Perl internals, but using Data::Dumper can show you the difference:
## Ask Postgres to return a string and an integer
$SQL = 'SELECT 123::text, 123::integer';
$info = $dbh->selectall_arrayref($SQL)->[0];
print Dumper $info;
## Older versions of DBD::Pg give:
$VAR1 = [
'123',
'123'
];
## The new and improved version gives:
$VAR1 = [
'123',
123
];
A small difference, but not unimportant - this change came about through a bug request, as it was causing problems when DBD::Pg was interacting with JSON::XS. Special thanks to Tim Bunce, (author of DBI, maintainer of the amazing NYTProf, and all around Perl guru) who found an important bug regarding this solution in 2.14.0, which led to the quick release of 2.14.1. Lesson learned: don't try converting ints to floats via sv_setnv.
Most of the other changes to 2.14 and 2.15 are bug fixes of one sort or another. To keep up on the changes or to talk about the project more, please join the mailing list
Comparing databases with check_postgres
One of the more recent additions to check_postgres, the all-singing, all-dancing Postgres monitoring tool, is the "same_schema" action. This was necessitated by clients who wanted to make sure that their schemas were identical across different servers. The two use cases I've seen are servers that are being replicated by Bucardo or Slony, and servers that are doing horizontal sharding (e.g. same schema and database on different servers: which server you go to depends on (for example) your customer id). Oft times a new index fails to make it to one of the slaves, or some function is tweaked on one server by a developer, who then forgets to change it back or propagate it. This program allows a quick and automatable check for such problems.
The idea behind the same_schema check is simple: we walk the schema and check for any differences, then throw a warning if any are found. In this case, we're using the term "schema" in the classic sense of a description of your database objects. Thus, one of the things we check is that all the schemas (in the classic RDBMS sense of a container of other database objects) are the same, when running the "same_schema" check. Only slightly confusing. :)
Not only is this program nice for monitoring (e.g. as a Nagios check), but if you pass in a --verbose argument, you get a simple not-all-on-one-line breakdown of all the differences between the two databases. Let's do a quick example.
First, we download and install check_postgres. We'll pull straight from a git repository for check_postgres. While we have our own repo at bucardo.org, we also are keeping it in sync with a tree at github.org, so we'll use that one:
git clone git://github.com/bucardo/check_postgres.git cd check_postgres perl Makefile.PL make make test sudo make install
Let's create a Postgres cluster with the initdb command, start it up, then create two new databases to compare to each other.
initdb -D cptest echo port=5555 >> cptest/postgresql.conf pg_ctl -D cptest -l cp.log start psql -p 5555 -c 'CREATE DATABASE yin' psql -p 5555 -c 'CREATE DATABASE yang'
We're ready to run the script. By default, it outputs things in a Nagios-friendly manner. We should see an 'OK' because the two databases are identical:
./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang POSTGRES_SAME_SCHEMA OK: DB "yin" (port=5555 => 5555) Both databases have identical items | time=0.01
The message could be clearer and show both database names, but the check worked and showed that things are exactly the same. Let's throw some differences in and run it again:
psql -p 5555 -d yin -c 'create table foobar(a int primary key, b text, c text)' psql -p 5555 -d yang -c 'create table foobar(a int, b text, c varchar(99))' psql -p 5555 -d yin -c 'create schema yinonly' psql -p 5555 -d yang -c 'create table pineapple(id int)' ./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang POSTGRES_SAME_SCHEMA CRITICAL: DB "yin" (port=5555 => 5555) Databases were different. Items not matched: 5 | time=0.01 Schema in 1 but not 2: yinonly Table in 2 but not 1: public.pineapple Column "a" of "public.foobar": nullable is NO on 1, but YES on 2. Column "c" of "public.foobar": type is text on 1, but character varying on 2. Table "public.foobar" on 1 has constraint "public.foobar_pkey", but 2 does not.
It works, but a little messy for human consumption. Nagios requires everything to be in a single line, but we'll add a --verbose argument to ask the script for prettier formatting:
./check_postgres.pl --action=same_schema --dbport=5555 --dbname=yin --dbport2=5555 --dbname2=yang POSTGRES_SAME_SCHEMA CRITICAL: DB "yin" (port=5555 => 5555) Databases were different. Items not matched: 5 | time=0.01 Schema in 1 but not 2: yinonly Table in 2 but not 1: public.pineapple Column "a" of "public.foobar": nullable is NO on 1, but YES on 2. Column "c" of "public.foobar": type is text on 1, but character varying on 2. Table "public.foobar" on 1 has constraint "public.foobar_pkey", but 2 does not.
There are also ways to filter the output, for times when you have known differences. For example, to exclude any tables with the word 'bucardo' in them, you could add this argument:
--warning="notable=bucardo"
The online documentation has more details about all the filtering options.
So what kind of things do we check for? Right now, we are checking:
- users (existence and powers, i.e. createdb, superuser)
- schemas
- tables
- sequences
- views
- triggers
- constraints
- columns
- functions (including volatility, strictness, etc.)
Got something else we aren't covering? Send in a patch, or a quick request, to the mailing list.
Bucardo and truncate triggers
Version 8.4 of Postgres was recently released. One of the features that hasn't gotten a lot of press, but which I'm excited about, is truncate triggers. This fixes a critical hole in trigger-based PostgreSQL replication systems, and support for these new triggers is now working in the Bucardo replication program.
Truncate triggers were added to Postgres by Simon Riggs (thanks Simon!), and unlike other types of triggers (UPDATE, DELETE, and INSERT), they are statement-level only, as truncate is not a row-level action.
Here's a quick demo showing off the new triggers. This is using the development version of Bucardo - a major new version is expected to be released in the next week or two that will include truncate trigger support and many other things. If you want to try this out for yourself, just run:
$ git clone git-clone http://bucardo.org/bucardo.git/
Bucardo does three types of replication; for this example, we'll be using the 'pushdelta' method, which is your basic "master to slaves" relationship. In addition to the master database (which we'll name A) and the slave database (which we'll name B), we'll create a third database for Bucardo itself.
$ initdb -D bcdata $ initdb -D testA $ initdb -D testB
(Technically, we are creating three new database clusters, and since we are doing this as the postgres user, the default database for all three will be 'postgres')
Let's give them all unique port numbers:
$ echo port=5400 >> bcdata/postgresql.conf $ echo port=5401 >> testA/postgresql.conf $ echo port=5402 >> testB/postgresql.conf
Now start them all up:
$ pg_ctl start -D bcdata -l bc.log $ pg_ctl start -D testA -l A.log $ pg_ctl start -D testB -l B.log
We'll create a simple test table on both sides:
$ psql -d postgres -p 5401 -c 'CREATE TABLE trtest(id int primary key)' $ psql -d postgres -p 5402 -c 'CREATE TABLE trtest(id int primary key)'
Before we go any further, let's install Bucardo itself. Bucardo is a Perl daemon that uses a central database to store its configuration information. The first step is to create the Bucardo schema. This, like almost everything else with Bucardo, is done with the 'bucardo_ctl' script. The install process is interactive:
$ bucardo_ctl install --dbport=5400 This will install the bucardo database into an existing Postgres cluster. Postgres must have been compiled with Perl support, and you must connect as a superuser We will create a new superuser named 'bucardo', and make it the owner of a new database named 'bucardo' Current connection settings: 1. Host:2. Port: 5400 3. User: postgres 4. PID directory: /var/run/bucardo Enter a number to change it, P to proceed, or Q to quit: P Version is: 8.4 Attempting to create and populate the bucardo database and schema Database creation is complete Connecting to database 'bucardo' as user 'bucardo' Updated configuration setting "piddir" Installation is now complete. If you see any unexpected errors above, please report them to bucardo-general@bucardo.org You should probably check over the configuration variables next, by running: bucardo_ctl show all Change any setting by using: bucardo_ctl set foo=bar
Because we don't want to tell the bucardo_ctl program our custom port each time we call it, we'll store that info into the ~/.bucardorc file:
$ echo dbport=5400 > ~/.bucardorc
Let's double check that everything went okay by checking the list of databases that Bucardo knows about:
$ bucardo_ctl list db There are no entries in the 'db' table.
Time to teach Bucardo about our two new databases. The format for the add commands is: bucardo_ctl add [type of thing] [name of thing within the database] [arguments of foo=bar format]
$ bucardo_ctl add database postgres name=master port=5401 Database added: master $ bucardo_ctl add database postgres name=slave1 port=5402 Database added: slave1
Before we go any further, let's look at our databases:
$ bucardo_ctl list dbs Database: master Status: active Conn: psql -h -p 5401 -U bucardo -d postgres Database: slave1 Status: active Conn: psql -h -p 5402 -U bucardo -d postgres
Note that by default we connect as the 'bucardo' user. This is a highly recommended practice, for safety and auditing. Since that user obviously does not exist on the newly created databases, we need to add them in:
$ psql -p 5401 -c 'create user bucardo superuser' $ psql -p 5402 -c 'create user bucardo superuser'
Now we need to teach Bucardo about the tables we want to replicate:
$ bucardo_ctl add table trtest db=master herd=herd1 Created herd "herd1" Table added: public.trtest
A herd is simply a named connection of tables. Typically, you put tables that are linked together by foreign keys or other logic into a herd so that they all get replicated at the same time.
The final setup step is to create a replication event, which in Bucardo is known as a 'sync':
$ bucardo_ctl add sync willow source=herd1 targetdb=slave1 type=pushdelta
NOTICE: Starting validate_sync for willow
CONTEXT: SQL statement "SELECT validate_sync('willow')"
Sync added: willow
This command actually did quite a bit of work behind the scenes, including creating all the supporting schemas, tables, functions, triggers, and indexes that Bucardo will need.
We are now ready to start Bucardo up. Simple enough:
$ bucardo_ctl start Checking for existing processes Starting Bucardo
Let's add a row to the master table and make sure it goes to the slave:
$ psql -p 5401 -c 'insert into trtest(id) VALUES (1)' INSERT 0 1 $ psql -p 5402 -c 'select * from trtest' id ---- 1 (1 row)
Looks fine, so let's try out the truncate. On versions of Postgres less than 8.4, there was no way for Bucardo (or Slony) to know that a truncate had been run, so the rows were removed from the master but not from the slave. We'll do a truncate and add a new row in a single operation:
$ psql -p 5401 -c 'begin; truncate table trtest; insert into trtest values (2); commit' COMMIT $ psql -p 5402 -c 'select * from trtest' id ---- 2 (1 row)
It works! Let's clean up our test environment for good measure:
$ bucardo_ctl stop $ pg_ctl stop -D bcdata $ pg_ctl stop -D testA $ pg_ctl stop -D testB
As mentioned, there are three types of syncs in Bucardo. The other type that can make use of truncate triggers is the 'swap' sync, aka "master to master". I've not yet decided on the behavior for such syncs, but one possibility is simply:
- Database A gets truncated at time X
- Bucardo truncates database B, then discards all delta rows older than X for both A
and B, and all delta rows for B - Everything after X gets processed as normal (conflict resolution, etc.)
- The same thing for a truncate on database B (truncate A, discard all older rows).
Second proposal:
- Database A gets truncated at time X
- We populate the delta table with every primary key in the table before truncation (assuming we can get at it)
- That's it! Bucardo does its normal thing as if we just deleted a whole bunch of rows on A, and in theory deletes them from B as well.
Comments on this strategy welcome!
Update: Clarified initdb cluster vs. database per comment #1 below, and added new truncation handling scheme for multi-master replication per comment #2.
MDX
Recently I've been working with Mondrian, an open source MDX engine. MDX stands for "multi-dimensional expressions", and is a query language used in analytical databases. In MDX, data are considered in "cubes" made up of "dimensions", which are concepts analogous to "tables" and "columns", respectively, in a relational database. And in MDX, much as in SQL, queries written in a special query language tell the MDX engine to return a data set by describing filters in terms of the various dimensions.
But MDX and SQL return data sets in very different ways. Whereas a SQL query will return individual rows (unless aggregate functions are used), MDX always aggregates rows. In MDX, dimensions aren't simple fields that contain arbitrary values; they're hierarchical objects that can be queried at different levels. And finally, in MDX only certain dimensions can be returned in a query. These dimensions are known as "Measures".
Without an example this doubtless makes little sense at first glance. In my case, the underlying data come from a public health application. Among other responsibilities, public health departments have as their task to prevent the spread of disease. Some diseases, such as tuberculosis or swine flu, are of particular interest because of their virulence, their mortality, or other characteristics. Health care providers are legally required to report cases of these diseases to various public health organizations, where the data are analyzed to identify and control outbreaks. The cube in question describes cases of these reportable conditions. Dimensions include the particular disease, the patient's gender and race, the health department jurisdiction the patient lives in, and a few other characteristics. Among the available measures are the count of cases, the average age of each patient, and the average duration of the local public health department's investigation into the case.
You'll note that each of the measures describes groups of cases: a count of cases, the average from a group of values, etc. MDX will tell me the number of cases that meet a criterion, for instance, but not the names of each patient involved. As I said before, MDX only returns aggregates, not individual rows. Each measure's definition includes an aggregate function used to calculate the final value for that measure based on a group of rows in a database.
The cube also uses hierarchical dimensions. As an example, public health data categorizes cases by age group rather than by age. Groups include '< 1 year' and '1-4 years' at the young end, '85+ years' at the older end, and five year increments for everything in between. So the age dimension hierarchy would include two levels: one for the age group, and one for the specific age. In some instances, the jurisdiction dimension might also be a hierarchy, with the public health department at the top level, and subdivisions such as county, zip code, or neighborhood in levels of increasing specificity underneath.
At this point, the SQL-oriented reader says, "Well, you can do all this in SQL," and that is perfectly true. In fact, the major duty of an MDX engine is generally to translate MDX queries into SQL queries (or more often, sets thereof). The advantage of MDX is that sometimes it's simply easier to express a particular set of dimensions and measures in MDX than in the corresponding set of SQL queries. Better still, there are nice applications that speak MDX and allow you to browse interactively through MDX cubes without knowing either MDX or SQL. And finally, when the data set gets really large, which is common in OLAP databases, the MDX engine knows about optimizations it can make to speed things up.
A simple MDX query might look like this:
SELECT
NON EMPTY {[Measures].[Quantity]} ON COLUMNS,
NON EMPTY
{([Markets].[All Markets], [Customers].[All Customers],
[Product].[All Products], [Time].[All Years],
[Order Status].[All Status Types])} ON ROWS
FROM [SteelWheelsSales]
These data come from a cube that ships as a sample with the open source Pentaho business intelligence software suite. [SteelWheelsSales] represents the cube name; other bracketed expressions are measure and dimension names. "ON ROWS" and "ON COLUMNS" describe the "axis" on which the particular measure or dimensions should be displayed. The "ROWS" and "COLUMNS" axes exist by default, and others can be defined at will. The query above gives a result set like this one:
This image shows what a more complex Mondrian MDX session might look like. The cube describes sales data from a sample business. Users can easily "slice and dice" data to view trends over time, variations across sales regions or product lines, or mixtures thereof. In this case, the rows describe various combinations of dimension values, and each cell contains the one measure this query asks for, aggregated across the rows that match the corresponding dimensions.
For more on MDX query syntax or MDX in general, see this Microsoft library MDX reference.
Competitors to Bucardo version 1
Last time I described the design and major functions of Bucardo version 1 in detail. A natural question to ask about Bucardo 1 is, why didn't I use something else already out there? And that's a very good question.
I had no desire to create a new replication system and work out the inevitable kinks that would come with that. However, nothing then available met our needs, and today still nothing I'm familiar with quite would. So writing something new was necessary. Writing an asynchronous multimaster replications system for Postgres was not trivial, but turned out to be easier than I had expected thanks to Postgres itself -- with the caveats noted in the last post.
But, back to the landscape. What follows is a survey of the Postgres replication landscape as it looked in mid-2002 when I first needed multimaster replication for PostgreSQL 7.2.
pgreplicator
PostgreSQL Replicator is probably the most similar project to Bucardo 1. It was released in 2001 and does not appear to have had any updates since October 2001. I don't recall why I didn't use this, but from reviewing the documentation I suspect it was because it hadn't been updated for PostgreSQL 7.2, it used PL/Tcl, and required a daemon to run on every node. But the asynchronous store-and-forward approach, the use of triggers and data storage tables is similar to Bucardo 1.
dbmirror
I don't remember whether this was around in 2002, but it's part of PostgreSQL contrib now. It is master/slave replication only.
Slony-I
I don't think Slony-I existed in 2002 -- version 1.0 was released in 2004. But in any case, it only does master/slave replication.
Slony2
There has been no code released from this project and the website is now gone.
erserver
Master/slave replication, abandoned in favor of Slony-I. Website is now gone.
Postgres-R
This was a research project that worked with PostgreSQL 6.4. Some Postgres-R design documents were published. An effort to port it to PostgreSQL 7.2 (the pgreplication project) did not appear to have gotten very far. In 2008 it seems to have been partially revived. I don't know what the current status is.
PGCluster
This didn't exist in 2002. I'm not sure where it's at now. I believe it uses synchronous replication.
pgpool
This isn't the kind of "replication" I wanted; it's database load balancing and multiplexing. The pgpool listener is a single point of failure, and all databases must be accessible or data will be lost on a database server that is down.
Usogres
Master/slave replication for backup purposes.
Mammoth PostgreSQL + Replication
This didn't exist in 2002. It is only master/slave replication. It began as proprietary software but I believe is open source now.
EnterpriseDB Replication Server
A proprietary offering that came out in 2005 or 2006, for master/slave replication only. Has apparently been replaced by Slony, or perhaps was always rebranded Slony.
pgComparator
An rsync-like tool for comparing databases. Didn't exist in 2002. Probably much better than Bucardo 1's compare operation.
DBBalancer
Kind of like pgpool, more of a connection pooler. Hasn't been updated since 2002.
DRAGON
"Database Replication based on Group Communication." Links to this project were defunct.
DBI-Link
DBI-Link isn't about replication.
(Summary)
I assembled this list some time back and have made some updates to it. I'm sure there are more to consider today. Please comment if you have any corrections or additions.
The design of Bucardo version 1
Since PGCon 2009 begins next week, I thought it would be a good time to start publishing some history of the Bucardo replication system for PostgreSQL. Here I will cover only Bucardo version 1 and leave Bucardo versions 2 and 3 for a later post.
Bucardo 1 is an asynchronous multi-master and master/slave database replication system. I designed it in August-September 2002, to run in Perl 5.6 using PostgreSQL 7.2. It was later updated to support PostgreSQL 7.4 and 8.1, and changes in DBD::Pg's COPY functionality. It was built for and funded by Backcountry.com, and various versions of Bucardo have been used in production as a core piece of their infrastructure from September 2002 to the present.
Bucardo's design is simple, relying on the consistently correct behavior of the underlying PostgreSQL database software. It made some compromises on ideal behavior in order to have a working system in a reasonable amount of time, but the compromises are few and are mentioned below.
General design
Bucardo 1 needed to:
- Support asynchronous multimaster replication.
- Support asynchronous master/slave replication of full tables and changes to tables.
- Leave frequency of replication up to the administrator, which came by default since each replication event is a separate run of the program.
- Preserve transaction atomicity and isolation across databases.
- Continue collecting change information even when no replication process is running.
- Be fairly efficient in storing changes and in bandwidth usage sending them to the other database.
- Have a default "winner" in collision situations, with special handling possible for certain tables where more intelligent collision merges could be done.
- Not require any database downtime for maintenance, upgrades, etc.
- Be fairly simple to understand and support.
- Support a data flow arrangement such that the replicator is behind a firewall and reaches out to an external database, but doesn't require inbound access to the internal database.
Operations
There are four types of database operations Bucardo 1 can perform:
- peer - synchronize changes in one or more tables between two peer databases (multi-master)
- pushdelta - copy only changed rows from a table or set of tables from a master database to a slave database
- push - copy an entire table or set of tables from a master database to a slave database
- compare - compare all rows of one or more tables between two databases
I will discuss each of these operations in turn.
Peer sync
The peer sync operation is the most groundbreaking feature of Bucardo 1. The much smaller Backcountry.com of 2002 wanted to have an internal master database in their office, which housed their customer service and warehouse employees, buyers, and management. Their office had a low-bandwidth and not entirely reliable Internet connection. Their e-commerce web, application, and database servers were at a colocation facility with a fast Internet connection, and they wanted an identical master database to reside there, so that in the case of any disruption in connectivity between their office and colocation facility, both locations could continue to function independently, and their databases would automatically synchronize after connectivity was restored.
To summarize, what they needed is multi-master replication. Their needs would be satisfied with asynchronous multi-master replication. That meant that it was acceptable for the databases to be current with each other with 1-2 minutes of lag time. (Synchronous multi-master replication requires a continuous connection between the two master databases, and transactions are not allowed to commit until the transaction is completed on both databases.)
I want to review some of the features that are required for multi-master replication to work. First, it needs to have ACID properties just as the underlying database itself. The most relevant properties for our multi-master replication system are atomicity and isolation. A transaction must be entirely visible on a given database, or not visible at all.
For example, let us imagine that a customer ecommerce order consists of exactly 1 row in the "orders" table, which references 1 row in the "users" table, and the following tables may have 0 or more rows pointing to the "orders" table:
- order_lines
- order_notes
- credit_cards
- payments
- gift_certificates
- coupon_uses
- affiliate_commissions
- inventory
To add an order to the source database, a transaction is started, rows are added to relevant tables, the transaction is committed, and then those rows will all appear to other database users at once. Until the transaction is committed, no changes are visible. If an error occurs, the entire transaction rolls back, and it will never have been seen by any other database user.
This ensures that warehouse employees, customer service representatives, etc. will never see a partial order. This is especially important since we don't want to ship an order that is missing some of its line items, or double-charge a credit card because we didn't have a payment record yet. And an order without its associated inventory records would have trouble shipping at the warehouse.
This is all standard ACID stuff. But since I was writing a multi-master replication system from scratch, I had to assure the same properties across two database clusters, for which PostgreSQL had no facilities.
Changes are tracked by having a "delta table" paired with every table that's part of the multi-master replication system. The table has three columns: the primary key in the table being tracked, the wallclock timestamp, and an indicator of whether the change was due to an insert, update, or delete. Every change in the table being tracked is recorded by rules and triggers that insert a corresponding row in the delta table.
This is what the delta table for "orders" looks like (simplified a bit for readability):
Table "public.orders_delta"
Column | Type | Modifiers
---------------+-------------+-----------------------------------------
delta_key | varchar(14) | not null
delta_action | char(1) | not null
last_modified | timestamp | not null default timeofday()::timestamp
Check constraints:
"delta_action_valid" CHECK (delta_action IN ('I','U','D'))
Triggers:
orders_delta_last_modified BEFORE INSERT OR UPDATE ON orders_delta
FOR EACH ROW EXECUTE PROCEDURE update_last_modified()
The new row data itself in the tracked table is not copied, because the data is right there for the taking. It is enough to note that a change was made. If multiple changes are made, only the most recent version of the row is available, but that is fine because that's the only one we need to replicate.
Because nothing outside of the database is required to track changes, the tracking continues even when Bucardo 1 is not running. As long as the delta table exists and can be written to, and the tracking rules and triggers are in place on the tracked table, the changes will be recorded.
Bucardo 1 achieves atomicity and isolation of the replication transaction with this process:
- Open a connection to the first database, set transaction isolation to serializable, and disable triggers and rules.
- Open a connection to the second database, set transaction isolation to serializable, and disable triggers and rules.
- For each table to be synchronized in this group:
- Verify that the table's column names and order match in the two databases.
- Walk through the delta table on the first database, making identical changes to the second database. Empty the delta table when done.
- Walk through the delta table on the second database, making identical changes to the first database. Empty the delta table when done.
- Make a note of any changes that were made to the same rows on both databases ("conflicts"). By default, we resolve the conflicts silently by allowing the designated "winner" database's change be the one that remains. For certain tables such as "inventory", appropriate table-specific conflict resolution code was added that merged the changes instead of designating a winner and loser version of the row.
- Once all changes have succeeded, commit transactions on both databases.
This last step of the process does not satisfy the ACID durability requirement. Since Bucardo 1 was designed on PostgreSQL 7.2, with no 2-phase commit possible, there is a chance that one database will fail to commit its transaction after the other database already did, and the changes will be lost on one side only. This has never happened in practice, mostly due to the fact that committing a transaction in PostgreSQL is a nearly instantaneous operation, since the data is already in place and no separate rollback or log tables need to be modified. But it is certainly possible that it could happen, and it is an undesirable risk. With real 2-phase commit now available in PostgreSQL, complete durability could be achieved.
All of a sudden, the changes on each side are now available to the other side, all at once. Only entire orders are visible, never partial orders.
ACID consistency is achieved by assuming that due to PostgreSQL's integrity checks on the source database, the data was already consistent there, and it is copied verbatim to the destination database where it will still be consistent. Thus, CHECK constraints, referential integrity constraints, etc. are expected to be identical between the two databases. Bucardo 1 does not propagate database schema changes.
Thus the main principles to provide fairly reliable replication are:
- All related tables must be synchronized within the same transaction.
- Synchronization must always be done in both directions in the same transaction, so that the code can detect simultaneous change conflicts.
- The most recent change to a given row must of course be the last change, so changes should be replayed in order. (We optimize this by not copying over row changes that we know will be deleted later in the same transaction.)
Things to consider with multi-master replication:
- Conflicts are less likely the more often the synchronization is performed. But conflicts can still happen, and must be resolved somehow. Creating a generic conflict resolution mechanism is difficult, but declaring a "winning" database is easy and special conflict resolution logic can be added for certain tables where lost changes would be troublesome.
-
Very large change sets can take a long time to synchronize. For example, consider an unintentionally large update like this:
UPDATE inventory SET quantity = quantity + 5
That may change hundreds of thousands of rows, all in a single transaction. Our replication system need to make all those changes in a single transaction to the other database, but it must do so over a comparatively slow Internet connection. As transactions run longer, they often encounter locks from other concurrent database activity, and rollback. Then the process must start over, but now there are even more changes to copy over, so it takes even longer. In the worst situations, the synchronization simply cannot complete until other concurrent database activity is temporarily stopped, so that no locks will conflict. And that means downtime of applications, and manual intervention of the system administrator.
Perhaps you could ship over all the data to the other database server ahead of time, then begin transactions on both databases and make the changes based on the local copy of the data, and expect the changes to be accepted more quickly since the network is no longer a bottleneck. But the destination database won't have been idle during that copying, which needs to be accounted for.
Statement replication does not have this same weakness, but it has many weaknesses of its own.
- Sequences need to be set up to operate independently without collisions on the two servers in a peer sync. Two easy ways to do this are:
- Set up sequences to cover separate ranges on each server. For example, MAXVALUE 999999 on the first server, and MINVALUE 1000000 on the second server. Make sure to spread the ranges far enough apart that they'll never likely collide.
- Set up sequences to supply odd numbers on one server, and even on the other. For example, START 1 INCREMENT 2 on the first server, and START 2 INCREMENT 2 on the second server.
- A primary key is required. Currently, it must be a single column, and must be the first column in the table.
- Because each table's primary key may be of a different datatype, and to keep queries on delta tables as simple as possible, Bucardo 1 uses a separate delta table for each table being tracked.
- A more pluggable system for adding table-specific collision handling would be nice.
- The delta table column "delta_action" isn't actually necessary -- inserts and updates are already handled identically, and deletes can be inferred from the join on the tracked table. The "delta_action" is perhaps a nice bit of diagnostic information, and not burdensome as a CHAR(1), but otherwise could be removed.
- It's important that the delta table's "last_modified" column be based on wallclock time, not transaction start time, because we only keep the most recent change, and if all changes within a transaction are tagged by transaction start time, we'd end up with an arbitrary row as the "most recent" one, resulting in inconsistent data between the databases.
Pushdelta
The pushdelta operation uses the same kind of delta tables and associated triggers and rules that the peer sync uses, but is a one-way push of the changed rows from master to slave. It is useful for large tables that don't have a high percentage of changed rows.
The pushdelta operation currently only supports a single target database. The ability to use pushdelta from a master to multiple slaves would be useful.
Push
The push operation very simply copies entire tables from the master to one or more slaves, for each table in a group. It requires no delta tables, triggers, or rules.
Table pushes can optionally supply a query that will be used instead of a bare "SELECT *" on the source table. Any query is allowed that will result in matching columns for the target table. We've used this to push out only in-stock inventory, rather than the whole inventory table, for example.
No primary key is required on tables that are pushed out in full.
The push operation uses DELETE to empty the target table. It would be good to optionally specify that TRUNCATE be used instead, and to take advantage of the PostgreSQL 8.1 multi-table truncate feature on tables with foreign key references.
Compare
The compare operation compares every row of the tables in its group, and displays any differences. It is a read-only operation. It can be used to make sure that tables to be used in multi-master replication start out identical, and later, to verify correct functioning of peer, pushdelta, and push operations.
The compare operation is fairly slow. It reads in all primary keys from both tables first, then fetches each row in turn. It could be made much more efficient.
Options
Optionally, tables can be vacuumed and/or analyzed after each operation.
In earlier versions of Bucardo 1, there was also an option to drop and rebuild all indexes automatically, to reduce index bloat, but beginning with PostgreSQL 7.3, primary key indexes could not be dropped when foreign keys required them, and the index bloat problem was dramatically reduced in PostgreSQL 7.4, mostly eliminating the need for the feature.
Limitations
Some of these are limitations that could easily be lifted, but no need had arisen. Some are minor annoyances, and others are major feature requests.
- For peer, pushdelta, and compare operations, a primary key is required. There are currently limitations on that key:
- Only single-part primary keys are supported.
- The primary key is assumed to be the first column. It would be easy to allow specifying another column as the primary key, or to interrogate the database schema directly to determine the key column, but we've never needed it.
- If an operation of one type is already underway, other operations of the same type will be rejected. It would be much more convenient for the users to add the newly requested operation to a queue and perform it when the current operation has finished.
- The program stands alone, performing a single operation and exiting. It was designed to run from cron. A persistent daemon that accepts requests in a queue or by message passing could better handle the many operations needed on a busy server.
- The program could use PostgreSQL's LISTEN and NOTIFY feature to learn of changes in a table and run a peer sync based on that notification, instead of being run on a timed schedule or on demand.
- Delta tables and triggers must be created or removed manually, though our helper script makes that fairly easy. It would be nice to have Bucardo automatically create delta tables and triggers as needed, or remove them when no longer needed (so that the overhead of tracking changes isn't incurred).
- Delta tables clutter the schema of the tables they are connected to. PostgreSQL didn't yet have the schema (namespace) feature when Bucardo 1 was created, but it would be nice to centralize the delta tables and functions in a separate schema.
- The datatypes of the fields in tables being replicated are not compared; only the names and order are compared.
- The configuration file syntax is fairly unpleasant.
- Only tables can be synchronized. It would be good to add support for views, sequences, and functions as first-class objects that could be pushed from master to slave or synchronized between two masters.
- It would be more convenient, and could reduce the chance of trouble due to misconfiguration, if Bucardo would interrogate the database to learn of all foreign key relationships between tables so that it could automatically create groups of tables that need to be processed together. Trigger functions and rules can cause changes to one table's row to modify rows in other table(s), in an opaque way that is resistant to introspection, but Bucardo could offer a location for users to declare what other tables a function can affect, and use that in building its dependency tree.
- There is no unit test suite.
- The insert trigger and update_last_modified function are written in PL/pgSQL, and are the only dependency on PL/pgSQL. They are both simple functions and should work fine as plain SQL functions, but it seems like there was a reason I had to use PL/pgSQL -- I just can't remember why anymore.
- In Bucardo 1, permission to insert to the various delta tables must be granted to any user that would change the base tables, or changes will be prevented by PostgreSQL. For a database with many users of varying access levels, this is a pain. It would be better to define the function to run as SECURITY DEFINER, and create the function as the superuser. Then no explicit permission would need to be granted on any delta table, and the delta tables would be inaccessible except through the Bucardo 1 API (except to the superuser). That would necessitate a change to using functions for updates and deletes, which currently are tracked by rules.
Future
Bucardo 1 performed admirably for Backcountry.com for over 4 years. The most serious problems, already mentioned above, have been the lack of a queue for push and pushdelta requests, limitations of running one-off processes from cron, limited row collision resolution, and bogging under a large insert or update that happens inside a single transaction.
Greg Sabino Mullane then created Bucardo 2, which is a rearchitected system built around all new code. It has all the important features of Bucardo 1, addressed most of Bucardo 1's deficiencies, and added many of the desired features listed above. We hope to publish some design notes about Bucardo 2 in the near future.
The Name
I originally gave Bucardo 1 the fairly descriptive but uninspiring name "sync-tables". Greg Sabino Mullane came up with the name Bucardo, a reference to the logo of this program's patron, Backcountry.com. You can read about attempts to clone the extinct bucardo in the Wikipedia articles Bucardo and Cloning.
Being at the MySQL User Conference: how Postgres fits in
I spent last week in Santa Clara attending the MySQL User Conference. Friends had clued me in that the conference was going to be a riot - with developers from the many forks of MySQL in attendance, all vying for spotlight, and to differentiate themselves from the MySQL core code.
The Oracle announcement of acquiring Sun cast an uncertain and uncomfortable light over the talks about forks, community and the future of MySQL. Many people wondered aloud what development on the core of MySQL’s code would be like now, and what would become of the remaining MySQL engineers.
Would the engineers defect to Monty’s new company? Will Oracle end support of MySQL development? How would MySQL end users feel about the changes? Would there be a surge in interest in Postgres, my favorite open source database?
Of course, it’s a bit early to tell. So, I’ve really got two posts about the trip, and this first one is about PostgreSQL, aka Postgres.
There’s a huge opportunity right now for Postgres to tell its story. Not because of a specific failure on the part of MySQL, but because the Oracle acquisition has raised the consciousness of all of mainstream tech. Developers and IT managers are taking a serious look at Postgres for new development projects, and evaluating their database technology choices with an eye toward whatever Oracle decides to do.
In this window of uncertainty is an opportunity for Postgres advocates to explain what it is that draws us to the project.
As a developer and a sysadmin, my enthusiasm for Postgres comes directly from the people that work on the code. The love of their craft - developing beautiful, purpose-built code - is reflected in the product, the mailing lists and the individuals who make up our community.
When someone asks me why I choose Postgres, I have to first answer that it is because of the people I know who are involved in the project. I trust them, and believe that they make the best technology decisions when it comes to the core of the code.
I believe that there’s room for improvement in extending Postgres’ reach, and speaking to people who don’t already believe the same things that we believe: that conforming to the SQL standard is fundamentally a useful and important goal, that vertical scaling is an important design objective, and that consistency is just as important to excellent user experience as are verbose command names and syntactic sugar extensions.
All of those issues are debated when discussing (typically by people outside of the Postgres community) how the Postgres development is prioritized and how this community works. It is inarguable that in the web space, Postgres lost the race. But the initial goal of the project, I’d argue, wasn’t necessarily to be the most popular end-user database. Now, that may have changed... :)
Meantime, the Postgres community continues to mature. There are clear constraints we need to overcome on the people side. Two that I think about frequently are the need for more code reviewers for patch review and testing, and smoothing over our prickly mailing-list reputation by getting more volunteers responding to requests for information the lists.
During a particularly raucous panel session at the Percona Performance Conference, a friend in the Postgres community commented that he was so happy that our community didn’t have the issues that the MySQL community has. And I said to him that it’s just a matter of time before we experience those issues if Postgres grows as MySQL has.
We will have issues with forks, conflicts and deep-cutting (founded, or unfounded) criticism. So, my advice to all the people I know in the Postgres community is to pay attention to what is happening with MySQL right now, because we can only benefit from being prepared.
OpenSQL Camp 2008
I attended the OpenSQL Camp last weekend, which ran Friday night to Sunday, November 14-16th. This was the first "unconference" I had been to, and Baron Schwartz did a great job in pulling this all together. I drove down with Bruce Momjian who said that this is the first cross-database conference of any kind since at least the year 2000.
The conference was slated to start at 6 pm, and Bruce and I arrived at our hotel a few minutes before then. Our hotel was at one end of the Charlottesville Downtown Mall, and the conference was at the other end, so we got a quick walking tour of the mall. Seems like a great place - lots of shops, people walking, temporary booths set out, outdoor seating for the restaurants. It reminded me a lot of Las Ramblas, but without the "human statue" performance artists. Having a hotel within walking distance of a conference is a big plus in my book, and I'll go out of my way to find one.
The first night was simply mingling with other people and designing the next day's sessions. There was a grid of talk slots on a wall, with large sticky notes stuck to some of them to indicate already-scheduled sessions. Next to the grid were two sections, where people added sticky notes for potential lightning talks, and for potential regular talks. There were probably about 20 of each type of talk by the end of the night. The idea was to put a check next to any talk you were interested in, although I don't think everyone really got the message about that, judging by the number of checks vs. the number of people. At one point, we gathered in a circle and gave a quick 5 word introduction about ourselves. Mine was "Just Another Perl Postgres Hacker." There were probably around 50-60 or so people there, and the vast majority were from Sun/MySQL. A smaller group of people were non-Sun MySQL people, such as Baron and Sheeri. Coming in at a minority of two was Bruce and myself, representing Postgres (although Saturday saw our numbers swell to three, with the addition of Kelly McDonald). However, the smallest minority was the SQLite contingent, consisting solely of Dr. Richard Hipp (whom it was great to meet in person). Needless to say, I met a lot of MySQL people at this conference! All were very friendly and receptive to Bruce and myself, and it did feel mostly like an open source database conference rather than a MySQL one. Seven of the twenty one talks were by non-MySQL people, which means we were technically overrepresented. Or had more interesting talks! ;)
After heading back to the room and reviewing my notes before bed, I got up the next day and caught the keynote, given by Brian Aker, about the future of open-source databases. Thanks for the Skype/Postgres shout out, Brian! :) A comment by Jim Starkey at the end of the talk led to an interesting discussion on bot nets, the current kings of cloud computing.
My talk on MVCC was the first talk of the day, which of course means lots of technical difficulties. As usual, my laptop refused to cooperate with the overhead projector. In anticipation of this, I had copied the presentation in PDF format to a USB disk, and ended up using someone else's Mac laptop to give the presentation. (I don't remember whose it was, but thank you!) I've given the talk before, but this was a major rewrite to suit the audience: much less Postgres-specific material, and some details about how other systems implement MVCC, as well as the advantages and disadvantages of both ways. Both Oracle and InnoDB update the actual value on disk, and save changes elsewhere, optimistically assuming that a rollback won't happen. This makes a rollback expensive, as the old diffs must be looked up and applied to the main table. Postgres is pessimistic, in that rollbacks are not as expensive as we simply add an entire new row on update, and a rollback simply marks it as no longer valid. Both ways involve some sort of cleaning up of old rows, and handle tradeoffs in different ways. There was some interesting discussions during and after the talk, as Jim Starkey and Ann Harrison weighed in on how other systems (Falcon and Firebird) perform MVCC, and the costs and tradeoffs involved. After the talk, I had some interesting talks with Ann about garbage collection and vacuuming in general.
The next talk was by Dr. Hipp, entitled "How SQL Database Engines Work", which was fascinating as it gave a glance into the inner working and philosophy of SQLite, whose underlying assumptions about power usage, memory, transactions, portability, and resource usage are radically different from most other database systems. Again there was some interesting discussions about certain slides from the audience within the talk.
The competing talk for that time slot was "Libdrizzle" by Eric Day. While I missed this talk, I did get to talk to him the night before about libdrizzle, among other things. Patrick Galbraith and I tried to explain the monstrosity that is XS to Eric (as he and I maintain DBD::mysql and DBD::Pg respectively), and Eric showed us how PHP does something similar.
My DBIx::Cache talk was sabotaged by Bruce having a better session at the same time, so I attended that instead of giving mine. I'll post the slides for the DBIx::Cache talk on the OpenSQL Camp wiki soon, however. I liked Bruce's talk ("Moving Application Logic Into the Database"), mostly becasuse he was preaching to the choir when talking about putting business logic into the database. There was an interesting discussion about the borrowing of LIMIT and OFFSET from MySQL and putting it into Postgres, and we even helped Richard figure out that he was unknowingly supporting the broken and deprecated Postgres "comma-comma" syntax. Bruce's talk was very polished and interesting. I suspect he may have given talks before. :)
Lunch was catered in, and I talked to many people while eating lunch, indeed over the conference itself. Apparently MySQL 5.1 is finally going to be released, this time for sure, according to first Giuseppe and then Dups. Post-lunch were the lightning talks, which I normally would not miss, but their overall MySQL-centricness and my interest in another session, entitled "MySQL Unconference" by Sheeri K. Cabral, drew me away. Bruce, Sheeri, Giuseppe Maxia, and myself talked about the details of such a conference. It was a very interesting perpective: MySQL has the problem of a "one company, and no community" perception, while Postgres suffers from a "all community, and no company" perception. Neither perception is accurate, of course, but there are some seeds of truth to both.
Bruce's second presentation, "Postgres Talks", turned into mostly a wide-ranging discussion between those present (myself, Bruce, Ann, Kelly, Richard, others?) about materialized views, vacuum, building query trees, and other topics.
I bailed out on my fellow Postgres talk "Postgres Extensions" by Kelly McDonald (sorry Kelly). I had already picked his brain about it earlier, so I felt not too much guilt in attending "Atomic Commit In SQLite" by Dr. Hipp. Again, it's fascinating to see things from the SQLite perspective. Not only technically, but how their development is structured is different as well.
I was not feeling well, so I ran back to the hotel to drop off my backpack with super-heavy laptop inside, and thus missed my next planned talk, "Unix Command Line Productivity Tips". If anyone went and can pass on some tips in the comments below, please do so! :)
The final talk I went to was "Join-Fu" by Jay Pipes. I honestly had no idea what this talk would be about, but I actually found it very interesting (and entertaining). Jay is a great speaker, and is not shy about pointing out some of MySQL's weaknesses. The talk was basically a collection of best practices for MySQL, and I actually learned not only things about MySQL I can put to use, but things to apply to Postgres as well. He spent some time on the MySQL query cache as well, which is particularly interesting to me as I'd love to see Postgres get something similar (and until then, people can use DBIx::Cache of course!).
After the final set of presentations was more mingling, eating of some pizza with funky toppings, and planning for the nexy day's hackathon. All the proposed ideas were MySQL-specific, as to be expected, but Bruce and I actually got some work done that night by looking over the pg_memcached code, prompted by Brian. I had looked it over a little bit a few months ago, but Bruce and I managed to fix a bug and, more importantly, found other people to continue working on it. Don't forget to take the credit when they finish their work, Bruce! :)
All in all, a great time. I would have liked to see the presentations stretched out over two days, and to have seen a greater Postgres turnout, but there's always next year. Thanks to Baron for creating a unique event!
Authorize.Net Transaction IDs to increase in size
A sign of their success, Authorize.net is going to break through Transaction ID numbers greater than 2,147,483,647 (or 2^31), which happens to exceed the maximum size of a signed MySQL int() column and the default Postgres "integer".
It probably makes sense to ensure that your transaction ID columns are large enough proactively - this would not be a fun bug to run into ex-post-facto.











