Perl Blog Archive
Custom helper subs in Dancer templates
I recently was writing some code using the Dancer Perl web framework, and had a set of HTML links in the template:
<a href="/">Home</a> | <a href="/contact">Contact</a> | [etc.]
Since it's possible this app could be relocated to a different path, say, /something/deeper instead of merely /, I wanted to use Dancer's handy uri_for() routine to get the full URL, which would include any path relocation. (This concept will be familiar to Interchange 5 users from its [area] and [page] tags.)
The uri_for function isn't available in templates. The easiest way to cope would be to just use it in my route sub where it works fine, and store the results in the template tokens as strings. But then for any new URL needed I would have to update the route sub and the template, and this feels like a quintessential template concern.
I found this blog post explaining how to add custom functions to be used in templates, and it worked great. Now my template can look like this:
<a href="<% uri_for('/') %>">Home</a> |
<a href="<% uri_for('contact') %>">Contact</a> |
[etc.]
And the URLs are output fully qualified:
http://localhost:3000/ http://localhost:3000/contact
Which is not always what I'd want, but in this case is.
The only final concern is that I am using Dancer version 1.3111 and I got this warning upon using the before_template setup mentioned in the blog:
Dancer::before_template has been deprecated since version 1.3080. use hooks!
But use hook 'before_template' => sub {} now instead.
So I updated my code, and the final result looks like this:
hook 'before_template' => sub {
my $tokens = shift;
$tokens->{uri_for} = \&uri_for;
};
And that made both Dancer and me happy.
Slash URL
There's always more to learn in this job. Today I learned that Apache web server is smarter than me.
A typical SEO-friendly solution to Interchange pre-defined searches (item categories, manufacturer lists, etc.) is to put together a URL that includes the search parameter, but looks like a hierarchical URL:
/accessories/Mens-Briefs.html
/manufacturer/Hanes.html
Through the magic of actionmaps, we can serve up a search results page that looks for products which match on the "accessories" or "manufacturer" field. The problem comes when a less-savvy person adds a field value that includes a slash:
accessories: "Socks/Hosiery"
or
manufacturer: "Disney/Pixar"
Within my actionmap Perl code, I wanted to redirect some URLs to the canonical actionmap page (because we were trying to short-circuit a crazy Web spider, but that's beside the point). So I ended up (after several wild goose chases) with:
my $new_path = '/accessories/' .
Vend::Tags->filter({body => (join '%2f' => (grep { /\D/ } @path)),
op => 'urlencode', }) .
'.html';
By this I mean: I put together my path out of my selected elements, joined them with a URL-encoded slash character (%2f), and then further URL-encoded the result. This was counter-intuitive, but as you can see at the first link in this article, it's necessary because Apache is smarter than you. Well, than me anyway.
Getting Started with the Perl Debugger
The Perl debugger is not an easy system to leap into unprepared, especially if you learned to program in the "modern era", with fancy, helpful GUIs and other such things.
So, for those of us who are old school, and those who aren't but wondering what the fuss is about, here's a very gentle introduction to debugging Perl code interactively.
First, an aside. You may think to yourself, "Hey, command-line scripting is passé; nobody does that any more. All my code runs within a website (maybe a modern MVC framework), so how can I make use of a command-line debugger?"
Well, that's where test-driven development and its related methodologies come in. If you have developed using a Perl test framework, you can use the approach outlined here.
The debugger is invoked by using the "-d" switch on the command line. You control the execution of your code with the "n" command:
$ perl -d debug.pl Loading DB routines from perl5db.pl version 1.33 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(debug.pl:1): my $x = 1; DB<1> n main::(debug.pl:2): $x = $x + 1; "n" steps you through the code from line to line. To step into a subroutine call, use "s", e.g.,
$ perl -d debug.pl ... main::(debug.pl:1): my $x = fx(1); DB<1> s main::fx(debug.pl:4): return; You can switch back and forth between step-over ("n") and step-into ("s") mode; just issue the command you want to use.
Next, let's talk about breakpoints. These are places in the code where you'd like the execution to stop so you can look around and take stock.
You can issue a "temporary" breakpoint with the "c" command:
main::(debug.pl:1): ... DB<1> c 23 main::(debug.pl:23): ... You can set a permanent breakpoint with the "b" command:
main::(debug.pl:1): ... DB<1> b 23 DB<2> c main::(debug.pl:23): ... DB<2> c main::(debug.pl:23): ... And note how the "c" command is used to mean "run until breakpoint (or exit)".
If all you could do with the Perl debugger was step through your program, that would be enough. (You could use "print" statements to see what was going on, but it would be awkward to go back and forth between the debugger and your editor.) Of course, we can do more:
main::(debug.pl:23): $x = get_complex_data_structure($arg); DB<1> x $x 0 HASH(0x1e123c8) 'a' => 1 'b' => 2 DB<2> x [1, 2, { b => sin(0.5) } ] 0 ARRAY(0x1e8e7c0) 0 1 1 2 2 HASH(0x19ff298) 'b' => 0.479425538604203 You can even evaluate complex expressions on the fly! Or invoke your code directly:
DB<1> x get_complex_data_structure($arg) 0 HASH(0x1e123c8) 'a' => 1 'b' => 2 Or set a breakpoint within your code, then invoke it:
DB<1> b My::Pkg::_routine DB<2> x get_complex_data_structure($arg) My::Pkg::_routine(My/Pkg.pm:99): ... I hope this brief introduction whetted your appetite for the debugger. It's a powerful system for exploring unfamiliar code, verifying assumptions about data structures, or tracking down bugs. There are many more debugger commands than I've outlined here, but this should get you started.
Happy debugging!
Insidious List Context
Recently, I fell into a deep pit. Not literally, but a deep pit of Perl debugging. As a result, I'm here to warn you and yours about "Insidious List Context(TM)".
(Note: this is a fairly elementary discussion, for people early in their Perl wizardry training.)
Perl has two contexts for evaluating expressions: list and scalar. (All who know this stuff cold can skip down a ways.) "Scalar" context is what non-Perl languages just call "normal reality", but Perl likes to do things ... differently ... so we have more than one context.
In scalar context, a scalar is a scalar is a scalar, but a list becomes a scalar that represents the number of items in the list. Thus,
@x = (1, 1, 1); # @x is a list of three 1s # vs. $x = (1, 1, 1); # $x is "3", the list size
In list context, a list of things is still a list of things. That's pretty simple, but when you are expecting a scalar and you get a list, your world can get pretty confused.
Okay, now the know-it-alls have rejoined us. I had a Perl hashref being initialized with code something like this:
my $hr = {
KEY1 => $value1,
KEY2 => $value2,
KEY_TROUBLE => (defined($foo) ? mysub($foo) : 1),
KEY3 => $value3,
};
So here is the issue: if mysub() returns a list, then the hashref will get extra data. Remember, Perl n00bs, "=>" is not a magical operator, it's just a "fat comma". So a construction like this:
1 => (2, 3, 4)is really the same as:
1, 2, 3, 4
Here's a complete example to illustrate just what size and shape hole I fell into:
use strict;
use Data::Dumper;
my($value1,$value2,$value3,$foo) = qw(value1 value2 value3 foo);
my $hr = {
KEY1 => $value1,
KEY2 => $value2,
KEY_TROUBLE => (defined($foo) ? mysub($foo) : 1),
KEY3 => $value3,
};
print Data::Dumper->Dumper($hr);
sub mysub {
return qw(junk extrajunk);
}
This outputs:
$VAR1 = 'Data::Dumper';
$VAR2 = {
'extrajunk' => 'KEY3',
'KEY2' => 'value2',
'KEY1' => 'value1',
'value3' => undef,
'KEY_TROUBLE' => 'junk'
};
Now, the actual subroutine involved in my little adventure was even more insidious: it returned a list context because it was evaluating a regular expression, in a list context. Its actual source:
sub is_yes {
return( defined($_[0]) && ($_[0] =~ /^[yYtT1]/));
}
So watch those expression-evaluation contexts; they can turn fairly harmless expressions into code-busters.
Pl/Perl multiplicity issues with PostgreSQL - the Highlander restriction
I came across this error recently for a client using Postgres 8.4:
ERROR: cannot allocate multiple Perl
interpreters on this platform
Most times when you see this error it indicates that someone was trying to use both a Pl/Perl function and a Pl/PerlU function on a server in which Perl's multiplicity flag is disabled. In such a case, only a single Perl interpreter can exist for each Postgres backend, and trying to create a new one (as happens when you execute two functions written in Pl/Perl and Pl/PerlU), the error above is thrown.
However, in this case it was not a combination of Pl/Perl and Pl/PerlU - I confirmed that only Pl/Perl was installed. The error was caused by a slightly less known limitation of a non-multiplicity Perl and Postgres. As the docs mention at the very bottom of the page, "...so any one session can only execute either PL/PerlU functions, or PL/Perl functions that are all called by the same SQL role". So we had two roles both trying to execute some Pl/Perl code in the same session. How is that possible - isn't each session tied to a single role at login? The answer is the SECURITY DEFINER flag for functions, which causes the function to run as if it was being invoked by the role that created the function, not the role that is executing it.
There is still a bit of a gotcha here, because Perl interpreters are created as needed, and thus the order of operations is very important. In other words, you may be able to run function foo() just fine, and run function bar() just fine, but you cannot run them together in the same session! This applies to both the Pl/Perl and Pl/PerlU limitation, as well as the Pl/Perl multiple user limitation.
While Postgres will validate functions as you create them, this is
subject to the same in-session limitation. All of the below examples assume
you have a non multiplicity-enabled Perl
(see
the perlguts manpage for gory details on what multiplicity means in Perl)
. To see what state your Perl is,
you need to determine if the 'usemultiplicity' option is enabled.
The -V option to the perl executable tells it to output all
of its configuration parameters. While the canonical way to check is to issue a
perl -V:usemultiplicity, that's a
hard string to remember, so I simply use grep:
$ perl -V | grep multi
useithreads=define, usemultiplicity=define
The above indicates that Perl has been compiled with multiplicity and thus not subject to the Postgres limitations - you can mix and match Perl functions in your database with abandon. The only problem occurs if the output looks like this:
$ perl -V | grep multi
useithreads=undef, usemultiplicity=undef
Technically, you can also prevent the issue by setting ithreads on, but there really is no reason to not just keep things simpler by setting the multiplicity on.
Watch what happens when we try to create two Perl functions using Postgres 9.2:
postgres=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create language plperl;
CREATE LANGUAGE
test=# create language plperlu;
CREATE LANGUAGE
test=# create or replace function test_perlver()
test-# returns text
test-# language plperl
test-# AS $$ return "Running test_perlver on Perl $^V"; $$;
CREATE FUNCTION
test=# create or replace function test_perlverU()
test-# returns text
test-# language plperlU
test-# AS $$ return "Running test_perlverU on Perl $^V"; $$;
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlveru"
What's going on here? We've already used a perl (Pl/Perl) in *this session*, so we cannot create another one, even if just to compile (but not execute) the function. However, if we start a new session, we can create our Pl/PerlU function!
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create or replace function test_perlverU()
test-# returns text
test-# language plperlU
test-# AS $$ return "Running test_perlverU on Perl $^V"; $$;
CREATE FUNCTION
This Highlander restriction ("there can be only one!") applies to both creation and execution of functions. Notice that we have both the Pl/Perl and Pl/PerlU versions installed, but we can only use one in a particular session - and which one depends on which is called first!:
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# select test_perlver();
test_perlver
--------------------------------------
Running test_perlver on Perl v5.10.0
test=# select test_perlverU();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlveru"
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# select test_perlverU();
test_perlveru
---------------------------------------
Running test_perlverU on Perl v5.10.0
test=# select test_perlver();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "test_perlver"
As you can imagine, the nondeterministic nature of such functions can make discovery and debugging of this issue on production servers tricky. :) Here's the other variant we talked about, in which only the first of two functions - both of which are Pl/Perl - will run:
postgres=# create database test;
CREATE DATABASE
postgres=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# create language plperl;
CREATE LANGUAGE
test=# create or replace function foo()
test-# returns text
test-# language plperl
test-# security invoker
test-# AS $$ return "Running as security invoker"; $$;
CREATE FUNCTION
test=# create or replace function bar()
test-# returns text
test-# language plperl
test-# security definer
test-# AS $$ return "Running as security definer"; $$;
CREATE FUNCTION
Now let's run as the user who created the function - no problemo, because we are the same user that created the function:
test=# \c test postgres
You are now connected to database "test" as user "postgres".
test=# SELECT foo();
foo
-----------------------------
Running as security invoker
(1 row)
test=# SELECT bar();
bar
-----------------------------
Running as security definer
(1 row)
All is well. However, if we try it as a different user, the Highlander restriction creeps in:
test=# \c test greg
You are now connected to database "test" as user "greg".
test=# SELECT foo();
foo
-----------------------------
Running as security invoker
(1 row)
test=# SELECT bar();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "bar"
test=# \c test greg
You are now connected to database "test" as user "greg".
test=# SELECT bar();
bar
-----------------------------
Running as security definer
(1 row)
test=# SELECT foo();
ERROR: cannot allocate multiple Perl interpreters on this platform
CONTEXT: compilation of PL/Perl function "foo"
This one took me a while to figure out on a production system, as somewhere in a twisty maze of trigger functions there was one that was set as security definer. Normally, this was not a problem, as the user that created that function did much of the updates, but a different user invoked a non- security definer function and then the security definer function, causing the error at the top of this article to show up.
So what can one do to prevent this problem from occurring? Luckily, for most people this will not be a problem, as many (if not all) distros and operating systems have the multiplicity compile flag for Perl enabled. If you do have the restriction, one option is to simply be careful about the use of security definer functions. You could either declare everything as security definer, or perhaps make sure that it is only called in a separate session if it really needs to be called by a different user.
A better solution is to recompile your Perl to enable multiplicity. I am not aware of any drawbacks to doing so. In theory, one could even recompile Perl in-place and then restart Postgres, but I have never tried this out. :)
Automatically kill process using too much memory on Linux
Sometimes on Linux (and other Unix variants) a process will consume way too much memory. This is more likely if you have a fair amount swap space configured -- but within the range of normal, for example, as much swap as you have RAM.
There are various methods to try to limit trouble from such situations. You can use the shell's ulimit setting to put a hard cap on the amount of RAM allowed to the process. You can adjust settings in /etc/security/limits.conf on both Red Hat- and Debian-based distros. You can wait for the OOM (out of memory) killer to notice the process and kill it.
But all those remedies don't help in situations where you want a process to be able to use a lot of RAM, sometimes, when there's a point to it and it's not just in an infinite loop that will eventually use all memory.
Sometimes such a bad process will bog the machine down horribly before the OOM killer notices it.
We put together the following script about a year ago to handle such cases:
It uses the Proc::ProcessTable module from Perl's CPAN to do the heavy lifting. We invoke it once per minute in cron. If you have processes eating up memory so quickly that they bring down the machine in less than a minute, you could run it in a loop every few seconds instead.
It's easy to customize based on various attributes of a process. In our example here we have it ignore root processes which are assumed to be better vetted. We have commented out a restriction to watch only for Ruby on Rails processes in Passenger. And we kill only processes using 1 GiB or more RAM.
If a process makes it past these tests and is considered bad, we print out a report that crond emails to us, so we can investigate and ideally fix the problem. Then we try to kill the process gracefully, and after 5 seconds forcibly terminate it.
It's simple, easily customizable, and has come in handy for us.
Web service integration in PHP, jQuery, Perl and Interchange
Jeff Boes presented on one of his latest projects.
CityPass.com decided on a project to convert their checkout from being served by Interchange to have the interface served by PHP, but continue to interact with Interchange for the checkout process through a web service.
The original site was entirely served by Interchange, the client then took on a project to convert the frontend to PHP while leveraging Interchange for frontend logic such as pricing and shipping as well as for backend administration for order fulfillment.
Technologies used in the frontend rewrite:
- PHP
- jQuery for jStorage, back-button support and checkout business logic
- AJAX web services for prices, discounts, click-tracking
The Interchange handler is conduit.am that handles the processing of the URL. From this ActionMap the URLs are decoded and passed to a Perl module, Data.pm, which handles processing the input and returning the results.
An order is just a JSON object so testing of the web service is easy. We have a known hash, we post to the proper URL and compare the results and verify they are the same. New test cases are also easy, we can capture any order (JSON) to a log file and add it as a test case.
XOR ROX
Recently a co-worker posed an interesting issue:
Given a non-zero integer$delta,
an array of structures with two key/value pairs,
{ flag => boolean, quantity => non-zero value }
Sort the array so that the first structures are those where either
flag is true and (sign of$deltaand sign of$quantityare different)
or
flag is false and (sign of$deltaand sign of$quantityare the same)
Secondarily, sort on the absolute value of$quantity.
A solution fairly leaped out at me, but I’m not claiming incredible programming skill: in fact, the solution suggested an XOR operation, which was the second time in about as many weeks that I’d gotten to use XOR in Perl code. (It’s one of those things that you can literally write tens of thousands of lines of code without ever needing, so a second opportunity within the same decade was pretty pleasing in a code-geek kind of way.)
The key to recognizing XOR in your problem solution is a pattern like:
A AND (B != C) or ~A AND (B == C)
or more simply:
(A AND ~B) or (~A AND B)
which is nothing more complex than the expanded equivalent of (A XOR B), from your college symbolic-logic class. The daunting sort problem becomes:
@sorted = sort {(
($a->{flag} xor ($a->{quantity} > 0 xor $delta > 0))
<=>
($b->{flag} xor ($b->{quantity} > 0 xor $delta > 0))
) || abs($a->{quantity} <=> abs($b->{quantity})} @items;That’s more “xor” operations in one statement than I’ve used in the last year.
Make your code search-friendly
Here's something about coding style that you may not have considered: is your code "search-friendly"? That is, does the format of your code help or hinder someone who might be searching it for context while debugging, extending, or just learning how it works?
Seriously Contrived Example (from Perl):
my $string = q{Your transaction could not be} .
q{ processed due to a charge} .
q{ card error.};
return $string;
Now someone's going to experience this error and wonder where it occurs. So armed with grep, or ack, or git-grep, they set off into the wilderness:
$ git grep 'could not be processed' $ git grep 'charge card error' $ git grep -e 'transaction.*charge.*error' $ alsdkjgalkghkf
(The last simulates pounding the keyboard with both fists.) I would suggest humbly that "strings you emit as a line should appear as a line in your code", if for no other reason than that it makes it so much easier for you or others to find them. Thus:
my $string = <<'MSG'; Your transaction could not be processed due to a charge card error. MSG return $string;
Perl, UTF-8, and binmode on filehandles
I recently ran into a Perl quirk involving UTF-8, standard filehandles, and the built-in Perl die() and warn() functions. Someone reported a bug in the check_postgres program in which the French output was displaying incorrectly. That is, when the locale was set to FR_fr, the French accented characters generated by the program were coming out as "byte soup" instead of proper UTF-8. Some other languages, English and Japanese among them, seemed to be fine. For example:
## English: "sorry, too many clients already" ## Japanese: "現在クライアント数が多すぎます" ## French expected: "désolé, trop de clients sont déjà connectés" ## French actual: "d�sol�, trop de clients sont d�j� connect�s"
That last line should be very familiar to anyone who has struggled with Unicode on a command line, with those question marks on an inverted background. Our problem was that the output of the script looked like the last line, rather than the one before it. The Japanese output, despite being chock full of Unicode, does have the same problem! More on that later.
I was able to duplicate the problem easy enough by setting my locale to FR_fr and having check_postgres output a message with some non-ASCII characters in it. However, as noted above, some languages were fine, some were not.
Before going any further, I should point out that this Perl script did have a use utf8; at the top of it, as it should. This does not dictate how things will be read in or output,but merely tells Perl that the source code itself contains UTF-8 characters. Now to the quirky parts.
I normally test my Perl scripts on the fly by adding a quick series of debugging statements to warn()s or die()s. Both go to stderr, so it is easy to separate your debugging statements from normal output of the code. However, when I output a non-ASCII message in question immediately after it was defined in the script, it showed a normal, expected UTF-8 string. So I started tracking things through the code, to see if there was some point at which the apparently normal UTF-8 string gets turned back into byte soup. It never did; I finally realized that although print was outputting byte soup, both warn() and die() were outputting UTF-8! Here's a sample script to better demonstrate the problem:
#!perl use strict; use warnings; use utf8; my $msg = 'This is a micro symbol: µ'; print "print = $msg\n"; warn "warn = $msg\n"; die "die = $msg\n";
Now let's run it and see what happens:
print = This is a micro symbol: � warn = This is a micro symbol: µ die = This is a micro symbol: µ
So we've found one Perl quirk: the output of print() and warn() are different, as warn() manages to correctly output the string as UTF-8. Perhaps it is just that the stdout and stderr filehandles are using different encodings? Let's take a look by expanding the script and explicitly printing to both stdout and stderr. We'll also add some other Unicode characters, to emulate the difference between French and Japanese above:
#!perl use strict; use warnings; use utf8; my $msg = 'This is a micro symbol: µ'; my $alert = 'The radioactive snowmen come in peace: ☢ ☃☃☃ ☮'; print STDOUT "print to STDOUT = $msg\n"; print STDOUT "print to STDOUT = $alert\n"; print STDERR "print to STDERR = $msg\n"; print STDERR "print to STDERR = $alert\n"; warn "warn = $msg\n"; warn "warn = $alert\n";
(Note: if you do not see small literal snowmen characters in the above script, you need to get a better browser or RSS reader!)
print to STDOUT = This is a micro symbol: � Wide character in print at utf12 line 11. print to STDOUT = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ print to STDERR = This is a micro symbol: � Wide character in print at utf12 line 14. print to STDERR = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ warn = This is a micro symbol: µ warn = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
There are a number of things to note here. First, that the stderr filehandle has the same problem as the stdout filehandle. So, while warn() and die() send things to stderr, there is some magic happening behind the scenes such
that sending a string to them is *not* the same as sending it to stderr ourselves via a print statement. Which is a good thing overall, as it would be more weird for stdout and stderr to have different encoding layers! The solution to this is simple enough: just force stdout to have the proper encoding by use of the binmode function:
binmode STDOUT, ':utf8';
Indeed, the one line above solved the original poster's problem; applying it to our test script shows that the stdout filehandle now outputs things correctly, unlike the stderr filehandle:
print to STDOUT = This is a micro symbol: µ print to STDOUT = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ print to STDERR = This is a micro symbol: � Wide character in print at utf12 line 16. print to STDERR = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ warn = This is a micro symbol: µ warn = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
The next thing to notice is that the snowmen alert message is displayed properly everywhere. Why is this? The answer lies in that the micro symbol (and the accented French characters) fall into a range that *could* still be
ASCII, as far as Perl is concerned. What happens is that, in the lack of any explicit guidance, Perl makes a best guess as to whether a string to be outputted contains UTF-8 characters or not. In the case of the French and "micro" strings, it guessed wrong, and the characters were output as ASCII. In the case of the Japanese and "snowmen" strings, it still guessed wrong, even though the strings contained higher bytes that left no doubt that we had left ASCII-land and were exploring the land of Unicode. In other words, even though they were still not coming out as pure UTF-8, there is no direct ASCII equivalent so they appear as the characters one would expect. Note, however, that Perl still emits a wide character warning, for it recognizes that something is probably wrong. The warnings go away when we use binmode to force the encoding layer to :utf8
The correct solution when dealing with UTF-8 is to be explicit and not let Perl make any guesses. Solutions to this vary, but the combination here of adding use utf8; and binmode STDOUT, ':utf8';. While I was able to duplicate the problem right away, the combination of Perl making inconsistent guesses and the odd behavior of warn() and die() turned this from a quick fix into a slightly longer investigation. Yes, Unicode and Perl has given me quite a few gray hairs over the years, but I always feel better when I look at how *other* languages handle Unicode. :)
Lock up your keys
Locking hash keys with Hash::Util
It’s a given that you shouldn’t write Perl without “use strict”; it prevents all kinds of silent bugs involving misspelled and uninitialized variables. A similar aid for misspelled and uninitialized hash keys exists in the module “Hash::Util”.
By way of background: I was working on a long chunk of code that prepares an e-commerce order for storage in a database. Many of the incoming fields map directly to the table, but others do not. The interface between this code and the page which submits a large JSON structure was in flux for a while, so from time to time I had to chase bugs involving “missing” or “extra” fields. I settled on a restricted hash to help me squash these and future bugs.
The idea of a restricted hash is to clamp down on Perl’s rather loose “record” structure (by which I mean the common practice of using a hash to represent a record with named fields), which is great in some circumstances. While in most programming languages you must pre-declare a structure and live with it, in Perl hashes you can add new keys on the fly, misspellings and all. A restricted hash can only have a particular set of keys, but is still a hash for all other purposes.
An example:
my %hash = (aaa => 1, bbb => 2);
Attempts to reference $hash{ccc} will not return an error, but only an undefined value. We can now lock the hash so that its current roster of keys will be constant:
use Hash::Util qw(lock_keys);
lock_keys(%hash);
and now $hash{ccc} is not only undefined, it’s a run-time error:
$hash{ccc};
Attempt to access disallowed key 'ccc' in a restricted hash
If we know the list of keys before the hash is initialized, we can set it up like this:
my %hash;
lock_keys(%hash, qw(aaa bbb ccc));
Keep in mind the values of $hash{aaa}, etc. are mutable (can be undefined, not exist, scalars, references, etc.), just like a normal hash.
What if our key roster needs to change over the course of the program? In my example, there were several kinds of transactions being sent via JSON, and I needed to validate and restrict fields based on the presence and values of other fields. E.g.,
if ($hash{record_type} eq 'A') {
# validate %hash for aaa, bbb, ccc
}
else {
# validate %hash for aaa, bbb, ddd; ccc should not appear
}
You can add to or modify the accepted keys as you go, but it’s a two-step process: not even Hash::Util can modify the keys of a locked hash, so you have to unlock and re-lock:
my %hash;
lock_keys(%hash, qw(record_type aaa bbb ccc));
# …
unlock_keys(%hash);
if ($hash{record_type} eq 'A') {
lock_keys(%hash, qw(record_type aaa bbb ccc));
}
else {
lock_keys(%hash, qw(record_type aaa bbb ddd));
}
Of course, that’s kind of wordy: we’d really rather just splice in a key here and there. Hash::Util has you covered, because you can retrieve the list of legal keys for a hash (even if it’s not currently locked):
lock_keys_plus(%hash, qw(ddd));
adds ‘ddd’ to the list, keeping the previous keys as well. However, if any of the legal keys are not current keys, they won’t make it into the key roster. Instead, use:
lock_keys_plus(%hash, (legal_keys(%hash), qw(more keys here)));
Everything shown here for hashes is also available for hashrefs: for instance, to lock up a hashref $hr:
lock_ref_keys($hr);
unlock_ref_keys($hr);
lock_ref_keys_plus($hr, (legal_ref_keys($hr), qw(other keys)));
Of course, adding all this locking and unlocking adds complexity to your code, so you should consider carefully whether it’s justified. In my case I had 60+ keys, in a nested structure, spanning 1500 lines of code – I just could not keep all the correct spellings in my head any more, so now when I write
if ($opt->{order_status})
when I mean “transaction_status”, I’ll get a helpful run-time error instead of a silent skip of that block of code.
Are there other approaches? Yes, depending on your needs: JSON::Schema, for instance, will let you validate a JSON structure against a “golden master”. However, it does not prevent subsequent assignments to the structure, creating new keys on the fly (possibly in error). Moose would support a restricted object like this, but may add more complexity than you need, so Hash::Util may be the appropriate, lighter-weight approach.
Protecting and auditing your secure PostgreSQL data

PostgreSQL functions can be written in many languages. These languages fall into two categories, 'trusted' and 'untrusted'. Trusted languages cannot do things "outside of the database", such as writing to local files, opening sockets, sending email, connecting to other systems, etc. Two such languages are PL/pgSQL and and PL/Perl. For "untrusted" languages, such as PL/PerlU, all bets are off, and they have no limitations placed on what they can do. Untrusted languages can be very powerful, and sometimes dangerous.
One of the reasons untrusted languages can be considered dangerous is that they can cause side effects outside of the normal transactional flow that cannot be rolled back. If your function writes to local disk, and the transaction then rolls back, the changes on disk are still there. Working around this is extremely difficult, as there is no way to detect when a transaction has rolled back at the level where you could, for example, undo your local disk changes.
However, there are times when this effect can be very useful. For example, in a recent thread on the PostgreSQL "general" mailing list (aka pgsql-general), somebody asked for a way to audit SELECT queries into a logging table that would survive someone doing a ROLLBACK. In other words, if you had a function named weapon_details() and wanted to have that function log all requests to it by inserting to a table, a user could simply run the query, read the data, and then rollback to thwart the auditing:
BEGIN;
SELECT weapon_details('BFG 9000'); -- also inserts to an audit table
ROLLBACK; -- inserts to the audit table are now gone!
Certainly there are other ways to track who is using this query, the most obvious being by enabling full Postgres logging (by setting log_statement = 'all' in your postgresql.conf file.) However, extracting that information from logs is no fun, so let's find a way to make that INSERT stick, even if the surrounding function was rolled back.
Stepping back for one second, we can see there are actually two problems here: restricting access to the data, and logging that access somewhere. The ultimate access restriction is to simply force everyone to go through your custom interface. However, in this example, we will assume that someone has psql access and needs to be able to run ad hoc SQL queries, as well as be able to BEGIN, ROLLBACK, COMMIT, etc.
Let's assume we have a table with some Very Important Data inside of it. Further, let's establish that regular users can only see some of that data, and that we need to know who asked for what data, and when. For this example, we will create a normal user named Alice:
postgres=> CREATE USER alice; CREATE ROLE
We need a way to tell which rows are suitable for people like Alice to view. We will set up a quick classification scheme using the nifty ENUM feature of PostgreSQL:
postgres=> CREATE TYPE classification AS ENUM ( 'unclassified', 'restricted', 'confidential', 'secret', 'top secret' ); CREATE TYPE
Next, as a superuser, we create the table containing sensitive information, and populate it:
postgres=> CREATE TABLE weapon (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
cost TEXT NOT NULL,
security_level CLASSIFICATION NOT NULL,
description TEXT NOT NULL DEFAULT 'a fine weapon'
);
NOTICE: CREATE TABLE will create implicit sequence "weapon_id_seq" for serial column "weapon.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "weapon_pkey" for table "weapon"
CREATE TABLE
postgres=> INSERT INTO weapon (name,cost,security_level) VALUES
('Crowbar', 10, 'unclassified'),
('M9', 200, 'restricted'),
('M16A2', 300, 'restricted'),
('M4A1', 400, 'restricted'),
('FGM-148 Javelin', 700, 'confidential'),
('Pulse Rifle', 50000, 'secret'),
('Zero Point Energy Field Manipulator', 'unknown', 'top secret');
INSERT 0 7
We don't want anyone but ourselves to be able to access this table, so for safety, we make some explicit revocations. We'll examine the permissions before and after we do this:
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+-------------------+--------------------------
public | weapon | table | |
postgres=> REVOKE ALL ON TABLE weapon FROM public;
REVOKE
postgres=> \dp weapon
Access privileges
Schema | Name | Type | Access privileges | Column access privileges
--------+--------+-------+---------------------------+--------------------------
public | weapon | table | postgres=arwdDxt/postgres |
As you can see, what the REVOKE really does is remove the implicit "no permission" and grant explicit permissions to only the postgres user to view or modify the table. Let's confirm that Alice cannot do anything with that table:
postgres=> \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> postgres=> SELECT * FROM weapon; ERROR: permission denied for relation weapon postgres=> postgres=> UPDATE weapon SET id = id; ERROR: permission denied for relation weapon
Alice does need to have access to parts of this table, so we will create a "wrapper function" that will query the table for us and return some results. By declaring this function as SECURITY DEFINER, it will run as if the person who created the function invoked it - in this case, the postgres user. For this example, we'll be letting Alice see the "cost and description" of exactly one item at a time. Further, we are not going to let her (or anyone else using this function) view certain items. Only those items classified as "confidential" or lower can be viewed (i.e. "confidential", "restricted", or "unclassified"). Here's the first version of our function:
postgres=> CREATE LANGUAGE plperlu;
CREATE LANGUAGE
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
The above should be fairly self-explanatory. We are using PL/Perl's built-in database access functions, such as spi_prepare, to do the actual querying. Let's confirm that this works as it should for Alice:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('anvil');
name | cost | description
------+------+-------------
(0 rows)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
Now that we have solved the restricted access problem, let's move on the auditing. We will create a simple table to hold information about who accessed what and when:
postgres=> CREATE TABLE data_audit ( tablename TEXT NOT NULL, arguments TEXT NULL, results INTEGER NULL, status TEXT NOT NULL DEFAULT 'normal', username TEXT NOT NULL DEFAULT session_user, txntime TIMESTAMPTZ NOT NULL DEFAULT now(), realtime TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp() ); CREATE TABLE
The 'tablename' column simply records which table they are getting data from. The 'arguments' is a free-form field describing what they were looking for. The 'results' column shows how many matching rows were found. The 'status' column will be used primarily to log unusual requests, such as the case where Alice looks for a forbidden item. The 'username' column records the name of the user doing the searching. Because we are using functions with SECURITY DEFINER set, this needs to be session_user, not current_user, as the latter will switch to 'postgres' within the function, and we want to log the real caller (e.g. 'alice'). The final two columns tell us then the current transaction started, and the exact time when an entry was made inside of this table. As a first attempt, we'll have our function do some simple inserts to this new data_audit table:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT)
RETURNS TABLE (name TEXT, cost TEXT, description TEXT)
LANGUAGE plperlu
SECURITY DEFINER
AS $bc$
use strict;
use warnings;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = delete $row->{allowed};
## Log this request
$SQL = 'INSERT INTO data_audit(tablename,arguments,results,status)
VALUES ($1,$2,$3,$4)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
$sth = spi_prepare($SQL, 'TEXT', 'TEXT', 'INTEGER', 'TEXT');
spi_exec_prepared($sth, 'weapon', $name, $rv->{processed}, $status);
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
However, this fails the case pointed out in the original poster's email about viewing the data within a transaction that is then rolled back. It also fails to work at all when a forbidden item is requested, as that insert is rolled back by the die() call:
postgres=> \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT * FROM weapon_details('crowbar');
name | cost | description
---------+------+---------------
Crowbar | 10 | a fine weapon
(1 row)
postgres=> SELECT * FROM weapon_details('pulse rifle');
ERROR: Sorry, you are not allowed to view information on that weapon!
CONTEXT: PL/Perl function "weapon_details"
postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM weapon_details('m9');
name | cost | description
------+------+---------------
M9 | 200 | a fine weapon
(1 row)
postgres=> ROLLBACK;
ROLLBACK
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=> SELECT * FROM data_audit \x \g
Expanded display is on.
-[ RECORD 1 ]----------------------------
tablename | weapon
arguments | crowbar
results | 1
status | normal
username | alice
txntime | 2012-01-30 17:37:39.497491-05
realtime | 2012-01-30 17:37:39.545891-05
How do we get around this? We need a way to commit something that will survive the surrounding transaction's rollback. The closest thing Postgres has to such a thing at the moment is to connect back to the database with a new and entirely separate connection. Two such popular ways to do so are with the dblink program and the PL/PerlU language. Obviously, we are going to focus on the latter, but all of this could be done with dblink as well. Here are the additional steps to connect back to the database, do the insert, and then leave again:
postgres=> CREATE OR REPLACE FUNCTION weapon_details(TEXT) RETURNS TABLE (name TEXT, cost TEXT, description TEXT) LANGUAGE plperlu SECURITY DEFINER VOLATILE AS $bc$
use strict;
use warnings;
>use DBI;
## The item they are looking for
my $name = shift;
## We will be nice and ignore the case and any whitespace
$name =~ s{^\s*(\S+)\s*$}{lc $1}e;
## What is the maximum security_level that people who are
## calling this function can view?
my $seclevel = 'confidential';
## Query the table and pull back the matching row
## We need to differentiate between "not found" and "not allowed",
## by comparing a passed-in level to the security_level for that row.
my $SQL = q{
SELECT name,cost,description,
CASE WHEN security_level <= $1 THEN 1 ELSE 0 END AS allowed
FROM weapon
WHERE LOWER(name) = $2};
## Run the query, pull back the first row, as well as the allowed column value
my $sth = spi_prepare($SQL, 'CLASSIFICATION', 'TEXT');
my $rv = spi_exec_prepared($sth, $seclevel, $name);
my $row = $rv->{rows}[0];
my $allowed = defined $row ? delete $row->{allowed} : 1;
## Log this request
$SQL = 'INSERT INTO data_audit(username,tablename,arguments,results,status)
VALUES (?,?,?,?,?)';
my $status = $rv->{rows}[0] ? $allowed ? 'normal' : 'forbidden' : 'na';
my $dbh = DBI->connect('dbi:Pg:service=auditor', '', '',
{AutoCommit=>0, RaiseError=>1, PrintError=>0});
$sth = $dbh->prepare($SQL);
my $user = spi_exec_query('SELECT session_user')->{rows}[0]{session_user};
$sth->execute($user, 'weapon', $name, $rv->{processed}, $status);
$dbh->commit();
## Did we find anything? If not, simply return undef
if (! $rv->{processed}) {
return undef;
}
## Throw an exception if we are not allowed to view this row
if (! $allowed) {
die qq{Sorry, you are not allowed to view information on that weapon!\n};
}
## Return the requested data
return_next($row);
$bc$;
CREATE FUNCTION
Note that because we are making external changes, we marked the function as VOLATILE, which ensures that it will always be run every time it is called, and not cached in any form. We are also using a Postgres service file with the 'db:Pg:service=auditor'. This means that the connection information (username, password, database) is contained in an external file. This is not only tidier than hard-coding those values into this function, but safer as well, as the function itself can be viewed by Alice. Finally, note that we are passing the 'username' directly into the function this time, as we have a brand new connection which is no longer linked to the 'alice' user, so we have to derive it ourselves from "SELECT session_user" and then pass it along.
Once this new function is in place, and we re-run the same queries as we did before, we see three entries in our audit table:
postgres=> \c postgres postgres You are now connected to database "postgres" as user "postgres". Expanded display is on. -[ RECORD 1 ]---------------------------- tablename | weapon arguments | crowbar results | 1 status | normal username | alice txntime | 2012-01-30 17:56:01.544557-05 realtime | 2012-01-30 17:56:01.54569-05 -[ RECORD 2 ]---------------------------- tablename | weapon arguments | pulse rifle results | 1 status | forbidden username | alice txntime | 2012-01-30 17:56:01.559532-05 realtime | 2012-01-30 17:56:01.561225-05 -[ RECORD 3 ]---------------------------- tablename | weapon arguments | m9 results | 1 status | normal username | alice txntime | 2012-01-30 17:56:01.573335-05 realtime | 2012-01-30 17:56:01.574989-05
So that's the basic premise of how to solve the auditing problem. For an actual production script, you would probably want to cache the database connection by sticking things inside of the special %_SHARED hash available to PL/Perl and Pl/PerlU. Note that each user gets their own version of that hash, so Alice will not be able to create a function and have access to the same %_SHARED hash that the postgres user has access to. It's probably a good idea to simply not let users like Alice use the language at all. Indeed, that's the default when we do the CREATE LANGUAGE call as above:
postgres=> \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> CREATE FUNCTION showplatform() RETURNS TEXT LANGUAGE plperlu AS $bc$ return $^O; $bc$; ERROR: permission denied for language plperlu
Further refinements to the actual script might include refactoring the logging bits to a separate function, writing some of the auditing data to a file on the local disk, recording the actual results returned to the user, and sending the data to another Postgres server entirely. For that matter, as we are using DBI, you could send it to other place entirely - such as a MySQL, Oracle, or DB2 database!
Another place for improvement would be associating each user with a security_level classification, such that any user could run the function and only see things at or below their level, rather than hard-coding the level as "confidential" as we have done here. Another nice refinement might be to always return undef (no matches) for items marked "top secret", to prevent the very existence of a top secret weapon from being deduced. :)
Our SoftLayer API tools
We do a lot of our hosting at SoftLayer, which seems to be one of the hosts with the most servers in the world -- they claim to have over 100,000 servers as of last month. More important for us than sheer size are many other fine attributes that SoftLayer has, in no particular order:
- a strong track record of reliability
- responsive support
- datacenters around the U.S. and some in Europe and Asia
- solid power backup
- well-connected redundant networks with multiple 10 Gbps uplinks
- gigabit Ethernet pipes all the way to the Internet
- first-class IPv6 support
- an internal private network with no data transfer charge
- Red Hat Enterprise Linux offered at no extra charge
- diverse dedicated server offerings at many price & performance points
- some disk partitioning options (though more flexibility here would be nice, especially with LVM for the /boot and / filesystems)
- fully automated provisioning, without salesman & quote hassles for standard offerings
- 3000 GB data transfer per month included standard with most servers
- month-to-month contracts
- reasonable prices (though we can of course always use lower prices, we'll take quality over cheapness for most of our hosting needs!)
- no arbitrary port blocks (some other providers rate-limit incoming TCP connections on port 22 to slow down ssh dictionary attacks, while others forbid IRC, etc.)
- a web service API for monitoring and controlling many aspects of our account via REST/JSON or SOAP
(No, they're not paying me for writing this! But they really have nice offerings.)
It is this last item, the SoftLayer API, that I want to elaborate on here.
The SoftLayer Development Network features API information and documentation and once you have an API account set up in the management website (quick and easy to do), you can start automating all sorts of tasks, from provisioning new hosts, monitoring your upcoming invoice or other accounting information, and much more.
I've released as open source two scripts we use: One is for managing secondary DNS domains in SoftLayer's DNS servers, from a primary name server running BIND 9. The other is a Nagios check script for monitoring monthly data transfer used and alerting when over a set threshold or over the monthly allotment.
See the GitHub repository of endpoint-softlayer-api if they would be useful to you, or to use as a starting point to interface with other SoftLayer APIs.
Interchange loops using DBI Slice
One day I was reading through the documentation on search.cpan.org for the DBI module and ran across an attribute that you can use with selectall_arrayref() that creates the proper data structure to be used with Interchange's object.mv_results loop attribute. The attribute is called Slice which causes selectall_arrayref() to return an array of hashrefs instead of an array of arrays. To use this you have to be working in global Perl modules as Safe.pm will not let you use the selectall_arrayref() method.
An example of what you could use this for is an easy way to generate a list of items in the same category. Inside the module, you would do like this:
my $results = $dbh->selectall_arrayref(
q{
SELECT
sku,
description,
price,
thumb,
category,
prod_group
FROM
products
WHERE
category = ?},
{ Slice => {} },
$category
);
$::Tag->tmpn("product_list", $results);
In the actual HTML page, you would do this:
<table cellpadding=0 cellspacing=2 border=1>
<tr>
<th>Image</th>
<th>Description</th>
<th>Product Group</th>
<th>Category</th>
<th>Price</th>
</tr>
[loop object.mv_results=`$Scratch->{product_list}` prefix=plist]
[list]
<tr>
<td><a href="/cgi-bin/vlink/[plist-param sku].html"><img src="[plist-param thumb]"></a></td>
<td>[plist-param description]</td>
<td>[plist-param prod_group]</td>
<td>[plist-param category]</td>
<td>[plist-param price]</td>
</tr>
[/list]
[/loop]
</table>
We normally use this when writing ActionMaps and using some template as our setting for mv_nextpage.
Sanitizing supposed UTF-8 data
As time passes, it's clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default. In a few more years we'll probably find discussion of different character set encodings to be arcane, relegated to "data historians" and people working with legacy systems.
But we're not there yet! There's still lots of migration to do before we can forget about everything that's not UTF-8.
Last week I again found myself converting data. This time I was taking data from a PostgreSQL database with no specified encoding (so-called "SQL_ASCII", really just raw bytes), and sending it via JSON to a remote web service. JSON uses UTF-8 by default, and that's what I needed here. Most of the source data was in either UTF-8, ISO Latin-1, or Windows-1252, but some was in non-Unicode Chinese or Japanese encodings, and some was just plain mangled.
At this point I need to remind you about one of the most unusual aspects of UTF-8: It has limited valid forms. Legacy encodings typically used all or most of the 255 code points in their 8-byte space (leaving point 0 for traditional ASCII NUL). While UTF-8 is compatible with 7-bit ASCII, it does not allow any possible 8-bit byte in any position. See the Wikipedia summary of invalid byte sequences to know what can be considered invalid.
We had no need to try to fix the truly broken data, but we wanted to convert everything possible to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in what we sent.
I previously wrote about converting a PostgreSQL database dump to UTF-8, and used the Perl CPAN module IsUTF8.
I was going to use that again, but looked around and found an even better module, exactly targeting this use case: Encoding::FixLatin, by Grant McLean. Its documentation says it "takes mixed encoding input and produces UTF-8 output" and that's exactly what it does, focusing on input with mixed UTF-8, Latin-1, and Windows-1252.
It worked as advertised, very well. We would need to use a different module to convert some other legacy encodings, but in this case this was good enough and got the vast majority of the data right.
There's even a standalone fix_latin program designed specifically for processing Postgres pg_dump output from legacy encodings, with some nice examples of how to use it.
One gotcha is similar to a catch that David Christensen reported with the Encode module in a blog post here about a year ago: If the Perl string already has the UTF-8 flag set, Encoding::FixLatin immediately returns it, rather than trying to process it. So it's important that the incoming data be a pure byte stream, or that you otherwise turn off the UTF-8 flag, if you expect it to change anything.
Along the way I found some other CPAN modules that look useful for cases where I need more manual control than Encoding::FixLatin gives:
- Search::Tools::UTF8 - test for and/or fix bad ASCII, Latin-1, Windows-1252, and UTF-8 strings
- Encode::Detect - use Mozilla's universal charset detector and convert to UTF-8
- Unicode::Tussle - ridiculously comprehensive set of Unicode tools that has to be seen to be believed
Once again Perl's thriving open source/free software community made my day!
Changing postgresql.conf from a script
The modify_postgres_conf script for Postgres allows you to change your postgresql.conf file from the command line, via a cron job, or any time when you want to automate the process.
Postgres runs as a background daemon. The configuration parameters it runs with are stored in a file named postgresql.conf. To change the behavior of Postgres, one must usually edit this file, and then tell Postgres that you have made the changes. Sometimes all that is needed is to 'HUP' or reload Postgres. Most changes fall into this category. Other changes require a full restart of Postgres, which entails disconnecting all current clients.
Thus, to make a change, one must edit the file, find the item to change (the file consists of "name = value" lines), change it, then send a signal to the main Postgres process so it picks up the change. Finally, you should then connect to Postgres to make sure it is still running and has accepted the latest change.
Doing this automatically (such as via a cron script) is very difficult. One method, if you are doing something simple like toggling between two known configuration files, is to simply store copies of both files and replace them, like this example cronjob:
30 10 * * * cp -f conf/postgresql.conf.1 /etc/postgresql.conf; /etc/init.d/postgresql reload 50 10 * * * cp -f conf/postgresql.conf.2 /etc/postgresql.conf; /etc/init.d/postgresql reload
The major problem with that approach, as I quickly learned when I tried it, is that despite nobody making changes to the postgresql.conf file in *years*, a few days after I put the above change in place, someone decided to edit postgresql.conf. At 10:30AM the next day, their changes were blown away. A better way is to simply write a program to make the change for you. Thus, the modify_postgres_conf.pl script.
The basic usage is to tell the script where the conf file is, and list what changes you want to make. Here's an example that will change the random_page_cost to 2 on a Debian system:
./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2
Here is exactly what the script does for the above statement:
- For each item to be changed, we:
- Ask the database what the current value is (and die if that parameter does not exist)
- If the current and new value are the same, do nothing
- Otherwise, open (and flock) the configuration file and change the parameter
- If no changes were made, exit
- Otherwise, close the configuration file
- Figure out the Postgres PID and send it a HUP signal
- Reconnect to the database and confirm each change has taken effect
By default, it adds a comment after the changed value as well, to help in tracking down who made the change. A diff of the postgresql.conf file after running the example above produces:
diff -r1.1 postgresql.conf 499c499 < random_page_cost = 4 --- > random_page_cost = 2 ## changed by modify_postgres_conf.pl on Wed Aug 10 13:31:34 2011
The addition of the comment can be stopped by added a --no-comment argument. If the script runs successfully, it also returns two items of information: the size and name of the current Postgres log file. This is useful so you can know exactly where in the log this change took place. Note that this only works for items that are already explicitly set in your configuration file. However, as discussed before, you should already have all the items that you may possibly change explicitly listed out at the bottom of the file already. Whitespace is preserved as well, for those (like me) who like to keep things lined up neatly inside the file (see examples in the link above).
Here are some more examples of the script in action:
$ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2 114991 /var/log/postgres/postgres-2011-08-10.log $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf --change random_page_cost=2 No change made: value of "random_page_cost" is already 2 $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change random_page_cost=2 \ > --change log_statement=ddl \ > --change log_min_duration_statement=100 No change made: value of "random_page_cost" is already 2 118459 /var/log/postgres/postgres-2011-08-10.log $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change default_statitics_target=200 --no-comment There is no Postgres variable named "default_statitics_target"! $ ./modify_postgres_conf.pl --pgconf /etc/postgresql/9.0/main/postgresql.conf \ > --change default_statistics_target=200 --no-comment 123396 /var/log/postgres/postgres-2011-08-10.log
Note that we make no attempt to automatically check changes in to version control: as you will see in an upcoming blog post on a real-life use case, such a checkin is usually not wanted, as we are making temporary changes.
This is a fairly simple Perl script, but I thought I would put it out there in the hopes of helping others out (and preventing the reinventing of wheels). Of course, if you find a bug or want to write a patch for it, those are welcome additions at any time! The code can be found on github:
git clone git://git@github.com:bucardo/modify_postgres_config.git
DevCamps news
DevCamps is a system for managing development, integration, staging, and production environments. It was developed by End Point for, and with the help of, some of our ecommerce clients. It grew over the space of several years, and really started to become its own standalone project in 2007.
Camps are a behind-the-scenes workhorse of our web application development at End Point, and don't always get much attention because everyone's too busy using camps to get work done! But this summer a few things are happening.
In early July we unveiled a redesign of the devcamps.org website that features a more whimsical look, a better explanation of what camps are all about, and endorsements by business and developer users. Marko Bijelic of Hipinspire did the design. Take a look:
In less than two weeks, on August 17, I'm going to be giving a talk on camps at YAPC::EU in Riga, Latvia. YAPC::EU is Europe's annual Perl conference, and will be a nice place to talk about camps.
Many Perl developers are doing web applications, which is camps' main focus, so that's reason enough. But camps also started around the Interchange application server, which is written in Perl. And the camp system is currently implemented in Perl as well.
We've set up a lot of camp systems for Perl web applications. So even though we've also set up camp systems for web applications using Ruby on Rails, Sinatra, Django, and PHP, it's a nice homecoming to talk about camps to Perl enthusiasts.
DBD::Pg UTF-8 for PostgreSQL server_encoding
We are preparing to make a major version bump in DBD::Pg, the Perl interface for PostgreSQL, from the 2.x series to 3.x. This is due to a reworking of how we handle UTF-8. The change is not going to be backwards compatible, but will probably not affect many people. If you are using the pg_enable_utf8 flag, however, you definitely need to read on for the details.
The short version is that DBD::Pg is going return all strings from the Postgres server with the Perl utf8 flag on. The sole exception will be databases in which the server_encoding is SQL_ASCII, in which case the flag will never be turned on.
For backwards compatibility and fine-tuning control, there is a new attribute called pg_utf8_strings that can be set at connection time to override the decision above. For example, if you need your connection to return byte-soup, non-utf8-marked strings, despite coming from a UTF-8 Postgres database, you can say:
my $dsn = 'dbi:Pg:dbname=foobar';
my $dbh = DBI->connect($dsn, $dbuser, $dbpass,
{ AutoCommit => 0,
RaiseError => 0,
PrintError => 0,
pg_utf8_strings => 0,
}
);
Similarly, you can set pg_utf8_strings to 1 and it will force settings returned strings as utf8, even if the backend is SQL_ASCII. You should not be using SQL_ASCII of course, and certainly not forcing the strings returned from it to UTF-8. :)
All Perl variables (be they strings or otherwise) are actually Perl objects, with some internal attributes defined on them. One of those is the utf8 flag, which can be flipped on to indicate that the string should be treated as possibly containing multi-byte characters, or it can be left off, to indicate the string should always be treated on a byte-by-byte basis. This will affect things like the Perl length function, and the Perl \w regex flag. This is completely unrelated to the Perl pragma use utf8, which DBD::Pg has nothing at all to do with. Have I mentioned that UTF-8, and UTF-8 in Perl in particular, can be quite confusing?
There are a few exceptions as to what things DBD::Pg will mark as utf8. Integers and other numbers will not, boolean values will not, and no bytea data will ever have the flag set. When in doubt, assume that it is set.
The old attribute, pg_enable_utf8, will be deprecated, and have no effect. We thought about re-using that but it seemed clearer and cleaner to simply create a new variable (pg_utf8_strings), as the behavior has significantly changed.
A beta version of DBD::Pg (2.99.9_1) with these changes has been uploaded to CPAN for anyone to experiment with. Right now, none of this is set in stone, but we did want to get a working version out there to start the discussion and see how it interacts with applications that were making use of the pg_enable_utf8 flag. You can web search for "dbdpg" and look for the "Latest Dev. Release", or jump straight to the page for DBD::Pg 2.99.9_1. The trailing underscore is a CPAN convention that indicates this is a development version only, and thus will not replace the latest production version (2.18.1 as of this writing).
As a reminder, DBD::Pg has switched to using git, so you can follow along with the development with:
git clone git://bucardo.org/dbdpg.git
There is also a commits mailing list you can join to receive notifications of commits as they are pushed to the main repo. To sign up, send an email to dbd-pg-changes-subscribe@perl.org.
MongoDB replication from Postgres using Bucardo
One of the features of the upcoming version of Bucardo (a replication system for the PostgreSQL RDBMS) is the ability to replicate data to things other than PostgreSQL databases. One of those new targets is MongoDB, a non-relational 'document-based' database. (to be clear, we can only use MongoDB as a target, not as a source)
To see this in action, let's setup a quick example, modified from the earlier blog post on running Bucardo 5. We will create a Bucardo instance that replicates from two Postgres master databases to a Postgres database target and a MongoDB instance target. We will start by setting up the prerequisites:
sudo aptitude install postgresql-server \ perl-DBIx-Safe \ perl-DBD-Pg \ postgresql-contrib
Getting Postgres up and running is left as an exercise to the reader. If you have problems, the friendly folks at #postgresql on irc.freenode.net will be able to help you out.
Now for the MongoDB parts. First, we need the server itself. Your distro may have it already available, in which case it's as simple as:
aptitude install mongodb
For more installation information, follow the links from the MongoDB Quickstart page. For my test box, I ended up installing from source by following the directions at the Building for Linux page.
Once MongoDB is installed, we will need to start it up. First, create a place for MongoDB to store its data, and then launch the mongodb process:
$ mkdir /tmp/mongodata $ mongod --dbpath=/tmp/mongodata --fork --logpath=/tmp/mongo.log all output going to: /tmp/mongo.log forked process: 428
You can perform a quick test that it is working by invoking the command-line shell for MongoDB (named "mongo" of course) Use quit() to exit:
$ mongo MongoDB shell version: 1.8.1 Fri Jun 10 12:45:00 connecting to: test > quit() $
The other piece we need is a Perl driver so that Bucardo (which is written in Perl) can talk to the MongoDB server. Luckily, there is an excellent one available on CPAN named 'MongoDB'. We started the MongoDB server before doing this step because the driver we will install needs a running MongoDB instance to pass all of its tests. The module has very good documentation available on its CPAN page. Installation may be as easy as:
$ sudo cpan MongoDB
If that did not work for you (case matters!), there are more detailed directions on the Perl Language Center page.
Our next step is to grab the latest Bucardo, install it, and create a new Bucardo instance. See the previous blog post for more details about each step.
$ git clone git://bucardo.org/bucardo.git Initialized empty Git repository... $ cd bucardo $ perl Makefile.PL Checking if your kit is complete... Looks good Writing Makefile for Bucardo $ make cp bucardo.schema blib/share/bucardo.schema cp Bucardo.pm blib/lib/Bucardo.pm cp bucardo blib/script/bucardo /usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/bucardo Manifying blib/man1/bucardo.1pm Manifying blib/man3/Bucardo.3pm $ sudo make install Installing /usr/local/lib/perl5/site_perl/5.10.0/Bucardo.pm Installing /usr/local/share/bucardo/bucardo.schema Installing /usr/local/bin/bucardo Installing /usr/local/share/man/man1/bucardo.1pm Installing /usr/local/share/man/man3/Bucardo.3pm Appending installation info to /usr/lib/perl5/5.10.0/i386-linux-thread-multi/perllocal.pod $ sudo mkdir /var/run/bucardo $ sudo chown $USER /var/run/bucardo $ bucardo install This will install the bucardo database into an existing Postgres cluster. ... Installation is now complete.
Now we create some test databases and populate with pgbench:
$ psql -c 'create database btest1' CREATE DATABASE $ pgbench -i btest1 NOTICE: table "pgbench_branches" does not exist, skipping ... creating tables... 10000 tuples done. 20000 tuples done. ... 100000 tuples done. $ psql -c 'create database btest2 template btest1' CREATE DATABASE $ psql -c 'create database btest3 template btest1' CREATE DATABASE $ psql btest3 -c 'truncate table pgbench_accounts' TRUNCATE TABLE $ bucardo add db t1 dbname=btest1 Added database "t1" $ bucardo add db t2 dbname=btest2 Added database "t2" $ bucardo add db t3 dbname=btest3 Added database "t3" $ bucardo list dbs Database: t1 Status: active Conn: psql -p 5432 -U bucardo -d btest1 Database: t2 Status: active Conn: psql -p 5432 -U bucardo -d btest2 Database: t3 Status: active Conn: psql -p 5432 -U bucardo -d btest3 $ bucardo add tables pgbench_accounts pgbench_branches pgbench_tellers herd=therd Created herd "therd" Added table "public.pgbench_accounts" Added table "public.pgbench_branches" Added table "public.pgbench_tellers" $ bucardo list tables Table: public.pgbench_accounts DB: t1 PK: aid (int4) Table: public.pgbench_branches DB: t1 PK: bid (int4) Table: public.pgbench_tellers DB: t1 PK: tid (int4)
The next step is to add in our MongoDB instance. The syntax is the same as the "add db" above, but we also tell it the type of database, as it is not the default of "postgres". We will also assign an arbitrary database name, "btest1", the same as the others. Everything else (such as the port and host) is default, so all we need to say is:
$ bucardo add db m1 dbname=btest1 type=mongo Added database "m1" $ bucardo list dbs Database: m1 Type: mongo Status: active Database: t1 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest1 Database: t2 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest2 Database: t3 Type: postgres Status: active Conn: psql -p 5432 -U bucardo -d btest3
Next we group our databases together and assign them roles:
$ bucardo add dbgroup tgroup t1:source t2:source t3:target m1:target Created database group "tgroup" Added database "t1" to group "tgroup" as source Added database "t2" to group "tgroup" as source Added database "t3" to group "tgroup" as target Added database "m1" to group "tgroup" as target
Note that "target" is the default action, so we could shorten that to:
$ bucardo add dbgroup tgroup t1:source t2 t3 m1
However, I think it is best to be explicit, even if it does (incorrectly) hint that m1 could be anything *other* than a target. :)
We are almost ready to go. The final step is to create a sync (a basic replication event in Bucardo), then we can start up Bucardo, put some test data into the master databases, and 'kick' the sync:
$ bucardo add sync mongotest herd=therd dbs=tgroup ping=false Added sync "mongotest" $ bucardo start Checking for existing processes Starting Bucardo $ pgbench -t 10000 btest1 starting vacuum...end. transaction type: TPC-B (sort of) number of transactions actually processed: 10000/10000 ... tps = 503.300595 (excluding connections establishing) $ pgbench -t 10000 btest2 number of transactions actually processed: 10000/10000 ... tps = 408.059368 (excluding connections establishing) $ bucardo kick mongotest
We'll give it a few seconds to replicate those changes (it took 18 seconds on my test box), and then check the output of bucardo status:
$ bucardo status PID of Bucardo MCP: 3317 Name State Last good Time Last I/D/C Last bad Time ===========+========+============+=======+=============+===========+======= mongotest | Good | 21:57:47 | 11s | 6/36234/898 | none |
Looks good, but what about the data in MongoDB? Let's get some counts from the Postgres masters and slave, and then look at the data inside MongoDB with the mongo command-line client:
$ psql btest1 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest2 -c 'SELECT count(*) FROM pgbench_accounts'
100000
$ psql btest3 -c 'SELECT count(*) FROM pgbench_accounts'
18106
$ psql btest1 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest2 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ psql btest3 -qc 'SELECT min(abalance),max(abalance) FROM pgbench_accounts'
-12071 | 13010
$ mongo btest1
MongoDB shell version: 1.8.1
Fri Jun 10 12:46:00
connecting to: btest1
> show collections
bucardo_status
pgbench_accounts
pgbench_branches
pgbench_tellers
system.indexes
> db.pgbench_accounts.count()
18106
> db.pgbench_accounts.find().sort({abalance:1}).limit(1).next()
{
"_id" : ObjectId("4df39bcb8795839660001de5"),
"abalance" : -12071,
"aid" : 84733,
"bid" : 1,
"filler" : " "
}
> db.pgbench_accounts.find().sort({abalance:-1}).limit(1).next()
{
"_id" : ObjectId("4df39bd08795839660002fb0"),
"abalance" : 13010,
"aid" : 45500,
"bid" : 1,
"filler" : " "
}
Why the difference in counts? We only started replicating after we populated the Postgres tables on the master databases with 100,000 rows, so the eighteen thousand is the number of rows that was changed during the subsequent pgbench run. (Note that pgbench uses randomness, so your numbers will be different than the above). In the future Bucardo will support the "onetimecopy" feature for MongoDB, but until then we can fully populate the pgbench_accounts collection simply by "touching' all the records on one of the masters:
$ psql btest1 -c 'UPDATE pgbench_accounts SET aid=aid' UPDATE 100000 $ bucardo kick mongotest Kicked sync mongotest $ echo 'db.pgbench_accounts.count()' | mongo btest1 MongoDB shell version: 1.8.1 Fri Jun 10 12:47:00 connecting to: btest1 > 100000 > bye
A nice feature of MongoDB is its autovivification ability (aka dynamic schemas), which means unlike Postgres you do not have to create your tables first, but can simply ask MongoDB to do an insert, and it will create the table (or, in mongospeak, the collection) automatically for you.
Because MongoDB has no concept of transactions, and because Bucardo does not update, but does deletes plus inserts (for reasons I'll not get into today), there is one more trick Bucardo does when replicating to a MongoDB instance. A collection named 'bucardo_status' is created and updated at the start and the end of a sync (a replication event). Thus, your application can pause if it sees this table has a 'started' value, and wait until it sees 'complete' or 'failed'. Not foolproof by any means, but better than nothing :) You should, of course, carefully consider the way your app and Bucardo will coordinate things.
Feedback from Postgres or MongoDB folk is much appreciated: there are probably some rough edges, but as you can see from above, the basics are there are working. Feel free to email the bucardo-general mailing list or make a feature request / bug report on the Bucardo Bugzilla page.
Bucardo multi-master for PostgreSQL
The next version of Bucardo, a replication system for Postgres, is almost complete. The scope of the changes required a major version bump, so this Bucardo will start at version 5.0.0. Much of the innards was rewritten, with the following goals:
Multi-master support
Where "multi" means "as many as you want"! There are no more pushdelta (master to slaves) or swap (master to master) syncs: there is simply one sync where you tell it which databases to use, and what role they play. See examples below.
Ease of use
The bucardo program (previously known as 'bucardo_ctl') has been greatly improved, making all the administrative tasks such as adding tables, creating syncs, etc. much easier.
Performance
Much of the underlying architecture was improved, and sometimes rewritten, to make things go much faster. Most striking is the difference between the old multi-master "swap syncs" and the new method, which has been described as "orders of magnitudes" faster by early testers. We use async database calls whenever possible, and no longer have the bottleneck of a single large bucardo_delta table.
Improved logging
Not only are more details provided, there is now the ability to control how verbose the logs are. Just set the log_level parameter to terse, normal, verbose, or debug. Those who had busy systems, which was the equivalent of a 'debug' firehose, will really appreciate this.
Different targets
Who says your slave (target) databases need to be Postgres? In addition to the ability to write text SQL files (for say, shipping to a different system), you can have Bucardo push to other systems as well. Stay tuned for more details on this. (Update: there is a blog post about using MongoDB as a target)
This new version is not quite at beta yet, but you can try out a demo of multi-master on Postgres quie easily. Let's see if we can do it in ten steps.
I. Download all prerequisites
To run Bucardo, you will need a Postgres database (obviously), the DBIx::Safe module, the DBI and DBD::Pg modules, and (for the purposes of this demo) the pgbench utility. Systems vary, but on aptitude-based systems, one can grab all of the above like this:
aptitude install postgresql-server \ perl-DBIx-Safe \ perl-DBD-Pg \ postgresql-contrib
II. Grab the latest Bucardo
git clone git://bucardo.org/bucardo.git
III. Install the program
cd bucardo perl Makefile.PL make sudo make install
You can ignore any errors that come up about ExtUtils::MakeMaker not being recent.
IV. Setup an instance of Bucardo
This step assumes there is a running Postgres available to connect to.
sudo mkdir /var/run/bucardo sudo chown $USER /var/run/bucardo bucardo install
V. Use the pgbench program to create some test tables
psql -c 'CREATE DATABASE btest1' pgbench -i btest1 psql -c 'CREATE DATABASE btest2 TEMPLATE btest1' psql -c 'CREATE DATABASE btest3 TEMPLATE btest1' psql -c 'CREATE DATABASE btest4 TEMPLATE btest1' psql -c 'CREATE DATABASE btest5 TEMPLATE btest1'
VI. Tell Bucardo about the databases and tables you are going to use
bucardo add db t1 dbname=btest1 bucardo add db t2 dbname=btest2 bucardo add db t3 dbname=btest3 bucardo add db t4 dbname=btest4 bucardo add db t5 dbname=btest5 bucardo list dbs bucardo add table pgbench_accounts pgbench_branches pgbench_tellers herd=therd bucardo list tables
A herd is simply a logical grouping of tables. We did not add the other pgbench table, pgbench_history, because it has no primary key or unique index.
VII. Group the databases together and set their roles
bucardo add dbgroup tgroup t1:source t2:source t3:source t4:source t5:target
We've grouped all five databases together, and made four of them masters (aka source), and one of them a slave (aka target). You can any combination of master and slaves you want, as long as there is at least one master.
VII. Create the Bucardo sync
bucardo add sync foobar herd=therd dbs=tgroup ping=false
Here we simply create a new sync, which is a controllable replication event, telling it which tables we want to replicate, and which databases we are going to use. We also set ping to false, which means that we will not create triggers to automatically fire off replication on any changes, but will do it manually. In a real world scenario, you generally do want those triggers, or want to set Bucardo to check periodically.
VIII. Start up Bucardo
bucardo start
If all went well, you should see some information in the log.bucardo file in the current directory.
IX. Make a bunch of changes on all the source databases.
pgbench -t 10000 btest1 pgbench -t 10000 btest2 pgbench -t 10000 btest3 pgbench -t 10000 btest4
Here, we've told pgbench to run ten thousand transactions against each of the first four databases. Triggers on these tables have captured the changes.
X. Kick off the sync and watch the fun.
bucardo kick foobar
You can now tail the log.bucardo file to see the fun, or simply run:
bucardo status
...to see what it is doing, and the final counts when we are done. Don't forget to stop Bucardo when you are done testing:
bucardo stop
The output of bucardo status, after the sync has completed, should look like this:
bucardo status Name State Last good Time Last I/D/C Last bad Time ========+========+============+=======+====================+===========+======= foobar | Good | 17:58:37 | 3m2s | 131836/131836/4785 | none |
Here we see that this syncs has never failed ("Last bad"), the time of day of the last good run, how long ago it was from right now (3 minutes and 2 seconds), as well as details of the last successful run. Last I/D/C stands for number of inserts, deletes, and collisions across all databases for this syncs. This is just an overview of all syncs at a high level, but we can also give status an argument of a sync name to see more details like so:
bucardo status foobar Last good : Jun 02, 2011 17:57:47 (time to run: 42s) Rows deleted/inserted/conflicts : 131,836 / 131,836 / 4,785 Sync name : foobar Current state : Good Source herd/database : therd / t1 Tables in sync : 3 Status : active Check time : none Overdue time : 00:00:00 Expired time : 00:00:00 Stayalive/Kidsalive : yes / yes Rebuild index : 0 Ping : no Onetimecopy : 0 Post-copy analyze : Yes Last error: :
This gives us a little more information about the sync itself, as well as another important metric, how long the sync itself took to run, in this case, 42 seconds. That particular metric might make its way back to the overall "status" view above. Try things out and help us find bugs and improve Bucardo!
DBD::Pg and the libpq COPY bug
(image by kvanhorn)Version 2.18.1 of DBD::Pg, the Perl driver for Postgres, was just released. This was to fix a serious bug in which we were not properly clearing things out after performing a COPY. The only time the bug manifested, however, is if an asynchronous query was done immediately after a COPY finished. I discovered this while working on the new version of Bucardo. The failing code section was this (simplified):
## Prepare the source
my $srccmd = "COPY (SELECT * FROM $S.$T WHERE $pkcols IN ($pkvals)) TO STDOUT";
$fromdbh->do($srccmd);
## Prepare each target
for my $t (@$todb) {
my $tgtcmd = "COPY $S.$T FROM STDIN";
$t->{dbh}->do($tgtcmd);
}
## Pull a row from the source, and push it to each target
while ($fromdbh->pg_getcopydata($buffer) >= 0) {
for my $t (@$todb) {
$t->{dbh}->pg_putcopydata($buffer);
}
}
## Tell each target we are done with COPYing
for my $t (@$todb) {
$t->{dbh}->pg_putcopyend();
}
## Later on, run an asynchronous command on the source database
$sth{track}{$dbname}{$g} = $fromdbh->prepare($SQL, {pg_async => PG_ASYNC});
$sth{track}{$dbname}{$g}->execute();
This gave the error "another command is already in progress". This error did not come from Postgres or DBD::Pg, but from libpq, the underlying C library which DBD::Pg uses to talk to the database. Strangely enough, taking out the async part and running the exact same command produced no errors.
After tracking back through the libpq code, it turns out that DBD::Pg was only calling PQresult a single time after the copy ended. I can see why this was done: the docs for PQputCopyEnd state: "After successfully calling PQputCopyEnd, call PQgetResult to obtain the final result status of the COPY command. One can wait for this result to be available in the usual way. Then return to normal operation." What's not explicitly stated is that you need call PQgetResult again, and keep calling it, until it returns null, to "clear out the message queue". In this case, PQresult pulled back a 'c' message from Postgres, via the frontend/backend protocol, indicating that the copy command was complete. However, what it really needed was to call PQresult two more times, once to get back a 'C' (indicating the COPY statement was complete), and a 'Z' (indicating the backend was ready for a new query). Technically, there was nothing stopping libpq from sending a fresh query except that its own internal flag, conn->asyncStatus, is not reset on a simple end of copy, but only when 'Z' is encountered. Thus, DBD::Pg 2.18.1 now calls PQresult until it returns null.
If your application is encountering this bug and you cannot upgrade to 2.18.1 yet, the solution is simple: perform a non-asynchronous query between the end of the copy and the start of the asynchronous query. It can be any query at all, so the above code could be cured with:
...
## Tell each target we are done with COPYing
for my $t (@$todb) {
$t->{dbh}->pg_putcopyend();
$t->{dbh}->do('SELECT 123');
}
## Later on, run an asynchronous command on the source database
$fromdbh->do('SELECT 123');
$sth{track}{$dbname}{$g} = $fromdbh->prepare($SQL, {pg_async => PG_ASYNC});
$sth{track}{$dbname}{$g}->execute();
Why does the non-asynchronous command work? Doesn't it check the conn->asyncStatus as well? The secret is that PQexecstart has this bit of code in it:
/*
* Silently discard any prior query result that application didn't eat.
* This is probably poor design, but it's here for backward compatibility.
*/
while ((result = PQgetResult(conn)) != NULL)
Wow, that code looks familiar! So it turns out that the only reason this was not spotted earlier is that non-asynchronous commands (e.g. those using PQexec) were silently clearing out the message queue, kind of as a little favor from libpq to the driver. The async function, PQsendQuery, is not as nice, so it does the correct thing and fails right away with the error seen above (via PQsendQueryStart).
Benchmarking in Perl: Map versus For Loop
Last week, I was coding in Perl for an Interchange project. I've been in and out of Perl and Ruby a lot lately. While I was working on the project, I came across the following bit of code and wanted to finally sit down and figure out how to use the map function in Perl on this bit of code.
my @options;
for my $obj (@$things) {
push @options, {
value => $obj->{a},
label => $obj->{b}
};
}
return \@options;
I'm a big fan of Ruby's inject method and in general a fan of the Enumerable Module, but I have a brain block when it comes to using the map method in both Perl and Ruby. I spent a little time investigating and working on a small local Perl script to test the implementation of the map method. I came up with the following:
return [ map {
{
value => $_->{a},
label => $_->{b}
}
} @$things ];
After that, I wanted to make sure the code change was justified. The Interchange application that is the source of this code is built for performance, so I wanted to ensure this change didn't hinder performance. It's been a while since I've done benchmarking in Perl, so I also had to refresh my memory regarding using the Benchmark module. I came up with:
#!/usr/bin/perl
use Benchmark;
my $count = 1000000;
my $things = [
{'a' => 123, 'b' => 456, 'c' => 789 },
{'a' => 456, 'b' => 789, 'c' => 123 }
];
#Test definitions as methods to mimic use in application
my $test1 = sub {
my @options;
for my $obj (@$things) {
push @options, {
value => $obj->{a},
label => $obj->{b}
};
}
return \@options;
};
my $test2 = sub {
return [ map {
{
value => $_->{a},
label => $_->{b}
}
} @$things ];
};
#Benchmark tests & results.
$t0 = Benchmark->new;
$test1->() for(1..$count);
$t1 = Benchmark->new;
$td = timediff($t1, $t0);
print "the code for test 1 took:",timestr($td),"\n";
$t0 = Benchmark->new;
$test2->() for(1..$count);
$t1 = Benchmark->new;
$td = timediff($t1, $t0);
print "the code for test 2 took:",timestr($td),"\n";
The results were:
| Test # | Before (For Loop) | After (Map) |
| 1 | 5 sec | 4 sec |
| 2 | 5 sec | 4 sec |
| 3 | 5 sec | 5 sec |
| 4 | 5 sec | 5 sec |
| 5 | 6 sec | 4 sec |
| 6 | 6 sec | 4 sec |
| 7 | 6 sec | 4 sec |
| 8 | 5 sec | 5 sec |
| 9 | 5 sec | 4 sec |
| 10 | 5 sec | 4 sec |
| Average | 5.3 sec | 4.3 sec |
In this case, replacing the imperative programming style here with Functional programming (via map) yielded a small performance improvement, but the script executed each method 1,000,000 times, so the performance gain yielded by just one method call is very small. I doubt it's worth it go on a code cleanup rampage to update and test this, but it's good to keep in mind moving forward as small bits of the code are touched. I also wonder if the performance will vary when the size of $things changes — something I didn't test here. It was nice to practice using Perl's map method and Benchmark module. Yippee.
Postgres query caching with DBIx::Cache
A few years back, I started working on a module named DBIx::Cache which would add a caching layer at the database driver level. The project that was driving it got put on hold indefinitely, so it's been on my long-term todo list to release what I did have to the public in the hope that someone else may find it useful. Hence, I've just released version 1.0.1 of DBIx::Cache. Consider it the closest thing Postgres has at the moment for query caching. :) The canonical webpage:
http://bucardo.org/wiki/DBIx-Cache
You can also grab it via git, either directly:
git clone git://bucardo.org/dbixcache.git/
or through the indispensable github:
https://github.com/bucardo/dbixcache
So, what does it do exactly? Well, the idea is that certain queries that are either repeated often and/or are very expensive to run should be cached somewhere, such that the database does not have to redo all the same work, just to return the same results over and over to the client application. Currently, the best you can hope for with Postgres is that things are in RAM from being run recently. DBIx::Cache changes this by caching the results somewhere else. The default destination is memcached.
DBIx::Cache acts as a transparent layer around your DBI calls. You can control which queries, or classes of queries get cached. Most of the basic DBI methods are overridden so that rather than query Postgres, they actually query memcached as needed (or other caching layer - could even query back into Postgres itself!). Let's look at a simple example:
use strict;
use warnings;
use Data::Dumper;
use DBIx::Cache;
use Cache::Memcached::Fast;
## Connect to an existing memcached server,
## and establish a default namespace
my $mc = Cache::Memcached::Fast->new(
{
servers => [ { address => 'localhost:11211' } ],
namespace => 'joy',
});
## Rather than DBI->connect, use DBIx->connect
## Tell it what to use as our caching source
## (the memcached server above)
my $dbh = DBIx::Cache->connect('', '', '',
{ RaiseError => 1,
dxc_cachehandle => $mc
});
## This is an expensive query, that takes 30 seconds to run:
my $SQL = 'SELECT * FROM analyze_sales_data()';
## Prepare this query
my $sth = $dbh->prepare($SQL);
## Run it ten times in a row.
## The first time takes 30 seconds, the other nine return instantly.
for (1..10) {
my $count = $sth->execute();
my $info = $sth->fetchall_arrayref({});
print Dumper $info;
}
In the above, the prepare($SQL) is actually calling the DBIx::Class::prepare method. This parses the query and tries to determine if it is cacheable or not, then stores that decision internally. Regardless of the result, it calls DBI::prepare (which is techincally DBD::Pg::prepare), and returns the result.The magic comes in the call to execute() later on. As you might imagine, this is also actually the DBIx::Class::execute() method. If the query is not cacheable, it simply runs it as normal and returns. If it is cacheable, and this is the first time it is run, DBIx::Class runs an EXPLAIN EXECUTE on the original statement, and parses out a list of all tables that are used in this query. Then it caches all of this information into memcached, so that subsequent runs using the same list of arguments to execute() don't need to do that work again.
Finally, we come to fetchall_arrayref(). The first time it is run, we simply call the parent methods and get the data back. Then we build unique keys and store the results of the query into memcached. Finally, we mark the execute() as fully cached. Thus, on subsequent calls to execute(), we don't actually execute anything on the database server, but simply return the count as stashed inside of memcached (in the case of execute, this is the number of affected rows). For the various fetch() methods, we do the same thing - rather than fetch things from the database (via DBI, DBD::Pg, and libpq), we get the results from memcached (frozen via Data::Dumper), and then unpack and return them. Since we don't actually need to do any work against the database, everything returns as fast as we can query memcached - which is in general very fast indeed.
Most of the above is working, but the piece that is not written is the cache invalidation. DBIx::Cache knows which tables go to which queries, so in theory you could have (for example), an UPDATE/INSERT/DELETE trigger on table X which calls DBIx::Cache and tells it to invalidate all items related to table X, so that the next call to prepare() or execute() or fetch() will not find any memcached matches and re-run the whole query and store the results. You could also simply handle that in your application, of course, and have it decide when to invalidate items.
It's been a while since I've really looked at the code, but as far as I can tell it is close to being able to actually use somewhere. :) Patches and questions welcome!
Use ZIP+4, except when you shouldn't
The USPS provides a handy API for looking up postal rates on the fly. Recently it started failing for code that had been working for a while, so I investigated. I found a couple of different problems with it:
- First, the "service description" field had been "augmented" by including copyright symbols via HTML mark-up. That meant internal comparisons started to fail, so I "canonicalized" all the responses by stripping out various things from both sides of my comparison.
$string =~ s{&(?:[a-z/;&])+}{}gis; $string =~ s/[^a-z]//gis; $string =~ s/^\s+//; $string =~ s/\s+$//; $string =~ s/\s+/ /gis; - Second, I found that the API inexplicably rejects 9-digit ZIP codes, the "ZIP+4" format. That's right, you can't look up a domestic shipping rate for a 9-digit ZIP. The documentation linked above specifically calls for 5-digit ZIPs. If you pass a 9-digit ZIP to the API, it doesn't smartly recognize that you've given it too much info and just use what it needs. Instead, it throws an error.
So the API got too clever in one regard, and not clever enough where it counts.
DBD::Pg query cancelling in Postgres
A new version of DBD::Pg, the Perl driver for PostgreSQL, has just been released. In addition to fixing some memory leaks and other minor bugs, this release (version 2.18.0) introduces support for the DBI method known as cancel(). A giant thanks to Eric Simon, who wrote this new feature. The new method is similar to the existing pg_cancel() method, except it works on synchronous rather than asynchronous queries. I'll show an example of both below.
DBD::Pg has been able to handle asynchronous queries for a while now. Basically, that means you don't have to wait around for the database to finish a query. Your application can do other things while the query runs, then check back later to see if it has completed and grab the results. The way to cancel an already kicked-off asynchronous query is with the pg_cancel() method (the other asynchronous methods are pg_ready and pg_result, which have no synchronous equivalents).
The prefix "pg_" is used because there is no corresponding built-in DBI method to override, and the convention is to prefix everything custom to a driver with the driver's prefix, in our case 'pg'. Here's an example showing one possible use of asynchronous queries using DBD::Pg in some Perl code:
## We are connecting to two servers and running expensive
## queries on both. We kick both off right away, then wait
## for them both to finish. Our total wait time is thus
## max(server1,server2) rather than sum(server1,server2)
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn1 = 'dbi:Pg:dbname=sales;host=example1.com';
my $dsn2 = 'dbi:Pg:dbname=sales;host=example2.com';
my $dbh1 = DBI->connect($dsn1, '', '', {AutoCommit=>0, RaiseError=>1});
my $dbh2 = DBI->connect($dsn2, '', '', {AutoCommit=>0, RaiseError=>1});
my $SQL = 'SELECT gather_yearly_sales_data()';
print "Kicking off a long, expensive query on database one\n";
## Normally, a do() will not return until the query is complete
## However, the async flag causes it to return immediately
$dbh1->do($SQL, {pg_async => PG_ASYNC});
print "Kicking off a long, expensive query on database two\n";
$dbh2->do($SQL, {pg_async => PG_ASYNC});
## Both queries are running in the 'background'
## We have to wait for both, so it doesn't matter which one we wait for here
## However, if it's been over 2 minutes, we'll cancel both and quit
my $time = 0;
while ( ! $dbh1->pg_ready() ) {
sleep 1;
if ($time++ > 120) {
print "Taking too long, let's cancel the queries\n";
$dbh1->pg_cancel();
$dbh2->pg_cancel();
$dbh1->rollback();
$dbh2->rollback();
die "No sales data was retrieved\n";
}
}
## We know that database 1 has finished, so we read in the results
my $rows1 = $dbh1->pg_result();
## We then grab results from database 2
## This will block until done, which is okay
my $rows2 = $dbh2->pg_result();
The new method, simply known as cancel(), will kill any synchronously running query. One of the main uses for this is to timeout a query by using the builtin Perl alarm function. However, since the builtin alarm function has some quirks, we will instead use the much safer POSIX::SigAction method. Another example:
## We are running a series of queries against a database, but if
## the whole thing is taking over 30 seconds, we want to cancel
## the currently running query and move on to something else.
use strict;
use warnings;
use DBI;
use DBD::Pg qw{ :async };
my $dsn = 'dbi:Pg:dbname=dq';
my $dbh = DBI->connect($dsn, '', '', {AutoCommit=>0, RaiseError=>1});
## Setup all the POSIX alarm plumbing
my $mask = POSIX::SigSet->new(SIGALRM);
my $action = POSIX::SigAction->new(
sub { die "TIMEOUT\n" },
$mask,
);
my $oldaction = POSIX::SigAction->new();
sigaction( SIGALRM, $action, $oldaction );
## Prepare the queries
my $upd = $dbh->prepare('UPDATE foobar SET x=? WHERE y=?');
my $inv = $dbh->prepare('SELECT refresh_inventory(?)');
## Yes, a double eval. Async is looking better all the time :)
eval {
eval {
alarm 30;
for my $y (12,24,48) {
print "Adjusting widget #$y\n";
$upd->execute(555,$y);
print "Recalculating inventory\n";
$inv->execute($y);
}
};
alarm 0; ## Turn off our alarm
die "$@\n" if $@; ## Bubble the error to the outer eval
};
if ($@) { ## Something went wrong
if ($@ =~ /TIMEOUT/) {
print "Queries are taking too long! Cancelling\n";
## We don't know which one is still running, and don't care
## It's safe to cancel a non-active statement handle
$upd->cancel() or die qq{Failed to cancel the query!\n};
$inv->cancel() or die qq{Failed to cancel the query!\n};
$dbh->rollback();
die "Who has time to wait 30 seconds anymore?";
}
## Some other non-alarm error, so we simply:
die $@;
}
print "Updates are complete\n";
$dbh->commit();
exit;
Got an interesting use case for asynchronous queries or the new $dbh‑>cancel()? Let me know!
Managing Perl environments with perlbrew
As a Perl hobbyist, I've gotten used to the methodical evolution of Perl 5 over the years. Perl has always been a reliable language, not without its faults, but with a high level of flexibility in syntactical expression and even deployment options. Even neophytes quickly learn how to install their own Perl distribution and CPAN libraries in $HOME. But the process can become unwieldy, particularly if you want to test across a variety of Perl versions.
To contrast, Ruby core development frequently experiences ABI breakages, even between minor releases. In spite of the wide adoption of Ruby as a Web development language (thanks to Ruby on Rails), Ruby developers are able to plod along unconcerned, where these incompatibilities would almost certainly lead to major bickering within the Perl or PHP communities. How do they do it? The Ruby Version Manager.
Ruby Version Manager (RVM) allows users to install Ruby and RubyGems within their own self-contained environment. This allows each user to install all (or only) the software that their particular application requires. Particularly for Ruby developers, this provides them with the flexibility to quickly test upgrades for regressions, ABI changes and enhancements without impacting system-wide stability. Thankfully a lot of the ideas in RVM have made their way over to the Perl landscape, in the form of perlbrew.
Perlbrew offers many of the same features found in RVM for Ruby. It's easy to install. It isolates different Perl versions and CPAN installations in your $HOME and helps you switch between them. It automates your environment setup and teardown. And most importantly, using perlbrew means not having to clutter your default system Perl with application-specific CPAN dependencies.
Getting started with perlbrew couldn't be easier. A quick one-liner is all it takes to install perlbrew in your home directory.
$ curl -L http://xrl.us/perlbrewinstall | bash
If you need to install perlbrew somewhere other than your home directory, just download the installer and pass it the PERLBREW_ROOT environment variable.
$ curl -LO http://xrl.us/perlbrew $ chmod +x perlbrew $ PERLBREW_ROOT=/mnt/perlbrew ./perlbrew install
Follow the instructions on screen and you'll be ready to use perlbrew in no time. The perlbrew binary will be installed in ~/perl5/perlbrew/bin, so make sure to adjust your login $PATH accordingly.
Once you're done installing perlbrew there are a couple commands you'll want to run before installing your own Perl versions or CPAN modules. The perlbrew init command is mandatory; this initializes your perlbrew directory. It can also be used later if you need to modify your PERLBREW_ROOT setting. The perlbrew mirror is optional (but recommended) to help you select a preferred CPAN mirror.
$ perlbrew init $ perlbrew mirror
Next comes the fun part. Start off by verifying the Perl version(s) that perlbrew sees.
$ perlbrew list * /usr/bin/perl (5.10.1)
Install a newer version of Perl.
$ perlbrew install 5.12.3
Now switch to the newer Perl.
$ perlbrew list * /usr/bin/perl (5.10.1) perl-5.12.3 $ perlbrew switch perl-5.12.3 $ perlbrew list /usr/bin/perl (5.10.1) * perl-5.12.3 $ perl -v This is perl 5, version 12, subversion 3 (v5.12.3) built for x86_64-linux Copyright 1987-2010, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using "man perl" or "perldoc perl". If you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page.
Alternatively, if you only want to test a different Perl version, try the perlbrew use command (note: this only works in bash and zsh). Unlike the switch command, use is only active for the current shell.
$ perlbrew use system $ perlbrew list * /usr/bin/perl (5.10.1) perl-5.12.3
A quick peek behind the curtain reveals much of the simplicity behind perlbrew.
$ ls -l ~/perl5/perlbrew/ total 2680 -rw-r--r-- 1 testy users 408 Feb 10 23:58 Conf.pm drwxr-xr-x 2 testy users 512 Feb 10 23:46 bin drwxr-xr-x 4 testy users 512 Feb 11 09:59 build -rw-r--r-- 1 testy users 1333196 Feb 11 10:33 build.log drwxr-xr-x 2 testy users 512 Feb 11 09:59 dists drwxr-xr-x 2 testy users 512 Feb 10 23:47 etc drwxr-xr-x 4 testy users 512 Feb 11 10:32 perls $ ls -l ~/perl5/perlbrew/perls/ total 8 drwxr-xr-x 5 testy users 512 Feb 11 00:38 perl-5.12.3 drwxr-xr-x 5 testy users 512 Feb 11 10:32 perl-5.13.6
If you're a Perl developer, the perlbrew project may help alleviate a lot of the pain associated with team development or multi-tenant programming environments. Suddenly it becomes much easier to manage your own software requirements, resulting in faster development and testing cycles for you, and fewer headaches for your System Administrators.
JSON pretty-printer
The other day Sonny and I were troubleshooting some YUI JavaScript code and looking at some fairly complex JSON. It would obviously be a lot easier to read if each nested data structure were indented, and spacing standardized.
I threw together a little Perl program based on the JSON man page:
#!/usr/bin/env perl
use JSON;
my $json = JSON->new;
undef $/;
while (<>) {
print $json->pretty->encode($json->decode($_));
}
It took all of 2 or 3 minutes and I even left out strictures and warnings. Living on the edge!
It turns a mess like this (sample from json.org):
{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":
{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language",
"Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":
"A meta-markup language,used to create markup languages such as DocBook.",
"GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}
into this much more readable specimen:
{
"glossary" : {
"GlossDiv" : {
"GlossList" : {
"GlossEntry" : {
"GlossDef" : {
"para" : "A meta-markup language,used to create markup languages such as DocBook.",
"GlossSeeAlso" : [
"GML",
"XML"
]
},
"GlossTerm" : "Standard Generalized Markup Language",
"ID" : "SGML",
"SortAs" : "SGML",
"Acronym" : "SGML",
"Abbrev" : "ISO 8879:1986",
"GlossSee" : "markup"
}
},
"title" : "S"
},
"title" : "example glossary"
}
}
But today I thought back to that and figured surely something like that must already be at hand if I'd just looked for it. Sure enough, there are many easy options that work conveniently from the shell, similarly to that script:
- json_xs (Perl JSON::XS)
- python -mjson.tool (Python 2.6+)
- prettify_json.rb (Ruby json gem)
And those were just the ones that were likely already on the machine I was using! Hooray for convenience.
Character encoding in perl: decode_utf8() vs decode('utf8')
When doing some recent encoding-based work in Perl, I found myself in a situation which seemed fairly unexplainable. I had a function which used some data which was encoded as UTF-8, ran Encode::decode_utf8() on said data to convert to Perl's internal character format, then converted the "wide" characters to the numeric entity using HTML::Entities::encode_entities_numeric(). Logging/printing of the data on input confirmed that the data was properly formatted UTF-8, as did running `iconv -f utf8 -t utf8 output.log >/dev/null` for the purposes of review.
However when I ended up processing the data, it was as if I had not run the decode function at all. In this case, the character in question was € (unicode code point U+20AC). The expected behavior from encode_entities_numeric() would be to turn any of the hi-bit characters in the perl string (i.e. all Unicode code points > 0x80) into the corresponding numeric entity (€ - € in this case). However instead of that specific character's numeric entity appearing in the output, the entities which appeared were: € i.e., the raw UTF-8 encoded value for €, with each octet being treated as an independent character instead of part of the whole encoded value.
What was particularly confusing was that extracting the relevant parts from the script in question resulted in the expected answer, so it was clearly not an issue of HTML::Entities not being able to deal with Unicode characters, as this code snippet demonstrates:
$ perl -MHTML::Entities+encode_entities_numeric -MEncode -e '$c=qq{\xE2\x82\xAC}; print encode_entities_numeric(decode_utf8($c))'
--> €
In the actual non-extracted version of the code, I was scratching my head. This was exhibiting the signs of doubly-encoded data, however I couldn't see how that could be the case. There were no PerlIO layers (e.g., :utf8 or :encoding) at play, the data I was outputting to a log file for verification purposes was being written via a brand new filehandle from a bare open(); I verified in multiple ways that the raw octets being passed in to the function were not doubly-encoded (printing the raw character points, counting lengths of the runs of octets and verifying that these matched the length of the UTF-8 encoded value for the represented characters, etc). The more things I tried the more puzzled I got. Finally, I changed the Encode::decode_utf8() call to a Encode::decode('utf8') one, providing the encoding explicitly. At this point, the processing pipeline started working as expected, and hi-bit characters were being output as their full numeric entities.
Since the documentation for decode_utf8 indicated that it should be identical to decode('utf8'), I resorted to the code to find out why it worked with the version that specified the encoding explicitly. I found that decode_utf8() does one additional thing that the regular decode('utf8') does not, and that is that before processing via the regular decode() function, decode_utf8 first checks the UTF-8 flag of the data that is being passed in, and if it is set it returns the data without further decoding*. My best guess is that this is to prevent errors if someone attempts to decode UTF-8 data in a string which is already in Perl's internal format, so in most cases this will provide a caller-friendly interface that will DWYM in many expected cases.
Armed with this knowledge, I verified that for some reason, the data that was being passed into the function had the UTF-8 flag set, so using the explicit decode('utf8') in lieu of decode_utf8() fixed the issue for me. (Tracing down the reason for the UTF-8 flag being set on this data was out of scope for this exercise, but is the true fix.) And just to verify that this was in fact the cause of the issue at hand, here's our example, modified slightly (we use the utf8::upgrade function to turn the UTF-8 flag on in the data and treat as actual encoded characters instead of raw octets):
$ perl -l -MHTML::Entities+encode_entities_numeric -MEncode -Mutf8 -e '$c=qq{\xE2\x82\xAC}; utf8::upgrade($c); print encode_entities_numeric(decode_utf8($c))'
--> €
* The UTF-8 flag is more-or-less an implementation detail of how Perl is able to deal with legacy 8-bit binary data in no particular encoding (i.e., raw octets, which it treats as latin-1) as well as the full range of Unicode data, and deal with both efficiently and in a backwards-compatible manner.
Using "diff" and "git" to locate original revision/source of externally modified files
I recently ran into an issue where I had a source file of unknown version which had been substantially modified from its original form, and I wanted to find the version of the originating software that it had originally come from to compare the changes. This file could have come from any number of the 100 tagged releases in the repository, so obviously a hand-review approach was out of the question. While there were certainly clues in the source file (i.e., copyright dates to narrow down the range of commits to review) I thought up and used this technique:
Here are our considerations:
- We know that the number of changes to the original file is likely small compared to the size of the file overall.
- Since we're trying to uncover a likely match for the purposes of reviewing, exactness is not required; i.e., if there are lines in common with future releases, we're interested in the changes, so a revision with the fewest number of changes is preferred over finding the *exact* version of the file that this was originally based on.
The basic thought, then, is that we want to take the content of the unversioned file (i.e., the file that was changed) and find the revision of the corresponding file in the repository with the least number of changes, which we'll measure as the count of the lines in the source code diff. This struck me as similar to the copy detection that git does, insofar as it can detect content that is similar to some source content with a certain amount of tolerance for changes from the base. The difference in this case is that we're comparing content across a number of refs rather than across all of the blobs in a single ref. This recipe distilled down to the following bash command:
for ref in $(git tag);
do
echo -n $ref;
diff -w <(git show $ref:/path/to/versioned/file 2>/dev/null) modified_file | wc -l;
done | sort -k2 -n
The results of running this command is a list of the tags in the repository ordered by how similar they are to the target content (most similar first). A few comments:
- We iterate through all tags in the project; while there could indeed be changes to the relevant file in intermediate versions, due to the way the release worked it's likely the original file was based on a released (aka tagged) version.
- We're using diff's -w option, as the content may have changed spaces to tabs or vice versa, depending on the editor/editing habits of the original user. This helps us ensure that the changes that we're focusing on are the ones that change something substantial.
- We're doing a numeric sort so the lines with the least number of changes show up at the top.
- For the specific case I used this technique with, there were a number of revisions that had the least number of changed lines. Upon reviewing this smaller set of revisions (using the git diff rev1 rev2 -- path/to/content syntax), it turns out that the file in question had remained unchanged in each of these revisions, so any one of them was useful for my purposes.
- The flexibility in the version detection works in this case because this was an isolated part of the system that did not have any changes or dependencies. If there had been important changes to the system as a whole independent of the changes to this file (but which had an affect on the operation of this specific part), we would need to have a more exact method of identifying the file.
Utah Open Source Conference 2010 part 1
It's been about a little over a month since the 2010 Utah Open Source Conference, and I decided to take a few minutes to review talks I enjoyed and link to my own talk slides.
Magento: Mac Newbold of Code Greene spoke on the Magento ecommerce framework for PHP. I've somewhat familiar with Magento, but a few things stood out:
- He finds the Magento Enterprise edition kind of problematic because Varien won't support you if you have any unsupported extensions. Some of his customers had problems with Varien support and went back to the community edition.
- Magento is now up to around 30 MB of PHP files!
- As I've heard elsewhere, serious customization has a steep learning curve.
- The Magento data model is an EAV (Entity-Attribute-Value) model. To get 50 columns of output requires 50+ joins between 8 tables (one EAV table for each value datatype).
- There are 120 tables total in default install -- many core features don't use the EAV tables for performance reasons.
- Another observation I've heard in pretty much every conversation about Magento: It is very resource intensive. Shared hosting is not recommended. Virtual servers should have a minimum of 1/2 to 1 GB RAM. Fast disk & database help most. APC cache recommended with at least 128 MB.
- A lot of front-end things are highly adjustable from simple back-end admin options.
- Saved credit cards are stored in the database, and the key is on the server. I didn't get a chance to ask for more details about this. I hope it's only the public part of a public/secret keypair!
It was a good overview for someone wanting to go beyond marketing feature lists.
Node.js: Shane Hansen of Backcountry.com spoke on Node, comparing it to Tornado and Twisted in Python. He calls JavaScript "Lisp in C's clothing", and says its culture of asynchronous, callback-driven code patterns makes Node a natural fit.
Performance and parallel processing are clearly strong incentives to look into Node. The echo server does 20K requests/sec. There are 2000+ Node projects on GitHub and 500+ packages in npm (Node Package Manager), including database drivers, web frameworks, parsers, testing frameworks, payment gateway integrations, and web analytics.
A few packages worth looking into further:
- express - web microframework like Sinatra
- Socket-IO - Web Sockets now; falls back to other things if no Web Sockets available
- hummingbird - web analytics, used by Gilt.com
- bespin - "cloud JavaScript editor"
- yui3 - build HTML via DOM, eventbus, etc.
- connect - like Ruby's Rack
I haven't played with Node at all yet, and this got me much more interested.
Metasploit: Jason Wood spoke on Metasploit, a penetration testing (or just plain penetrating!) tool. It was originally in Perl, and now is in Ruby. It comes with 590 exploits and has a text-based interactive control console.
Metasploit uses several external apps: nmap, Maltego (proprietary reconnaissance tool), Nessus (no longer open source, but GPL version and OpenVAS fork still available), Nexpose, Ratproxy, Karma.
The reconnaissance modules include DNS enumeration, and an email address collector that uses the big search engines.
It can backdoor PuTTY, PDFs, audio, and more.
This is clearly something you've got to experiment with to appreciate. Jason posted his Metasploit talk slides which have more detail.
So Many Choices: Web App Deployment with Perl, Python, and Ruby: This was my talk, and it was a lot of fun to prepare for, as I got to take time to see some new happenings I'd missed in these three languages communities' web server and framework space over the past several years.
The slides give pointers to a lot of interesting projects and topics to check out.
My summary was this. We have an embarrassment of riches in the open source web application world. Perl, Python, and Ruby all have very nice modern frameworks for developing web applications. They also have several equivalent solid options for deploying web applications. If you haven't tried the following, check them out:
That's about half of my notes on talks, but all I have time for now. I'll cover more in a later post.
Providing Database Handle for Interchange Testing
I've recently begun using the test driven development approach to my projects using Perl's Test::More module. Most of my projects lately have been with Interchange which has some hurdles to get around as far as test driven development is concerned. Primarily this is because Interchange runs as a daemon and provides some readily available utilites like the database handle. This method is not available to our tests, so they need to be made available as discussed below.
I develop Usertags, GlobalSubs and ActionMaps where applicable as it helps keep the separation of business logic and views clear. I generally organize these to call a function within a Perl module so they can be tested properly. Most of these tags involve some sort of connection with the database to present information to the user in which I uses the Interchange ::database_exists_ref method.
When it comes to testing I want to ensure that the test script invokes the same method. Otherwise, your script will not be testing the code as its used in production.
Let's say you are building a Perl module that looks something like this:
package YourMagic;
use strict;
sub do_something {
my ($opt) = @_;
# some code
my $dbh = ::database_exists_ref($opt->{table})->dbh
or return undef;
# ... more code
return $output;
}
1;
The ::database_exists_ref() method will not be available for a test script and needs to be defined. It should return an object to the dbh method in the test script as it does within Interchange. There is no need to test the method itself, as it is not part of the "what" that is being developed. The following code needs to be added to the test script so it can handle the correct type of database reference returned by Interchange.
use lib '/home/user/interchange/custom/lib';
use Test::More tests => 2;
use DBI;
# Here are the methods to provide proper reference to our database handle
################################
sub ::database_exists_ref {
my $table = shift;
return undef if !$table;
# return an object with a dbh method
return bless({}, __PACKAGE__);
}
sub dbh {
# define a dbh method
my $db = DBI->connect('dsn, 'user', 'pass');
return $db;
}
##################################
use YourMagic;
is(
YourMagic::do_something(),
undef,
'do_something() returns undef when called with no arguments',
);
is(
YourMagic::do_something(\%opt),
undef,
'do_something() returns ...',
);
It is also worthwhile to note that you'll need to use the ::database_exists_ref method to look up some information from the existing table that is valuable to test against. Now the do_something() method will call ::database_exists_ref() when invoked.
This approach allows us to use, reuse, and add new tests without worrying about mock data during the intial development. You can be sure that the existing test scripts will function properly against the latest data that is available.
I will cover some other topics regarding Interchange Test Driven Development in future posts. For more information regarding Unit Testing in general see this post by Ethan.
Perl Testing - stopping the firehose
I maintain a large number of Perl modules and scripts, and one thing they all have in common is a test suite, which is basically a collection of scripts inside a "t" subdirectory used to thoroughly test the behavior of the program. When using Perl, this means you are using the awesome Test::More module, which uses the Test Anything Protocol (TAP). While I love Test::More, I often find myself needing to stop the testing entirely after a certain number of failures (usually one). This is the solution I came up with.
Normally tests are run as a group, by invoking all files named t/*.t; each file has numerous tests inside of it, and these individual tests issue a pass or a fail. At the end of each file, a summary is output stating how many tests passed and how many failed. So why is stopping after a failed test even needed? The reasons below mostly relate to the tests I write for the Bucardo program, which has a fairly large and complex test suite. Some of the reasons I like having fine-grained control of when to stop are:
- Scrolling back through screens and screens of failing tests to find the point where the test began to fail is not just annoying, but a very unproductive use of my time.
- Tests are very often dependent. If test #23 fails, it means there is a very good chance that most if not all of the subsequent tests are going to fail as well, and it makes no sense for me to look at fixing anything but test #23 first.
- Tests can take a very long time to run, and I can't wait around for the errors to start appearing and hit ctrl-c. I need to kick them off, go do something else, and then come back and have the tests stop running immediately after the first failed test. Bucardo tests, for example, create and startup four different Postgres clusters, populates the databases inside each cluster with test data, installs a fresh copy of Bucardo, and *then* begins the real testing. No way I'm going to wait around for that to happen.
- Debugging is greatly aided by having the tests stop where I want them to. Often tests after the failing one will modify data and otherwise destroy the "state" such that I cannot manually duplicate the error right then and there, and thus fix it easily.
For now, my solution is to override some of the methods from Test::More. I have a standard script that does this, and I 'use' this script after I 'use Test::More' inside my test scripts. For example, a test script might look like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Test::More tests => 356;
use TestOverride;
sub some_function {
my $arr = [];
push @$arr => 4,9;
return [$arr];
}
my $t = q{Function some_function() returns correct value when called with 'foo'};
my $value = some_function('foo');
my $res = [[3],[5]];
is_deeply( $value, $res, $t);
...
$t = q{Value of baz is 123};
is ($baz, 123, $t);
...
In turn, the TestOverride file contains this:
...
use Data::Dumper;
$Data::Dumper::Indent = 1;
$Data::Dumper::Terse = 1;
$Data::Dumper::Pad = '|';
use base 'Exporter';
our @EXPORT = qw{ is_deeply like pass is isa_ok ok };
my $bail_on_error = $ENV{TESTBAIL} || 0;
my $total_errors = 0;
sub is_deeply {
# Return right away if the test passes
my $rv = Test::More::is_deeply(@_);
return $rv if $rv;
if ($bail_on_error and ++$total_errors >= $bail_on_error) {
my ($file,$line) = (caller)[1,2];
Test::More::diag("GOT: ".Dumper $_[0]);
Test::More::diag("EXPECTED: ".Dumper $_[1]);
Test::More::BAIL_OUT "Stopping on a failed 'is_deeply' test from line $line of $file.";
}
return;
} ## end of is_deeply
sub is {
my $rv = Test::More::is(@_);
return $rv if $rv;
if ($bail_on_error and ++$total_errors >= $bail_on_error) {
my ($file,$line) = (caller)[1,2];
Test::More::BAIL_OUT "Stopping on a failed 'is' test from line $line of $file.";
}
return;
} ## end of is
The is_deeply compares two arbitrary Perl structures (such as the arrayref here, but it can do hashes as well), and points out if they differ, and where. The "deeply" is because it will walk through the entire structure to find any differences. Good stuff.
Some things to note about the new is_deeply function: first, we simply pass in our parameters to the "real" is_deeply subroutine - the one found inside the Test::More package. If this passes (by returning true), we simply pass that truth back to the caller, and it's completely as if is_deeply had not been overwritten at all. However, if the test fails, Test::More::is_deeply will output a failure notice, but we check to see if the total number of failures for this test script ($total_errors) is greater than or equal to the threshold ($bail_on_error) that we set via then environment variable TESTBAIL. (Having it as an environment variable that defaults to zero allows the traditional behavior to be easily changed without editing any files).
If the number of failed tests is over our threshhold, we call the BAIL_OUT method from Test::More, which not only stops the current test script from running any more scripts, but stops any subsequent test files from running as well.
Before calling BAIL_OUT however, we also take advantage of the overriding to provide a little more detail about the failure. We output the line and file the test came from (because Test::More::is_deeply only sees that we are calling it from within the TestOverride.pm file). Most importantly, we output a complete dump of the expected and actual structures passed to is_deeply to be compared. The regular is_deeply only describes where the first mismatch occurs, but I often need to see the entire surrounding object. So rather than normal output looking like this:
1..356 not ok 1 - Function some_function() returns correct value when called with 'foo' # Failed test 'Function some_function() returns correct value when called with 'foo'' # at test1.t line 18. # Structures begin differing at: # $got->[0] = '4' # $expected->[0] = '3' # Looks like you planned 356 tests but ran 1. # Looks like you failed 1 test of 1 run.
The new output looks like this:
1..356 not ok 1 - Function some_function() returns correct value when called with 'foo' # Failed test 'Function some_function() returns correct value when called with 'foo'' # at TestOverride.pm line 23. # Structures begin differing at: # $got->[0] = '4' # $expected->[0] = '3' # GOT: |[ # | 4, # | [ # | 9 # | ] # |] # EXPECTED: |[ # | 3 # |] Bail out! Stopping on a failed 'is_deeply' test from line 17 of test1.t.
Yes, the Test::Most module does some similar things, but I don't use it because it's yet another module dependency, it doesn't allow me to control the number of acceptable failures before bailing, and it doesn't show pretty output for is_deeply.
Guidelines for Interchange site migrations
I'm involved at End Point often with Interchange site migrations. These migrations can be due to a new client coming to us and needing hosting or migrating from one server to another within our own infrastructure.
There are many different ways to do a migration, in the end though we need to hit on certain points to make sure that the migration goes smoothly. Below you will find steps which you can adapt for your specific migration.
During the start of the migration it might be a good time to introduce git for source control. You can do this by creating the repository and cloning it to /home/account/live, setting up .gitignore files for logs, counter files, gdbm files. Then commit the changes back to the repo and you've now introduced source control without much effort, improving the ability to make changes to the site in the future. This is also helpful to document the changes you make to the code base along the way during the migration in case you need to merge changes from the current production site before completing the migration.
- Export all of the gdbm databases to their text file equivalents on the production server
- Take a backup from production of the database, catalog, interchange server, htdocs
- Setup an account
- Create the database and user
- Restore the database, catalog, interchange server and htdocs
- Update the paths in interchange/bin for each script to point at the new location
- Grep the restored code for hard coded paths and update those paths to the new locations. Better yet move these paths out to a catalog_local.cfg where environment specific information can go.
- Grep the restored code for hard coded urls and use the [area] tag to generate the urls
- Update the urls in products/variable.txt to point at the test domain
- Update the sql settings in products/variable.txt to point at the new database using the new user
- Remove the gdbm databases so they will be recreated on startup from the source text files
- Install a local Perl if it's not already installed (./configure -des will compile and install Perl locally)
- Install Bundle::InterchangeKitchenSink
- Install the DBD module for MySQL or PostgreSQL
- Review the code base looking for use statements in custom code and Require module settings in interchange.cfg. Install the Perl modules found into the local Perl.
- Setup a non ssl and ssl virtual host using a temporary domain. Configure the temporary domain to use the SSL certificate from the production domain.
- Firewall or password protect the virtual host so it is not accessible to the public
- Generate a vlink using interchange/bin/compile and copy it into the cgi-bin directory and name it properly
- Startup the new Interchange
- Review error messages and resolve until Interchange will start properly
- Test the site thoroughly, resolving issues as they appear. Make sure that checkout, charging credit cards, sending of emails, using the admin, etc all function.
- Migrate any cron jobs running on the current production site, such as session expiration scripts
- Setup logrotation for the new logs that will be created
- Verify that you have access to make DNS changes
- Set the TTL for the domain to a low value such as 5 minutes
- Modify the new production site to respond to the production url, test by updating your hosts file to manually set the IP address of the domain
- Shutdown the new Interchange
- Restore a copy of the original backup for Interchange, the catalog and htdocs to /tmp on the production server
- Shutdown the production Interchange, put up a maintenance note on the production site.
- Take a backup of the production database and restore on the new server
- Diff the Interchange, catalog and htdocs directory between /tmp and the current production locations, making note of the files that have changed since we took the original copy.
- Copy the files that have changed, making sure to merge with any changes we have made on the new production site. Making sure to copy over all .counter and .autonumber files to the new production site.
- Start Interchange on the new production server
- Test the site thoroughly on the new production server, using the production url. Make sure that checkout with charging the credit card functions properly.
- Resolve any remaining issues found during the testing
- Setup the Interchange daemon to start at boot for this site in /etc/rc.d/rc.local or in cron using @reboot
- Update DNS to point at the new production IP address
- Update the TTL of the domain to a longer value
- Open the site to the public by opening the firewall or removing the password protection
- Keep an eye on the error logs for any issues that might crop up
This will hopefully give you a solid guide for performing an Interchange site migration from one server to another and some of the things to watch out for that might cause issues during the migrations.
Tail_n_mail and the log_line_prefix curse
One of the problems I had when writing tail_n_mail (a program that parses log files and mails interesting lines to you) was getting the program to understand the format of the Postgres log files. There are quite a few options inside of postgresql.conf that control where the logging goes, and what it looks like. The basic three options are to send it to a rotating logfile with a custom prefix at the start of each line, to use syslog, or to write it in CSV format. I'll save a discussion of all the logging parameters for another time, but the important one for this story is log_line_prefix. This is what gets prepended to each log line when using 'stderr' mode (e.g. regular log files and not syslog or csvlog). By default, log_line_prefix is an empty string. This is a very useless default.
What you can put in the log_line_prefix parameter is a string of sprintf style escapes, which Postgres will expand for you as it writes the log. There are a large number of escapes, but only a few are commonly used or useful. Here's a log_line_prefix I commonly use:
log_line_prefix = '%t [%p] %u@%d '
This tells Postgres to print out the timestamp, the PID aka process id (inside of square brackets), the current username and database name, and finally a single space to help separate the prefix visually from the rest of the line. The above will generate lines that look like this:
2010-08-06 09:24:57.714 EDT [7229] joy@joymail LOG: execute dbdpg_p7228_5: SELECT count(id) FROM joymail WHERE folder = $1 2010-08-06 09:24:57.714 EDT [7229] joy@joymail DETAIL: parameters: $1 = '4'
As you might imagine, the customizability of log_line_prefix makes parsing the log files all but impossible without some prior knowledge. I didn't want to go the pgfouine route and make people change their log_line_prefix to a specific setting. I think it's kind of rude to force your database to change its logging to accommodate your tools :). The original quick solution I came up with was to have a set of predefined regular expressions and the user would pick one that most closely matched their logs. For tail_n_mail to work properly, it needs to pick up at least the PID so it can tell when one statement ends a new one begins. For example, if you chose "regex #1", the log parsing regex would look like this:
(\d\d\d\d\-\d\d\-\d\d \d\d:\d\d:\d\d).+?(\d+)
This works fine on the example above, and gets us the timestamp and the PID from each line. The stock regexes worked for many different log_line_prefixes I came across that our clients were using, but I was never very happy with this solution. Not only was it susceptible to failing completely when a client was using a log_line_prefix not fitting into the current list of regexes, but there was no way to know exactly where the prefix ended and the statement began, which is important for the formatting of the output and the canonicaliztion of similar queries.
Enter the current solution: building a regex on the fly. Since we don't have a connection to the database at all, merely to the the log files, this requires that the user enter in their current log_line_prefix. This is a simple entry into the tailnmailrc file that looks just like the entry in postgresql.conf, e.g.:
log_line_prefix = '%t [%p] %u@%d '
The tail_n_mail script uses that variable to build a custom regex specifically tailored to that log_line_prefix and thus to the Postgres logs being used. Not only can we grab whatever bits we want (currently we only care about the timestamp (%t and %m) and the PID (%p)), but we can now cleanly break apart each line in the log into the prefix and the actual statement. This means the canonicalization/flattening of the queries is more effective, and allows us to only output the prefix information once. The output of tail_n_mail looks something like this:
Date: Fri Aug 6 11:01:03 2010 UTC Host: whale.example.com Unique items: 7 Total matches: 85 Matches from [A] /var/log/pg_log/postgresql-2010-08-05.log: 61 Matches from [B] /var/log/pg_log/postgresql-2010-08-06.log: 24 [1] From files A to B (between lines 14,205 of A and 527 of B, occurs 64 times) First: [A] 2010-08-05 16:52:11 UTC [1602] postgres@mydb Last: [B] 2010-08-06 01:18:14 UTC [20981] postgres@mydb ERROR: syntax error at or near ")" STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (?,?,?)) - ERROR: syntax error at or near ")" STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (123,'chocolate','donut')); [2] From file A (line 12,172) 2010-08-05 12:27:48 UTC [2906] bob@otherdb ERROR: invalid input syntax for type date: "May" STATEMENT: UPDATE personnel SET birthdate='May' WHERE id = 1234; (plus five other entries)
For the entry in the above example, we are able to show the complete prefix of the log lines where the error first occurred and where it most recently occurred. The next two lines show the "flattened" version of the query that tail_n_mail uses to group together similar errors. We then show a non-flattened example of an actual query from that group. In this case, someone added an extra closing paren in their application somewhere, which gives the same error each time, although the exact output changes depending on the values used. In the second example, because there is only one match, we don't bother to show the flattened version at all.
So in theory tail_n_mail should be now be able to handle any Postgres log you care to throw at it (yes, it can read syslog and csvlog format as well). As my coworker pointed out, parsing log files in this way is something that should probably be abstracted into a common module so other tools like pgsi can take advantage of it as well.
Localize $@ in DESTROY
I have been conditioned now for many years in Perl to trust the relationship of $@ to its preceding eval. The relationship goes something like this: if you have string or block eval, immediately after its execution, $@ will either be false or it will contain the die message of that eval (or the generic "Died at ..." message if none is provided). Implicit here is that evals contained within an eval have their effects on $@ concealed, unless the containing eval "passes on" the inner eval's die.
To quickly demonstrate:
use strict;
use warnings;
eval {
print "Some stuff\n";
eval {
die 'Oops. Bad inner eval';
};
printf '$@ in outer eval: %s', $@;
};
printf '$@ after outer eval: %s', $@;
print $/;
produces the following output:
[mark@sokt ~]$ perl demo.pl Some stuff $@ in outer eval: Oops. Bad inner eval at demo.pl line 7. $@ after outer eval: [mark@sokt ~]$
Only if the containing eval itself dies do we find any data in $@:
use strict;
use warnings;
eval {
print "Some stuff\n";
eval {
die 'Oops. Bad inner eval';
};
printf '$@ in outer eval: %s', $@;
die 'Uh oh. Bad outer eval, too';
};
printf '$@ after outer eval: %s', $@;
print $/;
which produces:
[mark@sokt ~]$ perl demo.pl Some stuff $@ in outer eval: Oops. Bad inner eval at demo.pl line 7. $@ after outer eval: Uh oh. Bad outer eval, too at demo.pl line 11. [mark@sokt ~]$
Why am I covering this, well known to any serious Perl programmer? Because I was caught off guard troubleshooting for a client last week when the result of an inner eval "leaked" through, affecting $@ of the containing eval. Because I was so conditioned to the stated relationship between eval and $@, it took me quite some time before I even opened up to the possibility.
It turned out the hitch had to do with garbage collection. The key was that the inner eval in question was initially called from a routine within an object's DESTROY method. As I discovered, at least in Perl 5.10, if a containing eval dies, causing an object to go out of scope, and that object's DESTROY itself executes an eval, $@ reflects the exit condition of the eval from within DESTROY, and not that of the containing eval. Even more strange, this is only true if the containing eval dies. If instead the containing eval completes, then that same dying eval within DESTROY does not affect the condition of $@ after the containing eval. It will still be false, as (IMO) it should be.
So, some code demonstrating each situation. We have 3 conditions:
- Containing eval dies, eval within DESTROY dies
- Containing eval dies, eval within DESTROY does not die
- Containing eval does not die, and eval is called within DESTROY, die or not.
Sample code demonstrating 1st condition:
use strict;
use warnings;
package Obj;
sub DESTROY {
eval { die 'in DESTROY' };
}
package main;
eval {
my $obj = {};
bless $obj, 'Obj';
die 'in main eval';
print "Super important stuff that must finish or we really need to know about it!\n";
return 1;
};
if ($@) {
printf '$@ comes from code %s', $@;
}
else {
print "Happy days! Our eval code ran to completion. Woot!\n";
}
Output as follows:
[mark@sokt ~]$ perl test1.pl $@ comes from code in DESTROY at test1.pl line 7. [mark@sokt ~]$
Demo of 2nd condition:
use strict;
use warnings;
package Obj;
sub DESTROY {
eval { 1 };
}
package main;
eval {
my $obj = {};
bless $obj, 'Obj';
die 'in main eval';
print "Super important stuff that must finish or we really need to know about it!\n";
return 1;
};
if ($@) {
printf '$@ comes from code %s', $@;
}
else {
print "Happy days! Our eval code ran to completion. Woot!\n";
}
Output as follows:
[mark@sokt ~]$ perl test2.pl Happy days! Our eval code ran to completion. Woot! [mark@sokt ~]$
Notice how particularly insidious the above is. We not only don't know what the error was from the eval block that immediately precedes the evaluation of $@, but we actually think it succeeds!
Finally, the 3rd condition:
use strict;
use warnings;
package Obj;
sub DESTROY {
eval { die 'in DESTROY' };
}
package main;
eval {
my $obj = {};
bless $obj, 'Obj';
print "Super important stuff that must finish or we really need to know about it!\n";
return 1;
};
if ($@) {
printf '$@ comes from code %s', $@;
}
else {
print "Happy days! Our eval code ran to completion. Woot!\n";
}
Output as follows:
[mark@sokt ~]$ perl test3.pl Super important stuff that must finish or we really need to know about it! Happy days! Our eval code ran to completion. Woot! [mark@sokt ~]$
So, fortunately, case 3 contains the leak when the outer eval completes successfully. We don't introduce the worst possible situation: a successful eval that is subsequently treated as a failure. However, cases 1, and especially 2, are bad enough.
Now that I know all this, the solution is thankfully simple. When constructing objects, if they include a supplied DESTROY, always localize $@. It doesn't matter whether I execute any evals or not; if the code calls any other routines that do, anywhere in the stack, the problem is introduced. A local $@ provides full protection.
A rerun of test1 but with localization provides a much more expected result:
use strict;
use warnings;
package Obj;
sub DESTROY {
local $@;
eval { die 'in DESTROY' };
}
package main;
eval {
my $obj = {};
bless $obj, 'Obj';
die 'in main eval';
print "Super important stuff that must finish or we really need to know about it!\n";
return 1;
};
if ($@) {
printf '$@ comes from code %s', $@;
}
else {
print "Happy days! Our eval code ran to completion. Woot!\n";
}
Output as follows:
[mark@sokt ~]$ perl test1.pl $@ comes from code in main eval at test1.pl line 17. [mark@sokt ~]$
and test2, which now doesn't lie to us about the success of the eval of interest:
use strict;
use warnings;
package Obj;
sub DESTROY {
local $@;
eval { 1 };
}
package main;
eval {
my $obj = {};
bless $obj, 'Obj';
die 'in main eval';
print "Super important stuff that must finish or we really need to know about it!\n";
return 1;
};
if ($@) {
printf '$@ comes from code %s', $@;
}
else {
print "Happy days! Our eval code ran to completion. Woot!\n";
}
Output as follows:
[mark@sokt ~]$ perl test2.pl $@ comes from code in main eval at test2.pl line 17. [mark@sokt ~]$
Mock Testing with Perl
I'll start by saying that I probably should have started with Test::MockObject and saved myself all of this trouble. But sometimes things don't work out that way.
So, I'm building unit tests in Perl the hard way. By the hard way, I mean that I am constructing ever more elaborate, interdependent, complex, and brittle test data sets to test the functions that I am hacking on. The data model is moderately complex, so there really isn't any way around it (since I'm doing it the hard way, after all).
At one point, one function (which I am not testing) returns a result that I need for the function I am testing. The problem is that it reaches pretty far away into a section of the data model that I'd rather not set up test data for at the moment just to get that one value. This is where I'm sitting there wishing I had mock objects more than usual, since this would be a perfect place to mock the method. Since I couldn't be bothered to see if someone had written such a handy module, I looked for a hard way to do it. Turns out that there is one.
It's not actually hard, but it could be considered complex if you are not familiar with typeglobs and the workings of the symbol table in Perl. A good discussion can be found in the Perl Cookbook in ch10.14.
In the following example, the function Base::Shipping::Package::weight is called at some point in create_shipment. Being able to call it is imperative to completing create_shipment. In my case, I have to have a successful result from create_shipment in order to test process_shipment.
{
local *Base::Shipping::Package::weight = \&test_weight
my $shipment = $class->create_shipment($shipment);
my $result = $class->process_shipment($shipment);
test $result;
}
sub test_weight { 4.0 }
In here then, the local call redefines the weight function inside the scope of the block. This turns out to be fairly convenient given that I already had the structure in place to test things this way. There are possibly other cases where something like this might make more sense than using Test::MockObject in the first place although I am somewhat skeptical.






