Detecting gaps in time-series data in PostgreSQL
October 26, 2020
A client has a number of data feeds that are supposed to update at regular intervals. Like most things in the universe (thanks, entropy!) this does not always end up working out the way we want. We recently discovered that some of the data had not loaded as expected, and were brought in to assess the extent of the issue.
There were 2 main feeds that had issues, with different types of time data. The first had date-based data with batches and the second had full-range timestamps. We will examine each type individually, since they have similar — but not the same — characteristics.
Date-based, with batches
In the first data feed there was a table which tracked which files had been loaded by file date and batch number (there was a morning/evening batch designated by a
batch field with either
B). Since the files that had not loaded successfully did not have entries in this table we could find that one or both of the day’s batch files would be missing, but since the table tracked those which did get loaded, we needed to turn this list into something useful.
Any time I find myself considering sequences of data, I remember my friend
generate_series() is often called with numeric data, it can also generate date ranges, which can come in very handy:
postgres=# SELECT * FROM generate_series('2020-01-01'::date, '2020-10-01'::date, interval '1 day') LIMIT 10; generate_series ------------------------ 2020-01-01 00:00:00-06 2020-01-02 00:00:00-06 2020-01-03 00:00:00-06 2020-01-04 00:00:00-06 2020-01-05 00:00:00-06 2020-01-06 00:00:00-06 2020-01-07 00:00:00-06 2020-01-08 00:00:00-06 2020-01-09 00:00:00-06 2020-01-10 00:00:00-06 (10 rows)
Let’s create our sample table and populate with data to simulate the situation we ran into:
CREATE TABLE loader_manifest (filename text, batch char, status char, processed_at timestamptz); INSERT INTO loader_manifest (filename, batch, status, processed_at) SELECT to_char(generate_series('2020-01-01'::date, '2020-10-22'::date, '1 day'), 'YYYYMMDD'), batch, 'L', now() FROM (VALUES ('A'::char), ('B')) batches(batch) ORDER BY 1,2 ; -- create some gaps to later detect DELETE FROM loader_manifest WHERE filename = '20200218' AND batch = 'B'; DELETE FROM loader_manifest WHERE filename = '20200426' AND batch = 'A'; DELETE FROM loader_manifest WHERE filename = '20201019';
The data that we were trying to match against looked similar to the following:
postgres=# SELECT * FROM loader_manifest LIMIT 5; filename | batch | status | processed_at ----------+-------+--------+------------------------------- 20200101 | A | L | 2020-10-18 11:25:05.878229-05 20200101 | B | L | 2020-10-18 11:25:05.878229-05 20200102 | A | L | 2020-10-18 11:25:05.878229-05 20200102 | B | L | 2020-10-18 11:25:05.878229-05 20200103 | A | L | 2020-10-18 11:25:05.878229-05 (5 rows)
Since either the batch or entire days could have been missing we basically needed to generate the list of expected combinations and use a
LEFT JOIN to find which values in this table were missing. While this example has only 2 batch options, this solution would work for additional numbers of expected batches.
generate_series() is a set-returning function we can use it as a source in our
FROM clause, in combination with the explicit values we want for our
batch field. We also will need to extract the date pieces out in a way that we can match the expected
filename format. We use something similar to the following:
postgres=# SELECT to_char(filename, 'YYYYMMDD') AS filename, batch FROM generate_series('2020-01-01'::date, '2020-10-01'::date, interval '1 day') AS filename, (VALUES ('A'), ('B')) AS batches(batch) LIMIT 10; filename | batch ----------+------- 20200101 | A 20200101 | B 20200102 | A 20200102 | B 20200103 | A 20200103 | B 20200104 | A 20200104 | B 20200105 | A 20200105 | B (10 rows)
So as we can see, this will generate the list of all expected filename/batch combinations that we can then check the original table against for missing values:
SELECT candidates.* FROM ( SELECT to_char(filename,'YYYYMMDD') AS filename, batch FROM generate_series('2020-01-01'::date, '2020-10-01'::date, interval '1 day') AS filename, (values ('A'), ('B')) AS batches(batch) ) candidates NATURAL LEFT JOIN loader_manifest WHERE loader_manifest.filename IS NULL
candidates is the generated list of all expected file/batch combinations.
And the results from our sample dataset:
filename | batch ----------+------- 20200218 | B 20200426 | A 20201019 | A 20201019 | B (4 rows)
As expected, this has identified the rows that we know are missing from the table. Success!
The second type of query had to identify similar gaps in loaded data, but used a different approach. This data feed had a
processing_time_utc field which would vary, but we expected this to have fairly regular updates. This data feed could be processed somewhat erratically, but if the data was complete we would expect to see records with
processing_time_utc approximately every 10 minutes. If we found that there were gaps much larger than every 10 minutes we could suspect missing data and would need to locate/reprocess the underlying data files.
This approach also uses
generate_series(), but since we do not have a single date that we can check existence of, the
JOIN approach does not work directly. This means that in order to find the gaps we care about, the need to check that at least some data exists within each time period in question.
Here’s the query:
SELECT day FROM generate_series('2020-07-01'::date, '2020-08-01'::date, interval '10 minutes') AS day LEFT JOIN LATERAL (SELECT day FROM the_table WHERE processing_time_utc >= day AND processing_time_utc < day + interval '10 minutes' LIMIT 1) gap ON day = gap.day WHERE gap.day IS NULL;
The key here is that we use
generate_series() with an
interval of the window size we care about, and then use that same
interval when checking for any row’s existence that fits that criteria. With the
LIMIT 1 on the
gap subquery, we stop at the first record we find in that range, and thus we can exclude in the antijoin to find our results.
I should note that this approach works regardless of the size of the logical date partitions, and can be used to identify all gaps of at least the interval size we are looking at; the same query can check for gaps every 10 minutes as for hourly, daily, weekly, etc. It is purely a matter of selecting the proper interval size in the query in question.
Also note that the
processing_time_utc field is indexed, or this would be painfully slow. We use a
LATERAL query to pull in one specific row that matches the time interval we care about and ignore any others.
Be aware that this technique finds only the cases where no data is present for the time interval, not less than expected. You could certainly adapt this approach to that situation by using an appropriate query.
These approaches definitely come in handy when trying to isolate missing data based on time; which tack you take will depend on the expected data present, as well as the granularity in which you identify your gaps.