Photo by Maksym Kaharlytskyi on Unsplash
The wondrous world of effective dated tables
Read a 101 introduction to effective dated tables and why they are considered useful and how to query some yourself.
This post is very basic and I plan to make a series out of it, but no promises.
Did you know that effective dated tables existed for a long time? At the time of this writing, the first reference I could find on stack overflow is roughly 14 years old. It's highly likely effective dated queries predate so's birth.
So how come it's still hardly used these days? I know from my days working with Peoplesoft® that we wrote a lot of queries using effective dates, and they had both their beauty and their pitfalls. I haven't seen many systems that support effective dates out of the box, and after 6+ years after a pull request on web2py (which now lives in pydal) it was time to revisit this brilliant way of containing both the current state of affairs, as well as anything that led up to it, in just one table. Since I work a lot with interns at Education Warehouse, I thought a blogpost or a series might be a good part of the system documentation.
The big advantage
An effective dated table is a table with an effdt
column in it. This used to be a date
column in my Peoplesoft era, sometimes followed by an effseq
column to support either multiple active rows per day or multiple entries per day of which the latest is active. Add to that an effstatus
boolean which allows filtering if this row should be marked 'removed' and you have everything set up for both remarkable queries, as well as utter chaos and incomprehensible queries. With a lot of fields comes great responsibility.
Let's start with an example:
id | key | value | effdt |
1 (active) | oo | foo | 2023-01-01 |
2 (active) | aa | bar | 2023-01-01 |
Using the most simple effective dated queries there are 2 active rows for 2 different keys. The value of oo
is foo
and the value of aa
is bar
. Both effective since the start of this year. Now, what if a new line was added to the table, because after a week, the value of loop
is assigned to the key oo
.
id | key | value | effdt |
1 | oo | foo | 2023-01-01 |
2 (active) | aa | bar | 2023-01-01 |
3 (active) | oo | loop | 2023-01-08 |
Now, the table contains three rows, but there are only 2 "effective" because row 1
is superseded by row 3
. See where this is going? You have access to both the most relevant state as well as historic (and future) rows. This is because you'd probably query the table based on today's current date. (which of course can be parameterized, offering you a view into the past
, or a prediction of the future
because you can shift your effective day linearly through time.
Timeshifting into the future
Using this time-shifting ability, it's quite easy to register an anticipated future change:
id | key | value | effdt |
1 | oo | foo | 2023-01-01 |
2 (active) | aa | bar | 2023-01-01 |
3 (active) | oo | loop | 2023-01-08 |
4 | oo | zope | 2030-01-01 |
Here we anticipate that oo
will have the value of zope
somewhere in the years to come, but using a "proper effective dated " query that specific row won't show up in your query results until you have long forgotten about this blog post.
Why is this practical you might wonder. Well, there are a lot of applications for effective dated tables. You can imagine some of the benefits:
using just an effective date column your application adds changes, which instantly allows for non-destructive editing of data.
when adding an author column, you can easily audit the data (who did what)
when adding an effective_status column (or active column) you can even simulate deletes quite easily.
you can forget about queues with scheduled changes because you can simply insert them into the table. When the query is performed, the most relevant results at query-time are returned.
when combining multiple effective dated tables you can choose to base the effective date on another table instead of
now()
which allows for the regeneration of output based on a specific time as if you stepped back in time.
Talk SQL to me
Let's query the effective rows per key based on the current date and time from the supposed kv
table above using a correlated subquery. That is, a subquery that references values from the outer query in the subquery. In his case, our key
field.
select *
from kv
where kv.effdt = (select max(kv_ed.effdt)
from kv as kv_ed
where kv_ed.key = kv.key
and kv_ed.effdt <= now()
)
group by key
Albeit useful, this does result in a subquery being executed per row, which can quickly become costly. Indexes are your friend here. In this case an index with (key
,effdt
) benefits this query. With only 8k rows in a table, that index reduced my execution time from 13s down to several milliseconds.
Using a join is sometimes doable and easier, thought it has a lot more limitations due to complexity that can't be handled when a lot of correlations are in place. But, this is pretty fast, because the kv_ed
query needs to be executed exactly once, and the resulting join is quite fast. Also note that a join allows for multiple columns to be returned, whereas a correlated subselect uses just a single column.
select *
from kv
join (select key, max(kv_ed.effdt)
from kv as kv_ed
where kv_ed.effdt <= now()
group by key
) as kv_ed on kv_ed.key = kv.key
group by key
There are times that it might even be interesting to use subselects in the from to gain performance. This mostly happens when you select only a few rows and the query planner is bright enough to only perform the subselect for the resulting rows, instead of executing the subselect on every row. Mind you this mostly only allows for a single column to be returned. Although you could of course use compounds like json
or similar, but that might make things quite a bit harder.
Conclusion
Effective dated tables use an effective date field and heavily rely on the right queries to get the result you want. The queries are a little more complex, but they are capable of giving you a lot of extra functionality that doesn't need multiple tables or loads of separate application logic.