The wondrous world of effective dated tables

Read a 101 introduction to effective dated tables and why they are considered useful and how to query some yourself.

This post is very basic and I plan to make a series out of it, but no promises.


Did you know that effective dated tables existed for a long time? At the time of this writing, the first reference I could find on stack overflow is roughly 14 years old. It's highly likely effective dated queries predate so's birth.

So how come it's still hardly used these days? I know from my days working with Peoplesoft® that we wrote a lot of queries using effective dates, and they had both their beauty and their pitfalls. I haven't seen many systems that support effective dates out of the box, and after 6+ years after a pull request on web2py (which now lives in pydal) it was time to revisit this brilliant way of containing both the current state of affairs, as well as anything that led up to it, in just one table. Since I work a lot with interns at Education Warehouse, I thought a blogpost or a series might be a good part of the system documentation.

The big advantage

An effective dated table is a table with an effdt column in it. This used to be a date column in my Peoplesoft era, sometimes followed by an effseq column to support either multiple active rows per day or multiple entries per day of which the latest is active. Add to that an effstatus boolean which allows filtering if this row should be marked 'removed' and you have everything set up for both remarkable queries, as well as utter chaos and incomprehensible queries. With a lot of fields comes great responsibility.

Let's start with an example:

idkeyvalueeffdt
1 (active)oofoo2023-01-01
2 (active)aabar2023-01-01

Using the most simple effective dated queries there are 2 active rows for 2 different keys. The value of oo is foo and the value of aa is bar. Both effective since the start of this year. Now, what if a new line was added to the table, because after a week, the value of loop is assigned to the key oo.

idkeyvalueeffdt
1oofoo2023-01-01
2 (active)aabar2023-01-01
3 (active)ooloop2023-01-08

Now, the table contains three rows, but there are only 2 "effective" because row 1 is superseded by row 3. See where this is going? You have access to both the most relevant state as well as historic (and future) rows. This is because you'd probably query the table based on today's current date. (which of course can be parameterized, offering you a view into the past, or a prediction of the future because you can shift your effective day linearly through time.

Timeshifting into the future

Using this time-shifting ability, it's quite easy to register an anticipated future change:

idkeyvalueeffdt
1oofoo2023-01-01
2 (active)aabar2023-01-01
3 (active)ooloop2023-01-08
4oozope2030-01-01

Here we anticipate that oo will have the value of zope somewhere in the years to come, but using a "proper effective dated " query that specific row won't show up in your query results until you have long forgotten about this blog post.

Why is this practical you might wonder. Well, there are a lot of applications for effective dated tables. You can imagine some of the benefits:

  • using just an effective date column your application adds changes, which instantly allows for non-destructive editing of data.

  • when adding an author column, you can easily audit the data (who did what)

  • when adding an effective_status column (or active column) you can even simulate deletes quite easily.

  • you can forget about queues with scheduled changes because you can simply insert them into the table. When the query is performed, the most relevant results at query-time are returned.

  • when combining multiple effective dated tables you can choose to base the effective date on another table instead of now() which allows for the regeneration of output based on a specific time as if you stepped back in time.

Talk SQL to me

Let's query the effective rows per key based on the current date and time from the supposed kv table above using a correlated subquery. That is, a subquery that references values from the outer query in the subquery. In his case, our key field.

select * 
  from kv 
 where kv.effdt = (select max(kv_ed.effdt) 
                      from kv as kv_ed 
                     where kv_ed.key = kv.key
                       and kv_ed.effdt <= now()
                    )
 group by key

Albeit useful, this does result in a subquery being executed per row, which can quickly become costly. Indexes are your friend here. In this case an index with (key,effdt) benefits this query. With only 8k rows in a table, that index reduced my execution time from 13s down to several milliseconds.

Using a join is sometimes doable and easier, thought it has a lot more limitations due to complexity that can't be handled when a lot of correlations are in place. But, this is pretty fast, because the kv_ed query needs to be executed exactly once, and the resulting join is quite fast. Also note that a join allows for multiple columns to be returned, whereas a correlated subselect uses just a single column.

select * 
  from kv 
  join (select key, max(kv_ed.effdt)
          from kv as kv_ed
         where kv_ed.effdt <= now()
         group by key
    ) as kv_ed on kv_ed.key = kv.key
 group by key

There are times that it might even be interesting to use subselects in the from to gain performance. This mostly happens when you select only a few rows and the query planner is bright enough to only perform the subselect for the resulting rows, instead of executing the subselect on every row. Mind you this mostly only allows for a single column to be returned. Although you could of course use compounds like json or similar, but that might make things quite a bit harder.

Conclusion

Effective dated tables use an effective date field and heavily rely on the right queries to get the result you want. The queries are a little more complex, but they are capable of giving you a lot of extra functionality that doesn't need multiple tables or loads of separate application logic.