Getting temporal configuration values into date-range value equivalents

I collect a lot of configuration values from my database servers and most of these values are stored by date.

So often I end up with values such as:


config_date config_value
2010-09-01  value_1
2010-09-02  value_1
2010-09-03  value_2
2010-09-04  value_3
2010-09-05  value_3
2010-09-06  value_3
2010-09-07  value_3
2010-09-08  value_4
2010-09-09  value_4
2010-09-10  value_1
2010-09-11  value_5
2010-09-12  value_5
2010-09-13  value_5
I´ve been unsuccessfully been trying to figure out how to convert this in SQL into something like the following:


config_from config_to   config_value
2010-09-01  2010-09-02  value_1
2010-09-03  2010-09-03  value_2
2010-09-04  2010-09-07  value_3
2010-09-08  2010-09-09  value_4
2010-09-10  2010-09-10  value_1
2010-09-11  2010-09-13  value_5

The second format is often much shorter if configuration changes are not that frequent. While this is straight forward to do in a programming language in a single pass once you have the config_date ordered data, I can’t figure out how to do this with SQL and no programming. Celko´s books haven’t enlightened me either.

So do you know how to do this?

Tags: ,

13 Responses to “Getting temporal configuration values into date-range value equivalents”

  1. Jeremy Cole says:

    Simon,

    I accept your challenge!

    Here’s the schema I used (btw, you should provide that if you want people to try…):

    create table config (config_id int unsigned not null auto_increment, config_date date not null, config_value varchar(50) null, primary key(config_id));

    insert into config (config_date, config_value)
    values
    (“2010-09-01″, “value_1″),
    (“2010-09-02″, “value_1″),
    (“2010-09-03″, “value_2″),
    (“2010-09-04″, “value_3″),
    (“2010-09-05″, “value_3″),
    (“2010-09-06″, “value_3″),
    (“2010-09-07″, “value_3″),
    (“2010-09-08″, “value_4″),
    (“2010-09-09″, “value_4″),
    (“2010-09-10″, “value_1″),
    (“2010-09-11″, “value_5″),
    (“2010-09-12″, “value_5″),
    (“2010-09-13″, “value_5″);

    The solution I came up with at first pass is this:

    set @config_seq = 0;
    set @config_value = “”;
    select min(config_date) as config_from, max(config_date) as config_to, config_value from (select config_date, config_value, @config_seq := if(@config_value != config_value, (0*(@config_value:=config_value))+@config_seq + 1, @config_seq) as config_seq from config order by config_date) as t group by t.config_seq;

    And the result is exactly as you suggested above.

    This is of course highly MySQL-specific as a solution.

    Regards,

    Jeremy

  2. Xaprb says:

    select min(config_date) as config_from, max(config_date) as config_to, config_value from config group by config_value;

  3. Simon J Mudd says:

    Xaprb, your solution does not work. Note, these are contiguous date ranges and that value_1 is shown in 2 places. A simple SELECT and GROUP BY won’t work.

  4. Simon J Mudd says:

    Hi Jeremy,

    That’s nice and thanks. Your solution does indeed work perfectly. I’d forgotten about the non-standard way you can use MySQL to use variables to perform the task. Thanks and well done. This is a handy tip to know, and something I’m likely to use often.

  5. Simon J Mudd says:

    Jeremy’s solution is nice but has a big disadvantage. I can’t hide the complexity of the SQL from the users in a VIEW which would be convenient. Also if you try to ignore the intial setting up of the variable then MySQL does NOT complain (which is bad) and the results may not be what you expect. So looking for a clean SQL solution to the problem in spite of Jeremy’s good first offer.

  6. Jeremy Cole says:

    Baron,

    Hehe, I tried that solution first too, and quickly noticed that the result was not correct — suspecting that Simon slipped in the double-instance of “value_1″ to catch that particular oversight.

    Simon,

    You could wrap the set of queries up into a stored procedure pretty easily, but you’re right, it’s not possible to do with a view.

    I’m not sure that it’s actually possible to do it at all without user variables with the current schema, given the non-sequential nature of SQL.

    However, if you could change your schema (updating your old data with a one-time somewhat inefficient query) with e.g. config_version incrementing for each unique *change* of the configuration variable, you could avoid them. In that case it becomes a simple:

    select min(config_date) as config_from, max(config_date) as config_to, config_value from config group by config_value, config_version

    Then each time you insert the data you compare to the previous version and increment config_version if config_value differs.

    Regards,

    Jeremy

  7. Did my solution make it? I posted, but got no response indicating my message was received and queued for moderation or whatever…

  8. select s1.config_date from_date
    , max(s2.config_date) to_date
    , s1.config_value
    from simon s1
    inner join simon s2
    on s1.config_value = s2.config_value
    and s1.config_date <= s2.config_date
    left join simon s3
    on s3.config_value != s1.config_value
    and s3.config_date between s1.config_date
    and s2.config_date
    or s3.config_value = s1.config_value
    and s3.config_date = s1.config_date – INTERVAL 1 DAY
    where s3.config_date is null
    group by s1.config_date

  9. I should point out that there is another solutions that may be more efficient than my 3 way join solution. However, that relies on a SELECT in the FROM list, which cannot be used to create a MySQL view. Hence, I came up with the unwieldy but correct 3-way self join solution shown above. You should be warned that this could be a pretty poor performing query, and you really should have some process in place to extract and store the data for convenient reporting if you need good performance.

  10. Simon J Mudd says:

    Hi Roland, yes, you made it. It also works. I’ve rewritten this so the formatting is a bit cleaner and easier to read. Also changed the table and column names to something a bit shorter to ease readability.

    DROP TABLE IF EXISTS c;
    CREATE TABLE c (d DATE NOT NULL, v INT NULL, PRIMARY KEY(d));

    INSERT INTO c (d, v)
    VALUES
    ('2010-09-01', 1),
    ('2010-09-02', 1),
    ('2010-09-03', 2),
    ('2010-09-04', 3),
    ('2010-09-05', 3),
    ('2010-09-06', 3),
    ('2010-09-07', 3),
    ('2010-09-08', 4),
    ('2010-09-09', 4),
    ('2010-09-10', 1),
    ('2010-09-11', 5),
    ('2010-09-12', 5),
    ('2010-09-13', 5);

    EXPLAIN SELECT c1.d AS from_date
    , MAX(c2.d) AS to_date
    , c1.v
    FROM c c1
    INNER JOIN c c2 ON ( c1.v = c2.v AND c1.d < = c2.d )
    LEFT JOIN c c3 ON (
    ((c3.v != c1.v) AND (c3.d BETWEEN c1.d AND c2.d)) OR
    ((c3.v = c1.v) AND (c3.d = c1.d - INTERVAL 1 DAY))
    )
    WHERE c3.d IS NULL
    GROUP BY c1.d
    ;

    This does give the required results.

    +------------+------------+------+
    | from_date | to_date | v |
    +------------+------------+------+
    | 2010-09-01 | 2010-09-02 | 1 |
    | 2010-09-03 | 2010-09-03 | 2 |
    | 2010-09-04 | 2010-09-07 | 3 |
    | 2010-09-08 | 2010-09-09 | 4 |
    | 2010-09-10 | 2010-09-10 | 1 |
    | 2010-09-11 | 2010-09-13 | 5 |
    +------------+------------+------+
    6 rows in set (0.02 sec)

    The explain is a bit worrying.

    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
    | 1 | SIMPLE | c1 | ALL | PRIMARY | NULL | NULL | NULL | 13 | Using temporary; Using filesort |
    | 1 | SIMPLE | c2 | ALL | PRIMARY | NULL | NULL | NULL | 13 | Using where; Using join buffer |
    | 1 | SIMPLE | c3 | ALL | PRIMARY | NULL | NULL | NULL | 13 | Using where; Not exists |
    +----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
    3 rows in set (0.02 sec)

    This sample table was quite small. The real tables I'm using have 70,000 "sets of configs", that is 1 of these rows represents a set of config values. Denormalised the table holds 18,000,000 rows. So it's likely that even if I'm only looking for config changes of a single server/config_item this select may be unworkable. I'll have to see. I also need to change this so that the assumption that there are consecutive days of config values is removed as while this should be the case it may not always be so.

    Nevertheless, thanks for the solution. It works and points me to how to use it in my real life problem. This is also probably useful for a number of other people doing similar things.

  11. Hi Simon,

    yeah, the query is not efficient. I think there are ways to improve it though, and I’ll post back if I find time to tinker with it more.

    Would you be willing to send me your raw data? It would make it more fun for me. You can use my gmail addres for that if you’re interested.

    Finally, I would like to stress again that I would probably set up some process to massage the data if I’d need regular reporting.

  12. Simon J Mudd says:

    Hi Roland,

    Basically the “config table” is for all the servers I’m managing and stores the output daily of SHOW GLOBAL VARIABLES.

    Something like

    INSERT INTO c
    SELECT @@hostname AS hostname, @@datadir as datadir, CURRENT_DATE() as report_date, * FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES WHERE VARIABLE_NAME <> 'TIMESTAMP'

    See: http://bugs.mysql.com/bug.php?id=49686 which I brought up with MySQL some time ago and still needs fixing.

    Do this by collecting data from several different servers.

    From this you can get things out like:
    1. show all changes current value vs previous value
    2. show the changes for a particular host/config_variable overtime (the original point of this post)
    3. show changes against previous date: variable name, number of hosts with a change
    4. check the configuration of one server versus another one.

    This is very handy for auditing and other purposes.
    So there are a lot of “config variables”. Usual usage of this table (something I’m still working on) is to provide for a specific server a list of values over time for a specific value. Most of these values don’t really change.

    If you want values you can combine

  13. Simon J Mudd says:

    Also worth noting that while normally I do attempt to collect data every day this may not in fact happen so the “simplification” of “(c3.d = c1.d – INTERVAL 1 DAY))” can not be used.

    Given the complexity of the simple form of the query it certainly does seem that some sort of derived table would be better, and that can be updated by triggers inserting into the base tables.

Leave a Reply