grouping consecutive records

Discussion:

(too old to reply)

Morus Walter

2013-02-04 09:34:54 UTC

Hallo,

I have a question regarding a selection.

I'd like to group and merge certain records having the same values in
some columns, but only if they are contiguous with regard to some sort
order.

So for a table
create table foo (
id int,
user_id int,
key varchar,
sort int );

and values e.g.
insert into foo
values ( 1, 1, 'foo', 1),
( 2, 1, 'foo', 2),
( 3, 1, 'bar', 3),
( 4, 1, 'foo', 4),
( 5, 1, 'foo', 5),
( 6, 1, 'foo', 6),
( 7, 1, 'bla', 7),
( 8, 2, 'bar', 1),
( 9, 2, 'foo', 2),
(10, 2, 'foo', 3),
(11, 2, 'bla', 4);

I'd like to merge all consecutive records (ordered by sort, user_id)
having the same value in user_id and key and keep the first/last
value of sort of the merged records (and probably some more values
from the first or last merged record).

So the result should be something like
user_id, key, sort_first, sort_last
1, 'foo', 1, 2
1, 'bar', 3, 3
1, 'foo', 4, 6
1, 'bla', 7, 7
2, 'bar', 1, 1
2, 'foo', 2, 3
2, 'bla', 4, 4

I was trying to do that using window functions, which works great -
except it merges non consecutive occurences (key foo for user_id 1 in
my sample) as well.

select user_id, key, sort_first, sort_last
from (
select user_id,
key,
first_value(sort) over w as sort_first,
last_value(sort) over w as sort_last,
lead(key) over w as next_key
from foo
window w as (partition by user_id, key order by sort
range between unbounded preceding and unbounded following)
) as foo
where next_key is null
order by user_id, sort_first;

user_id | key | sort_first | sort_last
---------+-----+------------+-----------
1 | foo | 1 | 6 <-- would like to have two records
1/2 and 4/6 here
1 | bar | 3 | 3
1 | bla | 7 | 7
2 | bar | 1 | 1
2 | foo | 2 | 3
2 | bla | 4 | 4

Introducing another window on user_id only allows me to keep two records for
1/foo but I still cannot determine the intended sort_first/sort_last.

select user_id, key, sort_first, sort_last
from (
select user_id,
key,
first_value(sort) over w as sort_first,
last_value(sort) over w as sort_last,
lead(key) over u as next_key
from foo
window u as (partition by user_id order by sort),
w as (partition by user_id, key order by sort
range between unbounded preceding and unbounded following)
) as foo
where next_key is null or key != next_key
order by user_id, sort_first;

user_id | key | sort_first | sort_last
---------+-----+------------+-----------
1 | foo | 1 | 6
1 | foo | 1 | 6
1 | bar | 3 | 3
1 | bla | 7 | 7
2 | bar | 1 | 1
2 | foo | 2 | 3
2 | bla | 4 | 4

So the question is: is this doable with a selection?
Can I use window functions for this type of grouping?
Are there other options?

I do have an alternative plan to select records into a temporary table first,
and then do updates merging two consecutive records and repeat that until
all groups are completely merged, but I'd still like to know if I miss
something regarding selection options.

best
Morus

PS: the alternative plan is something like

select id, user_id,
key,
sort,
sort as sort_last,
lead(key) over u as next_key,
lead(id) over u as next_id,
lag(key) over u as prev_key
into temp table footmp
from foo
window u as (partition by user_id order by sort);

update footmp set sort = f2.sort, prev_key = f2.prev_key
from footmp f2
where footmp.id = f2.next_id and
footmp.key = f2.key and
f2.key = f2.next_key and
( f2.prev_key is null or f2.prev_key != f2.key );

delete from footmp
where id in (
select id from (
select first_value(id) over w as id,
count(*) over w as cnt
from footmp
window w as ( partition by user_id, sort )
) as foo where cnt > 1
);

(repeat update/delete until no row is affected)

select user_id,
key,
sort as sort_first,
sort_last
from footmp
order by user_id, sort_first;

pretty ugly and complicated but at least gives me what I want...

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Виктор Егоров

2013-02-04 10:14:06 UTC

Permalink

Post by Morus Walter
I'd like to merge all consecutive records (ordered by sort, user_id)
having the same value in user_id and key and keep the first/last
value of sort of the merged records (and probably some more values
from the first or last merged record).
So the result should be something like
user_id, key, sort_first, sort_last
1, 'foo', 1, 2
1, 'bar', 3, 3
1, 'foo', 4, 6
1, 'bla', 7, 7
2, 'bar', 1, 1
2, 'foo', 2, 3
2, 'bla', 4, 4

This example corresponds to the ORDER BY user_id, sort
while you claim you need to ORDER BY sort, user_id.

I will explain this for the ordering that matches your sample.

You need to group your data, but you should first create an artificial
grouping column.

First, detect ranges of your buckets:
WITH ranges AS (
SELECT id, user_id, key, sort,
CASE WHEN lag(key) OVER
(PARTITION BY user_id ORDER BY user_id, sort) = key
THEN NULL ELSE 1 END r
FROM foo
)
SELECT * FROM ranges;

Here each time a new “range” is found, «r» is 1, otherwise it is NULL.

Now, form your grouping column:
WITH ranges AS (
SELECT id, user_id, key, sort,
CASE WHEN lag(key) OVER
(PARTITION BY user_id ORDER BY user_id, sort) = key
THEN NULL ELSE 1 END r
FROM foo
)
, groups AS (
SELECT id, user_id, key, sort, r,
sum(r) OVER (ORDER BY user_id, sort) grp
FROM ranges
)
SELECT * FROM groups;

Here sum() is used as running total to produce new “grp” values.

Final query looks like this:
WITH ranges AS (
SELECT id, user_id, key, sort,
CASE WHEN lag(key) OVER
(PARTITION BY user_id ORDER BY user_id, sort) = key
THEN NULL ELSE 1 END r
FROM foo
)
, groups AS (
SELECT id, user_id, key, sort, r,
sum(r) OVER (ORDER BY user_id, sort) grp
FROM ranges
)
SELECT min(user_id) user_id, min(key) "key",
min(sort) sort_first,
max(sort) sort_last
FROM groups
GROUP BY grp
ORDER BY user_id,sort_first;

Based on this SO answer: http://stackoverflow.com/a/10624628/1154462

--
Victor Y. Yegorov
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Morus Walter

2013-02-04 10:46:41 UTC

Permalink

Hallo Виктор,

thanks a lot for your explanation :-)
You rock!

Post by ÐÐ¸ÐºÑÐ¾Ñ ÐÐ³Ð¾ÑÐ¾Ð²
This example corresponds to the ORDER BY user_id, sort
while you claim you need to ORDER BY sort, user_id.

right, I confused the order.

Post by ÐÐ¸ÐºÑÐ¾Ñ ÐÐ³Ð¾ÑÐ¾Ð²
I will explain this for the ordering that matches your sample.
You need to group your data, but you should first create an artificial
grouping column.
WITH ranges AS (
SELECT id, user_id, key, sort,
CASE WHEN lag(key) OVER
(PARTITION BY user_id ORDER BY user_id, sort) = key
THEN NULL ELSE 1 END r
FROM foo
)
SELECT * FROM ranges;
Here each time a new “range” is found, «r» is 1, otherwise it is NULL.
WITH ranges AS (
SELECT id, user_id, key, sort,
CASE WHEN lag(key) OVER
(PARTITION BY user_id ORDER BY user_id, sort) = key
THEN NULL ELSE 1 END r
FROM foo
)
, groups AS (
SELECT id, user_id, key, sort, r,
sum(r) OVER (ORDER BY user_id, sort) grp
FROM ranges
)
SELECT * FROM groups;

so the trick is to flag changes in key and afterwards count them using
the dynamic nature of a frame ending with the current row.
great :-)
Once you have a group column, it's pretty clear then.

thanks
Morus

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general