Optimizing query?

Discussion:

Optimizing query?

(too old to reply)

w***@noten5.maas-noten.de

2013-01-30 11:08:05 UTC

Hi,

I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, etc
The table has an index on that column.
The second table is, typically, much smaller

select .... from tab1, tab2 where tab1.code = tab2.code;

This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the
big table and want them to match XY423, GF55 in the second table

Variants I have tried

select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])');

both take an enormous time. In the better case that I can subset (e.g. all candidates in table 2
share initial "AX") I get back to manageable times by adding
and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and 800 entries in tab2
is about 40 seconds.

Regards
Wolfgang Hamann

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vincent Veyron

2013-01-31 07:07:30 UTC

Post by w***@noten5.maas-noten.de
Hi,
I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, etc
The table has an index on that column.
The second table is, typically, much smaller
select .... from tab1, tab2 where tab1.code = tab2.code;
This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the
big table and want them to match XY423, GF55 in the second table
Variants I have tried
select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])');

Have you tried the substring function?

select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =
tab2.code

Post by w***@noten5.maas-noten.de
both take an enormous time. In the better case that I can subset (e.g. all candidates in table 2
share initial "AX") I get back to manageable times by adding
and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and 800 entries in tab2
is about 40 seconds.
Regards
Wolfgang Hamann

--
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Pavel Stehule

2013-01-31 08:49:47 UTC

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
Hi,
I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, etc
The table has an index on that column.
The second table is, typically, much smaller
select .... from tab1, tab2 where tab1.code = tab2.code;
This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the
big table and want them to match XY423, GF55 in the second table
Variants I have tried
select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])');

Have you tried the substring function?
select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =
tab2.code

Post by w***@noten5.maas-noten.de
both take an enormous time. In the better case that I can subset (e.g. all candidates in table 2
share initial "AX") I get back to manageable times by adding
and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and 800 entries in tab2
is about 40 seconds.

any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

You can try use a functional index.

create index on tab2 ((substring(tab1.code from 1 for 5))

Regards

Pavel Stehule

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
Regards
Wolfgang Hamann

--
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance
--
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vincent Veyron

2013-01-31 10:03:08 UTC

Post by Pavel Stehule
any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

Hi Pavel,

Thank you for the correction. Since we're at it, I have a question
regarding functions in a query :

Suppose I have a query of the form

SELECT my_function(column_1), column_2
FROM my_table
GROUP BY my_function(column_1)
ORDER BY my_function(column_1);

where my_function is a user defined function.

How many times is the function computed?
--
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Pavel Stehule

2013-01-31 10:06:23 UTC

Post by Vincent Veyron

Post by Pavel Stehule
any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

Hi Pavel,
Thank you for the correction. Since we're at it, I have a question
Suppose I have a query of the form
SELECT my_function(column_1), column_2
FROM my_table
GROUP BY my_function(column_1)
ORDER BY my_function(column_1);
where my_function is a user defined function.
How many times is the function computed?

if function is stable or immutable, then once per row

Pavel

Post by Vincent Veyron
--
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vincent Veyron

2013-01-31 14:00:30 UTC

Post by Pavel Stehule

Post by Vincent Veyron
Suppose I have a query of the form
SELECT my_function(column_1), column_2
FROM my_table
GROUP BY my_function(column_1)
ORDER BY my_function(column_1);
where my_function is a user defined function.
How many times is the function computed?

if function is stable or immutable, then once per row

In this post (watch for line-wrap) :

http://www.postgresql.org/message-id/CAFj8pRAdYL1-hCxH
+***@mail.gmail.com

you wrote that it is usually better not to mark SQL functions (as
opposed to plpgsql functions).

So should I mark SQL functions stable/immutable if I use them in a query
like the one above, or is it unnecessary?
--
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Pavel Stehule

2013-01-31 14:09:00 UTC

Post by Vincent Veyron

Post by Pavel Stehule

Post by Vincent Veyron
Suppose I have a query of the form
SELECT my_function(column_1), column_2
FROM my_table
GROUP BY my_function(column_1)
ORDER BY my_function(column_1);
where my_function is a user defined function.
How many times is the function computed?

if function is stable or immutable, then once per row

http://www.postgresql.org/message-id/CAFj8pRAdYL1-hCxH
you wrote that it is usually better not to mark SQL functions (as
opposed to plpgsql functions).
So should I mark SQL functions stable/immutable if I use them in a query
like the one above, or is it unnecessary?

It should not be marked

Regards

Pavel

Post by Vincent Veyron
--
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

h***@t-online.de

2013-01-31 18:29:28 UTC

Post by w***@noten5.maas-noten.de

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
Hi,
I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, =

etc

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
The table has an index on that column.
The second table is, typically, much smaller
select .... from tab1, tab2 where tab1.code =3D tab2.code;
This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=

in the

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
big table and want them to match XY423, GF55 in the second table
Variants I have tried
select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=

])');

Post by Vincent Veyron
Have you tried the substring function?
select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
tab2.code

Hi Pavel, it was just by chance that a fixed size substring would match the
data at hand. It is more common to have a digit/letter (or vice versa) boundary
or a hyphen there

Post by w***@noten5.maas-noten.de

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
both take an enormous time. In the better case that I can subset (e.g. a=

ll candidates in table 2

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
share initial "AX") I get back to manageable times by adding
and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and=

800 entries in tab2

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
is about 40 seconds.

any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

I see the problem since obviously every the ~ operator with a non-constant
pattern is constantly recompiling the pattern.

I wonder whether it would be possible to invent a prefix-match operator that approaches
the performance of string equality. I noted in the past (not sure whether anything
has changed in regex matching) that a constant leading part of regex would improve
performance, i.e. use an index scan to select possible candidates.

Post by w***@noten5.maas-noten.de
You can try use a functional index.
create index on tab2 ((substring(tab1.code from 1 for 5))

What kind of trick is that - mixing two tables into a functional index?
What would the exact syntax be for that?

Regards
Wolfgang Hamann

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Pavel Stehule

2013-02-01 05:50:23 UTC

Hello

Post by h***@t-online.de

Post by w***@noten5.maas-noten.de

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
Hi,
I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, =

etc

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
The table has an index on that column.
The second table is, typically, much smaller
select .... from tab1, tab2 where tab1.code =3D tab2.code;
This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=

in the

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
big table and want them to match XY423, GF55 in the second table
Variants I have tried
select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=

])');

Post by Vincent Veyron
Have you tried the substring function?
select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
tab2.code

Hi Pavel, it was just by chance that a fixed size substring would match the
data at hand. It is more common to have a digit/letter (or vice versa) boundary
or a hyphen there

Post by w***@noten5.maas-noten.de

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
both take an enormous time. In the better case that I can subset (e.g. a=

ll candidates in table 2

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
share initial "AX") I get back to manageable times by adding
and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and=

800 entries in tab2

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
is about 40 seconds.

any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

I see the problem since obviously every the ~ operator with a non-constant
pattern is constantly recompiling the pattern.
I wonder whether it would be possible to invent a prefix-match operator that approaches
the performance of string equality. I noted in the past (not sure whether anything
has changed in regex matching) that a constant leading part of regex would improve
performance, i.e. use an index scan to select possible candidates.

Post by w***@noten5.maas-noten.de
You can try use a functional index.
create index on tab2 ((substring(tab1.code from 1 for 5))

What kind of trick is that - mixing two tables into a functional index?

it is not possible - you can do some auxiliary table and creating
indexes over this table

but maybe https://github.com/dimitri/prefix can help

Regards

Pavel

Post by h***@t-online.de
What would the exact syntax be for that?
Regards
Wolfgang Hamann
--
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

h***@t-online.de

2013-02-02 09:17:08 UTC

Post by Pavel Stehule
but maybe https://github.com/dimitri/prefix can help

Hi Pavel,

thanks - this works perfect. However, it does not seem to play well
with the optimizer, so I ended up with

select all candidates into a temp table using prefix operator
apply all other conditions by joining that temp table to original ones

Regards
Wolfgang

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Pavel Stehule

2013-02-02 09:25:40 UTC

Post by Vincent Veyron

Post by Pavel Stehule
but maybe https://github.com/dimitri/prefix can help

Hi Pavel,
thanks - this works perfect. However, it does not seem to play well
with the optimizer, so I ended up with
select all candidates into a temp table using prefix operator
apply all other conditions by joining that temp table to original ones

you can send proposals to enhancing to Dimitry - He will be happy :)

Regards

Pavel

Post by Vincent Veyron
Regards
Wolfgang
--
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Pavel Stehule

2013-02-02 10:11:18 UTC

Post by Pavel Stehule

Post by Vincent Veyron

Post by Pavel Stehule
but maybe https://github.com/dimitri/prefix can help

Hi Pavel,
thanks - this works perfect. However, it does not seem to play well
with the optimizer, so I ended up with
select all candidates into a temp table using prefix operator
apply all other conditions by joining that temp table to original ones

seriously - it is typical solution - and it is great so PostgreSQL
help to you :)

Regards

Pavel

Post by Pavel Stehule
you can send proposals to enhancing to Dimitry - He will be happy :)
Regards
Pavel

Post by Vincent Veyron
Regards
Wolfgang
--
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jasen Betts

2013-02-03 05:33:50 UTC

Post by h***@t-online.de

Post by w***@noten5.maas-noten.de

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
Hi,
I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, =

etc

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
The table has an index on that column.
The second table is, typically, much smaller
select .... from tab1, tab2 where tab1.code =3D tab2.code;
This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=

in the

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
big table and want them to match XY423, GF55 in the second table
Variants I have tried
select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=

])');

Post by Vincent Veyron
Have you tried the substring function?
select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
tab2.code

Hi Pavel, it was just by chance that a fixed size substring would match the
data at hand. It is more common to have a digit/letter (or vice versa) boundary
or a hyphen there

Post by w***@noten5.maas-noten.de

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
both take an enormous time. In the better case that I can subset (e.g. a=

ll candidates in table 2

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
share initial "AX") I get back to manageable times by adding
and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and=

800 entries in tab2

Post by Vincent Veyron

Post by w***@noten5.maas-noten.de
is about 40 seconds.

any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

I see the problem since obviously every the ~ operator with a non-constant
pattern is constantly recompiling the pattern.
I wonder whether it would be possible to invent a prefix-match operator that approaches
the performance of string equality. I noted in the past (not sure whether anything
has changed in regex matching) that a constant leading part of regex would improve
performance, i.e. use an index scan to select possible candidates.

you could write a set returning function that opens cursors on both tables using
"ORDER BY code" and merges the results

--
⚂⚃ 100% natural
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

12 Replies
6 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

w***@noten5.maas-noten.de 2013-01-30 11:08:05 UTC

Vincent Veyron 2013-01-31 07:07:30 UTC

Pavel Stehule 2013-01-31 08:49:47 UTC

Vincent Veyron 2013-01-31 10:03:08 UTC

Pavel Stehule 2013-01-31 10:06:23 UTC

Vincent Veyron 2013-01-31 14:00:30 UTC

Pavel Stehule 2013-01-31 14:09:00 UTC

h***@t-online.de 2013-01-31 18:29:28 UTC

Pavel Stehule 2013-02-01 05:50:23 UTC

h***@t-online.de 2013-02-02 09:17:08 UTC

Pavel Stehule 2013-02-02 09:25:40 UTC

Pavel Stehule 2013-02-02 10:11:18 UTC

Jasen Betts 2013-02-03 05:33:50 UTC

about - legalese

Loading...