Kernel kills postgres process

Discussion:

Kernel kills postgres process - help need

(too old to reply)

Hervé Piedvache

2008-01-09 21:57:06 UTC

Hi,

I have a big trouble with a PostgreSQL server ... regulary since I have added
8 Gb of memory, on a server having already 8Gb of memory, I have troubles.
Nothing else have changed ... I'm on a Dell server, and all the memory
diagnostics from Dell seems to be good ...
When I have a lot of connexions (persistante connexions from 6 web apache/php
serveurs using PDO, about 110 process on each web servers) on the server, or
long request, it's difficult for me to know when it's appening, the kernel
seems to kill my postgresql process then the server become completly
instable, and most of the time need a reboot ...

I'm on Linux kernel 2.6.15 with a version 8.1.10 of PostgreSQL.
My database is a size of 56G
RAM = 16 Gb

kernel shmmax : 941604096

Postgresql config :
max_connections = 2048
shared_buffers = 40000
#temp_buffers = 1000 # min 100, 8KB each
work_mem = 2048 # min 64, size in KB
maintenance_work_mem = 512000 # min 1024, size in KB
max_stack_depth = 4096 # min 100, size in KB
max_fsm_pages = 25000000
max_fsm_relations = 2000 # min 100, ~70 bytes each
max_files_per_process = 255 # min 25
fsync = on
wal_buffers = 128 # min 4, 8KB each
commit_delay = 500 # range 0-100000, in microseconds
commit_siblings = 5 # range 1-1000
checkpoint_segments = 160
effective_cache_size = 600000 # typically 8KB each
random_page_cost = 2

Syslog when crashing :
Jan 9 20:30:47 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0
Jan 9 20:30:48 db2 kernel: Mem-info:
Jan 9 20:30:48 db2 kernel: DMA per-cpu:
Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 1 hot: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 1 cold: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 2 hot: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 2 cold: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 3 hot: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 3 cold: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: DMA32 per-cpu: empty
Jan 9 20:30:48 db2 kernel: Normal per-cpu:
Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 186, batch 31 used:5
Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 62, batch 15 used:59
Jan 9 20:30:48 db2 kernel: cpu 1 hot: low 0, high 186, batch 31 used:22
Jan 9 20:30:48 db2 kernel: cpu 1 cold: low 0, high 62, batch 15 used:49
Jan 9 20:30:48 db2 kernel: cpu 2 hot: low 0, high 186, batch 31 used:33
Jan 9 20:30:48 db2 kernel: cpu 2 cold: low 0, high 62, batch 15 used:60
Jan 9 20:30:48 db2 kernel: cpu 3 hot: low 0, high 186, batch 31 used:3
Jan 9 20:30:48 db2 kernel: cpu 3 cold: low 0, high 62, batch 15 used:55
Jan 9 20:30:48 db2 kernel: HighMem per-cpu:
Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 186, batch 31 used:5
Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 62, batch 15 used:5
Jan 9 20:30:48 db2 kernel: cpu 1 hot: low 0, high 186, batch 31 used:11
Jan 9 20:30:48 db2 kernel: cpu 1 cold: low 0, high 62, batch 15 used:4
Jan 9 20:30:48 db2 kernel: cpu 2 hot: low 0, high 186, batch 31 used:17
Jan 9 20:30:48 db2 kernel: cpu 2 cold: low 0, high 62, batch 15 used:14
Jan 9 20:30:48 db2 kernel: cpu 3 hot: low 0, high 186, batch 31 used:14
Jan 9 20:30:48 db2 kernel: cpu 3 cold: low 0, high 62, batch 15 used:9
Jan 9 20:30:48 db2 kernel: Free pages: 497624kB (490232kB HighMem)
Jan 9 20:30:48 db2 kernel: Active:3604892 inactive:234379 dirty:20273
writeback:210 unstable:0 free:124406 slab:49119 mapped:547571
pagetables:139724
Jan 9 20:30:48 db2 kernel: DMA free:3588kB min:68kB low:84kB high:100kB
active:0kB inactive:0kB present:16384kB pages_scanned:1 all_unreclaimable?
yes
Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 880 17392
Jan 9 20:30:48 db2 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB
inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 880 17392
Jan 9 20:30:48 db2 kernel: Normal free:3804kB min:3756kB low:4692kB
high:5632kB active:508kB inactive:464kB present:901120kB pages_scanned:975
all_unreclaimable? yes
Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 0 132096
Jan 9 20:30:48 db2 kernel: HighMem free:490108kB min:512kB low:18148kB
high:35784kB active:14419044kB inactive:937112kB present:16908288kB
pages_scanned:0 all_unreclaimable? no
Jan 9 20:30:48 db2 kernel: lowmem_reserve[]: 0 0 0 0
Jan 9 20:30:48 db2 kernel: DMA: 1*4kB 0*8kB 2*16kB 1*32kB 1*64kB 1*128kB
1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
Jan 9 20:30:48 db2 kernel: DMA32: empty
Jan 9 20:30:48 db2 kernel: Normal: 35*4kB 0*8kB 7*16kB 5*32kB 1*64kB 0*128kB
1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3804kB
Jan 9 20:30:48 db2 kernel: HighMem: 29171*4kB 43358*8kB 1620*16kB 8*32kB
0*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 490108kB
Jan 9 20:30:48 db2 kernel: Swap cache: add 161, delete 160, find 98/138, race
0+0
Jan 9 20:30:48 db2 kernel: Free swap = 15623168kB
Jan 9 20:30:48 db2 kernel: Total swap = 15623172kB
Jan 9 20:30:48 db2 kernel: Free swap: 15623168kB
Jan 9 20:30:48 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0
Jan 9 20:30:48 db2 kernel: Mem-info:
Jan 9 20:30:48 db2 kernel: DMA per-cpu:
Jan 9 20:30:48 db2 postgres[7634]: [2-1] LOG: background writer process (PID
7639) was terminated by signal 9
Jan 9 20:30:48 db2 kernel: cpu 0 hot: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 kernel: cpu 0 cold: low 0, high 0, batch 1 used:0
Jan 9 20:30:48 db2 postgres[7634]: [3-1] LOG: terminating any other active
server processes
Jan 9 20:30:48 db2 postgres[4058]: [2-1] WARNING: terminating connection
because of crash of another server process
Jan 9 20:30:48 db2 postgres[4058]: [2-2] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server
Jan 9 20:30:48 db2 postgres[4058]: [2-3] process exited abnormally and
possibly corrupted shared memory.
Jan 9 20:30:48 db2 postgres[4044]: [2-1] WARNING: terminating connection
because of crash of another server process
Jan 9 20:30:48 db2 postgres[4058]: [2-4] HINT: In a moment you should be
able to reconnect to the database and repeat your command.
Jan 9 20:30:48 db2 postgres[4023]: [2-1] WARNING: terminating connection
because of crash of another server process
Jan 9 20:30:48 db2 postgres[4023]: [2-2] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server
Jan 9 20:30:48 db2 postgres[4023]: [2-3] process exited abnormally and
possibly corrupted shared memory.
Jan 9 20:30:48 db2 postgres[4023]: [2-4] HINT: In a moment you should be
able to reconnect to the database and repeat your command.
etc.

At this moment I had 877 connexions ... nothing very big for our activity.

If somebody have any idea ... a bad configuration parameter ... or another
idea to solve my problem ... help will be really appreciated.

Regards,

--
Hervé

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Jeff Davis

2008-01-09 22:17:14 UTC

Permalink

Post by HervÃ© Piedvache
Hi,
I have a big trouble with a PostgreSQL server ... regulary since I have added
8 Gb of memory, on a server having already 8Gb of memory, I have troubles.
Nothing else have changed ... I'm on a Dell server, and all the memory
diagnostics from Dell seems to be good ...
When I have a lot of connexions (persistante connexions from 6 web apache/php
serveurs using PDO, about 110 process on each web servers) on the server, or
long request, it's difficult for me to know when it's appening, the kernel
seems to kill my postgresql process then the server become completly
instable, and most of the time need a reboot ...
I'm on Linux kernel 2.6.15 with a version 8.1.10 of PostgreSQL.
My database is a size of 56G
RAM = 16 Gb

[snip]

Post by HervÃ© Piedvache
Jan 9 20:30:47 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0

It looks like the Out Of Memory Killer was invoked, and you need to find
out why it was invoked.

I posted to LKML here:

http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202

because linux has a behavior -- which in my opinion is a bug -- that
causes the OOM killer to almost always kill PostgreSQL first, regardless
of whether it was truly the offending process or not.

So, find out which process truly caused the memory pressure that lead to
the OOM being invoked, and fix that problem.

You may also consider some other linux configuration options that make
invocation of OOM killer less likely.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Hervé Piedvache

2008-01-09 22:43:47 UTC

Permalink

Post by Jeff Davis

Post by HervÃ© Piedvache
Hi,
I have a big trouble with a PostgreSQL server ... regulary since I have
added 8 Gb of memory, on a server having already 8Gb of memory, I have
troubles. Nothing else have changed ... I'm on a Dell server, and all the
memory diagnostics from Dell seems to be good ...
When I have a lot of connexions (persistante connexions from 6 web
apache/php serveurs using PDO, about 110 process on each web servers) on
the server, or long request, it's difficult for me to know when it's
appening, the kernel seems to kill my postgresql process then the server
become completly instable, and most of the time need a reboot ...
I'm on Linux kernel 2.6.15 with a version 8.1.10 of PostgreSQL.
My database is a size of 56G
RAM = 16 Gb

[snip]

Post by HervÃ© Piedvache
Jan 9 20:30:47 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0

It looks like the Out Of Memory Killer was invoked, and you need to find
out why it was invoked.
http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202
because linux has a behavior -- which in my opinion is a bug -- that
causes the OOM killer to almost always kill PostgreSQL first, regardless
of whether it was truly the offending process or not.
So, find out which process truly caused the memory pressure that lead to
the OOM being invoked, and fix that problem.

How can I process to find this ? It's a production server for a web service,
and I have no idea how to find wich process was the cause of this ... !?

Post by Jeff Davis
You may also consider some other linux configuration options that make
invocation of OOM killer less likely.

On this server there is only Postgresql, slony, and sshd running the rest is
only Linux basic process (cron, atd, getty etc.)

regards,

--
Hervé Piedvache

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Robert Treat

2008-01-10 03:12:35 UTC

Permalink

On Wed, 09 Jan 2008 14:17:14 -0800

Post by Jeff Davis
http://kerneltrap.org/mailarchive/linux-kernel/2007/2/12/54202
because linux has a behavior -- which in my opinion is a bug -- that
causes the OOM killer to almost always kill PostgreSQL first,
regardless of whether it was truly the offending process or not.

If that isn't an argument for FreeBSD I don't know what is...

Funny, it looked like an argument for Solaris to me. ;-)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Tom Lane

2008-01-09 22:17:56 UTC

Permalink

Post by HervÃ© Piedvache
When I have a lot of connexions (persistante connexions from 6 web apache/php
serveurs using PDO, about 110 process on each web servers) on the server, or
long request, it's difficult for me to know when it's appening, the kernel
seems to kill my postgresql process then the server become completly
instable, and most of the time need a reboot ...

Turn off memory overcommit.

Post by HervÃ© Piedvache
max_connections = 2048

Have you considered using a connection pooler in front of a smaller
number of backends?

If you really need that many backends, it'd likely be a good idea to
reduce max_files_per_process to perhaps 100 or so. If you manage
to run the kernel out of filetable slots, all sorts of userland stuff
is going to get very unhappy.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Hervé Piedvache

2008-01-09 22:38:31 UTC

Permalink

Tom,

Post by Tom Lane

Post by HervÃ© Piedvache
When I have a lot of connexions (persistante connexions from 6 web
apache/php serveurs using PDO, about 110 process on each web servers) on
the server, or long request, it's difficult for me to know when it's
appening, the kernel seems to kill my postgresql process then the server
become completly instable, and most of the time need a reboot ...

Turn off memory overcommit.

My sysctl.conf file looks like this :
kernel.shmmax= 941604096
kernel.sem = 250 32000 100 400
fs.file-max=655360
vm.overcommit_memory=2
vm.overcommit_ratio=30

Post by Tom Lane

Have you considered using a connection pooler in front of a smaller
number of backends?

Which system do you recommand for this ?

Post by Tom Lane
If you really need that many backends, it'd likely be a good idea to
reduce max_files_per_process to perhaps 100 or so. If you manage
to run the kernel out of filetable slots, all sorts of userland stuff
is going to get very unhappy.

I'll try this ...

regards,

--
Hervé

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Scott Marlowe

2008-01-09 22:59:45 UTC

Permalink

On Jan 9, 2008 3:57 PM, Hervé Piedvache <***@gmail.com> wrote:

SNIP

Post by HervÃ© Piedvache
0+0
Jan 9 20:30:48 db2 kernel: Free swap = 15623168kB
Jan 9 20:30:48 db2 kernel: Total swap = 15623172kB
Jan 9 20:30:48 db2 kernel: Free swap: 15623168kB
Jan 9 20:30:48 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0
Jan 9 20:30:48 db2 postgres[7634]: [2-1] LOG: background writer process (PID
7639) was terminated by signal 9

This makes no sense to me. The OS is showing that there's
16G free swap. Why is it killing things? I'm betting there's some
bug with too large of a swap resulting in some kind of wrap around or
something.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Martijn van Oosterhout

2008-01-10 11:39:02 UTC

Permalink

Post by Scott Marlowe

Post by HervÃ© Piedvache
Jan 9 20:30:48 db2 kernel: Free swap = 15623168kB
Jan 9 20:30:48 db2 kernel: Total swap = 15623172kB
Jan 9 20:30:48 db2 kernel: Free swap: 15623168kB
Jan 9 20:30:48 db2 kernel: oom-killer: gfp_mask=0x84d0, order=0
Jan 9 20:30:48 db2 postgres[7634]: [2-1] LOG: background writer process (PID
7639) was terminated by signal 9

At a guess it's this:

Jan 9 20:30:48 db2 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no

Which is why the bgwriter got whacked, it couldn't allocate any memory
for the disk transfer (though why the OOM killer gets invoked here I
don't know). Disabling overcommit won't help you either.

Perhaps a 64-bit architecture? Or a RAID controller that can access
high memory (is this possible?).

Have a nice day,

Post by Scott Marlowe
Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy