PostgreSQL Out Of Memory
Database connections uses huge amount of memory
Imagine you, in a beautiful sunny day, running your production environment when suddenly…
Your database SHUTDOWN!
First thing to do is put everything up & running again and, after that you start the diagnostic job. The first place to look is to the logs, when you start seek the messages you face with the message bellow:
[Fri Oct 25 14:11:39 2019] Out of memory: Kill process 2591 (postgres) score 225 or sacrifice child
[Fri Oct 25 14:11:39 2019] Killed process 2591 (postgres) total-vm:34872820kB, anon-rss:2776kB, file-rss:0kB, shmem-rss:33416816kB[...]2019-10-25 14:12:15 UTC [2589]: [6] user=,db=,client=,app= LOG: database system is shut down
That surprises you, once the database server have a lot of RAM which make no sense hit an OOM issue. With that in mind the next step is look to resources metrics (i.e.: memory, CPU, Network, etc…) with the expectation to find some leak which can explain an increasing memory usage, but, another surprise, nothing changes until the OOM issue time. The memory metrics was stable and suddenly free memory goes to zero causing the database shutdown.
So you start to seek for tuning recommendations, and the official PostgreSQL documentation has a good one about how Managing Kernel Resources. The first change done was the Linux Memory Overcommit:
sysctl -w vm.overcommit_memory=2
The second was manually adjust the PostgreSQL process score to avoid kernel (i.e.: OOM Killer) to kill the database process:
echo -1000 > /proc/self/oom_score_adj
After this optimizations the expectation is that everything is solve right? NO!
Some hours after, another database outage happen and this start to happen several times during the day, even out of working hours, and the logs show that the database doesn't shutdown anymore, but initiate to repeatedly restart…
2019-10-25 21:15:57 UTC [122914]: [2] user=readonly,db=production-db,client=0.0.0.0,app=[unknown] FATAL: the database system is in recovery mode2019-10-25 21:16:31 UTC [89367]: [7] user=,db=,client=,app= LOG: all server processes terminated; reinitializing
While debug you can identify a lot of heavy queries and start doing a lot of optimizations and the next step was decrease the work_mem
configuration to use less memory on complex sorting queries.
But you still seen database restart caused by Out of Memory errors into the log messages:
2019-11-01 13:31:54 UTC [12954]: [8] user=readonly,db=production-db,client=0.0.0.0,app=[unknown] ERROR: out of memory
2019-11-01 13:31:55 UTC [13144]: [16] user=readonly,db=production-db,client=0.0.0.0,app=[unknown] ERROR: out of memory[...]2019-11-01 13:33:45 UTC [69292] LOG: server process (PID 84886) exited with exit code 127
Besides query optimizations, you change more kernel params. The next ones is about the Shared Memory, by increasing it to leave more memory resource to the database process, something like bellow, following the documentation formula described below:
The default maximum segment size is 32 MB, and the default maximum total size is 2097152 pages. A page is almost always 4096 bytes except in unusual kernel configurations with “huge pages” (use getconf PAGE_SIZE to verify).
sysctl -w kernel.shmmax=17179869184
sysctl -w kernel.shmall=4194304
Even after all this modifications, nothing changes, and the database still restarting over and over…
So you'll need to understand more about how PostgreSQL uses memory to try to identify the possible cause of the issue.
One of the first things to do is try to understand how PostgreSQL’s memory allocation works, and, for that, severalnines has a nice post about PostgreSQL memory architecture, explaining the differences between Local / Shared memory areas and for what each one is used. Below the image to illustrate how the things works:
Alright, this helps but you'll need more deep known about this topic and PostreSQL Addict writes a very nice article giving a very good understand about MemoryContexts
, which are, basically, groups of allocated pieces of memory, making it easier to manage lifecycle. are organized in a tree, roughly matching the execution plans. It's main objective is to to minimize malloc
calls/book-keeping, maximize memory reuse, and never really frees memory.
After all this reading you are able to understand more about memory allocation but doesn’t clarify the possible cause of the OOM issue.
So you'll keep researching and, Brandur shows a big picture how PostgreSQL process are forked and allocate memory as we can see on the image below:
Poring over the reading you face with the below quote:
Each of these “backend” processes starts out at around 5 MB in size, but may grow to be much larger depending on the data they’re accessing
With that, something come up into your mind: "If the connections are leaking?". Since PostgreSQL is very mature code you wasn't able to find any memory leak bug, but, as said on Stack Exchange, long-live connections can use a lot of memory:
So if you have hundreds of thousands or millions of these things, and a long-lived connection will eventually touch each one of them, then their memory usage will continuously grow.
OK…! Things starts to make sense, since your application uses Django Persistent Connection with database and you realize that you've set the CONN_MAX_AGE
to a very big number (i.e.: 600
) which should eating the server memory.
So, what is the right value? As per Stack Overflow question, there's no "right value" since you need to evaluate your traffic, so the general recommendation is use CONN_MAX_AGE = 60
and watch.
After apply this setting…
Conclusion
The most common cause of out of memory issue happens when PostgreSQL is unable to allocate the memory required for a query to run. This is defined by work_mem
parameter, which sets the maximum amount of memory that can be used by a query operation before writing to temporary disk files.
As said by Citus, give too much memory to a query operation can cause some collateral effects like OOM issue…
When you consume more memory than is available on your machine you can start to see out of
out of memory
errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in linux begins killing running processes which in some cases might even include Postgres itself.
PGAnalyse also describes some characteristics and recommendations about OOM issues and configuration tuning.
But that isn't the only cause, as described, connection needs to be treat as a resource, as the same CPU, Memory, Network, etc… Although, there are no much documentation about connection behavior as we were able to see on the narrative.
There some mechanisms to protect the database infrastructure from this application "misbehaviour", which is the idle_in_transaction_session_timeout
configuration…
Terminate any session with an open transaction that has been idle for longer than the specified duration in milliseconds.
This means that the database control the expiration for long-live connection opened from application.
Also, another approach is use some connection pool solution like PgBouncer or PGPool-II