Note that there are some explanatory texts on larger screens.

plurals
  1. POMonit times out when stopping unicorn workers if they respawn too quickly
    primarykey
    data
    text
    <p>I'm trying to monitor unicorn workers with monit, so it gracefully kills them when they reach certain memory threshold.</p> <p><strong>The problem:</strong></p> <p>When I tell monit to restart a worker it first tries to stop it, firing my <code>/etc/init.d/unicorn kill_worker 0</code> script command.</p> <pre><code># my /etc/monit/config.d/unicorn file check process orly_unicorn_worker_0 with pidfile /tmp/unicorn.orly.0.pid start program = "/bin/true" stop program = "/etc/init.d/unicorn_orly kill_worker 0" </code></pre> <p>As I am monitoring processes via the <code>top</code> command I see how the worker is killed and how the master spawns a new worker with, of course, another pid.</p> <p>Monit, however, waits for a while and throws a "failed to stop" error in its log. It is actually waiting 30 seconds and timing out.</p> <p>Once it times out, monit recognizes that the <code>restart action is done</code>, and then notices the worker PID has changed and continues to monitor the process as expected.</p> <p>As a result everything is working, monit is able to restart a worker when needed and keep monitoring them, but the log is full of errors, the web interface shows a nasty (and confusing) <code>execution failed</code> error status on the worker, and I guess it would send erroneous email alerts if they were set up.</p> <p>This is the relevant part of the log, when I try to restart a worker through the web interface (notice how it also gets confused with the workers parent PID):</p> <pre><code>[UTC Mar 5 13:29:17] info : 'orly_unicorn_worker_0' trying to restart [UTC Mar 5 13:29:17] info : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly [UTC Mar 5 13:29:47] error : 'orly_unicorn_worker_0' failed to stop [UTC Mar 5 13:29:47] info : 'orly_unicorn_worker_0' restart action done [UTC Mar 5 13:29:47] error : 'orly_unicorn_worker_0' process PID changed to 13699 [UTC Mar 5 13:29:49] error : 'orly_unicorn_worker_0' process PPID changed to 0 [UTC Mar 5 13:30:19] info : 'orly_unicorn_worker_0' process PID has not changed since last cycle [UTC Mar 5 13:30:19] error : 'orly_unicorn_worker_0' process PPID changed to 13660 [UTC Mar 5 13:30:49] info : 'orly_unicorn_worker_0' process PPID has not changed since last cycle </code></pre> <p>This took me a long time to figure out but, what's happening here is that the worker gets killed and then respawned so quickly that monit doesn't even notice the change.</p> <p>My guess is that monit, when performing the stop action, reads the <code>/tmp/unicorn.orly.0.pid</code> to get the pid of the process and then looks to see if that process stil exists.</p> <p>However since the <em>kill-respawn worker operation</em> happens so fast monit doesn't realize that the pid of the worker has changed and keeps waiting for the (bran new) worker to die. Then it times out, then it realizes the pid has actually changed and it goes as normal.</p> <p><strong>The dirty solution I have found:</strong></p> <p>To prove this hypothesis I tried to slow down the mentioned <em>kill-respawn worker operation</em>. So I edited the unicorn config file to sleep the new workers a few seconds just before they write down their new pid in <code>/tmp/unicorn.orly.0.pid</code>. </p> <p>I did it like this:</p> <pre><code>after_fork do |server, worker| sleep 3 # write down the new worker PID so monit can monitor it child_pid = server.config[:pid].sub(".pid", ".#{worker.nr}.pid") system("echo #{Process.pid} &gt; #{child_pid}") end </code></pre> <p>And it worked wonderfully: birds and flowers sing in the sunny day, the web interface now shows a nice <code>process running</code> status, logs show everything is going smoothly, take a look:</p> <pre><code>[UTC Mar 5 13:30:44] info : 'orly_unicorn_worker_0' trying to restart [UTC Mar 5 13:30:44] info : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly [UTC Mar 5 13:30:45] info : 'orly_unicorn_worker_0' stopped [UTC Mar 5 13:30:45] info : 'orly_unicorn_worker_0' start: /bin/true [UTC Mar 5 13:30:46] info : 'orly_unicorn_worker_0' restart action done </code></pre> <p><strong>The question:</strong></p> <p>Is there a <em>monit-way</em> of achieving this? Sleeping my workers for 3 seconds doesn't seem like a good solution. Any ideas?</p> <p>I understand this is not the normal situation with monit. We have kind of broken the <em>restart process cycle</em> of monit, since we don't want the <code>start program</code> of monit to perform any action, but instead let the unicorn master process handle it (as explained here: <a href="http://www.stopdropandrew.com/2010/06/01/where-unicorns-go-to-die-watching-unicorn-workers-with-monit.html" rel="nofollow">http://www.stopdropandrew.com/2010/06/01/where-unicorns-go-to-die-watching-unicorn-workers-with-monit.html</a>)</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload