StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POMonit times out when stopping unicorn workers if they respawn too quickly
primarykey
Id
15228109
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2013-03-05T15:47:16.217
FavoriteCount
1
LastActivityDate
2013-03-07T23:43:33.577
LastEditDate
2013-03-06T15:22:43.193
LastEditorUserId
159537
OwnerUserId
159537
ParentId
0
PostTypeId
1
Score
2
ViewCount
897
LastEditorDisplayName
text
Body
I'm trying to monitor unicorn workers with monit, so it gracefully kills them when they reach certain memory threshold. The problem: When I tell monit to restart a worker it first tries to stop it, firing my <code>/etc/init.d/unicorn kill_worker 0</code> script command. <pre><code># my /etc/monit/config.d/unicorn file check process orly_unicorn_worker_0 with pidfile /tmp/unicorn.orly.0.pid start program = "/bin/true" stop program = "/etc/init.d/unicorn_orly kill_worker 0" </code></pre> As I am monitoring processes via the <code>top</code> command I see how the worker is killed and how the master spawns a new worker with, of course, another pid. Monit, however, waits for a while and throws a "failed to stop" error in its log. It is actually waiting 30 seconds and timing out. Once it times out, monit recognizes that the <code>restart action is done</code>, and then notices the worker PID has changed and continues to monitor the process as expected. As a result everything is working, monit is able to restart a worker when needed and keep monitoring them, but the log is full of errors, the web interface shows a nasty (and confusing) <code>execution failed</code> error status on the worker, and I guess it would send erroneous email alerts if they were set up. This is the relevant part of the log, when I try to restart a worker through the web interface (notice how it also gets confused with the workers parent PID): <pre><code>[UTC Mar 5 13:29:17] info : 'orly_unicorn_worker_0' trying to restart [UTC Mar 5 13:29:17] info : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly [UTC Mar 5 13:29:47] error : 'orly_unicorn_worker_0' failed to stop [UTC Mar 5 13:29:47] info : 'orly_unicorn_worker_0' restart action done [UTC Mar 5 13:29:47] error : 'orly_unicorn_worker_0' process PID changed to 13699 [UTC Mar 5 13:29:49] error : 'orly_unicorn_worker_0' process PPID changed to 0 [UTC Mar 5 13:30:19] info : 'orly_unicorn_worker_0' process PID has not changed since last cycle [UTC Mar 5 13:30:19] error : 'orly_unicorn_worker_0' process PPID changed to 13660 [UTC Mar 5 13:30:49] info : 'orly_unicorn_worker_0' process PPID has not changed since last cycle </code></pre> This took me a long time to figure out but, what's happening here is that the worker gets killed and then respawned so quickly that monit doesn't even notice the change. My guess is that monit, when performing the stop action, reads the <code>/tmp/unicorn.orly.0.pid</code> to get the pid of the process and then looks to see if that process stil exists. However since the kill-respawn worker operation happens so fast monit doesn't realize that the pid of the worker has changed and keeps waiting for the (bran new) worker to die. Then it times out, then it realizes the pid has actually changed and it goes as normal. The dirty solution I have found: To prove this hypothesis I tried to slow down the mentioned kill-respawn worker operation. So I edited the unicorn config file to sleep the new workers a few seconds just before they write down their new pid in <code>/tmp/unicorn.orly.0.pid</code>. I did it like this: <pre><code>after_fork do |server, worker| sleep 3 # write down the new worker PID so monit can monitor it child_pid = server.config[:pid].sub(".pid", ".#{worker.nr}.pid") system("echo #{Process.pid} > #{child_pid}") end </code></pre> And it worked wonderfully: birds and flowers sing in the sunny day, the web interface now shows a nice <code>process running</code> status, logs show everything is going smoothly, take a look: <pre><code>[UTC Mar 5 13:30:44] info : 'orly_unicorn_worker_0' trying to restart [UTC Mar 5 13:30:44] info : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly [UTC Mar 5 13:30:45] info : 'orly_unicorn_worker_0' stopped [UTC Mar 5 13:30:45] info : 'orly_unicorn_worker_0' start: /bin/true [UTC Mar 5 13:30:46] info : 'orly_unicorn_worker_0' restart action done </code></pre> The question: Is there a monit-way of achieving this? Sleeping my workers for 3 seconds doesn't seem like a good solution. Any ideas? I understand this is not the normal situation with monit. We have kind of broken the restart process cycle of monit, since we don't want the <code>start program</code> of monit to perform any action, but instead let the unicorn master process handle it (as explained here: <a href="http://www.stopdropandrew.com/2010/06/01/where-unicorns-go-to-die-watching-unicorn-workers-with-monit.html" rel="nofollow">http://www.stopdropandrew.com/2010/06/01/where-unicorns-go-to-die-watching-unicorn-workers-with-monit.html</a>)
Tags
<ruby-on-rails><timeout><unicorn><monit>
Title
Monit times out when stopping unicorn workers if they respawn too quickly
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USariera
UserOwnerUserId
1. USariera
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POMonit times out when stopping unicorn workers if they respawn too quickly
 UserUserId
 USariera
 VoteTypeVoteTypeId
 VTBountyStart
2. VO
 singulars
 PostPostId
 POMonit times out when stopping unicorn workers if they respawn too quickly
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POMonit times out when stopping unicorn workers if they respawn too quickly
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTBountyClose
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.