Note that there are some explanatory texts on larger screens.

plurals
  1. PONode.js app has periodic slowness and/or timeouts (does not accept incoming requests)
    primarykey
    data
    text
    <p>This problem is killing the stability of my production servers.</p> <p>To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what. </p> <p><em>The problem ranges in severity</em>. Sometimes what should be &lt;100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. <strong>In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout"</strong>.</p> <p><em>The issue comes and goes without warning</em>. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to <em>maybe</em> make the problem go away <em>for a while</em>, but that really doesn't tell me much. I do wonder if <a href="https://groups.google.com/forum/?fromgroups=#!topic/nodejs/jncp6KM3EGM" rel="nofollow noreferrer">it might be a bug in node.js</a>... not very comforting, considering it is killing my production servers.</p> <ul> <li>The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (<a href="http://pastebin.com/UFS6jW5X" rel="nofollow noreferrer">here they are</a>). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.</li> <li>At first, <a href="https://stackoverflow.com/questions/12868107/node-jsexpress-randomly-drops-requests-resulting-in-a-gateway-timeout">I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy</a>.</li> <li>Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help</li> <li>After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).</li> <li><a href="https://stackoverflow.com/questions/12887062/node-v8-garbage-collector-how-to-debug-long-mark-sweep-times">Still thinking that memory might be an issue, I took a look at garbage collection</a>. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.</li> <li>Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.</li> <li>In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.</li> <li>I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be &lt; 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.</li> </ul> <p>I've been logging my server with nodetime and here's the jist of it:</p> <ul> <li>CentOS 5.2 running on the Amazon Cloud (m1.large instance)</li> <li>Greater than 5000 MB free memory at all times</li> <li>Less than 150 MB heap size at all times</li> <li>CPU usage is less than 60% at all times</li> </ul> <p>I've also checked my MongoDB servers, which have &lt;5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.</p> <p>I've wrapped (almost) all my code using Q-promises (<a href="https://stackoverflow.com/questions/12883370/reading-node-js-heap-snapshots-created-via-nodetime-why-are-my-objects-not#comment17444187_12883370">see code sample</a>), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload