Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat can cause the JVM to fail to resolve DNS under load?
    text
    copied!<p>I'm investigating an issue with our service, which fails to resolve s3 bucket names under load.</p> <p>I'm stressing a single c1.medium ec2 instance: </p> <pre><code>root@ip-10-243-126-111:/mnt/log# uname -a Linux ip-10-243-126-111 2.6.35-30-virtual #56-Ubuntu SMP Mon Jul 11 23:41:40 UTC 2011 i686 GNU/Linux root@ip-10-243-126-111:/mnt/log# cat /etc/issue Ubuntu 10.10 \n \l root@ip-10-243-126-111:/mnt/log# free total used free shared buffers cached Mem: 1746008 1681752 64256 0 29600 1582508 -/+ buffers/cache: 69644 1676364 Swap: 917500 32 917468 </code></pre> <p>The application is running with <code>-server, jvm build 1.6.0_23-b05, 32bit</code></p> <p>The behaviour I'm seeing is network communications are starting to "act funny", sometimes socket timeout occur from our mongo connection driver, which looks like:</p> <pre><code>Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.6.0_23] at java.net.SocketInputStream.read(SocketInputStream.java:129) ~[na:1.6.0_23] at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) ~[na:1.6.0_23] at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) ~[na:1.6.0_23] at java.io.BufferedInputStream.read(BufferedInputStream.java:317) ~[na:1.6.0_23] at org.bson.io.Bits.readFully(Bits.java:35) ~[mongo-java-driver-2.5.3.jar:na] at org.bson.io.Bits.readFully(Bits.java:28) ~[mongo-java-driver-2.5.3.jar:na] at com.mongodb.Response.&lt;init&gt;(Response.java:35) ~[mongo-java-driver-2.5.3.jar:na] at com.mongodb.DBPort.go(DBPort.java:110) ~[mongo-java-driver-2.5.3.jar:na] at com.mongodb.DBPort.go(DBPort.java:75) ~[mongo-java-driver-2.5.3.jar:na] at com.mongodb.DBPort.call(DBPort.java:65) ~[mongo-java-driver-2.5.3.jar:na] at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:201) ~[mongo-java-driver-2.5.3.jar:na] ... 43 common frames omitted </code></pre> <p>and at times the following happens</p> <pre><code>Caused by: java.net.UnknownHostException: bucket-system.s3.amazonaws.com at java.net.InetAddress.getAllByName0(InetAddress.java:1158) ~[na:1.6.0_23] at java.net.InetAddress.getAllByName(InetAddress.java:1084) ~[na:1.6.0_23] at java.net.InetAddress.getAllByName(InetAddress.java:1020) ~[na:1.6.0_23] at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:242) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:130) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:562) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) ~[httpclient-4.1.jar:4.1] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) ~[httpclient-4.1.jar:4.1] at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:240) ~[aws-java-sdk-1.2.5.jar:na] ... 48 common frames omitted </code></pre> <p>This is reproducible but not consistent. Once load starts on the machine (50 concurrent http requests) the machines goes between cycles of responding correctly for ~5min then failing all requests for ~10sec then another cycle of correct responses.</p> <p>What can cause such behavior? Is there any ulimit or other system setting I might try to tune to improve on this? Any more pointer to search for clues?</p> <p>Another option I'm suspecting is the infrastructure at Amazon (us-east-1 region), I'm suspecting that the routers there active some kind of DoS prevention policy on the service because requests jump almost instantly from 0 to 50. After some time it stabilizes on a steady 50 concurrent rate at which point the hardware adjusts to the new traffic. Far fetched? I haven't found any mentions of this type of pattern anywhere.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload