Note that there are some explanatory texts on larger screens.

plurals
  1. PORunning out of heap space with web crawler
    text
    copied!<p>I wrote a small crawler and found out it was running out of heap space (even though I limit the number of URLs in my list to 300 currently).</p> <p>With Java Memory Analyzer I found out that the consumers is <code>char[]</code> (45MB out of 64MB, or also more if I increase allowed size; it just grows constantly).</p> <p>The analyzer also gives me the content of the <code>char[]</code>. It contains HTML pages that were read by the crawlers.</p> <p>With some more deep analysis on different settings for <code>-Xmx[...]m</code> I found out that Java uses <strong>almost all space</strong> it has available and then gets <code>out of heap</code> as soon as I want to download an image with 3MB size.</p> <p>When I give Java 16MB, it uses 14MB and fails, when I give it 64MB it used 59MB and fails when trying to download a large image.</p> <p>Reading pages is done with this piece of code (Edited and added <code>.close()</code>):</p> <pre><code>private String readPage(Website url) throws CrawlerException { StringBuffer sourceCodeBuffer = new StringBuffer(); try { URLConnection con = url.getUrl().openConnection(); con.setConnectTimeout(2000); con.setReadTimeout(2000); BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream())); String strTemp = ""; try { while(null != (strTemp = br.readLine())) { sourceCodeBuffer = sourceCodeBuffer.append(strTemp); } } finally { br.close(); } } catch (IOException e) { throw new CrawlerException(); } return sourceCodeBuffer.toString(); } </code></pre> <p>Another function uses the returned string in a while loop, but to my knowledge the space should be freed as soon as the string is overwritten with the next page.</p> <pre><code>public void run() { boolean stop = false; while (stop == false) { try { Website nextPage = getNextPage(); String source = visitAndReadPage(nextPage); List&lt;Website&gt; links = new LinkExtractor(nextPage).extract(source); List&lt;Website&gt; images = new ImageExtractor(nextPage).extract(source); // do something with links and images, source is not used anymore } catch (CrawlerException e) { logger.warning("could not crawl a url"); } } } </code></pre> <p>Below is an example of the output the analyzer gives me. When I want to see <strong>where</strong> these <code>char[]</code> are still required, the Analyzer cannot tell. So I guess they are not needed anymore and should be garbage collected. As its always a slightly bit below the maximum space, it also seems Java <strong>does</strong> garbage collecting, but only as much as necessary to keep the program running as for now (not thinking about there might be large input coming).</p> <p>Also, explictely calling <code>System.gc()</code> every 5 seconds or even after setting <code>source = null;</code> did not work.</p> <p>The website codes just seem to be stored as long as it is possible in any way.</p> <p>Am I using something <a href="https://stackoverflow.com/questions/7495155/java-heap-space-out-of-memory">similar to <code>ObjectOutputStream</code></a> which enforces the read strings to be maintained forever? Or how is it possible Java does keep these website <code>Strings</code> in a <code>char[]</code> array so long?</p> <pre><code>Class Name | Shallow Heap | Retained Heap | Percentage ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- char[60750] @ 0xb02c3ee0 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.512 | 121.512 | 1,06% char[60716] @ 0xb017c9b8 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.448 | 121.448 | 1,06% char[60686] @ 0xb01f3c88 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.384 | 121.384 | 1,06% char[60670] @ 0xb015ec48 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.352 | 121.352 | 1,06% char[60655] @ 0xb01d5d08 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.328 | 121.328 | 1,06% char[60651] @ 0xb009d9c0 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.320 | 121.320 | 1,06% char[60637] @ 0xb022f418 &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen&lt;/title&gt;&lt;link rel="shortcut icon" href="http://img.e-wallp...| 121.288 | 121.288 | 1,06% </code></pre> <h2>Edit</h2> <p>After testing it with even more memory, I found such an occurrence of URL in the <code>dominator tree</code></p> <pre><code>Class Name | Shallow Heap | Retained Heap | Percentage crawling.Website @ 0xa8d28cb0 | 16 | 759.776 | 0,15% |- java.net.URL @ 0xa8d289c0 https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN... | 56 | 759.736 | 0,15% | |- char[379486] @ 0xa8c6f4f8 &lt;!DOCTYPE html&gt;&lt;html lang="en"&gt; &lt;head&gt; &lt;meta charset="utf-8"&gt; &lt;meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9"&gt; &lt;title&gt;Google Accounts&lt;/title&gt;&lt;style type="text/css"&gt; html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl, dt, dd, ol, ul, li, t... | 758.984 | 758.984 | 0,15% | |- java.lang.String @ 0xa8d28a40 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...| 24 | 624 | 0,00% | | '- char[293] @ 0xa8d28a58 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl... | 600 | 600 | 0,00% | |- java.lang.String @ 0xa8d289f8 c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...| 24 | 24 | 0,00% | |- java.lang.String @ 0xa8d28a10 www.google.com | 24 | 24 | 0,00% | |- java.lang.String @ 0xa8d28a28 /recaptcha/api/image | 24 | 24 | 0,00% </code></pre> <p>From the intendation I am really wondering: Why is the HTML source part of <code>java.net.URL</code>? Does this come from the URLConnection I had opened?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload