Note that there are some explanatory texts on larger screens.

plurals
  1. POSpark cluster fails on bigger input, works well for small
    primarykey
    data
    text
    <p>I'm playing with Spark. It is the default, pre-built distribution (0.7.0) from the website, with default config, cluster mode, one worker (my localhost). I read the docs on installing and everything seems fine.</p> <p>I have a CSV file (various sizes, 1000 - 1million rows). If I run my app with small input file (for example the 1000 rows), everything is fine, the program is done in seconds and produces the expected output. But when I supply a bigger file (100.000 rows, or 1million), the execution fails. I tried to dig in the logs, but did not help much (it repeats the whole process about 9-10 times and exitst with fail after that. Also, there is some error related to fetching from some null source failed).</p> <p>The result Iterable returned by the first JavaRDD is suspicious for me. If I return a hard-coded, singleton list (like res.add("something"); return res;), everything is fine, even with one million rows. But if I add all my keys I want (28 strings of lenght 6-20 chars), the process fails <em>only</em> with the big input. The problem is, I need all these keys, this is the actual business logic.</p> <p>I'm using Linux amd64, quad core, 8GB ram. Latest Oracle Java7 JDK. Spark config:</p> <pre><code>SPARK_WORKER_MEMORY=4g SPARK_MEM=3g SPARK_CLASSPATH=$SPARK_CLASSPATH:/my/super/application.jar </code></pre> <p>I must mention that when I start the program, it says:</p> <pre><code>13/05/30 11:41:52 WARN spark.Utils: Your hostname, *** resolves to a loopback address: 127.0.1.1; using 192.168.1.157 instead (on interface eth1) 13/05/30 11:41:52 WARN spark.Utils: Set SPARK_LOCAL_IP if you need to bind to another address </code></pre> <p>Here is my program. It is based on the JavaWordCount example, minimally modified.</p> <pre><code>public final class JavaWordCount { public static void main(final String[] args) throws Exception { final JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount", System.getenv("SPARK_HOME"), new String[] {"....jar" }); final JavaRDD&lt;String&gt; words = ctx.textFile(args[1], 1).flatMap(new FlatMapFunction&lt;String, String&gt;() { @Override public Iterable&lt;String&gt; call(final String s) { // parsing "s" as the line, computation, building res (it's a List&lt;String&gt;) return res; } }); final JavaPairRDD&lt;String, Integer&gt; ones = words.map(new PairFunction&lt;String, String, Integer&gt;() { @Override public Tuple2&lt;String, Integer&gt; call(final String s) { return new Tuple2&lt;String, Integer&gt;(s, 1); } }); final JavaPairRDD&lt;String, Integer&gt; counts = ones.reduceByKey(new Function2&lt;Integer, Integer, Integer&gt;() { @Override public Integer call(final Integer i1, final Integer i2) { return i1 + i2; } }); counts.collect(); for (Tuple2&lt;?, ?&gt; tuple : counts.collect()) { System.out.println(tuple._1 + ": " + tuple._2); } } } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload