StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSpark cluster fails on bigger input, works well for small
primarykey
Id
16832429
data
AcceptedAnswerId
16834334
AnswerCount
2
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2013-05-30T09:26:25.510
FavoriteCount
5
LastActivityDate
2015-01-26T23:42:30.823
LastEditDate
2015-01-26T23:42:30.823
LastEditorUserId
587408
OwnerUserId
972932
ParentId
0
PostTypeId
1
Score
11
ViewCount
7029
LastEditorDisplayName
text
Body
I'm playing with Spark. It is the default, pre-built distribution (0.7.0) from the website, with default config, cluster mode, one worker (my localhost). I read the docs on installing and everything seems fine. I have a CSV file (various sizes, 1000 - 1million rows). If I run my app with small input file (for example the 1000 rows), everything is fine, the program is done in seconds and produces the expected output. But when I supply a bigger file (100.000 rows, or 1million), the execution fails. I tried to dig in the logs, but did not help much (it repeats the whole process about 9-10 times and exitst with fail after that. Also, there is some error related to fetching from some null source failed). The result Iterable returned by the first JavaRDD is suspicious for me. If I return a hard-coded, singleton list (like res.add("something"); return res;), everything is fine, even with one million rows. But if I add all my keys I want (28 strings of lenght 6-20 chars), the process fails only with the big input. The problem is, I need all these keys, this is the actual business logic. I'm using Linux amd64, quad core, 8GB ram. Latest Oracle Java7 JDK. Spark config: <pre><code>SPARK_WORKER_MEMORY=4g SPARK_MEM=3g SPARK_CLASSPATH=$SPARK_CLASSPATH:/my/super/application.jar </code></pre> I must mention that when I start the program, it says: <pre><code>13/05/30 11:41:52 WARN spark.Utils: Your hostname, *** resolves to a loopback address: 127.0.1.1; using 192.168.1.157 instead (on interface eth1) 13/05/30 11:41:52 WARN spark.Utils: Set SPARK_LOCAL_IP if you need to bind to another address </code></pre> Here is my program. It is based on the JavaWordCount example, minimally modified. <pre><code>public final class JavaWordCount { public static void main(final String[] args) throws Exception { final JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount", System.getenv("SPARK_HOME"), new String[] {"....jar" }); final JavaRDD<String> words = ctx.textFile(args[1], 1).flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(final String s) { // parsing "s" as the line, computation, building res (it's a List<String>) return res; } }); final JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(final String s) { return new Tuple2<String, Integer>(s, 1); } }); final JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(final Integer i1, final Integer i2) { return i1 + i2; } }); counts.collect(); for (Tuple2<?, ?> tuple : counts.collect()) { System.out.println(tuple._1 + ": " + tuple._2); } } } </code></pre>
Tags
<java><cluster-computing><apache-spark><real-time-data>
Title
Spark cluster fails on bigger input, works well for small
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USGrega Kešpret
UserOwnerUserId
1. USgyorgyabraham
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POSpark cluster fails on bigger input, works well for small
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POSpark cluster fails on bigger input, works well for small
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COBefore changing the Spark system properties, what exception / error did your job fail with?
 singulars
 PostPostId
 POSpark cluster fails on bigger input, works well for small
 UserUserId
 USJosh Rosen
2. COOn the spark-users group I got the answer that .collect() will trigger the collection of each and every (temp) RDDs. That was the real problem. Thread with solution here: http://stackoverflow.com/questions/16832429/spark-cluster-fails-on-bigger-input-works-well-for-small?noredirect=1#comment24468201_16832429
 singulars
 PostPostId
 POSpark cluster fails on bigger input, works well for small
 UserUserId
 USgyorgyabraham

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.