Note that there are some explanatory texts on larger screens.

plurals
  1. POHIVE produces wrong results when a partition has too many rows
    primarykey
    data
    text
    <p>The problem is joining a big partition with more than 2^31 rows on the partition key. </p> <p>(the outputs in this posts are from mapr's distribution but this was also reproduced on apache hadoop/hive)<br> Versions : hadoop - 0.20.2 hive - 0.10.0 </p> <p>When the partition has more than 2147483648 rows (even 2147483649) the output of the join is a single row.<br> When the partition has less than 2147483648 rows (event 2147483647) the output is correct.</p> <p>Test case :</p> <p>create a table with 2147483649 rows in a partition with the value : "1" ,<br> join this table to another table with a single row,single column with the value "1" on the partition_key.<br> later delete 2 rows and run the same join.<br> 1st : only a single row is created<br> 2nd : 2147483647 rows </p> <pre class="lang-sql prettyprint-override"><code>create table max_sint_rows (s1 string) partitioned by (p1 string) ROW FORMAT DELIMITED LINES TERMINATED BY '\n'; Create table small_table (p1 string) ROW FORMAT DELIMITED LINES TERMINATED BY '\n'; alter table max_sint_rows add partition (p1="1"); </code></pre> <p>Write 2147483649 random rows to max_sint_rows.<br> Write the value “1” into small_table.</p> <pre class="lang-sql prettyprint-override"><code>create table output_rows_over as select a.s1 from max_sint_rows a join small_table b on (a.p1=b.p1); </code></pre> <p>in the reducer’s syslog we get this output :</p> <pre><code>INFO ExecReducer: ExecReducer: processing 2147000000 rows: used memory = 715266312 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarding 1 rows INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_tmp.-ext-10001/000004_1 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_task_tmp.-ext-10001/_tmp.000004_1 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_tmp.-ext-10001/000004_1 INFO ExecReducer: ExecReducer: processed 2147483650 rows: used memory = 828336712 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 finished. closing... INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarded 1 rows INFO org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 finished. closing... INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarded 1 rows INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 finished. closing... INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 forwarded 0 rows INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 Close done INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 Close done org.apache.hadoop.mapred.Task: Task:attempt_201305071944_2359_r_000004_1 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.Task: Task 'attempt_201305071944_2359_r_000004_1' done. INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=- </code></pre> <p>Notice the TABLE_ID_1_ROWCOUNT:1 and in fact the table has only one random row.</p> <p>Now delete 2 rows from max_sint_rows and rerun :</p> <pre class="lang-sql prettyprint-override"><code>create table output_rows_under as select a.s1 from max_sint_rows a join small_table b on (a.p1=b.p1); </code></pre> <p>we get 2147483647 rows in output_rows_under and the syslog of the reducer is :</p> <pre><code>INFO ExecReducer: ExecReducer: processed 2147483648 rows: used memory = 243494552 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 finished. closing... INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarded 2147483647 rows INFO org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 finished. closing... INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarded 2147483647 rows INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 finished. closing... INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 forwarded 0 rows INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:2147483647 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 Close done INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 Close done INFO org.apache.hadoop.mapred.Task: Task:attempt_201305071944_2360_r_000004_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.Task: Task 'attempt_201305071944_2360_r_000004_0' done. INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload