Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>That is certainly an approach, but if you are doing 2 random reads per scanned row then your speed will plummet. If you are filtering the rows out significantly or have a small dataset in A that may not be an issue.</p> <h1>Sort-merge Join</h1> <p>However the best approach, which will be available in HBase 0.96, is the MultipleTableInput method. This means that it will scan table A and write it's output with a unique key that will allow table B to match up. </p> <p>E.g. Table A emits (b_id, a_info) and Table B will emit (b_id, b_info) merging together in the reducer.</p> <p>This is an example of a sort-merge join.</p> <h1>Nested-Loop Join</h1> <p>If you are joining on the row key or the joining attribute is sorted in line with table B, you can have a instance of a scanner in each task which sequentially reads from table B until it finds what it's looking for.</p> <p>E.g. Table A row key = "companyId" and Table B row key = "companyId_employeeId". Then for each Company in Table A you can get all the employees using the nest-loop algorithm.</p> <h3>Pseudocode:</h3> <pre><code>for(company in TableA): for(employee in TableB): if employee.company_id == company.id: emit(company.id, employee) </code></pre> <p>This is an example of a nest-loop join.</p> <h2>More detailed join algorithms are here:</h2> <ul> <li><a href="http://en.wikipedia.org/wiki/Nested_loop_join">http://en.wikipedia.org/wiki/Nested_loop_join</a></li> <li><a href="http://en.wikipedia.org/wiki/Hash_join">http://en.wikipedia.org/wiki/Hash_join</a></li> <li><a href="http://en.wikipedia.org/wiki/Sort-merge_join">http://en.wikipedia.org/wiki/Sort-merge_join</a></li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload