Note that there are some explanatory texts on larger screens.

plurals
  1. POCould data.table do joins on arbitrary functions (i.e. fuzzy/soft joins)?
    primarykey
    data
    text
    <p>The following is hypothetical. Lets say I have a string similarity function that produces a boolean result (<code>string_sim</code>), and another that determines if the distance between two lat/lon coordinates is below a threshold (<code>geo_dist</code>)</p> <p>I decide to merge on these conditions using a fuzzy join:</p> <pre><code>merge(LHS, RHS, by=string_sim(LHS$string, RHS$string) &amp; geo_dist(LHS$lat, LHS$lon, RHS$lat,RHS$lon)) </code></pre> <p>Under the hood, data.table would need to compare the cartesian product...every row to every other asymmetrically. That could easily be a gigantic number, in the trillions with medium sized datasets. Therefore, it might also need to use a divide-and-conquer strategy when sending data to the comparison functions, utilizing multiple processes with under 2 billion cells given to each to avoid integer limits. Essentially, it would need to MAP sections of the vector to pieces and then send them to the function (this is starting to sound like map-reduce)</p> <p>Let's say the user wises up and wants to save time by iteratively applying join commands, starting with the least costly to run such as an equality condition.</p> <pre><code>merge(LHS, RHS, by=list(LHS$first_initial == RHS$first_initial, string_sim(LHS$string, RHS$string) &amp; geo_dist(LHS$lat, LHS$lon, RHS$lat,RHS$lon))) </code></pre> <p>I would like a feature like this. Its taken me some time, but I've hacked together some code that does this using data.table, and the package could come with something like this in the future.</p> <p>Edit: Let me express this in a more data.table native way. First define the variables to match on equality:</p> <pre><code>setkey(LHS, first_initial) setkey(RHS, first_initial) </code></pre> <p>Then do a binary merge following by vector scan:</p> <pre><code>LHS[RHS][string_sim(string, string.1) &amp; geo_dist(lat, lon, lat.1,lon.1)] </code></pre> <p>Alternately, the most expensive operations can be done first. I think the following would be faster:</p> <pre><code>LHS[RHS][geo_dist(lat, lon, lat.1,lon.1)][string_sim(string, string.1)] </code></pre> <p>However, when the LHS is 20 million rows, and RHS is 20 million rows, this can overload the system.To prevent this, I have needed to split LHS and RHS into pieces, using a divide-and-conquer approach. If there were some more efficient way to parallelize the process on the backend, I don't think I'd have to do all of that.</p> <p>Here is the code:</p> <pre><code>joiner &lt;- function(x,y, reduce=quote( x[map$x][y[map$y],nomatch=0] ), chunks = 100000, mc.cores = getOption("cores")){ require("multicore") require("data.table") map_function &lt;- function(x_N,y_N, chunks){ x_partitions = ceiling(x_N/chunks) x_parts = (0:x_partitions)*chunks x_parts[length(x_parts)]=x_N y_partitions = ceiling(y_N/chunks) y_parts = (0:y_partitions)*chunks y_parts[length(y_parts)]=y_N MAP = vector("list",x_partitions*y_partitions ) index = 0 for(i in 1:x_partitions){ for(j in 1:y_partitions){ index = index +1 MAP[[index]] = list( x = (x_parts[i]+1):x_parts[i+1], y = (y_parts[j]+1):y_parts[j+1] ) } } return(MAP) } if(missing(y)){ y=x } reducer_function &lt;- function(map, reduce, x,y){ eval(reduce) } collector = mclapply(map_function(nrow(x),nrow(y),chunks=chunks),reducer_function,reduce, x,y, mc.cores=mc.cores) collector = rbindlist(collector) return(collector) } D = data.table(value=letters, row_id = sample(1:100,26)); D= rbind(D,D); setkey(D,value) joiner(D); joiner(D,chunks=10); D[D] # they are all the same, except the key is gone </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload