StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POClosure table Root Nodes Query performance with 10s of millions of nodes
text
Body
copied!<p>I currently have a closure table used for hierarchical data that has 5 million nodes which results in ~75 million rows in the closure table. Using SqLite my query time is rising exponentially due to the size of the closure table. </p> <pre><code>CREATE TABLE `Closure` (`Ancestor` INTEGER NOT NULL ,`Descendant` INTEGER NOT NULL ,`Depth` INTEGER, PRIMARY KEY (`Ancestor`,`Descendant`) ) CREATE INDEX `Closure_AncestorDescendant` ON `Closure` (`Ancestor` ASC, `Descendant` ASC); CREATE INDEX `Closure_DescendantAncestor` ON `Closure` (`Descendant` ASC, `Ancestor` ASC); CREATE TABLE `Nodes` (`Node` INTEGER PRIMARY KEY NOT NULL, `Root` BOOLEAN NOT NULL, `Descendants` INTEGER NOT NULL); </code></pre> <p>My query to find the nodes that are roots takes about 20 minutes with this many nodes even though only about 5 or 6 nodes meet the query.</p> <pre><code>SELECT `Closure`.`Ancestor` FROM `Closure` LEFT OUTER JOIN `Closure` AS `Anc` ON `Anc`.`Descendant` = `Closure`.`Descendant` AND `Anc`.`Ancestor` <> `Closure`.`Ancestor` WHERE `Anc`.`Ancestor` IS NULL; </code></pre> <p>20 minutes is to long so right now I'm storing a bool for if the node is a root and modifying the <code>Nodes</code>.<code>Root</code> column when the node is moved.. I'm not exactly happy with the duplicate data but my query times are now in the single digit milliseconds for every query.</p> <p>I also have a lot of queries that require knowledge of how many descendants a given node has (mostly if Descendants > 1 to know if this object can be virtualized/expanded in a tree view). I used to query this every time I needed it but across a gigantic database like I have even with indexes the queries seemed to take to long (more than 1 second) so I also reduced them to the <code>Nodes</code>.<code>Descendants</code> column which I also update every time a node is moved. Unfortunate this is another duplication of data I would like to avoid.</p> <p>The query I used to use was like below. If anyone can explain how to increase the performance of this (consider that I already have an index starting with Ancestor) I would appreciate it.</p> <pre><code>SELECT COUNT(*) FROM `Closure` WHERE `Ancestor`=@Node </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload