Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat is the most efficient way to store and query trees?
    primarykey
    data
    text
    <p>I need to analyze 1 TB+ of web access logs, and in particular I need to analyze statistics relating to requested URLs and subsets of the URLs (child branches). If possible, I want the queries to be fast over small subsets of the data (e.g. 10 million requests).</p> <p>For example, given an access log with the following URLs being requested:</p> <pre><code>/ocp/about_us.html /ocp/security/ed-209/patches/urgent.html /ocp/security/rc/ /ocp/food/ /weyland-yutani/products/ </code></pre> <p>I want to do queries such as:</p> <ul> <li>Count the number of requests for everything 'below' /ocp. </li> <li>Same as above, but only count requests for child nodes under /ocp/security</li> <li>Return the top 5 most frequently requested URLs.</li> <li>Same as above, except group by an arbitrary depth,</li> </ul> <p>e.g. For the last query above, depth 2 for the data would return:</p> <pre><code>2: /ocp/security/ 1: /ocp/ 1: /ocp/food/ 1: /weyland-yutani/products/ </code></pre> <p>I think the ideal approach would probably be to use a column DB and tokenize the URLs such that there is a column for each element in the URL. However, I would really like to find a way to do this with open source apps if possible. HBase is a possibility, but query performance seems too slow to be useful for real-time queries (also, I don't really want to be in the business of re-implementing SQL)</p> <p>I'm aware there are commercial apps for doing this this type of analytics, but for various reasons I want to implement this myself.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload