Note that there are some explanatory texts on larger screens.

plurals
  1. POVertica and joins
    primarykey
    data
    text
    <p>I'm adapting a web analysis tool to use <code>Vertica</code> as the DB. I'm having real problems <code>optimizing joins</code>. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.</p> <p>A simple <code>INSERT INTO ... SELECT * FROM</code> which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.</p> <p>Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.</p> <p>My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?</p> <p>We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.</p> <p>The EXPLAIN output looks like this:</p> <pre><code>EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN '20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_ .url ORDER BY visits DESC LIMIT 250; Access Path: +-SELECT LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0) | Output Only: 250 tuples | Execute on: Query Initiator | +---&gt; SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1) | | Order: count(DISTINCT "session".id) DESC | | Output Only: 250 tuples | | Execute on: All Nodes | | +---&gt; GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 2) | | | Aggregates: count(DISTINCT "session".id) | | | Group By: referer_via_.url | | | Execute on: All Nodes | | | +---&gt; GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows : 1 (STALE STATISTICS)] (PATH ID: 3) | | | | Group By: referer_via_.url, "session".id | | | | Execute on: All Nodes | | | | +---&gt; JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 4) Outer (RESEGMENT) | | | | | Join Cond: ("session".referer_id = referer_via_.id) | | | | | Execute on: All Nodes | | | | | +-- Outer -&gt; STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5) | | | | | | Projection: public.owa_session_projection | | | | | | Materialize: "session".id, "session".referer_id | | | | | | Filter: ("session".site_id = '49') | | | | | | Filter: (("session".yyyymmdd &gt;= 20121123) AND ("session" .yyyymmdd &lt;= 20121123)) | | | | | | Execute on: All Nodes | | | | | +-- Inner -&gt; STORAGE ACCESS for referer_via_ [Cost: 293K, Rows: 26M] (PATH ID: 6) | | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112 2_Potency_20121122 | | | | | | Materialize: referer_via_.id, referer_via_.url | | | | | | Execute on: All Nodes </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload