Note that there are some explanatory texts on larger screens.

plurals
  1. POPostgreSQL Query Optimization and the Postmaster Process'
    primarykey
    data
    text
    <p>I currently working with a larger wikipedia-dump derived PostgreSQL database; it contains about 40 GB of data. The database is running on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I am querying it from my laptop over a private network managed by a simple D-Link router. I assigned static DHCP (private) IPs to both laptop and server.</p> <p>Anyway, from my laptop, using pgAdmin III, I send off some SQL commands/queries; some of these are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns, telling me that the query was executed perfectly, etc. However, the postmaster process assigned to such a command seems to remain sleeping on the server. Now, I do not really mind this, for I say to myself that PostgreSQL maintains a pool of postmasters ready to process queries. Yet, if this process eats up 6 GB of it 9.4 GB assigned RAM, I worry (and it does so for the moment). Now maybe this is a cache of data that is kept in [shared] memory in case another query happens to need to use that same data, but I do not know.</p> <p>Another thing is bothering me. </p> <p>I have 2 tables. One is the <em>page</em> table; I have an index on its <em>page_id</em> column. The other is the <em>pagelinks</em> tables which has the <em>pl_from</em> column that references either nothing or a variable in the <em>page.page_id</em> column; unlike the <em>page_id</em> column, the <em>pl_from</em> has no index (yet). To give you an idea of the scale of the tables and the necessity for me to find a viable solution, <em>page</em> table has 13.4 million rows (after I deleted those I do not need) while the <em>pagelinks</em> table has 293 million.</p> <p>I need to execute the following command to clean the <em>pagelinks</em> table of some of its useless rows:</p> <pre><code>DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id); </code></pre> <p>So basically, I wish to rid the <em>pagelinks</em> table of all links coming from a page not in the <em>page</em> table. Even after disabling the nested loops and/or sequential scans, the query optimizer always gives me the following "solution":</p> <pre><code>Nested Loop (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6) Join Filter: ("outer".pl_from &lt;&gt; "inner".page_id)" -&gt; Seq Scan on pagelinks (cost=0.00..5889791.00 rows=293392800 width=17) -&gt; Materialize (cost=494640.60..708341.51 rows=13474691 width=11) -&gt; Seq Scan on page (cost=0.00..402211.91 rows=13474691 width=11) </code></pre> <p>It seems that such a task would take more than weeks to complete; obviously, this is unacceptable. It seems to me that I would much rather it use the <em>page_id</em> index to do its thing...but it is a stubborn optimizer and I might be wrong. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload