Note that there are some explanatory texts on larger screens.

plurals
  1. POPostgreSQL: How to optimize my database for storing and querying a huge graph
    primarykey
    data
    text
    <p>I'm running PostgreSQL 8.3 on a 1.83 GHz Intel Core Duo Mac Mini with 1GB of RAM and Mac OS X 10.5.8. I have a stored a huge graph in my PostgreSQL database. It consists of 1.6 million nodes and 30 million edges. My database schema is like:</p> <pre><code>CREATE TABLE nodes (id INTEGER PRIMARY KEY,title VARCHAR(256)); CREATE TABLE edges (id INTEGER,link INTEGER,PRIMARY KEY (id,link)); CREATE INDEX id_idx ON edges (id); CREATE INDEX link_idx ON edges (link); </code></pre> <p>The data in the table edges looks like</p> <pre><code>id link 1 234 1 88865 1 6 2 365 2 12 ... </code></pre> <p>So it stores for each node with id x the outgoing link to id y. </p> <p>The time for searching all the outgoing links is ok:</p> <pre><code>=# explain analyze select link from edges where id=4620; QUERY PLAN --------------------------------------------------------------------------------- Index Scan using id_idx on edges (cost=0.00..101.61 rows=3067 width=4) (actual time=135.507..157.982 rows=1052 loops=1) Index Cond: (id = 4620) Total runtime: 158.348 ms (3 rows) </code></pre> <p>However, if I search for the incoming links to a node, the database is more than 100 times slower (although the resulting number of incoming links is only 5-10 times higher than the number of outgoing links):</p> <pre><code>=# explain analyze select id from edges where link=4620; QUERY PLAN ---------------------------------------------------------------------------------- Bitmap Heap Scan on edges (cost=846.31..100697.48 rows=51016 width=4) (actual time=322.584..48983.478 rows=26887 loops=1) Recheck Cond: (link = 4620) -&gt; Bitmap Index Scan on link_idx (cost=0.00..833.56 rows=51016 width=0) (actual time=298.132..298.132 rows=26887 loops=1) Index Cond: (link = 4620) Total runtime: 49001.936 ms (5 rows) </code></pre> <p>I tried to force Postgres not to use a Bitmap Scan via</p> <pre><code>=# set enable_bitmapscan = false; </code></pre> <p>but the speed of the query for incoming links didn't improve:</p> <pre><code>=# explain analyze select id from edges where link=1588; QUERY PLAN ------------------------------------------------------------------------------------------- Index Scan using link_idx on edges (cost=0.00..4467.63 rows=1143 width=4) (actual time=110.302..51275.822 rows=43629 loops=1) Index Cond: (link = 1588) Total runtime: 51300.041 ms (3 rows) </code></pre> <p>I also increased my shared buffers from 24MB to 512MB, but it didn't help. So I wonder why my queries for outgoing and incoming links show such an asymmetric behaviour? Is something wrong with my choice of indexes? Or should I better create a third table holding all the incoming links for a node with id x? But that would be quite a waste of disk space. But since I'm new into SQL databases maybe I'm missing something basic here? </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload