Note that there are some explanatory texts on larger screens.

plurals
  1. POGood graph traversal algorithm
    text
    copied!<p>Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.</p> <p>Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.</p> <p>My current implementation : </p> <pre><code>def run() : import_pool = ThreadPool(10) user_pool = ThreadPool(1) do_user("arcaneCoder", import_pool, user_pool) def do_user(user, import_pool, user_pool) : id = user alias = models.Alias.get(id) # if its been updates in the last 7 days if alias and alias.modified + datetime.timedelta(days=7) &gt; datetime.datetime.now() : sys.stderr.write("Skipping: %s\n" % user) else : sys.stderr.write("Importing: %s\n" % user) while import_pool.num_jobs() &gt; 20 : print "Too many queued jobs, sleeping" time.sleep(15) import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user)) sys.stderr.write("Crawling: %s\n" % user) users = crawl(id, 5) if len(users) &gt;= 2 : for user in random.sample(users, 2) : if (user_pool.num_jobs() &lt; 100) : user_pool.add_job(do_user, [user, import_pool, user_pool]) def crawl(id, limit=50) : '''returns the first 'limit' friends of a user''' *not relevant* </code></pre> <p>Problems of current implementation :</p> <ul> <li>Gets stuck in cliques that I've already imported, thereby wasting time and the importing threads are idle.</li> <li>Will add more as they get pointed out.</li> </ul> <p>So, marginal improvments are welcome, as well as full rewrites. Thanks!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload