StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>A Python point: <code>adict.has_key(k)</code> is obsolete in Python 2.X and vanished in Python 3.X. <code>k in adict</code> as an expression has been available since Python 2.2; use it instead. It will be faster (no method call).</p> <p>An any-language practical point: iterate over the shorter dictionary.</p> <p>Combined result:</p> <pre><code>if len(doca_dic) < len(docb_dict): short_dict, long_dict = doca_dic, docb_dic else: short_dict, long_dict = docb_dic, doca_dic similarity = 0 for x in short_dict: if x in long_dict: #calculate the similarity by summing the products of the tf-idf_norm similarity += short_dict[x] * long_dict[x] </code></pre> <p>And if you don't need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the <code>docb = cursor2.fetchall()</code>, replace all following code by this:</p> <pre><code>similarity = 0 for b_token, b_value in docb: if b_token in doca_dic: similarity += doca_dic[b_token] * b_value </code></pre> <p>Alternative to the above code: This is doing more work but it's doing more of the iterating in C instead of Python and may be faster.</p> <pre><code>similarity = sum( doca_dic[k] * docb_dic[k] for k in set(doca_dic) & set(docb_dic) ) </code></pre> <p><strong>Final version of the Python code</strong></p> <pre><code># Doc A cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0])) doca = cursor1.fetchall() # Doc B cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0])) docb = cursor2.fetchall() if len(doca) < len(docb): short_doc, long_doc = doca, docb else: short_doc, long_doc = docb, doca long_dict = dict(long_doc) # yes, it should be that simple similarity = 0 for key, value in short_doc: if key in long_dict: similarity += long_dict[key] * value </code></pre> <p>Another practical point: you haven't said which part of it is slow ... working on the dicts or doing the selects? Put some calls of <code>time.time()</code> into your script.</p> <p>Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.</p> <pre><code>C:\junk\so>sqlite3 SQLite version 3.6.14 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> create table atable(docid text, token text, score float, primary key (docid, token)); sqlite> insert into atable values('a', 'apple', 12.2); sqlite> insert into atable values('a', 'word', 29.67); sqlite> insert into atable values('a', 'zulu', 78.56); sqlite> insert into atable values('b', 'apple', 11.0); sqlite> insert into atable values('b', 'word', 33.21); sqlite> insert into atable values('b', 'zealot', 11.56); sqlite> select sum(A.score * B.score) from atable A, atable B where A.token = B.token and A.docid = 'a' and B.docid = 'b'; 1119.5407 sqlite> </code></pre> <p>And it's worth checking that the database table is appropriately indexed (e.g. one on <code>token</code> by itself) ... not having a usable index is a good way of making an SQL query run very slowly.</p> <p>Explanation: Having an index on <code>token</code> may make either your existing queries or the "do all the work in the DB" query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don't have a usable index, the DB will read ALL the rows in your table -- not good.</p> <p>Creating an index: <code>create index atable_token_idx on atable(token);</code></p> <p>Dropping an index: <code>drop index atable_token_idx;</code></p> <p>(but do consult the docs for <em>your</em> DB)</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload