StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to optimize this long running sqlite3 query for finding duplicates?
primarykey
Id
14433728
data
AcceptedAnswerId
14434145
AnswerCount
2
ClosedDate
CommentCount
7
CommunityOwnedDate
CreationDate
2013-01-21T06:57:40.760
FavoriteCount
0
LastActivityDate
2013-01-21T17:54:23.257
LastEditDate
2013-01-21T07:35:03.290
LastEditorUserId
321583
OwnerUserId
321583
ParentId
0
PostTypeId
1
Score
0
ViewCount
250
LastEditorDisplayName
text
Body
I've got this rather insane query for finding all but the FIRST record with a duplicate value. It takes a substantially long time to run on 38000 records; about 50 seconds. <pre><code>UPDATE exr_exrresv SET mh_duplicate = 1 WHERE exr_exrresv._id IN ( SELECT F._id FROM exr_exrresv AS F WHERE Exists ( SELECT PHONE_NUMBER, Count(_id) FROM exr_exrresv WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER AND exr_exrresv.PHONE_NUMBER != '' AND mh_active = 1 AND mh_duplicate = 0 GROUP BY exr_exrresv.PHONE_NUMBER HAVING Count(exr_exrresv._id) > 1) ) AND exr_exrresv._id NOT IN ( SELECT Min(_id) FROM exr_exrresv AS F WHERE Exists ( SELECT PHONE_NUMBER, Count(_id) FROM exr_exrresv WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER AND exr_exrresv.PHONE_NUMBER != '' AND mh_active = 1 AND mh_duplicate = 0 GROUP BY exr_exrresv.PHONE_NUMBER HAVING Count(exr_exrresv._id) > 1 ) GROUP BY PHONE_NUMBER ); </code></pre> Any tips on how to optimize it or how I should begin to go about it? I've checked out the query plan but I'm really not sure how to begin improving it. Temp tables? Better query? Here is the explain query plan output: <pre><code>0|0|0|SEARCH TABLE exr_exrresv USING INTEGER PRIMARY KEY (rowid=?) (~12 rows) 0|0|0|EXECUTE LIST SUBQUERY 0 0|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows) 0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1 1|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows) 1|0|0|USE TEMP B-TREE FOR GROUP BY 0|0|0|EXECUTE LIST SUBQUERY 2 2|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows) 2|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3 3|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows) 3|0|0|USE TEMP B-TREE FOR GROUP BY 2|0|0|USE TEMP B-TREE FOR GROUP BY </code></pre> Any tips would be much appreciated. :) Also, I am using Ruby to make the sql query so if it makes more sense for the logic to leave SQL and be written in Ruby, that's possible. The schema is as follows, and you can use sqlfiddle here: <a href="http://sqlfiddle.com/#!2/2c07e" rel="nofollow">http://sqlfiddle.com/#!2/2c07e</a> <pre><code>_id INTEGER PRIMARY KEY OPPORTUNITY_ID varchar(50) CREATEDDATE varchar(50) FIRSTNAME varchar(50) LASTNAME varchar(50) MAILINGSTREET varchar(50) MAILINGCITY varchar(50) MAILINGSTATE varchar(50) MAILINGZIPPOSTALCODE varchar(50) EMAIL varchar(50) CONTACT_PHONE varchar(50) PHONE_NUMBER varchar(50) CallFromWeb varchar(50) OPPORTUNITY_ORIGIN varchar(50) PROJECTED_LTV varchar(50) MOVE_IN_DATE varchar(50) mh_processed_date varchar(50) mh_control INTEGER mh_active INTEGER mh_duplicate INTEGER </code></pre>
Tags
<sql><ruby><sqlite><sqlite3>
Title
How to optimize this long running sqlite3 query for finding duplicates?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USJeremy Baker
UserOwnerUserId
1. USJeremy Baker
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.