Note that there are some explanatory texts on larger screens.

plurals
  1. PODELETE SQL with correlated subquery for table with 42 million rows?
    text
    copied!<p>I have a table <code>cats</code> with 42,795,120 rows.</p> <p>Apparently this is a lot of rows. So when I do:</p> <pre><code>/* owner_cats is a many-to-many join table */ DELETE FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) </code></pre> <p>the query times out :(</p> <p><strong><em>(edit: I need to increase my</em> <code>CommandTimeout</code> <em>value, default is only 30 seconds)</em></strong></p> <p>I can't use <code>TRUNCATE TABLE cats</code> because I don't want to blow away cats from other owners.</p> <p>I'm using SQL Server 2005 with "Recovery model" set to "Simple."</p> <p>So, I thought about doing something like this (executing this SQL from an application btw):</p> <pre><code>DELETE TOP (25) PERCENT FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) DELETE TOP(50) PERCENT FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) DELETE FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) </code></pre> <p><strong>My question is: what is the threshold of the number of rows I can <code>DELETE</code> in SQL Server 2005?</strong></p> <p>Or, if my approach is not optimal, please suggest a better approach. Thanks.</p> <p><em>This post didn't help me enough:</em></p> <ul> <li><a href="https://stackoverflow.com/questions/2519639/sql-server-efficiently-dropping-a-group-of-rows-with-millions-and-millions-of-row">SQL Server Efficiently dropping a group of rows with millions and millions of rows</a></li> </ul> <p><strong>EDIT (8/6/2010):</strong></p> <p>Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even <code>id_cat</code> is not a PK, because in my real life schema, it's not a unique field.</p> <p>I will put indexes on:</p> <ol> <li><code>cats.id_cat</code></li> <li><code>owner_cats.id_cat</code></li> <li><code>owner_cats.id_owner</code></li> </ol> <p>I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the <code>JOIN</code> fields right?</p> <p>However, it takes hours for me to do this batch load process. I'm already doing it as a <code>SqlBulkCopy</code> (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:</p> <ul> <li><a href="https://stackoverflow.com/questions/2692654/sqlbulkcopy-slow-as-molasses/2692787#2692787">SqlBulkCopy slow as molasses</a></li> <li><a href="https://stackoverflow.com/questions/24200?tab=votes#tab-top">What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)</a></li> </ul> <p>So I'm going to <code>DROP</code> my indexes before the copy and then re <code>CREATE</code> them when it's done.</p> <p>Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.</p> <p><strong>UPDATE (8/7/2010):</strong></p> <p>Tom suggested:</p> <pre><code>DELETE FROM cats c WHERE EXISTS (SELECT 1 FROM owner_cats o WHERE o.id_cat = c.id_cat AND o.id_owner = 1) </code></pre> <p>And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!</p> <p><strong>Update (8/8/2010):</strong></p> <p>Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min <strong><em>(yes an hour!)</em></strong> versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.</p> <p><strong>Update (8/9/2010):</strong></p> <p>Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. <strong><em>Deletes are taking longer on my database when I use indexes by an order of magnitude!</em></strong> I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! <strong>What am I doing wrong?</strong></p> <p><strong>Update (8/10/2010):</strong></p> <p>For lack of any other options, I have come up with what I feel is a lackluster solution <em>(hopefully temporary)</em>:</p> <ol> <li>Increase timeout for database connection to 1 hour (<code>CommandTimeout=60000;</code> default was 30 sec)</li> <li>Use Tom's query: <code>DELETE FROM WHERE EXISTS (SELECT 1 ...)</code> because it performed a little faster</li> <li><code>DROP</code> all indexes and PKs before running delete statement <em>(???)</em></li> <li>Run <code>DELETE</code> statement</li> <li><code>CREATE</code> all indexes and PKs</li> </ol> <p>Seems crazy, but at least it's faster than using <code>TRUNCATE</code> and starting over my load from the beginning with the first <code>owner_id</code>, because one of my <code>owner_id</code> takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that <code>owner_id</code>, but I don't want to blow away previous <code>owner_id</code>, so I don't want to <code>TRUNCATE</code> the <code>owner_cats</code> table, which is why I'm trying to use <code>DELETE</code>.)</p> <p><strong><em>Anymore help would still be appreciated :)</em></strong></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload