StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPrevent duplicate rows in MySQL without unique index/constraint?
text
Body
copied!<p>I'm writing an application that needs to deal with millions of URLs. It also needs to do retrieval by URL.</p> <p>My table currently looks like this:</p> <pre><code>CREATE TABLE Pages ( id bigint(20) unsigned NOT NULL, url varchar(4096) COLLATE utf8_unicode_ci NOT NULL, url_crc int(11) NOT NULL, PRIMARY KEY (id), KEY url_crc (url_crc) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; </code></pre> <p>The idea behind this structure is to do look ups by a CRC32 hash of the URL, since a b-tree index would be very inefficient on URLs which tend to have common prefixes (InnoDB doesn't support hash indexes). Duplicate results from the CRC32 are filtered by a comparison with the full URL. A sample retrieval query would look like this:</p> <pre><code>SELECT id FROM Pages WHERE url_crc = 2842100667 AND url = 'example.com/page.html'; </code></pre> <p>The problem I have is avoiding duplicate entries being inserted. The application will always check the database for an existing entry before inserting a new one, but it's likely in my application that multiple queries for the same new URL will be made concurrently and duplicate CRC32s and URLs will be entered.</p> <p>I don't want to create a unique index on url as it will be gigantic. I also don't want to write lock the table on every insert since that will destroy concurrent insert performance. Is there an efficient way to solve this issue?</p> <p>Edit: To go into a bit more detail about the usage, it's a real-time table for looking up content in response to a URL. By looking up the URL, I can find the internal id for the URL, and then use that to find content for a page. New URLs are added to the system all the time, and I have no idea what those URLs will be before hand. When new URLs are referenced, they will likely be slammed by simultaneous requests referencing the same URLs, perhaps hundreds per second, which is why I'm concerned about the race condition when adding new content. The results need to be immediate and there can't be read lag (subsecond lag is okay).</p> <p>To start, new URLs will be added only a few thousand per day, but the system will need to handle many times that before we have time to move to a more scalable solution next year.</p> <p>One other issue with just using a unique index on url is that the length of the URLs can exceed the maximum length of a unique index. Even if I drop the CRC32 trick, it doesn't solve the issue of preventing duplicate URLs.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload