Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Just to add some hard numbers to the question above, I've implemented a small program to generate ids along the lines mentioned in the question, and these were the results of one of the trial runs:</p> <pre><code>Length Count MD5 Base 62 4 400 3d0e 446 5 925 4fc04 1Myi 6 2434 0e9368 40V6 7 29155 e406d96 GBFiA 8 130615 7ba787c8 2GOiCm 9 403040 75525d163 YNKjL9 10 1302992 e1b3581f52 H47JAIs Total: 1869561 </code></pre> <p>Each time the length of the hash increased, this was due to a collision, so there were six collisions in this instance. The count is the number of keys of a given length that were generated before a collision. The MD5 column shows the last truncated key that was successfully added before there was a duplicate key error. The column on the far right shows the key in base 62 (if I've done the conversion correctly).</p> <p>It looks like more and more keys are being generated with each additional character, which is what you would imagine. I'm hopeful that this approach will scale for user-generated content.</p> <p>For anyone who's interested, here's how I got the base 62 representation for the last column:</p> <pre><code>def base_62_encode(input): "Inspired by code at http://en.wikipedia.org/wiki/Base_36." CLIST="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" rv = "" while input != 0: rv = CLIST[input % 62] + str(rv) input /= 62 return rv base62_id = base_62_encode(long(truncated_hash, 16)) </code></pre> <hr> <p>(Added later on:)</p> <p>Here is a class that takes care of several things related to the generation of these IDs in the context of Google App Engine. By default, keys that are generated by this class are not purely base 62, since the first character of a GAE key name must be alphabetic. That requirement has been addressed here by using base 52 for the first character.</p> <p>The primary base can be changed to something other than 62 by altering the "clist" value that has been passed into (or omitted from) the constructor. One might want to remove characters that are easy to get mixed up, for example, "1", "l", "i", etc.</p> <p>Usage:</p> <pre><code>keygen = LengthBackoffIdGenerator(SomeClass, initial_length=5) small_id, modified = keygen.small_id(seed_value_from_sharded_counter) </code></pre> <p>Here is the class:</p> <pre><code>class LengthBackoffIdGenerator(object): """Class to generate ids along the lines of tinyurl. By default, ids begin with a alphabetic character. Each id that is created is checked against previous ones to ensure uniqueness. When a duplicate is generated, the length of the seed value used is increased by one character. Ids become gradually longer over time while remaining relatively short, and there are very few retries in relation to the number of keys generated. """ ids = set() def __init__(self, cls, clist='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', alpha_first=False, initial_length=5, check_db=False): self.clist = clist self.alpha_first = alpha_first if self.alpha_first: import re alpha_list = re.sub(r'\d', '', clist) if len(alpha_list) &lt; 1: raise Exception, 'At least one non-numeric character is required.' self.alpha_list = alpha_list self.cls = cls self.length = initial_length self.check_db = check_db def divide_long_and_convert(self, radix, clist, input, rv): "Inspired by code at http://en.wikipedia.org/wiki/Base_36." rv += clist[input % radix] input /= radix return (input, rv) def convert_long(self, input): rv = "" if self.alpha_first: input, rv = self.divide_long_and_convert(len(self.alpha_list), self.alpha_list, input, rv) while input != 0: input, rv = self.divide_long_and_convert(len(self.clist), self.clist, input, rv) return rv def hash_truncate_and_binify(self, input, length): """Compute the MD5 hash of an input string, truncate it and convert it to a long. input - A seed value. length - The length to truncate the MD5 hash to. """ from assessment.core.functions import md5_hash input = md5_hash(input)[0:length] return long(input, 16) def _small_id(self, input): """Create an ID that looks similar to a tinyurl ID. The first letter of the ID will be non-numeric by default. """ return self.convert_long(self.hash_truncate_and_binify(input, self.length)) def small_id(self, seed): key_name = self._small_id(seed) increased_length = False if self.check_db and not self.cls.get_by_key_name(key_name) is None: self.ids.add(key_name) if key_name in self.ids: self.length += 1 increased_length = True key_name = self._small_id(seed) self.ids.add(key_name) return (key_name, increased_length) </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload