Note that there are some explanatory texts on larger screens.

plurals
  1. POFastest way to search a SQLite database thousands of times?
    text
    copied!<h2>Question First</h2> <p>How can I search through my SQLite database as quickly as possible?</p> <p>Should I parse the address data from all 60,000 rows in Excel, load those into a list, then just search for all of them at once?</p> <p>Switching from simply looking through a plain text file sped up my script by 3 times, but I still think it could go faster.</p> <p>Thank you in advance!</p> <hr> <h2>The Database</h2> <p>I have a SQLite database of city names, their postal code, coordinates, etc. that I created from Geonames' postal codes data dump: <a href="http://download.geonames.org/export/zip/" rel="nofollow">Geonames Postal Codes</a></p> <p>The database has a table for for each country (DE,US,GB,etc.. 72 in all), and each of those tables has between a a couple dozen to tens of thousands of lines each, in the following format:</p> <pre><code>country code : iso country code, 2 characters postal code : varchar(20) place name : varchar(180) admin name1 : 1. order subdivision (state) varchar(100) admin code1 : 1. order subdivision (state) varchar(20) admin name2 : 2. order subdivision (county/province) varchar(100) admin code2 : 2. order subdivision (county/province) varchar(20) admin name3 : 3. order subdivision (community) varchar(100) admin code3 : 3. order subdivision (community) varchar(20) latitude : estimated latitude (wgs84) longitude : estimated longitude (wgs84) accuracy : accuracy of lat/lng from 1=estimated to 6=centroid </code></pre> <h2>Workflow</h2> <p>Now my current script in Python goes as follows:</p> <ul> <li>Read row in Excel file</li> <li>Parse address and location data (a lot of other, unrelated stuff in between)</li> <li>Search for a match in SQLite database</li> <li>Write the information from matching row in SQLite db to a .CSV file</li> </ul> <p>The Excel file is about 60,000 rows and goes through my entire Python script(above process) for each row.</p> <p>My address data is very inconsistent, containing a mix of postal codes, city names, and country names. Sometimes all of this data is in the Excel row, sometimes not. And it also comes with many misspellings and alternate names.</p> <p>So because the data is so inconsistent, and because sometimes people put down postal codes and cities that don't match, I currently have my Python script try a bunch of different search queries, like:</p> <ul> <li>Check if [postal code] matches column exactly AND [place name] matches column exactly</li> <li>Check if [postal code] matches column exactly AND [place name] in column</li> <li>Check if [postal code] matches column exactly AND [place name](split up by word) in column</li> <li>Check if just[postal code] matches column</li> </ul> <h2>Python Script</h2> <p>Here is the section of the Python script. As you can see, it seems pretty inefficient:</p> <pre><code>if has_country_code == True: not_in_list = False country = country_code.lower()+"_" print "HAS COUNTRY" if has_zipcode == True and has_city_name == True: print "HAS COUNTRY2" success = False try: curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name = ? COLLATE NOCASE", (zipcode, city,)) for row in curs: success = True break except: not_in_list = True success = True if success != True: curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name LIKE ? COLLATE NOCASE", (zipcode,"%"+city+"%",)) for row in curs: success = True break if success != True: newCity = "" newCity = filter(None,re.split('[; / ( ) - ,]',city)) questionMarks = ",".join(["?" for w in newCity]) curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name IN ("+questionMarks+") COLLATE NOCASE", ([zipcode]+newCity)) for row in curs: success = True break if success != True: curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? COLLATE NOCASE", (zipcode,)) for row in curs: success = True break if success != True: curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,)) for row in curs: success = True break if success != True: curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",)) for row in curs: success = True break if success != True: newCity = "" newCity = filter(None,re.split('[; / ( ) - ,]',city)) questionMarks = ",".join(["?" for w in newCity]) curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity)) for row in curs: success = True break if success != True: newCity = "" newCity = filter(None,re.split('[; / ( ) - ,]',city)) newCity.sort(key=len, reverse=True) newCity = (["%"+w+"%" for w in newCity]) for item in newCity: curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,)) for row in curs: success = True break break if has_city_name == True and has_zipcode == False: try: curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,)) for row in curs: success = True break except: not_in_list = True success = True if success != True: curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",)) for row in curs: success = True break if success != True: newCity = "" newCity = filter(None,re.split('[; / ( ) - ,]',city)) questionMarks = ",".join(["?" for w in newCity]) curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity)) for row in curs: success = True break if success != True: newCity = "" newCity = filter(None,re.split('[; / ( ) - ,]',city)) newCity.sort(key=len, reverse=True) newCity = (["%"+w+"%" for w in newCity]) for item in newCity: curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,)) for row in curs: success = True break break if has_city_name == False and has_zipcode == True: try: curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ?", (zipcode,)) for row in curs: success = True break except: not_in_list = True success = True </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload