Note that there are some explanatory texts on larger screens.

plurals
  1. POSearching for a string in a large text file - profiling various methods in python
    text
    copied!<p>This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...</p> <blockquote> <ul> <li>I have a <strong>600 MB</strong> file with <strong>6 million</strong> lines of strings (Category paths from DMOZ project). </li> <li>The entry on each line is unique.</li> <li>I want to <strong>load</strong> the file <strong>once</strong> &amp; <strong>keep searching</strong> for matches in the data</li> </ul> </blockquote> <p>The three methods that I tried below list the time taken to load the file, search time for a <em>negative match</em> &amp; memory usage in the task manager</p> <hr> <pre><code>1) set : (i) data = set(f.read().splitlines()) (ii) result = search_str in data </code></pre> <blockquote> <p><strong>Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB</strong></p> </blockquote> <hr> <pre><code>2) list : (i) data = f.read().splitlines() (ii) result = search_str in data </code></pre> <blockquote> <p><strong>Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB</strong></p> </blockquote> <hr> <pre><code>3) mmap : (i) data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) (ii) result = data.find(search_str) </code></pre> <blockquote> <p><strong>Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA</strong></p> </blockquote> <hr> <pre><code>4) Hash lookup (using code from @alienhard below): </code></pre> <blockquote> <p><strong>Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB</strong></p> </blockquote> <hr> <pre><code>5) File search (using code from @EOL below): with open('input.txt') as f: print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file </code></pre> <blockquote> <p><strong>Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA</strong></p> </blockquote> <hr> <pre><code>6) sqlite (with primary index on url): </code></pre> <blockquote> <p><strong>Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA</strong></p> </blockquote> <hr> <p>For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions : </p> <blockquote> <ol> <li>A <strong>better alternative</strong> e.g. sqlite ? </li> <li>Ways to <strong>improve the search time using mmap</strong>. I have a 64-bit setup. [edit] e.g. bloom filters</li> <li>As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches .. </li> </ol> </blockquote> <p>[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.</p> <p>Any comments/suggestions are welcome ! </p> <p>[edit 2] Update with results from methods suggested in answers [edit 3] Update with sqlite results</p> <p><strong>Solution</strong> : Based on all the profiling &amp; feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload