Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.</p> <p>"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)</p> <p>There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.</p> <p>To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.</p> <p>Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files <em>while you're building the same files</em> and create the index, and then later just read that file.</p> <p>There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.</p> <p>However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster <em>if you use an already existing index</em>. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.</p> <p>Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what <code>os.walk</code> does but be an order of magnitude faster than <code>os.walk</code>" the answer is a resounding <em>no</em>. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still <em>no</em>.</p> <p>(<em>Edit in response to "I don't think you understand what I'm trying to do"</em>)</p> <p>Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.</p> <p>Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.</p> <p>You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.</p> <p><em>If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?</em></p> <p>That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.</p> <p>But your question, as you asked it, has a clear answer: <em>No</em>. In the general case, you're not going to get an order of magnitude faster reimplementing <code>os.walk</code>. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload