StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
2061763
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2010-01-14T02:15:28.843
FavoriteCount
0
LastActivityDate
2010-01-14T07:11:43.020
LastEditDate
2010-01-14T07:11:43.020
LastEditorUserId
167630
OwnerUserId
167630
ParentId
2059912
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude. "Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.) There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search. To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools. Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file. There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison. However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes. Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what <code>os.walk</code> does but be an order of magnitude faster than <code>os.walk</code>" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no. (Edit in response to "I don't think you understand what I'm trying to do") Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand. Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly. You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now. If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that? That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system. But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing <code>os.walk</code>. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POUsing an index to recursively get all files in a directory really fast
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USTravis Bradshaw
UserOwnerUserId
1. USTravis Bradshaw
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTDownMod
CommentsPostId
1. COLeveraging the Windows Desktop Search indexing is a nice idea. But I have no idea how to do that. Also, my method really isn't all that complicated or hard to implement. // I think you're misunderstanding what I'm trying to do. I've restated my question in an attempt to be more clear. The thing is I only need to write the index *the first time* I call the function and *subsequent calls* are sped up because I can just read the index and no longer need to walk. I keep the index up to date by applying deltas from the monitor on subsequent calls.
 singulars
 PostPostId
 PO
 UserUserId
 USJesse Aldridge

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.