Note that there are some explanatory texts on larger screens.

plurals
  1. POTraversing files on a distributed file system
    primarykey
    data
    text
    <p>I have a filesystem with a few hundred million files (several petabytes) and I want to get pretty much everything that stat would return and store it in some sort of database. Right now, we have an MPI program that is fed directory names from a central queue and worker nodes that slam NFS (which can handle this without trying too hard) with stat calls. The worker nodes then hit postgres to store the results. </p> <p>Although this works, it's very slow. A single run will take over 24 hours on a modern 30 node cluster.</p> <p>Does anyone have any ideas for splitting up the directory structure instead of having a centralized queue (I'm under the impression that exact algorithms for this are NP hard)? Also, I've been considering replacing postgres with something like MongoDB's autosharding with several routers (since postgres is currently a huge bottleneck).</p> <p>I'm pretty much just looking for ideas in general on how this setup could be improved.</p> <p>Unfortunately, using something like the 2.6 kernel audit subsystem is probably out of the question since it would be extremely difficult (in a political way) to get that running on every machine that hits this filesystem.</p> <p>If it matters, every machine (several thousand) using this filesystem is running linux 2.6.x.</p> <p>The actual primary purpose of this is to find files that are older than a certain date so we can have the ability to delete them. We also want to collect data in general on how the filesystem is being used.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload