StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Here's some random thoughts on implementation and possible issues based on the follwing assumptions: average image size of 100kb, and a steady state of 50M (5GB) images. This also assumes users will not be accessing the file store directly, and will do it through software or a web site:</p> <ol> <li><p>Storage medium: The size of images you give amounts to a rather paltry read and write speeds, I would think most common hard drives wouldn't have an issue with this throughput. I would put them in a RAID1 configuration for data security, however. Backups wouldn't appear to be too much of an issue, since it's only 5gb of data.</p></li> <li><p>File storage: To prevent issues with maximum files in a directory, I would take the hash (MD5 minimum, this would be the quickest, but most-collision likely. And before people chirp in to say MD5 is broken, this is for identification, and not security. An attacker could pad images for a second preimage attack, and replace all images with goatse, but we'll consider this unlikely), and convert that has to a hexadecimal string. Then, when it comes time to stash the file in the file system, take the hex string in blocks of 2 characters, and create a directory structure for that file based on that. E.g. if the file hashes to <code>abcdef</code>, the root directory would be <code>ab</code> then under that a directory called <code>cd</code>, under which you would store the image with the name of <code>abcdef</code>. The real name will be kept somewhere else (discussed below). </p> <p>With this approach, if you start hitting file system limits (or performance issues) from too many files in a directory, you can just have the file storage part create another level of directories. You could also store with the metadata how many levels of directories the file was created with, so if you expand later, older files won't be looked for in the newer, deeper directories.</p> <p>Another benefit here: If you hit transfer speed issues, or file system issues in general, you could easily split off a set off files to other drives. Just change the software to keep the top level directories on different drives. So if you want to split the store in half, 00-7F on one drive, 80-FF on another.</p> <p>Hashing also nets you single instance storage, which can be nice. Since hashes of a normal population of files tend to be random, this should also net you an even distribution of files across all directories.</p></li> <li><p>Metadata storage: While 50M rows seems like a lot, most DBMS's are built to scoff at that number of records, with enough RAM, of course. The following is written based on SQL Server, but I'm sure most of these will apply to others. Create a table with the hash of the file as the primary key, along with things like the size, format, and level of nesting. Then create another table with an artificial key (an int Identity column would be fine for this), and also the original name of the file (varchar(255) or whatever), and the hash as a foreign key back to the first table, and the date that it was added, with an index on the file name column. Also add any other columns you need to figure out if a file is expired or not. This will let you store the original name if you have people trying to put the same file in under different names (but are otherwise identical, since they hash the same).</p></li> <li><p>Maintenance: This should be a scheduled task. Let Windows worry about when your task runs, less for you to debug and get wrong (what if you do maintenance every night at 2:30AM, and you're somewhere that observes Summer/daylight saving time. 2:30AM doesn't happen during the spring changeover). This service will then run a query against the database to establish which files are expired (based on the data stored per-file name, so it knows when all references that point to a stored file are expired. Any hashed file that is not referenced by at least one row in the file name table is no longer needed). The service would then go delete these files.</p></li> </ol> <p>I think that's about it for the major parts.</p> <p>EDIT: My comment was getting too long, moving it into an edit:</p> <p>Whoops, my mistake, that's what I get for doing math when I'm tired. In this case, if you want to avoid the extra redundancy of adding RAID levels (51 or 61 e.g. mirrored across a striped set), the hashing would net you the benefit of being able to slot 5 1TB drives into the server, and then have the file storage software span the drives by the hash like mentioned at the end of 2. You could even RAID1 the drives for added security for this.</p> <p>Backing up would be more complex, though the file system creation/modification times would still hold for doing this (You could have it touch each file to update it's modification time when a new reference to that file is added).</p> <p>I see a two-fold downside to going by date/time for the directories. First, it is unlikely the distribution would be uniform, this will cause some directories to be fuller than others. Hashing would distribute evenly. As for the spanning, you could monitor the space on the drive as you add files, and start spilling over to the next drive when space runs out. I imagine part of the expiry is date related, so you would have older drives start to empty as newer ones fill up, and you'd have to figure out how to balance that.</p> <p>The metadata store doesn't have to be on the server itself. You're already storing file related data in the database. As opposed to just referencing the path directly from the row where it is used, reference the file name key (the second table I mentioned) instead.</p> <p>I imagine users use some sort of web or application to interface to the store, so the smarts to figure out where the file would go on the storage server would live there, and just share out the roots of the drives (or do some fancy stuff with NTFS junctioning to put all the drives into one subdirectory). If you're expecting to pull down a file via a web site, create a page on the site that takes the file name ID, then perform the lookup in the DB to get the hash, then it would break the hash up to whatever configured level, and request that over the share to the server, then stream it back to the client. If expecting a UNC to access the file, have the server just build the UNC instead.</p> <p>Both of these methods would make your end-user app less dependent on the structure on the file system itself, and will make it easier for you to tweak and expand your storage later.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload