Note that there are some explanatory texts on larger screens.

plurals
  1. POGet fast random access to binary files, but also sequential when needed. How to layout?
    primarykey
    data
    text
    <p>I have about 1 billion datasets that have a DatasetKey and each has between 1 and 50 000 000 child entries (some objects), average is about 100, but there are many fat tails.</p> <p>Once the data is written, there is no update to the data, only reads.</p> <p>I need to read the data by DatasetKey and one of the following:<br> Get number of child entries<br> Get first 1000 child entries (max if less than 1000)<br> Get first 5000 child entries (max if less than 5000)<br> Get first 100000 child entries (max if less than 100000)<br> Get all child entries </p> <p>Each child entry has a size of about 20 bytes to 2KB (450 bytes averaged).</p> <p>My layout I want to use would be the following:</p> <p>I create a file of a size of at least 5MB.<br> Each file contains at least one DatasetKey, but if the file is still less than 5MB I add new DatasetKeys (with child entries) till I exceed the 5 MB.<br> First I store a header that says at which file-offsets I will find what kind of data.<br> Further I plan to store serialized packages using protocol-buffers.<br> One package for the first 1000 entries,<br> one for the next 4000 entries,<br> one for the next 95000 entries,<br> one for the next remaining entries. </p> <p>I store the file sizes in RAM (storing all the headers would be to much RAM needed on the machine I use). When I need to access a specific DatasetKey I look in the RAM which file I need. Then I get the file size from the RAM. When the file-size is about 5MB or less I will read the whole file to memory and process it. If it is more than 5MB I will read only the first xKB to get the header. Then I load the position I need from disk.</p> <p>How does this sound? Is this totaly nonsense? Or a good way to go?</p> <p>Using this design I had the following in mind:</p> <p>I want to store my data in an own binary file instead a database to have it easier to backup and process the files in future.<br> I would have used postgresql but I figured out storing binary data would make postgresqls-toast to do more than one seek to access the data.<br> Storing one file for each DatasetKey needs too much time for writing all the values to disk.<br> The data is calculated in the RAM (as not the whole data is fitting simultaniously in the RAM, it is calculated block wise).<br> The Filesize of 5MB is only a rough estimation.</p> <p>What do you say? Thank you for your help in advance!</p> <p><strong>edit</strong></p> <p>Some more background information:</p> <p>DatasetKey is of type ulong.</p> <p>A child entry (there are different types) is most of the time like the following:</p> <pre><code>public struct ChildDataSet { public string Val1; public string Val2; public byte Val3; public long Val4; } </code></pre> <p>I cannot tell what data exactly is accessed. Planned is that the users get access to first 1000, 5000, 100000 or all data of particular DatasetKeys. Based on their settings.</p> <p>I want to keep the response time as low as possible and use as less as possible disk space.</p> <p><strong>@Regarding random access (Marc Gravells question):</strong></p> <p>I do not need access to element no. 123456 for a specific DatasetKey. </p> <p>When storing more than one DatasetKey (with the child entries) in one file (the way I designed it to have not to create to much files), I need random access to to first 1000 entries of a specific DatasetKey in that file, or the first 5000 (so I would read the 1000 and the 4000 package).</p> <p>I only need access to the following regarding one specific DatasetKey (uint):<br> 1000 child entries (or all child entries if less than 1000)<br> 5000 child entries (or all child entries if less than 5000)<br> 100000 child entries (or all child entries if less than 100000)<br> all child entries </p> <p>All other things I mentioned where just a design try from me :-)</p> <p><strong>EDIT, streaming for one List in a class?</strong></p> <pre><code>public class ChildDataSet { [ProtoMember(1)] public List&lt;Class1&gt; Val1; [ProtoMember(2)] public List&lt;Class2&gt; Val2; [ProtoMember(3)] public List&lt;Class3&gt; Val3; } </code></pre> <p>Could I stream for Val1, for example get the first 5000 entries of Val1</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload