Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>An arbitrary <code>Long</code> is about 19.5 ASCII digits long, but only 8 bytes long, so you'll gain a savings of a factor of ~2 if you write it in binary. Now, it may be that most of the values are not actually taking all 8 bytes, in which case you could define some compression scheme yourself.</p> <p>In any case, you are probably best off writing block data using <code>java.nio.ByteBuffer</code> and friends. Binary data is most efficiently read in blocks, and you might want your file to be randomly accessible, in which case you want your data to look something like so:</p> <pre><code>&lt;some unique binary header that lets you check the file type&gt; &lt;int saying how many records you have&gt; &lt;offset of the first record&gt; &lt;offset of the second record&gt; ... &lt;offset of the last record&gt; &lt;int&gt;&lt;int&gt;&lt;length of vector&gt;&lt;long&gt;&lt;long&gt;...&lt;long&gt; &lt;int&gt;&lt;int&gt;&lt;length of vector&gt;&lt;long&gt;&lt;long&gt;...&lt;long&gt; ... &lt;int&gt;&lt;int&gt;&lt;length of vector&gt;&lt;long&gt;&lt;long&gt;...&lt;long&gt; </code></pre> <p>This is a particularly convenient format for reading and writing using <code>ByteBuffer</code> because you know in advance how big everything is going to be. So you can</p> <pre><code>val fos = new FileOutputStream(myFileName) val fc = fos.getChannel // java.nio.channel.FileChannel val header = ByteBuffer.allocate(28) header.put("This is my cool header!!".getBytes) header.putInt(data.length) fc.write(header) val offsets = ByteBuffer.allocate(8*data.length) data.foldLeft(28L+8*data.length){ (n,d) =&gt; offsets.putLong(n) n = n + 12 + d.vector.length*8 } fc.write(offsets) ... </code></pre> <p>and on the way back in</p> <pre><code>val fis = new FileInputStream(myFileName) val fc = fis.getChannel val header = ByteBuffer.allocate(28) fc.read(header) val hbytes = new Array[Byte](24) header.get(hbytes) if (new String(hbytes) != "This is my cool header!!") ??? val nrec = header.getInt val offsets = ByteBuffer.allocate(8*nrec) fc.read(offsets) val offsetArray = offsets.getLongs(nrec) // See below! ... </code></pre> <p>There are some handy methods on <code>ByteBuffer</code> that are absent, but you can add them on with implicits (here for Scala 2.10; with 2.9 make it a plain class, drop the <code>extends AnyVal</code>, and supply an implicit conversion from <code>ByteBuffer</code> to <code>RichByteBuffer</code>):</p> <pre><code>implicit class RichByteBuffer(val b: java.nio.ByteBuffer) extends AnyVal { def getBytes(n: Int) = { val a = new Array[Byte](n); b.get(a); a } def getShorts(n: Int) = { val a = new Array[Short](n); var i=0; while (i&lt;n) { a(i)=b.getShort(); i+=1 } ; a } def getInts(n: Int) = { val a = new Array[Int](n); var i=0; while (i&lt;n) { a(i)=b.getInt(); i+=1 } ; a } def getLongs(n: Int) = { val a = new Array[Long](n); var i=0; while (i&lt;n) { a(i)=b.getLong(); i+=1 } ; a } def getFloats(n: Int) = { val a = new Array[Float](n); var i=0; while (i&lt;n) { a(i)=b.getFloat(); i+=1 } ; a } def getDoubles(n: Int) = { val a = new Array[Double](n); var i=0; while (i&lt;n) { a(i)=b.getDouble(); i+=1 } ; a } } </code></pre> <p>Anyway, the reason to do things this way is that you'll end up with decent performance, which is also a consideration when you have tens of gigabytes of data (which it sounds like you have given hundreds of thousands of vectors of length up to ten thousand).</p> <p>If your problem is actually much smaller, then don't worry so much about it--pack it into XML or use JSON or some custom text solution (or use <code>DataOutputStream</code> and <code>DataInputStream</code>, which don't perform as well and won't give you random access).</p> <p>If your problem is actually bigger, you can define <em>two</em> lists of longs; first, the ones that will fit in an <code>Int</code>, say, and then the ones that actually need a full <code>Long</code> (with indices so you know where they are). Data compression is a very case-specific task--assuming you don't just want to use <code>java.util.zip</code>--so without a lot more knowledge about what the data looks like, it's hard to know what to recommend beyond just storing it as a weakly hierarchical binary file as I've described above.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload