Note that there are some explanatory texts on larger screens.

plurals
  1. PONeed a way to sort a 100 GB log file by date
    text
    copied!<p>So, for some strange reason I end up with a 100GB log file that is unsorted (<em>actually it's partially sorted</em>), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so</p> <pre><code>data &lt;date&gt; data data more data </code></pre> <p>I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.</p> <p>Incidentally parsing the date string with <code>DateTime.Parse()</code> is very slow and takes up a lot of CPU time - The <em>chugging</em>-rate is measly 10 MB/sec. Is there a faster way than the following?</p> <pre><code> public static DateTime Parse(string data) { int year, month, day; int.TryParse(data.Substring(0, 4), out year); int.TryParse(data.Substring(5, 2), out month); int.TryParse(data.Substring(8, 2), out day); return new DateTime(year, month, day); } </code></pre> <p>I wrote that to speed up <code>DateTime.Parse()</code> and it actually works well, but is still taking a bucket-load of cycles.</p> <p><em>Note that for the current log-file I'm interested in hours, minutes and seconds also. I know that I can provide DateTime.Parse() with format, but that doesn't seem to speed it up all that much.</em></p> <p>I'm looking for a nudge in the right direction, thanks in advance.</p> <p><strong>EDIT</strong>: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.</p> <p><strong>EDIT 2</strong> : Well, thanks to several suggestions that I use <a href="http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sort.mspx?mfr=true" rel="noreferrer">windows sort</a>, I found out that there's a <a href="http://manpages.ubuntu.com/manpages/lucid/man1/sort.1.html" rel="noreferrer">similar tool for Linux</a>. Basically you call sort and it fixes everything for you. As we speak it's doing <em>something</em>, and I hope it'll finish soon. The command I'm using is </p> <pre><code>sort -k 2b 2008.log &gt; 2008.sorted.log </code></pre> <p>-k specifies that I want to sort on the second row, which is an date-time string in the usual <code>YYYY-MM-DD hh:mm:ss.msek</code> format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running <code>info coreutils 'sort invocation'</code>.</p> <p>I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.</p> <p><strong>Edit 3</strong> Well, checking <a href="http://manpages.ubuntu.com/manpages/lucid/en/man1/iotop.1.html" rel="noreferrer">iotop</a> suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(</p> <p><code>sort</code> isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?</p> <p><strong>Edit 4</strong> Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...</p> <p><strong>Edit 5</strong> Before I went home, I restarted the process with the following command:</p> <pre><code>sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log </code></pre> <p>It returned this, this morning:</p> <pre><code>sort: write failed: /media/My Passport/sortQAUKdT: File too large </code></pre> <p><em>Wraawr!</em> I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.</p> <p><strong>Final Notice</strong> This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!</p> <p><strong>Benchmarking</strong></p> <p>Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload