Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory. I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.</p> <pre><code>// Pseudo Hashmap map&lt;url,count&gt; while(log file has nextline){ url = nextline in logfile add url to map and update count } List list foreach(m in map){ add m to list } sort the list by count value take top n from the list </code></pre> <p>The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.</p> <p>Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided. A simple emulation of reading a log-file using a list in the memory is provided instead. </p> <p>The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.</p> <p>The memory complexity is dependent on expected distinct links. The hashmap must be able to contain the found distinct links, else this algorithm won't work.</p> <pre class="lang-c# prettyprint-override"><code>// Implementation using System; using System.Collections.Generic; using System.Linq; public class Program { public static void Main(string[] args) { RunLinkCount(); Console.WriteLine("press a key to exit"); Console.ReadKey(); } class LinkData : IComparable { public string Url { get; set; } public int Count { get; set; } public int CompareTo(object obj) { var other = obj as LinkData; int i = other == null ? 0 : other.Count; return i.CompareTo(this.Count); } } static void RunLinkCount() { // Data setup var urls = new List&lt;string&gt;(); var rand = new Random(); const int loglength = 500000; // Emulate the log-file for (int i = 0; i &lt; loglength; i++) { urls.Add(string.Format("http://{0}.com", rand.Next(1000) .ToString("x"))); } // Hashmap memory must be allocated // to contain distinct number of urls var lookup = new Dictionary&lt;string, int&gt;(); var stopwatch = new System.Diagnostics.Stopwatch(); stopwatch.Start(); // Algo-time // O(n) where n is log line count foreach (var url in urls) // Emulate stream reader, readline { if (lookup.ContainsKey(url)) { int i = lookup[url]; lookup[url] = i + 1; } else { lookup.Add(url, 1); } } // O(m) where m is number of distinct urls var list = lookup.Select(i =&gt; new LinkData { Url = i.Key, Count = i.Value }).ToList(); // O(mlogm) list.Sort(); // O(m) var top = list.Take(100).ToList(); // top urls stopwatch.Stop(); // End Algo-time // Show result // O(1) foreach (var i in top) { Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count); } Console.WriteLine(string.Format("Time elapsed msec: {0}", stopwatch.ElapsedMilliseconds)); } } </code></pre> <p><strong>Edit:</strong> This answer has been updated based on the comments</p> <ul> <li>added: running time and memory complexity analysis</li> <li>added: pseudo-code</li> <li>added: explain how we manage a fairly large log-file</li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload