Note that there are some explanatory texts on larger screens.

plurals
  1. POHDFS partition data
    text
    copied!<p>I have huge data (TBs) of DNS logs in text files where each record is of the form</p> <pre><code>timestamp | resolvername | domainlookedfor | dns_answer </code></pre> <p>where,</p> <pre><code>timestamp - time at which the record was logged resolvername - the dns resolver that served the end-user domainlookedfor - domain that was looked for by the end user dns_answer - final dns resolution record of 'hostname -&gt; ip address' </code></pre> <p>As of now, I have <code>individual text files for every five minutes of logs</code> from various <code>dns resolvers</code>. So if I want to see the records in the past 10 days which contain the hostname say <code>www.google.com</code>, then I will have to scan the entire data for the past 10 days (lets say 50GB) and the filter only the records that match the domain (lets say 10MB of data). So obviously there is a huge chunk of data that is read from the disk unnecessarily and it takes a lot of time to get the results.</p> <p>To improve this situation, I am thinking of partitioning the data based on the <code>domain name</code> and thereby reduce my search space. Also, I would like to retain the notion of records separated based on time (if not for every 5 mins, I would like to have a file for say, every day). </p> <p>One simple approach that I can think of is,</p> <ul> <li><p>Bucket the records based on the hash of the domain name (or may be the the first two letters) [domain_AC, domain_AF, domain_AI ... domain_ZZ] where directory domain_AC will have the records for all the domains whose 1st character is A and 2nd character is either A or B or C.</p></li> <li><p>Within each bucket, there will be a separate file for each day [20130129, 20130130, ... ]</p></li> </ul> <p>So to obtain records for <code>www.google.com</code>, first identify the bucket and then based on the date range, scan the respective files and filter only records that match www.google.com.</p> <hr> <p>Another requirement I have is to group the records based on the <code>resolvername</code> to answer queries such as, <code>get all the records by resolver 'x'</code>.</p> <p>Please let me know if there are any important details that I should consider and any other known approach to solve this problem. I appreciate any help. Thanks!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload