Note that there are some explanatory texts on larger screens.

plurals
  1. POEfficiently identify changed fields in CSV files using c#
    text
    copied!<p>This turned out to be more difficult than I thought. Basically, each day a snapshot of a customer master list is being dumped by a system into CSV. It contains about 120000 records and 60 fields. About 25mb. Anyway, I'd like to report on values that change between one snapshot and another. It <em>isn't</em> a plan file diff, as it must be matched on the leftmost column value which contains the customer's unique number. Lines could be inserted/removed etc. All fields are strings, including the reference number. </p> <p>I've written a solution with LINQ but it dies with larger datasets. For 10000 records, it takes 17 seconds. For 120000, it takes nearly 2 hours to compare the two files. Right now it uses the excellent and free 'filehelpers' <a href="http://www.filehelpers.com/" rel="noreferrer">http://www.filehelpers.com/</a> to load the data, this takes a few seconds only, then. But detecting which records have changed is more problematic. The below takes is the 2 hour query:</p> <pre><code> var changednames = from f in fffiltered from s in sffiltered where f.CustomerRef == s.CustomerRef &amp;&amp; f.Customer_Name != s.Customer_Name select new { f, s }; </code></pre> <p>What approach would you recommend? I'd like to immediately 'prune' the list to those with a change of some sort, then apply my more specific comparisons to that small subset. Some of my thoughts were:</p> <p>a) Use dictionaries or Hashsets- though early tests don't really show improvements</p> <p>b) Compartmentalise the operations - use the first character in the customer reference field and match only against those with the same one. This probably involves creating many separate collections though and seems pretty inelegant.</p> <p>c) move away from a typed data arrangement and do it with arrays. Again, benefit uncertain.</p> <p>Any thoughts?</p> <p>Thanks!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload