Note that there are some explanatory texts on larger screens.

plurals
  1. POList duplicate files inside a folder in C#: Leveraging LINQ.AsParallel
    primarykey
    data
    text
    <p>I have written following algorithm into C# code to list down the files inside a folder recursively.</p> <ol> <li>Begin Iterating through the list of files in the directory &amp; its sub directories.</li> <li>Store file Name &amp; Path in a list.</li> <li>If current file matches any other file in the list, during mark both files as duplicate.</li> <li>Fetch all files from the list which were marked duplicate.</li> <li>Group them by name &amp; return.</li> </ol> <p>The implementation is very slow on a folder containing 50,000 files and 12,000 sub directories. As disk read operation is basically time consuming task. Even <strong>LINQ.Parallel()</strong> doesn't help much.</p> <p><strong>Implmentation:</strong></p> <blockquote> <pre><code>class FileTuple { public string FileName { set; get; } public string ContainingFolder { set; get; } public bool HasDuplicate { set; get; } public override bool Equals(object obj) { if (this.FileName == (obj as FileTuple).FileName) return true; return false; } } </code></pre> </blockquote> <ol> <li>FileTuple class keeps track of filenames &amp; containing directory, the flag keeps track of duplicate status.</li> <li>I have overridden the equals method to compare only files names, in the collection of fileTuples.</li> </ol> <p>Following method finds the duplicate files and return as a list.</p> <pre><code> private List&lt;FileTuple&gt; FindDuplicates() { List&lt;FileTuple&gt; fileTuples = new List&lt;FileTuple&gt;(); //Read all files from the given path List&lt;string&gt; enumeratedFiles = Directory.EnumerateFiles(txtFolderPath.Text, "*.*", SearchOption.AllDirectories).Where(str =&gt; str.Contains(".exe") || str.Contains(".zip")).AsParallel().ToList(); foreach (string filePath in enumeratedFiles) { var name = Path.GetFileName(filePath); var folder = Path.GetDirectoryName(filePath); var currentFile = new FileTuple { FileName = name, ContainingFolder = folder, HasDuplicate = false, }; int foundIndex = fileTuples.IndexOf(currentFile); //mark both files as duplicate, if found in list //assuming only two duplicate file if (foundIndex != -1) { currentFile.HasDuplicate = true; fileTuples[foundIndex].HasDuplicate = true; } //keep of track of the file navigated fileTuples.Add(currentFile); } List&lt;FileTuple&gt; duplicateFiles = fileTuples.Where(fileTuple =&gt; fileTuple.HasDuplicate).Select(fileTuple =&gt; fileTuple).OrderBy(fileTuple =&gt; fileTuple.FileName).AsParallel().ToList(); return duplicateFiles; } </code></pre> <p>Can you please suggest a way to improve the performance. </p> <p>Thank you for your help.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload