Is there a faster way to scan through a directory recursively in .NET?

There is a long history of the .NET file enumeration methods being slow. The issue is there is not an instantaneous way of enumerating large directory structures. Even the accepted answer here has it's issues with GC allocations.

The best I've been able do is wrapped up in my library and exposed as the FileFile (source) class in the CSharpTest.Net.IO namespace. This class can enumerate files and folders without unneeded GC allocations and string marshaling.

The usage is simple enough, and the RaiseOnAccessDenied property will skip the directories and files the user does not have access to:

    private static long SizeOf(string directory)
    {
        var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
        fcounter.RaiseOnAccessDenied = false;

        long size = 0, total = 0;
        fcounter.FileFound +=
            (o, e) =>
            {
                if (!e.IsDirectory)
                {
                    Interlocked.Increment(ref total);
                    size += e.Length;
                }
            };

        Stopwatch sw = Stopwatch.StartNew();
        fcounter.Find();
        Console.WriteLine("Enumerated {0:n0} files totaling {1:n0} bytes in {2:n3} seconds.",
                          total, size, sw.Elapsed.TotalSeconds);
        return size;
    }

For my local C:\ drive this outputs the following:

Enumerated 810,046 files totaling 307,707,792,662 bytes in 232.876 seconds.

Your mileage may vary by drive speed, but this is the fastest method I've found of enumerating files in managed code. The event parameter is a mutating class of type FindFile.FileFoundEventArgs so be sure you do not keep a reference to it as it's values will change for each event raised.

You might also note that the DateTime's exposed are only in UTC. The reason is that the conversion to local time is semi-expensive. You might consider using UTC times to improve performance rather than converting these to local time.

I just ran across this. Nice implementation of the native version.

This version, while still slower than the version that uses FindFirst and FindNext, is quite a bit faster than the your original .NET version.

    static List<Info> RecursiveMovieFolderScan(string path)
    {
        var info = new List<Info>();
        var dirInfo = new DirectoryInfo(path);
        foreach (var entry in dirInfo.GetFileSystemInfos())
        {
            bool isDir = (entry.Attributes & FileAttributes.Directory) != 0;
            if (isDir)
            {
                info.AddRange(RecursiveMovieFolderScan(entry.FullName));
            }
            info.Add(new Info()
            {
                IsDirectory = isDir,
                CreatedDate = entry.CreationTimeUtc,
                ModifiedDate = entry.LastWriteTimeUtc,
                Path = entry.FullName
            });
        }
        return info;
    }

It should produce the same output as your native version. My testing shows that this version takes about 1.7 times as long as the version that uses FindFirst and FindNext. Timings obtained in release mode running without the debugger attached.

Curiously, changing the GetFileSystemInfos to EnumerateFileSystemInfos adds about 5% to the running time in my tests. I rather expected it to run at the same speed or possibly faster because it didn't have to create the array of FileSystemInfo objects.

The following code is shorter still, because it lets the Framework take care of recursion. But it's a good 15% to 20% slower than the version above.

    static List<Info> RecursiveScan3(string path)
    {
        var info = new List<Info>();

        var dirInfo = new DirectoryInfo(path);
        foreach (var entry in dirInfo.EnumerateFileSystemInfos("*", SearchOption.AllDirectories))
        {
            info.Add(new Info()
            {
                IsDirectory = (entry.Attributes & FileAttributes.Directory) != 0,
                CreatedDate = entry.CreationTimeUtc,
                ModifiedDate = entry.LastWriteTimeUtc,
                Path = entry.FullName
            });
        }
        return info;
    }

Again, if you change that to GetFileSystemInfos, it will be slightly (but only slightly) faster.

For my purposes, the first solution above is quite fast enough. The native version runs in about 1.6 seconds. The version that uses DirectoryInfo runs in about 2.9 seconds. I suppose if I were running these scans very frequently, I'd change my mind.

This implementation, which needs a bit of tweaking is 5-10X faster.

    static List<Info> RecursiveScan2(string directory) {
        IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
        WIN32_FIND_DATAW findData;
        IntPtr findHandle = INVALID_HANDLE_VALUE;

        var info = new List<Info>();
        try {
            findHandle = FindFirstFileW(directory + @"\*", out findData);
            if (findHandle != INVALID_HANDLE_VALUE) {

                do {
                    if (findData.cFileName == "." || findData.cFileName == "..") continue;

                    string fullpath = directory + (directory.EndsWith("\\") ? "" : "\\") + findData.cFileName;

                    bool isDir = false;

                    if ((findData.dwFileAttributes & FileAttributes.Directory) != 0) {
                        isDir = true;
                        info.AddRange(RecursiveScan2(fullpath));
                    }

                    info.Add(new Info()
                    {
                        CreatedDate = findData.ftCreationTime.ToDateTime(),
                        ModifiedDate = findData.ftLastWriteTime.ToDateTime(),
                        IsDirectory = isDir,
                        Path = fullpath
                    });
                }
                while (FindNextFile(findHandle, out findData));

            }
        } finally {
            if (findHandle != INVALID_HANDLE_VALUE) FindClose(findHandle);
        }
        return info;
    }

extension method:

 public static class FILETIMEExtensions {
        public static DateTime ToDateTime(this System.Runtime.InteropServices.ComTypes.FILETIME filetime ) {
            long highBits = filetime.dwHighDateTime;
            highBits = highBits << 32;
            return DateTime.FromFileTimeUtc(highBits + (long)filetime.dwLowDateTime);
        }
    }

interop defs are:

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
    public static extern IntPtr FindFirstFileW(string lpFileName, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
    public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll")]
    public static extern bool FindClose(IntPtr hFindFile);

    [StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
    public struct WIN32_FIND_DATAW {
        public FileAttributes dwFileAttributes;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
        public int nFileSizeHigh;
        public int nFileSizeLow;
        public int dwReserved0;
        public int dwReserved1;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
        public string cFileName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
        public string cAlternateFileName;
    }

Depending on how much time you're trying to shave off the function, it may be worth your while to call the Win32 API functions directly, since the existing API does a lot of extra processing to check things that you may not be interested in.

If you haven't done so already, and assuming you don't intend to contribute to the Mono project, I would strongly recommend downloading Reflector and having a look at how Microsoft implemented the API calls you're currently using. This will give you an idea of what you need to call and what you can leave out.

You might, for example, opt to create an iterator that yields directory names instead of a function that returns a list, that way you don't end up iterating over the same list of names two or three times through all the various levels of code.

Is there a faster way to scan through a directory recursively in .NET?

Tags:

C#

.Net

Filesystems

Related

Recent Posts