Traverse huge directories instantly (iteration like in JDK7, now!)

This was so trivial, I wonder why Sun didn’t do it in the first place…

It launches a new process and listens for the stdout:

  • Windows: cmd /c dir /B $path
  • Linux/Unix: /bin/ls $path

public interface DirectoryVisitor
{
   public void visit(File parent, String name);
}

public class DirectoryIterator
{
   public static void main(String[] args) throws Exception
   {
      DirectoryVisitor visitor = new DirectoryVisitor()
      {
         @Override
         public void visit(File parent, String name)
         {
            System.out.println("file: " + name);
         }
      };

      iterate(new File(args[0]), visitor);
   }

   public static void iterate(File directory, DirectoryVisitor visitor)
   {
      String path = directory.getAbsolutePath();

      if (!directory.isDirectory())
      {
         throw new IllegalStateException("file not a directory: " + path);
      }

      if (System.getProperty("os.name").toLowerCase().contains("windows"))
      {
         try
         {
            String[] cmd = new String[5];
            cmd[0] = "cmd";
            cmd[1] = "/c";
            cmd[2] = "dir";
            cmd[3] = "/B"; // by: irreversible_kev
            cmd[4] = path;
            Process p = Runtime.getRuntime().exec(cmd);
            BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
            while (true)
            {
               String line = br.readLine();
               if (line == null)
                  break;
               visitor.visit(directory, line);
            }
            int ev = p.waitFor();
            if (ev != 0)
            {
               throw new IllegalStateException("cmd exit value: " + ev);
            }
         }
         catch (Exception exc)
         {
            throw new IllegalStateException(exc);
         }
      }
      else
      {
         try
         {
            String[] cmd = new String[2];
            cmd[0] = "/bin/ls";
            cmd[1] = path;
            Process p = Runtime.getRuntime().exec(cmd);
            BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
            while (true)
            {
               String line = br.readLine();
               if (line == null)
                  break;
               visitor.visit(directory, line);
            }
            int ev = p.waitFor();
            if (ev != 0)
            {
               throw new IllegalStateException("/bin/ls exit value: " + ev);
            }
         }
         catch (Exception exc)
         {
            throw new IllegalStateException(exc);
         }
      }
   }
}

I woulda thought their listFiles() method used the same underlying code as dir and ls but never underestimate Java for getting these little things wrong for 10 years… :slight_smile:

Cas :slight_smile:

These are the results for a directory with 65536 files:

[tr]
[td]strategy [/td]
[td]first result [/td]
[td]last result[/td]
[/tr]
[tr]
[td]File.list()[/td]
[td]7010ms[/td]
[td]7010ms[/td]
[/tr]
[tr]
[td]DirectoryIterator.iterate(visitor) [/td]
[td]210ms[/td]
[td]7780ms[/td]
[/tr]

Each test was performed after a reboot, as the 2nd invocation of File.list() takes only 110ms, due to OS caching.

Both versions seem to have problems with spaces in the path.
I tried the obvious stuff with quotes, but it didn’t work out.
Will fix soon, I hope.

Fixed!

Whats this?

[quote]line = line.substring(36); // offset / magic value
[/quote]
I actually have this stupid thing, even with the sort the file access is what kills it,
besides the directory / not directory comparation.


    /**
     * Returns a flat view of a file hierarchy.
     *
     * The file list returns the files descending from sum,
     * including any in the sum array. They come
     * sorted by natural ordering for their
     * filenames for each directory only.
     * The directory list returns the directories in the given hierarchy
     * including any in the sum array. The directories are not sorted.
     * @param levels
     * @param sum
     * @param files
     * @param directories
     */
    public static void getFiles(int levels, File[] sum, List<File> files, List<File> directories) {
        Comparator<File> orderFiles = new Comparator<File>() {

            @Override
            public int compare(File o1, File o2) {
                return Strings.compareNatural(o1.getName(), o2.getName());
            }
        };

        Arrays.sort(sum, orderFiles);

        getFiles(levels, sum, files, directories, orderFiles);
    }

    private static void getFiles(int levels, File[] sum, List<File> files, List<File> directories, Comparator<File> comp) {
        List<File[]> subFilesList = new ArrayList<File[]>(50);
        //O(n^2.log n)
        for (File f : sum) {
            File[] subFiles = f.listFiles();
            if (subFiles == null) {
                files.add(f);
            } else {
                directories.add(f);
                Arrays.sort(subFiles, comp);
                subFilesList.add(subFiles);
            }
        }

        if (levels > 0) {
            for (File[] subFiles : subFilesList) {
                getFiles(levels - 1, subFiles, files, directories, comp);
            }
        }
    }

[/quote]
Just open the command-prompt, type ‘dir’ and hit enter. You’ll see where that 36 comes from.

If you want to recursively iterate (lazily / non-blocking) over a directory-structure:


public class FileUtil
{
   public static Iterator<File> getFileHierachyIterator(File dir)
   {
      return FileUtil.getFileHierachyIterator(dir, null);
   }

   public static Iterator<File> getFileHierachyIterator(File dir, FileFilter filter)
   {
      return FileUtil.getFileHierachyIterator(dir, filter, null);
   }

   public static Iterator<File> getFileHierachyIterator(final File dir, final FileFilter filter, final Comparator<File> sorter)
   {
      List<File> files = new ArrayList<File>();

      File[] ff = dir.listFiles(filter);
      if (ff == null) // access denied, very unlikely
         return new ArrayList<File>().iterator();
      for (File file : ff)
         files.add(file);

      if (sorter != null)
         Collections.sort(files, sorter);

      final Iterator<File> it = files.iterator();

      return new Iterator<File>()
      {
         private Iterator<File> subs;

         public boolean hasNext()
         {
            if (subs != null && subs.hasNext())
               return true;
            return it.hasNext();
         }

         public File next()
         {
            if (subs == null || !subs.hasNext())
            {
               File file = it.next();

               if (file.isDirectory())
               {
                  subs = FileUtil.getFileHierachyIterator(file, filter, sorter);
               }
               else
               {
                  subs = null;
               }
               return file;
            }

            if (subs != null)
            {
               if (subs.hasNext())
                  return subs.next();
               subs = null;
            }
            return it.next();
         }

         public void remove()
         {
            throw new UnsupportedOperationException("hm... would you really want that??");
         }
      };
   }
}

Minor issue but

dir /B

should help you get rid of some code

:o Thanks.

Fixed.

BTW, do you have any idea why sorting many files based in lastModified() takes forever and ever? I already have the files in a map/list like structure, but when actually getting the file time it goes forever and ever.

Any trick for that?

Because you MUST NOT read the lastModified() again and again for each comparison. Cache it somewhere, as Map<File, Long> will solve it

So obvious. Thanks. No way to avoid that first hit though right?