Find and when to not use globbing
Thu, Dec 10, 2015All the time I see people trying to handle large numbers of files in the shell, and any of you that have tried this before would know that it is not pretty. Try doing an ls *
in a folder with a few hundred thousand files and you’ll be lucky to have anything happen in a reasonable time frame.
There’s a few gotcha’s that apply to these sorts of situations. The first is that using ‘*’ in the command will use shell globbing, so before executing the ls
in a folder structure like this for example:
folder
|__file1
|__file2
|__file3
It will expand ls folder/*
in to ls folder/file1 folder/file2 folder/file3
which as you get in to the thousands of files will max out the maximum length of arguments you can have in a command pretty quickly. For reference, you can check with getconf ARG_MAX
. My Arch and Debian systems default to 2097152.
The second gotcha is that ls
does a stat()
call. You can avoid this with ls -f
which will disable sorting and other checks.
But the best solution to this, is to use find
. Find won’t do stat() calls by default, and allows you to manipulate what comes out as a stream instead of getting a big list of all files in to memory before doing anything. You can also selectively find files using -name
or -iname
, delete with -delete
and execute things on each file with -exec
.
On the other hand, if you’re using a for
loop; globbing works great. You’re not limited by the command length and it won’t be doing any other calls than to list the files. If you’re using a modern shell as well you don’t have to worry about whitespace issues. When you do something like for i in /tmp/*
the globbing will correctly match files with whitespace. If you use the output of find instead, i.e. for i in $(find /tmp)
then it will also seperate the output on the $IFS
variable, which by default in Unix will be a variety of whitespace characters.