Monday, August 19, 2013

Recursive grep and performance comparison with find + grep

This is how you can recursively grep starting from current directory. 

find . -name "pattern" -exec grep "search_pattern" '{}' \;

That's been a known pattern to grep recursively. But let me introduce you to a better approach described in the second answer here which is saying that since grep 2.5.2 shipped on August 2006 we are happy to use a functionality from grep itself. So, check this out

00:47:50:~/Coding$ time greprepo "*.java" "flint" . 
./clojure/repos/my/jalint2/HelloWorld.java: Native.loadLibrary("flint", Flint.class); 
./clojure/repos/my/jalint2/HelloWorld.java: System.setProperty("jna.library.path", "../flint2"); 
./clojure/repos/rebcabin/jalint2/HelloWorld.java: Native.loadLibrary("flint", Flint.class); 
./clojure/repos/rebcabin/jalint2/HelloWorld.java: System.setProperty("jna.library.path", "../flint2");

real 0m0.134s 
user 0m0.120s 
sys 0m0.014s

00:48:25:~/Coding$ time find . -name "*.java" -exec grep flint '{}' \; 
Native.loadLibrary("flint", Flint.class); 
System.setProperty("jna.library.path", "../flint2"); 
Native.loadLibrary("flint", Flint.class); 
System.setProperty("jna.library.path", "../flint2"); 

real 0m0.545s 
user 0m0.289s 
sys 0m0.212s

Isn't it impressive? And greprepo is my alias to

alias greprepo='grep --exclude-dir ".{git,svn}" -R --mmap --include'

Performance boost is also due to mmap which is ok if you don't change files you are currently greping. Don't try to do mmap on network share or if you change the stuff. Anyways, whithout mmap performance is still much more impressive than with find 

01:02:46:~/Coding$ time grep --exclude-dir ".{git,svn}" -R --include "*.java" "flint" .
./clojure/repos/my/jalint2/HelloWorld.java:            Native.loadLibrary("flint", Flint.class);
./clojure/repos/my/jalint2/HelloWorld.java:        System.setProperty("jna.library.path", "../flint2");
./clojure/repos/rebcabin/jalint2/HelloWorld.java:            Native.loadLibrary("flint", Flint.class);
./clojure/repos/rebcabin/jalint2/HelloWorld.java:        System.setProperty("jna.library.path", "../flint2");

real 0m0.146s
user 0m0.130s
sys 0m0.015s

mmap will do its magic on really huge files. ~/Coding is not a storage for huge files as you can presume.

Of course you may notice I used --exclude-dir with grep, but here is one more

1:02:56:~/Coding$ time grep -R --include "*.java" "flint" .
./clojure/repos/my/jalint2/HelloWorld.java:            Native.loadLibrary("flint", Flint.class);
./clojure/repos/my/jalint2/HelloWorld.java:        System.setProperty("jna.library.path", "../flint2");
./clojure/repos/rebcabin/jalint2/HelloWorld.java:            Native.loadLibrary("flint", Flint.class);
./clojure/repos/rebcabin/jalint2/HelloWorld.java:        System.setProperty("jna.library.path", "../flint2");

real 0m0.135s
user 0m0.118s
sys 0m0.017s

which makes me thinking of pre caching :) Anyway, find doesn't feature an easy option to exclude directories. There is some "workaround", which I didn't understand how to easily use. But even then find + grep is much slower, than grep alone.



UPDATE [Sep 17, 2013]



My friend pointed me at that there is a slightly better approach to find + grep to one that I used above, which is to combine find with xargs. So, here are all three approaches listed from the slowest to the fastest:

0 (raspberry) 14:43:20:~/.vim$ time find . -name "*.vim" -exec grep -Hn --color=always tab '{}' \; > /dev/null

real    0m1.855s
user    0m0.370s
sys     0m1.070s

0 (raspberry) 14:43:26:~/.vim$ time find . -name "*.vim" -print0 | xargs -0 grep -Hn --color=always tab > /dev/null

real    0m0.213s
user    0m0.080s
sys     0m0.110s

0 (raspberry) 14:43:31:~/.vim$ time greprepo "*.vim" "tab" . > /dev/null

real    0m0.172s
user    0m0.080s
sys     0m0.080s

Once again, grep alone wins. However, it's worth saying, find + xargs is much faster than find -exec. 

But don't forget that using grep only approach you have exclusions which I talked about in the original post.