Are basic POSIX utilities parallelized?

You can get a first impression by checking whether the utility is linked with the pthread library. Any dynamically linked program that uses OS threads should use the pthread library.

ldd /bin/grep | grep -F libpthread.so

So for example on Ubuntu:

for x in $(dpkg -L coreutils grep findutils util-linux | grep /bin/); do if ldd $x | grep -q -F libpthread.so; then echo $x; fi; done

However, this produces a lot of false positives due to programs that are linked with a library that itself is linked with pthread. For example, /bin/mkdir on my system is linked with PCRE (I don't know why…) which itself is linked with pthread. But mkdir is not parallelized in any way.

In practice, checking whether the executable contains libpthread gives more reliable results. It could miss executables whose parallel behavior is entirely contained in a library, but basic utility typically aren't designed that way.

dpkg -L coreutils grep findutils util-linux | grep /bin/ | xargs grep pthread               
Binary file /usr/bin/timeout matches
Binary file /usr/bin/sort matches

So the only tool that actually has a chance of being parallelized is sort. (timeout only links to libpthread because it links to librt.) GNU sort does work in parallel: the number of threads can be configured with the --parallel option, and by default it uses one thread per processor up to 8. (Using more processors gives less and less benefit as the number of processors increases, tapering off at a rate that depends on how parallelizable the task is.)

grep isn't parallelized at all. The PCRE library actually links to the pthread library only because it provides thread-safe functions that use locks and the lock manipulation functions are in the pthread library.

The typical simple approach to benefit from parallelization when processing a large amount of data is to split this data into pieces, and process the pieces in parallel. In the case of grep, keep file sizes manageable (for example, if they're log files, rotate them often enough) and call separate instances of grep on each file (for example with GNU Parallel). Note that grepping is usually IO-bound (it's only CPU-bound if you have a very complicated regex, or if you hit some Unicode corner cases of GNU grep where it has bad performance), so you're unlikely to get much benefit from having many threads.


Another way to find an answer is to use something like sysdig to examine the system calls executed by a process. For example, if you want to see if rm creates any threads (via the clone system call), you could do:

# sysdig proc.name=rm and evt.type=clone and evt.dir='<'

With that running, I did:

$ mkdir foo
$ cd foo
$ touch {1..9999}
$ rm *

And saw no clones -- no threading there. You could repeat this experiment for other tools, but I don't think you'll find that they're threaded.

Note that clone() is the underpinnings of fork() as well, so if a tool starts some other process (e.g., find ... -exec), you'd see that output. The flags will differ from the "create a new thread" use case:

# sysdig proc.name=find and evt.type=clone and evt.dir='<'
...
1068339 18:55:59.702318832 2 find (2960545) < clone res=0 exe=find args=/tmp/foo.-type.f.-exec.rm.{}.;. tid=2960545(find) pid=2960545(find) ptid=2960332(find) cwd= fdlimit=1024 pgft_maj=0 pgft_min=1 vm_size=9100 vm_rss=436 vm_swap=0 comm=find cgroups=cpuset=/.cpu=/user.slice.cpuacct=/user.slice.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1026 gid=1026 vtid=2960545(find) vpid=2960545(find)

See xargs or gnu parallel, for how to run them in parallel.

However the parallelisable part will tends toward zero time, as more processes are added. This will leave the non-parallelisable part, that will not get faster. Therefore there is a limit to how fast a task can be by adding more processes. Very quickly you can get to a situation that adding processes makes very little difference.

Then there is communication overhead: adding processes makes it slower. If the benefit of adding a process is lower than the cost of adding it, then it can get slower.