Shell stuff I didn’t know

I generally stay away from doing anything more complicated in a shell script than making a directory and running an R script or a single binary, and especially avoid awk and sed as much as possible. However, sometimes the shell actually does offer a certain elegance and convenience (and sometimes deceitful traps).

Here are three things I only learned recently:

Stripping directory and suffix from file names

Imagine we have a project where files are named with the sample ID followed by some extension, like so:

project/data/sample1.g.vcf
project/data/sample2.g.vcf
project/data/sample3.g.vcf

Quite often, we will want to grab all the in a directory and extract the base name without extension and without the whole path leading up to the file. There is a shell command for this called basename:

basename -s .g.vcf project/data/sample*.g.vcf
sample1
sample2
sample3

The -s flag gives the suffix to remove.

This is much nicer than trying to regexp it, for example with R:

library(stringr)

files <- dir("project/data")
basename <- str_match(files, "^.*/(.+)\\.g\\.vcf")

Look at that second argument … ”^.*/(.+)\\.g\\.vcf” What is this?! And let me tell you, that was not my first attempt at writing that regexp either. Those of us who can interpret this gibberish must acknowledge that we have learned to do so only through years of suffering.

For that matter, it’s also than the bash suffix and prefix deletion syntax, which is one of those things I think one has to google every time.

for string in project/data/*.g.vcf; do
    nosuffix=${string%.g.vcf}
    noprefix=${nosuffix#project/data/}
    echo $noprefix
done

Logging both standard out and standard error

When sending jobs off to a server to be run without you looking at them, it’s often convenient to save the output to a file. To redirect standard output to a file, use ”>”, like so:

./script_that_prints_output.sh > out_log.txt

However, there is also another output stream used to record (among other things) error messages (in some programs; this isn’t very consistent). Therefore, we should probably log the standard error stream too. To redirect standard error to a file:

./script_that_prints_output.sh 2> error_log.txt

And to redirect both to the same file:

./script_that_prints_output.sh > combined_log.txt 2>&1

The last bit is telling the shell to redirect the standard error stream to standard out, and then both of them get captured in the file. I didn’t know until recently that one could do this.

The above code contained some dots, and speaking of that, here is a deceitful shell trap to trip up the novice:

The dot command (oh my, this is so bad)

When working on a certain computer system, there is a magic invocation that needs to be in the script to be able to use the module system. It should look like this:

. /etc/profile.d/modules.sh

That means ”source the script found at /etc/profiles.d/modules.sh” — which will activate the module system for you.

It should not look like this:

./etc/profile.d/modules.sh
bash: ./etc/profile.d/modules.sh: No such file or directory

That means that bash tries to find a file called ”etc/profile.d/modules.sh” located in the current directory — which (probably) doesn’t exist.

If there is a space after the dot, it is a command that means the same as source, i.e. run a script from a file. If there is no space after the dot, it means a relative file path — also often used to run a script. I had never actually thought about it until someone took away the space before the dot, and got the above error message (plus something else more confusing, because a module was missing).