2022-10-24

1. Hello Shell!

Learning Objectives

  • Understand how the shell relates to the keyboard, the screen, the operating system, and users’ programs.
  • Understand when and why command-line interfaces should be used instead of graphical interfaces.

Telling the computer what to do

  • Human-computer interfaces (HCI)
    • Graphical User Interface (GUI)
    • Command-Line Interface (CLI)

What is the shell?

  • A program where users can type commands
    • simple commands (e.g. cd)
    • invoke complex programs (e.g. climate models)
  • Many variants
    • bash (Bourne Again SHell)
    • zsh (Z SHell)

Why use the shell?

  • GUIs are intuitive, but do not scale; the shell scales
  • Many bioinformatics tools can only be used via CLI
    • (Systems like Galaxy try to wrap these tools)
  • Automatable
    • package, repeat and reproduce analyses as scripts
    • combine tools into pipelines for large data volumes
    • high action-to-keystroke ratio
  • Many bioinformatics analyses require remote server access, through the shell
  • primarily textual
  • sometimes cryptic

2. Getting started with the shell

Getting started

  • Start your terminal (if not already started)

LIVE DEMONSTRATION

$
$ ls

3. Story time

Nelle’s Pipeline

  • Nelle Nemo: Marine Biologist to the stars
    • sampling gelatinous marine life in the Great Pacific Garbage Patch (North Pacific Gyre)
    • 1520 samples assayed for relative abundance of 300 proteins
    • analyse these results with goostats.sh (a program): 30s per sample
  • navigate to file/directory
  • create file/directory
  • check file size
  • iterate over many files
  • chain commands into a pipeline or script

4. Navigating in the shell

Learning objectives

  • Understand the similarities and differences between a file and a directory.
  • Be able to translate an absolute path into a relative path and vice versa.
  • Be able to construct absolute and relative paths that identify specific files and directories.
  • Be able to use options and arguments to change the behaviour of a shell command.
  • Be able to demonstrate the use of tab completion and explain its advantages.

The File System

  • Part of the Operating System
    • organises data into
      • files (hold data)
      • directories/folders (hold files or directories)
  • The shell has commands to
    • create
    • inspect
    • rename
    • delete

LIVE DEMONSTRATION

Home Directory

  • Your account’s home
    • Location varies by operating system
      • Linux: /home/<USERNAME>
      • macOS: /Users/<USERNAME>
      • Windows: C:\Documents and Settings\<USERNAME> or C:\Users\<USERNAME>

Return to your home directory with either command:

cd
cd ~

Nelle’s Filesystem

Nelle's filesystem structure: The root directory contains bin, data, Users, and tmp directories.
  • Nelle’s home: /Users/nelle

Nelle’s Filesystem

Nelle's home directories structure: There are users imhotep, larry, and nelle.
  • Nelle’s home: /Users/nelle

Exploring your filesystem

LIVE DEMONSTRATION

ls
ls -F
ls --help
man ls
ls -j
ls -ltrh
ls -F Desktop
cd Desktop
pwd
cd ..
ls -Fa
cd
cd Desktop/shell-lesson-data/exercise-data
cd ~
cd /Users/<USERNAME>/Desktop/shell-lesson-data

Challenge 01 (2min)

Starting from /Users/amanda/data, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?

  1. cd .
  2. cd /
  3. cd /home/amanda
  4. cd ../..
  5. cd ~
  6. cd home
  7. cd ~/data/..
  8. cd
  9. cd ..

Challenge 02 (2min)

Using the filesystem diagram below, if pwd displays /Users/thing, what will ls -F ../backup display?

Filesystem diagram for challenge 02
  1. ../backup: No such file or directory
  2. 2012-12-01 2013-01-08 2013-01-27
  3. 2012-12-01/ 2013-01-08/ 2013-01-27/
  4. original/ pnas_final/ pnas_sub/

5. Syntax

Shell command syntax

Consider the command $ ls -F /

Parts of a shell command, highlighted to indicate command, prompt, option, and argument

Upper and lower case matter

ls -s
ls -S

Nelle’s files

  • There is a parent directory: north-pacific-gyre
  • Each sample is labelled:
    • unique ten-character ID (e.g. “NENE01729A”) for location, time, depth, etc.
    • files named, e.g. NENE01729A.txt
    • all 1520 files in same directory

LIVE DEMONSTRATION

6. Working with files and directories

Learning Objectives

  • To be able to create a directory hierarchy that matches a given diagram.
  • To be able to create files in that hierarchy using an editor or by copying and renaming existing files.
  • To be able to delete, copy and move specified files and/or directories.

Creating directories

LIVE DEMONSTRATION

pwd
cd exercise-data/writing/
mkdir thesis
mkdir -p ../project/data ../project/results
ls -FR ../project

Good names for files and directories:

  • are meaningful
  • don’t have spaces
  • don’t being with a - (dash or hyphen)
  • use letters, numbers, period, dash, and underscore

Create a text file

LIVE DEMONSTRATION

cd thesis
nano draft.txt

Text editor choices

  • nano: simple, basic, but extendable
  • emacs: extremely powerful, steep learning curve
  • vim: extremely powerful, steep learning curve
  • Gedit: GUI text editor on Linux
  • Notepad++: GUI text editor on Windows
  • Visual Studio Code (VSCode): cross-platform text editor

Control Keys

You may see instructions to hold the Control/Ctrl key written in several ways. E.g. to hold down Control and press X you might be instructed to press:

  • Control-X
  • Control+X
  • Ctrl-X
  • Ctrl+X
  • ^X
  • C-x

Filename extensions

Files often have two parts, separated by a dot/period, e.g. myfile.txt.

  • myfile is the filestem
  • txt is the extension

The extension is an indicator of the data the file contains, but it is only a convention. The extension does not determine the content of the file.

You can’t turn your project thesis into an audiobook by renaming thesis.txt to thesis.mp3.

Moving/renaming files and directories

LIVE DEMONSTRATION

cd ~/Desktop/shell-lesson-data/exercise-data/writing
mv thesis/draft.txt thesis/quotes.txt
mv -i thesis/quotes.txt .
ls thesis/quotes.txt

Be careful: mv will overwrite the destination

Copying files and directories

LIVE DEMONSTRATION

cp quotes.txt thesis/quotations.txt
ls quotes.txt thesis/quotations.txt
cp -r thesis thesis_backup
ls thesis thesis_backup

Challenge 03 (2min)

Suppose that you created a plain-text file in your current directory to contain a list of the statistical tests you will need to do to analyze your data, and named it: statstics.txt

After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so?

  1. cp statstics.txt statistics.txt
  2. mv statstics.txt statistics.txt
  3. mv statstics.txt .
  4. cp statstics.txt .

Removing files and directories

LIVE DEMONSTRATION

rm quotes.txt
ls quotes.txt

Deleting is forever

  • There is no recycle bin in the shell
rm -i thesis_backup/quotations.txt

Operations on multiple files and directories

LIVE DEMONSTRATION

mkdir backup
cp creatures/minotaur.dat creatures/unicorn.dat backup/
ls *.pdb
ls p*.pdb
ls ?ethane.pdb
ls *ethane.pdb
ls ???ane.pdb

Challenge 04 (2min)

When run in the proteins directory, which ls command(s) will produce the output:

ethane.pdb methane.pdb
  1. ls *t*ane.pdb
  2. ls *t?ne.*
  3. ls *t??ne.pdb
  4. ls ethane.*

7. Pipes and Filters

The wc command

  • Starting in shell-lesson-data/exercise-data/proteins

The .pdb extension indicates that these are Protein DataBank structure files

LIVE DEMONSTRATION

ls proteins
cd proteins
wc cubane.pdb
wc *.pdb
wc -l *.pdb

Redirection

  • By default the output from wc is written to the terminal, but we can redirect it with the > (right angled bracket) character

LIVE DEMONSTRATION

wc -l *.pdb > lengths.txt
ls lengths.txt
cat lengths.txt
less lengths.txt

The sort command

LIVE DEMONSTRATION

sort ../numbers.txt
sort -n ../numbers.txt
sort -n lengths.txt
sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt

Piping output to another command

  • The pipe (|) symbol is like the redirect, except it passes the output to another command, rather than to a file.

LIVE DEMONSTRATION

sort -n lengths.txt | head -n 1
wc -l *.pdb | sort -n
wc -l *.pdb | sort -n | head -n 1

Tools and commands in Unix are designed to be linked together in this way, as pipelines. This power is one of the reasons that Unix has been so successful.

Piping output to another command

Three images, visualising the actions in the three pipe examples used above.

Challenge 05 (2min)

In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

  1. wc -l * > sort -n > head -n 3
  2. wc -l * | sort -n | head -n 1-3
  3. wc -l * | head -n 3 | sort -n
  4. wc -l * | sort -n | head -n 3

Nelle’s pipeline

  • Nelle has created 17 files in the north-pacific-gyre directory
  • She checks the output
cd north-pacific-gyre
wc -l *.txt
wc -l *.txt | sort -n | head -n 5

One of the files is too short! Do any of the files have too much data?

wc -l *.txt | sort -n | tail -n 5

There’s a convention that files with missing data end with Z

ls *Z.txt

8. Loops

Learning objectives

  • To be able to write a loop that applies one or more commands separately to each file in a set of files.
  • To be able to trace the values taken on by a loop variable during execution of the loop.
  • Understand the difference between a variable’s name and its value.
  • Understand why spaces and some punctuation characters shouldn’t be used in file names.
  • To be able to see what commands have recently been executed.
  • To be able to re-run recently executed commands without retyping them.

Loops

  • Loops are a flow control mechanism in programming
  • Loops allow us to repeat a command or set of commands for each item in a list
    • Apply to any number of files or items
  • Reduce the amount of typing needed (reduces mistakes!)

Loop motivation

Let’s look in exercise-data/creatures

head -n 5 basilisk.dat minotaur.dat unicorn.dat
  • common name
  • classification
  • date
  • DNA sequence

We want the classification for each species…

Loop example

for thing in list_of_things
do
    operation_using $thing    # Indentation not required, but readable
done

We need to apply this structure to our creature files

$ for filename in basilisk.dat minotaur.dat unicorn.dat
> do
>     head -n 2 $filename | tail -n 1
> done

The shell prompt changes from $ to > to let us know that we haven’t finished typing our command, yet.

Challenge 06 (5min)

How would you write a loop that echoes all 10 numbers from 0 to 9?

A more complicated loop

We want to see some of the middle of the sequence for each creature.

$ for filename in *.dat
> do
>     echo $filename
>     head -n 100 $filename | tail -n 20
> done

Note that

  • we can use the wildcard to indicate a list of files
  • filenames with spaces would look like they were multiple files (e.g. for filename in red dragon.txt)

Modifying files in a loop

We want to modify each file as we loop over it, e.g. rename the original basilisk.dat file to original-basilisk.dat and put modified data into basilisk.dat.

We can’t use the following:

cp *.dat original-*.dat

because this is equivalent to

cp basilisk.dat minotaur.dat unicorn.dat original-*.dat

Instead we use a loop:

$ for filename in *.dat
> do
>     cp $filename original-$filename
> done

Nelle’s pipeline

Nelle can now process her data files with goostats.sh (a shell script her supervisor wrote).

This script takes two arguments: an input file (the raw data), and the output file (calculated statistics).

LIVE DEMONSTRATION

  • The up arrow allows you to recover previous commands
  • Alt/Option on macOS allows you to place the cursor with the mouse/trackpad
  • Ctrl-a takes you to the start of a line
  • Ctrl-e takes you to the end of a line

9. Shell Scripts

Learning Objectives

  • To be able to write a shell script that runs a command or series of commands for a fixed set of files.
  • To be able to run a shell script from the command line.
  • To be able to write a shell script that operates on a set of files defined by the user on the command line.
  • To be able to create pipelines that include shell scripts you, and others, have written.

Shell scripts

We can take commands we repeat frequently and save them in files so they can be re-run with a single command. Such files are called shell scripts (these are programs).

  • faster (less retyping)
  • fewer errors (less retyping)
  • more reproducible (others can run your scripts)
  • build more work on top of it

My first script

LIVE DEMONSTRATION

cd proteins
nano middle.sh
head -n 15 octane.pdb | tail -n 5
bash middle.sh

Script files must be in plain text

Word files, etc. are not plain text

Let’s have an argument

LIVE DEMONSTRATION

nano middle.sh
head -n 15 "$1" | tail -n 5
bash middle.sh octane.pdb
bash middle.sh pentane.pdb

We put double-quotes around "$1" to protect in case any filenames contain spaces or other special characters.

Accepting parameters

LIVE DEMONSTRATION

nano middle.sh
head -n "$2" "$1" | tail -n "$3"
bash middle.sh pentane.pdb 15 5
bash middle.sh pentane.pdb 20 5

Adding documentation

LIVE DEMONSTRATION

# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"

Processing many files

LIVE DEMONSTRATION

wc -l *.pdb | sort -n

We use the special variable $@ - “all command line arguments”

# Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n
bash sorted.sh *.pdb ../creatures/*.dat

Nelle’s Pipeline

nano do-stats.sh
# Calculate stats for data files.
for datafile in "$@"
do
    echo $datafile
    bash goostats.sh $datafile stats-$datafile
done
bash do-stats.sh NENE*A.txt NENE*B.txt
bash do-stats.sh NENE*A.txt NENE*B.txt | wc -l

10. Finding Things

Learning Objectives

  • To be able to use grep to select lines from text files that match simple patterns.
  • To be able to use find to find files and directories whose names match simple patterns.
  • To be able to use the output of one command as the command-line argument(s) to another command.
  • Understand what is meant by ‘text’ and ‘binary’ files, and why many common tools don’t handle the latter well.

grep

  • grep is a very useful command-line tool that lets us find lines in files that match a pattern we want to look for

grep is short for global/regular expression/print

LIVE DEMONSTRATION

cd
cd Desktop/shell-lesson-data/exercise-data/writing
cat haiku.txt
grep not haiku.txt
grep The haiku.txt
grep -w The haiku.txt
grep -w "is not" haiku.txt
grep -n "it" haiku.txt
grep -n -w "the" haiku.txt
grep -nwi "the" haiku.txt
grep -nwv "the" haiku.txt
grep -r Yesterday .

find

  • The find command finds files in the filesystem

LIVE DEMONSTRATION

cd shell-lesson-data/exercise-data
find .
find . -type d
find . -type f
find . -name *.txt
find . -name "*.txt"

Using command output aas input

  • We can use the list of files produced by find as the input to other commands

LIVE DEMONSTRATION

wc -l $(find . -name "*.txt")
wc -l ./writing/LittleWomen.txt ./writing/haiku.txt ./numbers.txt
grep "searching" $(find . -name "*.txt")