Andrea Telatin
Andrea Telatin Senior bioinformatician at the Quadram Institute Bioscience, Norwich.

Text files and the command line

Text files and the command line

Here we introduce some new commands and concepts to work with the command line, in particular for text file parsing and manipulation.

Before you start ensure you downloaded the learn_bash directory (see the first part of the tutorial.

:warning: In most of the examples we will see the use of absolute paths like ~/learn_bash/README.md, because we assume you have a learn_bash directory in your home folder. Try to practice using relative paths as well!

find

The find command is used to search for files in a directory. Unlike ls, it can search for files matching a pattern, and it can search in subdirectories.

In the Linux flavour of find, if you omit a path to scan, it will assume its the current directory. In the version of find installed with MacOS, you need to specify the path to scan, like find ..

Some examples:

  • List all the files and subdirectories or a directory:
1
2
# Remember that [tab] just means to press Tab, to autocomplete
find ~/learn_b[tab]
  • List only the subdirectories:
1
2
# If you type `-type f` you will print only the files instead
find ~/learn_bash -type d
  • List only items ending by “.faa”:
1
find ~/learn_bash -name "*.faa"

:bulb: Did you notice that find will report the absolute paths of the files?

wild characters and shell expansion

We introduced with the last command a special character, the asterisk *, that meant “any string”. In find, we must surround patterns with double quotes.

We will now try to use ls to exemplify the power of shell expansion.

  • * matches any string, including the empty string. For example, ls *.faa will match all files ending by .faa, while ls *gene* will list all files containing the string gene.
  • ? matches any single character. For example, ls file?.* will match all files starting with file followed by exactly one more character, then a dot and ending with any string. Examples are “file1.txt”, “fileZ.fasta” but not “file12.txt”.
  • [A-Z], [a-z] or [0-9] matches any character in the range. For example, ls file[0-9].* will match all files starting with file followed by a digit, then a dot and ending with any string. Examples are “file1.txt”, “file2.fasta” but not “fileB.txt”.
  • {this,that} matches any of the strings separated by commas. For example, ls {file1,file2}.txt will match both “file1.txt” and “file2.txt”, or *_R{1,2}.fastq.gz will match all files ending by “_R1.fastq.gz” and “_R2.fastq.gz”.

When we use this syntax, the command will receive the expanded list of matching files. Try with “echo” to see what happens:

1
2
3
cd ~/learn_bash/files/
echo LIST: *.png
cd -

and try to use the ls command with the same pattern.

1
ls -lh ~/learn_bash/files/*.png

Text files

wc

The wc command is used to count the number of lines, words and characters in a file. This only applies to simple text files.

1
wc ~/learn_bash/README.md

If we want, we can print only the lines (with -l), words (with -w) or characters (with -c).

:question: How many lines are there in ~/learn_bash/files/README.md?

head and tail

The head and tail commands are used to print the first or last lines of a file. By default, they print the first 10 lines. In both commands we can change the number of lines with the -n option followed by the number of lines we want.

1
2
3
4
# Example:
head -n 3 ~/learn_bash/README.md

tail -n 2  ~/learn_bash/files/cars.csv 

less

In this file we introduced wc, head and tail: they are all non-interactive commands. To interactively browse a file, we can use the less command, which has the same interface as the man command.

1
less ~/learn_bash/files/origin.txt 

A useful option is -S, which will not wrap long lines, meaning that long lines will not be split.

grep

The grep command is used to search for a pattern in a file, and it prints the lines matching the pattern. The general syntax is “grep PATTERN FILE(s)”.

1
grep "Darwin" ~/learn_bash/files/origin.txt

The pattern can be more complicated than a simple string, but we will not cover this here. To match a pattern insensitive to case, we can use the -i option.

1
grep -i "darwin" ~/learn_bash/files/origin.txt

Some useful options:

  • -c prints the number of matching lines, instead of the lines themselves.
  • -v prints the lines that do not match the pattern (inVert match)
  • -n prints the line number before the line itself.
  • -w matches only whole words (not substrings)

cut

The cut command is used to extract columns from a file. The general syntax is “cut -d DELIMITER -f FIELDS FILE(s)”. By default the delimiter is a tab, so you don’t need to specify

1
2
3
4
5
# Check the first two lines of the file first
head -n 2 ~/learn_bash/files/cars.csv

# Print only the columns model and cyl (1 and 3)
cut -d "," -f 1,3 ~/learn_bash/files/cars.csv

:question: Extract the same columns, but from the file ~/learn_bash/files/cars.tsv (tab-separated).

sed

The sed command is used to search and replace text in a file. The general syntax is “sed -e ‘s/OLD/NEW/g’ FILE(s)”.

1
2
# We can replace the "," with a "|"
sed 's/,/|/g' ~/learn_bash/files/cars.csv

Here we used an obscure syntax, which will become familiar with practice, as it’s borrowed by other tools and even programming languages: s/SOMETHING/OTHER/g. The s means “substitute”, the g means “global” (i.e. replace all the occurrences.

Try to replace only the first occurrence, removing the g:

1
sed 's/,/---/' ~/learn_bash/files/cars.csv

sort

The sort command is used to sort the lines of a file. By default, it sorts alphabetically.

1
sort ~/learn_bash/files/cars.csv

Some options:

  • -t, to specify the delimiter (in this case, a comma)
  • -k FIELDS to sort by the given fields (fields being a number or a set of numbers separated by commas)
  • -n to sort numerically (this is a switch)
  • -r to sort in reverse order (this is a switch)
1
sort -n -t, -k 2 ~/learn_bash/files/cars.csv

Typing multi-line commands

Sometimes our terminal can become too crammed with text, and we would like to type a command in multiple lines. This is possible with the \ character, which tells the shell that the command continues on the next line. This is useful in our website, to make commands clearer to read:

1
2
conda install --yes -c conda-forge \
  -c bioconda "seqfu>1.10"

:warning: Note that when you type a backslash at the end of a line and then press enter, the shell will print a different promt (usually a >), which means that the command is not finished yet. The greater-than is not to be confused with the redirection operator.