Text files and the command line

Here we introduce some new commands and concepts to work with the command line, in particular for text file parsing and manipulation.
Before you start ensure you downloaded the learn_bash directory (see the first part of the tutorial.
In most of the examples we will see the use of absolute paths like
~/learn_bash/README.md
, because we assume you
have a learn_bash directory in your home folder. Try to practice using relative paths as well!
find
The find
command is used to search for files in a directory. Unlike ls
, it can search for files
matching a pattern, and it can search in subdirectories.
In the Linux flavour of find, if you omit a path to scan, it will assume its the current directory. In the version of find
installed with MacOS, you need to specify the path to scan, like find .
.
Some examples:
- List all the files and subdirectories or a directory:
1
2
# Remember that [tab] just means to press Tab, to autocomplete
find ~/learn_b[tab]
- List only the subdirectories:
1
2
# If you type `-type f` you will print only the files instead
find ~/learn_bash -type d
- List only items ending by “.faa”:
1
find ~/learn_bash -name "*.faa"
Did you notice that find will report the absolute paths of the files?
wild characters and shell expansion
We introduced with the last command a special character, the asterisk *
, that meant “any string”.
In find, we must surround patterns with double quotes.
We will now try to use ls
to exemplify the power of shell expansion.
-
*
matches any string, including the empty string. For example,ls *.faa
will match all files ending by.faa
, whilels *gene*
will list all files containing the stringgene
. -
?
matches any single character. For example,ls file?.*
will match all files starting withfile
followed by exactly one more character, then a dot and ending with any string. Examples are “file1.txt”, “fileZ.fasta” but not “file12.txt”. -
[A-Z]
,[a-z]
or[0-9]
matches any character in the range. For example,ls file[0-9].*
will match all files starting withfile
followed by a digit, then a dot and ending with any string. Examples are “file1.txt”, “file2.fasta” but not “fileB.txt”. -
{this,that}
matches any of the strings separated by commas. For example,ls {file1,file2}.txt
will match both “file1.txt” and “file2.txt”, or*_R{1,2}.fastq.gz
will match all files ending by “_R1.fastq.gz” and “_R2.fastq.gz”.
When we use this syntax, the command will receive the expanded list of matching files. Try with “echo” to see what happens:
1
2
3
cd ~/learn_bash/files/
echo LIST: *.png
cd -
and try to use the ls
command with the same pattern.
1
ls -lh ~/learn_bash/files/*.png
Text files
wc
The wc
command is used to count the number of lines, words and characters in a file. This only applies to simple text files.
1
wc ~/learn_bash/README.md
If we want, we can print only the lines (with -l
), words (with -w
) or characters (with -c
).
How many lines are there in
~/learn_bash/files/README.md
?
head and tail
The head
and tail
commands are used to print the first or last lines of a file. By default, they print the first 10 lines.
In both commands we can change the number of lines with the -n
option followed by the number of lines we want.
1
2
3
4
# Example:
head -n 3 ~/learn_bash/README.md
tail -n 2 ~/learn_bash/files/cars.csv
less
In this file we introduced wc, head and tail: they are all non-interactive commands. To interactively browse a file, we can use the less
command, which has the same interface as the man
command.
1
less ~/learn_bash/files/origin.txt
A useful option is -S
, which will not wrap long lines, meaning that long lines will not be split.
grep
The grep
command is used to search for a pattern in a file, and it prints the lines matching the pattern.
The general syntax is “grep PATTERN FILE(s)”.
1
grep "Darwin" ~/learn_bash/files/origin.txt
The pattern can be more complicated than a simple string,
but we will not cover this here.
To match a pattern insensitive to case, we can use the -i
option.
1
grep -i "darwin" ~/learn_bash/files/origin.txt
Some useful options:
-
-c
prints the number of matching lines, instead of the lines themselves. -
-v
prints the lines that do not match the pattern (inVert match) -
-n
prints the line number before the line itself. -
-w
matches only whole words (not substrings)
cut
The cut
command is used to extract columns from a file.
The general syntax is “cut -d DELIMITER -f FIELDS FILE(s)”. By default the delimiter is a tab, so you don’t
need to specify
1
2
3
4
5
# Check the first two lines of the file first
head -n 2 ~/learn_bash/files/cars.csv
# Print only the columns model and cyl (1 and 3)
cut -d "," -f 1,3 ~/learn_bash/files/cars.csv
Extract the same columns, but from the file
~/learn_bash/files/cars.tsv
(tab-separated).
sed
The sed
command is used to search and replace text in a file.
The general syntax is “sed -e ‘s/OLD/NEW/g’ FILE(s)”.
1
2
# We can replace the "," with a "|"
sed 's/,/|/g' ~/learn_bash/files/cars.csv
Here we used an obscure syntax, which will become familiar with practice, as it’s borrowed by other tools and
even programming languages: s/SOMETHING/OTHER/g
. The s
means “substitute”, the g
means “global” (i.e.
replace all the occurrences.
Try to replace only the first occurrence, removing the g
:
1
sed 's/,/---/' ~/learn_bash/files/cars.csv
sort
The sort
command is used to sort the lines of a file. By default, it sorts alphabetically.
1
sort ~/learn_bash/files/cars.csv
Some options:
-
-t,
to specify the delimiter (in this case, a comma) -
-k FIELDS
to sort by the given fields (fields being a number or a set of numbers separated by commas) -
-n
to sort numerically (this is a switch) -
-r
to sort in reverse order (this is a switch)
1
sort -n -t, -k 2 ~/learn_bash/files/cars.csv
Typing multi-line commands
Sometimes our terminal can become too crammed with text, and we would like to type a command in multiple lines.
This is possible with the \
character, which tells the shell that the command continues on the next line.
This is useful in our website, to make commands clearer to read:
1
2
conda install --yes -c conda-forge \
-c bioconda "seqfu>1.10"
Note that when you type a backslash at the end of a line and then press enter, the shell will
print a different promt (usually a
>
), which means that the command is not finished yet. The greater-than
is not to be confused with the redirection operator.