seqfu check
EXPERIMENTAL: Introduced in SeqFu 1.15, updated with –deep in 1.18 (see below).In SeqFu 2.0 a strict behaviour will be used by default, see fu-pecheck and fu-secheck.
Evaluates the integrity of DNA FASTQ files.
Usage: seqfu check [options] <FQFILE> [<REV>]
seqfu check [options] --dir <FQDIR>
Check the integrity of FASTQ files, returns non zero
if an error occurs. Will print a table with a report.
Input is a single dataset:
<FQFILE> the forward read file
<REV> the reverse read file
or a directory of FASTQ files (--dir):
<FQDIR> the directory containing the FASTQ files
Options:
-d, --deep Perform a deep check of the file and will not
lsupport multiline Sanger FASTQ [default: false]
-n, --no-paired Disable autodetection of second pair
-s, --safe-exit Exit with 0 even if errors are found
-q, --quiet Do not print infos, just exit status
-v, --verbose Verbose output
-t, --thousands Print numbers with thousands separator
--debug Debug output
-h, --help Show this help
Integrity check
If not using --deep
, the file is considered valid if seqfu cat $INPUT > $OUTPUT
would produce a valid file (i.e. if an error is detected at the 100-th sequence, the file would be considered valid reporting 99 as total sequences)
A single FASTQ file is considered valid if:
- each record has the same sequence and quality length
- only A,C,G,T,N characters are present in the sequence
A paired-end set of FASTQ files is considered valid if:
- each file is individually valid
- the two files have the same number of sequences
- the first and last sequence of both files has the same name (the last three characters are ignored if the remaining sequence name is greater than 4 characters)
- the first and last sequence of the two files are not identical (R1 != R2)
Deep check
If you are parsing NGS files, i.e. FASTQ files, with four lines per record and you expect them to be accepted by any program, use --deep
.
Usage
To test a single file:
seqfu check test_file.fq.gz
To test a pair of files:
seqfu check test_R1.fq.gz [test_R2.fq.gz]
Note that if supplying a single file but a matching pair is detected (e.g. test_R1.fq.gz
is supplied and test_R2.fq.gz
is found), the check will be performed on both files.
To test all files in a directory:
seqfu check --dir test_dir
Other options
-
--no-paired
disables the autodetection of the second pair (i.e. force single end check) -
--thousands
will add a thousands separator to the output -
--quiet
will not print data, but only the exit status will be used -
--verbose
will print more information (including processing speed) -
--debug
will print debug information -
--safe-exit
will always exit with 0, even if errors are found (useful in pipelines)
Exit status
If an error is identified in at least one file, the program will exit with non zero status, unless the --safe-exit
option is used.
Output
The output is a table with the following columns:
- Status (
OK
orERR
) - Library type (
SE
orPE
) - Filename (the path to the first pair, if
PE
) - Number of sequences counted (if
PE
: number of sequences in both files) or-
if the dataset is not valid - Number of bases (if
PE
: total number of bases in both files) or-
if the dataset is not valid - Number of errors
- List of detected errors (if any)
Example
Example of output for a directory containing 3 Paired End datasets:
OK PE /tmp/data/16S_R1.fq.gz 12274 3694474 0
OK PE /tmp/data/16Snano_R1.fq.gz 468 140868 0
OK PE /tmp/data/illumina_1.fq.gz 14 1260 0
Example of errors (can be reproduced using the data directory of the repository)
seqfu check --dir data/primers
OK SE data/primers/16S_merge.fq.gz 6137 2596981 0
OK SE data/primers/16S_vsearch_merge.fq.gz 3935 1818111 0
ERR SE data/primers/artificial.fq.gz - - 2 Invalid character in sequence: < > in R2.REV+.middle;
OK SE data/primers/its-merge.fq.gz 7299 1504898 0
OK SE data/primers/se.fq.gz 234 70434 0
OK SE data/primers/small.fq 4 360 0
OK PE data/primers/16S_R1.fq.gz 12274 3694474 0
OK PE data/primers/16Snano_R1.fq.gz 468 140868 0
ERR PE data/primers/art_R1.fq.gz 7 - 5 R2=Invalid character in sequence: < > in R2.REV+.middle;;First sequence names do not match (R1.startFOR+, R2.startREV+);Last sequence names do not match (R1.FOR1+.start-middle, );
OK PE data/primers/its_R1.fq.gz 16000 3387804 0
OK PE data/primers/itsfilt_R1.fq.gz 15618 3272396 0
OK PE data/primers/pico_R1.fq.gz 24 7224 0