fu-orf
Extraction of ORFs from raw reads datasets (and other sequence files). Open reading frames are defined as a stretch of aminoacids not interrupted by stop codons: this program does not perform any gene finding procedure and merely extracts ORFs (under the assumption that, running on raw reads, fragments are expected).
A major update was introduced with version 1.8.4, with improved reporting (strand and reding frame in the output), improved tests and --scan-reverse
option (previously enabled by default).
fu-orf
Extract ORFs from Paired-End reads.
Usage:
fu-orf [options] <InputFile>
fu-orf [options] -1 File_R1.fq
fu-orf [options] -1 File_R1.fq -2 File_R2.fq
fu-orf --help | --codes
Input files:
-1, --R1 FILE First paired end file
-2, --R2 FILE Second paired end file
ORF Finding and Output options:
-m, --min-size INT Minimum ORF size (aa) [default: 25]
-p, --prefix STRING Rename reads using this prefix
-r, --scan-reverse Also scan reverse complemented sequences
-c, --code INT NCBI Genetic code to use [default: 1]
-l, --min-read-len INT Minimum read length to process [default: 25]
Paired-end optoins:
-j, --join Attempt Paired-End joining
--min-overlap INT Minimum PE overlap [default: 12]
--max-overlap INT Maximum PE overlap [default: 200]
--min-identity FLOAT Minimum sequence identity in overlap [default: 0.80]
Other options:
--codes Print NCBI genetic codes and exit
--pool-size INT Size of the sequences array to be processed
by each working thread [default: 250]
--verbose Print verbose log
--debug Print debug log
--help Show help
Example usage
Single input file (FASTA or FASTQ):
fu-orf --min-size 500 data/orf.fa.gz
Paired-end Illumina reads:
fu-orf --min-size 29 -1 data/illumina_1.fq.gz -2 data/illumina_2.fq.gz
will produce a FASTA output reporting, as comment, the frame and the total ORFs printed for each sequence:
>filt.1_1 frame=+0 tot=5
RNLIILKMDFFFENFALVGLLYGACQRLNSTKFYLMSTDYLIVKTFNNGSLGSRIDEERS
>filt.1_2 frame=+2 tot=5
WSFRGSKSRNKVSVGEPAEGSLKKFNNFENGFFF
>filt.1_3 frame=+2 tot=5
KLCFGRPSIWGLPEVKLNQILFNVNRLFNSQNFQQRISWFSHR
>filt.1_4 frame=-1 tot=5
NLVEFNLWQAPYRRPTKAKFSKKKSIFKIIKFL
>filt.1_5 frame=-2 tot=5
LLNNRLTLNKIWLSLTSGRPHIEGLPKQSFQKKNPFSKLLNFFNDPSAGSPTETLLRLLLPLNDQ
>filt.2_1 frame=+0 tot=5
TYNQFFINLSHQIITNSQNFQQRISWFSHRNA
>filt.2_2 frame=+1 tot=5
NKLALAVGPACRQRSKLTTNFLSTCHTRLLLIVKTFNNGSLGSRIETQ
>filt.2_3 frame=+2 tot=5
WSFRGSKSRNKVSVGEPAEGSLLICLIAPHVFFFETNLLWRWAQPAARGLNLQPIFYQLVTPDYY
>filt.2_4 frame=-1 tot=5
KIGCKFRPLAAGWAHRQSKFVSKKNTCGAIKQISNDPSAGSPTETLLRLLLPLNDQ
>filt.2_5 frame=-2 tot=5
QVDKKLVVSLDLWRQAGPTAKASLFQRKTHVVQLSKSVMILPQVHLRKPCYDFYFL
Genetic codes
Genetic codes can be changed using NCBI Genetic Codes.
Type fu-orf --codes
to print the following list.
1
: The Standard Code2
: The Vertebrate Mitochondrial Code3
: The Yeast Mitochondrial Code4
: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code5
: The Invertebrate Mitochondrial Code6
: The Ciliate, Dasycladacean and Hexamita Nuclear Code9
: The Echinoderm and Flatworm Mitochondrial Code10
: The Euplotid Nuclear Code11
: The Bacterial, Archaeal and Plant Plastid Code12
: The Alternative Yeast Nuclear Code13
: The Ascidian Mitochondrial Code14
: The Alternative Flatworm Mitochondrial Code16
: Chlorophycean Mitochondrial Code21
: Trematode Mitochondrial Code22
: Scenedesmus obliquus Mitochondrial Code23
: Thraustochytrium Mitochondrial Code24
: Rhabdopleuridae Mitochondrial Code25
: Candidate Division SR1 and Gracilibacteria Code26
: Pachysolen tannophilus Nuclear Code27
: Karyorelict Nuclear Code28
: Condylostoma Nuclear Code29
: Mesodinium Nuclear Code30
: Peritrich Nuclear Code31
: Blastocrithidia Nuclear Code33
: Cephalodiscidae Mitochondrial UAA-Tyr Code