fu-orf
Extraction of ORFs from raw reads datasets (and other sequence files). Open reading frames are defined as a stretch of aminoacids not interrupted by stop codons: this program does not perform any gene finding procedure and merely extracts ORFs (under the assumption that, running on raw reads, fragments are expected).
A major update was introduced with version 1.8.4, with improved reporting (strand and reding frame in the output), improved tests and --scan-reverse option (previously enabled by default).
fu-orf
Extract ORFs from Paired-End reads.
Usage:
fu-orf [options] <InputFile>
fu-orf [options] -1 File_R1.fq
fu-orf [options] -1 File_R1.fq -2 File_R2.fq
fu-orf --help | --codes
Input files:
-1, --R1 FILE First paired end file
-2, --R2 FILE Second paired end file
ORF Finding and Output options:
-m, --min-size INT Minimum ORF size (aa) [default: 25]
-p, --prefix STRING Rename reads using this prefix
-r, --scan-reverse Also scan reverse complemented sequences
-c, --code INT NCBI Genetic code to use [default: 1]
-l, --min-read-len INT Minimum read length to process [default: 25]
Paired-end optoins:
-j, --join Attempt Paired-End joining
--min-overlap INT Minimum PE overlap [default: 12]
--max-overlap INT Maximum PE overlap [default: 200]
--min-identity FLOAT Minimum sequence identity in overlap [default: 0.80]
Other options:
--codes Print NCBI genetic codes and exit
--pool-size INT Size of the sequences array to be processed
by each working thread [default: 250]
--verbose Print verbose log
--debug Print debug log
--help Show help
Example usage
Single input file (FASTA or FASTQ):
fu-orf --min-size 500 data/orf.fa.gz
Paired-end Illumina reads:
fu-orf --min-size 29 -1 data/illumina_1.fq.gz -2 data/illumina_2.fq.gz
will produce a FASTA output reporting, as comment, the frame and the total ORFs printed for each sequence:
>filt.1_1 frame=+0 tot=5
RNLIILKMDFFFENFALVGLLYGACQRLNSTKFYLMSTDYLIVKTFNNGSLGSRIDEERS
>filt.1_2 frame=+2 tot=5
WSFRGSKSRNKVSVGEPAEGSLKKFNNFENGFFF
>filt.1_3 frame=+2 tot=5
KLCFGRPSIWGLPEVKLNQILFNVNRLFNSQNFQQRISWFSHR
>filt.1_4 frame=-1 tot=5
NLVEFNLWQAPYRRPTKAKFSKKKSIFKIIKFL
>filt.1_5 frame=-2 tot=5
LLNNRLTLNKIWLSLTSGRPHIEGLPKQSFQKKNPFSKLLNFFNDPSAGSPTETLLRLLLPLNDQ
>filt.2_1 frame=+0 tot=5
TYNQFFINLSHQIITNSQNFQQRISWFSHRNA
>filt.2_2 frame=+1 tot=5
NKLALAVGPACRQRSKLTTNFLSTCHTRLLLIVKTFNNGSLGSRIETQ
>filt.2_3 frame=+2 tot=5
WSFRGSKSRNKVSVGEPAEGSLLICLIAPHVFFFETNLLWRWAQPAARGLNLQPIFYQLVTPDYY
>filt.2_4 frame=-1 tot=5
KIGCKFRPLAAGWAHRQSKFVSKKNTCGAIKQISNDPSAGSPTETLLRLLLPLNDQ
>filt.2_5 frame=-2 tot=5
QVDKKLVVSLDLWRQAGPTAKASLFQRKTHVVQLSKSVMILPQVHLRKPCYDFYFL
Genetic codes
Genetic codes can be changed using NCBI Genetic Codes.
Type fu-orf --codes to print the following list.
1: The Standard Code2: The Vertebrate Mitochondrial Code3: The Yeast Mitochondrial Code4: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code5: The Invertebrate Mitochondrial Code6: The Ciliate, Dasycladacean and Hexamita Nuclear Code9: The Echinoderm and Flatworm Mitochondrial Code10: The Euplotid Nuclear Code11: The Bacterial, Archaeal and Plant Plastid Code12: The Alternative Yeast Nuclear Code13: The Ascidian Mitochondrial Code14: The Alternative Flatworm Mitochondrial Code16: Chlorophycean Mitochondrial Code21: Trematode Mitochondrial Code22: Scenedesmus obliquus Mitochondrial Code23: Thraustochytrium Mitochondrial Code24: Rhabdopleuridae Mitochondrial Code25: Candidate Division SR1 and Gracilibacteria Code26: Pachysolen tannophilus Nuclear Code27: Karyorelict Nuclear Code28: Condylostoma Nuclear Code29: Mesodinium Nuclear Code30: Peritrich Nuclear Code31: Blastocrithidia Nuclear Code33: Cephalodiscidae Mitochondrial UAA-Tyr Code