seqfu metadata

Given one (or more) directories containing sequencing reads, this tool produces a metadata file by extracting the ID from the filename and optionally adding file paths or read counts.

Usage

Usage: 
  metadata [options] [<dir>...]
  metadata formats

Prepare mapping files from directory containing FASTQ files

Options:
  -1, --for-tag STR      String found in filename of forward reads [default: _R1]
  -2, --rev-tag STR      String found in filename of forward reads [default: _R2]
  -s, --split STR        Separator used in filename to identify the sample ID [default: _]
  --pos INT...           Which part of the filename is the Sample ID [default: 1]

  -f, --format TYPE      Output format: dadaist, irida, manifest,... list to list [default: manifest]
  -p, --add-path         Add the reads absolute path as column 
  -c, --counts           Add the number of reads as a property column (experimental)
  -t, --threads INT      Number of simultaneously opened files (legacy: ignored) 
  --pe                   Enforce paired-end reads (not supported)
  --ont                  Long reads (Oxford Nanopore) [default: false]

  GLOBAL OPTIONS
  --abs                  Force absolute path
  --basename             Use basename instead of full path
  --force-tsv            Force '\t' separator, otherwise selected by the format
  --force-csv            Force ',' separator, otherwise selected by the format
  -R, --rand-meta INT    Add a random metadata column with INT categories

  FORMAT SPECIFIC OPTIONS
  -P, --project INT      Project ID (only for irida)
  --meta-split STR       Separator in the SampleID to extract metadata, used in MetaPhage [default: _]
  --meta-part INT        Which part of the SampleID to extract metadata, used in MetaPhage [default: 1]
  --meta-default STR     Default value for metadata, used in MetaPhage [default: Cond]

  -v, --verbose          Verbose output
  --debug                Debug output
  -h, --help             Show this help

Output formats

SeqFu metadata now supports the following output formats:

manifest: Used as import manifest for Qiime2 artifacts.
qiime1: Forward-compatible Qiime1 mapping file.
qiime2: Qiime2 metadata file.
dadaist: Dadaist2 compatible metadata.
lotus: Lotus mapping file (tested with Lotus1).
irida: IRIDA uploader sample sheet. Requires -P PROJECTID.
metaphage: MetaPhage metadata file. Use --meta-split, --meta-part, and --meta-default to customize a Treatment column.
ampliseq: nf-core/ampliseq metadata file.
rnaseq: nf-core/rnaseq metadata file.
bactopia: Bactopia FOFN (File of File Names) file.
mag: nf-core/mag metadata file.

New Features

Support for --format bactopia to generate Bactopia FOFN files.
Added --ont option for long reads (Oxford Nanopore Technology).
Enhanced support for various bioinformatics pipelines (ampliseq, rnaseq, mag).

Examples

Manifest (default)

seqfu metadata ./MiSeq_SOP/

Output:

sample-id	forward-absolute-filepath	reverse-absolute-filepath
F3D0	/Users/telatin/MiSeq_SOP/F3D0_S188_L001_R1_001.fastq.gz	/Users/telatin/MiSeq_SOP/F3D0_S188_L001_R2_001.fastq.gz
F3D1	/Users/telatin/MiSeq_SOP/F3D1_S189_L001_R1_001.fastq.gz	/Users/telatin/MiSeq_SOP/F3D1_S189_L001_R2_001.fastq.gz
...

Qiime1 mapping file

seqfu metadata MiSeq_SOP -f qiime1 --add-path --counts
  

Output:

#SampleID	Counts	Paths
F3D0	7793	F3D0_S188_L001_R1_001.fastq.gz,F3D0_S188_L001_R2_001.fastq.gz
F3D1	5869	F3D1_S189_L001_R1_001.fastq.gz,F3D1_S189_L001_R2_001.fastq.gz
...

IRIDA uploader

seqfu metadata -f irida -P 123 data/pe/

Output:

Sample_Name,Project_ID,File_Forward,File_Reverse
sample1,123,sample1_R1.fq.gz,sample1_R2.fq.gz
sample2,123,sample2_R1.fq.gz,sample2_R2.fq.gz

Bactopia FOFN

seqfu metadata -f bactopia data/pe/

For ONT data, add --ont

Output:

sample	runtype	r1	r2
sample1	paired-end	/path/to/data/pe/sample1_R1.fq.gz	/path/to/data/pe/sample1_R2.fq.gz
sample2	paired-end	/path/to/data/pe/sample2_R1.fq.gz	/path/to/data/pe/sample2_R2.fq.gz

Notes

Use --add-path to include full file paths in the output (when supported by the format).
The --counts option adds read counts to the output (experimental feature, not supported by all formats).
Format-specific options (like --project for IRIDA) are required for certain output types.
Use --verbose for detailed processing information and --debug for troubleshooting.

For more information on each format and its specific options, please refer to the respective tool's documentation.