vs_fastx_subsample
subsamples sequences in FASTA/FASTQ
file or object by randomly extracting sequences based on number or percentage
using VSEARCH
.
Usage
vs_fastx_subsample(
fastx_input,
output_format = "fastq",
fastx_output = NULL,
sample_pct = NULL,
sample_size = NULL,
sizein = TRUE,
sizeout = TRUE,
relabel = NULL,
relabel_sha1 = FALSE,
randseed = NULL,
fasta_width = 0,
sample = NULL,
threads = 1,
vsearch_options = NULL,
tmpdir = NULL
)
Arguments
- fastx_input
(Required). A FASTA/FASTQ file path or FASTA/FASTQ object. See Details.
- output_format
(Optional). Desired output format of file or tibble:
"fasta"
or"fastq"
(default). Iffastx_input
is a FASTA file path or a FASTA object,output_format
cannot be"fastq"
.- fastx_output
(Optional). Name of the output file for subsampled reads from
fastx_input
. File can be in either FASTA or FASTQ format, depending onoutput_format
. IfNULL
(default), no sequences are written to file. See Details.- sample_pct
(Optional). Percentage of the input sequences to be subsampled. Numeric value ranging from
0.0
to100.0
. Defaults toNULL
.- sample_size
(Optional). The given number of sequences to extract. Must be a positive integer if specified. Defaults to
NULL
.- sizein
(Optional). If
TRUE
(default), abundance annotations present in sequence headers are taken into account.- sizeout
(Optional). If
TRUE
(default), abundance annotations are added to FASTA headers.- relabel
(Optional). Relabel sequences using the given prefix and a ticker to construct new headers. Defaults to
NULL
.- relabel_sha1
(Optional). If
TRUE
(default), relabel sequences using the SHA1 message digest algorithm. Defaults toFALSE
.- randseed
(Optional). Random seed. Must be a positive integer. A given seed always produces the same output, which is useful for replicability. Defaults to
NULL
.- fasta_width
(Optional). Number of characters per line in the output FASTA file. Defaults to
0
, which eliminates wrapping.- sample
(Optional). Add the given sample identifier string to sequence headers. For instance, if the given string is "ABC", the text ";sample=ABC" will be added to the header. If
NULL
(default), no identifier is added.- threads
(Optional). Number of computational threads to be used by
VSEARCH
.Defaults to1
.- vsearch_options
Additional arguments to pass to
VSEARCH
. Defaults toNULL
. See Details.- tmpdir
(Optional). Path to the directory where temporary files should be written when tables are used as input or output. Defaults to
NULL
, which resolves to the session-specific temporary directory (tempdir()
).
Value
A tibble or NULL
.
If fastx_output
is specified, the subsampled sequences are written to
the specified output file, and no tibble is returned.
If fastx_output
NULL
, a tibble containing the subsampled reads
in the format specified by output_format
is returned.
Details
Sequences in the input file/object (fastx_input
) are subsampled by
randomly extracting a specified number or percentage of sequences. Extraction
is performed as random sampling with a uniform distribution among the input
sequences and without replacement.
fastx_input
can either be a FASTA/FASTQ file or a FASTA/FASTQ object.
FASTA objects are tibbles that contain the columns Header
and
Sequence
, see readFasta
. FASTQ objects are tibbles that
contain the columns Header
, Sequence
, and Quality
, see
readFastq
.
Specify either sample_size
or sample_pct
to determine the
number or percentage of sequences to subsample. Only one of these parameters
can be specified at a time. If neither is specified, an error is thrown.
If fastx_output
is specified, the sampled sequences are output to this
file in format given by output_format
.
If fastx_output
is NULL
, the sample sequences are returned as a
FASTA or FASTQ object, depending on output_format
.
vsearch_options
allows users to pass additional command-line arguments
to VSEARCH
that are not directly supported by this function. Refer to
the VSEARCH
manual for more details.
Examples
if (FALSE) { # \dontrun{
# Define arguments
fastx_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
"small_R1.fq")
fastx_output <- NULL
output_format <- "fastq"
sample_size <- 10
# Subsample sequences and return a FASTQ tibble
subsample_R1 <- vs_fastx_subsample(fastx_input = fastx_input,
fastx_output = fastx_output,
output_format = output_format,
sample_size = sample_size)
# Subsample sequences and write subsampled sequences to a file
vs_fastx_subsample(fastx_input = fastx_input,
fastx_output = "subsample.fq",
output_format = output_format,
sample_size = sample_size)
} # }