vs_fastx_uniques
performs dereplication of sequences in a
FASTA/FASTQ file or object by merging identical sequences using
VSEARCH
.
Usage
vs_fastx_uniques(
fastx_input,
output_format = "fastq",
fastx_output = NULL,
minuniquesize = 1,
strand = "plus",
sizein = TRUE,
sizeout = TRUE,
relabel = NULL,
relabel_sha1 = FALSE,
fastq_qout_max = FALSE,
fasta_width = 0,
sample = NULL,
vsearch_options = NULL,
tmpdir = NULL
)
Arguments
- fastx_input
(Required). A FASTA/FASTQ file path or FASTA/FASTQ object. See Details.
- output_format
(Optional). Desired output format of file or tibble:
"fasta"
or"fastq"
(default). Iffastx_input
is a FASTA file path or a FASTA object,output_format
cannot be"fastq"
.- fastx_output
(Optional). Name of the output file for dereplicated reads from
fastx_input
. File can be in either FASTA or FASTQ format, depending onoutput_format
. IfNULL
(default), no sequences are written to file. See Details.- minuniquesize
(Optional). Minimum abundance value post-dereplication for a sequence not to be discarded. Defaults to
1
.- strand
(Optional). Specifies which strand to consider when comparing sequences. Can be either
"plus"
(default) or"both"
.- sizein
(Optional). If
TRUE
(default), abundance annotations present in sequence headers are taken into account.- sizeout
(Optional). If
TRUE
(default), abundance annotations are added to FASTA headers.- relabel
(Optional). Relabel sequences using the given prefix and a ticker to construct new headers. Defaults to
NULL
.- relabel_sha1
(Optional). If
TRUE
(default), relabel sequences using the SHA1 message digest algorithm. Defaults toFALSE
.- fastq_qout_max
(Optional). If
TRUE
, the quality score will be the highest (best) quality score observed in each position. Defaults toFALSE
.- fasta_width
(Optional). Number of characters per line in the output FASTA file. Defaults to
0
, which eliminates wrapping.- sample
(Optional). Add the given sample identifier string to sequence headers. For nstance, if the given string is "ABC", the text ";sample=ABC" will be added to the header. If
NULL
(default), no identifier is added.- vsearch_options
(Optional). A character string of additional arguments to pass to
VSEARCH
. Defaults toNULL
. See Details.- tmpdir
(Optional). Path to the directory where temporary files should be written when tables are used as input or output. Defaults to
NULL
, which resolves to the session-specific temporary directory (tempdir()
).
Value
A tibble or NULL
.
If fastx_output
is specified, the dereplicated sequences are written
to the specified output file, and no tibble is returned.
If fastx_output
NULL
, a tibble containing the dereplicated
reads in the format specified by output_format
is returned.
Details
Sequences in the input file/object (fastx_input
) are dereplicated by
merging identical sequences. Identical sequences are defined as sequences
with the same length and the same string of nucleotides (case insensitive, T
and U are considered the same).
fastx_input
can either be a FASTA/FASTQ file or a FASTA/FASTQ object.
FASTA objects are tibbles that contain the columns Header
and
Sequence
, see readFasta
. FASTQ objects are tibbles that
contain the columns Header
, Sequence
, and Quality
, see
readFastq
.
By default, the quality scores in FASTQ output files will correspond to the
average error probability of the nucleotides in the each position.
If fastq_qout_max = TRUE
, the quality score will be the highest (best)
quality score observed in each position.
If fastx_output
is specified, the dereplicated sequences are output to
this file in format given by output_format
.
If fastx_output
is NULL
, the dereplicated sequences are
returned as a FASTA or FASTQ object, depending on output_format
.
vsearch_options
allows users to pass additional command-line arguments
to VSEARCH
that are not directly supported by this function. Refer to
the VSEARCH
manual for more details.
Examples
if (FALSE) { # \dontrun{
# Define arguments
fastx_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
"small_R1.fq")
fastx_output <- NULL
output_format <- "fastq"
# Dereplicate sequences and return a FASTQ tibble
derep_R1 <- vs_fastx_uniques(fastx_input = fastx_input,
fastx_output = fastx_output,
output_format = output_format)
# Dereplicate sequences and write derelicated sequences to a file
vs_fastx_uniques(fastx_input = fastx_input,
fastx_output = "dereplicated_sequences.fq",
output_format = output_format)
} # }