vs_cluster_unoise
performs denoising of FASTA sequences from a
given file or object using VSEARCH
´s cluster_unoise
method.
Usage
vs_cluster_unoise(
fasta_input,
centroids = NULL,
otutabout = NULL,
size_column = FALSE,
id = 0.97,
minsize = 8,
unoise_alpha = 2,
strand = "plus",
sizein = TRUE,
sizeout = TRUE,
relabel = NULL,
relabel_sha1 = FALSE,
fasta_width = 0,
sample = NULL,
log_file = NULL,
threads = 1,
vsearch_options = NULL,
tmpdir = NULL
)
Arguments
- fasta_input
(Required). A FASTA file path or a FASTA object containing reads to denoise. See Details.
- centroids
(Optional). A character string specifying the name of the FASTA output file for the cluster centroid sequences. If
NULL
(default), no output is written to a file and the centroid sequences are returned as a FASTA object. See Details.- otutabout
(Optional). A character string specifying the name of the output file in an OTU table format. If
NULL
(default), no output is written to a file. IfTRUE
, the output is returned as a tibble. See Details.- size_column
(Optional). If
TRUE
, a column with the size of each centroid is added to the centroid output tibble.- id
(Optional). Pairwise identity threshold for sequence to be added to a cluster. Defaults to
0.97
. See Details.- minsize
(Optional). Minimum abundance of sequences for denoising. Defaults to
8
.- unoise_alpha
(Optional). Alpha value for the UNOISE algorithm. Defaults to
2
.- strand
(Optional). Specifies which strand to consider when comparing sequences. Can be either
"plus"
(default) or"both"
.- sizein
(Optional). If
TRUE
(default), abundance annotations present in sequence headers are taken into account.- sizeout
(Optional). If
TRUE
(default), abundance annotations are added to FASTA headers.- relabel
(Optional). Relabel sequences using the given prefix and a ticker to construct new headers. Defaults to
NULL
.- relabel_sha1
(Optional). If
TRUE
(default), relabel sequences using the SHA1 message digest algorithm. Defaults toFALSE
.- fasta_width
(Optional). Number of characters per line in the output FASTA file. Defaults to
0
, which eliminates wrapping.- sample
(Optional). Add the given sample identifier string to sequence headers. For instance, if the given string is "ABC", the text ";sample=ABC" will be added to the header. his option is only applicable when the output format is FASTA (
centroids
). IfNULL
(default), no identifier is added.- log_file
(Optional). Name of the log file to capture messages from
VSEARCH
. IfNULL
(default), no log file is created.- threads
(Optional). Number of computational threads to be used by
VSEARCH
. Defaults to1
.- vsearch_options
(Optional). Additional arguments to pass to
VSEARCH
. Defaults toNULL
. See Details.- tmpdir
(Optional). Path to the directory where temporary files should be written when tables are used as input or output. Defaults to
NULL
, which resolves to the session-specific temporary directory (tempdir()
).
Value
A tibble or NULL
.
If centroids
is specified the centroid sequences are written to the
specified file, and no tibble is returned.
If otutabout
is TRUE
, an OTU table is returned as a tibble.
If otutabout
is a character string, the output is written to the file,
and no tibble is returned.
If neither centroids
nor otutabout
is specified, a FASTA object
with the centroid sequences and additional column otu_id
is returned.
The clustering statistics are included as an attribute named
"statistics"
.
The "statistics"
attribute of the returned tibble (when
centroids
is NULL
) is a tibble with the following columns:
num_nucleotides
: Total number of nucleotides used as input for clustering.min_length_input_seq
: Length of the shortest sequence used as input for clustering.max_length_input_seq
: Length of the longest sequence used as input for clustering.avg_length_input_seq
: Average length of the sequences used as input for clustering.num_clusters
: Number of clusters generated.min_size_cluster
: Size of the smallest cluster.max_size_cluster
: Size of the largest cluster.avg_size_cluster
: Average size of the clusters.num_singletons
: Number of singletons after clustering.input
: Name of the input file/object for the clustering.
Details
Sequences are denoised according to the UNOISE version 3 algorithm by Robert Edgar, but without the de novo chimera removal step. In the this algorithm, clustering of sequences depend on both the sequence distance and the abundance ratio. The abundance ratio (skew) is the abundance of a new sequence divided by the abundance of the centroid sequence. This skew must not be larger than beta if the sequences should be clustered together. Beta is calculated as 2 raised to the power of minus 1 minus alpha times the sequence distance. The sequence distance used is the number of mismatches in the alignment, ignoring gaps. This means that the abundance must be exponentially lower as the distance increases from the centroid for a new sequence to be included in the cluster. Nearer sequences with higher abundances will form their own new clusters.
fasta_input
can either be a file path to a FASTA file or a FASTA
object. FASTA objects are tibbles that contain the columns Header
and
Sequence
, see readFasta
. The Header
column
must contain the size of each sequence in the format ";size=X",
where X is the read count for the given sequence. This can be obtained by
dereplicating function vs_fastx_uniques
with the
sizeout = TRUE
argument.
If neither centroids
nor otutabout
is specified (default), the
function returns the centroid sequences as a FASTA object with an additional
column otu_id
. This column contains the identifier extracted from each
sequence header.
If centroids
is specified, centroid sequences are written to the
specified file in FASTA format.
otutabout
gives the option to output the results in an OTU
table format with tab-separated columns. When writing to a file, the first
line starts with the string "#OTU ID", followed by a tab-separated list of
all sample identifiers (formatted as "sample=X"). Each subsequent line,
corresponding to an OTU, begins with the OTU identifier and is followed by
tab-separated abundances for that OTU in each sample. If otutabout
is
a character string, the output is written to the specified file. If
otutabout
is TRUE
, the function returns the OTU table as a
tibble, where the first column is named otu_id
instead of "#OTU ID".
id
is a value between 0 and 1 that defines the minimum pairwise
identity required for a sequence to be added to a cluster. A sequence is not
added to a cluster if its pairwise identity with the centroid is bellow the
id
threshold.
Pairwise identity is calculated as the number of matching columns divided by
the alignment length minus terminal gaps.
If log_file
is NULL
and centroids
is specified,
clustering statistics from VSEARCH
will not be captured.
vsearch_options
allows users to pass additional command-line arguments
to VSEARCH
that are not directly supported by this function. Refer to
the VSEARCH
manual for more details.
Examples
if (FALSE) { # \dontrun{
# Define arguments
fasta_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
"small.fasta")
centroids <- NULL
# Denoise sequences and return a FASTA tibble
denoise_seqs <- vs_cluster_unoise(fasta_input = fasta_input,
centroids = centroids)
# Extract clustering statistics
statistics <- attr(cluster_seqs, "statistics")
# Cluster sequences and write centroids to a file
vs_cluster_unoise(fasta_input = fasta_input,
centroids = "centroids_sequences.fa")
} # }