vs_cluster_subseq
clusters FASTA sequences from a given
file or object using VSEARCH
´s cluster_fast
method and 100
identity. The function automatically sorts sequences by decreasing length
before clustering.
Usage
vs_cluster_subseq(
fasta_input,
centroids = NULL,
strand = "plus",
sizein = TRUE,
fasta_width = 0,
log_file = NULL,
threads = 1,
vsearch_options = NULL,
tmpdir = NULL
)
Arguments
- fasta_input
(Required). A FASTA file path or a FASTA object containing reads to cluster. See Details.
- centroids
(Optional). A character string specifying the name of the FASTA output file for the cluster centroid sequences. If
NULL
(default), no output is written to a file and the centroid sequences are returned as a FASTA object. See Details.- strand
(Optional). Specifies which strand to consider when comparing sequences. Can be either
"plus"
(default) or"both"
.- sizein
(Optional). If
TRUE
(default), abundance annotations present in sequence headers are taken into account.- fasta_width
(Optional). Number of characters per line in the output FASTA file. Defaults to
0
, which eliminates wrapping.- log_file
(Optional). Name of the log file to capture messages from
VSEARCH
. IfNULL
(default), no log file is created.- threads
(Optional). Number of computational threads to be used by
VSEARCH
. Defaults to1
.- vsearch_options
(Optional). Additional arguments to pass to
VSEARCH
. Defaults toNULL
. See Details.- tmpdir
(Optional). Path to the directory where temporary files should be written when tables are used as input or output. Defaults to
NULL
, which resolves to the session-specific temporary directory (tempdir()
).
Value
A tibble or NULL
.
If centroids
is specified the centroid sequences are written to the
specified file, and no tibble is returned.
If centroids
is not specified, a FASTA object
is returned. This is a tibble
with columns Header
and
Sequence
, and also the additional column(s) members
and, if
sizein = TRUE
, size
.
Details
After merging/dereplication some sequences may be sub-sequences of longer sequences. This function will cluster such sequences at 100 (terminal gaps ignored), and keep the longest in each cluster as the centroid.
fasta_input
can either be a file path to a FASTA file or a FASTA
object. FASTA objects are tibbles that contain the columns Header
and
Sequence
, see readFasta
.
If sizein = TRUE
(default) the FASTA headers must contain text
matching the regular expression "size=[0-9]+"
indicating the copy
number (=size) of each input sequence. This is then summed for each cluster
and added to the output. This text is typically added by de-replication, see
vs_fastx_uniques
.
The number of distinct sequences in each cluster is output as members
.
vsearch_options
allows users to pass additional command-line arguments
to VSEARCH
that are not directly supported by this function. Refer to
the VSEARCH
manual for more details.
Examples
if (FALSE) { # \dontrun{
# Define arguments
fasta_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
"small.fasta")
# De-replicating
derep.tbl <- vs_fastx_uniques(fasta_input, output_format = "fasta")
# Clustering subsequences
cluster.tbl <- vs_cluster_subseq(fasta_input = derep.tbl)
# Cluster sequences and write centroids to a file
vs_cluster_subseq(fasta_input = derep.tbl,
centroids = "distinct.fa")
} # }