vs_cluster_subseq clusters FASTA sequences from a given
file or object using VSEARCH´s cluster_fast method and 100
identity. The function automatically sorts sequences by decreasing length
before clustering.
Usage
vs_cluster_subseq(
fasta_input,
centroids = NULL,
strand = "plus",
sizein = TRUE,
fasta_width = 0,
log_file = NULL,
threads = 1,
vsearch_options = NULL,
tmpdir = NULL
)Arguments
- fasta_input
(Required). A FASTA file path or a FASTA object containing reads to cluster. See Details.
- centroids
(Optional). A character string specifying the name of the FASTA output file for the cluster centroid sequences. If
NULL(default), no output is written to a file and the centroid sequences are returned as a FASTA object. See Details.- strand
(Optional). Specifies which strand to consider when comparing sequences. Can be either
"plus"(default) or"both".- sizein
(Optional). If
TRUE(default), abundance annotations present in sequence headers are taken into account.- fasta_width
(Optional). Number of characters per line in the output FASTA file. Defaults to
0, which eliminates wrapping.- log_file
(Optional). Name of the log file to capture messages from
VSEARCH. IfNULL(default), no log file is created.- threads
(Optional). Number of computational threads to be used by
VSEARCH. Defaults to1.- vsearch_options
(Optional). Additional arguments to pass to
VSEARCH. Defaults toNULL. See Details.- tmpdir
(Optional). Path to the directory where temporary files should be written when tables are used as input or output. Defaults to
NULL, which resolves to the session-specific temporary directory (tempdir()).
Value
A tibble or NULL.
If centroids is specified the centroid sequences are written to the
specified file, and no tibble is returned.
If centroids is not specified, a FASTA object
is returned. This is a tibble with columns Header and
Sequence, and also the additional column(s) members and, if
sizein = TRUE, size.
Details
After merging/dereplication some sequences may be sub-sequences of longer sequences. This function will cluster such sequences at 100 (terminal gaps ignored), and keep the longest in each cluster as the centroid.
fasta_input can either be a file path to a FASTA file or a FASTA
object. FASTA objects are tibbles that contain the columns Header and
Sequence, see readFasta.
If sizein = TRUE (default) the FASTA headers must contain text
matching the regular expression "size=[0-9]+" indicating the copy
number (=size) of each input sequence. This is then summed for each cluster
and added to the output. This text is typically added by de-replication, see
vs_fastx_uniques.
The number of distinct sequences in each cluster is output as members.
vsearch_options allows users to pass additional command-line arguments
to VSEARCH that are not directly supported by this function. Refer to
the VSEARCH manual for more details.
Examples
if (FALSE) { # \dontrun{
# Define arguments
fasta_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
"small.fasta")
# De-replicating
derep.tbl <- vs_fastx_uniques(fasta_input, output_format = "fasta")
# Clustering subsequences
cluster.tbl <- vs_cluster_subseq(fasta_input = derep.tbl)
# Cluster sequences and write centroids to a file
vs_cluster_subseq(fasta_input = derep.tbl,
centroids = "distinct.fa")
} # }