Taxonomic classification with LCA
Source:R/vs_alignment_classification.R
vs_alignment_classification.Rdvs_alignment_classification assigns taxonomy by global
alignment and Last Common Ancestor (LCA) consensus of database hits using
VSEARCH.
Usage
vs_alignment_classification(
fastx_input,
database,
lcaout = NULL,
lca_cutoff = 1,
top_hits_only = FALSE,
gapopen = "20I/2E",
gapext = "2I/1E",
id = 0.7,
strand = "plus",
maxaccepts = 2,
maxrejects = 32,
threads = 1,
vsearch_options = NULL,
tmpdir = NULL
)Arguments
- fastx_input
(Required). A FASTA/FASTQ file path or FASTA/FASTQ object. See Details.
- database
(Required). A FASTA/FASTQ file path or FASTA/FASTQ tibble object containing the target sequences.
- lcaout
(Optional). A character string specifying the name of the output file. If
NULL(default), no output is written to a file and the results are returned as a tibble with the columnsquery_idandtaxonomy.- lca_cutoff
(Optional). Adjust the fraction of matching hits required for the last common ancestor (LCA). Defaults to
1.0, which requires all hits to match at each taxonomic rank for that rank to be included. If a lower cutoff value is used, e.g. 0.95, a small fraction of non-matching hits are allowed while that rank will still be reported. The argument to this option must be between0.5and1.0.- top_hits_only
(Optional). If
TRUE, only the top hits with an equally high percentage of identity between the query and database sequence sets are written to the output. Defaults toFALSE.- gapopen
(Optional). Penalties for gap opening. Defaults to
"20I/2E". See Details.- gapext
(Optional). Penalties for gap extension. Defaults to
"2I/1E". See Details.- id
(Optional). Pairwise identity threshold. Defines the minimum identity required for matches. Defaults to
0.7.- strand
(Optional). Specifies which strand to consider when comparing sequences. Can be either
"plus"(default) or"both".- maxaccepts
(Optional). Maximum number of matching target sequences to accept before stopping the search for a given query. Defaults to
2. Must be larger than1for information to be useful.- maxrejects
(Optional). Maximum number of non-matching target sequences to consider before stopping the search for a given query. Defaults to 32. If
maxacceptsandmaxrejectsare both set to 0, the complete database is searched.- threads
(Optional). Number of computational threads to be used by
VSEARCH. Defaults to1.- vsearch_options
(Optional). Additional arguments to pass to
VSEARCH. Defaults toNULL. See Details.- tmpdir
(Optional). Path to the directory where temporary files should be written when tables are used as input or output. Defaults to
NULL, which resolves to the session-specific temporary directory (tempdir()).
Value
A tibble or NULL.
If lcaout is specified the results are written to the specified file.
If lcaout is NULL a data.frame is returned.
The data.frame contains the classification results for each query sequence.
Both the Header and Sequence columns of fasta_input are
copied into this table, and in addition are also the columns for each rank.
The ranks depend on the database file used, but are typically domain, phylum,
class, order,family, genus and species.
Details
Performs global sequence alignment against a reference database and assigns taxonomy using the Last Common Ancestor (LCA) approach, reporting the deepest taxonomic level consistently supported by the majority of hits.
fastx_input and database can either be file paths to a
FASTA/FASTQ files or FASTA/FASTQ objects. FASTA objects are tibbles that
contain the columns Header and Sequence, see
readFasta. FASTQ objects are tibbles that contain the
columns Header, Sequence, and Quality, see
readFastq.
Pairwise identity (id) is calculated as the number of matching columns
divided by the alignment length minus terminal gaps.
vsearch_options allows users to pass additional command-line arguments
to VSEARCH that are not directly supported by this function. Refer to
the VSEARCH manual for more details.
Visit the VSEARCH
documentation
for information about defining gapopen and gapext.
Examples
if (FALSE) { # \dontrun{
# Example files
db.file <- file.path(file.path(path.package("Rsearch"), "extdata"),
"sintax_db.fasta")
fasta.file <- file.path(file.path(path.package("Rsearch"), "extdata"),
"small.fasta")
tax.tbl <- vs_alignment_classification(fastx_input = fasta.file,
database = db.file)
View(tax.tbl)
} # }