Rsearch tutorial (v1.0.0)

Introduction

Rsearch is an R package designed for handling and analyzing targeted sequencing data. The package provides a user-friendly interface for core VSEARCH functions in addition to tools for visualization and parameter optimization.

In this tutorial, we will walk through an example workflow using the Rsearch package to process a small multi-sample dataset. For this tutorial, we will use Illumina-sequenced paired-end reads, stored in FASTQ files and separated by sample.

Using Rsearch, we will demonstrate how to perform key processing steps, including:

Inspecting raw data
Trimming and filtering
Merging read pairs
Dereplicate reads
Chimera detection
Clustering reads into OTUs
Taxonomic assignment

Finally, we will show how the data generated in Rsearch can be imported into the phyloseq R package for downstream microbiome analysis. You can jump directly to that section here.

For the impatient

If you prefer to run the entire example workflow without inspecting intermediate steps, a streamlined version is provided at the end of the tutorial, using the pipe operator.

Installing `Rsearch`

To install Rsearch from GitHub, you will first need the devtools package. If it’s not installed yet, you can install it by running:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

If you don’t already have it, you will need to install the phyloseq package before installing Rsearch:

# Install phyloseq if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("phyloseq")

Now you are ready to install Rsearch:

# Install Rsearch from GitHub
devtools::install_github("CassandraHjo/Rsearch")

After installation, it is a good idea to restart your R session (in Rstudio: Session > Restart R) to make sure everything is properly loaded.

Rsearch requires that you have VSEARCH installed. VSEARCH performs the heavy computational tasks quickly, while Rsearch allows you to run these operations conveniently from within R. Without Rsearch, you would need to run VSEARCH commands manually from the command line.

Installation instruction for VSEARCH are available on the Rsearch GitHub page.

You can verify the installation of both Rsearch and VSEARCH by running:

library(Rsearch)
packageVersion("Rsearch")

[1] '1.0.0'

vsearch()

The VSEARCH executable is: vsearch 
This is a valid command to invoke VSEARCH on this computer!

Getting started

Example data

The example data set in this tutorial is a subset of the samples from the publicly available project Sequencing marine seafloor samples.

Download a zipped archive with example data here

The data set contains samples with zymbiomics mock communities.

After unpacking the zipped archive you should have:

A text-file metadata.txt
A folder named fastq with 16 pairs of FASTQ files.

First, define the paths to your FASTQ files and metadata file:

fastq_path <- "../tutorial_data/fastq"           # Change to the directory containing the FASTQ files
metadata_file <- "../tutorial_data/metadata.txt" # Change to the file path of your metadata file

If you see the expected files listed and no errors, you are ready to proceed.

Load required libraries

Load the libraries needed for this tutorial:

library(Rsearch)
library(microseq)
library(readr)

Inspect the raw data

First, we read the metadata file:

sample.tbl <- read_tsv(metadata_file)
cat("Number of samples in the metadata file: ", nrow(sample.tbl), "\n")

Number of samples in the metadata file:  16

Read quality

To help you run the tutorial interactively, we have included a break statement inside some for-loops. This causes the for-loop to stop after processing the first sample, allowing you to quickly preview results. You can simply remove the break line to process all samples.

We assess the read quality using plot_base_quality, which plots the mean and median base quality scores across all bases for each read. Error bars show the interquartile range (25th to 75th percentiles). Note that the x-axis for the R2 panel has been flipped to remind us that R1 and R2 reads are sequenced in opposite directions.

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  gg.obj <- plot_base_quality(R1.tbl, R2.tbl, show_overlap_box = TRUE)
  print(gg.obj)
  break # Remove this line to loop through all samples
}

In the first sample, we see that the R1 reads have generally higher and more stable quality compared to the R2 reads, which is a common feature of Illumina paired-end sequencing. The shaded regions at the curve end represent the average overlap length when merging the paired-end reads, offering insight into the quality within those overlapping segments. As expected, read quality declines toward the ends, especially for the R2 reads, suggesting that trimming and filtering may be necessary before proceeding with downstream analyses.

Expected Error (EE) rate distribution

We can also explore the distribution of expected error (EE) rates across reads, which provides a measure of read quality (lower EE rates indicate higher quality).

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  R1.gg.obj <- plot_ee_rate_dist(R1.tbl,
                                 plot_title = "Distribution of the expected error (EE) rate of R1 reads")
  R2.gg.obj <- plot_ee_rate_dist(R2.tbl,
                                 plot_title = "Distribution of the expected error (EE) rate of R2 reads")
  
  print(R1.gg.obj)
  print(R2.gg.obj)
  break # Remove this line to loop through all samples
}

In the first sample, most reads have low expected error rates, as indicated by the left-skewed distributions. However, a subset exhibits higher EE rates, suggesting the precence of low-quality reads that require further inspection and may be candidates for removal.

Trimming and filtering

Choosing the best trimming and filtering settings is important for obtaining good merging results.

Trimming: Removing poor-quality bases from the ends of the reads.
Filtering: Removing whole reads with poor quality.

Rsearch provides two new helper functions, vs_optimize_truncqual and vs_optimize_truncee_rate, to optimize the trimming. These functions search parameter combinations to find the settings that yield the best merging results. For details, see the section on optimizing trimming settings.

Merging raw reads

As an optional but informative step, merging the raw reads before trimming and filtering can help establish a baseline understanding of read overlap and merging success.

Here we try to merge all the reads we have got for the first sample, without any trimming and filtering. The vs_merging_lengths function can be used to plot some merging statistics:

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  merging_raw.tbl <- vs_merging_lengths(fastq_input = R1.tbl, 
                                        reverse = R2.tbl)
  
  print(attr(merging_raw.tbl, "plot"))
  print(data.frame(attr(merging_raw.tbl, "statistics")))
  break # Remove this line to loop through all samples
}

Tot_num_pairs	Merged	Mean_Read_Length_before_merging	Mean_Read_Length_after_merging	StdDev_Read_Length	R1	R2
67857	55564	279.5	428.98	2.08	R1.tbl	R2.tbl

From the first sample, approximately 82% (55 564 out of 67 857) of the read pairs merged successfully. From the two top panels we can see the length distribution of the R1 and R2 reads. In this case all of the R1 and R2 reads are of equal length: 278 and 281 bases long, respectively.

The two bottom panels show the distribution of the length of the merged reads, and the distribution of the overlap length. In this example, almost all merged reads have a length of around 430 bases, and the length of the overlap is around 130 bases.

The results from merging the raw reads above includes singletons, merged read pairs with a copy number of 1. If we assume that sequencing errors are more likely to occur among singletons and remove them, we are left with approximately 49% (33 259 out of 67 857) successfully merged read pairs when merging the raw read pairs. This experiment can easily be performed using the vs_fastq_mergepairs and vs_fastx_uniques functions, as shown below.

library(stringr)
library(tidyverse)

m_raw.tbl <- vs_fastq_mergepairs(R1.tbl,
                                 R2.tbl,
                                 minovlen = 10,
                                 minlen = 0,
                                 threads = 1)

d_raw.tbl <- vs_fastx_uniques(m_raw.tbl) %>% 
  mutate(centroid_size = str_extract(Header, "(?<=;size=)\\d+")) %>% 
  mutate(centroid_size = as.numeric(centroid_size)) %>% 
  filter(centroid_size > 1)

sum(d_raw.tbl$centroid_size)

Together this tells us:

We should discard the very short reads, and we may set a threshold at 50 bases.
The rather long overlap means we may trim the 3’ ends of the reads quite a lot and still maintain a long enough overlap to merge.

Optimizing trimming settings

At this stage, Rsearch introduces a new step in the pipeline: optimizing the trimming of reads to improve merging results. The aim is not only to maximize the number of merged reads, but also to ensure that the merged reads are of high quality. For this purpose, ‘Rsearch’ introduces two new functions that systematically explore ranges of truncqual and truncee_rate values and report the settings that give the best result. Although the optimization is run on a single sample, we find that samples from the same sequencing run usually behave similarly. We still assume that sequencing errors are more frequent among singletons, the evaluation focuses on merged reads with a copy number of 2 or more. We assume that a larger number of such reads is taken as evidence for better merging. During optimization the reads are also filtered to remove the reads with average expected error greater than 0.01 (default). This threshold can be adjusted via the maxee_rate parameter when working with other datasets.

Optimize `truncqual` value

The vs_optimize_truncqual function finds the optimal truncqual value for truncating reads, i.e. trimming at the 3’ end. The truncqual value is the quality score threshold at which the reads are truncated, starting from the first base. The function tests a range of truncqual values and visualizes the effect on merging success. The optimal truncqual value is measured by the number of merged read pairs with a copy number above the number specified by min_size, after dereplication.

The function may take some time to run, but you can view the progress through a progress bar in the console.

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  optimize_truncqual.tbl <- vs_optimize_truncqual(fastq_input = R1.tbl, 
                                                  reverse = R2.tbl
  )
  
  print(attr(optimize_truncqual.tbl, "plot"))
  
  break # Remove this line to loop through all samples
}

For the first sample, the best truncqual value is 7. At this setting, approximately 50% (33 740 out of 67 857) read pairs are successfully merged.

Although this number is just slightly higher compared to the results from the raw merging, the important difference is that only high-quality merged reads with copy number above 2 and an average expected error bellow 0.01 are considered here. Thus, trimming with truncqual = 7 effectively retains higher-confidence reads at the cost of discarding lower-quality ones, which is exactly what we want.

Why not merge all reads here? Because singleton reads often have sequencing errors, and this will affect the merging. We base the optimization entirely on multicopy or high-quality reads.

Optimize `truncee_rate` value

Similarly, the vs_optimize_truncee_rate function optimizes the trimming with the truncee_rate parameter rather than truncqual. The truncee_rate value truncates the sequences so that their average expected error per base is not higher than the specified value. The function tests a range of truncee_rate values and visualizes the effect on merging success. The optimal truncee_rate value is measured by the number of merged read pairs with a copy number above the number specified by min_size, after dereplication.

The function may take some time to run, but you can view the progress through a progress bar in the console.

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  optimize_truncee_rate.tbl <- vs_optimize_truncee_rate(fastq_input = R1.tbl, 
                                                        reverse = R2.tbl
  )
  
  print(attr(optimize_truncee_rate.tbl, "plot"))
  
  break # Remove this line to loop through all samples
}

The optimal truncee_rate for this sample is 0.01, with approximately 51% (34 668 out of 67 857) successfully merged read pairs. Again, the number of merged read pairs is just slightly higher compared to merging raw reads, but here the reads retained are of higher quality and reliability, because singletons and low quality reads are removed, which is critical for downstream clustering and analysis.

Based on the results from the optimization plots, the truncqual and truncee_rate values can be set in the trimming and filtering function: vs_fastx_trim_filt.

Trimming and filtering reads

Now we apply trimming and filtering using reasonable parameters. We will use the same for-loop as above to loop through all samples, but this time we will add the vs_fastx_trim_filt function to trim and filter all the reads.

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  # Trim and filter
  R1_filt.tbl <- vs_fastx_trim_filt(R1.tbl, R2.tbl,
                                    minlen = 50,
                                    truncee_rate = 0.01,
                                    stripleft = 12)
  R2_filt.tbl <- attr(R1_filt.tbl, "reverse")
  
  print(data.frame(attr(R1_filt.tbl, "statistics")))
  break # Remove this line to loop through all samples
}

Kept_Sequences	Discarded_Sequences	fastx_source	reverse_source
60977	6880	R1.tbl	R2.tbl

We also trim 12 bases from the 5’ end (stripleft = 12) of each read, based on the earlier quality assessment.

We can again assess the read quality after trimming and filtering using plot_base_quality:

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  # Trim and filter
  R1_filt.tbl <- vs_fastx_trim_filt(R1.tbl, R2.tbl,
                                    minlen = 50,
                                    truncee_rate = 0.01,
                                    stripleft = 12)
  R2_filt.tbl <- attr(R1_filt.tbl, "reverse")
  
  gg.obj <- plot_base_quality(R1_filt.tbl, R2_filt.tbl, show_overlap_box = TRUE)
  print(gg.obj)
  break # Remove this line to loop through all samples
}

The plot shows that the read quality has improved after trimming and filtering. After trimming, the base quality scores in plot two are consistently higher across all positions.

Merging trimmed and filtered reads

After trimming and filtering, we merge the read pairs using the vs_fastq_mergepairs function. We then expand our previous for-loop to include merging like this:

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  # Trim and filter
  R1_filt.tbl <- vs_fastx_trim_filt(R1.tbl, R2.tbl,
                                    minlen = 50,
                                    truncee_rate = 0.01,
                                    stripleft = 12)
  R2_filt.tbl <- attr(R1_filt.tbl, "reverse")
  
  # Merge reads
  merged.tbl <- vs_fastq_mergepairs(R1_filt.tbl, R2_filt.tbl)
  
  print(data.frame(attr(merged.tbl, "statistics")))
  break # Remove this line to loop through all samples
}

Tot_num_pairs	Merged	Mean_Read_Length_before_merging	Mean_Read_Length_after_merging	StdDev_Read_Length	R1	R2
60977	58323	254.92	404.97	2.33	R1_filt.tbl	R2_filt.tbl

In the statistics table from the merging one can see the total number of read pairs, the number of merged reads, and length statistics about the reads before and after merging. For this sample 58 323 out of 60 977 read pairs were successfully merged. The mean read length after merging is approximately 405 bases, with a standard deviation of 2.3 bases.

Dereplicating the merged reads

The merged reads are dereplicated to remove redundancy using the vs_fastx_uniques function. We expand our for-loop to include dereplication, and because the next step, chimera detection, requires a FASTA object as input we will set the output_format to "fasta". We also want to add the sample identifier to the header of each sequence, which is done by using the sample parameter.

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  # Trim and filter
  R1_filt.tbl <- vs_fastx_trim_filt(R1.tbl, R2.tbl,
                                    minlen = 50,
                                    truncee_rate = 0.01,
                                    stripleft = 12)
  R2_filt.tbl <- attr(R1_filt.tbl, "reverse")
  
  # Merge reads
  merged.tbl <- vs_fastq_mergepairs(R1_filt.tbl, R2_filt.tbl)
  
  # Dereplicate
  derep.tbl <- vs_fastx_uniques(fastx_input = merged.tbl,
                                output_format = "fasta",
                                sample = sample.tbl$sample_id[i])
  
  break # Remove this line to loop through all samples
}

cat("Number of unique reads after dereplication: ", nrow(derep.tbl), "\n")

Number of unique reads after dereplication:  25645

Plot size distribution

An optional step is to plot the size (copy number) distribution of the dereplicated reads, using the plot_size_dist function:

plot_size_dist(fastx_input = derep.tbl)

In this example, we can see that most of the dereplicated reads have a low copy number, and only a few reads have high copy numbers.

Chimera detection

Chimera detection and removal can be performed using the vs_uchime_denovo function. Again, we expand our for-loop to include chimera detection.

This is the last step in the preprocessing of the reads, and we will have to store our results somewhere for downstream analysis.

Option 1: Store the results from each sample in a FASTA table.
Option 2: Write the results to file.

Both options are demonstrated in the code bellow, but we will use the first option in this tutorial.

# Create empty table to store results
all.tbl <- tibble()

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  # Trim and filter
  R1_filt.tbl <- vs_fastx_trim_filt(R1.tbl, R2.tbl,
                                    minlen = 50,
                                    truncee_rate = 0.01,
                                    stripleft = 12)
  R2_filt.tbl <- attr(R1_filt.tbl, "reverse")
  
  # Merge reads
  merged.tbl <- vs_fastq_mergepairs(R1_filt.tbl, R2_filt.tbl)
  
  # Dereplicate
  derep.tbl <- vs_fastx_uniques(fastx_input = merged.tbl,
                                output_format = "fasta",
                                sample = sample.tbl$sample_id[i])
  
  # Detect chimeras
  derep_wo_chimera.tbl <- vs_uchime_denovo(derep.tbl)
  
  # Store results for downstream analysis
  
  # Option 1: Append results to all.tbl
  all.tbl <- bind_rows(all.tbl, derep_wo_chimera.tbl)
  
  # Option 2: Write results to file
  # writeFasta(derep_wo_chimera.tbl,
  #            out.file = file.path("fasta", # Change to the directory where you want to save the FASTA files
  #                                 str_c(sample.tbl$sample_id[i], ".fasta")))
  
  # break # Remove this line to loop through all samples
}

Clustering

Clustering sequences at a given identity threshold to create OTUs can be performed using the vs_cluster_size function. We will now use the table with all the reads from all samples from earlier (all.tbl).

If you wrote the results to file you need to read the files into R again:

files <- list.files("../tutorial_data/fasta", # Change to the directory where you saved your FASTA files
                    full.names = TRUE)
all.tbl <- lapply(files, readFasta) %>% 
  bind_rows()

Alternatively, if you don’t want to run the entire pipeline yourself, you can use the all_tbl.txt.zip object that is included in the example data.

all.tbl <- read_tsv("path/to/all_tbl.txt.zip") # Change to the file path of your all_tbl.txt.zip file

Next we dereplicate the reads again, but this time we will use the vs_fastx_uniques function with the output_format set to "fasta" and minuniquesize set to 2. This will create a FASTA table with all the unique sequences from all samples with a copy number above 2. We will also use the relabel parameter, to relabel the headers of the unique sequences.

all_derep.tbl <- vs_fastx_uniques(all.tbl,
                                  minuniquesize = 2,
                                  output_format = "fasta")

Header	Sequence
M07166:204:000000000-KN9DC:1:1101:22227:1435;sample=P10_ZB;size=74637	CGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGG
M07166:204:000000000-KN9DC:1:1101:21037:1765;sample=P10_ZB;size=60912	CACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGTGGTTAATAACCGCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGG
M07166:204:000000000-KN9DC:1:1101:10597:1525;sample=P10_ZB;size=59016	CACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGG
M07166:204:000000000-KN9DC:1:1101:18309:1743;sample=P10_ZB;size=45432	GGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAACAAGGACGTTAGTAACTGAACGTCCCCTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGG
M07166:204:000000000-KN9DC:1:1101:12703:1536;sample=P10_ZB;size=39963	GACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTAAGTTAATACCTTGCTGTTTTGACGTTACCAACAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGG
M07166:204:000000000-KN9DC:1:1101:22294:1864;sample=P10_ZB;size=37536	CGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGG

Next we cluster the unique sequences at 95% identity using the vs_cluster_size function:

sequence.tbl <- vs_cluster_size(all_derep.tbl, 
                                id = 0.95, 
                                relabel = "OTU")

head(sequence.tbl)
print(attr(sequence.tbl, "statistics"))

Header	Sequence	otu_id
OTU1;size=151721	CGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGG	OTU1
OTU2;size=142184	CACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGTGGTTAATAACCGCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGG	OTU2
OTU3;size=138095	CACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGG	OTU3
OTU4;size=93098	GGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAACAAGGACGTTAGTAACTGAACGTCCCCTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGG	OTU4
OTU5;size=79738	GACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTAAGTTAATACCTTGCTGTTTTGACGTTACCAACAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGG	OTU5
OTU6;size=91320	CGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGG	OTU6

num_nucleotides	min_length_input_seq	max_length_input_seq	avg_length_input_seq	num_clusters	min_size_cluster	max_size_cluster	avg_size_cluster	num_singletons	input
12277056	238	408	404.8093	119	2	151721	6867.1	0	all_derep.tbl

To plot the distribution of the copy number for the different centroids, you can use the plot_size_dist function:

plot_size_dist(sequence.tbl)

Creating the OTU table

Finally, we create the read count table. This may take some time as we have to search with all reads against all OTU centroids using the vs_usearch_global function. To increase the speed you can run the function with multiple threads, but this depends on the computer you are using.

readcount.tbl <- vs_usearch_global(fastx_input = all.tbl,
                                   database = sequence.tbl,
                                   otutabout = TRUE,
                                   id = 0.95,
                                   threads = 4 # Change to the number of threads you want to use
)

Assign taxonomy

Taxonomy can be assigned using either:

vs_sintax
vs_usearch_global

Both require a reference database with taxonomy, for example the SILVA database. You can read more about the requirements for the reference database in the VSEARCH documentation.

Alternative 1: `vs_sintax`

You can assign taxonomy to OTUs using the vs_sintax function as follows:

tax.tbl <- vs_sintax(fasta_input = sequence.tbl,
                     database = "SOME_DATABASE_FILE.fa.gz" # Change to the file path of your SINTAX reference database 
)

Alternative 2: `vs_usearch_global`

The vs_usearch_global function be used to assign taxonomy to the OTUs. The format of the output table is a little different, we therefore do some data wrangling in order to make it the correct format for an Rsearch object.

tax_v2.tbl <- vs_usearch_global(fastx_input = sequence.tbl,
                                database = "SOME_DATABASE_FILE.fa.gz" # Change to the file path of your reference database 
)

# Make correct format for Rsearch object
tax_v2.tbl <- tax_v2.tbl |>
  mutate(domain = str_extract(target, "d:([^,]+)")) |>
  mutate(domain = str_remove_all(domain, "d:|\\(.+")) |>
  mutate(phylum = str_extract(target, "p:([^,]+)")) |>
  mutate(phylum = str_remove_all(phylum, "p:|\\(.+")) |>
  mutate(class = str_extract(target, "c:([^,]+)")) |>
  mutate(class = str_remove_all(class, "c:|\\(.+")) |>
  mutate(order = str_extract(target, "o:([^,]+)")) |>
  mutate(order = str_remove_all(order, "o:|\\(.+")) |>
  mutate(family = str_extract(target, "f:([^,]+)")) |>
  mutate(family = str_remove_all(family, "f:|\\(.+")) |>
  mutate(genus = str_extract(target, "g:([^,]+)")) |>
  mutate(genus = str_remove_all(genus, "g:|\\(.+")) |>
  mutate(species = str_extract(target, "s:([^,]+)")) |>
  mutate(species = str_remove_all(species, "s:|\\(.+")) |>
  mutate(Header = query) |>
  right_join(sequence.tbl, by = "Header")

Saving processed data

To avoid having to run this pipeline again in the future it can be a good idea to save the current tables as an RData object, but this is optional:

save(sample.tbl, 
     sequence.tbl, 
     readcount.tbl, 
     tax.tbl,
     file = "tutorial_data_processed.RData" # Change the filename to your preferred name for the saved processed data
)

Full pipeline (short version)

If you want to run the pipeline without all the intermediate steps, you can use the pipe operator in R:

all.tbl <- tibble()

for(i in 1:nrow(sample.tbl)){
  R1.tbl <- readFastq(file.path(fastq_path, sample.tbl$R1_file[i]))
  R2.tbl <- readFastq(file.path(fastq_path, sample.tbl$R2_file[i]))
  
  derep_wo_chimera.tbl <- 
    vs_fastx_trim_filt(R1.tbl, R2.tbl,
                       minlen = 50,
                       truncee_rate = 0.035,
                       stripleft = 12) %>% 
    vs_fastq_mergepairs() %>%
    vs_fastx_uniques(output_format = "fasta",
                     sample = sample.tbl$sample_id[i]) %>%
    vs_uchime_denovo()
  
  all.tbl <- bind_rows(all.tbl, derep_wo_chimera.tbl)
}

all_derep.tbl <- vs_fastx_uniques(all.tbl,
                                  minuniquesize = 2,
                                  relabel = "OTU",
                                  output_format = "fasta")

sequence.tbl <- vs_cluster_size(all_derep.tbl)

readcount.tbl <- vs_usearch_global(fastx_input = all.tbl,
                                   database = sequence.tbl,
                                   otutabout = TRUE,
                                   id = 0.95,
                                   threads = 4
)

tax.tbl <- vs_sintax(fasta_input = sequence.tbl,
                     database = "SOME_DATABASE_FILE.fa.gz" # Change to the file path of your SINTAX reference database 
) 
)

Handoff to `phyloseq`

The phyloseq R package is a powerful tool for analyzing microbiome data. Below, we demonstrate how to import the tables generated by Rsearch into a phyloseq object.

First, use the rsearch_obj function to standardize and organize your data into an an Rsearch object. An Rsearch object is a list containing three elements with data structures that can be used as input to build a phyloseq object:

Read count table

A data frame with one row per OTU and one column per sample:
- The first column must list OTU identifiers (matching those in the sequence table).
- Remaining columns correspond to sample names, with each cell containing the read count for that OTU in that sample.
Sequence table

A data frame of centroid sequences representing each OTU:
- The first column must be named Header and contain the OTU identifiers.
- One column must be named Sequence(containing the DNA sequences).
- Additional columns (e.g., taxonomic classifications) are optional.
Sample metadata table

A data frame with one row per sample:
- One column must contain unique sample identifiers that match the column names in the read count table.
- Other columns can include any metadata (e.g., treatment group, collection date, etc.).

The sample_id_col parameter specifies the name of the column containing the unique sample identifiers in the sample metadata table.

The Rsearch object can then be converted to a phyloseq object using the rsearch2phyloseq function:

# Create Rsearch object
rsearch_obj <- rsearch_obj(readcount_data = readcount.tbl,
                           sequence_data = tax.tbl,
                           sample_data = sample.tbl,
                           sample_id_col = "sample_id")

# Convert Rsearch object to phyloseq object
phy_obj <- rsearch2phyloseq(rsearch_obj, 
                            sample_id_col = "sample_id")

Now that you’ve completed these steps, you’re all set to explore and analyze your data.