6 Inspect Read Quality
Using good quality input data improves our chance of getting a good quality output assembly.
We call the set of sequenced reads we obtain from the sequencer the raw reads. Modern sequencing technologies usually give us high quality data, but it is still possible for some low-quality read sequences to be produced among the raw reads. So that these do not detrimentally affect the quality of our assembly, we first assess the quality of our read data, and then remove low-quality reads and low-quality parts of read sequences. The resulting dataset is often called the cleaned reads or processed reads.
It is good practice always to inspect the quality of your sequencing read data, and remove low-quality reads, before assembly.
When describing your sequencing experiment in a manuscript or dissertation, you should always state:
- The sequencing technology used, and the sequencing platform
- The number of raw reads obtained, and a measure of average read length
The main kind of low-quality data we might have in our raw reads are:
- Contamination: reads that derive from an organism we didn’t intend to sequence; these are usually identified by querying against a database and removed
- Low-confidence basecalls: reads containing base calls that the basecaller is unsure about; these are usually found towards the end of the read sequence and are removed by trimming the read
- Low-confidence reads: reads where a large proportion of the base calls are low quality; these are usually removed
- Adapter sequence: these are sections of sequence left over from library preparation and do not derive from the sequenced organism; they are removed by trimming
This part of the workshop will cover the use of FastQC
to inspect the quality of your sequence reads.
6.1 Using FastQC
- Navigate to the
FastQC
tool using theTools
sidebar inGalaxy
- You can use the
search tools
field to findFastQC
- Alternatively, you will find it under
GENOMIC FILE MANIPULATION
/FASTA/FASTQ
in the sidebar
- Select the
FastQC
tool - Run the
FASTQC
tool on each of yourFASTQ
input files
- These are the
trimmed_pe_aln.qsorted.mapped.fixed.1.fastq.gz
andtrimmed_pe_aln.qsorted.mapped.fixed.2.fastq.gz
files. - With the
FastQC
tool options in the mainWorkspace
window, select the file you want to runFastQC
on. - Click on the
Run Tool
button.
Make sure you run FastQC
on both the forward and reverse read sets.
FastQC
to assess the quality of your sequencing reads
FastQC
takes a few moments to run.
When the run is complete, click on the eye
icon of the Webpage
result to inspect the FastQC
output.
You can press and hold CTRL (on a PC) or CMD ⌘ (macOS) when clicking the eye
icon to open the result in its own window.
6.2 FastQC
output
The output of FastQC
tells us a lot about the quality of our sequencing data, and we would typically use it to identify problems with our sequencing data. Some important sections are described below, and you can read more about how to diagnose quality issues in your dataset at the links below:
FastQC
output in Galaxy
6.2.1 FastQC
Summary
The FastQC
summary (Figure 6.1) provides an overview that indicates where there may be areas of concern in your data. Each section of the report receives a flag indicating either that it gets a pass (green, with a tick), a warning (amber, with exclamation mark), or a fail (red, with a cross) mark.
6.2.2 Per base sequence quality
The per base sequence quality plot (Figure 6.2) presents the ranges of individual base quality calls across the lengths of the input sequence reads. Higher scores (near 40) indicate better quality.
For good-quality sequence read data, we are looking for a graph where the mean quality (see Figure 6.2) is always in the green area, with a score greater than 29.
It is not unusual for the per-base quality scores to fall quite steeply towards the end of the reads, with Illumina and other sequencing technologies.
It is also typical for the quality score to be slightly lower in the first 5-7 bases of the read.
If mean quality per base falls below 29, there may be a problem with the sequencing data.
6.2.3 Per sequence quality scores
The per sequence quality score (Figure 6.3) summarises the distribution of high quality (higher scores, near 40) and lower quality reads in the complete dataset.
In good quality sequence data, we tend to see a sharp peak towards the right hand side of the plot.
If the per sequence quality score appears flat, or plateaus towards the left hand side of the plot, there may be a problem with the sequencing data.
- The
FastQC
Sequence Duplication Levels section is marked as a fail in one of the output result files. Why do you think this is?
- Read the description of the Sequence Duplication Levels result section in the MSU Tutorial.
- Read the description of the provenance of your read data in Chapter 1
Your sequence reads were obtained using RNA sequencing, and very deeply sequenced the virus genome. The high abundance of virus sequence reads (and the low abundance of any reads that do not derive from the virus) looks unusual compared to the results of a standard DNA genome sequencing experiment, so is marked as “fail” by FastQC
.
6.3 What do we do about sequencing data problems?
FastQC
does a great job of alerting us to problems with our sequencing data sets. But, by itself, it cannot remedy these problems.
Modern sequencing methods produce so much read data that the usual approach to excluding poor data is either to exclude (or drop) the read itself, or to trim (throw away) the section of the read sequence that is low quality.
A popular tool for removing poor quality read data is trimmomatic
and you will meet this in Chapter 7.