6  Inspect Read Quality

Using good quality input data improves our chance of getting a good quality output assembly.

We call the set of sequenced reads we obtain from the sequencer the raw reads. Modern sequencing technologies usually give us high quality data, but it is still possible for some low-quality read sequences to be produced among the raw reads. So that these do not detrimentally affect the quality of our assembly, we first assess the quality of our read data, and then remove low-quality reads and low-quality parts of read sequences. The resulting dataset is often called the cleaned reads or processed reads.

Good practice

It is good practice always to inspect the quality of your sequencing read data, and remove low-quality reads, before assembly.

When describing your sequencing experiment in a manuscript or dissertation, you should always state:

  1. The sequencing technology used, and the sequencing platform
  2. The number of raw reads obtained, and a measure of average read length
Tip

The main kind of low-quality data we might have in our raw reads are:

  1. Contamination: reads that derive from an organism we didn’t intend to sequence; these are usually identified by querying against a database and removed
  2. Low-confidence basecalls: reads containing base calls that the basecaller is unsure about; these are usually found towards the end of the read sequence and are removed by trimming the read
  3. Low-confidence reads: reads where a large proportion of the base calls are low quality; these are usually removed
  4. Adapter sequence: these are sections of sequence left over from library preparation and do not derive from the sequenced organism; they are removed by trimming

This part of the workshop will cover the use of FastQC to inspect the quality of your sequence reads.

6.1 Using FastQC

  1. Navigate to the FastQC tool using the Tools sidebar in Galaxy
  • You can use the search tools field to find FastQC
  • Alternatively, you will find it under GENOMIC FILE MANIPULATION/FASTA/FASTQ in the sidebar
  1. Select the FastQC tool
  2. Run the FASTQC tool on each of your FASTQ input files
  • These are the trimmed_pe_aln.qsorted.mapped.fixed.1.fastq.gz and trimmed_pe_aln.qsorted.mapped.fixed.2.fastq.gz files.
  • With the FastQC tool options in the main Workspace window, select the file you want to run FastQC on.
  • Click on the Run Tool button.
Important

Make sure you run FastQC on both the forward and reverse read sets.

Video: Using FastQC to assess the quality of your sequencing reads
Important

FastQC takes a few moments to run.

When the run is complete, click on the eye icon of the Webpage result to inspect the FastQC output.

Tip

You can press and hold CTRL (on a PC) or CMD ⌘ (macOS) when clicking the eye icon to open the result in its own window.

6.2 FastQC output

The output of FastQC tells us a lot about the quality of our sequencing data, and we would typically use it to identify problems with our sequencing data. Some important sections are described below, and you can read more about how to diagnose quality issues in your dataset at the links below:

Video: Inspecting FastQC output in Galaxy

6.2.1 FastQC Summary

The FastQC summary (Figure 6.1) provides an overview that indicates where there may be areas of concern in your data. Each section of the report receives a flag indicating either that it gets a pass (green, with a tick), a warning (amber, with exclamation mark), or a fail (red, with a cross) mark.

Summary section of a FastQC report showing passes for most sections, warnings for some sections, and failure for a single section.
Figure 6.1: Example FastQC summary showing passes for most sections of the report including “Basic Statistics,” warnings for some sections including “Overrepresented sequences,” and a fail for “Sequence Duplication Levels”

6.2.2 Per base sequence quality

The per base sequence quality plot (Figure 6.2) presents the ranges of individual base quality calls across the lengths of the input sequence reads. Higher scores (near 40) indicate better quality.

Tip

For good-quality sequence read data, we are looking for a graph where the mean quality (see Figure 6.2) is always in the green area, with a score greater than 29.

Figure 6.2: Per base sequence quality score plot from FastQC indicating high quality sequencing reads. Quality score ranges are presented as a boxplot, with the mean drawn as a line connecting each boxplot.
Caution

It is not unusual for the per-base quality scores to fall quite steeply towards the end of the reads, with Illumina and other sequencing technologies.

It is also typical for the quality score to be slightly lower in the first 5-7 bases of the read.

If mean quality per base falls below 29, there may be a problem with the sequencing data.

6.2.3 Per sequence quality scores

The per sequence quality score (Figure 6.3) summarises the distribution of high quality (higher scores, near 40) and lower quality reads in the complete dataset.

Tip

In good quality sequence data, we tend to see a sharp peak towards the right hand side of the plot.

Figure 6.3: Per sequence quality score plot from FastQC indicating overall high quality sequencing reads.
Caution

If the per sequence quality score appears flat, or plateaus towards the left hand side of the plot, there may be a problem with the sequencing data.

Question
  1. The FastQC Sequence Duplication Levels section is marked as a fail in one of the output result files. Why do you think this is?
  • Read the description of the Sequence Duplication Levels result section in the MSU Tutorial.
  • Read the description of the provenance of your read data in Chapter 1

Your sequence reads were obtained using RNA sequencing, and very deeply sequenced the virus genome. The high abundance of virus sequence reads (and the low abundance of any reads that do not derive from the virus) looks unusual compared to the results of a standard DNA genome sequencing experiment, so is marked as “fail” by FastQC.

6.3 What do we do about sequencing data problems?

FastQC does a great job of alerting us to problems with our sequencing data sets. But, by itself, it cannot remedy these problems.

Modern sequencing methods produce so much read data that the usual approach to excluding poor data is either to exclude (or drop) the read itself, or to trim (throw away) the section of the read sequence that is low quality.

A popular tool for removing poor quality read data is trimmomatic and you will meet this in Chapter 7.