diff --git a/topics/microbiome/tutorials/metatranscriptomics/content.md b/topics/microbiome/tutorials/metatranscriptomics/content.md index 61d2dee28d1518..84fa587294ec36 100644 --- a/topics/microbiome/tutorials/metatranscriptomics/content.md +++ b/topics/microbiome/tutorials/metatranscriptomics/content.md @@ -187,7 +187,7 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con {% endunless %} -For more information about how to interpret the plots generated by FastQC and MultiQC, please see [this section]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}#assess-quality-with-fastqc---short--long-reads) in our dedicated Quality Control Tutorial. +For more information about how to interpret the plots generated by FastQC and MultiQC, please see [this section]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}#assess-quality-with-falcofastqc---short--long-reads) in our dedicated Quality Control Tutorial. > > diff --git a/topics/sequence-analysis/images/quality-control/adapter_content-after.png b/topics/sequence-analysis/images/quality-control/adapter_content-after.png index a1f973993258ce..01e393e37d9543 100644 Binary files a/topics/sequence-analysis/images/quality-control/adapter_content-after.png and b/topics/sequence-analysis/images/quality-control/adapter_content-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/adapter_content-before.png b/topics/sequence-analysis/images/quality-control/adapter_content-before.png index a7a544382c66d8..5af9bed664c494 100644 Binary files a/topics/sequence-analysis/images/quality-control/adapter_content-before.png and b/topics/sequence-analysis/images/quality-control/adapter_content-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/kmer_content.png b/topics/sequence-analysis/images/quality-control/kmer_content.png index 87b7142ff93555..88db92cdc66e71 100644 Binary files a/topics/sequence-analysis/images/quality-control/kmer_content.png and b/topics/sequence-analysis/images/quality-control/kmer_content.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_base_n_content-after.png b/topics/sequence-analysis/images/quality-control/per_base_n_content-after.png index a4cb43ae53c704..799ff68e2af84c 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_base_n_content-after.png and b/topics/sequence-analysis/images/quality-control/per_base_n_content-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_base_n_content-before.png b/topics/sequence-analysis/images/quality-control/per_base_n_content-before.png index 977ce0a6dd3dd6..0765b6e0af0ddf 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_base_n_content-before.png and b/topics/sequence-analysis/images/quality-control/per_base_n_content-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_base_sequence_content-after.png b/topics/sequence-analysis/images/quality-control/per_base_sequence_content-after.png index 46b9c873d68121..93f3ba33d5054f 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_base_sequence_content-after.png and b/topics/sequence-analysis/images/quality-control/per_base_sequence_content-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_base_sequence_content-before.png b/topics/sequence-analysis/images/quality-control/per_base_sequence_content-before.png index ad2f3cf19719dd..67e4971984dd60 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_base_sequence_content-before.png and b/topics/sequence-analysis/images/quality-control/per_base_sequence_content-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-after.png b/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-after.png index 5cb2f2ccf1c78a..6cf6b28b4083f0 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-after.png and b/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-before.png b/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-before.png index 1e069ef2cc2804..8342347aacac66 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-before.png and b/topics/sequence-analysis/images/quality-control/per_base_sequence_quality-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-after.png b/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-after.png index e59e0be9d53809..0cbb10db9dd3ec 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-after.png and b/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-before.png b/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-before.png index 7281b5562aff3b..c7767fb7c5a9dc 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-before.png and b/topics/sequence-analysis/images/quality-control/per_sequence_gc_content-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-after.png b/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-after.png index dfea70d026af74..d30ad27c5a4a80 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-after.png and b/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-before.png b/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-before.png index 299f435cef4187..cd573e2702e92c 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-before.png and b/topics/sequence-analysis/images/quality-control/per_sequence_quality_scores-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-after.png b/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-after.png index 4dd7a9797c2be4..f73586bc67f7f8 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-after.png and b/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-before.png b/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-before.png index 5f3358754c519a..08292ad64c5a86 100644 Binary files a/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-before.png and b/topics/sequence-analysis/images/quality-control/per_tile_sequence_quality-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-after.png b/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-after.png index 72caf919499767..21e4f54f452adb 100644 Binary files a/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-after.png and b/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-before.png b/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-before.png index ba92a37d64946a..dd6e1d57be3092 100644 Binary files a/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-before.png and b/topics/sequence-analysis/images/quality-control/sequence_duplication_levels-before.png differ diff --git a/topics/sequence-analysis/images/quality-control/sequence_length_distribution-after.png b/topics/sequence-analysis/images/quality-control/sequence_length_distribution-after.png index ef197719757890..7b621a6e760836 100644 Binary files a/topics/sequence-analysis/images/quality-control/sequence_length_distribution-after.png and b/topics/sequence-analysis/images/quality-control/sequence_length_distribution-after.png differ diff --git a/topics/sequence-analysis/images/quality-control/sequence_length_distribution-before.png b/topics/sequence-analysis/images/quality-control/sequence_length_distribution-before.png index c25a32b8d76d5d..5ced07a76d2936 100644 Binary files a/topics/sequence-analysis/images/quality-control/sequence_length_distribution-before.png and b/topics/sequence-analysis/images/quality-control/sequence_length_distribution-before.png differ diff --git a/topics/sequence-analysis/tutorials/quality-control/tutorial.md b/topics/sequence-analysis/tutorials/quality-control/tutorial.md index 9f6807881cd186..b5ac4fdd714076 100644 --- a/topics/sequence-analysis/tutorials/quality-control/tutorial.md +++ b/topics/sequence-analysis/tutorials/quality-control/tutorial.md @@ -7,7 +7,7 @@ questions: - What are the quality parameters to check for a dataset? - How to improve the quality of a dataset? objectives: -- "Assess short reads FASTQ quality using FASTQE \U0001F9EC\U0001F60E and FastQC" +- "Assess short reads FASTQ quality using FASTQE \U0001F9EC\U0001F60E and FastQC/Falco" - Assess long reads FASTQ quality using Nanoplot and PycoQC - Perform quality correction with Cutadapt (short reads) - Summarise quality metrics MultiQC @@ -39,6 +39,7 @@ contributions: - neoformit editing: - Swathi266 + - VerenaMoo funding: - gallantries @@ -169,7 +170,7 @@ It means that the fragment named `@M00970` corresponds to the DNA sequence `GTGC {: .question} > -> The current lllumina (1.8+) uses Sanger format (Phred+33). If you are working with older datasets you may encounter the older scoring schemes. **FastQC** {% icon tool %}, a tool we will use later in this tutorial, can be used to try to determine what type of quality encoding is used (through assessing the range of Phred values seen in the FASTQ). +> The current lllumina (1.8+) uses Sanger format (Phred+33). If you are working with older datasets you may encounter the older scoring schemes. **Falco** {% icon tool %}, a tool we will use later in this tutorial, can be used to try to determine what type of quality encoding is used (through assessing the range of Phred values seen in the FASTQ). {: .comment} When looking at the file in Galaxy, it looks like most the nucleotides have a high score (`G` corresponding to a score 38). Is it true for all sequences? And along the full sequence length? @@ -229,13 +230,15 @@ Phred Quality Score | ASCII code | Emoji {: .question} -# Assess quality with FastQC - short & long reads +# Assess quality with Falco/FastQC - short & long reads -An additional or alternative way we can check sequence quality is with [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). It provides a modular set of analyses which you can use to check whether your data has any problems of which you should be aware before doing any further analysis. We can use it, for example, to assess whether there are known adapters present in the data. We'll run it on the FASTQ file. +Additional or alternative ways we can check sequence quality with are [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [Falco](https://falco.readthedocs.io/en/latest/). Falco is a high-speed emulation of FastQC. Both tools offer a modular set of analyses which you can use to check whether your data has any problems of which you should be aware before doing any further analysis. We can use them, for example, to assess whether there are known adapters present in the data. + +In this case, we'll use Falco to analyze the FASTQ file, as it runs three times faster than FastQC while delivering equivalent results. Additionally, Falco provides greater flexibility in visualizing the HTML report. We will present both Falco and FastQC plots, as many are familiar with FastQC and to show their differences. > Quality check > -> 1. {% tool [FASTQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters +> 1. {% tool [Falco](toolshed.g2.bx.psu.edu/repos/iuc/falco/falco/1.2.4+galaxy0) %} with the following parameters > - {% icon param-files %} *"Raw read data from your current history"*: `Reads` > > 2. Inspect the generated HTML file @@ -253,27 +256,33 @@ An additional or alternative way we can check sequence quality is with [FastQC]( ## Per base sequence quality -With FastQC we can use the per base sequence quality plot to check the base quality of the reads, similar to what we did with FASTQE. +With Falco we can use the per base sequence quality plot to check the base quality of the reads, similar to what we did with FASTQE. ![Per base sequence quality](../../images/quality-control/per_base_sequence_quality-before.png "Per base sequence quality") -On the x-axis are the base position in the read. In this example, the sample contains reads that are up to 296 bp long. +The x-axis shows the base position in the read. In this example, the sample contains reads that are up to 296 bp long. > Non uniform x-axis > > The x-axis is not always uniform. When you have long reads, some binning is applied to keep things compact. We can see that in our sample. It starts out with individual 1-10 bases. After that, bases are binned across a window a certain number of bases wide. Data binning means grouping and is a data pre-processing technique used to reduce the effects of minor observation errors. The number of base positions binned together depends on the length of the read. With reads >50bp, the latter part of the plot will report aggregate statistics for 5bp windows. Shorter reads will have smaller windows and longer reads larger windows. Binning can be removed when running FastQC by setting the paramter "Disable grouping of bases for reads >50bp" to Yes. {: .details} -For each position, a boxplot is drawn with: +For each position, a boxplot by Falco is drawn with: -- the median value, represented by the central red line -- the inter-quartile range (25-75%), represented by the yellow box +- the median value, represented by the central intensely coloured line +- the inter-quartile range (25-75%), represented by the coloured box - the 10% and 90% values in the upper and lower whiskers -- the mean quality, represented by the blue line -The y-axis shows the quality scores. The higher the score, the better the base call. The background of the graph divides the y-axis into very good quality scores (green), scores of reasonable quality (orange), and reads of poor quality (red). +The y-axis shows the quality scores. It ranges from around 5 to 40 for this data with higher scores indicating better base calls. The beam colours correspond to different quality levels: green for very good quality, yellow for reasonable quality, and red for reads of poor quality. The switch from green to yellow occurs when the median quality score falls below 25 or the lower limit of the inter-quartile range drops below 10. Beams are coloured in red when the median falls below 20 or the lower limit of the inter-quartile range is below 5. + +It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7 bases (not visible for our dataset) and to then rise. The quality of reads on most platforms will drop at the end of the read. This is often due to signal decay or phasing during the sequencing run. The recent developments in chemistry applied to sequencing has improved this somewhat, but reads are now longer than ever. + +> Differences FastQC and Falco +> +> The plots are displayed slightly differently. In FastQC, the quality level is indicated by background colour coding, whereas Falco uses coloured beams. Additionally, FastQC includes the mean value, which Falco does not display. Another key difference is that FastQC's y-axis always ranges from 0 to 38, while Falco adjusts the y-axis to show only the relevant range of values. In our plot, this range is approximately 3 to 38. This differrence in y-axis scaling is consistent acros all plots generated by the two tools. +{: .details} -It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7 bases and to then rise. The quality of reads on most platforms will drop at the end of the read. This is often due to signal decay or phasing during the sequencing run. The recent developments in chemistry applied to sequencing has improved this somewhat, but reads are now longer than ever. +For further information to this plot you can have a look at the [official documentation of FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html). > Signal decay and phasing @@ -307,15 +316,15 @@ It is normal with all Illumina sequencers for the median quality score to start > > - Manifold burst > -> ![Manifold burst](../../images/quality-control/per_base_sequence_quality_manifold_burst.png) +> ![Manifold burst](../../images/quality-control/per_base_sequence_quality_manifold_burst.png "Per base sequence quality with manifold burst") > > - Cycles loss > -> ![Cycles loss](../../images/quality-control/per_base_sequence_quality_cycle_loss.png) +> ![Cycles loss](../../images/quality-control/per_base_sequence_quality_cycle_loss.png "Per base sequence quality with cycles loss") > > - Read 2 failure > -> ![Cycles loss](../../images/quality-control/per_base_sequence_quality_read2_failure.png) +> ![Cycles loss](../../images/quality-control/per_base_sequence_quality_read2_failure.png "Per base sequence quality with read 2 failure") > > With such data, the sequencing facility should be contacted for discussion. Often, a resequencing then is needed (and from our experience also offered by the company). > @@ -323,11 +332,11 @@ It is normal with all Illumina sequencers for the median quality score to start > > -> 1. How does the mean quality score change along the sequence? +> 1. How does the median quality score change along the sequence? > 2. Is this tendency seen in all sequences? > > > -> > 1. The mean quality score (blue line) drops about midway though these sequences. It is common for the mean quality to drop towards the end of the sequences, as the sequencers are incorporating more incorrect nucleotides at the end. However, in this sample there is a very large drop in quality from the middle onwards. +> > 1. The median quality score (line in darker shade of green, yellow and red in each beam) drops about midway though these sequences. It is common for the median quality to drop towards the end of the sequences, as the sequencers are incorporating more incorrect nucleotides at the end. However, in this sample there is a very large drop in quality from the middle onwards. > > 2. The box plots are getting wider from position ~100. It means a lot of sequences have their score dropping from the middle of the sequence. After 100 nucleotides, more than 10% of the sequences have scores below 20. > > > {: .solution } @@ -335,47 +344,61 @@ It is normal with all Illumina sequencers for the median quality score to start When the median quality is below a Phred score of ~20, we should consider trimming away bad quality bases from the sequence. We will explain that process in the Trim and filter section. -### Adapter Content +## Adapter Content ![Adapter Content](../../images/quality-control/adapter_content-before.png "Adapter Content") -The plot shows the cumulative percentage of reads with the different adapter sequences at each position. Once an adapter sequence is seen in a read it is counted as being present right through to the end of the read so the percentage increases with the read length. FastQC can detect some adapters by default (e.g. Illumina, Nextera), for others we could provide a contaminants file as an input to the FastQC tool. +The plot shows the cumulative percentage of reads with the different adapter sequences at each position. Once an adapter sequence is seen in a read it is counted as being present right through to the end of the read so the percentage increases with the read length. Falco can detect some adapters by default (e.g. Illumina, Nextera), for others we could provide a contaminants file as an input to the Falco tool. -Ideally Illumina sequence data should not have any adapter sequence present. But with long reads, some of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3' end of the read. This microbiome sample has relatively long reads and we can see Nextera dapater has been detected. +Ideally, Illumina sequence data should not have any adapter sequences present. But with long reads, some of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3' end of the read. This microbiome sample has relatively long reads and we can see Nextera adpater has been detected. + +> Differences FastQC and Falco +> +> The two plots appear quite similar, with the greatest difference that Falco includes a second line indicating the presence of a PolyA sequence, which FastQC doesn't search for. Additionally, again the y-axis of the Falco plot is adjusted to display only the relevant range. +> +>Here's the [official documentation of FastQC of this plot](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html). +{: .details} > Other adapter content profiles > > Adapter content may also be detected with RNA-Seq libraries where the distribution of library insert sizes is varied and likely to include some short inserts. > -> ![Adapter Content](../../images/quality-control/adapter_content_rna_seq.png) +> ![Adapter Content](../../images/quality-control/adapter_content_rna_seq.png "Adapter Content with RNA-Seq library") > {: .details} -We can run an trimming tool such as Cutadapt to remove this adapter. We will explain that process in the filter and trim section. +We can run a trimming tool such as Cutadapt to remove this adapter. We will explain that process in the filter and trim section. > Take a shortcut > -> The following sections go into detail about some of the other plots generated by FastQC. Note that some plots/modules may give warnings but be normal -> for the type of data you're working with, as discussed below and [in the FASTQC FAQ](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/). +> The following sections go into detail about some of the other plots generated by Falco/FastQC. Note that some plots/modules may give warnings but be normal +> for the type of data you're working with, as discussed below and [in the documentation of FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). > The other plots give us information to more deeply understand the quality of the data, and to see if changes could be made in the lab to get higher-quality data in the future. > These sections are **optional**, and if you would like to skip these you can: > - Jump straight to the [next section](#trim-and-filter---short-reads) to learn about trimming paired-end data {: .tip} -### Per tile sequence quality +## Per tile sequence quality -This plot enables you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. The plot shows the deviation from the average quality for each flowcell tile. The hotter colours indicate that reads in the given tile have worse qualities for that position than reads in other tiles. With this sample, you can see that certain tiles show consistently poor quality, especially from ~100bp onwards. A good plot should be blue all over. +This plot enables you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. The plot shows the deviation from the average quality for each flowcell tile. For Falco red colours indicate that reads in the given tile have worse qualities for that position than reads in other tiles. Dark blue indicates a better quality than reads in other tiles. With this sample, you can see that certain tiles show consistently poor quality, especially from ~100bp onwards. A good plot should be equally blue all over. ![Per tile sequence quality](../../images/quality-control/per_tile_sequence_quality-before.png "Per tile sequence quality") -This plot will only appear for Illumina library which retains its original sequence identifiers. Encoded in these is the flowcell tile from which each read came. +This plot will only appear for an Illumina library which retains its original sequence identifiers. Encoded in these is the flowcell tile from which each read came. + +> Differences FastQC and Falco +> +> In these plots, the different colours are immediately noticeable. For FastQC hotter colours indicate worse quality while colder colours represent quality at or above the average. Upon closer insperction, slight differences in the results can be observed, for example the red box on the far left of the FastQC plot has no corresponding box in the Falco plot. However, overall, the results are very similar and effectively highlight the key quality differences between the tiles. +> +>Here you can have a look at the [official documentation of FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html). +{: .details} > Other tile quality profiles > > In some cases, the chemicals used during sequencing becoming a bit exhausted over the time and the last tiles got worst chemicals which makes the sequencing reactions a bit error-prone. The "Per tile sequence quality" graph will then have some horizontal lines like this: > -> ![Per tile sequence quality with horizontal lines](../../images/quality-control/per_tile_sequence_quality_horizontal_lines.png) +> ![Per tile sequence quality with horizontal lines](../../images/quality-control/per_tile_sequence_quality_horizontal_lines.png "Per tile sequence quality with horizontal lines") > {: .details} @@ -385,7 +408,12 @@ It plots the average quality score over the full length of all reads on the x-ax ![Per sequence quality scores](../../images/quality-control/per_sequence_quality_scores-before.png "Per sequence quality scores") -The distribution of average read quality should be tight peak in the upper range of the plot. It can also report if a subset of the sequences have universally low quality values: it can happen because some sequences are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences. +The distribution of average read quality should form a tight peak in the upper range of the plot. It can also report if a subset of the sequences have universally low quality values: it can happen because some sequences are poorly imaged (on the edge of the field of view etc). However, these should represent only a small percentage of the total sequences. Here's the [official documentation of FastQC for this plot](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/3%20Per%20Sequence%20Quality%20Scores.html). + +> Differences FastQC and Falco +> +> Except for the slightly differnt design of the two plots there are no big differences. +{: .details} ## Per base sequence content @@ -395,15 +423,22 @@ The distribution of average read quality should be tight peak in the upper range In a random library we would expect that there would be little to no difference between the four bases. The proportion of each of the four bases should remain relatively constant over the length of the read with `%A=%T` and `%G=%C`, and the lines in this plot should run parallel with each other. This is amplicon data, where 16S DNA is PCR amplified and sequenced, so we'd expect this plot to have some bias and not show a random distribution. +For further information to this plot you can have a look at the [official documentation of FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html). + +> Differences FastQC and Falco +> +> These plots have identical colour coding and no differences can be observed. +{: .details} + > Biases by library type > > It's worth noting that some library types will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries), and those which were fragmented using transposases, will contain an intrinsic bias in the positions at which reads start (the first 10-12 bases). This bias does not involve a specific sequence, but instead provides enrichment of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will, however, produce a warning or error in this module. > -> ![Per base sequence content for RNA-seq data](../../images/quality-control/per_base_sequence_content_rnaseq.png) +> ![Per base sequence content for RNA-Seq data](../../images/quality-control/per_base_sequence_content_rnaseq.png "Per base sequence content with RNA-Seq library") > > ChIP-seq data can also encounter read start sequence biases in this plot if fragmenting with transposases. With bisulphite converted data, e.g. HiC data, a separation of G from C and A from T is expected: > -> ![Per base sequence content for Bisulphite data](../../images/quality-control/per_base_sequence_content_bisulphite.png) +> ![Per base sequence content for Bisulphite data](../../images/quality-control/per_base_sequence_content_bisulphite.png "Per base sequence content with Bisulphite data") > > At the end, there is an overall shift in the sequence composition. If the shift correlates with a loss of sequencing quality, it can be suspected that miscalls are made with a more even sequence bias than bisulphite converted libraries. Trimming the sequences fixed this problem, but if this hadn't been done it would have had a dramatic effect on the methylation calls which were made. {: .details} @@ -425,9 +460,14 @@ In a random library we would expect that there would be little to no difference This plot displays the number of reads vs. percentage of bases G and C per read. It is compared to a theoretical distribution assuming an uniform GC content for all reads, expected for whole genome shotgun sequencing, where the central peak corresponds to the overall GC content of the underlying genome. Since the GC content of the genome is not known, the modal GC content is calculated from the observed data and used to build a reference distribution. +> Differences FastQC and Falco +> +> Again these plots show no differences. +{: .details} + An unusually-shaped distribution could indicate a contaminated library or some other kind of biased subset. A shifted normal distribution indicates some systematic bias, which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what your genome's GC content should be. -But there are also other situations in which an unusually-shaped distribution may occur. For example, with RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts causing the observed plot to be wider or narrower than an ideal normal distribution. +But there are also other situations in which an unusually-shaped distribution may occur. For example, with RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts causing the observed plot to be wider or narrower than an ideal normal distribution. You can find the [official documentation of FastQC regarding this plot](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/5%20Per%20Sequence%20GC%20Content.html) here. > > @@ -438,20 +478,32 @@ But there are also other situations in which an unusually-shaped distribution ma > {: .solution } {: .question} -### Sequence length distribution +## Sequence length distribution This plot shows the distribution of fragment sizes in the file which was analysed. In many cases this will produce a simple plot showing a peak only at one size, but for variable length FASTQ files this will show the relative amounts of each different size of sequence fragment. Our plot shows variable length as we trimmed the data. The biggest peak is at 296bp but there is a second large peak at ~100bp. So even though our sequences range up to 296bp in length, a lot of the good-quality sequences are shorter. This corresponds with the drop we saw in the sequence quality at ~100bp and the red stripes starting at this position in the per tile sequence quality plot. -![Sequence length distribution](../../images/quality-control/sequence_length_distribution-before.png "Sequence length distribution") +![Sequence Length Distribution](../../images/quality-control/sequence_length_distribution-before.png "Sequence Length Distribution") Some high-throughput sequencers generate sequence fragments of uniform length, but others can contain reads of widely varying lengths. Even within uniform length libraries some pipelines will trim sequences to remove poor quality base calls from the end or the first $$n$$ bases if they match the first $$n$$ bases of the adapter up to 90% (by default), with sometimes $$n = 1$$. +> Differences FastQC and Falco +> +> At fist glance, these two plots appear to show completely different results. However, if you look at the x-axis, you'll notice that Falco shows a beam at position 296, while FastQC shows a line with a peak at this position and 0 for the adjacent positions 295 and 297. Therefore the difference is just how the data is visualized. +{: .details} + +For further information to this plot you can have a look at the [official documentation of FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/7%20Sequence%20Length%20Distribution.html). + ## Sequence Duplication Levels -The graph shows in blue the percentage of reads of a given sequence in the file which are present a given number of times in the file: +The graph shows in blue the percentage of reads of a given sequence in the file which are present a given number of times in the file. For the red line the reads are de-duplicated and plotted. Therefore this line shows the percentage of different sequences that are contained in the file with their duplication rate. The percentage indicates the number of different sequences with that duplication rate as part of the total number of different sequences. ![Sequence Duplication Levels](../../images/quality-control/sequence_duplication_levels-before.png "Sequence Duplication Levels") +> Differences FastQC and Falco +> +> The sequence duplication level plots only differ in minor design choices and the adjusted y-axis in the Falco plot. You can find more information about the [FastQC plot](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html) here. +{: .details} + In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias. Two sources of duplicate reads can be found: @@ -465,7 +517,7 @@ Two sources of duplicate reads can be found: > More details about duplication > -> FastQC counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication. There are two lines on the plot: +> Falco counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication. There are two lines on the plot: > - Blue line: distribution of the duplication levels for the full sequence set > - Red line: distribution for the de-duplicated sequences with the proportions of the deduplicated set which come from different duplication levels in the original data. > @@ -473,19 +525,16 @@ Two sources of duplicate reads can be found: > > More specific enrichments of subsets, or the presence of low complexity contaminants will tend to produce spikes towards the right of the plot. These high duplication peaks will most often appear in the blue trace as they make up a high proportion of the original library, but usually disappear in the red trace as they make up an insignificant proportion of the deduplicated set. If peaks persist in the red trace then this suggests that there are a large number of different highly duplicated sequences which might indicate either a contaminant set or a very severe technical duplication. > -> It is usually the case for RNA sequencing where there is some very highly abundant transcripts and some lowly abundant. It is expected that duplicate reads will be observed for high abundance transcripts: -> -> ![Sequence Duplication Levels for RNA-seq](../../images/quality-control/sequence_duplication_levels_rna_seq.png) -> +> It is usually the case for RNA sequencing where there is some very highly abundant transcripts and some lowly abundant. It is expected that duplicate reads will be observed for high abundance transcripts. {: .details} ## Over-represented sequences A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very over-represented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as expected. -FastQC lists all of the sequence which make up more than 0.1% of the total. For each over-represented sequence FastQC will look for matches in a database of common contaminants and will report the best hit it finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn't necessarily mean that this is the source of the contamination, but may point you in the right direction. It's also worth pointing out that many adapter sequences are very similar to each other so you may get a hit reported which isn't technically correct, but which has a very similar sequence to the actual match. +Falco lists all of the sequences which make up more than 0.1% of the total. For each over-represented sequence Falco will look for matches in a database of common contaminants and will report the best hit it finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn't necessarily mean that this is the source of the contamination, but may point you in the right direction. It's also worth pointing out that many adapter sequences are very similar to each other so you may get a hit reported which isn't technically correct, but which has a very similar sequence to the actual match. -RNA sequencing data may have some transcripts that are so abundant that they register as over-represented sequence. With DNA sequencing data no single sequence should be present at a high enough frequency to be listed, but we can sometimes see a small percentage of adapter reads. +RNA sequencing data may have some transcripts that are so abundant that they register as over-represented sequence. With DNA sequencing data no single sequence should be present at a high enough frequency to be listed, but we can sometimes see a small percentage of adapter reads. Here you can find the [official documentation of FastQC regarding this output](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html). > > @@ -503,7 +552,7 @@ RNA sequencing data may have some transcripts that are so abundant that they reg {: .question} -> More details about other FastQC plots +> More details about other Falco/FastQC plots > > > #### Per base N content @@ -512,18 +561,24 @@ RNA sequencing data may have some transcripts that are so abundant that they reg > > If a sequencer is unable to make a base call with sufficient confidence, it will write an "N" instead of a conventional base call. This plot displays the percentage of base calls at each position or bin for which an N was called. > -> It's not unusual to see a very high proportion of Ns appearing in a sequence, especially near the end of a sequence. But this curve should never rises noticeably above zero. If it does this indicates a problem occurred during the sequencing run. In the example below, an error caused the instrument to be unable to call a base for approximately 20% of the reads at position 29: +>> Differences FastQC and Falco +>> +>> These plots may seem very different at fist, but the key difference lies in the y-axis ranges. In the Falco plot, the y-axis is scaled differently, making the peaks appear larger. If the y-axis in the Falco plot extended to 100%, as it does in the FastQC plot, the peaks would appear much smaller, and the two plots would look the same. +>{: .details} > -> ![Per base N content](../../images/quality-control/per_base_n_content_error.png) +> It's not unusual to see a very high proportion of Ns appearing in a sequence, especially near the end of a sequence. But this curve should never rise noticeably above zero. If it does this indicates a problem occurred during the sequencing run. In the example below, an error caused the instrument to be unable to call a base for approximately 20% of the reads at position 29: > +> ![Per base N content](../../images/quality-control/per_base_n_content_error.png "Per base N content with error") +> +> For further information to this plot you can have a look at the [official documentation of FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/6%20Per%20Base%20N%20Content.html). > > #### Kmer Content > -> This plot not output by default. As stated in the tool form, if you want this module it needs to be enabled using a custom Submodule and limits file. With this module, FastQC does a generic analysis of all of the short nucleotide sequences of length k (kmer, with k = 7 by default) starting at each position along the read in the library to find those which do not have an even coverage through the length of your reads. Any given kmer should be evenly represented across the length of the read. +> This plot is not output by default. As stated in the tool form, if you want this module it needs to be enabled using a custom Submodule and limits file. With this module, FastQC does a generic analysis of all of the short nucleotide sequences of length k (kmer, with k = 7 by default) starting at each position along the read in the library to find those which do not have an even coverage through the length of your reads. Any given kmer should be evenly represented across the length of the read. > > FastQC will report the list of kmers which appear at specific positions with a greater frequency than expected. This can be due to different sources of bias in the library, including the presence of read-through adapter sequences building up on the end of the sequences. The presence of any overrepresented sequences in the library (such as adapter dimers) causes the kmer plot to be dominated by the kmer from these sequences. Any biased kmer due to other interesting biases may be then diluted and not easy to see. > -> The following example is from a high-quality DNA-Seq library. The biased kmers nearby the start of the read likely are due to slight sequence dependent efficiency of DNA shearing or a result of random priming: +> In the following only the plot generated by Falco is shown. The biased kmers nearby the start of the read likely are due to slight sequence dependent efficiency of DNA shearing or a result of random priming: > > ![Kmer Content](../../images/quality-control/kmer_content.png "Kmer content") > @@ -531,7 +586,7 @@ RNA sequencing data may have some transcripts that are so abundant that they reg > {: .details} -We tried to explain here there different FastQC reports and some use cases. More about this and also some common next-generation sequencing problems can be found on [QCFAIL.com](https://sequencing.qcfail.com/) +We tried to explain here the different Falco/FastQC reports and some use cases. More about this and also some common next-generation sequencing problems can be found on [QCFAIL.com](https://sequencing.qcfail.com/) > Specific problem for alternate library types > @@ -563,7 +618,7 @@ We tried to explain here there different FastQC reports and some use cases. More > > #### Adapter dimer contamination > -> Any library type may contain a very small percentage of adapter dimer (i.e. no insert) fragments. They are more likely to be found in amplicon libraries constructed entirely by PCR (by formation of PCR primer-dimers) than in DNA-Seq or RNA-Seq libraries constructed by adapter ligation. If a sufficient fraction of the library is adapter dimer it will become noticeable in the FastQC report: +> Any library type may contain a very small percentage of adapter dimer (i.e. no insert) fragments. They are more likely to be found in amplicon libraries constructed entirely by PCR (by formation of PCR primer-dimers) than in DNA-Seq or RNA-Seq libraries constructed by adapter ligation. If a sufficient fraction of the library is adapter dimer it will become noticeable in the Falco report: > > - Drop in per base sequence quality after base 60 > - Possible bi-modal distribution of per sequence quality scores @@ -686,7 +741,7 @@ To accomplish this task we will use [Cutadapt](https://cutadapt.readthedocs.io/e {: .details} -We can examine our trimmed data with FASTQE and/or FastQC. +We can examine our trimmed data with FASTQE and/or Falco. > Checking quality after trimming > @@ -715,12 +770,12 @@ We can examine our trimmed data with FASTQE and/or FastQC. With FASTQE we can see we improved the quality of the bases in the dataset. -We can also, or instead, check the quality-controlled data with FastQC. +We can also, or instead, check the quality-controlled data with Falco. > Checking quality after trimming > -> 1. {% tool [FASTQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters +> 1. {% tool [Falco](toolshed.g2.bx.psu.edu/repos/iuc/falco/falco/1.2.4+galaxy0) %} with the following parameters > - {% icon param-files %} *"Short read data from your current history"*: `Cutadapt Read 1 Output` > > 2. Inspect the generated HTML file @@ -733,37 +788,36 @@ We can also, or instead, check the quality-controlled data with FastQC. > > > > > 1. Yes. The vast majority of the bases have a quality score above 20 now. -> > ![Per base sequence quality](../../images/quality-control/per_base_sequence_quality-after.png "Per base sequence quality") -> > +> > ![Per base sequence quality](../../images/quality-control/per_base_sequence_quality-after.png "Per base sequence quality") > > 2. Yes. No adapter is detected now. -> > ![Adapter Content](../../images/quality-control/adapter_content-after.png) +> > ![Adapter Content](../../images/quality-control/adapter_content-after.png "Adapter content") > > > {: .solution } {: .question} -With FastQC we can see we improved the quality of the bases in the dataset and removed the adapter. +With Falco we can see we improved the quality of the bases in the dataset and removed the adapter. -> Other FastQC plots after trimming +> Other Falco plots after trimming > -> ![Per tile sequence quality](../../images/quality-control/per_tile_sequence_quality-after.png) +> ![Per tile sequence quality](../../images/quality-control/per_tile_sequence_quality-after.png "Per tile sequence quality") > We have some red stripes as we've trimmed those regions from the reads. > -> ![Per sequence quality scores](../../images/quality-control/per_sequence_quality_scores-after.png) +> ![Per sequence quality scores](../../images/quality-control/per_sequence_quality_scores-after.png "Per sequence quality scores") > We now have one peak of high quality instead of one high and one lower quality that we had previously. > -> ![Per base sequence content](../../images/quality-control/per_base_sequence_content-after.png) +> ![Per base sequence content](../../images/quality-control/per_base_sequence_content-after.png "Per base sequence content") > We don't have equal representation of the bases as before as this is amplicon data. > -> ![Per sequence GC content](../../images/quality-control/per_sequence_gc_content-after.png) +> ![Per sequence GC content](../../images/quality-control/per_sequence_gc_content-after.png "Per sequence GC content") > We now have a single main GC peak due to removing the adapter. > -> ![Per base N content](../../images/quality-control/per_base_n_content-after.png) -> This is the same as before as we don't have any Ns in these reads. +> ![Per base N content](../../images/quality-control/per_base_n_content-after.png "Per base N content") +> This is roughly the same as before since we have hardly any Ns in these reads. > -> ![Sequence length distribution](../../images/quality-control/sequence_length_distribution-after.png) -> We now have multiple peaks and a range of lengths, instead of the single peak with had before trimming when all sequences were the same length. +> ![Sequence Length Distribution](../../images/quality-control/sequence_length_distribution-after.png "Sequence Length Distribution") +> We now have multiple peaks and a range of lengths, instead of the single peak we had before trimming when all sequences were the same length. > -> ![Sequence Duplication Levels](../../images/quality-control/sequence_duplication_levels-after.png) +> ![Sequence Duplication Levels](../../images/quality-control/sequence_duplication_levels-after.png "Sequence Duplication Levels") > > > > > > What does the top overrepresented sequence `GTGTCAGCCGCCGCGGTAGTCCGACGTGG` correspond to? @@ -801,7 +855,7 @@ Paired-end sequencing generates 2 FASTQ files: Usually we recognize these two files which belong to one sample by the name which has the same identifier for the reads but a different extension, e.g. `sampleA_R1.fastq` for the forward reads and `sampleA_R2.fastq` for the reverse reads. It can also be `_f` or `_1` for the forward reads and `_r` or `_2` for the reverse reads. -The data we analyzed in the previous step was single-end data so we will import a paired-end RNA-seq dataset to use. We will run FastQC and aggregate the two reports with MultiQC {% cite ewels2016multiqc %}. +The data we analyzed in the previous step was single-end data so we will import a paired-end RNA-seq dataset to use. We will run Falco and aggregate the two reports with MultiQC {% cite ewels2016multiqc %}. > Assessing the quality of paired-end reads > @@ -812,7 +866,7 @@ The data we analyzed in the previous step was single-end data so we will import > https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_2.fastq > ``` > -> 2. {% tool [FASTQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with both datasets: +> 2. {% tool [Falco](toolshed.g2.bx.psu.edu/repos/iuc/falco/falco/1.2.4+galaxy0) %} with both datasets: > - {% icon param-files %} *"Raw read data from your current history"*: both the uploaded datasets. > > {% snippet faqs/galaxy/tools_select_multiple_datasets.md %} @@ -822,7 +876,7 @@ The data we analyzed in the previous step was single-end data so we will import > - *"Which tool was used generate logs?"*: `FastQC` > - In *"FastQC output"* > - *"Type of FastQC output?"*: `Raw data` -> - {% icon param-files %} *"FastQC output"*: `Raw data` files (output of both **FastQC** {% icon tool %}) +> - {% icon param-files %} *"FastQC output"*: `Raw data` files (output of both **Falco** {% icon tool %}) > > 4. Inspect the webpage output from MultiQC. > @@ -975,12 +1029,12 @@ In runs with a lot of short reads the shorter reads are sometimes of lower quali {: .question} > Try it on! -> Do the quality control with **FastQC** {% icon tool %} on `m64011_190830_220126.Q20.subsample.fastq.gz` and compare the results! +> Do the quality control with **Falco** {% icon tool %} on `m64011_190830_220126.Q20.subsample.fastq.gz` and compare the results! {: .comment} # Assess quality with PycoQC - Nanopore only -[PycoQC](https://github.com/tleonardi/pycoQC) ({% cite Leger2019 %}) is a data visualisation and quality control tool for nanopore data. In contrast to FastQC/Nanoplot it needs a specific sequencing_summary.txt file generated by Oxford nanopore basecallers such as Guppy or the older albacore basecaller. +[PycoQC](https://github.com/tleonardi/pycoQC) ({% cite Leger2019 %}) is a data visualisation and quality control tool for nanopore data. In contrast to Falco/Nanoplot it needs a specific sequencing_summary.txt file generated by Oxford nanopore basecallers such as Guppy or the older albacore basecaller. One of the strengths of PycoQC is that it is interactive and highly customizable, e.g., plots can be cropped, you can zoom in and out, sub-select areas and export figures. @@ -1019,7 +1073,7 @@ One of the strengths of PycoQC is that it is interactive and highly customizable ## Basecalled reads length -As for FastQC and Nanoplot, this plot shows the distribution of fragment sizes in the file that was analyzed. +As for Falco and Nanoplot, this plot shows the distribution of fragment sizes in the file that was analyzed. As for PacBio CLR/HiFi, long reads have a variable length and this will show the relative amounts of each different size of sequence fragment. In this example, the distribution of read length is quite dispersed with a minimum read length for the passed reads around 200bp and a maximum length ~150,000bp. @@ -1073,7 +1127,7 @@ Although it is normal that yield decreases over time a decrease like this is not > This absence of a decreasing curve at the end of the run indicate that there is still biological material on the flow cell. The run was ended before all was sequenced. > It's an excellent run, even can be considered as exceptional. > -> ![Output over experiment time good profile](../../images/quality-control/output_over_experiment_time-pycoqc-good.png) +> ![Output over experiment time good profile](../../images/quality-control/output_over_experiment_time-pycoqc-good.png "Output over experiment time") > {: .details} @@ -1107,13 +1161,13 @@ Depending if you chose “Reads” or “Bases” on the left the colour indicat > > In this example, almost all pores are active all along the run (yellow/red profile) which indicate an excellent run. > -> ![Channel activity over time good profile](../../images/quality-control/channel_activity_over_time-pycoqc-good.png) +> ![Channel activity over time good profile](../../images/quality-control/channel_activity_over_time-pycoqc-good.png "Channel activity over time") > {: .details} > Try it out! -> Do the quality control with **FastQC** {% icon tool %} and/or **Nanoplot** {% icon tool %} on `nanopore_basecalled-guppy.fastq.gz` and compare the results! +> Do the quality control with **Falco** {% icon tool %} and/or **Nanoplot** {% icon tool %} on `nanopore_basecalled-guppy.fastq.gz` and compare the results! {: .comment} # Conclusion @@ -1125,7 +1179,8 @@ Quality control steps are similar for any type of sequencing data: - Quality assessment with tools like: - *Short Reads*: {% tool [FASTQE](toolshed.g2.bx.psu.edu/repos/iuc/fastqe/fastqe/0.3.1+galaxy0) %} - - *Short+Long*: {% tool [FASTQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} + - *Short+Long*: {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} + - *Short+Long*: {% tool [Falco](toolshed.g2.bx.psu.edu/repos/iuc/falco/falco/1.2.4+galaxy0) %} - *Long Reads*: {% tool [Nanoplot](toolshed.g2.bx.psu.edu/repos/iuc/nanoplot/nanoplot/1.41.0+galaxy0) %} - *Nanopore only*: {% tool [PycoQC](toolshed.g2.bx.psu.edu/repos/iuc/pycoqc/pycoqc/2.5.2+galaxy0) %} - Trimming and filtering for **short reads** with a tool like **Cutadapt** {% icon tool %} diff --git a/topics/variant-analysis/tutorials/dunovo/tutorial.md b/topics/variant-analysis/tutorials/dunovo/tutorial.md index 66e0239e33ce0e..edd26e2e49321e 100644 --- a/topics/variant-analysis/tutorials/dunovo/tutorial.md +++ b/topics/variant-analysis/tutorials/dunovo/tutorial.md @@ -141,7 +141,7 @@ We then evaluated the quality of the data by running FastQC on both datasets (fo {: .hands_on} -This created two datasets in our galaxy history: one for forward reads and one for reverse. We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse). You can read about using {% icon tool %} **FastQC** in the dedicated [quality-control tutorial]({{ site.baseurl }}/topics/sequence-analysis/tutorials/quality-control/tutorial.html#assess-quality-with-fastqc---short--long-reads). +This created two datasets in our galaxy history: one for forward reads and one for reverse. We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse). You can read about using {% icon tool %} **FastQC** in the dedicated [quality-control tutorial]({{ site.baseurl }}/topics/sequence-analysis/tutorials/quality-control/tutorial.html#assess-quality-with-falcofastqc---short--long-reads). This gave us the following plots: @@ -408,5 +408,5 @@ You can use the variant calling workflow to call variants using the SSCS instead You should now understand duplex sequencing, rare variants, and be able to process the former to find the latter. -### If things don't work... +# If things don't work... ...you need to complain. Use [Galaxy's Help Forum](https://help.galaxyproject.org/) to do this.