Skip to content

Commit e5b5f83

Browse files
authored
Merge pull request #319 from shbrief/master
Example analysis using harmonized metadata
2 parents cade2aa + f383cef commit e5b5f83

File tree

2 files changed

+137
-60
lines changed

2 files changed

+137
-60
lines changed

DESCRIPTION

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,15 @@ Authors@R:
2525
person(given = "Marcel", family = "Ramos", role = "ctb"),
2626
person(given = "Valerie", family = "Obenchain", role = "ctb"),
2727
person(given = "Kelly", family = "Eckenrode", role = "ctb"),
28-
person(given = "Nicola", family = "Segata", role = "ctb"))
28+
person(given = "Nicola", family = "Segata", role = "ctb"),
29+
person(given = "Sehyun", family = "Oh", role = "ctb"),
30+
person(given = "Yoon-Ji", family = "Jung", role = "ctb"))
2931
biocViews:
3032
ExperimentHub,
3133
Homo_sapiens_Data,
3234
MicrobiomeData,
3335
ReproducibleResearch
34-
Version: 3.17.1
36+
Version: 3.17.2
3537
License: Artistic-2.0
3638
Depends:
3739
R (>= 4.1.0),

vignettes/curatedMetagenomicData.Rmd

Lines changed: 133 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,16 @@ author:
1111
- Graduate School of Public Health and Health Policy, City University of New
1212
York, New York, NY, U.S.A.
1313
email: levi.waldron@sph.cuny.edu
14+
- name: Yoon Ji Jung
15+
affiliation:
16+
- Graduate School of Public Health and Health Policy, City University of New
17+
York, New York, NY, U.S.A.
18+
email: YOONJI.JUNG49@sphmail.cuny.edu
19+
- name: Sehyun Oh
20+
affiliation:
21+
- Graduate School of Public Health and Health Policy, City University of New
22+
York, New York, NY, U.S.A.
23+
email: Sehyun.Oh@sph.cuny.edu
1424
package: curatedMetagenomicData
1525
abstract: >
1626
The curatedMetagenomicData package provides standardized, curated human
@@ -186,110 +196,175 @@ The `counts` and `rownames` arguments apply to `returnSamples()` as well, and ca
186196

187197
To demonstrate the utility of `r Biocpkg("curatedMetagenomicData")`, an example analysis is presented below. However, readers should know analysis is generally beyond the scope of `r Biocpkg("curatedMetagenomicData")` and the analysis presented here is for demonstration alone. It is best to consider the output of `r Biocpkg("curatedMetagenomicData")` as the input of analysis more than anything else.
188198

189-
## R Packages
190-
191-
To demonstrate the utility of `r Biocpkg("curatedMetagenomicData")`, the `r CRANpkg("stringr")`, `r Biocpkg("mia")`, `r Biocpkg("scater")`, and `r CRANpkg("vegan")` packages are needed.
192-
193-
```{r, message = FALSE}
194-
library(stringr)
199+
## Load R packages
200+
```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
201+
library(OmicsMLRepoR)
195202
library(mia)
196203
library(scater)
197204
library(vegan)
205+
library(stringr)
206+
library(lefser)
207+
```
208+
209+
## Retrieve harmonized metadata
210+
`r Biocpkg("curatedMetagenomicData")` loads the metadata table, `sampleMetadata`. For further harmonized version of the sample-level metadata will be available in the future release of this package, and currently available through `r Biocpkg("OmicsMLRepoR")` package.
211+
212+
```{r, message = FALSE}
213+
cmd <- OmicsMLRepoR::getMetadata("cMD")
198214
```
199215

200216
## Prepare Data
201217

202-
In our hypothetical study, let's examine the association of alcohol consumption and stool microbial composition across all annotated samples in `r Biocpkg("curatedMetagenomicData")`. We will examine the alpha diversity (within subject diversity), beta diversity (between subject diversity), and conclude with a few notes on differential abundance analysis.
218+
In this example, we will examine the association between current smoking status and fecal microbial composition across all relevant samples from `r Biocpkg("curatedMetagenomicData")`. We will examine the alpha diversity (within subject diversity), beta diversity (between subject diversity), and differential abundant taxa.
203219

204220
### Return Samples
205221

206-
First, as above, we use the `returnSamples()` function to return the relevant samples across all studies available in `r Biocpkg("curatedMetagenomicData")`. We want adults over the age of 18, for whom alcohol consumption status is known, and we want only stool samples. The `select(where...` line below simply removes metadata columns which are all `NA` values – they exist in another study but are all `NA` once subsetting has been done. Lastly, the `"relative_abundance"` `dataType` is requested because it contains the relevant information about microbial composition.
222+
First, as above, we use the `returnSamples()` function to return the relevant samples across all studies available in `r Biocpkg("curatedMetagenomicData")`. We want healthy adults, whose smoking history is known, and only fecal samples. The `select(where...` line below removes metadata columns which are all `NA` values – they exist in another study but are all `NA` once subsetting has been done.
207223

208-
```{r, collapse = TRUE, message = FALSE}
209-
alcoholStudy <-
210-
filter(sampleMetadata, age >= 18) |>
211-
filter(!is.na(alcohol)) |>
212-
filter(body_site == "stool") |>
213-
select(where(~ !all(is.na(.x)))) |>
214-
returnSamples("relative_abundance", rownames = "short")
224+
```{r}
225+
smoke <- cmd |>
226+
filter(disease == "Healthy") |>
227+
filter(age_group == "Adult") |>
228+
filter(!is.na(smoker)) |>
229+
filter(body_site == "feces") |>
230+
select(where(~ !all(is.na(.x)))) # remove metadata columns which are all `NA` values
215231
```
216232

217-
### Mutate colData
233+
```{r}
234+
table(smoke$smoker, useNA = "ifany")
235+
```
218236

219-
Most of the values in the `sampleMetadata` `data.frame` (which becomes `colData`) are in snake case (e.g. `snake_case`) and don't look nice in plots. Here, the values of the `alcohol` variable are made into title case using `r CRANpkg("stringr")` so they will look nice in plots.
237+
A new binary variable for smoking status, with levels `Smoker` and `Never Smoker`, is created to facilitate downstream analysis. The names of attributes are updated, so they will look nice in plots.
220238

221-
```{r, collapse = TRUE, message = FALSE}
222-
colData(alcoholStudy) <-
223-
colData(alcoholStudy) |>
224-
as.data.frame() |>
225-
mutate(alcohol = str_replace_all(alcohol, "no", "No")) |>
226-
mutate(alcohol = str_replace_all(alcohol, "yes", "Yes")) |>
227-
DataFrame()
239+
```{r}
240+
smoke <- smoke %>%
241+
mutate(
242+
smoker_bin = as.factor(
243+
case_when(smoker == "Smoker (finding)" ~ "Smoker",
244+
smoker != "Non-smoker (finding);Never smoked tobacco (finding)" ~ "Never Smoker",
245+
)))
246+
247+
table(smoke$smoker_bin, useNA = "ifany")
228248
```
229249

230-
### Agglomerate Ranks
250+
Lastly, the `"relative_abundance"` `dataType` is requested because it contains the relevant information about microbial composition.
231251

232-
Next, the `splitByRanks` function from `r Biocpkg("mia")` is used to create alternative experiments for each level of the taxonomic tree (e.g. Genus). This allows for diversity and differential abundance analysis at specific taxonomic levels; with this step complete, our data is ready to analyze.
252+
```{r message = FALSE}
253+
smoke_tse <- smoke %>% returnSamples("relative_abundance", rownames = "short")
233254
234-
```{r, collapse = TRUE, message = FALSE}
235-
altExps(alcoholStudy) <-
236-
splitByRanks(alcoholStudy)
255+
## Removing samples with NA values for smoker_bin
256+
smoke_tse <- smoke_tse[,!is.na(smoke_tse$smoker_bin)]
237257
```
238258

239-
## Alpha Diversity
259+
### Agglomerate By Taxonomic Rank
260+
The `agglomerateByRank` unction from `r Biocpkg("mia")` is used to sum up data based on associations with certain taxonomic ranks, as defined in `rowData`.
240261

241-
Alpha diversity is a measure of the within sample diversity of features (relative abundance proportions here) and seeks to quantify the evenness (i.e. are the amounts of different microbes the same) and richness (i.e. are they are large variety of microbial taxa present). The Shannon index (H') is a commonly used measure of alpha diversity, it's estimated here using the `estimateDiversity()` function from the `r Biocpkg("mia")` package.
262+
```{r}
263+
smoke_tse_genus <- agglomerateByRank(smoke_tse, rank = "genus")
264+
```
242265

243-
To quickly plot the results of alpha diversity estimation, the `plotColData()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax.
244266

245-
```{r, collapse = TRUE, fig.cap = "Alpha Diversity – Shannon Index (H')"}
246-
alcoholStudy |>
247-
estimateDiversity(assay.type = "relative_abundance", index = "shannon") |>
248-
plotColData(x = "alcohol", y = "shannon", colour_by = "alcohol", shape_by = "alcohol") +
249-
labs(x = "Alcohol", y = "Alpha Diversity (H')") +
250-
guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
267+
## Alpha Diversity
268+
Alpha diversity is a measure of the within sample diversity of features
269+
(relative abundance proportions here) and seeks to quantify the evenness
270+
(i.e. are the amounts of different microbes the same) and richness (i.e.
271+
are they are large variety of microbial taxa present). The Shannon index
272+
(H’) is a commonly used measure of alpha diversity, it’s estimated here
273+
using the `addAlpha()` function from the `r Biocpkg("mia")` package.
274+
275+
```{r fig.cap = "Alpha Diversity - Shannon Index (H')"}
276+
## Adding Shannon diversity values to colData
277+
smoke_shannon <- smoke_tse_genus |>
278+
addAlpha(assay.type = "relative_abundance", index = "shannon_diversity")
279+
280+
## Violin plots
281+
title <- "Alpha Diversity by Smoking Status"
282+
smoke_shannon |>
283+
plotColData(x = "smoker_bin", y = "shannon_diversity", colour_by = "smoker_bin", shape_by = "smoker_bin") +
284+
labs(x = "Smoking Status", y = "Alpha Diversity (H')") +
285+
guides(colour = guide_legend(title = "Smoking Status"), shape = guide_legend(title = title)) +
251286
theme(legend.position = "none")
252287
```
253288

254-
The figure suggest that those who consume alcohol have higher Shannon alpha diversity than those who do not consume alcohol; however, the difference does not appear to be significant, at least qualitatively.
289+
A p-value < 0.01 and a W value > 0 indicate that the `Never Smoker` group
290+
has higher alpha diversity compared to the `Smoker` group. This may serve
291+
as basis for further investigation as to whether smoking can lead to gut
292+
microbiome dysbiosis.
255293

256-
## Beta Diversity
294+
```{r}
295+
## Test if alpha diversity between smokers and non-smokers is significantly different
296+
wilcox.test(shannon_diversity ~ smoker_bin, data = colData(smoke_shannon))
297+
```
257298

258-
Beta diversity is a measure of the between sample diversity of features (relative abundance proportions here) and seeks to quantify the magnitude of differences (or similarity) between every given pair of samples. Below it is assessed by Bray–Curtis Principal Coordinates Analysis (PCoA) and Uniform Manifold Approximation and Projection (UMAP).
299+
## Beta Diversity
300+
Beta diversity is a measure of the between sample diversity of features
301+
(relative abundance proportions here) and seeks to quantify the magnitude
302+
of differences (or similarity) between every given pair of samples. Below
303+
it is assessed by Bray–Curtis Principal Coordinates Analysis (PCoA) and
304+
Uniform Manifold Approximation and Projection (UMAP).
259305

260306
### Bray–Curtis PCoA
307+
To calculate pairwise Bray–Curtis distance for every sample in our study
308+
we will use the `runMDS()` function from the `r Biocpkg("scater")` package
309+
along with the `vegdist()` function from the `r CRANpkg("vegan")` package.
261310

262-
To calculate pairwise Bray–Curtis distance for every sample in our study we will use the `runMDS()` function from the `r Biocpkg("scater")` package along with the `vegdist()` function from the `r CRANpkg("vegan")` package.
311+
To quickly plot the results of beta diversity analysis,
312+
the `plotReducedDim()` function from the `r Biocpkg("scater")` package is
313+
used along with `r CRANpkg("ggplot2")` syntax.
263314

264-
To quickly plot the results of beta diversity analysis, the `plotReducedDim()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax.
265-
266-
```{r, collapse = TRUE, fig.cap = "Beta Diversity – Bray–Curtis PCoA"}
267-
alcoholStudy |>
315+
```{r fig.cap = "Beta Diversity – Bray–Curtis PCoA"}
316+
smoke_tse %>%
317+
agglomerateByRanks() |>
268318
runMDS(FUN = vegdist, method = "bray", exprs_values = "relative_abundance", altexp = "genus", name = "BrayCurtis") |>
269-
plotReducedDim("BrayCurtis", colour_by = "alcohol", shape_by = "alcohol") +
319+
plotReducedDim("BrayCurtis", colour_by = "smoker_bin", shape_by = "smoker_bin") +
270320
labs(x = "PCo 1", y = "PCo 2") +
271-
guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
272-
theme(legend.position = c(0.90, 0.85))
321+
guides(colour = guide_legend(title = "Smoking Status"), shape = guide_legend(title = "Smoking Status")) +
322+
theme(legend.position = c(0.80, 0.25))
273323
```
274324

275-
### UMAP
276325

277-
To calculate the UMAP coordinates of every sample in our study we will use the `runUMAP()` function from the `r Biocpkg("scater")` package package, as it handles the task in a single line.
326+
### UMAP
327+
To calculate the UMAP coordinates of every sample in our study we will use
328+
the` runUMAP()` function from the `r Biocpkg("scater")` package package, as
329+
it handles the task in a single line.
278330

279-
To quickly plot the results of beta diversity analysis, the `plotReducedDim()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax again.
331+
To quickly plot the results of beta diversity analysis, the `plotReducedDim()`
332+
function from the `r Biocpkg("scater")` package is used along with
333+
`r CRANpkg("ggplot2")` syntax again.
280334

281-
```{r, collapse = TRUE, fig.cap = "Beta Diversity – UMAP (Uniform Manifold Approximation and Projection)"}
282-
alcoholStudy |>
335+
```{r}
336+
smoke_tse %>%
337+
agglomerateByRanks() |>
283338
runUMAP(exprs_values = "relative_abundance", altexp = "genus", name = "UMAP") |>
284-
plotReducedDim("UMAP", colour_by = "alcohol", shape_by = "alcohol") +
339+
plotReducedDim("UMAP", colour_by = "smoker_bin", shape_by = "smoker_bin") +
285340
labs(x = "UMAP 1", y = "UMAP 2") +
286-
guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
287-
theme(legend.position = c(0.90, 0.85))
341+
guides(colour = guide_legend(title = "Smoking Status"), shape = guide_legend(title = "Smoking Status")) +
342+
theme(legend.position = c(0.80, 0.55))
288343
```
289344

345+
290346
## Differential Abundance
347+
Next, we can identify taxa enriched in either the *Smoker* or *Never Smoker*
348+
groups. An example approach for differential abundance is the LEfSe analysis,
349+
which can be accomplished using `lefser()` and `lefserPlot()` from the
350+
`r Biocpkg("lefser")` package.
351+
352+
```{r}
353+
lefser(
354+
relativeAb(smoke_tse_genus),
355+
kruskal.threshold = 0.05,
356+
wilcox.threshold = 0.05,
357+
lda.threshold = 2,
358+
classCol = "smoker_bin",
359+
subclassCol = NULL,
360+
assay = 1L,
361+
trim.names = FALSE,
362+
checkAbundances = TRUE
363+
) %>%
364+
lefserPlot()
365+
```
366+
291367

292-
Next, it would be desirable to establish which microbes are differentially abundant between the two groups (those who consume alcohol, and those who do not). The `r Biocpkg("lefser")` and `r Biocpkg("ANCOMBC")` packages are excellent resources for this tasks; however, code is not included here to avoid including excessive `Suggests` packages – `r Biocpkg("curatedMetagenomicData")` had far too many of these in the the past and is now very lean. There is a repository of analyses, [curatedMetagenomicAnalyses](https://github.com/waldronlab/curatedMetagenomicAnalyses), on GitHub and a forthcoming paper that will feature extensive demonstrations of analyses – but for now, the suggestions above will have to suffice.
293368

294369
# Type Conversion
295370

0 commit comments

Comments
 (0)