You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Graduate School of Public Health and Health Policy, City University of New
12
12
York, New York, NY, U.S.A.
13
13
email: levi.waldron@sph.cuny.edu
14
+
- name: Yoon Ji Jung
15
+
affiliation:
16
+
- Graduate School of Public Health and Health Policy, City University of New
17
+
York, New York, NY, U.S.A.
18
+
email: YOONJI.JUNG49@sphmail.cuny.edu
19
+
- name: Sehyun Oh
20
+
affiliation:
21
+
- Graduate School of Public Health and Health Policy, City University of New
22
+
York, New York, NY, U.S.A.
23
+
email: Sehyun.Oh@sph.cuny.edu
14
24
package: curatedMetagenomicData
15
25
abstract: >
16
26
The curatedMetagenomicData package provides standardized, curated human
@@ -186,110 +196,175 @@ The `counts` and `rownames` arguments apply to `returnSamples()` as well, and ca
186
196
187
197
To demonstrate the utility of `r Biocpkg("curatedMetagenomicData")`, an example analysis is presented below. However, readers should know analysis is generally beyond the scope of `r Biocpkg("curatedMetagenomicData")` and the analysis presented here is for demonstration alone. It is best to consider the output of `r Biocpkg("curatedMetagenomicData")` as the input of analysis more than anything else.
188
198
189
-
## R Packages
190
-
191
-
To demonstrate the utility of `r Biocpkg("curatedMetagenomicData")`, the `r CRANpkg("stringr")`, `r Biocpkg("mia")`, `r Biocpkg("scater")`, and `r CRANpkg("vegan")` packages are needed.
`r Biocpkg("curatedMetagenomicData")` loads the metadata table, `sampleMetadata`. For further harmonized version of the sample-level metadata will be available in the future release of this package, and currently available through `r Biocpkg("OmicsMLRepoR")` package.
211
+
212
+
```{r, message = FALSE}
213
+
cmd <- OmicsMLRepoR::getMetadata("cMD")
198
214
```
199
215
200
216
## Prepare Data
201
217
202
-
In our hypothetical study, let's examine the association of alcohol consumption and stool microbial composition across all annotated samples in`r Biocpkg("curatedMetagenomicData")`. We will examine the alpha diversity (within subject diversity), beta diversity (between subject diversity), and conclude with a few notes on differential abundance analysis.
218
+
In this example, we will examine the association between current smoking status and fecal microbial composition across all relevant samples from`r Biocpkg("curatedMetagenomicData")`. We will examine the alpha diversity (within subject diversity), beta diversity (between subject diversity), and differential abundant taxa.
203
219
204
220
### Return Samples
205
221
206
-
First, as above, we use the `returnSamples()` function to return the relevant samples across all studies available in `r Biocpkg("curatedMetagenomicData")`. We want adults over the age of 18, for whom alcohol consumption status is known, and we want only stool samples. The `select(where...` line below simply removes metadata columns which are all `NA` values – they exist in another study but are all `NA` once subsetting has been done. Lastly, the `"relative_abundance"``dataType` is requested because it contains the relevant information about microbial composition.
222
+
First, as above, we use the `returnSamples()` function to return the relevant samples across all studies available in `r Biocpkg("curatedMetagenomicData")`. We want healthy adults, whose smoking history is known, and only fecal samples. The `select(where...` line below removes metadata columns which are all `NA` values – they exist in another study but are all `NA` once subsetting has been done.
select(where(~ !all(is.na(.x)))) # remove metadata columns which are all `NA` values
215
231
```
216
232
217
-
### Mutate colData
233
+
```{r}
234
+
table(smoke$smoker, useNA = "ifany")
235
+
```
218
236
219
-
Most of the values in the `sampleMetadata``data.frame` (which becomes `colData`) are in snake case (e.g. `snake_case`) and don't look nice in plots. Here, the values of the `alcohol` variable are made into title case using `r CRANpkg("stringr")` so they will look nice in plots.
237
+
A new binary variable for smoking status, with levels `Smoker` and `Never Smoker`, is created to facilitate downstream analysis. The names of attributes are updated, so they will look nice in plots.
Lastly, the `"relative_abundance"``dataType` is requested because it contains the relevant information about microbial composition.
231
251
232
-
Next, the `splitByRanks` function from `r Biocpkg("mia")` is used to create alternative experiments for each level of the taxonomic tree (e.g. Genus). This allows for diversity and differential abundance analysis at specific taxonomic levels; with this step complete, our data is ready to analyze.
The `agglomerateByRank` unction from `r Biocpkg("mia")` is used to sum up data based on associations with certain taxonomic ranks, as defined in `rowData`.
240
261
241
-
Alpha diversity is a measure of the within sample diversity of features (relative abundance proportions here) and seeks to quantify the evenness (i.e. are the amounts of different microbes the same) and richness (i.e. are they are large variety of microbial taxa present). The Shannon index (H') is a commonly used measure of alpha diversity, it's estimated here using the `estimateDiversity()` function from the `r Biocpkg("mia")` package.
To quickly plot the results of alpha diversity estimation, the `plotColData()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax.
The figure suggest that those who consume alcohol have higher Shannon alpha diversity than those who do not consume alcohol; however, the difference does not appear to be significant, at least qualitatively.
289
+
A p-value < 0.01 and a W value > 0 indicate that the `Never Smoker` group
290
+
has higher alpha diversity compared to the `Smoker` group. This may serve
291
+
as basis for further investigation as to whether smoking can lead to gut
292
+
microbiome dysbiosis.
255
293
256
-
## Beta Diversity
294
+
```{r}
295
+
## Test if alpha diversity between smokers and non-smokers is significantly different
296
+
wilcox.test(shannon_diversity ~ smoker_bin, data = colData(smoke_shannon))
297
+
```
257
298
258
-
Beta diversity is a measure of the between sample diversity of features (relative abundance proportions here) and seeks to quantify the magnitude of differences (or similarity) between every given pair of samples. Below it is assessed by Bray–Curtis Principal Coordinates Analysis (PCoA) and Uniform Manifold Approximation and Projection (UMAP).
299
+
## Beta Diversity
300
+
Beta diversity is a measure of the between sample diversity of features
301
+
(relative abundance proportions here) and seeks to quantify the magnitude
302
+
of differences (or similarity) between every given pair of samples. Below
303
+
it is assessed by Bray–Curtis Principal Coordinates Analysis (PCoA) and
304
+
Uniform Manifold Approximation and Projection (UMAP).
259
305
260
306
### Bray–Curtis PCoA
307
+
To calculate pairwise Bray–Curtis distance for every sample in our study
308
+
we will use the `runMDS()` function from the `r Biocpkg("scater")` package
309
+
along with the `vegdist()` function from the `r CRANpkg("vegan")` package.
261
310
262
-
To calculate pairwise Bray–Curtis distance for every sample in our study we will use the `runMDS()` function from the `r Biocpkg("scater")` package along with the `vegdist()` function from the `r CRANpkg("vegan")` package.
311
+
To quickly plot the results of beta diversity analysis,
312
+
the `plotReducedDim()` function from the `r Biocpkg("scater")` package is
313
+
used along with `r CRANpkg("ggplot2")` syntax.
263
314
264
-
To quickly plot the results of beta diversity analysis, the `plotReducedDim()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax.
To calculate the UMAP coordinates of every sample in our study we will use the `runUMAP()` function from the `r Biocpkg("scater")` package package, as it handles the task in a single line.
326
+
### UMAP
327
+
To calculate the UMAP coordinates of every sample in our study we will use
328
+
the` runUMAP()` function from the `r Biocpkg("scater")` package package, as
329
+
it handles the task in a single line.
278
330
279
-
To quickly plot the results of beta diversity analysis, the `plotReducedDim()` function from the `r Biocpkg("scater")` package is used along with `r CRANpkg("ggplot2")` syntax again.
331
+
To quickly plot the results of beta diversity analysis, the `plotReducedDim()`
332
+
function from the `r Biocpkg("scater")` package is used along with
Next, we can identify taxa enriched in either the *Smoker* or *Never Smoker*
348
+
groups. An example approach for differential abundance is the LEfSe analysis,
349
+
which can be accomplished using `lefser()` and `lefserPlot()` from the
350
+
`r Biocpkg("lefser")` package.
351
+
352
+
```{r}
353
+
lefser(
354
+
relativeAb(smoke_tse_genus),
355
+
kruskal.threshold = 0.05,
356
+
wilcox.threshold = 0.05,
357
+
lda.threshold = 2,
358
+
classCol = "smoker_bin",
359
+
subclassCol = NULL,
360
+
assay = 1L,
361
+
trim.names = FALSE,
362
+
checkAbundances = TRUE
363
+
) %>%
364
+
lefserPlot()
365
+
```
366
+
291
367
292
-
Next, it would be desirable to establish which microbes are differentially abundant between the two groups (those who consume alcohol, and those who do not). The `r Biocpkg("lefser")` and `r Biocpkg("ANCOMBC")` packages are excellent resources for this tasks; however, code is not included here to avoid including excessive `Suggests` packages – `r Biocpkg("curatedMetagenomicData")` had far too many of these in the the past and is now very lean. There is a repository of analyses, [curatedMetagenomicAnalyses](https://github.com/waldronlab/curatedMetagenomicAnalyses), on GitHub and a forthcoming paper that will feature extensive demonstrations of analyses – but for now, the suggestions above will have to suffice.
0 commit comments