Skip to content

Commit 914f383

Browse files
committed
Reorganize analysis: Move R analysis to dedicated directory and deprecate Python analysis
1 parent d119578 commit 914f383

25 files changed

+8382
-5
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
title: "Example_analysis_Hierarchical"
3+
output: html_document
4+
date: "2024-06-03"
5+
---
6+
7+
# *NOTE: this script and analysis is still preliminary and should only be used if one is comfortable with both hierarchical and bayesian models!*
8+
9+
```{r setup, include=FALSE}
10+
knitr::opts_chunk$set(echo = TRUE)
11+
12+
pacman::p_load(tidyverse,ggdist,psycho,caret,patchwork, gt, cowplot,
13+
grid,reticulate,cmdstanr,posterior,rstan,bayesplot,here,rmarkdown,pracma,
14+
brms)
15+
16+
set.seed(111)
17+
18+
source(here("docs","source","examples","R","src","firstlevelanalysis.R"))
19+
20+
```
21+
22+
## **Here we show how to perform a Hierarchical Bayesian post-hoc analysis on the data.**
23+
24+
This type of analysis is similar to the post-hoc single fit bayesian analysis, however here we fit all our subjects together.
25+
26+
This is analogous to the difference between running many separate linear regression and then aggregating the results, compared to running a linear mixed effects model. (This script is what a linear mixed effects model is).
27+
28+
29+
30+
## *Running this model*
31+
32+
For the moment, running this type of hierarchical model, opens the black-box of the sampling algorithm from the single fit analysis.
33+
34+
Firstly, we need to specify the model:
35+
36+
```{r}
37+
mod = cmdstanr::cmdstan_model(here::here("docs","source","examples","R","src","Hierarchical Cummulative Normal.stan"))
38+
```
39+
40+
and load the data
41+
42+
```{r}
43+
data = read.csv(here::here("docs","source","examples","R","data","Hierarchical data","Hierarchical_data.csv"))
44+
```
45+
46+
47+
This data set is simulated and represents a within subject design with two sessions. We can visualize the aggregated (population level) effects for each session:
48+
49+
```{r}
50+
data %>% ggplot(aes(x = Alpha, y = resp, col = as.factor(sessions)))+
51+
geom_smooth(method = "glm",
52+
method.args = list(family = "binomial"),
53+
se = FALSE)+theme_classic()
54+
```
55+
56+
To run the model one needs to run the following:
57+
58+
This just aggregates the data for the stan model to run a bit faster. Note that you will need a participant_id column that is a unique identifier for each subjects and a sessions column that is a unique session identifier.
59+
60+
```{r}
61+
data = data %>% mutate(sessions = ifelse(sessions == 1, 0 ,1))
62+
63+
data = transform_data_to_stan(data) %>% arrange(participant_id,sessions)
64+
```
65+
66+
67+
Now one just needs to put it all into one big list as below and run the model!
68+
69+
70+
Note! Here 3 matrices are specified "X_lapse", "X_alpha", "X_beta". These represent parameterizations of these three parameters of the model.
71+
72+
What this means is that you can specify you desired constast of interrest by adding a column to this matrix, the model will then give the estimate (as in a linear model), of how much this "covariate" explains. Here we are interrested in the difference in slope (beta) and threshold (alpha) between sessions.
73+
74+
```{r}
75+
datastan = list(T = nrow(data),
76+
S = length(unique(data$participant_id)),
77+
S_id = as.numeric(data$participant_id),
78+
X = data %>% .$Alpha,
79+
X_lapse = as.matrix(data.frame(int = rep(1,nrow(data)))),
80+
X_alpha = as.matrix(data.frame(int = rep(1,nrow(data)),
81+
sessions = data %>% .$sessions)),
82+
X_beta = as.matrix(data.frame(int = rep(1,nrow(data)),
83+
sessions = data %>% .$sessions)),
84+
N_alpha = 2,
85+
N_beta = 2,
86+
N_lapse = 1,
87+
Y = data %>% .$resp,
88+
npx = data %>% .$npx
89+
90+
)
91+
92+
```
93+
94+
95+
Running the model is the easy bit. The following makes the stan model sample from the joint posterior. See the comments for each line!
96+
97+
```{r}
98+
fit <- mod$sample(
99+
data = datastan, #The List specified above containing the data
100+
iter_sampling = 1000, #The number of sampling draws
101+
iter_warmup = 1000, #The number of warmup draws
102+
chains = 4, #The number of chains
103+
init = 0, #Initial values for the sampler
104+
parallel_chains = 4, #the number of chains to run in parallel
105+
refresh = 500, #when do you want a update from stan? Here 500 means that after 500 draws you get an "update"
106+
adapt_delta = 0.90, #control parameter for the sampler (sometimes useful for getting rid of divergences)
107+
max_treedepth = 10 #control parameter for the sampler (useful if you get a warning about max_treedepth)
108+
)
109+
```
110+
111+
112+
## *Output*
113+
114+
115+
The output of this process is an object that has both summary of the marginal posteriors, but also all the draws from it.
116+
117+
Firstly, calling the summary of the object gives a summary of the model.
118+
119+
```{r}
120+
fit$summary()
121+
```
122+
123+
Here is a description of what these variables mean
124+
125+
gm[1] is the "population" mean of the slope for the reference level of sessions (i.e. session 0).
126+
127+
gm[2] is the "population" mean difference of the slope between the session levels (this was specified in the data).
128+
129+
gm[3] is the "population" mean of the lapse rate. This was not specified to vary between sessions.
130+
131+
gm[4] is the "population" mean of the threshold for the reference level of sessions (i.e. session 0).
132+
133+
gm[5] is the "population" mean difference of the threshold between the session levels (this was specified in the data).
134+
135+
136+
The same indexing is present for tau_u which displays the standard deviation of the population level distribution i.e.
137+
138+
gm[1] is the "population" standard deviation of the slope for the reference level of sessions (i.e. session 0).
139+
140+
gm[2] is the "population" standard deviation difference of the slope between the session levels (this was specified in the data).
141+
142+
gm[3] is the "population" standard deviation of the lapse rate. This was not specified to vary between sessions.
143+
144+
gm[4] is the "population" standard deviation of the threshold for the reference level of sessions (i.e. session 0).
145+
146+
gm[5] is the "population" standard deviation difference of the threshold between the session levels (this was specified in the data).
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
---
2+
title: "Example_analysis_bayesian"
3+
author: "Jesper Fischer Ehmsen"
4+
date: "`r Sys.Date()`"
5+
output: html_document
6+
---
7+
8+
9+
# **Here we show how to perform a Bayesian post-hoc analysis on the data, which is very similar to the "simple" analysis. Please see the "simple analysis before this!**
10+
11+
```{r message=FALSE}
12+
pacman::p_load(tidyverse,ggdist,psycho,caret,patchwork, gt, cowplot,
13+
grid,reticulate,cmdstanr,posterior,rstan,bayesplot,here,rmarkdown,pracma,
14+
brms)
15+
np <- import("numpy")
16+
17+
set.seed(111)
18+
```
19+
20+
21+
# **Reading in the data**
22+
23+
```{r message=FALSE}
24+
#Here we read the same file as in the python notebook:
25+
psychophysics_df = read_csv('https://github.com/embodied-computation-group/CardioceptionPaper/raw/main/data/Del2_merged.txt')
26+
df = psychophysics_df %>% filter(Subject == "sub_0042")
27+
28+
```
29+
30+
31+
```{r message=FALSE, results='hide',warning=FALSE}
32+
#loading the functions to do the analysis:
33+
source(here("docs","source","examples","R","src","firstlevelanalysis.R"))
34+
```
35+
36+
37+
The difference between this example and the simple analysis is that here we set Bayesian equal to T (TRUE) and specify a model.
38+
39+
The models can be found in the src directory inside the .stan files. These are probabilistic models written in stan.
40+
41+
42+
There are two options at the moment for re-fitting the data using this Bayesian model, there is the standard cumulative normal as well as a cumulative normal that incorporates a lapse rate, that specifies the minimum and maximum of the tails of the psychometric.
43+
44+
45+
A lapse rate of 5% (0.05) means that the psychometric function (the cumulative normal) on the lower end is 5% and on the upper end is 95%.
46+
47+
The reason to include a lapse rate is that if responses are made that are attentional slips or miss-clicks, then this is greatly influence the slope of the function if not accounted for.
48+
49+
In order to run a Bayesian probabilistic model, one needs to specify priors on the free parameters. Here there are 2 or 3 depending on the model.
50+
51+
The priors of the Bayesian model is as follows:
52+
53+
alpha ~ normal(0,20);
54+
beta ~ normal(0,3);
55+
lambda ~ normal(-4,2);
56+
57+
Note that all the parameters are specified in the unconstrained space, this means that the slope of the psychometric i.e. beta is going to be constrained to be strictly positive and is therefore exponentially transformed.
58+
The lapse is constrained between 0 and 0.5 meaning its inverse logit transformed and then divided by 2:
59+
60+
61+
We now visualize these priors
62+
63+
```{r}
64+
data.frame(alpha = rnorm(1000,0,20), beta = exp(rnorm(10000,0,3)), lambda = brms::inv_logit_scaled(rnorm(1000,-4,2)) / 2) %>%
65+
pivot_longer(everything(), values_to = "value",names_to = "parameter") %>%
66+
ggplot(aes(x = value, fill = parameter))+geom_histogram(col = "black")+facet_wrap(~parameter, scales = "free")+theme_classic()
67+
```
68+
69+
Below there is a visualization of what this extra lapse rate does, as well as what the priors of the model means when looking at the psychometric function itself.
70+
71+
```{r}
72+
n_sim = 25
73+
74+
alpha = rnorm(n_sim,0,20)
75+
beta = rnorm(n_sim,0,3)
76+
lambda = rnorm(n_sim,-4,2)
77+
78+
data.frame(alpha = alpha, beta = exp(beta), lambda = brms::inv_logit_scaled(lambda) / 2) %>%
79+
rowwise() %>%
80+
mutate(x = list(seq(-80,80,0.1)),
81+
y = list(psychometric(seq(-80,80,0.1), alpha, beta, lambda))
82+
) %>%
83+
ungroup %>%
84+
mutate(id = 1:n()) %>%
85+
unnest(cols = c(x, y)) %>% mutate(lapse = T) %>%
86+
ggplot(aes(x = x, y = y, group = id))+
87+
geom_line(alpha = 0.5)+theme_classic()+ggtitle("With Lapse rate")
88+
89+
90+
91+
data.frame(alpha = alpha, beta = exp(beta), lambda = NA) %>%
92+
rowwise() %>%
93+
mutate(x = list(seq(-80,80,0.1)),
94+
y = list(psychometric_nolapse(seq(-80,80,0.1), alpha, beta))
95+
) %>%
96+
ungroup %>%
97+
mutate(id = 1:n()) %>%
98+
unnest(cols = c(x, y)) %>% mutate(lapse = F) %>%
99+
ggplot(aes(x = x, y = y, group = id))+
100+
geom_line(alpha = 0.5)+theme_classic()+ggtitle("Without Lapse rate")
101+
102+
103+
```
104+
105+
The addition of the lapse rate is visually obvious when looking at the very steep psychometric functions, which do not asymtote at 0 and 1, but at the lambda value and 1-lambda. This is what is called prior predictive checks.
106+
107+
### **Changing the priors**
108+
109+
If you want to change the priors of the Bayesian model, this has to be done inside the Stan scripts.
110+
111+
Open the .stan file and then navigate to the last couple of lines of code where the syntax is the same as above i.e.
112+
113+
alpha ~ normal(0,20);
114+
beta ~ normal(0,3);
115+
lambda ~ normal(-4,2);
116+
117+
These lines of code are the priors of the model and it is therefore possible to first visualize what the prior distributions for the parameters entail (prior predictive checks) for the shape of the psychometric and then changing them inside the Stan scripts themselves.
118+
119+
120+
**Running the analysis**
121+
122+
123+
Using this bayesian fit invovles the same as for the simple analysis with two addition arguments.
124+
125+
Firstly, the flag Bayesian needs to be set to T (TRUE), and a model has to be specified.
126+
127+
There are at the moment two different models to choose from, one with the lapse rate and one without. These are called
128+
129+
"Standard Cummulative normal.stan"
130+
"Lapse Cummulative normal.stan"
131+
132+
Respectively
133+
134+
```{r message=FALSE, results='hide',warning=FALSE}
135+
# No lapse rate model:
136+
model = cmdstan_model(here("docs","source","examples","R","src","Standard Cummulative normal.stan"))
137+
# Lapse rate model:
138+
model = cmdstan_model(here("docs","source","examples","R","src","Lapse Cummulative normal.stan"))
139+
140+
results = single_sub_analysis(df,
141+
interoPost = NA,
142+
exteroPost = NA,
143+
bayesian = T,
144+
model = model,
145+
out = here::here("docs","source","examples","R"))
146+
```
147+
148+
The results list now also contains a new index called bayesian_plot.
149+
150+
This is a list of either 1 or 3 plots. There will be 1 if you only have one Modality and 2 if you have two i.e. both Extero and Intero.
151+
152+
Lets look at them individually:
153+
154+
```{r}
155+
results$bayesian_plot[[1]]
156+
```
157+
158+
159+
```{r}
160+
results$bayesian_plot[[2]]
161+
```
162+
# **Convergence and trust in the model**
163+
164+
NOTE:
165+
166+
When running a Bayesian model like this convergence is not a given. It is therefore important to check good model covergence!
167+
168+
In-order to check for good model convergence watch the upper plots in the above two plots:
169+
170+
Here we see that all the 4 chains (to the left) seem to capture the same posterior distribution . It is also clear from the trace-plots to the upper right that the chains mix well (hairy catterpillars), indicating good convergence.
171+
172+
Lastly, one needs to investigate whether there are divergences in the sampling process.
173+
174+
This information is stored in the stats file under divergences, if this column is not 0, then trusting the estimates even with good looking chains is not advised.
175+
176+
Dealing with divergences for single subjects fits, involves changing priors and or the model itself (i.e. leaving out or including the lapse rate).
177+
178+
Other approaches for dealing with these divergences exist, but is out of the scope of this tutorial [see](https://discourse.mc-stan.org/t/divergent-transitions-a-primer/17099)
179+
180+
181+
182+
## **Here is the number of mean in both conditions divergences:**
183+
```{r}
184+
results$stats$divergences
185+
```
186+
Indicating that there are divergences here so perhaps running without the Lapse rate would be preferable, or changing the priors.
187+
188+
189+
190+
191+
**Of cause this can be run through several subjects like the "simple" analysis**
192+
193+
```{r message=FALSE, fig.show='hide', results='hide', warning=FALSE}
194+
path_to_data = here("docs","source","examples","R","data")
195+
196+
out = here::here("docs","source","examples","R")
197+
198+
data = study_analysis(path = path_to_data,
199+
bayesian = T,
200+
model = model,
201+
folder = T,
202+
out = out)
203+
```
204+
205+
206+
207+
```{r}
208+
read.csv(here("docs","source","examples","R","resulting_dataframe.csv")) %>% select(-X)%>% head(4)
209+
```
210+
211+
### Here the Bayesian alpha is the threshold and the beta is the slope
212+

0 commit comments

Comments
 (0)