Informing the Design of Experiments
The design submodule has utilities to run comparative experiments. For example, to compare two RBITs, you can instantiate two population models with different parameters and call run_comparative_experiment
data_A, data_B = run_comparative_experiment(
population_model_one,
population_model_two,
schedule,
test_blocks=test_blocks,
replications=1,
)
You may also compute recall block percentages directly, as the next worked out example shows:
# example with two equivalent population groups and RBITs
population_model_one = GaussianPopulation(
ExponentialForgetting,
mu=[10 ** (-2.5), 0.75],
sigma=1e-7 * numpy.array([[0.01, 0], [0, 1]]),
population_size=24 * 16,
n_items=1,
seed=None,
)
population_model_two = GaussianPopulation(
ExponentialForgetting,
mu=[10 ** (-2.5), 0.75],
sigma=1e-7 * numpy.array([[0.01, 0], [0, 1]]),
population_size=24 * 16,
n_items=1,
seed=None,
)
# L1 R1 L2 R2 L3 R3 R4 L4 R5
# 200 1000 200 200 2000 86400 200 2000
schedule = BlockBasedSchedule(
1,
15,
[200, 200, 1000, 200, 2000, 86400, 200, 2000],
repet_trials=1,
seed=123,
sigma_t=1,
)
# get mean recall rates, as two arrays and optionnally as a large dataframe
df, mA, mB = diff_eval_block_percentages(
population_model_one,
population_model_two,
schedule,
test_blocks=[1, 3, 5, 6, 8],
output_df=True,
)
# plot the evaluation
import seaborn
import matplotlib.pyplot as plt
fix, ax = plt.subplots(nrows=1, ncols=1)
seaborn.barplot(
data=df,
x="block",
y="block recall %",
hue="Condition",
errorbar="se",
ax=ax,
)
plt.show()
Power Analysis
These evaluations can be used to perform an empirical power analysis. The following code shows how you can compute type 1 and type 2 errors for various experimental conditions and ways of combining p values. You can also define your own way to compute p values. We suggest considering the type 2 error of the combination which has the type 1 error closest to the nominal value (5%) (in the following example, it would be the Bonferroni-like method)
# Empirical Statistical power analysis
REPETITIONS = 100
# Case 1: Equal RBITs, should find statistical significance alpha=0.05 at 5%
population_model_one = [
GaussianPopulation(
ExponentialForgetting,
mu=[10 ** (-2), 0.75],
sigma=1e-7 * numpy.array([[0.01, 0], [0, 1]]),
population_size=24,
n_items=1,
seed=None,
)
for i in range(REPETITIONS)
]
population_model_two = [
GaussianPopulation(
ExponentialForgetting,
mu=[10 ** (-2), 0.75],
sigma=1e-7 * numpy.array([[0.01, 0], [0, 1]]),
population_size=24,
n_items=1,
seed=None,
)
for i in range(REPETITIONS)
]
# Define here custom methods to compute p values from mean recalls
# compute the average of p values
def mean_pvalue(mr_A, mr_B, **kwargs):
pvalues = scipy.stats.ttest_rel(mr_A, mr_B, axis=1).pvalue[1:]
return numpy.mean(pvalues)
# Bonferonni-like correction
def bonf(mr_A, mr_B, **kwargs):
pvalues = scipy.stats.ttest_rel(mr_A, mr_B, axis=1).pvalue[1:]
return numpy.min(pvalues) * len(pvalues)
# Fisher and Stouffer methods can be called by their names to trigger their scipy implementation
pos_one, neg_one, p_container = get_p_values_frequency(
population_model_one,
population_model_two,
schedule,
combine_pvalues=["stouffer", "fisher", bonf, mean_pvalue],
test_blocks=None,
significance_level=0.05,
)
# Type 1 errors: we reject the null that there is no difference in RBITs but there was actually none
type1error = pos_one
# Case 2: Unequal RBITs (alpha = 10**-2.1 versus alpha = 10**-2). The rate at which the test finds significant differences is an estimate of its power
population_model_one = [
GaussianPopulation(
ExponentialForgetting,
mu=[10 ** (-2.3), 0.75],
sigma=1e-5 * numpy.array([[0.01, 0], [0, 1]]),
population_size=24,
n_items=1,
seed=None,
)
for i in range(REPETITIONS)
]
population_model_two = [
GaussianPopulation(
ExponentialForgetting,
mu=[10 ** (-2), 0.75],
sigma=1e-5 * numpy.array([[0.01, 0], [0, 1]]),
population_size=24,
n_items=1,
seed=None,
)
for i in range(REPETITIONS)
]
pos_two, neg_two, p_container = get_p_values_frequency(
population_model_one,
population_model_two,
schedule,
combine_pvalues=["stouffer", "fisher", bonf, mean_pvalue],
test_blocks=None,
significance_level=0.05,
)
# Type 2 errors: we failed to reject the null hypothesis that there is no difference between RBITs (even though it was False)
type2error = neg_two