verona.evaluation.stattests package¶

verona.evaluation.stattests.correlated_t_test module¶

class verona.evaluation.stattests.correlated_t_test.BayesianTTestResult(posterior_probs, approximated, parameters, posterior, additional)[source]¶

Bases: object

Represents the results of a Bayesian Correlated t-test.

posterior_probabilities¶

A dictionary containing the probabilities for the left, rope, and right regions of the posterior distribution.

Type:: dict

approximated¶

Indicates if the posterior distribution is approximated (True if approximated, e.g., by MCMC sampling, and False if exact).

Type:: bool

parameters¶

The parameters used for running the Bayesian t-test, specifically ‘rho’ and ‘rope’.

Type:: dict

posterior¶

A dictionary containing the density, cumulative, and quantile functions for the posterior distribution.

Type:: dict

additional¶

Additional details about the posterior distribution, such as degrees of freedom, mean, and standard deviation.

Type:: dict

class verona.evaluation.stattests.correlated_t_test.CorrelatedBayesianTTest(x, y, approaches: List[str])[source]¶

Bases: object

Bayesian equivalent to the correlated t-test.

This class offers a Bayesian alternative to the traditional frequentist correlated t-test, often used for comparing the means of two paired samples to determine if they come from populations with equal means. It extends the paired Student’s t-test to a Bayesian framework, offering a richer set of inferences that can be drawn from the data.

In particular, this implementation follows the Bayesian correlated t-test as described by Benavoli et al., 2017, which provides not just point estimates, but also credible intervals and posterior probabilities that can more informatively capture the uncertainty around the true parameter values.

x¶

First sample.

Type:: array-like

y¶

Second sample. If not provided, x is assumed to be the difference. approaches (array-like):

Type:: array-like

Methods or approaches to be compared.

approaches¶

Names of the two methods or approaches to be compared.

Type:: array-like

run()[source]¶: Executes the Bayesian t-test.

Example

>>> sample1 = [random.gauss(1, 1) for _ in range(25)]
>>> sample2 = [random.gauss(1.2, 1) for _ in range(25)]
>>> test = CorrelatedBayesianTTest(sample1, sample2, ["Method1", "Method2"])
>>> test.run(rho=0.1, rope=[-1, 1])

References

1. Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. Journal of Machine Learning Research, 18, 1-36.
scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems.

run(rho=0.2, rope=(-1, 1)) → BayesianTTestResult[source]¶

Executes the Bayesian t-test.

Parameters:

rho (float, optional) –
Correlation factor between the paired samples. Default is 0.2.
- A rho of 0 implies that the paired samples are entirely independent, essentially converting the test into a standard Bayesian t-test.
- A rho of 1 implies that the paired samples are perfectly correlated, making the test trivial.
- Values between 0 and 1 adjust the test to account for the degree of correlation between the paired samples. For instance, in the context of machine learning, rho could be set to the proportion of the test set size to the total dataset size to account for data reuse across different folds in k-fold cross-validation.
rope (list, optional) – Interval for the difference considered as “irrelevant” or “equivalent”. Defaults is [-1, 1].

Returns:

An instance of the BayesianTTestResult class that contains the following:

posterior_probabilities: Probabilities for the left, rope, and right regions.
approximated: Whether the posterior distribution is approximated.
parameters: Parameters used in the Bayesian t-test.
posterior: Functions related to the posterior distribution.
additional: Additional details about the posterior distribution.

Return type:

BayesianTTestResult

Note

The default value for rho is 0.2, which accounts for a 20% split in the testing set.

verona.evaluation.stattests.hierarchical module¶

class verona.evaluation.stattests.hierarchical.BayesianHierarchicalResults(approximated, global_wins, posterior_distribution, per_dataset, global_sign, raw_results)[source]¶: Bases: object

class verona.evaluation.stattests.hierarchical.HierarchicalBayesianTest(x_result: DataFrame, y_result: DataFrame, approaches: List[str], datasets: List[str])[source]¶

Bases: object

static pt_scaled(q, df, mean=0, sd=1)[source]¶

This function emulates the function pt.scaled from the “metRology” R package. That function computes the cumulative distribution function of a T-Student for a given number of degrees of freedom (df). This computation is scaled and shifted by mean, and sd, respectively.

Parameters:

q – quantile
df – degrees of freedom
mean – scale (mean). Default is 0.
sd – shift (standard deviation). Default is 1.

Returns:

Cumulative distribution function shifted and scaled.

run(rope=[-1, 1], rho=0.2, n_chains=4, num_samples=300000, std_upper=1000, alpha_lower=0.5, alpha_upper=5, beta_lower=0.05, beta_upper=0.15, d0_lower=None, d0_upper=None) → BayesianHierarchicalResults[source]¶

Executes a Bayesian hierarchical model tailored for comparing the performance of two machine learning algorithms across multiple datasets based on cross-validation results.

This model employs a two-level hierarchical structure to account for both dataset-specific variations and global trends in the performance differences between the two algorithms. Specifically, it treats the performance metrics from each dataset as arising from a dataset-specific distribution, which in turn is governed by global hyperparameters.

The model can work directly on a variety of metrics obtained through cross-validation, thereby providing a comprehensive statistical insight into the comparative evaluation. This is a significant advantage over frequentist methods which may not fully capture the uncertainty in the metrics.

This implementation is ported from the Bayesian hierarchical model implemented in [2], which provides posterior probabilities for a richer interpretation of the comparison.

Parameters:

rope (List, optional) – Region of Practical Equivalence, defines the interval within which performance differences are considered “irrelevant” or “insignificant”. Default is [-1, 1].
rho (float, optional) – A hyperparameter representing the correlation factor across datasets. A higher value indicates stronger correlation between datasets in terms of algorithm performance. Default is 0.2.
n_chains (int, optional) – The number of Markov Chains to be used in the simulation. Half of the simulations are used for warm-up. Default is 4.
num_samples (int, optional) – The total number of samples (per chain) used for estimating the posterior distribution. Default is 300000.
std_upper (int, optional) – A scaling factor that sets the upper bounds for the hyperparameters sigma_i and sigma_0, which represent dataset-specific and global variability, respectively. Default is 1000.
alpha_lower (float, optional) – Lower bound for the uniform prior of the alpha hyperparameter, which models the global variance. Default is 0.5.
alpha_upper (float, optional) – Upper bound for the uniform prior of the alpha hyperparameter. Default is 5.
beta_lower (float, optional) – Lower bound for the uniform prior of the beta hyperparameter, which models dataset-specific variances. Default is 0.05.
beta_upper (float, optional) – Upper bound for the uniform prior of the beta hyperparameter. Default is 0.15.
d0_lower (float, optional) – Lower bound for the prior distribution of mu_0, the grand mean of performance differences. If not provided, the smallest observed difference is used as the lower bound.
d0_upper (float, optional) – Upper bound for the prior distribution of mu_0. If not provided, the largest observed difference is used as the upper bound.

Note

The results includes the typical information relative to the three areas of the posterior density (left, right and rope probabilities), both global and per dataset (in the additional information). Also, the simulation results are included.

As for the prior parameters, they are set to the default values indicated in [1,2], except for the bound for the prior distribution of mu_0, which are set to the maximum and minimum values observed in the sample. You should not modify them unless you know what you are doing.

[1] A. Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. emph{Journal of Machine Learning Research}, 18, 1-36. [2] Borja Calvo and Guzmán Santafé (2016) scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems. The R Journal, 8(1), 248-256. [DOI: 10.32614/RJ-2016-017](https://doi.org/10.32614/RJ-2016-017)

Returns:

BayesianHierarchicalResults

An object containing the following attributes:

approximated : Boolean value that indicates whether the posterior distribution is approximated (True in this case).
global_wins: DataFrame containing the global winning probabilities for each condition (left, right, rope).
posterior_distribution: DataFrame with the posterior distribution probabilities (left, rope, right).
per_dataset: DataFrame with per-dataset statistics.
global_sign: DataFrame containing the global sign probabilities (positive, negative).
raw_results: DataFrame containing the raw results from the sampling.

Examples

>>> x_data = pd.DataFrame([[75.3, 78.3, 60.4], [68.5, 77.5, 76.9], [77.9, 74.5, 80.9], [90, 90, 90]])
>>> y_data = pd.DataFrame([[74.3, 75.3, 61.4], [65.5, 70.5, 80.9], [79.9, 76.2, 81.9], [90, 90, 90]])
>>> results = HierarchicalBayesianTest(x_data, y_data, approaches=["approach 1", "approach 2"],
datasets=["d1", "d2", "d3", "d4"]).run([-1, 1])
>>> print("Global wins: ", results.global_wins)
Global wins:     left (approach 1 < approach 2)  rope (approach 1 = approach 2)  right (approach 1 > approach 2)
                        0.809272                             0.0                         0.190728
>>> print("Per dataset: ", results.per_dataset.iloc[:, 1:4])
Per dataset:      left (approach 1 < approach 2)  rope (approach 1 = approach 2)  right (approach 1 > approach 2)
d1                        0.207137                        0.503125                         0.289738
d2                        0.279955                        0.419848                         0.300197
d3                        0.619778                        0.337055                         0.043167
d4                        0.039968                        0.937468                         0.022563

verona.evaluation.stattests.plackettluce module¶

class verona.evaluation.stattests.plackettluce.PlackettLuceRanking(result_matrix: DataFrame, approaches: List[str])[source]¶

Bases: object

run(n_chains=8, num_samples=300000, mode='max') → PlackettLuceResults[source]¶

Execute the Plackett-Luce ranking model to estimate the rank and probabilities of each algorithm based on their performance metrics.

The method employs Markov Chain Monte Carlo (MCMC) sampling, leveraging the STAN backend, to estimate the posterior distribution of the rank and probabilities.

Parameters:

n_chains (int, optional) – Number of chains used ot perform the sampling. Default is 8.
num_samples (int, optional) – Number of samples to considerate in the MCMC. Default is 300000.
mode (Literal['max', 'min'], optional) – If 'max' the higher the value the better the algorithm. If 'min' the lower the value the better the algorithm. Default is 'max'.

Returns:

PackettLuceResutls instance containing:

expected_prob: Expected probability of each algorithm having the best ranking
expected_rank: Expected rank of each algorithm
posterior: Posterior

Return type:

PlackettLuceResutls

Examples

>>> result_matrix = pd.DataFrame([[0.75, 0.6, 0.8], [0.8, 0.7, 0.9], [0.9, 0.8, 0.7]])
>>> plackett_ranking = PlackettLuceRanking(result_matrix, ["a1", "a2", "a3"])
>>> results = plackett_ranking.run(n_chains=10, num_samples=300000, mode="max")
>>> print("Expected prob: ", results.expected_prob)
Expected prob:  a1    0.432793
                a2    0.179620
                a3    0.387587
>>> print("Expected rank: ", results.expected_rank)
Expected rank:  a1    1.580505
                a2    2.667531
                a3    1.751964

class verona.evaluation.stattests.plackettluce.PlackettLuceResults(expected_prob: Series, expected_rank: Series, posterior: DataFrame)[source]¶

Bases: object

Encapsulates the results from running the Plackett-Luce ranking model.

This class serves as a container for the results obtained after fitting the Plackett-Luce model using MCMC sampling. It provides structured access to important quantities such as the expected probabilities, expected ranks, and the posterior distributions of these metrics.

expected_prob¶

A pandas Series object representing the expected probabilities for each algorithm. It quantifies the estimated likelihood that each algorithm is the best among the ones compared.

Type:: pd.Series

expected_rank¶

A pandas Series object representing the expected ranks for each algorithm. The rank is a numerical ordering where lower values indicate better performance.

Type:: pd.Series

posterior¶

A container (e.g., dictionary) that holds the posterior distributions for the rank and probabilities of each algorithm. These distributions capture the uncertainties in the point estimates and are essential for Bayesian inference.

Type:: dict or similar container

verona.evaluation.stattests.signed_rank module¶

class verona.evaluation.stattests.signed_rank.BayesianSignedRankTest(x, y, approaches: List[str])[source]¶

Bases: object

Bayesian equivalent to Wilcoxon’s signed-rank test.

This function implements the Bayesian version of the signed-rank test as presented in Benavoli et al., 2017. This Bayesian test aims to evaluate the difference between two related samples (or one sample against a zero null hypothesis) and provides probabilities for three regions: left, rope, and right.

x¶

First sample.

Type:: array-like

y¶

Second sample. If not provided, x is assumed to be the difference.

Type:: array-like, optional

approaches¶

Names of the two methods or approaches to be compared.

Type:: array-like

run()[source]¶: Executes the Bayesian test.

References

1. Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. Journal of Machine Learning Research, 18, 1-36.
scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems.

run(s=0.5, z0=0, rope=(-1, 1), nsim=100000, seed=None) → BayesianSignedRankTestResult[source]¶

Parameters:

s (float, optional) – Scale parameter of the prior Dirichlet Process. Defaults to 0.5.
z0 (float, optional) – Position of the pseudo-observation associated to the prior Dirichlet Process.
0. (Defaults to)
rope (tuple, optional) – Interval for the difference considered as “irrelevant”. Defaults to (-1, 1).
nsim (int, optional) – Number of samples used to estimate the posterior distribution. Defaults to 100000.
seed (int, optional) – Optional parameter used to fix the random seed.

Returns:

A dictionary containing:

method: A string with the name of the method used.
posterior_probabilities: A dictionary with the left, rope and right probabilities.
approximate: A boolean, True if the posterior distribution is approximated (sampled) and False if it is exact.
parameters: A dictionary of parameters used by the method.
posterior: A list of dictionaries containing the sampled probabilities.

Return type:

dict

References

1. Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. Journal of Machine Learning Research, 18, 1-36.
scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems.

class verona.evaluation.stattests.signed_rank.BayesianSignedRankTestResult(posterior_probs, approximated, posterior)[source]¶

Bases: object

Represents the results of a Bayesian Signed Rank Test.

method¶

The name of the statistical method used.

Type:: str

posterior_probabilities¶

Probabilities for the left, rope, and right regions.

Type:: dict

approximated¶

Whether the posterior distribution is approximated.

Type:: bool

parameters¶

Parameters used in the Bayesian Signed Rank Test.

Type:: dict

posterior¶

Sampled probabilities for left, rope, and right areas.

Type:: pd.DataFrame