verona.evaluation.stattests package¶
verona.evaluation.stattests.hierarchical module¶
- class verona.evaluation.stattests.hierarchical.BayesianHierarchicalResults(approximated, global_wins, posterior_distribution, per_dataset, global_sign, raw_results)[source]¶
Bases:
object
- class verona.evaluation.stattests.hierarchical.HierarchicalBayesianTest(x_result: DataFrame, y_result: DataFrame, approaches: List[str], datasets: List[str])[source]¶
Bases:
object- static pt_scaled(q, df, mean=0, sd=1)[source]¶
This function emulates the function pt.scaled from the “metRology” R package. That function computes the cumulative distribution function of a T-Student for a given number of degrees of freedom (df). This computation is scaled and shifted by mean, and sd, respectively.
- Parameters:
q – quantile
df – degrees of freedom
mean – scale (mean). Default is
0.sd – shift (standard deviation). Default is
1.
- Returns:
Cumulative distribution function shifted and scaled.
- run(rope=[-1, 1], rho=0.2, n_chains=4, num_samples=300000, std_upper=1000, alpha_lower=0.5, alpha_upper=5, beta_lower=0.05, beta_upper=0.15, d0_lower=None, d0_upper=None) BayesianHierarchicalResults[source]¶
Executes a Bayesian hierarchical model tailored for comparing the performance of two machine learning algorithms across multiple datasets based on cross-validation results.
This model employs a two-level hierarchical structure to account for both dataset-specific variations and global trends in the performance differences between the two algorithms. Specifically, it treats the performance metrics from each dataset as arising from a dataset-specific distribution, which in turn is governed by global hyperparameters.
The model can work directly on a variety of metrics obtained through cross-validation, thereby providing a comprehensive statistical insight into the comparative evaluation. This is a significant advantage over frequentist methods which may not fully capture the uncertainty in the metrics.
This implementation is ported from the Bayesian hierarchical model implemented in [2], which provides posterior probabilities for a richer interpretation of the comparison.
- Parameters:
rope (List, optional) – Region of Practical Equivalence, defines the interval within which performance differences are considered “irrelevant” or “insignificant”. Default is
[-1, 1].rho (float, optional) – A hyperparameter representing the correlation factor across datasets. A higher value indicates stronger correlation between datasets in terms of algorithm performance. Default is
0.2.n_chains (int, optional) – The number of Markov Chains to be used in the simulation. Half of the simulations are used for warm-up. Default is
4.num_samples (int, optional) – The total number of samples (per chain) used for estimating the posterior distribution. Default is
300000.std_upper (int, optional) – A scaling factor that sets the upper bounds for the hyperparameters sigma_i and sigma_0, which represent dataset-specific and global variability, respectively. Default is
1000.alpha_lower (float, optional) – Lower bound for the uniform prior of the alpha hyperparameter, which models the global variance. Default is
0.5.alpha_upper (float, optional) – Upper bound for the uniform prior of the alpha hyperparameter. Default is
5.beta_lower (float, optional) – Lower bound for the uniform prior of the beta hyperparameter, which models dataset-specific variances. Default is
0.05.beta_upper (float, optional) – Upper bound for the uniform prior of the beta hyperparameter. Default is
0.15.d0_lower (float, optional) – Lower bound for the prior distribution of mu_0, the grand mean of performance differences. If not provided, the smallest observed difference is used as the lower bound.
d0_upper (float, optional) – Upper bound for the prior distribution of mu_0. If not provided, the largest observed difference is used as the upper bound.
Note
The results includes the typical information relative to the three areas of the posterior density (left, right and rope probabilities), both global and per dataset (in the additional information). Also, the simulation results are included.
As for the prior parameters, they are set to the default values indicated in [1,2], except for the bound for the prior distribution of mu_0, which are set to the maximum and minimum values observed in the sample. You should not modify them unless you know what you are doing.
[1] A. Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. emph{Journal of Machine Learning Research}, 18, 1-36. [2] Borja Calvo and Guzmán Santafé (2016) scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems. The R Journal, 8(1), 248-256. [DOI: 10.32614/RJ-2016-017](https://doi.org/10.32614/RJ-2016-017)
- Returns:
- BayesianHierarchicalResults
- An object containing the following attributes:
approximated : Boolean value that indicates whether the posterior distribution is approximated (True in this case).
global_wins: DataFrame containing the global winning probabilities for each condition (left, right, rope).
posterior_distribution: DataFrame with the posterior distribution probabilities (left, rope, right).
per_dataset: DataFrame with per-dataset statistics.
global_sign: DataFrame containing the global sign probabilities (positive, negative).
raw_results: DataFrame containing the raw results from the sampling.
Examples
>>> x_data = pd.DataFrame([[75.3, 78.3, 60.4], [68.5, 77.5, 76.9], [77.9, 74.5, 80.9], [90, 90, 90]]) >>> y_data = pd.DataFrame([[74.3, 75.3, 61.4], [65.5, 70.5, 80.9], [79.9, 76.2, 81.9], [90, 90, 90]]) >>> results = HierarchicalBayesianTest(x_data, y_data, approaches=["approach 1", "approach 2"], datasets=["d1", "d2", "d3", "d4"]).run([-1, 1]) >>> print("Global wins: ", results.global_wins) Global wins: left (approach 1 < approach 2) rope (approach 1 = approach 2) right (approach 1 > approach 2) 0.809272 0.0 0.190728 >>> print("Per dataset: ", results.per_dataset.iloc[:, 1:4]) Per dataset: left (approach 1 < approach 2) rope (approach 1 = approach 2) right (approach 1 > approach 2) d1 0.207137 0.503125 0.289738 d2 0.279955 0.419848 0.300197 d3 0.619778 0.337055 0.043167 d4 0.039968 0.937468 0.022563
verona.evaluation.stattests.plackettluce module¶
- class verona.evaluation.stattests.plackettluce.PlackettLuceRanking(result_matrix: DataFrame, approaches: List[str])[source]¶
Bases:
object- run(n_chains=8, num_samples=300000, mode='max') PlackettLuceResults[source]¶
Execute the Plackett-Luce ranking model to estimate the rank and probabilities of each algorithm based on their performance metrics.
The method employs Markov Chain Monte Carlo (MCMC) sampling, leveraging the STAN backend, to estimate the posterior distribution of the rank and probabilities.
- Parameters:
n_chains (int, optional) – Number of chains used ot perform the sampling. Default is
8.num_samples (int, optional) – Number of samples to considerate in the MCMC. Default is
300000.mode (Literal['max', 'min'], optional) – If
'max'the higher the value the better the algorithm. If'min'the lower the value the better the algorithm. Default is'max'.
- Returns:
PackettLuceResutlsinstance containing:expected_prob: Expected probability of each algorithm having the best ranking
expected_rank: Expected rank of each algorithm
posterior: Posterior
- Return type:
PlackettLuceResutls
Examples
>>> result_matrix = pd.DataFrame([[0.75, 0.6, 0.8], [0.8, 0.7, 0.9], [0.9, 0.8, 0.7]]) >>> plackett_ranking = PlackettLuceRanking(result_matrix, ["a1", "a2", "a3"]) >>> results = plackett_ranking.run(n_chains=10, num_samples=300000, mode="max") >>> print("Expected prob: ", results.expected_prob) Expected prob: a1 0.432793 a2 0.179620 a3 0.387587 >>> print("Expected rank: ", results.expected_rank) Expected rank: a1 1.580505 a2 2.667531 a3 1.751964
- class verona.evaluation.stattests.plackettluce.PlackettLuceResults(expected_prob: Series, expected_rank: Series, posterior: DataFrame)[source]¶
Bases:
objectEncapsulates the results from running the Plackett-Luce ranking model.
This class serves as a container for the results obtained after fitting the Plackett-Luce model using MCMC sampling. It provides structured access to important quantities such as the expected probabilities, expected ranks, and the posterior distributions of these metrics.
- expected_prob¶
A pandas Series object representing the expected probabilities for each algorithm. It quantifies the estimated likelihood that each algorithm is the best among the ones compared.
- Type:
pd.Series
- expected_rank¶
A pandas Series object representing the expected ranks for each algorithm. The rank is a numerical ordering where lower values indicate better performance.
- Type:
pd.Series
- posterior¶
A container (e.g., dictionary) that holds the posterior distributions for the rank and probabilities of each algorithm. These distributions capture the uncertainties in the point estimates and are essential for Bayesian inference.
- Type:
dict or similar container
verona.evaluation.stattests.signed_rank module¶
- class verona.evaluation.stattests.signed_rank.BayesianSignedRankTest(x, y, approaches: List[str])[source]¶
Bases:
objectBayesian equivalent to Wilcoxon’s signed-rank test.
This function implements the Bayesian version of the signed-rank test as presented in Benavoli et al., 2017. This Bayesian test aims to evaluate the difference between two related samples (or one sample against a zero null hypothesis) and provides probabilities for three regions: left, rope, and right.
- x¶
First sample.
- Type:
array-like
- y¶
Second sample. If not provided, x is assumed to be the difference.
- Type:
array-like, optional
- approaches¶
Names of the two methods or approaches to be compared.
- Type:
array-like
References
Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. Journal of Machine Learning Research, 18, 1-36.
scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems.
- run(s=0.5, z0=0, rope=(-1, 1), nsim=100000, seed=None) BayesianSignedRankTestResult[source]¶
- Parameters:
s (float, optional) – Scale parameter of the prior Dirichlet Process. Defaults to
0.5.z0 (float, optional) – Position of the pseudo-observation associated to the prior Dirichlet Process.
0. (Defaults to)
rope (tuple, optional) – Interval for the difference considered as “irrelevant”. Defaults to
(-1, 1).nsim (int, optional) – Number of samples used to estimate the posterior distribution. Defaults to
100000.seed (int, optional) – Optional parameter used to fix the random seed.
- Returns:
- A dictionary containing:
method: A string with the name of the method used.
posterior_probabilities: A dictionary with the left, rope and right probabilities.
approximate: A boolean,
Trueif the posterior distribution is approximated (sampled) andFalseif it is exact.parameters: A dictionary of parameters used by the method.
posterior: A list of dictionaries containing the sampled probabilities.
- Return type:
dict
References
Benavoli, G. Corani, J. Demsar, M. Zaffalon (2017) Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis. Journal of Machine Learning Research, 18, 1-36.
scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems.
- class verona.evaluation.stattests.signed_rank.BayesianSignedRankTestResult(posterior_probs, approximated, posterior)[source]¶
Bases:
objectRepresents the results of a Bayesian Signed Rank Test.
- method¶
The name of the statistical method used.
- Type:
str
- posterior_probabilities¶
Probabilities for the left, rope, and right regions.
- Type:
dict
- approximated¶
Whether the posterior distribution is approximated.
- Type:
bool
- parameters¶
Parameters used in the Bayesian Signed Rank Test.
- Type:
dict
- posterior¶
Sampled probabilities for left, rope, and right areas.
- Type:
pd.DataFrame