Bayesian bi-clustering of categorical data


Cluster analysis is a common statistical technique for partitioning the observed data into disjoint homogeneous groups. In the presence of multivariate data, it is often useful to identify which features are best predictors of cluster association. The problem is formalized as a bidirectional Bayesian cluster analysis, both in the units space and the features space. The aim is obviously to perform a clustering of the observed sample, but also to classify the variables according to prespecified levels of discrimination power. Split-merge and Gibbs sampler type MCMC algorithms are employed to simultaneously traverse the posterior of partitions of samples and variables. We show how the model can be successfully utilized for clustering genetic data and highlighting sites under selective pressure. Software implementation for clustering categorical data matrices is freely available at

Helsinki, Finland