Jointly Learning to Embed and Predict with Multiple Languages

Abstract

We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, © addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

Publication
Springer Journal of Classification