Clustering

“Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each other than to those in other groups (clusters).” (from Wikipedia). In our study, we consider the multivariate case, where clustering is done considering more than one variable. First, we apply the Quantum Genetic Algorithm, then a classical GA, and compare their results.

Prepare data

We make use of the “iris” dataset:

data(iris)
vars <- colnames(iris)[1:4]
vars
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"

The above four variables will be used in clustering.

Quantum GA application

Fitness evaluation

The following fitness evaluation function is defined:

clustering <- function(solution, eval_func_inputs) {
  maxvalue <- 5
  penalfactor <- 2
  df <- eval_func_inputs[[1]]
  vars <- eval_func_inputs[[2]]
  # Fitness function
  fitness <- 0
  for (v in vars) {
    cv <- tapply(df[,v],solution,FUN=sd) / tapply(df[,v],solution,FUN=mean)
    cv <- ifelse(is.na(cv),maxvalue,cv)
    fitness <- fitness + sum(cv)
  }
  # Penalization on unbalanced clusters
  b <- table(solution)/nrow(df)
  fitness <- fitness + penalfactor * (sum(abs(b - c(rep(1/(length(b)),length(b))))))
  return(-fitness)
}

This function receives as input parameters:

  • solution (the current solution to be evaluated)
  • eval_func_inputs (a list including the dataframe to be analyzed, and the set of variables to be considered in the clustering)

Optimization

We define the Genome parameter as the number of entries in the “iris” dataset. The number of values is set equal to the 3. This means that each solution is a vector of 150 elements, each one with an assigned value from 1 to 3: each couple element/value indicates which entry (element) is assigned to which cluster (value).

nclust = 3
popsize = 20
Genome = nrow(iris)
set.seed(1234)
CLUSTsolution <- QGA(popsize,
                generation_max = 1500,
                nvalues_sol = nclust,
                Genome,
                thetainit = 3.1415926535 * 0.1,
                thetaend = 3.1415926535 * 0.05,
                pop_mutation_rate_init = 1/(popsize + 1),
                pop_mutation_rate_end = 1/(popsize + 1),
                mutation_rate_init = 1/(Genome + 1),
                mutation_rate_end = 1/(Genome + 1),
                mutation_flag = TRUE,
                plotting = FALSE,
                verbose = FALSE,
                progress = FALSE,
                eval_fitness = clustering,
                eval_func_inputs = list(iris, vars),
                stop_iters = 300)

Analysis of the solution

QGA:::plot_Output(CLUSTsolution[[2]])

The plot indicates that the number of iterations was enough to obtain a solution that is likely to be no further improved.

solution <- CLUSTsolution[[1]]
fitnessQGA <- 0
df <- iris
for (v in vars) {
  cv <- tapply(df[,v],solution,FUN=sd) / tapply(df[,v],solution,FUN=mean)
  cv <- ifelse(is.na(cv),maxvalue,cv)
  fitnessQGA <- fitnessQGA + sum(cv)
}
fitnessQGA
#> [1] 1.62336

We can try to understand the quality of these clusters by comparing them to the “Species” variable in the dataset:

iris$cluster <- solution
xtabs( ~ Species + cluster, data=iris)
#>             cluster
#> Species       1  2  3
#>   setosa      0  0 50
#>   versicolor 45  5  0
#>   virginica   5 45  0

It seems that the cluster values can predict quite well the Species of the iris flowers, with only 10 misclassifications out of 150.

Classical GA application

We can now compare these results with those obtained by a classical genetic algorithm, the one implemented in the R package “genalg”.

Fitness evaluation

The fitness function:

evaluate <- function(solution) {
  solution <- round(solution)
  maxvalue <- 5
  penalfactor <- 2
  fitness <- 0
  for (v in vars) {
    cv <- tapply(iris[,v],solution,FUN=sd) / tapply(iris[,v],solution,FUN=mean)
    cv <- ifelse(is.na(cv),maxvalue,cv)
    fitness <- fitness + sum(cv)
  }
  b <- table(solution)/nrow(iris)
  # Penalisation on unbalanced clusters
  fitness <- fitness + penalfactor * (sum(abs(b - c(rep(1/(length(b)),length(b))))))
  return(fitness)
}

Optimization

To ensure good results to the GA we give a number of iterations (2500) that is much higher than the one given to the QGA (1500).

set.seed(1234)
solution_genalg <- rbga(stringMin=c(rep(1,nrow(iris))), 
                   stringMax=c(rep(nclust,nrow(iris))),
                   popSize=20, 
                   iters=3000, 
                   elitism=NA, 
                   mutationChance = 5/(nrow(iris)+1),
                   evalFunc=evaluate)
plot(solution_genalg)

Analysis of the solution

filter = solution_genalg$evaluations == min(solution_genalg$evaluations)
bestObjectCount = sum(rep(1, solution_genalg$popSize)[filter])
if (bestObjectCount > 1) {
  bestSolution = solution_genalg$population[filter, ][1, 
  ]
} else {
  bestSolution = solution_genalg$population[filter, ]
}
bestSolution <- round(bestSolution)
fitnessGA <- 0
for (v in vars) {
  cv <- tapply(iris[,v],bestSolution,FUN=sd) / tapply(iris[,v],bestSolution,FUN=mean)
  cv <- ifelse(is.na(cv),maxvalue,cv)
  fitnessGA <- fitnessGA + sum(cv)
}
fitnessGA
#> [1] 1.857548
fitnessQGA
#> [1] 1.62336

Also considering the obtained stratification:

iris$stratum <- bestSolution
table(iris$Species, iris$stratum)
#>             
#>               1  2  3
#>   setosa      0 50  0
#>   versicolor 12  0 38
#>   virginica  38  0 12

the classic GA one is less in line with the “natural” one.

Conclusions

In this case, the Quantum GA is more efficient than the classical GA.

In general, when the number of values that each element in the genome can assume is not high (in this case, there are only 3 different values), then a relative efficiency of the QGA is to be expected. When the number of these values is high, the classical GA tends to be more efficient.

This is because the QGA needs to define a number of qubits that is the integer ceiling value of log(n) (in our case n=3, so the number of qubits is 2). The length of the solution is given by the product of the length of the genome (in our case 150) by the number of qubits necessary for each element of the genome, so 150x2=300.

On the contrary, the classical GA can work directly on an integer representation of the values for each element of the genome, so the length of the solution does not increase.

When the n is high, the length of the solution treated by the QGA is much higher than that of the solution treated by the classical GA, so the latter can be more efficient than the former.