HG-means -- Optimization performance matters in clustering

HG-means illustration

Minimum sum-of-squares clustering (MSSC) is a widely used clustering model, of which the popular K-means algorithm constitutes a local minimizer. It is well known that the solutions of K-means can be arbitrarily distant from the true MSSC global optimum, however, no other algorithm has been commonly adopted in the literature. This may be related to differences of effort, or to the assumption that a better solution of the MSSC has only a minor impact on classification or generalization capabilities. We dispute this belief. We introduce an efficient genetic algorithm, called HG-means, that combines K-means with problem-tailored crossover, mutation, and diversification operators. This algorithm can be interpreted as a multi-start K-means where the initial center positions are sampled based on the search history. The approach is scalable and accurate, outperforming all other algorithms in terms of solution quality, measured by the depth of local minima. This leads to classification results which are significantly closer to the ground truth for Gaussian-mixture datasets with a large number of features and clusters.

Open-source C++ and Python code libraries for HG-means are available in the Github repository located HERE.
Installing the Python package is done in a breeze:
"python -m pip install hgmeans".
The associated paper, published in Pattern Recognition, is freely available HERE.

Happy clustering !

Operations Research -- feeding the debate on hours of service regulations ?

Government regulations restricting hours of service of truck drivers have been debated for more than a decade by the trucking industry, safety advocates and the Federal Motor Carrier Safety Administration. Last week the United States Court of Appeals for the District of Columbia denied the petitions of the trucking industry and safety advocates challenging the new regulations which entered into force in July 2013. The court acknowledges that it cannot judge the rules from a scientific point of view and must focus on the rationality of the rulemaking process as such.

In a recent study, Asvin Goel and Thibaut Vidal developed a very efficient approach to solve combined vehicle routing and scheduling problems with various hours of service (HOS) regulations. The algorithm was used to simulate the carriers' decision making process subject to different rules. By comparing the resulting vehicle routes, using a driver-fatigue and risk simulator from the Health and Safety Executive, we observed that the new rule change can be expected to have a 1.2% reduction in accident risk while having a 0.8% increase in costs, to deliver the same set of customers within required time windows. We also conducted comparisons with other regulations. The impact of the new regulations is now comparable to the regulations in the European Union. While Canadian truck drivers can operate more efficiently, they also suffer from a much higher accident risk. Australian truck drivers have the highest accident risk without being able to operate more efficiently.

Thus it appears that, with some good optimization, the impact of the 2013 HOS regulations change can be largely mitigated.

This news has been since relayed in some specialized journals and blogs (see Fleet Owner, Commercial Carrier Journal, and Trucking Info). By the way I was quite surprised to see the words "population-based metaheuristic" mentioned here. Some practitioners welcome the new insights, and several concerns have been expressed regarding the applicability of the optimized solution. These concerns are intimately linked with our daily challenges as OR researchers, analysts and practitioners. Extracting the benefits out of this new set of rules requires good data, software and a decision chain in place. This can be often be achieved by major transportation companies, but, for smaller-scale actors, these HOS rules are even harder to address, and may directly translate into competiveness losses.

Hence, some open questions (some of those already well-known) remain in mind. Is optimization only a luxury that only large companies can afford? How to move forward to make this technology available to a broader range of end-users? Finally, from your point of view, what is the expected impact of the new hours of service regulations change?

Multi-attribute Vehicle routing Problems

VRP picture

Vehicle routing problems seek to design a set of delivery routes at minimum cost to deliver a geographically-dispersed set of customers. As humans, we can solve these problems manually, however even an expert usually ends-up with solutions that are more than 5% away from the best ones. Yet, even a slight improvement of 1-2% on operational costs can have a tremendous impact in large-scale operations. Furthermore, practical problems settings involve many complicating constraints, objectives and decisions (that we call attributes), aiming to capture a higher level of system detail or decision choices, including but not limited to richer system structures (e.g., several depots, vehicle fleets, and commodities), customer requirements (e.g., multi-period visits and within-period time windows), vehicle operation rules (e.g., load placement, route restrictions on total distance or time, and driver work rules), and decision context (e.g., traffic congestion and planning over extended time horizons). This is especially true when aiming to provide a high quality of service to passengers, according to a large variety of criteria and constraints. Our research is specifically focused on addressing such real-life applications.

Unified Hybrid Genetic Search

Genetic Algorithm

To address vehicle routing problem variants, we have introduced in the past years a new unified solution algorithm called Unified Hybrid Genetic Search. This algorithm uses evolutionary principles. It considers a set of problem solutions as a population, and applies genetic operators such as selection, crossover and mutation on the solutions. By simulating evolution and survival of the fittest, the population quality increases progressively and better solutions are discovered until one of the best ones is attained. Our method also revisits the traditional survival-of-the-fittest idea, by considering a bi-criteria evaluation of individuals based on solution cost and diversity (distance-to-the-others) measures. This diversity management procedures contributes to lead the search further towards higher quality solutions and eliminates risks of premature method convergence.

The method has also been designed in a unified manner with "polymorphic" components able to adapt themselves to the problem at hand. Up to date it has been tested on 60 categories of problems and overall more than 3000 data sets with a wide variety of additional constraints and objectives, e.g. multiple periods of service, multiple depots, vehicle-site dependencies, soft, multiple, and general time windows for service-time to customers, backhauls, cumulative or load-dependent costs, simultaneous pickup and delivery, fleet mix and heterogeneous fleet, time dependency, route congestion, service site choice, clustered problems, driving and working hour regulations, oil-well maintenance problems, asymmetric problems, and many of their combinations.

For all of these problems, the proposed method outperforms all other algorithms in the literature, originated from more than 200 past research articles, and yields solutions that are less than 0.5% from the best solutions ever known in all past research. Several of the data sets used in our experiments are directly issued from real-life application cases. Hence, it appears that generality does not necessarily impede efficiency. From a practical application standpoint, UHGS stands out as the first general-purpose solver to be state-of-the-art in the academic literature, validated on numerous problem benchmarks, and quickly applicable to new industrial settings.

Combinatorial Optimization

Searching for a shortest path, building itineraries or schedules, packing objects in a 2D space are all tasks which we solve on a daily basis, in a manner that is sufficient for our needs. Yet, these problems are called "combinatorial problems", since the number of possible solutions becomes rapidly huge when the problem size -- number of locations to visit, tasks to operate, or objects to fit -- grows. For example, the number of possible sequences (permutations) for visiting 80 places is 1 x 2 x 3 x 4 x ... x 80, a number with more than 118 zeroes, much bigger than the estimated number of atoms in the visible universe.

QM picture

The fast development of algorithmic complexity theory, since the 1970s, led to a new insights on a wide family of combinatorial optimization problems called "Nondeterministic Polynomial time" (NP). A simplistic but telling description of these problems is that it is rather easy to verify a solution, but possibly very difficult to find it. Such NP problems arise everywhere, from logistic applications, portfolio selection in finance, production scheduling, design of airplanes trajectories, emergency relief, beam optimization in cancer therapy, design of CPUs, among other. Up to this date, the existence of efficient polynomial algorithms (for which the computational time is smaller than a polynomial function of the problem size) for NP-problems is a famous open research question, listed as one of the millennium prize problems and worth 1.000.000$ for the individual that can solve it. A positive answer is rarely envisaged. Yet, this does not means that any effort focused on better solving this problem is in vain, but rather that exact methods for these problems are bound to be inadequate for some problems of large size.

In the industrial world, in presence of complex and huge system, solving these large NP problems in an efficient manner is often critical task. To that extent, my colleagues and I have developed a large array of heuristic methods, which can produce possibly non-optimal but usually high-quality solutions for many important combinatorial optimization problems, and especially vehicle routing problems.