Most data in genome-wide phylogenetic analysis (phylogenomics) is essentially
multidimensional, posing a major challenge to human comprehension and
computational analysis. Also, we cannot directly apply statistical learning models in
data science to a set of phylogenetic trees since the space of phylogenetic trees is not
Euclidean. In fact, the space of phylogenetic trees is a tropical Grassmannian in
terms of the max-plus algebra. Therefore, to classify multilocus data sets for
phylogenetic analysis, we propose tropical
support vector machines (SVMs).
Linear SVMs are supervised learning models that can be formulated in terms
of quadratic optimization problems and that classify using hyperplanes in
a high-dimensional Euclidean space. Here we study hard margin tropical
SVMs introduced by Gärtner and Jaggi and define soft margin tropical
SVMs in the setting of tropical geometry. Then we show that both hard
margin tropical SVMs and soft margin tropical SVMs can be formulated
as linear optimization problems. For hard margin tropical SVMs, we show
necessary and sufficient conditions on the feasibility of the linear optimization
problem and if there exists a feasible solution then we show an explicit formula
for the optimal value of the feasible linear optimization problem. For soft
margin tropical SVMs, we show necessary conditions of the feasibility of the
linear optimization problem. Computational experiments show that our
methods work well with data sets generated under the multispecies coalescent
model.
Keywords
linear programming, phylogenetic tree, supervised learning,
non-Euclidean data, tropical geometry