Content deleted Content added
General fixes, removed orphan tag using AWB (11769) |
m →Libraries: HTTP to HTTPS for Cornell University |
||
(16 intermediate revisions by 14 users not shown) | |||
Line 1:
{{short description|Set of machine learning methods}}
{{Machine learning bar}}
'''Multiple kernel learning''' refers to a set of machine learning methods that use a predefined set of [[Kernel method|kernels]] and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources (e.g. sound and images from a video) that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.
Multiple kernel learning approaches have been used in many applications, such as event recognition in video
==Algorithms==
Line 19 ⟶ 20:
Fixed rules approaches such as the linear combination algorithm described above use rules to set the combination of the kernels. These do not require parameterization and use rules like summation and multiplication to combine the kernels. The weighting is learned in the algorithm. Other examples of fixed rules include pairwise kernels, which are of the form
:<math>k((x_{1i}, x_{1j}),(x_{2i},x_{2j}))=k(x_{1i},x_{2i})k(x_{1j},x_{2j})+k(x_{1i},x_{2j})k(x_{1j},x_{2i})</math>.
These pairwise approaches have been used in predicting protein-protein interactions.<ref>Ben-Hur, A. and Noble W.S. [
====Heuristic approaches====
Line 30 ⟶ 31:
Other approaches use a definition of kernel similarity, such as
:<math>A(K_1,K_2)=\frac{
Using this measure, Qui and Lane (2009)<ref>Shibin Qiu and Terran Lane. A framework for multiple kernel support vector regression and its
Line 74 ⟶ 75:
:<math>f(x)=\sum_{i=1}^N\sum_{m=1}^P\alpha_i^mK_m(x_i^m,x^m)+b</math>
The parameters <math>\alpha_i^m</math> and <math>b</math> are learned by [[gradient descent]] on a coordinate basis. In this way, each iteration of the descent algorithm identifies the best kernel column to choose at each particular iteration and adds that to the combined kernel. The model is then rerun to generate the optimal weights <math>\alpha_i</math> and <math>b</math>.
===Semisupervised learning===
Line 87 ⟶ 88:
where <math>L</math> is the loss function (weighted negative log-likelihood in this case), <math>R</math> is the regularization parameter ([[Proximal gradient methods for learning#Exploiting group structure|Group LASSO]] in this case), and <math>\Theta</math> is the conditional expectation consensus (CEC) penalty on unlabeled data. The CEC penalty is defined as follows. Let the marginal kernel density for all the data be
:<math>g^{\pi}_m(x)=
where <math>\psi_m(x)=[K_m(x_1,x),\ldots,K_m(x_L,x)]^T</math> (the kernel distance between the labeled data and all of the labeled and unlabeled data) and <math>\phi^{\pi}_m</math> is a non-negative random vector with a 2-norm of 1. The value of <math>\Pi</math> is the number of times each kernel is projected. Expectation regularization is then performed on the MKD, resulting in a reference expectation <math>q^{pi}_m(y|g^{\pi}_m(x))</math> and model expectation <math>p^{\pi}_m(f(x)|g^{\pi}_m(x))</math>. Then, we define
:<math>\Theta=\frac{1}{\Pi} \sum^{\Pi}_{\pi=1}\sum^{M}_{m=1} D(q^{pi}_m(y|g^{\pi}_m(x))||p^{\pi}_m(f(x)|g^{\pi}_m(x)))</math>
where <math>D(Q||P)=\sum_iQ(i)\ln\frac{Q(i)}{P(i)}</math> is the [[Kullback–Leibler divergence]]. The combined minimization problem is optimized using a modified block gradient descent algorithm. For more information, see Wang et al.<ref>Wang, Shuhui et al. [
▲The combined minimization problem is optimized using a modified block gradient descent algorithm. For more information, see Wang et al.<ref>Wang, Shuhui et al. [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6177671 S3MKL: Scalable Semi-Supervised Multiple Kernel Learning for Real-World Image Applications]. IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012</ref>
===Unsupervised learning===
[[Unsupervised learning|Unsupervised]] multiple kernel learning algorithms have also been proposed by Zhuang et al. The problem is defined as follows. Let <math>U={x_i}</math> be a set of unlabeled data. The kernel definition is the linear combined kernel <math>K'=\sum_{i=1}^M\beta_iK_m</math>. In this problem, the data needs to be "clustered" into groups based on the kernel distances. Let <math>B_i</math> be a
:<math>\min_{\beta,B}\sum^n_{i=1}\left\Vert x_i - \sum_{x_j\in B_i} K(x_i,x_j)x_j\right\Vert^2 + \gamma_1\sum_{i=1}^n\sum_{x_j\in B_i}K(x_i,x_j)\left\Vert x_i - x_j \right\Vert^2 + \gamma_2\sum_i |B_i|</math>
Line 102:
where . One formulation of this is defined as follows. Let <math>D\in {0,1}^{n\times n}</math> be a matrix such that <math>D_{ij}=1</math> means that <math>x_i</math> and <math>x_j</math> are neighbors. Then, <math>B_i={x_j:D_{ij}=1}</math>. Note that these groups must be learned as well. Zhuang et al. solve this problem by an alternating minimization method for <math>K</math> and the groups <math>B_i</math>. For more information, see Zhuang et al.<ref>J. Zhuang, J. Wang, S.C.H. Hoi & X. Lan. [http://jmlr.csail.mit.edu/proceedings/papers/v20/zhuang11/zhuang11.pdf Unsupervised Multiple Kernel Learning]. Jour. Mach. Learn. Res. 20:129–144, 2011</ref>
==
Available MKL libraries include
* [
* [http://research.microsoft.com/en-us/um/people/manik/code/GMKL/download.html GMKL]: Generalized Multiple Kernel Learning code in [[MATLAB]], does <math>\ell_1</math> and <math>\ell_2</math> regularization for supervised learning.<ref>M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In Proceedings of the International Conference on Machine Learning, Montreal, Canada, June 2009</ref>
* [https://archive.today/20141208195618/http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=gmkl (Another) GMKL]: A different MATLAB MKL code that can also perform elastic net regularization<ref>Yang, H., Xu, Z., Ye, J., King, I., & Lyu, M. R. (2011). Efficient Sparse Generalized Multiple Kernel Learning. IEEE Transactions on Neural Networks, 22(3), 433-446</ref>
* [http://research.microsoft.com/en-us/um/people/manik/code/smo-mkl/download.html SMO-MKL]: C++ source code for a Sequential Minimal Optimization MKL algorithm. Does <math>p</math>-n orm regularization.<ref>S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt and M. Varma. Multiple kernel learning and the SMO algorithm. In Advances in Neural Information Processing Systems, Vancouver, B. C., Canada, December 2010.</ref>
* [http://asi.insa-rouen.fr/enseignants/~arakoto/code/mklindex.html SimpleMKL]: A MATLAB code based on the SimpleMKL algorithm for MKL SVM.<ref>Alain Rakotomamonjy, Francis Bach, Stephane Canu, Yves Grandvalet. SimpleMKL. Journal of Machine Learning Research, Microtome Publishing, 2008, 9, pp.2491-2521.</ref>
* [https://pypi.python.org/pypi/MKLpy MKLPy]: A Python framework for MKL and kernel machines scikit-compliant with different algorithms, e.g. EasyMKL<ref>Fabio Aiolli, Michele Donini. [https://www.sciencedirect.com/science/article/pii/S0925231215003653 EasyMKL: a scalable multiple kernel learning algorithm]. Neurocomputing, 169, pp.215-224.</ref> and others.
==References==
|