Multiple kernel learning: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:03, 7 January 2016 edit BattyBot (talk \| contribs) Bots 1,957,321 edits General fixes, removed orphan tag using AWB (11769) ← Previous edit		Latest revision as of 15:55, 19 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m →Libraries: HTTP to HTTPS for Cornell University Tag: AWB
(16 intermediate revisions by 14 users not shown)
Line 1: {{short description\|Set of machine learning methods}} {{Machine learning bar}} '''Multiple kernel learning''' refers to a set of machine learning methods that use a predefined set of [[Kernel method\|kernels]] and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources (e.g. sound and images from a video) that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source. Multiple kernel learning approaches have been used in many applications, such as event recognition in video..,<ref>Lin Chen, Lixin Duan, and Dong Xu, "Event Recognition in Videos by Learning From Heterogeneous Web Sources," in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2666-2673</ref> object recognition in images,<ref>Serhat S. Bucak, Rong Jin, and Anil K. Jain, Multiple Kernel Learning for Visual Object Recognition: A Review. T-PAMI, 2013.</ref> and biomedical data fusion.<ref>Yu et al. [http://www.biomedcentral.com/1471-2105/11/309 L2-norm multiple kernel learning and its application to biomedical data fusion]. BMC Bioinformatics 2010, 11:309</ref> ==Algorithms== Line 19 ⟶ 20: Fixed rules approaches such as the linear combination algorithm described above use rules to set the combination of the kernels. These do not require parameterization and use rules like summation and multiplication to combine the kernels. The weighting is learned in the algorithm. Other examples of fixed rules include pairwise kernels, which are of the form :<math>k((x_{1i}, x_{1j}),(x_{2i},x_{2j}))=k(x_{1i},x_{2i})k(x_{1j},x_{2j})+k(x_{1i},x_{2j})k(x_{1j},x_{2i})</math>. These pairwise approaches have been used in predicting protein-protein interactions.<ref>Ben-Hur, A. and Noble W.S. [~~http~~https://www.ncbi.nlm.nih.gov/pubmed/15961482?dopt=Abstract Kernel methods for predicting protein-protein interactions.] Bioinformatics. 2005 Jun;21 Suppl 1:i38-46.</ref> ====Heuristic approaches==== Line 30 ⟶ 31: Other approaches use a definition of kernel similarity, such as :<math>A(K_1,K_2)=\frac{<\langle K_1,K_2> \rangle}{\sqrt{<\langle K_1,K_1>< \rangle\langle K_2,K_2>\rangle}}</math> Using this measure, Qui and Lane (2009)<ref>Shibin Qiu and Terran Lane. A framework for multiple kernel support vector regression and its Line 74 ⟶ 75: :<math>f(x)=\sum_{i=1}^N\sum_{m=1}^P\alpha_i^mK_m(x_i^m,x^m)+b</math> The parameters <math>\alpha_i^m</math> and <math>b</math> are learned by [[gradient descent]] on a coordinate basis. In this way, each iteration of the descent algorithm identifies the best kernel column to choose at each particular iteration and adds that to the combined kernel. The model is then rerun to generate the optimal weights <math>\alpha_i</math> and <math>b</math>. ===Semisupervised learning=== Line 87 ⟶ 88: where <math>L</math> is the loss function (weighted negative log-likelihood in this case), <math>R</math> is the regularization parameter ([[Proximal gradient methods for learning#Exploiting group structure\|Group LASSO]] in this case), and <math>\Theta</math> is the conditional expectation consensus (CEC) penalty on unlabeled data. The CEC penalty is defined as follows. Let the marginal kernel density for all the data be :<math>g^{\pi}_m(x)=<\langle\phi^{\pi}_m,\psi_m(x)>\rangle</math> where <math>\psi_m(x)=[K_m(x_1,x),\ldots,K_m(x_L,x)]^T</math> (the kernel distance between the labeled data and all of the labeled and unlabeled data) and <math>\phi^{\pi}_m</math> is a non-negative random vector with a 2-norm of 1. The value of <math>\Pi</math> is the number of times each kernel is projected. Expectation regularization is then performed on the MKD, resulting in a reference expectation <math>q^{pi}_m(y\|g^{\pi}_m(x))</math> and model expectation <math>p^{\pi}_m(f(x)\|g^{\pi}_m(x))</math>. Then, we define :<math>\Theta=\frac{1}{\Pi} \sum^{\Pi}_{\pi=1}\sum^{M}_{m=1} D(q^{pi}_m(y\|g^{\pi}_m(x))\|\|p^{\pi}_m(f(x)\|g^{\pi}_m(x)))</math> where <math>D(Q\|\|P)=\sum_iQ(i)\ln\frac{Q(i)}{P(i)}</math> is the [[Kullback–Leibler divergence]]. The combined minimization problem is optimized using a modified block gradient descent algorithm. For more information, see Wang et al.<ref>Wang, Shuhui et al. [~~http~~https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6177671 S3MKL: Scalable Semi-Supervised Multiple Kernel Learning for Real-World Image Applications]. IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012</ref>▼ ~~where <math>D(Q\|\|P)=\sum_iQ(i)\ln\frac{Q(i)}{P(i)}</math> is the [[Kullback-Leibler divergence]].~~ ▲The combined minimization problem is optimized using a modified block gradient descent algorithm. For more information, see Wang et al.<ref>Wang, Shuhui et al. [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6177671 S3MKL: Scalable Semi-Supervised Multiple Kernel Learning for Real-World Image Applications]. IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012</ref> ===Unsupervised learning=== [[Unsupervised learning\|Unsupervised]] multiple kernel learning algorithms have also been proposed by Zhuang et al. The problem is defined as follows. Let <math>U={x_i}</math> be a set of unlabeled data. The kernel definition is the linear combined kernel <math>K'=\sum_{i=1}^M\beta_iK_m</math>. In this problem, the data needs to be "clustered" into groups based on the kernel distances. Let <math>B_i</math> be a ~~the~~ group or cluster of which <math>x_i</math> is a member. We define the loss function as <math>\sum^n_{i=1}\left\Vert x_i - \sum_{x_j\in B_i} K(x_i,x_j)x_j\right\Vert^2</math>. Furthermore, we minimize the distortion by minimizing <math>\sum_{i=1}^n\sum_{x_j\in B_i}K(x_i,x_j)\left\Vert x_i - x_j \right\Vert^2</math>. Finally, we add a regularization term to avoid overfitting. Combining these terms, we can write the minimization problem as follows. :<math>\min_{\beta,B}\sum^n_{i=1}\left\Vert x_i - \sum_{x_j\in B_i} K(x_i,x_j)x_j\right\Vert^2 + \gamma_1\sum_{i=1}^n\sum_{x_j\in B_i}K(x_i,x_j)\left\Vert x_i - x_j \right\Vert^2 + \gamma_2\sum_i \|B_i\|</math> Line 102: where . One formulation of this is defined as follows. Let <math>D\in {0,1}^{n\times n}</math> be a matrix such that <math>D_{ij}=1</math> means that <math>x_i</math> and <math>x_j</math> are neighbors. Then, <math>B_i={x_j:D_{ij}=1}</math>. Note that these groups must be learned as well. Zhuang et al. solve this problem by an alternating minimization method for <math>K</math> and the groups <math>B_i</math>. For more information, see Zhuang et al.<ref>J. Zhuang, J. Wang, S.C.H. Hoi & X. Lan. [http://jmlr.csail.mit.edu/proceedings/papers/v20/zhuang11/zhuang11.pdf Unsupervised Multiple Kernel Learning]. Jour. Mach. Learn. Res. 20:129–144, 2011</ref> ==~~MKL~~ Libraries== Available MKL libraries include * [~~http~~https://www.cs.cornell.edu/~ashesh/pubs/code/SPG-GMKL/download.html SPG-GMKL]: A scalable C++ MKL SVM library that can handle a million kernels.<ref>Ashesh Jain, S. V. N. Vishwanathan and Manik Varma. SPG-GMKL: Generalized multiple kernel learning with a million kernels. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, China, August 2012</ref> * [http://research.microsoft.com/en-us/um/people/manik/code/GMKL/download.html GMKL]: Generalized Multiple Kernel Learning code in [[MATLAB]], does <math>\ell_1</math> and <math>\ell_2</math> regularization for supervised learning.<ref>M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In Proceedings of the International Conference on Machine Learning, Montreal, Canada, June 2009</ref> * [https://archive.today/20141208195618/http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=gmkl (Another) GMKL]: A different MATLAB MKL code that can also perform elastic net regularization<ref>Yang, H., Xu, Z., Ye, J., King, I., & Lyu, M. R. (2011). Efficient Sparse Generalized Multiple Kernel Learning. IEEE Transactions on Neural Networks, 22(3), 433-446</ref> * [http://research.microsoft.com/en-us/um/people/manik/code/smo-mkl/download.html SMO-MKL]: C++ source code for a Sequential Minimal Optimization MKL algorithm. Does <math>p</math>-n orm regularization.<ref>S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt and M. Varma. Multiple kernel learning and the SMO algorithm. In Advances in Neural Information Processing Systems, Vancouver, B. C., Canada, December 2010.</ref> * [http://asi.insa-rouen.fr/enseignants/~arakoto/code/mklindex.html SimpleMKL]: A MATLAB code based on the SimpleMKL algorithm for MKL SVM.<ref>Alain Rakotomamonjy, Francis Bach, Stephane Canu, Yves Grandvalet. SimpleMKL. Journal of Machine Learning Research, Microtome Publishing, 2008, 9, pp.2491-2521.</ref> * [https://pypi.python.org/pypi/MKLpy MKLPy]: A Python framework for MKL and kernel machines scikit-compliant with different algorithms, e.g. EasyMKL<ref>Fabio Aiolli, Michele Donini. [https://www.sciencedirect.com/science/article/pii/S0925231215003653 EasyMKL: a scalable multiple kernel learning algorithm]. Neurocomputing, 169, pp.215-224.</ref> and others. ==References==