Multi-task learning: Difference between revisions

Content deleted Content added
Reverted to revision 992892066 by MrOllie (talk): Rv more refspam
WikiCleanerBot (talk | contribs)
m v2.04b - Bot T21 - Fix errors for CW project (Missing whitespace before a link - Reference before punctuation)
Line 1:
{{short description|Solving multiple machine learning tasks at the same time}}
'''Multi-task learning''' (MTL) is a subfield of [[machine learning]] in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.<ref>Baxter, J. (2000). A model of inductive bias learning" ''Journal of Artificial Intelligence Research'' 12:149--198, [http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume12/baxter00a.pdf On-line paper]</ref><ref>[[Sebastian Thrun|Thrun, S.]] (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. [http://citeseer.ist.psu.edu/thrun96is.html Paper at Citeseer]</ref><ref name=":2">{{Cite journal|url = http://www.cs.cornell.edu/~caruana/mlj97.pdf|title = Multi-task learning|last = Caruana|first = R.|date = 1997|journal = Machine Learning|doi = 10.1023/A:1007379606734|volume=28|pages=41–75}}</ref> Early versions of MTL were called "hints".<ref>Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.</ref><ref>{{cite journal | last1 = Abu-Mostafa | first1 = Y. S. | year = 1990 | title = Learning from hints in neural networks | journal = Journal of Complexity | volume = 6 | issue = 2| pages = 192–198 | doi=10.1016/0885-064x(90)90006-y}}</ref>.
 
In a widely cited 1997 paper, Rich Caruana gave the following characterization:<blockquote>Multitask Learning is an approach to [[inductive transfer]] that improves [[Generalization error|generalization]] by using the ___domain information contained in the training signals of related tasks as an [[inductive bias]]. It does this by learning tasks in parallel while using a shared [[Representation learning|representation]]; what is learned for each task can help other tasks be learned better.<ref name=":2">{{Cite journal|url = http://www.cs.cornell.edu/~caruana/mlj97.pdf|title = Multi-task learning|last = Caruana|first = R.|date = 1997|journal = Machine Learning|doi = 10.1023/A:1007379606734|volume=28|pages=41–75}}</ref></blockquote>
Line 11:
 
===Task grouping and overlap===
Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a [[linear combination]] of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with [[Sparse array|sparsity]], overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.<ref>Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf</ref> Task relatedness can be imposed a priori or learned from the data.<ref name=":1"/><ref>Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf</ref> Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.<ref name=":bmdl">Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-___domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. {{ArXiv|1810.09433}}</ref><ref>Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf</ref>. For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.<ref name=":bmdl"/>
 
===Exploiting unrelated tasks===
Line 90:
* Letting <math display="inline">A^\dagger = \gamma I_T + ( \gamma - \lambda)\frac {1} T \mathbf{1}\mathbf{1}^\top </math> (where <math>I_T </math> is the ''T''x''T'' identity matrix, and <math display="inline">\mathbf{1}\mathbf{1}^\top </math> is the ''T''x''T'' matrix of ones) is equivalent to letting {{math|&Gamma;}} control the variance <math display="inline">\sum_t || f_t - \bar f|| _{\mathcal H_k} </math> of tasks from their mean <math display="inline">\frac 1 T \sum_t f_t </math>. For example, blood levels of some biomarker may be taken on {{mvar|T}} patients at <math>n_t</math> time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
* Letting <math> A^\dagger = \alpha I_T +(\alpha - \lambda )M </math> , where <math> M_{t,s} = \frac 1 {|G_r|} \mathbb I(t,s\in G_r) </math> is equivalent to letting <math> \alpha </math> control the variance measured with respect to a group mean: <math> \sum _{r} \sum _{t \in G_r } ||f_t - \frac 1 {|G_r|} \sum _{s\in G_r)} f_s|| </math>. (Here <math> |G_r| </math> the cardinality of group r, and <math> \mathbb I </math> is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
* Letting <math> A^\dagger = \delta I_T + (\delta -\lambda)L </math>, where <math> L=D-M</math> is the L[[Laplacian matrix|aplacianLaplacian]] for the graph with adjacency matrix ''M'' giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks ''t'' and ''s'' when they are more similar (according to the weight <math> M_{t,s} </math>,) i.e. <math>\delta </math> regularizes <math> \sum _{t,s}||f_t - f_s ||_{\mathcal H _k }^2 M_{t,s} </math>.
* All of the above choices of A also induce the additional regularization term <math display="inline">\lambda \sum_t ||f|| _{\mathcal H_k} ^2 </math> which penalizes complexity in f more broadly.