Active learning (machine learning): Difference between revisions

Content deleted Content added
WikiCleanerBot (talk | contribs)
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation)
Line 64:
== Scenarios ==
*'''Pool-Based Sampling''': In this approach, which is the most well known scenario,<ref>{{cite web |last1=DataRobot |title=Active learning machine learning: What it is and how it works |url=https://www.datarobot.com/blog/active-learning-machine-learning |website=DataRobot Blog |publisher=DataRobot Inc. |access-date=30 January 2024}}</ref> the learning algorithm attempts to evaluate ''the entire dataset'' before selecting data points (instances) for labeling. It is often initially trained on a fully labeled subset of the data using a machine-learning method such as logistic regression or SVM that yields class-membership probabilities for individual data instances. The candidate instances are those for which the prediction is most ambiguous.instances are drawn from the entire data pool and assigned a confidence score, a measurement of how well the learner "understands" the data. The system then selects the instances for which it is the least confident and queries the teacher for the labels. <br />The theoretical drawback of pool-based samplilng is that it is memory-intensive and is therefore limited in its capacity to handle enormous datasets, but in practice, the rate-limiting factor is that the teacher is typically a (fatiguable) human expert who must be paid for their effort, rather than computer memory.
*'''Stream-Based Selective Sampling''': Here, each consective unlabeled dinstanceinstance is examined ''one at a time'' with the machine evaluating the informativeness of each item against its query parameters. The learner decides for itself whether to assign a label or query the teacher for each datapoint. As contrasted with Pool-based sampling, the obvious drawback of stream-based methods is that the learning algorithm does not have sufficient information, early in the process, to make a sound assign-label-vs ask-teacher decision, and it does not capitalize as efficiently on the presence of already labeled data. Therefore, the teacher is likely to spend more effort in supplying labels than with the pool-based approach.
*'''Membership Query Synthesis''': This is where the learner generates synthetic data from an underlying natural distribution. For example, if the dataset are pictures of humans and animals, the learner could send a clipped image of a leg to the teacher and query if this appendage belongs to an animal or human. This is particularly useful if the dataset is small.<ref>{{Cite journal|last1=Wang|first1=Liantao|last2=Hu|first2=Xuelei|last3=Yuan|first3=Bo|last4=Lu|first4=Jianfeng|date=2015-01-05|title=Active learning via query synthesis and nearest neighbour search|url=http://espace.library.uq.edu.au/view/UQ:344582/UQ344582_OA.pdf|journal=Neurocomputing|volume=147|pages=426–434|doi=10.1016/j.neucom.2014.06.042|s2cid=3027214 }}</ref> <br />The challenge here, as with all synthetic-data-generation efforts, is in ensuring that the synthetic data is consistent in terms of meeting the constraints on real data. As the number of variables/features in the input data increase, and strong dependencies between variables exist, it becomes increasingly difficult to generate synthetic data with sufficient fidelity. <br />For example, to create a synthetic data set for human laboratory-test values, the sum of the various [[white blood cell]] (WBC) components in a [[White_blood_cell_differential|White Blood Cell differential]] must equal 100, since the component numbers are really percentages. Similarly, the enzymes [[Alanine_transaminase|Alanine Transaminase]] (ALT) and [[Aspartate_transaminase|Aspartate Transaminase]] (AST) measure liver function (though AST is also produced by other tissues, e.g., lung, pancreas) A synthetic data point with AST at the lower limit of normal range (8-33 Units/L) with an ALT several times above normal range (4-35 Units/L) in a simulated chronically ill patient would be physiologically impossible.