Content deleted Content added
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation) |
|||
Line 64:
== Scenarios ==
*'''Pool-Based Sampling''': In this approach, which is the most well known scenario,<ref>{{cite web |last1=DataRobot |title=Active learning machine learning: What it is and how it works |url=https://www.datarobot.com/blog/active-learning-machine-learning |website=DataRobot Blog |publisher=DataRobot Inc. |access-date=30 January 2024}}</ref> the learning algorithm attempts to evaluate ''the entire dataset'' before selecting data points (instances) for labeling. It is often initially trained on a fully labeled subset of the data using a machine-learning method such as logistic regression or SVM that yields class-membership probabilities for individual data instances. The candidate instances are those for which the prediction is most ambiguous.instances are drawn from the entire data pool and assigned a confidence score, a measurement of how well the learner "understands" the data. The system then selects the instances for which it is the least confident and queries the teacher for the labels. <br />The theoretical drawback of pool-based samplilng is that it is memory-intensive and is therefore limited in its capacity to handle enormous datasets, but in practice, the rate-limiting factor is that the teacher is typically a (fatiguable) human expert who must be paid for their effort, rather than computer memory.
*'''Stream-Based Selective Sampling''': Here, each consective unlabeled
*'''Membership Query Synthesis''': This is where the learner generates synthetic data from an underlying natural distribution. For example, if the dataset are pictures of humans and animals, the learner could send a clipped image of a leg to the teacher and query if this appendage belongs to an animal or human. This is particularly useful if the dataset is small.<ref>{{Cite journal|last1=Wang|first1=Liantao|last2=Hu|first2=Xuelei|last3=Yuan|first3=Bo|last4=Lu|first4=Jianfeng|date=2015-01-05|title=Active learning via query synthesis and nearest neighbour search|url=http://espace.library.uq.edu.au/view/UQ:344582/UQ344582_OA.pdf|journal=Neurocomputing|volume=147|pages=426–434|doi=10.1016/j.neucom.2014.06.042|s2cid=3027214 }}</ref> <br />The challenge here, as with all synthetic-data-generation efforts, is in ensuring that the synthetic data is consistent in terms of meeting the constraints on real data. As the number of variables/features in the input data increase, and strong dependencies between variables exist, it becomes increasingly difficult to generate synthetic data with sufficient fidelity. <br />For example, to create a synthetic data set for human laboratory-test values, the sum of the various [[white blood cell]] (WBC) components in a [[White_blood_cell_differential|White Blood Cell differential]] must equal 100, since the component numbers are really percentages. Similarly, the enzymes [[Alanine_transaminase|Alanine Transaminase]] (ALT) and [[Aspartate_transaminase|Aspartate Transaminase]] (AST) measure liver function (though AST is also produced by other tissues, e.g., lung, pancreas) A synthetic data point with AST at the lower limit of normal range (8-33 Units/L) with an ALT several times above normal range (4-35 Units/L) in a simulated chronically ill patient would be physiologically impossible.
|