Data preprocessing: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: doi added to citation with #oabot.
Citation bot (talk | contribs)
Add: doi-access. Removed proxy/dead URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Abductive | Category:Articles needing cleanup from August 2023 | #UCB_Category 54/263
Line 13:
| pmid = 29234465
| doi = 10.1186/s13040-017-0155-3
| pmc= 5721660
| pmc= 5721660}}</ref> If there is a high proportion of irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase may be more difficult. [[Data preparation]] and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include [[Data cleaning|cleaning]], [[instance selection]], [[data normalization|normalization]], [[One-hot|one-hot encoding]], [[data transformation]], [[feature extraction]] and [[feature selection]].
| doi-access = free
| pmc= 5721660}}</ref> If there is a high proportion of irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase may be more difficult. [[Data preparation]] and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include [[Data cleaning|cleaning]], [[instance selection]], [[data normalization|normalization]], [[One-hot|one-hot encoding]], [[data transformation]], [[feature extraction]] and [[feature selection]].
 
==Applications==
Line 29 ⟶ 31:
There are increasingly complex problems which are asking to be solved by more elaborate techniques to better analyze existing information.{{Fact or opinion|date=August 2023}} Instead of creating a simple script for aggregating different numerical values into a single value, it make sense to focus on semantic based data preprocessing.<ref>{{cite conference |title=An ontology-based framework for semantic data preprocessing aimed at human activity recognition |author=Culmone, Rosario and Falcioni, Marco and Quadrini, Michela |s2cid=196091422 |conference=SEMAPRO 2014: The Eighth International Conference on Advances in Semantic Processing. Alexey Cheptsov, High Performance Computing Center Stuttgart (HLRS) |year=2014 }}</ref> The idea is to build a dedicated [[Ontology (information science)|ontology]], which explains on a higher level what the problem is about.<ref>{{cite conference |doi=10.1007/11946465_24 |year=2006 |publisher=Springer Berlin Heidelberg |pages=262–272 |author=David Perez-Rey and Alberto Anguita and Jose Crespo |title=OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data |conference=Biological and Medical Data Analysis }}</ref> In regards to semantic data mining and semantic pre-processing, ontologies are a way to conceptualize and formally define semantic knowledge and data. The [[Protégé (software)]] is the standard tool for constructing an ontology.{{cn|date=July 2022}} In general, the use of ontologies bridges the gaps between data, applications, algorithms, and results that occur from semantic mismatches. As a result, semantic data mining combined with ontology has many applications where semantic ambiguity can impact the usefulness and efficiency of data systems.{{cn|date=August 2023}} Applications include the medical field, language processing, banking,<ref>{{cite book |chapter=Semantic Data Pre-Processing for Machine Learning Based Bankruptcy Prediction Computational Model |author=Yerashenia, Natalia and Bolotov, Alexander and Chan, David and Pierantoni, Gabriele |title=2020 IEEE 22nd Conference on Business Informatics (CBI) |year=2020 |pages=66–75 |publisher=IEEE |doi=10.1109/CBI49978.2020.00015 |isbn=978-1-7281-9926-9 |s2cid=219499599 |url=https://westminsterresearch.westminster.ac.uk/download/6b3387bc3e53e8c935cb4267be3c7b04fe410b5e5019edbc692a53d0b6ae4d65/3538863/CBI_2020_Yereashenia_et_al.pdf |chapter-url=https://ieeexplore.ieee.org/document/9140238}}</ref> and even tutoring,<ref>{{cite journal |title=Building Ontology-Driven Tutoring Models for Intelligent Tutoring Systems Using Data Mining |author=Chang, Maiga and D'Aniello, Giuseppe and Gaeta, Matteo and Orciuoli, Franceso and Sampson, Demetrois and Simonelli, Carmine |journal=IEEE Access |year=2020 |volume=8 |pages=48151–48162 |publisher=IEEE |doi=10.1109/ACCESS.2020.2979281 |s2cid=214594754 |doi-access=free }}</ref> among many more.
 
There are various strengths to using a semantic data mining and ontological based approach. As previously mentioned, these tools can help during the per-processing phase by filtering out non-desirable data from the data set. Additionally, well-structured formal semantics integrated into well designed ontologies can return powerful data that can be easily read and processed by machines.<ref>{{cite web |title=Semantic Data Mining: A Survey of Ontology-based Approaches |author=Dou, Deijing and Wang, Hao and Liu, Haishan |publisher=University of Oregon |url=http://ix.cs.uoregon.edu/~dou/research/papers/icsc15_invited.pdf |language=en-US}}</ref> A specifically useful example of this exists in the medical use of semantic data processing. As an example, a patient is having a medical emergency and is being rushed to hospital. The emergency responders are trying to figure out the best medicine to administer to help the patient. Under normal data processing, scouring all the patient’s medical data to ensure they are getting the best treatment could take too long and risk the patients’ health or even life. However, using semantically processed ontologies, the first responders could save the patient’s life. Tools like a semantic reasoner can use [[ontology (information science)|ontology]] to infer the what best medicine to administer to the patient is based on their medical history, such as if they have a certain cancer or other conditions, simply by examining the natural language used in the patient's medical records.<ref>{{cite web |title=AN ONTOLOGICAL APPROACH TO DATA MINING FOR EMERGENCY MEDICINE |author =Kahn, Atif and Doucette, John A. and Jin, Changjiu and Fu Lijie and Cohen, Robin |publisher=University of Waterloo |url=https://cs.uwaterloo.ca/~j3doucet/papers/OntApproachToDataMining.pdf}}</ref> This would allow the first responders to quickly and efficiently search for medicine without having worry about the patient’s medical history themselves, as the semantic reasoner would already have analyzed this data and found solutions. In general, this illustrates the incredible strength of using semantic data mining and ontologies. They allow for quicker and more efficient data extraction on the user side, as the user has fewer variables to account for, since the semantically pre-processed data and ontology built for the data have already accounted for many of these variables. However, there are some drawbacks to this approach. Namely, it requires a high amount of computational power and complexity, even with relatively small data sets.<ref>{{cite journal|title=Semantic data mining in the information age: A systematic review |author=Sirichanya, Chanmee and Kraisak Kesorn |year=2021 |journal=International Journal of Intelligent Systems|volume=36 |issue=8 |pages=3880–3916 |doi=10.1002/int.22443 |s2cid=235506360 | url=https://onlinelibrary.wiley.com/doi/10.1002/int.22443 |language=en|doi-access=free }}</ref> This could result in higher costs and increased difficulties in building and maintaining semantic data processing systems. This can be mitigated somewhat if the data set is already well organized and formatted, but even then, the complexity is still higher when compared to standard data processing.{{tone inline|date=August 2023}}
 
Below is a simple a diagram combining some of the processes, in particular semantic data mining and their use in ontology.