Scikit-learn: Difference between revisions

Content deleted Content added
Changing short description from "Machine learning library for the Python programming language" to "Python library for machine learning"
Added the "Applications" section which outlines the real-life use cases. All applications were taken from the official Testimonial page (https://scikit-learn.org/stable/testimonials/testimonials.html). It was inspired by the "Applications" section on TenserFlow wiki page.
 
(40 intermediate revisions by 31 users not shown)
Line 8:
| collapsible =
| author = [[David Cournapeau]]
| developer = [[Google Summer of Code]] project
| released = {{Start date and age|2007|06|df=yes}}
| latest release version = {{wikidata|property|reference|P348}}
Line 23:
| website = {{URL|https://scikit-learn.org/}}
}}
{{Portal|Free and open-source software}}
'''Scikit-learn''' (formerly '''scikits.learn''' and also known as '''sklearn''') is a [[free software]] [[machine learning]] [[Library (computing)|library]] for the [[Python (programming language)|Python]] [[programming language]].<ref name="jmlr">{{cite journal
'''scikit-learn''' (formerly '''scikits.learn''' and also known as '''sklearn''') is a [[free and open-source]] [[machine learning]] [[Library (computing)|library]] for the [[Python (programming language)|Python]] [[programming language]].<ref name="jmlr">{{cite journal
|author1=Fabian Pedregosa
|author2=Gaël Varoquaux
Line 39 ⟶ 40:
|author14=Matthieu Perrot
|author15=Édouard Duchesnay
|title=Scikitscikit-learn: Machine Learning in Python
|journal=Journal of Machine Learning Research
|year=2011
|volume=12
|pages=2825–2830
|arxiv=1201.0490
|bibcode=2011JMLR...12.2825P
|url=http://jmlr.org/papers/v12/pedregosa11a.html
}}</ref>
It features various [[statistical classification|classification]], [[regression analysis|regression]] and [[Cluster analysis|clustering]] [[Algorithm|algorithms]] including [[support vector machine|support-vector machine]]s, [[random forests]], [[gradient boosting]], [[k-means clustering|''k''-means]] and [[DBSCAN]], and is designed to interoperate with the [[Python (programming language)|Python]] numerical and scientific libraries [[NumPy]] and [[SciPy]]. Scikit-learn is a [[NumFOCUS]] fiscally sponsored project.<ref>{{cite web|title=NumFOCUS Sponsored Projects|url=https://numfocus.org/sponsored-projects|publisher=NumFOCUS|access-date=2021-10-25}}</ref>
 
==Overview==
The scikit-learn project started as scikits.learn, a [[Google Summer of Code]] project by French [[data scientist]] [[David Cournapeau]]. ItsThe name stems fromof the notionproject thatderives itfrom its role isas a "SciKit"scientific (SciPytoolkit Toolkit)for machine learning", aoriginally separately-developed and distributed as a third-party extension to [[SciPy]].<ref>{{cite web
|url=https://scikits.appspot.com/scikit-learn
|title=scikit-learn
|last1=Dreijer
|first1=Janto
}}</ref> The original [[codebase]] was later rewritten by other [[Programmer|developers]].{{Who|date=March 2025}} In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the [[French Institute for Research in Computer Science and Automation]] in [[Plateau de Saclay|Saclay]], [[France]], took leadership of the project and released the first public version of the library on February 1, 2010.<ref>{{cite web|url=https://scikit-learn.org/stable/about.html#history|title=About us — scikit-learn 0.20.1 documentation|website=scikit-learn.org}}</ref> In November 2012, scikit-learn as well as [[scikit-image]] were described as two of the "well-maintained and popular" {{As of|2012|11|alt=scikits libraries}}.<ref>{{cite book
}}</ref>
The original [[codebase]] was later rewritten by other developers. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel, all from the [[French Institute for Research in Computer Science and Automation]] in [[Rocquencourt, Yvelines|Rocquencourt]], [[France]], took leadership of the project and made the first public release on February the 1st 2010.<ref>{{cite web|url=https://scikit-learn.org/stable/about.html#history|title=About us — scikit-learn 0.20.1 documentation|website=scikit-learn.org}}</ref> Of the various scikits, scikit-learn as well as [[scikit-image]] were described as "well-maintained and popular" {{As of|2012|11|alt=in November 2012}}.<ref>{{cite book
|author=Eli Bressert
|title=SciPy and NumPy: an overview for developers
Line 62 ⟶ 64:
|url=https://books.google.com/books?id=fLKTuJqQLVEC&pg=PA43
|page=43
|isbn=978-1-4493-6162-4
}}</ref> Scikit-learn is one of the most popular machine learning libraries on [[GitHub]].<ref>{{Cite web|url=https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/|title=The State of the Octoverse: machine learning|date=2019-01-24|website=The GitHub Blog|publisher=[[GitHub]]|language=en-US|access-date=2019-10-17}}</ref>
}}</ref> In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on [[GitHub]].<ref>{{Cite web|url=https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/|title=The State of the Octoverse: machine learning|date=2019-01-24|website=The GitHub Blog|publisher=[[GitHub]]|language=en-US|access-date=2019-10-17}}</ref>
 
== Features ==
 
* Large catalogue of well-established machine learning algorithms and data pre-processing methods (i.e. [[feature engineering]])
* Utility methods for common data-science tasks, such as splitting data into [[Training, validation, and test data sets|train and test sets]], [[Cross-validation (statistics)|cross-validation]] and [[grid search]]
* Consistent way of running machine learning models ({{code|estimator.fit()|python}} and {{code|estimator.predict()|python}}), which libraries can implement
* Declarative way of structuring a data science process (the {{Code|Pipeline|Python}}), including data pre-processing and model fitting
 
== Examples ==
Fitting a [[Random forest|random forest classifier]]:<syntaxhighlight lang="python3" line="1">
>>> from sklearn.ensemble import RandomForestClassifier
>>> classifier = RandomForestClassifier(random_state=0)
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
... [11, 12, 13]]
>>> y = [0, 1] # classes of each sample
>>> classifier.fit(X, y)
RandomForestClassifier(random_state=0)
</syntaxhighlight>
 
==Implementation==
Scikitscikit-learn is largely written in Python, and uses [[NumPy]] extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in [[Cython]] to improve performance. Support vector machines are implemented by a Cython wrapper around [[LIBSVM]]; logistic regression and linear support vector machines by a similar wrapper around [[LIBLINEAR]]. In such cases, extending these methods with Python may not be possible.
 
scikit-learn integrates well with many other Python libraries, such as [[Matplotlib]] and [[plotly]] for plotting, [[NumPy]] for array vectorization, [[Pandas (software)|Pandas]] dataframes, [[SciPy]], and many more.
 
== History ==
scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010, [[French Institute for Research in Computer Science and Automation|INRIA]], the [[French Institute for Research in Computer Science and Automation]], got involved and the first public release (v0.1 beta) was published in late January 2010.
 
== Applications ==
Scikit-learn is widely used across industries for a variety of machine learning tasks such as classification, regression, clustering, and model selection. The following are real-world applications of the library:
 
=== Finance and Insurance ===
 
* '''AXA''' uses scikit-learn to speed up the compensation process for car accidents and to detect insurance fraud.<ref name="sklearn-testimonials">{{Cite web |title=Testimonials |url=https://scikit-learn.org/stable/testimonials/testimonials.html |website=scikit-learn.org |access-date=2025-08-06}}</ref>
* '''Zopa''', a peer-to-peer lending platform, employs scikit-learn for credit risk modelling, fraud detection, marketing segmentation, and loan pricing.<ref name="sklearn-testimonials"/>
* '''BNP Paribas Cardif''' uses scikit-learn to improve the dispatching of incoming mail and manage internal model risk governance through pipelines that reduce operational and overfitting risks.<ref name="sklearn-testimonials"/>
 
* '''J.P. Morgan''' reports broad usage of scikit-learn across the bank for classification tasks and predictive analytics in financial decision-making.<ref name="sklearn-testimonials"/>
 
=== Retail and E-Commerce ===
 
* '''Booking.com''' uses scikit-learn for hotel and destination recommendation systems, fraudulent reservation detection, and workforce scheduling for customer support agents.<ref name="sklearn-testimonials"/>
* '''HowAboutWe''' uses it to predict user engagement and preferences on a dating platform.<ref name="sklearn-testimonials"/>
* '''Lovely''' leverages the library to understand user behaviour and detect fraudulent activity on its platform.<ref name="sklearn-testimonials"/>
* '''Data Publica''' uses it for customer segmentation based on the success of past partnerships.<ref name="sklearn-testimonials"/>
 
* '''Otto Group''' integrates scikit-learn throughout its data science stack, particularly in logistics optimization and product recommendations.<ref name="sklearn-testimonials"/>
 
=== Media, Marketing, and Social Platforms ===
 
* '''Spotify''' applies scikit-learn in its recommendation systems.<ref name="sklearn-testimonials"/>
* '''Betaworks''' uses the library for both recommendation systems (e.g., for Digg) and dynamic subspace clustering applied to weather forecasting data.<ref name="sklearn-testimonials"/>
* '''PeerIndex''' used scikit-learn for missing data imputation, tweet classification, and community clustering in social media analytics.<ref name="sklearn-testimonials"/>
* '''Bestofmedia Group''' employs it for spam detection and ad click prediction.<ref name="sklearn-testimonials"/>
 
* '''Machinalis''' utilizes scikit-learn for click-through rate prediction and relational information extraction for content classification and advertising optimization.<ref name="sklearn-testimonials"/>
* '''Change.org''' applies scikit-learn for targeted email outreach based on user behaviour.<ref name="sklearn-testimonials"/>
 
=== Technology ===
 
* '''AWeber''' uses scikit-learn to extract features from emails and build pipelines for managing large-scale email campaigns.<ref name="sklearn-testimonials"/>
* '''Solido''' applies it to semiconductor design tasks such as rare-event estimation and worst-case verification using statistical learning.<ref name="sklearn-testimonials"/>
 
* '''Evernote''', '''Dataiku''', and other tech companies employ scikit-learn in prototyping and production workflows due to its consistent API and integration with the Python ecosystem.<ref name="sklearn-testimonials"/>
 
=== Academia ===
 
* '''Télécom ParisTech''' integrates scikit-learn in hands-on coursework and assignments as part of its machine learning curriculum.<ref name="sklearn-testimonials"/>
Scikit-learn integrates well with many other Python libraries, such as [[Matplotlib]] and [[plotly]] for plotting, [[NumPy]] for array vectorization, [[Pandas (software)|Pandas]] dataframes, [[SciPy]], and many more.
 
== Version historyAwards ==
Scikit-learn was initially developed by David Cournapeau as a [[Google]] summer of code project in 2007. Later Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010 [[French Institute for Research in Computer Science and Automation|INRIA]], the [[French Institute for Research in Computer Science and Automation]], got involved and the first public release (v0.1 beta) was published in late January 2010.
* August 2013. scikit-learn 0.14<ref name=":0" />
* July 2014. scikit-learn 0.15.0<ref name=":0" />
* March 2015. scikit-learn 0.16.0<ref name=":0" />
* November 2015. scikit-learn 0.17.0<ref name=":0">{{Cite web|url=https://scikit-learn.org/dev/whats_new.html|title=Release history — scikit-learn 0.19.dev0 documentation|website=scikit-learn.org|access-date=2017-02-27}}</ref>
* September 2016. scikit-learn 0.18.0
* July 2017. scikit-learn 0.19.0
* September 2018. scikit-learn 0.20.0<ref>{{cite web |title=Release History - 0.20.0 documentation |url=https://scikit-learn.org/stable/whats_new.html#version-0-20 |website=scikit-learn |access-date=6 November 2018}}</ref>
* May 2019. scikit-learn 0.21.0<ref>{{cite web |title=Release History - 0.21.0 documentation |url=https://scikit-learn.org/stable/whats_new.html#version-0-21-0 |website=scikit-learn |access-date=5 May 2019}}</ref>
* December 2019. scikit-learn 0.22.0<ref>{{cite web |title=Release History - 0.22.0 documentation |url=https://scikit-learn.org/dev/whats_new/v0.22.html#version-0-22-0 |website=scikit-learn |access-date=7 June 2020}}</ref>
*May 2020. scikit-learn 0.23.0<ref>{{cite web |title=Release History - 0.23.0 documentation |url=https://scikit-learn.org/dev/whats_new/v0.23.html#version-0-23-0 |website=scikit-learn |access-date=7 June 2020}}</ref>
* Jan 2021. scikit-learn 0.24<ref>{{Citation|title=scikit-learn: A set of python modules for machine learning and data mining|url=http://scikit-learn.org/|access-date=2021-02-08}}</ref>
* September 2021. scikit-learn 1.0<ref>{{Citation|title=scikit-learn: A set of python modules for machine learning and data mining|url=http://scikit-learn.org/|access-date=2021-09-24}}</ref>
 
* 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize<ref>{{Cite web |title=The 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize : scikit-learn , a success story for machine learning free software {{!}} Inria |url=https://www.inria.fr/en/2019-inria-french-academy-sciences-dassault-systemes-innovation-prize-scikit-learn-success-story |access-date=2025-03-19 |website=www.inria.fr}}</ref>
==See also==
* 2022 Open Science Award for Open Source Research Software<ref>{{Cite web |last=Badolato |first=Anne-Marie |date=2022-02-07 |title=Open Science Awards for Open Source Research Software |url=https://www.ouvrirlascience.fr/open-science-free-software-award-ceremony/ |access-date=2025-03-19 |website=Ouvrir la Science |language=en}}</ref>
* [[mlpy]]
* [[SpaCy]]
* [[Natural Language Toolkit|NLTK]]
* [[Orange (software)|Orange]]
* [[PyTorch]]
* [[TensorFlow]]
* [[Infer.NET]]
* [[List of numerical analysis software]]
 
==References==
Line 99 ⟶ 144:
==External links==
* {{Official website|https://scikit-learn.org/}}
* {{GitHub|https://github.com/scikit-learn}}
 
{{SciPy ecosystem}}
{{differentiable computing}}
 
[[Category:Data mining and machine learning software]]
Line 106 ⟶ 153:
[[Category:Python (programming language) scientific libraries]]
[[Category:Software using the BSD license]]
[[Category:2010 in artificial intelligence]]
[[Category:2010 software]]