Content deleted Content added
m Add link to Github |
Added the "Applications" section which outlines the real-life use cases. All applications were taken from the official Testimonial page (https://scikit-learn.org/stable/testimonials/testimonials.html). It was inspired by the "Applications" section on TenserFlow wiki page. |
||
(37 intermediate revisions by 29 users not shown) | |||
Line 8:
| collapsible =
| author = [[David Cournapeau]]
| developer = [[Google Summer of Code]] project
| released = {{Start date and age|2007|06|df=yes}}
| latest release version = {{wikidata|property|reference|P348}}
Line 23:
| website = {{URL|https://scikit-learn.org/}}
}}
{{Portal|Free and open-source software}}
'''scikit-learn''' (formerly '''scikits.learn''' and also known as '''sklearn''') is a [[free and open-source]] [[machine learning]] [[Library (computing)|library]] for the [[Python (programming language)|Python]] [[programming language]].<ref name="jmlr">{{cite journal
|author1=Fabian Pedregosa
|author2=Gaël Varoquaux
Line 39 ⟶ 40:
|author14=Matthieu Perrot
|author15=Édouard Duchesnay
|title=
|journal=Journal of Machine Learning Research
|year=2011
|volume=12
|pages=2825–2830
|arxiv=1201.0490
|bibcode=2011JMLR...12.2825P
|url=http://jmlr.org/papers/v12/pedregosa11a.html
}}</ref>
It features various [[statistical classification|classification]], [[regression analysis|regression]] and [[Cluster analysis|clustering]] [[Algorithm|algorithms]] including [[support vector machine|support-vector machine]]s, [[random forests]], [[gradient boosting]], [[k-means clustering|''k''-means]] and [[DBSCAN]], and is designed to interoperate with the [[Python (programming language)|Python]] numerical and scientific libraries [[NumPy]] and [[SciPy]]. Scikit-learn is a [[NumFOCUS]] fiscally sponsored project.<ref>{{cite web|title=NumFOCUS Sponsored Projects|url=https://numfocus.org/sponsored-projects|publisher=NumFOCUS|access-date=2021-10-25}}</ref>
==Overview==
The scikit-learn project started as scikits.learn, a [[Google Summer of Code]] project by French [[data scientist]] [[David Cournapeau]].
|url=https://scikits.appspot.com/scikit-learn
|title=scikit-learn
|last1=Dreijer
|first1=Janto
}}</ref> The original [[codebase]] was later rewritten by other [[Programmer|developers]].{{Who|date=March 2025}} In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the [[French Institute for Research in Computer Science and Automation]] in [[Plateau de Saclay|Saclay]], [[France]], took leadership of the project and released the first public version of the library on February 1, 2010.<ref>{{cite web|url=https://scikit-learn.org/stable/about.html#history|title=About us — scikit-learn 0.20.1 documentation|website=scikit-learn.org}}</ref> In November 2012, scikit-learn as well as [[scikit-image]] were described as two of the "well-maintained and popular" {{As of|2012|11|alt=scikits libraries}}.<ref>{{cite book
|author=Eli Bressert
|title=SciPy and NumPy: an overview for developers
Line 62 ⟶ 64:
|url=https://books.google.com/books?id=fLKTuJqQLVEC&pg=PA43
|page=43
|isbn=978-1-4493-6162-4
}}</ref> In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on [[GitHub]].<ref>{{Cite web|url=https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/|title=The State of the Octoverse: machine learning|date=2019-01-24|website=The GitHub Blog|publisher=[[GitHub]]|language=en-US|access-date=2019-10-17}}</ref>
== Features ==
* Large catalogue of well-established machine learning algorithms and data pre-processing methods (i.e. [[feature engineering]])
* Utility methods for common data-science tasks, such as splitting data into [[Training, validation, and test data sets|train and test sets]], [[Cross-validation (statistics)|cross-validation]] and [[grid search]]
* Consistent way of running machine learning models ({{code|estimator.fit()|python}} and {{code|estimator.predict()|python}}), which libraries can implement
* Declarative way of structuring a data science process (the {{Code|Pipeline|Python}}), including data pre-processing and model fitting
== Examples ==
Fitting a [[Random forest|random forest classifier]]:<syntaxhighlight lang="python3" line="1">
>>> from sklearn.ensemble import RandomForestClassifier
>>> classifier = RandomForestClassifier(random_state=0)
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
... [11, 12, 13]]
>>> y = [0, 1] # classes of each sample
>>> classifier.fit(X, y)
RandomForestClassifier(random_state=0)
</syntaxhighlight>
==Implementation==
scikit-learn integrates well with many other Python libraries, such as [[Matplotlib]] and [[plotly]] for plotting, [[NumPy]] for array vectorization, [[Pandas (software)|Pandas]] dataframes, [[SciPy]], and many more.
== History ==
scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010, [[French Institute for Research in Computer Science and Automation|INRIA]], the [[French Institute for Research in Computer Science and Automation]], got involved and the first public release (v0.1 beta) was published in late January 2010.
== Applications ==
Scikit-learn is widely used across industries for a variety of machine learning tasks such as classification, regression, clustering, and model selection. The following are real-world applications of the library:
=== Finance and Insurance ===
* '''AXA''' uses scikit-learn to speed up the compensation process for car accidents and to detect insurance fraud.<ref name="sklearn-testimonials">{{Cite web |title=Testimonials |url=https://scikit-learn.org/stable/testimonials/testimonials.html |website=scikit-learn.org |access-date=2025-08-06}}</ref>
* '''Zopa''', a peer-to-peer lending platform, employs scikit-learn for credit risk modelling, fraud detection, marketing segmentation, and loan pricing.<ref name="sklearn-testimonials"/>
* '''BNP Paribas Cardif''' uses scikit-learn to improve the dispatching of incoming mail and manage internal model risk governance through pipelines that reduce operational and overfitting risks.<ref name="sklearn-testimonials"/>
* '''J.P. Morgan''' reports broad usage of scikit-learn across the bank for classification tasks and predictive analytics in financial decision-making.<ref name="sklearn-testimonials"/>
=== Retail and E-Commerce ===
* '''Booking.com''' uses scikit-learn for hotel and destination recommendation systems, fraudulent reservation detection, and workforce scheduling for customer support agents.<ref name="sklearn-testimonials"/>
* '''HowAboutWe''' uses it to predict user engagement and preferences on a dating platform.<ref name="sklearn-testimonials"/>
* '''Lovely''' leverages the library to understand user behaviour and detect fraudulent activity on its platform.<ref name="sklearn-testimonials"/>
* '''Data Publica''' uses it for customer segmentation based on the success of past partnerships.<ref name="sklearn-testimonials"/>
* '''Otto Group''' integrates scikit-learn throughout its data science stack, particularly in logistics optimization and product recommendations.<ref name="sklearn-testimonials"/>
=== Media, Marketing, and Social Platforms ===
* '''Spotify''' applies scikit-learn in its recommendation systems.<ref name="sklearn-testimonials"/>
* '''Betaworks''' uses the library for both recommendation systems (e.g., for Digg) and dynamic subspace clustering applied to weather forecasting data.<ref name="sklearn-testimonials"/>
* '''PeerIndex''' used scikit-learn for missing data imputation, tweet classification, and community clustering in social media analytics.<ref name="sklearn-testimonials"/>
* '''Bestofmedia Group''' employs it for spam detection and ad click prediction.<ref name="sklearn-testimonials"/>
* '''Machinalis''' utilizes scikit-learn for click-through rate prediction and relational information extraction for content classification and advertising optimization.<ref name="sklearn-testimonials"/>
* '''Change.org''' applies scikit-learn for targeted email outreach based on user behaviour.<ref name="sklearn-testimonials"/>
=== Technology ===
* '''AWeber''' uses scikit-learn to extract features from emails and build pipelines for managing large-scale email campaigns.<ref name="sklearn-testimonials"/>
* '''Solido''' applies it to semiconductor design tasks such as rare-event estimation and worst-case verification using statistical learning.<ref name="sklearn-testimonials"/>
* '''Evernote''', '''Dataiku''', and other tech companies employ scikit-learn in prototyping and production workflows due to its consistent API and integration with the Python ecosystem.<ref name="sklearn-testimonials"/>
=== Academia ===
* '''Télécom ParisTech''' integrates scikit-learn in hands-on coursework and assignments as part of its machine learning curriculum.<ref name="sklearn-testimonials"/>
==
* 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize<ref>{{Cite web |title=The 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize : scikit-learn , a success story for machine learning free software {{!}} Inria |url=https://www.inria.fr/en/2019-inria-french-academy-sciences-dassault-systemes-innovation-prize-scikit-learn-success-story |access-date=2025-03-19 |website=www.inria.fr}}</ref>
* 2022 Open Science Award for Open Source Research Software<ref>{{Cite web |last=Badolato |first=Anne-Marie |date=2022-02-07 |title=Open Science Awards for Open Source Research Software |url=https://www.ouvrirlascience.fr/open-science-free-software-award-ceremony/ |access-date=2025-03-19 |website=Ouvrir la Science |language=en}}</ref>
==References==
Line 102 ⟶ 147:
{{SciPy ecosystem}}
{{differentiable computing}}
[[Category:Data mining and machine learning software]]
Line 107 ⟶ 153:
[[Category:Python (programming language) scientific libraries]]
[[Category:Software using the BSD license]]
[[Category:2010 in artificial intelligence]]
[[Category:2010 software]]
|