Scikit-learn: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 20:46, 15 December 2022 edit Laurburke (talk \| contribs) 51 edits m Add link to Github Tag: Visual edit: Switched ← Previous edit		Latest revision as of 19:06, 6 August 2025 edit undo Ishanov A (talk \| contribs) 2 edits Added the "Applications" section which outlines the real-life use cases. All applications were taken from the official Testimonial page (https://scikit-learn.org/stable/testimonials/testimonials.html). It was inspired by the "Applications" section on TenserFlow wiki page. Tag: Visual edit: Switched
(37 intermediate revisions by 29 users not shown)
Line 8: \| collapsible = \| author = [[David Cournapeau]] \| developer = [[Google Summer of Code]] project \| released = {{Start date and age\|2007\|06\|df=yes}} \| latest release version = {{wikidata\|property\|reference\|P348}} Line 23: \| website = {{URL\|https://scikit-learn.org/}} }} {{Portal\|Free and open-source software}} '''Scikit-learn''' (formerly '''scikits.learn''' and also known as '''sklearn''') is a [[free software]] [[machine learning]] [[Library (computing)\|library]] for the [[Python (programming language)\|Python]] [[programming language]].<ref name="jmlr">{{cite journal '''scikit-learn''' (formerly '''scikits.learn''' and also known as '''sklearn''') is a [[free and open-source]] [[machine learning]] [[Library (computing)\|library]] for the [[Python (programming language)\|Python]] [[programming language]].<ref name="jmlr">{{cite journal \|author1=Fabian Pedregosa \|author2=Gaël Varoquaux Line 39 ⟶ 40: \|author14=Matthieu Perrot \|author15=Édouard Duchesnay \|title=~~Scikit~~scikit-learn: Machine Learning in Python \|journal=Journal of Machine Learning Research \|year=2011 \|volume=12 \|pages=2825–2830 \|arxiv=1201.0490 \|bibcode=2011JMLR...12.2825P \|url=http://jmlr.org/papers/v12/pedregosa11a.html }}</ref> It features various [[statistical classification\|classification]], [[regression analysis\|regression]] and [[Cluster analysis\|clustering]] [[Algorithm\|algorithms]] including [[support vector machine\|support-vector machine]]s, [[random forests]], [[gradient boosting]], [[k-means clustering\|''k''-means]] and [[DBSCAN]], and is designed to interoperate with the [[Python (programming language)\|Python]] numerical and scientific libraries [[NumPy]] and [[SciPy]]. Scikit-learn is a [[NumFOCUS]] fiscally sponsored project.<ref>{{cite web\|title=NumFOCUS Sponsored Projects\|url=https://numfocus.org/sponsored-projects\|publisher=NumFOCUS\|access-date=2021-10-25}}</ref> ==Overview== The scikit-learn project started as scikits.learn, a [[Google Summer of Code]] project by French [[data scientist]] [[David Cournapeau]]. ~~Its~~The name ~~stems from~~of the ~~notion~~project ~~that~~derives itfrom its role isas a "~~SciKit"~~scientific ~~(SciPy~~toolkit ~~Toolkit)~~for machine learning", aoriginally ~~separately-~~developed and distributed as a third-party extension to [[SciPy]].<ref>{{cite web \|url=https://scikits.appspot.com/scikit-learn \|title=scikit-learn \|last1=Dreijer \|first1=Janto }}</ref> The original [[codebase]] was later rewritten by other [[Programmer\|developers]].{{Who\|date=March 2025}} In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the [[French Institute for Research in Computer Science and Automation]] in [[Plateau de Saclay\|Saclay]], [[France]], took leadership of the project and released the first public version of the library on February 1, 2010.<ref>{{cite web\|url=https://scikit-learn.org/stable/about.html#history\|title=About us — scikit-learn 0.20.1 documentation\|website=scikit-learn.org}}</ref> In November 2012, scikit-learn as well as [[scikit-image]] were described as two of the "well-maintained and popular" {{As of\|2012\|11\|alt=scikits libraries}}.<ref>{{cite book ~~}}</ref>~~ The original [[codebase]] was later rewritten by other developers. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel, all from the [[French Institute for Research in Computer Science and Automation]] in [[Plateau de Saclay\|Saclay]], [[France]], took leadership of the project and made the first public release on February the 1st 2010.<ref>{{cite web\|url=https://scikit-learn.org/stable/about.html#history\|title=About us — scikit-learn 0.20.1 documentation\|website=scikit-learn.org}}</ref> Of the various scikits, scikit-learn as well as [[scikit-image]] were described as "well-maintained and popular" {{As of\|2012\|11\|alt=in November 2012}}.<ref>{{cite book \|author=Eli Bressert \|title=SciPy and NumPy: an overview for developers Line 62 ⟶ 64: \|url=https://books.google.com/books?id=fLKTuJqQLVEC&pg=PA43 \|page=43 \|isbn=978-1-4493-6162-4 }}</ref> Scikit-learn is one of the most popular machine learning libraries on [[GitHub]].<ref>{{Cite web\|url=https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/\|title=The State of the Octoverse: machine learning\|date=2019-01-24\|website=The GitHub Blog\|publisher=[[GitHub]]\|language=en-US\|access-date=2019-10-17}}</ref> }}</ref> In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on [[GitHub]].<ref>{{Cite web\|url=https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/\|title=The State of the Octoverse: machine learning\|date=2019-01-24\|website=The GitHub Blog\|publisher=[[GitHub]]\|language=en-US\|access-date=2019-10-17}}</ref> == Features == * Large catalogue of well-established machine learning algorithms and data pre-processing methods (i.e. [[feature engineering]]) * Utility methods for common data-science tasks, such as splitting data into [[Training, validation, and test data sets\|train and test sets]], [[Cross-validation (statistics)\|cross-validation]] and [[grid search]] * Consistent way of running machine learning models ({{code\|estimator.fit()\|python}} and {{code\|estimator.predict()\|python}}), which libraries can implement * Declarative way of structuring a data science process (the {{Code\|Pipeline\|Python}}), including data pre-processing and model fitting == Examples == Fitting a [[Random forest\|random forest classifier]]:<syntaxhighlight lang="python3" line="1"> >>> from sklearn.ensemble import RandomForestClassifier >>> classifier = RandomForestClassifier(random_state=0) >>> X = [[ 1, 2, 3], # 2 samples, 3 features ... [11, 12, 13]] >>> y = [0, 1] # classes of each sample >>> classifier.fit(X, y) RandomForestClassifier(random_state=0) </syntaxhighlight> ==Implementation== ~~Scikit~~scikit-learn is largely written in Python, and uses [[NumPy]] extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in [[Cython]] to improve performance. Support vector machines are implemented by a Cython wrapper around [[LIBSVM]]; logistic regression and linear support vector machines by a similar wrapper around [[LIBLINEAR]]. In such cases, extending these methods with Python may not be possible. scikit-learn integrates well with many other Python libraries, such as [[Matplotlib]] and [[plotly]] for plotting, [[NumPy]] for array vectorization, [[Pandas (software)\|Pandas]] dataframes, [[SciPy]], and many more. == History == scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010, [[French Institute for Research in Computer Science and Automation\|INRIA]], the [[French Institute for Research in Computer Science and Automation]], got involved and the first public release (v0.1 beta) was published in late January 2010. == Applications == Scikit-learn is widely used across industries for a variety of machine learning tasks such as classification, regression, clustering, and model selection. The following are real-world applications of the library: === Finance and Insurance === * '''AXA''' uses scikit-learn to speed up the compensation process for car accidents and to detect insurance fraud.<ref name="sklearn-testimonials">{{Cite web \|title=Testimonials \|url=https://scikit-learn.org/stable/testimonials/testimonials.html \|website=scikit-learn.org \|access-date=2025-08-06}}</ref> * '''Zopa''', a peer-to-peer lending platform, employs scikit-learn for credit risk modelling, fraud detection, marketing segmentation, and loan pricing.<ref name="sklearn-testimonials"/> * '''BNP Paribas Cardif''' uses scikit-learn to improve the dispatching of incoming mail and manage internal model risk governance through pipelines that reduce operational and overfitting risks.<ref name="sklearn-testimonials"/> * '''J.P. Morgan''' reports broad usage of scikit-learn across the bank for classification tasks and predictive analytics in financial decision-making.<ref name="sklearn-testimonials"/> === Retail and E-Commerce === * '''Booking.com''' uses scikit-learn for hotel and destination recommendation systems, fraudulent reservation detection, and workforce scheduling for customer support agents.<ref name="sklearn-testimonials"/> * '''HowAboutWe''' uses it to predict user engagement and preferences on a dating platform.<ref name="sklearn-testimonials"/> * '''Lovely''' leverages the library to understand user behaviour and detect fraudulent activity on its platform.<ref name="sklearn-testimonials"/> * '''Data Publica''' uses it for customer segmentation based on the success of past partnerships.<ref name="sklearn-testimonials"/> * '''Otto Group''' integrates scikit-learn throughout its data science stack, particularly in logistics optimization and product recommendations.<ref name="sklearn-testimonials"/> === Media, Marketing, and Social Platforms === * '''Spotify''' applies scikit-learn in its recommendation systems.<ref name="sklearn-testimonials"/> * '''Betaworks''' uses the library for both recommendation systems (e.g., for Digg) and dynamic subspace clustering applied to weather forecasting data.<ref name="sklearn-testimonials"/> * '''PeerIndex''' used scikit-learn for missing data imputation, tweet classification, and community clustering in social media analytics.<ref name="sklearn-testimonials"/> * '''Bestofmedia Group''' employs it for spam detection and ad click prediction.<ref name="sklearn-testimonials"/> * '''Machinalis''' utilizes scikit-learn for click-through rate prediction and relational information extraction for content classification and advertising optimization.<ref name="sklearn-testimonials"/> * '''Change.org''' applies scikit-learn for targeted email outreach based on user behaviour.<ref name="sklearn-testimonials"/> === Technology === * '''AWeber''' uses scikit-learn to extract features from emails and build pipelines for managing large-scale email campaigns.<ref name="sklearn-testimonials"/> * '''Solido''' applies it to semiconductor design tasks such as rare-event estimation and worst-case verification using statistical learning.<ref name="sklearn-testimonials"/> * '''Evernote''', '''Dataiku''', and other tech companies employ scikit-learn in prototyping and production workflows due to its consistent API and integration with the Python ecosystem.<ref name="sklearn-testimonials"/> === Academia === * '''Télécom ParisTech''' integrates scikit-learn in hands-on coursework and assignments as part of its machine learning curriculum.<ref name="sklearn-testimonials"/> Scikit-learn integrates well with many other Python libraries, such as [[Matplotlib]] and [[plotly]] for plotting, [[NumPy]] for array vectorization, [[Pandas (software)\|Pandas]] dataframes, [[SciPy]], and many more. == ~~Version history~~Awards == Scikit-learn was initially developed by David Cournapeau as a [[Google]] summer of code project in 2007. Later Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010 [[French Institute for Research in Computer Science and Automation\|INRIA]], the [[French Institute for Research in Computer Science and Automation]], got involved and the first public release (v0.1 beta) was published in late January 2010. * August 2013. scikit-learn 0.14<ref name=":0" /> * July 2014. scikit-learn 0.15.0<ref name=":0" /> * March 2015. scikit-learn 0.16.0<ref name=":0" /> * November 2015. scikit-learn 0.17.0<ref name=":0">{{Cite web\|url=https://scikit-learn.org/dev/whats_new.html\|title=Release history — scikit-learn 0.19.dev0 documentation\|website=scikit-learn.org\|access-date=2017-02-27}}</ref> * September 2016. scikit-learn 0.18.0 * July 2017. scikit-learn 0.19.0 * September 2018. scikit-learn 0.20.0<ref>{{cite web \|title=Release History - 0.20.0 documentation \|url=https://scikit-learn.org/stable/whats_new.html#version-0-20 \|website=scikit-learn \|access-date=6 November 2018}}</ref> * May 2019. scikit-learn 0.21.0<ref>{{cite web \|title=Release History - 0.21.0 documentation \|url=https://scikit-learn.org/stable/whats_new.html#version-0-21-0 \|website=scikit-learn \|access-date=5 May 2019}}</ref> * December 2019. scikit-learn 0.22.0<ref>{{cite web \|title=Release History - 0.22.0 documentation \|url=https://scikit-learn.org/dev/whats_new/v0.22.html#version-0-22-0 \|website=scikit-learn \|access-date=7 June 2020}}</ref> May 2020. scikit-learn 0.23.0<ref>{{cite web \|title=Release History - 0.23.0 documentation \|url=https://scikit-learn.org/dev/whats_new/v0.23.html#version-0-23-0 \|website=scikit-learn \|access-date=7 June 2020}}</ref> Jan 2021. scikit-learn 0.24<ref>{{Citation\|title=scikit-learn: A set of python modules for machine learning and data mining\|url=http://scikit-learn.org/\|access-date=2021-02-08}}</ref> * September 2021. scikit-learn 1.0<ref>{{Citation\|title=scikit-learn: A set of python modules for machine learning and data mining\|url=http://scikit-learn.org/\|access-date=2021-09-24}}</ref> * 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize<ref>{{Cite web \|title=The 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize : scikit-learn , a success story for machine learning free software {{!}} Inria \|url=https://www.inria.fr/en/2019-inria-french-academy-sciences-dassault-systemes-innovation-prize-scikit-learn-success-story \|access-date=2025-03-19 \|website=www.inria.fr}}</ref> ~~==Scikit-learn tools==~~ * 2022 Open Science Award for Open Source Research Software<ref>{{Cite web \|last=Badolato \|first=Anne-Marie \|date=2022-02-07 \|title=Open Science Awards for Open Source Research Software \|url=https://www.ouvrirlascience.fr/open-science-free-software-award-ceremony/ \|access-date=2025-03-19 \|website=Ouvrir la Science \|language=en}}</ref> * [[mlpy]] * [[SpaCy]] * [[Natural Language Toolkit\|NLTK]] * [[Orange (software)\|Orange]] * [[PyTorch]] * [[TensorFlow]] * [[Infer.NET]] * [[List of numerical analysis software]] ==References== Line 102 ⟶ 147: {{SciPy ecosystem}} {{differentiable computing}} [[Category:Data mining and machine learning software]] Line 107 ⟶ 153: [[Category:Python (programming language) scientific libraries]] [[Category:Software using the BSD license]] [[Category:2010 in artificial intelligence]] [[Category:2010 software]]