Revision as of 02:05, 1 March 2024 edit WikiEd303 (talk \| contribs) 130 edits m →Web scraping Tag: Visual edit ← Previous edit		Revision as of 14:39, 17 March 2024 edit undo Citation bot (talk \| contribs) Bots 5,871,305 edits Alter: title, template type. Add: chapter-url, chapter, pmid, doi, authors 1-1. Removed or converted URL. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 \| #UCB_webform_linked 133/686 Next edit →
Line 12: Data scraping is generally considered an ''[[ad hoc]]'', inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher [[computer programming\|programming]] and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of [[error handling]] logic present in the [[computer]], this failure can result in error messages, corrupted output or even [[program crash]]es. However, setting up a data scraping pipeline nowadays is straightforward, requiring minimal programming effort to meet practical needs (especially in biomedical data integration).<ref>{{Cite journal \|last=Glez-Peña \|first=Daniel \|date=April 30, 2013 \|title=Web scraping technologies in an API world \|url=https://academic.oup.com/bib/article/15/5/788/2422275 \|journal=Briefings in Bioinformatics \|volume=15 \|issue=5 \|pages=788–797\|doi=10.1093/bib/bbt026 \|pmid=23632294 }}</ref> ==Technical variants<!--'Screen scraping' redirects here-->== Line 39: ===Web scraping=== {{main\|Web scraping}} [[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)\|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping\|web scraper]] is an [[API]] or tool to extract data from a website.<ref>{{Cite journal \|last1=Thapelo \|first1=Tsaone Swaabow \|last2=Namoshe \|first2=Molaletsa \|last3=Matsebe \|first3=Oduetse \|last4=Motshegwa \|first4=Tshiamo \|last5=Bopape \|first5=Mary-Jane Morongwa \|date=2021-07-28 \|title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data \|journal=Data Science Journal \|language=en \|volume=20 \|pages=24 \|doi=10.5334/dsj-2021-024 \|s2cid=237719804 \|issn=1683-1470\|doi-access=free }}</ref> Companies like [[Amazon AWS]] and [[Google]] provide '''web scraping''' tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver. A web scraper uses a website's [[URL]] to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner.<ref>{{Cite ~~journal~~book \|~~last~~last1=Singrodia \|~~first~~first1=Vidhi \|last2=Mitra \|first2=Anirban \|last3=Paul \|first3=Subrata \|~~date=2019-01-23 \|title~~chapter=A Review on Web Scrapping and its Applications \|~~url~~date=~~https://ieeexplore.ieee.org/document/8821809/~~2019-01-23 \|~~journal~~title=2019 International Conference on Computer Communication and Informatics (ICCCI) \|chapter-url=https://ieeexplore.ieee.org/document/8821809 \|publisher=IEEE \|pages=1–6 \|doi=10.1109/ICCCI.2019.8821809 \|isbn=978-1-5386-8260-9}}</ref> Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web\|title=A Startup Hopes to Help Computers Understand Web Pages \|date=June 1, 2012 \|first1=Rachel \|last1=Metz \|url=https://www.technologyreview.com/2012/06/01/85817/a-startup-hopes-to-help-computers-understand-web-pages/\|website=MIT Technology Review\|access-date=1 December 2014}}</ref><ref>{{cite magazine\|title=This Simple Data-Scraping Tool Could Change How Apps Are Made\|url=https://www.wired.com/2014/03/kimono/\|magazine=WIRED \|date=Mar 4, 2014 \|first1=Kyle \|last1=VanHemert \|access-date=8 May 2015\|url-status=dead\|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono\|archive-date=11 May 2015}} <!-- ?! syntax error --></ref>

Data scraping: Difference between revisions