Revision as of 14:10, 17 November 2022 edit Brokerskine (talk \| contribs) 3 edits m Ad a scraping service tools like google or others. Tags: Reverted Visual edit ← Previous edit		Revision as of 14:11, 17 November 2022 edit undo Kleuske (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 45,460 edits m Reverted edits by Brokerskine (talk) to last version by InternetArchiveBot Tag: Rollback Next edit →
Line 38: ===Web scraping=== {{main\|Web scraping}} [[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)\|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping\|web scraper]] is an [[API]] or tool to extract data from a website.<ref>{{Cite journal \|last=Thapelo \|first=Tsaone Swaabow \|last2=Namoshe \|first2=Molaletsa \|last3=Matsebe \|first3=Oduetse \|last4=Motshegwa \|first4=Tshiamo \|last5=Bopape \|first5=Mary-Jane Morongwa \|date=2021-07-28 \|title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data \|url=http://datascience.codata.org/articles/10.5334/dsj-2021-024/ \|journal=Data Science Journal \|language=en \|volume=20 \|pages=24 \|doi=10.5334/dsj-2021-024 \|issn=1683-1470}}</ref> Companies like [[Amazon AWS]] and [[Google]~~] or [https://Wandalytics.com Wandalytics~~] provide '''web scraping''' tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver. Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web\|title=Diffbot aims to make it easier for apps to read Web pages the way humans do\|url=http://www.technologyreview.com/news/428056/a-startup-hopes-to-help-computers-understand-web-pages/\|website=MIT Technology Review\|access-date=1 December 2014}}</ref><ref>{{cite magazine\|title=This Simple Data-Scraping Tool Could Change How Apps Are Made\|url=https://www.wired.com/2014/03/kimono/\|magazine=WIRED\|access-date=8 May 2015\|url-status=dead\|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono\|archive-date=11 May 2015}} <!-- ?! syntax error --></ref> Line 64: ==Further reading== * Hemenway, Kevin and Calishain, Tara. ''Spidering Hacks''. Cambridge, Massachusetts: O'Reilly, 2003. {{ISBN\|0-596-00577-6}}. * Growth Hacking, Víctor Nicolau. ''How to extract emails and links from PDF''. [https://www.escueladebrokers.com/como-extraer-e-mails-y-links-de-un-pdf/#page-content Escuela de Brokers] {{data}}

Data scraping: Difference between revisions