Data scraping: Difference between revisions

Content deleted Content added
Tag: Reverted
Reverted 3 edits by 136.185.189.153 (talk): The user has added a name that has nothing to do with the article. It's definitely not in good faith, though.
Line 37:
 
===Web scraping===
{{main | Web scraping}}
[[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping | web scraper]] is an [[API]] or tool to extract data from a website.<ref>{{Cite journal |last1=Thapelo |first1 = Tsaone Swaabow |last2 = Namoshe |first2 = Molaletsa |last3 = Matsebe |first3 = Oduetse | last4 = Motshegwa |first4 = Tshiamo |last5 = Bopape |first5=Mary-Jane Morongwa |date=2021-07-28 |title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data |journal=Data Science Journal |language = en |volume=20 |pages=24 |doi = 10.5334/dsj-2021-024 |s2cid = 237719804 |issn = 1683-1470|doi-access=free }}</ref> Companies like [[Amazon AWS]] and [[Google]] provide '''web scraping''' tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver.
 
Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web|title = Diffbot aims to make it easier for apps to read Web pages the way humans do|url=http://www.technologyreview.com/news/428056/a-startup-hopes-to-help-computers-understand-web-pages/|website=MIT Technology Review | access-date=1 December 2014}}</ref><ref>{{cite magazine | title=This Simple Data-Scraping Tool Could Change How Apps Are Made|url=https://www.wired.com/2014/03/kimono/|magazine=WIRED|access-date=8 May 2015|url-status=dead|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono|archive-date=11 May 2015}} <!-- ?! syntax error --></ref>
Sathish harik
 
Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web|title = Diffbot aims to make it easier for apps to read Web pages the way humans do|url=http://www.technologyreview.com/news/428056/a-startup-hopes-to-help-computers-understand-web-pages/|website=MIT Technology Review | access-date=1 December 2014}}</ref><ref>{{cite magazine | title=This Simple Data-Scraping Tool Could Change How Apps Are Made|url=https://www.wired.com/2014/03/kimono/|magazine=WIRED|access-date=8 May 2015|url-status=dead|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono|archive-date=11 May 2015}} <!-- ?! syntax error --></ref>
 
Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.<ref>{{Cite web|url=https://support.google.com/websearch/answer/86640?hl=en|title="Unusual traffic from your computer network" - Search Help|website=support.google.com|language=en|access-date=2017-04-04}}</ref>