Content deleted Content added
4Sophianer (talk | contribs) |
Citation bot (talk | contribs) Added article-number. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 603/1032 |
||
(5 intermediate revisions by 5 users not shown) | |||
Line 6:
Normally, [[Data transmission|data transfer]] between programs is accomplished using [[data structures]] suited for [[Automation|automated]] processing by [[computers]], not people. Such interchange [[File format|formats]] and [[Protocol (computing)|protocols]] are typically rigidly structured, well-documented, easily [[parsing|parsed]], and minimize ambiguity. Very often, these transmissions are not human-readable at all.
Thus, the key element that distinguishes data scraping from regular [[parsing]] is that the
Data scraping is most often done either to [[Interface (computing)|interface]] to a [[legacy system]], which has no other mechanism which is compatible with current [[computer hardware|hardware]], or to interface to a third-party system which does not provide a more convenient [[Application programming interface|API]]. In the second case, the operator of the third-party system will often see [[screen scraping]] as unwanted, due to reasons such as increased system [[load (computing)|load]], the loss of [[advertisement]] [[revenue]], or the loss of control of the information content.
Line 26:
As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized [[data processing]]. Computer to [[user interface]]s from that era were often simply text-based [[dumb terminal]]s which were not much more than virtual [[teleprinter]]s (such systems are still in use {{As of|2007|alt=today}}, for various reasons). The desire to interface such a system to more modern systems is common. A [[Robustness (computer science)|robust]] solution will often require things no longer available, such as [[source code]], system [[documentation]], [[Application programming interface|API]]s, or [[programmers]] with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via [[Telnet]], [[emulator|emulate]] the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of [[robotic process automation]] software, called RPA or RPAAI for self-guided RPA 2.0 based on [[artificial intelligence]].
In the 1980s, financial data providers such as [[Reuters]], [[Dow Jones & Company|Telerate]], and [[Quotron]] displayed data in 24×80 format intended for a human reader. Users of this data, particularly [[Investment banking|investment banks]], wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without [[data entry clerk|re-keying]] the data. The common term for this practice, especially in the [[United Kingdom]], was ''page shredding'', since the results could be imagined to have passed through a [[paper shredder]]. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on [[VAX/VMS]] called the Logicizer.<ref>[
More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an [[Optical character recognition|OCR]] engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results.<ref>{{Cite journal|url = http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf|title = Sikuli: Using GUI Screenshots for Search and Automation|last = Yeh|first = Tom|date = 2009|journal = UIST|access-date = 2015-02-16|archive-date = 2010-02-14|archive-url = https://web.archive.org/web/20100214184939/http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf|url-status = dead}}</ref> This can be combined in the case of [[GUI]] applications, with querying the graphical controls by programmatically obtaining references to their underlying [[Object-oriented programming|programming objects]]. A sequence of screens is automatically captured and converted into a database.
Line 39:
===Web scraping===
{{main|Web scraping}}
[[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping|web scraper]] is an [[API]] or tool to extract data from a website.<ref>{{Cite journal |last1=Thapelo |first1=Tsaone Swaabow |last2=Namoshe |first2=Molaletsa |last3=Matsebe |first3=Oduetse |last4=Motshegwa |first4=Tshiamo |last5=Bopape |first5=Mary-Jane Morongwa |date=2021-07-28 |title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data |journal=Data Science Journal |language=en |volume=20 |
Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web|title=A Startup Hopes to Help Computers Understand Web Pages |date=June 1, 2012 |first1=Rachel |last1=Metz |url=https://www.technologyreview.com/2012/06/01/85817/a-startup-hopes-to-help-computers-understand-web-pages/|website=MIT Technology Review|access-date=1 December 2014}}</ref><ref>{{cite magazine|title=This Simple Data-Scraping Tool Could Change How Apps Are Made|url=https://www.wired.com/2014/03/kimono/|magazine=WIRED |date=Mar 4, 2014 |first1=Kyle |last1=VanHemert |access-date=8 May 2015|url-status=dead|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono|archive-date=11 May 2015}} <!-- ?! syntax error --></ref>
Line 52:
The legality and ethics of data scraping are often argued. Scraping publicly accessible data is generally legal, however scraping in a manner that infringes a website's terms of service, breaches security measures, or invades user privacy can lead to legal action. Moreover, some websites particularly prohibit data scraping in their robots.
==See also==
{{div col|colwidth=30}}
Line 73 ⟶ 68:
==References==
{{reflist}}
12. Multilogin. (n.d.). Multilogin | Prevent account bans and enables scaling. [https://multilogin.com/blog/how-to-scrape-data-on-google/ How to Scrape Data on Google: 2024 Step-by-Step Guide]
13. Mitchell, R. (2022). "The Ethics of Data Scraping." Journal of Information Ethics, 31(2), 45-61.
14. Kavanagh, D. (2021). "Anti-Detect Browsers: The Next Frontier in Web Scraping." Web Security Review, 19(4), 33-48.
15.Walker, J. (2020). "Legal Implications of Data Scraping." Tech Law Journal, 22(3), 109-126.
|