Content deleted Content added
4Sophianer (talk | contribs) Tag: references removed |
|||
(15 intermediate revisions by 8 users not shown) | |||
Line 6:
Normally, [[Data transmission|data transfer]] between programs is accomplished using [[data structures]] suited for [[Automation|automated]] processing by [[computers]], not people. Such interchange [[File format|formats]] and [[Protocol (computing)|protocols]] are typically rigidly structured, well-documented, easily [[parsing|parsed]], and minimize ambiguity. Very often, these transmissions are not human-readable at all.
Thus, the key element that distinguishes data scraping from regular [[parsing]] is that the
Data scraping is most often done either to [[Interface (computing)|interface]] to a [[legacy system]], which has no other mechanism which is compatible with current [[computer hardware|hardware]], or to interface to a third-party system which does not provide a more convenient [[Application programming interface|API]]. In the second case, the operator of the third-party system will often see [[screen scraping]] as unwanted, due to reasons such as increased system [[load (computing)|load]], the loss of [[advertisement]] [[revenue]], or the loss of control of the information content.
Line 26:
As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized [[data processing]]. Computer to [[user interface]]s from that era were often simply text-based [[dumb terminal]]s which were not much more than virtual [[teleprinter]]s (such systems are still in use {{As of|2007|alt=today}}, for various reasons). The desire to interface such a system to more modern systems is common. A [[Robustness (computer science)|robust]] solution will often require things no longer available, such as [[source code]], system [[documentation]], [[Application programming interface|API]]s, or [[programmers]] with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via [[Telnet]], [[emulator|emulate]] the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of [[robotic process automation]] software, called RPA or RPAAI for self-guided RPA 2.0 based on [[artificial intelligence]].
In the 1980s, financial data providers such as [[Reuters]], [[Dow Jones & Company|Telerate]], and [[Quotron]] displayed data in 24×80 format intended for a human reader. Users of this data, particularly [[Investment banking|investment banks]], wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without [[data entry clerk|re-keying]] the data. The common term for this practice, especially in the [[United Kingdom]], was ''page shredding'', since the results could be imagined to have passed through a [[paper shredder]]. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on [[VAX/VMS]] called the Logicizer.<ref>[
More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an [[Optical character recognition|OCR]] engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results.<ref>{{Cite journal|url = http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf|title = Sikuli: Using GUI Screenshots for Search and Automation|last = Yeh|first = Tom|date = 2009|journal = UIST|access-date = 2015-02-16|archive-date = 2010-02-14|archive-url = https://web.archive.org/web/20100214184939/http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf|url-status = dead}}</ref> This can be combined in the case of [[GUI]] applications, with querying the graphical controls by programmatically obtaining references to their underlying [[Object-oriented programming|programming objects]]. A sequence of screens is automatically captured and converted into a database.
Line 39:
===Web scraping===
{{main|Web scraping}}
[[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping|web scraper]] is an [[API]] or tool to extract data from a website.<ref>{{Cite journal |last1=Thapelo |first1=Tsaone Swaabow |last2=Namoshe |first2=Molaletsa |last3=Matsebe |first3=Oduetse |last4=Motshegwa |first4=Tshiamo |last5=Bopape |first5=Mary-Jane Morongwa |date=2021-07-28 |title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data |journal=Data Science Journal |language=en |volume=20 |
Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web|title=A Startup Hopes to Help Computers Understand Web Pages |date=June 1, 2012 |first1=Rachel |last1=Metz |url=https://www.technologyreview.com/2012/06/01/85817/a-startup-hopes-to-help-computers-understand-web-pages/|website=MIT Technology Review|access-date=1 December 2014}}</ref><ref>{{cite magazine|title=This Simple Data-Scraping Tool Could Change How Apps Are Made|url=https://www.wired.com/2014/03/kimono/|magazine=WIRED |date=Mar 4, 2014 |first1=Kyle |last1=VanHemert |access-date=8 May 2015|url-status=dead|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono|archive-date=11 May 2015}} <!-- ?! syntax error --></ref>
Line 52:
The legality and ethics of data scraping are often argued. Scraping publicly accessible data is generally legal, however scraping in a manner that infringes a website's terms of service, breaches security measures, or invades user privacy can lead to legal action. Moreover, some websites particularly prohibit data scraping in their robots.
'''
In 2025, international institutions linked data scraping more directly with artificial intelligence development and compliance. The OECD Regulatory Policy Outlook 2025 described scraping as both a constraint and an enabler of growth, emphasizing extraterritorial enforcement and the need for adaptive regulation. The OECD’s Intellectual Property Issues in AI Training report defined scraping as automated, large-scale, and uncoordinated, warning that copyright, database, and trademark protections may apply when scraped data is reused for AI model training.
Industry surveys also reported that a majority of large enterprises integrate scraping pipelines into AI workflows, using safeguards such as geo-targeting, audit logs, and selector-level controls to remain compliant with the European Union’s General Data Protection Regulation (GDPR) and sector-specific laws. Analysts note that as AI adoption expands, regulators increasingly treat scraping capacity as a strategic infrastructure issue rather than a marginal technical practice.[16]
==See also==
{{div col|colwidth=30}}
Line 72 ⟶ 73:
==References==
{{reflist
12. Multilogin. (n.d.). Multilogin | Prevent account bans and enables scaling. [https://multilogin.com/blog/how-to-scrape-data-on-google/ How to Scrape Data on Google: 2024 Step-by-Step Guide]
13. Mitchell, R. (2022). "The Ethics of Data Scraping." Journal of Information Ethics, 31(2), 45-61.
Line 79 ⟶ 81:
15.Walker, J. (2020). "Legal Implications of Data Scraping." Tech Law Journal, 22(3), 109-126.
16. GroupBWT (2025). [https://groupbwt.com/glossary/data-scraping "Data Scraping | Definitions, Markets, Compliance, Infrastructure."]
==Further reading==
|