Data scraping: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 20:14, 21 August 2024 edit 4Sophianer (talk \| contribs) 10 edits →References Tag: references removed ← Previous edit		Latest revision as of 10:34, 29 August 2025 edit undo 176.37.123.177 (talk) →References
(15 intermediate revisions by 8 users not shown)
Line 6: Normally, [[Data transmission\|data transfer]] between programs is accomplished using [[data structures]] suited for [[Automation\|automated]] processing by [[computers]], not people. Such interchange [[File format\|formats]] and [[Protocol (computing)\|protocols]] are typically rigidly structured, well-documented, easily [[parsing\|parsed]], and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular [[parsing]] is that the ~~output~~data being ~~scraped~~consumed is intended for display to an [[End-user (computer science)\|end-user]], rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring [[binary data]] (usually images or multimedia data), [[Display device\|display]] formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing. Data scraping is most often done either to [[Interface (computing)\|interface]] to a [[legacy system]], which has no other mechanism which is compatible with current [[computer hardware\|hardware]], or to interface to a third-party system which does not provide a more convenient [[Application programming interface\|API]]. In the second case, the operator of the third-party system will often see [[screen scraping]] as unwanted, due to reasons such as increased system [[load (computing)\|load]], the loss of [[advertisement]] [[revenue]], or the loss of control of the information content. Line 26: As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized [[data processing]]. Computer to [[user interface]]s from that era were often simply text-based [[dumb terminal]]s which were not much more than virtual [[teleprinter]]s (such systems are still in use {{As of\|2007\|alt=today}}, for various reasons). The desire to interface such a system to more modern systems is common. A [[Robustness (computer science)\|robust]] solution will often require things no longer available, such as [[source code]], system [[documentation]], [[Application programming interface\|API]]s, or [[programmers]] with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via [[Telnet]], [[emulator\|emulate]] the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of [[robotic process automation]] software, called RPA or RPAAI for self-guided RPA 2.0 based on [[artificial intelligence]]. In the 1980s, financial data providers such as [[Reuters]], [[Dow Jones & Company\|Telerate]], and [[Quotron]] displayed data in 24×80 format intended for a human reader. Users of this data, particularly [[Investment banking\|investment banks]], wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without [[data entry clerk\|re-keying]] the data. The common term for this practice, especially in the [[United Kingdom]], was ''page shredding'', since the results could be imagined to have passed through a [[paper shredder]]. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on [[VAX/VMS]] called the Logicizer.<ref>[~~http~~https://www.fxweek.com/fx-week/news/1539599/contributors-fret-about-reuters-plan-to-switch-from-monitor-network-to-idn Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN], ''FX Week'', 02 Nov 1990</ref> More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an [[Optical character recognition\|OCR]] engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results.<ref>{{Cite journal\|url = http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf\|title = Sikuli: Using GUI Screenshots for Search and Automation\|last = Yeh\|first = Tom\|date = 2009\|journal = UIST\|access-date = 2015-02-16\|archive-date = 2010-02-14\|archive-url = https://web.archive.org/web/20100214184939/http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf\|url-status = dead}}</ref> This can be combined in the case of [[GUI]] applications, with querying the graphical controls by programmatically obtaining references to their underlying [[Object-oriented programming\|programming objects]]. A sequence of screens is automatically captured and converted into a database. Line 39: ===Web scraping=== {{main\|Web scraping}} [[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)\|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping\|web scraper]] is an [[API]] or tool to extract data from a website.<ref>{{Cite journal \|last1=Thapelo \|first1=Tsaone Swaabow \|last2=Namoshe \|first2=Molaletsa \|last3=Matsebe \|first3=Oduetse \|last4=Motshegwa \|first4=Tshiamo \|last5=Bopape \|first5=Mary-Jane Morongwa \|date=2021-07-28 \|title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data \|journal=Data Science Journal \|language=en \|volume=20 \|~~pages~~article-number=24 \|doi=10.5334/dsj-2021-024 \|s2cid=237719804 \|issn=1683-1470\|doi-access=free }}</ref> Companies like [[Amazon AWS]] and [[Google]] provide '''web scraping''' tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver. A web scraper uses a website's [[URL]] to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner.<ref>{{Cite book \|last1=Singrodia \|first1=Vidhi \|last2=Mitra \|first2=Anirban \|last3=Paul \|first3=Subrata \|chapter=A Review on Web Scrapping and its Applications \|date=2019-01-23 \|title=2019 International Conference on Computer Communication and Informatics (ICCCI) ~~\|chapter-url=https://ieeexplore.ieee.org/document/8821809~~ \|publisher=IEEE \|pages=1–6 \|doi=10.1109/ICCCI.2019.8821809 \|isbn=978-1-5386-8260-9}}</ref> Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web\|title=A Startup Hopes to Help Computers Understand Web Pages \|date=June 1, 2012 \|first1=Rachel \|last1=Metz \|url=https://www.technologyreview.com/2012/06/01/85817/a-startup-hopes-to-help-computers-understand-web-pages/\|website=MIT Technology Review\|access-date=1 December 2014}}</ref><ref>{{cite magazine\|title=This Simple Data-Scraping Tool Could Change How Apps Are Made\|url=https://www.wired.com/2014/03/kimono/\|magazine=WIRED \|date=Mar 4, 2014 \|first1=Kyle \|last1=VanHemert \|access-date=8 May 2015\|url-status=dead\|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono\|archive-date=11 May 2015}} <!-- ?! syntax error --></ref> Line 52: The legality and ethics of data scraping are often argued. Scraping publicly accessible data is generally legal, however scraping in a manner that infringes a website's terms of service, breaches security measures, or invades user privacy can lead to legal action. Moreover, some websites particularly prohibit data scraping in their robots. '''~~Data Scraping~~Regulation and ~~Anti-detect~~AI ~~browsers~~integration''' In 2025, international institutions linked data scraping more directly with artificial intelligence development and compliance. The OECD Regulatory Policy Outlook 2025 described scraping as both a constraint and an enabler of growth, emphasizing extraterritorial enforcement and the need for adaptive regulation. The OECD’s Intellectual Property Issues in AI Training report defined scraping as automated, large-scale, and uncoordinated, warning that copyright, database, and trademark protections may apply when scraped data is reused for AI model training. Anti-detect browsers have come up as a tool closely associated with data scraping, especially for users who need to manage multiple scraping actions simultaneously or avoid detection. This kind of browser allows users to create multiple virtual browser accounts and mimic different devices, locations, etc. At the same time, it is reducing the likelihood of being blocked or flagged by websites. Industry surveys also reported that a majority of large enterprises integrate scraping pipelines into AI workflows, using safeguards such as geo-targeting, audit logs, and selector-level controls to remain compliant with the European Union’s General Data Protection Regulation (GDPR) and sector-specific laws. Analysts note that as AI adoption expands, regulators increasingly treat scraping capacity as a strategic infrastructure issue rather than a marginal technical practice.[16] Anti-detect browsers might be an effective way in web scraping because many websites enact detection mechanisms to identify and block scraping activities. By using an [https://multilogin.com/ anti-detect browser], data scrapers can avoid these restrictions, and make sure that operations they made remain undetected. ==See also== {{div col\|colwidth=30}} Line 72 ⟶ 73: ==References== {{reflist }} 12. Multilogin. (n.d.). Multilogin \| Prevent account bans and enables scaling. [https://multilogin.com/blog/how-to-scrape-data-on-google/ How to Scrape Data on Google: 2024 Step-by-Step Guide] 13. Mitchell, R. (2022). "The Ethics of Data Scraping." Journal of Information Ethics, 31(2), 45-61. Line 79 ⟶ 81: 15.Walker, J. (2020). "Legal Implications of Data Scraping." Tech Law Journal, 22(3), 109-126. }} 16. GroupBWT (2025). [https://groupbwt.com/glossary/data-scraping "Data Scraping \| Definitions, Markets, Compliance, Infrastructure."] ==Further reading==