Data scraping: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:43, 6 November 2021 edit Vt320 (talk \| contribs) Extended confirmed users 5,551 edits remove link to redirect Tag: Reverted ← Previous edit		Latest revision as of 10:34, 29 August 2025 edit undo 176.37.123.177 (talk) →References
(51 intermediate revisions by 29 users not shown)
Line 1: {{short description\|Data extraction technique}} {{~~refimprove~~more citations needed\|date=February 2011}} {{Information security}}▼ '''Data scraping''' is a technique where a [[computer program]] extracts [[data]] from [[Human-readable medium\|human-readable]] output coming from another program. Line 7 ⟶ 6: Normally, [[Data transmission\|data transfer]] between programs is accomplished using [[data structures]] suited for [[Automation\|automated]] processing by [[computers]], not people. Such interchange [[File format\|formats]] and [[Protocol (computing)\|protocols]] are typically rigidly structured, well-documented, easily [[parsing\|parsed]], and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular [[parsing]] is that the ~~output~~data being ~~scraped~~consumed is intended for display to an [[End-user (computer science)\|end-user]], rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring [[binary data]] (usually images or multimedia data), [[Display device\|display]] formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing. Data scraping is most often done either to [[Interface (computing)\|interface]] to a [[legacy system]], which has no other mechanism which is compatible with current [[computer hardware\|hardware]], or to interface to a third-party system which does not provide a more convenient [[Application programming interface\|API]]. In the second case, the operator of the third-party system will often see [[screen scraping]] as unwanted, due to reasons such as increased system [[load (computing)\|load]], the loss of [[advertisement]] [[revenue]], or the loss of control of the information content. Data scraping is generally considered an ''[[ad hoc]]'', inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher [[computer programming\|programming]] and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of [[error handling]] logic present in the [[computer]], this failure can result in error messages, corrupted output or even [[program crash]]es. However, setting up a data scraping pipeline nowadays is straightforward, requiring minimal programming effort to meet practical needs (especially in biomedical data integration).<ref>{{Cite journal \|last=Glez-Peña \|first=Daniel \|date=April 30, 2013 \|title=Web scraping technologies in an API world \|url=https://academic.oup.com/bib/article/15/5/788/2422275 \|journal=Briefings in Bioinformatics \|volume=15 \|issue=5 \|pages=788–797\|doi=10.1093/bib/bbt026 \|pmid=23632294 \|hdl=1822/32460 \|hdl-access=free }}</ref> ==Technical variants<!--'Screen scraping' redirects here-->== ===~~{{Visible~~ ~~anchor\|~~Screen scraping}} === [[File:Screen-Scraping-OCRget.jpg\|thumb\|380px\|A screen fragment and a screen-scraping interface (blue box with red arrow) to customize data capture process.]] Although the use of physical "[[dumb terminal]]" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire [[World Wide Web\|Web]] interfaces, some Web applications merely continue to use the technique of '''screen scraping'''<!--boldface per WP:R#PLA--> to capture old screens and transfer the data to modern front-ends.<ref>"Back in the 1990s.. 2002 ... 2016 ... still, according to [[Chase Bank]], a major issue. {{cite ~~newspaper~~news \|newspaper=[[The New York Times]] \|url=https://www.nytimes.com/2016/05/07/your-money/jamie-dimon-wants-to-protect-you-from-innovative-start-ups.html \|title=Jamie Dimon Wants to Protect You From Innovative Start-Ups Line 25 ⟶ 26: As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized [[data processing]]. Computer to [[user interface]]s from that era were often simply text-based [[dumb terminal]]s which were not much more than virtual [[teleprinter]]s (such systems are still in use {{As of\|2007\|alt=today}}, for various reasons). The desire to interface such a system to more modern systems is common. A [[Robustness (computer science)\|robust]] solution will often require things no longer available, such as [[source code]], system [[documentation]], [[Application programming interface\|API]]s, or [[programmers]] with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via [[Telnet]], [[emulator\|emulate]] the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of [[robotic process automation]] software, called RPA or RPAAI for self-guided RPA 2.0 based on [[artificial intelligence]]. In the 1980s, financial data providers such as [[Reuters]], [[Dow Jones & Company\|Telerate]], and [[Quotron]] displayed data in 24×80 format intended for a human reader. Users of this data, particularly [[Investment banking\|investment banks]], wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without [[data entry clerk\|re-keying]] the data. The common term for this practice, especially in the [[United Kingdom]], was ''page shredding'', since the results could be imagined to have passed through a [[paper shredder]]. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on [[~~OpenVMS\|~~VAX/VMS]] called the Logicizer.<ref>[~~http~~https://www.fxweek.com/fx-week/news/1539599/contributors-fret-about-reuters-plan-to-switch-from-monitor-network-to-idn Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN], ''FX Week'', 02 Nov 1990</ref> More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an [[Optical character recognition\|OCR]] engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results.<ref>{{Cite journal\|url = http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf\|title = Sikuli: Using GUI Screenshots for Search and Automation\|last = Yeh\|first = Tom\|date = 2009\|journal = UIST\|access-date = 2015-02-16\|archive-date = 2010-02-14\|archive-url = https://web.archive.org/web/20100214184939/http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf\|url-status = dead}}</ref> This can be combined in the case of [[GUI]] applications, with querying the graphical controls by programmatically obtaining references to their underlying [[Object-oriented programming\|programming objects]]. A sequence of screens is automatically captured and converted into a database. Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic "document scraping" and [[#Report mining\|report mining]] techniques. Line 38 ⟶ 39: ===Web scraping=== {{main\|Web scraping}} [[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)\|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping\|web scraper]] is an [[API]] or tool to extract data from a ~~web~~website.<ref>{{Cite ~~site~~journal \|last1=Thapelo \|first1=Tsaone Swaabow \|last2=Namoshe \|first2=Molaletsa \|last3=Matsebe \|first3=Oduetse \|last4=Motshegwa \|first4=Tshiamo \|last5=Bopape \|first5=Mary-Jane Morongwa \|date=2021-07-28 \|title=SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data \|journal=Data Science Journal \|language=en \|volume=20 \|article-number=24 \|doi=10.5334/dsj-2021-024 \|s2cid=237719804 \|issn=1683-1470\|doi-access=free }}</ref> Companies like [[Amazon AWS]] and [[Google]] provide '''web scraping''' tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver. A web scraper uses a website's [[URL]] to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner.<ref>{{Cite book \|last1=Singrodia \|first1=Vidhi \|last2=Mitra \|first2=Anirban \|last3=Paul \|first3=Subrata \|chapter=A Review on Web Scrapping and its Applications \|date=2019-01-23 \|title=2019 International Conference on Computer Communication and Informatics (ICCCI) \|publisher=IEEE \|pages=1–6 \|doi=10.1109/ICCCI.2019.8821809 \|isbn=978-1-5386-8260-9}}</ref> ~~Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver.~~ Recently, companies have developed web scraping systems that rely on using techniques in [[DOM parsing]], [[computer vision]] and [[natural language processing]] to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.<ref>{{cite web\|title=~~Diffbot~~A ~~aims~~Startup Hopes to ~~make~~Help itComputers ~~easier~~Understand ~~for~~Web ~~apps~~Pages to\|date=June ~~read~~1, ~~Web~~2012 ~~pages~~\|first1=Rachel ~~the way~~\|last1=Metz ~~humans do~~\|url=~~http~~https://www.technologyreview.com/~~news~~2012/~~428056~~06/01/85817/a-startup-hopes-to-help-computers-understand-web-pages/\|website=MIT Technology Review\|access-date=1 December 2014}}</ref><ref>{{cite magazine\|title=This Simple Data-Scraping Tool Could Change How Apps Are Made\|url=https://www.wired.com/2014/03/kimono/\|magazine=WIRED \|date=Mar 4, 2014 \|first1=Kyle \|last1=VanHemert \|access-date=8 May 2015\|url-status=dead\|archive-url=https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono\|archive-date=11 May 2015}} <!-- ?! syntax error --></ref> Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.<ref>{{Cite web\|url=https://support.google.com/websearch/answer/86640?hl=en\|title="Unusual traffic from your computer network" -\|website=Google Search Help~~\|website=support.google.com~~ \|language=en\|access-date=2017-04-04}}</ref> ==={{Visible anchor\|Report mining}}<!--'Report mining' redirects here-->=== '''Report mining'''<!--boldface per WP:R#PLA--> is the extraction of data from human-readable computer reports. Conventional [[data extraction]] requires a connection to a working source system, suitable [[Database connection\|connectivity]] standards or an [[Application programming interface\|API]], and usually complex querying. By using the source system's standard reporting options, and directing the output to a [[Spooling\|spool file]] instead of to a [[printer (computing)\|printer]], static reports can be generated suitable for offline analysis via report mining.<ref>Scott Steinacher, [https://web.archive.org/web/20160304205109/http://connection.ebscohost.com/c/product-reviews/2235513/data-pump-transforms-host-data "Data Pump transforms host data"], ''[[InfoWorld]]'', 30 August 1999, p55</ref> This approach can avoid intensive [[Central processing unit\|CPU]] usage during business hours, can minimise [[end-user]] licence costs for [[Enterprise resource planning\|ERP]] customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as [[HTML]], PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system. '''Legal and Ethical Considerations''' The legality and ethics of data scraping are often argued. Scraping publicly accessible data is generally legal, however scraping in a manner that infringes a website's terms of service, breaches security measures, or invades user privacy can lead to legal action. Moreover, some websites particularly prohibit data scraping in their robots. '''Regulation and AI integration''' In 2025, international institutions linked data scraping more directly with artificial intelligence development and compliance. The OECD Regulatory Policy Outlook 2025 described scraping as both a constraint and an enabler of growth, emphasizing extraterritorial enforcement and the need for adaptive regulation. The OECD’s Intellectual Property Issues in AI Training report defined scraping as automated, large-scale, and uncoordinated, warning that copyright, database, and trademark protections may apply when scraped data is reused for AI model training. Industry surveys also reported that a majority of large enterprises integrate scraping pipelines into AI workflows, using safeguards such as geo-targeting, audit logs, and selector-level controls to remain compliant with the European Union’s General Data Protection Regulation (GDPR) and sector-specific laws. Analysts note that as AI adoption expands, regulators increasingly treat scraping capacity as a strategic infrastructure issue rather than a marginal technical practice.[16] ==See also== {{div col\|colwidth=30}} * [[Comparison of feed aggregators]] * [[Data cleansing]] Line 54 ⟶ 65: * [[Importer (computing)]] * [[Information extraction]] * [[Open data]] * [[Mashup (web application hybrid)]] * [[Metadata]] * [[~~Web~~Open ~~scraping~~data]] * [[Search engine scraping]] * [[Web scraping]] {{div col end}} ==References== {{reflist}} 12. Multilogin. (n.d.). Multilogin \| Prevent account bans and enables scaling. [https://multilogin.com/blog/how-to-scrape-data-on-google/ How to Scrape Data on Google: 2024 Step-by-Step Guide] 13. Mitchell, R. (2022). "The Ethics of Data Scraping." Journal of Information Ethics, 31(2), 45-61. 14. Kavanagh, D. (2021). "Anti-Detect Browsers: The Next Frontier in Web Scraping." Web Security Review, 19(4), 33-48. 15.Walker, J. (2020). "Legal Implications of Data Scraping." Tech Law Journal, 22(3), 109-126. 16. GroupBWT (2025). [https://groupbwt.com/glossary/data-scraping "Data Scraping \| Definitions, Markets, Compliance, Infrastructure."] ==Further reading== Line 67 ⟶ 88: {{data}} ▲{{Information security}} {{DEFAULTSORT:Data Scraping}}