Data scraping: Difference between revisions

Content deleted Content added
boldface per WP:R#PLA
boldface per WP:R#PLA
Line 1:
{{short description|Data extraction technique}}
{{refimprove|date=February 2011}}
'''Data scraping''' is a technique in which a [[computer program]] extracts [[data]] from [[Human-readable medium|human-readable]] output coming from another program.
{{Information security}}
'''Data scraping''' is a technique in which a [[computer program]] extracts [[data]] from [[Human-readable medium|human-readable]] output coming from another program.
 
==Description==
Normally, [[Data transmission|data transfer]] between programs is accomplished using [[data structures]] suited for [[Automation|automated]] processing by [[computers]], not people. Such interchange [[File format|formats]] and [[Protocol (computing)|protocols]] are typically rigidly structured, well-documented, easily [[parsing|parsed]], and keep ambiguity to a minimum. Very often, these transmissions are not [[human-readable]] at all.
Line 8 ⟶ 9:
Thus, the key element that distinguishes data scraping from regular [[parsing]] is that the output being scraped is intended for display to an [[End-user (computer science)|end-user]], rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), [[Display device|display]] formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.
 
Data scraping is most often done either to interface to a [[legacy system]], which has no other mechanism which is compatible with current [[computer hardware|hardware]], or to interface to a third-party system which does not provide a more convenient [[Application programming interface|API]]. In the second case, the operator of the third-party system will often see [[screen scraping]] as unwanted, due to reasons such as increased system [[load (computing)|load]], the loss of [[advertisement]] [[revenue]], or the loss of control of the information content.
 
Data scraping is generally considered an ''[[ad hoc]]'', inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher [[computer programming|programming]] and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of [[error handling]] logic present in the computer, this failure can result in error messages, corrupted output or even [[program crash]]es.
 
==Technical variants<!--'Screen scraping' redirects here-->==
==={{Visible anchor|Screen scraping}}===
[[File:Screen-Scraping-OCRget.jpg|thumb|380px|A screen fragment and a screen-scraping interface (blue box with red arrow) to customize data capture process.]]
Although the use of physical "[[dumb terminal]]" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire [[World Wide Web|Web]] interfaces, some Web applications merely continue to use the technique of "'''screen scraping'''"<!--boldface per WP:R#PLA--> to capture old screens and transfer the data to modern front-ends.<ref>"Back in the 1990s.. 2002 ... 2016 ... still, according to [[Chase Bank]], a major issue. {{cite newspaper |newspaper=[[The New York Times]]
|url=https://www.nytimes.com/2016/05/07/your-money/jamie-dimon-wants-to-protect-you-from-innovative-start-ups.html
|title=Jamie Dimon Wants to Protect You From Innovative Start-Ups
Line 44 ⟶ 45:
Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.<ref>{{Cite web|url=https://support.google.com/websearch/answer/86640?hl=en|title="Unusual traffic from your computer network" - Search Help|website=support.google.com|language=en|access-date=2017-04-04}}</ref>
 
==={{Visible anchor|Report mining}}<!--'Report mining' redirects here-->===
'''Report mining'''<!--boldface per WP:R#PLA--> is the extraction of data from human-readable computer reports. Conventional [[data extraction]] requires a connection to a working source system, suitable [[Database connection|connectivity]] standards or an [[Application programming interface|API]], and usually complex querying. By using the source system's standard reporting options, and directing the output to a [[Spooling|spool file]] instead of to a [[printer (computing)|printer]], static reports can be generated suitable for offline analysis via report mining.<ref>Scott Steinacher, [https://web.archive.org/web/20160304205109/http://connection.ebscohost.com/c/product-reviews/2235513/data-pump-transforms-host-data "Data Pump transforms host data"], ''[[InfoWorld]]'', 30 August 1999, p55</ref> This approach can avoid intensive [[Central processing unit|CPU]] usage during business hours, can minimise [[end-user]] licence costs for [[Enterprise resource planning|ERP]] customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as [[HTML]], PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.