Data scraping: Difference between revisions

Content deleted Content added
m word choice
No edit summary
Line 21:
|author=Ron Lieber |date=May 7, 2016}}</ref>
 
Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in Webweb scraping. Originally, ''screen scraping'' referred to the practice of reading text data from a computer display [[Computer terminal|terminal]]'s [[Display device|screen]]. This was generally done by reading the terminal's [[memory (computers)|memory]] through its auxiliary [[Computer port (hardware)|port]], or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human.
 
As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized [[data processing]]. Computer to [[user interface]]s from that era were often simply text-based [[dumb terminal]]s which were not much more than virtual [[teleprinter]]s (such systems are still in use {{As of|2007|alt=today}}, for various reasons). The desire to interface such a system to more modern systems is common. A [[Robustness (computer science)|robust]] solution will often require things no longer available, such as [[source code]], system [[documentation]], [[Application programming interface|API]]s, or [[programmers]] with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via [[Telnet]], [[emulator|emulate]] the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of [[robotic process automation]] software, called RPA or RPAAI for self-guided RPA 2.0 based on [[artificial intelligence]].
Line 38:
===Web scraping===
{{main|Web scraping}}
[[Web page]]s are built using text-based mark-up languages ([[HTML]] and [[XHTML]]), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human [[End-user (computer science)|end-users]] and not for ease of automated use. Because of this, tool kits that scrape web content were created. A [[Web scraping|web scraper]] is an [[API]] or tool to extract data from a web site. Companies like [[Amazon AWS]] and [[Google]] provide '''web scraping''' tools, services, and public data available free of cost to end-users.
Newer forms of web scraping involve listening to data feeds from web servers. For example, [[JSON]] is commonly used as a transport storage mechanism between the client and the webserver.