Search engine scraping: Difference between revisions

Content deleted Content added
ChabbD (talk | contribs)
removed extra word
Hedja (talk | contribs)
Removed some excessive mentions of specific search engines (seems to be spam).
Line 4:
{{Original research|date=March 2021}}
}}
'''Search engine scraping''' is the process of harvesting [[URL]]s, descriptions, or other information from [[search engine]]s such as [[Google Search|Google]], [[Microsoft Bing|Bing]], [[Yahoo! Search|Yahoo]], or [[Yandex]]. This is a specific form of [[screen scraping]] or [[web scraping]] dedicated to search engines only.
 
Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines, especially Google, [[Sogou]] to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing|indexing]] status.
 
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engine’s like Google, Bing, Yahoo or [[Sogou]]engines get almost all their data from automated crawling bots.
Search engines like Google have implemented various forms of human detection to block any sort of automated access to their service,<ref>{{Cite web|url=https://support.google.com/webmasters/answer/66357?hl=en|title=Automated queries – Search Console Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> in the intent of driving the users of scrapers towards buying their official [[API]]s instead.
 
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engine’s like Google, Bing, Yahoo or [[Sogou]] get almost all their data from automated crawling bots.
 
Search engines are an integral part of the modern online ecosystem. They provide a way for people to find information, products, and services online quickly and easily. In fact, more than 90% of online experiences begin with a search engine, and the top search results receive the majority of clicks. This is why SEO is critical for businesses and organizations that want to succeed in the digital world.
Line 38 ⟶ 36:
All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).
 
== Methods of scraping Google, Bing, Yahoo or [[Sogou]] ==
To scrape a search engine successfully, the two major factors are time and amount.
 
Line 65 ⟶ 63:
* [[cURL]] – a command line browser for automation and testing, as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web|url=https://curl.haxx.se/libcurl/|title=libcurl - the multiprotocol file transfer library|website=curl.haxx.se}}</ref>
* Google-search - A Go package to scrape Google.<ref>{{cite web|url=https://github.com/rocketlaunchr/google-search|title=A Go package to scrape Google.|via=GitHub}}</ref>
* [https://seotoolskit.co/ SEO Tools Kit] – Free Online Tools, DuckDuckGo, Baidu, [[Sogou]]) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.<ref>{{cite web|url=https://seotoolskit.co/|title=Free online SEO Tools (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/SEO Tools Kit|date=15 January 2019|publisher=|via=GitHub}}</ref>
*se-scraper - Successor of SEO Tools Kit. Scrape search engines concurrently with different proxies.<ref>{{Citation|last=Tschacher|first=Nikolai|title=NikolaiT/se-scraper|date=2020-11-17|url=https://github.com/NikolaiT/se-scraper|access-date=2020-11-19}}</ref>
 
Line 74 ⟶ 72:
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service,<ref>{{cite magazine|url=https://www.wired.com/2011/02/bing-copies-google/|title=Google Catches Bing Copying; Microsoft Says 'So What?'|first=Ryan|last=Singel|magazine=Wired}}</ref> but even this incident did not result in a court case.
 
One possible reason might be that search engines like Google, [[Sogou]] are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms.
 
==See also==