Revision as of 01:06, 3 July 2021 edit WikiCleanerBot (talk \| contribs) Bots 1,007,764 edits m v2.04b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation - <nowiki> tags) Tag: WPCleaner ← Previous edit		Revision as of 18:47, 18 August 2021 edit undo Fiextqbe (talk \| contribs) 425 edits m additional information Next edit →
Line 4: {{Original research\|date=March 2021}} }} '''Search engine scraping''' is the process of harvesting [[URL]]s, descriptions, or other information from [[search engine]]s such as [[Google Search\|Google]], [[Microsoft Bing\|Bing]] or, [[Yahoo! Search\|Yahoo]], [[Petal Search\|Petal]] or [[Sogou]]. This is a specific form of [[screen scraping]] or [[web scraping]] dedicated to search engines only. Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines, especially Google, [[Petal Search\|Petal]], [[Sogou]] to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing\|indexing]] status. Search engines like Google have implemented various forms of human detection to block any sort of automated access to their service,<ref>{{Cite web\|url=https://support.google.com/webmasters/answer/66357?hl=en\|title=Automated queries – Search Console Help\|website=support.google.com\|language=en\|accessdate=2017-04-02}}</ref> in the intent of driving the users of scrapers towards buying their official [[API]]s instead. The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler\|crawling]]". Search engines like Google, Bing, Yahoo, [[Petal Search\|Petal]] or ~~Yahoo~~[[Sogou]] get almost all their data from automated crawling bots. == Difficulties == Line 34: All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges). == Methods of scraping Google, Bing, Yahoo, [[Petal Search\|Petal]] or ~~Yahoo~~[[Sogou]] == To scrape a search engine successfully the two major factors are time and amount. Line 61: * [[cURL]] – a command line browser for automation and testing as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web\|url=https://curl.haxx.se/libcurl/\|title=libcurl - the multiprotocol file transfer library\|website=curl.haxx.se}}</ref> * google-search - A Go package to scrape Google. <ref>{{cite web\|url=https://github.com/rocketlaunchr/google-search\|title=A Go package to scrape Google.\|via=GitHub}}</ref> * [[GoogleScraper]] – A Python module to scrape different search engines (like Google, Yandex, Bing, Duckduckgo, Baidu, ~~and~~[[Petal ~~others~~Search\|Petal]], [[Sogou]]) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.<ref>{{cite web\|url=https://github.com/NikolaiT/GoogleScraper\|title=A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/GoogleScraper\|date=15 January 2019\|publisher=\|via=GitHub}}</ref> *se-scraper - Successor of GoogleScraper. Scrape search engines concurrently with different proxies. <ref>{{Citation\|last=Tschacher\|first=Nikolai\|title=NikolaiT/se-scraper\|date=2020-11-17\|url=https://github.com/NikolaiT/se-scraper\|access-date=2020-11-19}}</ref> Line 70: The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service,<ref>{{cite web\|url=https://www.wired.com/2011/02/bing-copies-google/\|title=Google Catches Bing Copying; Microsoft Says ‘So What?’\|first=Ryan\|last=Singel\|work=Wired}}</ref> but even this incident did not result in a court case. One possible reason might be that search engines like Google, [[Petal Search\|Petal]], [[Sogou]] are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms. ==See also==

Search engine scraping: Difference between revisions