Content deleted Content added
m v2.04b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation - <nowiki> tags) |
m additional information |
||
Line 4:
{{Original research|date=March 2021}}
}}
'''Search engine scraping''' is the process of harvesting [[URL]]s, descriptions, or other information from [[search engine]]s such as [[Google Search|Google]], [[Microsoft Bing|Bing]]
Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines, especially Google, [[Petal Search|Petal]], [[Sogou]] to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing|indexing]] status.
Search engines like Google have implemented various forms of human detection to block any sort of automated access to their service,<ref>{{Cite web|url=https://support.google.com/webmasters/answer/66357?hl=en|title=Automated queries – Search Console Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> in the intent of driving the users of scrapers towards buying their official [[API]]s instead.
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engines like Google, Bing, Yahoo, [[Petal Search|Petal]] or
== Difficulties ==
Line 34:
All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).
== Methods of scraping Google, Bing, Yahoo, [[Petal Search|Petal]] or
To scrape a search engine successfully the two major factors are time and amount.
Line 61:
* [[cURL]] – a command line browser for automation and testing as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web|url=https://curl.haxx.se/libcurl/|title=libcurl - the multiprotocol file transfer library|website=curl.haxx.se}}</ref>
* google-search - A Go package to scrape Google. <ref>{{cite web|url=https://github.com/rocketlaunchr/google-search|title=A Go package to scrape Google.|via=GitHub}}</ref>
* [[GoogleScraper]] – A Python module to scrape different search engines (like Google, Yandex, Bing, Duckduckgo, Baidu,
*se-scraper - Successor of GoogleScraper. Scrape search engines concurrently with different proxies. <ref>{{Citation|last=Tschacher|first=Nikolai|title=NikolaiT/se-scraper|date=2020-11-17|url=https://github.com/NikolaiT/se-scraper|access-date=2020-11-19}}</ref>
Line 70:
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service,<ref>{{cite web|url=https://www.wired.com/2011/02/bing-copies-google/|title=Google Catches Bing Copying; Microsoft Says ‘So What?’|first=Ryan|last=Singel|work=Wired}}</ref> but even this incident did not result in a court case.
One possible reason might be that search engines like Google, [[Petal Search|Petal]], [[Sogou]] are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms.
==See also==
|