Search engine scraping: Difference between revisions

Content deleted Content added
Ilyazub (talk | contribs)
Added SerpApi to Tools and scripts
cleaned up a few things, but still needs more work
Tags: references removed Visual edit
Line 8:
Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines, especially Google, to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing|indexing]] status.
 
Search engines like Google dohave notimplemented allowvarious forms of human detection to block any sort of automated access to their service,<ref>{{Cite web|url=https://support.google.com/webmasters/answer/66357?hl=en|title=Automated queries – Search Console Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> butin fromthe aintent legalof pointdriving ofthe view,users thereof isscrapers notowards knownbuying casetheir orofficial broken[[API]]<nowiki/>s lawinstead.
 
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engines like Google, Bing or Yahoo get almost all their data from automated crawling bots.
Line 15:
Google is the by far largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.<ref>{{cite web|url=http://searchengineland.com/google-worlds-most-popular-search-engine-148089|title=Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly|date=11 February 2013|website=searchengineland.com}}</ref>
 
Although Google does not take legal action against scraping, likely for self-protective reasons. However, Googleit uses a range of defensive methods that makes scraping their results a challenging task., even when the scraping tool is realistically [[Spoofing attack|spoofing]] a normal web browser:
* Google is using a complex system of request rate limitation which iscan differentvary for each Languagelanguage, Countrycountry, User-Agent as well as depending on the keywordkeywords and keywordor search parameters. The rate limitation can make it unpredictable when accessing a search engine automated as the behaviour patterns are not known to the outside developer or user.
* Google is testing the [[User agent|User-Agent]] (Browser type) of [[Hypertext Transfer Protocol|HTTP]] requests and serves a different page depending on the User-Agent. Google is automatically rejecting User-Agents that seem to originate from a possible automated bot. [Part of the Google error page: ''Please see Google's Terms of Service posted at <nowiki>http://www.google.com/terms_of_service.html</nowiki>'' ] A typical example would be using the command line browser [[cURL]], Google will directly reject to serve any pages to it while Bing is a bit more forgiving, Bing does not seem to care about User-Agents.<ref>{{cite web|url=http://unix.stackexchange.com/questions/139698/why-would-curl-and-wget-result-in-a-403-forbidden|title=why would curl and wget result in a 403 forbidden?|website=unix.stackexchange.com}}</ref>
* Google is using a complex system of request rate limitation which is different for each Language, Country, User-Agent as well as depending on the keyword and keyword search parameters. The rate limitation can make it unpredictable when accessing a search engine automated as the behaviour patterns are not known to the outside developer or user.
* Network and [[IP address|IP]] limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
* Offending IPs and offending IP networks can easily be stored in a blacklist database to detect offenders much faster. The fact that most ISPs give [[dynamic IP addresses]] to customers requires that such automated bans be only temporary, to not block innocent users.
* Behaviour based detection is the most difficult defense system. Search engines serve their pages to millions of users every day, this provides a large amount of behaviour information. A scraping script or bot is not behaving like a real user, aside from having non-typical access times, delays and session times the keywords being harvested might be related to each other or include unusual parameters. Google for example has a very sophisticated behaviour analyzation system, possibly using [[deep learning]] software to detect unusual patterns of access. It can detect unusual activity much faster than other search engines.<ref>{{cite web|url=http://tor.stackexchange.com/questions/313/does-google-know-that-i-am-using-tor-browser|title=Does Google know that I am using Tor Browser?|website=tor.stackexchange.com}}</ref>
* [[HTML]] markup changes, depending on the methods used to harvest the content of a website even a small change in HTML data can render a scraping tool broken until it wasis updated.
* General changes in detection systems. In the past years search engines have tightened their detection systems nearly month by month making it more and more difficult to reliable scrape as the developers need to experiment and adapt their code regularly.<ref>{{cite web|url=https://productforums.google.com/forum/#!topic/websearch/MAju1QDF6_8|title=Google Groups|website=google.com}}</ref>
 
Line 27 ⟶ 26:
When search engine defense thinks an access might be automated the search engine can react differently.
 
The first layer of defense is a captcha page<ref>{{Cite web|url=https://support.google.com/recaptcha/answer/6081888?hl=en|title=My computer is sending automated queries – reCAPTCHA Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> where the user is prompted to verify hethey isare a real person and not a bot or tool. Solving the [[CAPTCHA|captcha]] will create a [[HTTP cookie|cookie]] that permits access to the search engine again for a while. After about one day the captcha page is removed again.
 
The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted or the user changes histheir IP.
 
The third layer of defense is a longtermlong-term block of the entire network segment. Google has blocked large network blocks for months. This sort of block is likely triggered by an administrator and only happens if a scraping tool is sending a very high number of requests.
 
All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).
Line 60 ⟶ 59:
When developing a search engine scraper there are several existing tools and libraries available that can either be used, extended or just analyzed to learn from.
* [[iMacros]] - A free browser automation toolkit that can be used for very small volume scraping from within a users browser <ref>{{Cite web|url=https://stackoverflow.com/q/32171929 |title=iMacros to extract google results|website=stackoverflow.com|access-date=2017-04-04}}</ref>
* [[cURL]] – a commandlinecommand line browser for automation and testing as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web|url=https://curl.haxx.se/libcurl/|title=libcurl - the multiprotocol file transfer library|website=curl.haxx.se}}</ref>
* google-search - A Go package to scrape Google. <ref>{{cite web|url=https://github.com/rocketlaunchr/google-search|title=A Go package to scrape Google.|via=GitHub}}</ref>
* [[GoogleScraper]] – A Python module to scrape different search engines (like Google, Yandex, Bing, Duckduckgo, Baidu and others) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.<ref>{{cite web|url=https://github.com/NikolaiT/GoogleScraper|title=A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/GoogleScraper|date=15 January 2019|publisher=|via=GitHub}}</ref>
*se-scraper - Successor of GoogleScraper. Scrape search engines concurrently with different proxies. <ref>{{Citation|last=Tschacher|first=Nikolai|title=NikolaiT/se-scraper|date=2020-11-17|url=https://github.com/NikolaiT/se-scraper|access-date=2020-11-19}}</ref>
*[https://serpapi.com SerpApi] – A real-time API to access extracted search engine results like Google, Bing, Baidu, Yandex, Yahoo. <ref>{{Cite web|title=SerpApi: Google Search API|url=https://serpapi.com/|access-date=2021-05-02|website=SerpApi|language=en}}</ref>
 
== Legal ==
Line 70 ⟶ 68:
However, when it comes to scraping search engines the situation is different, search engines usually do not list intellectual property as they just repeat or summarize information they scraped from other websites.
 
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service. (<ref>{{cite web|url=https://www.wired.com/2011/02/bing-copies-google/|title=Google Catches Bing Copying; Microsoft Says ‘So What?’|first=Ryan|last=Singel|work=Wired}}</ref>) , Butbut even this incident did not result in a court case.
 
One possible reason might be that search engines like Google are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms. A legal case won by Google against Microsoft would possibly put their whole business as risk.
 
==See also==