Search engine scraping: Difference between revisions

Content deleted Content added
Tags: Reverted Visual edit
No edit summary
 
(36 intermediate revisions by 26 users not shown)
Line 4:
{{Original research|date=March 2021}}
}}
'''Search engine scraping''' isscraping refers to the processautomated ofextraction harvestingof [[URL]]s, descriptions, orand other informationdata from [[search engine]]s such as [[Google Search|Google]], [[Microsoft Bing|Bing]], [[Yahoo! Search|Yahoo]], [[Petal Search|Petal]] or [[Sogou]]results. ThisIt is a specificspecialized formsubset of [[screenweb scraping]] orfocused [[webexclusively scraping]] dedicated toon search enginesengine onlycontent.
 
Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines, especially Google, [[Petal Search|Petal]], [[Sogou]] to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing|indexing]] status.
 
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engine’s like Google, Bing, Yahoo, [[Petal Search|Petal]] or [[Sogou]]engines get almost all their data from automated crawling bots.
Search engines like Google have implemented various forms of human detection to block any sort of automated access to their service,<ref>{{Cite web|url=https://support.google.com/webmasters/answer/66357?hl=en|title=Automated queries – Search Console Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> in the intent of driving the users of scrapers towards buying their official [[API]]s instead.
 
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engine’s like Google, Bing, Yahoo, [[Petal Search|Petal]] or [[Sogou]] get almost all their data from automated crawling bots.
 
== Difficulties ==
Google is the by far the largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.<ref>{{cite web|url=http://searchengineland.com/google-worlds-most-popular-search-engine-148089|title=Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly|date=11 February 2013|website=searchengineland.com}}</ref>
 
Although Google does not take legal action against scraping, it uses a range of defensive methods that makes scraping their results a challenging task, even when the scraping tool is realistically [[Spoofing attack|spoofing]] a normal [[web browser]]:
* Google is using a complex system of request rate limitation which can vary for each language, country, User-Agent as well as depending on the keywords or search parameters. The rate limitation can make it unpredictable when accessing a search engine automated, as the behaviour patterns are not known to the outside developer or user.
* Network and [[IP address|IP]] limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
Line 26 ⟶ 24:
When search engine defense thinks an access might be automated, the search engine can react differently.
 
The first layer of defense is a captcha page<ref>{{Cite web|url=https://support.google.com/recaptcha/answer/6081888?hl=en|title=My computer is sending automated queries – reCAPTCHA Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> where the user is prompted to verify they are a real person and not a bot or tool. Solving the [[CAPTCHA|captcha]] will create a [[HTTP cookie|cookie]] that permits access to the search engine again for a while. After about one day, the captcha page is removeddisplayed again.
 
The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted, or the user changes their IP.
Line 34 ⟶ 32:
All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).
 
== Methods of scraping Google, Bing, Yahoo, [[Petal Search|Petal]] or [[Sogou]] ==
{{unref-section|date=July 2024}}
[https://www.serphouse.com/ '''SERP Scraper API'''] is a tool that gathers real-time parsed and ready-to-use search engine data from both organic and paid results. Organic, popular products, paid videos, product listing ads, images, featured snippets, related searches, and many other public data sources can be extracted.
 
To monitor brand mentions or product counterfeiting, you can extract data for any search query from the search page, keyword pages, and other page types.
 
To scrape a search engine successfully, the two major factors are time and amount.
 
The more keywords a user needs to scrape and the smaller the time for the job, the more difficult scraping will be and the more developed a scraping script or tool needs to be.
 
Scraping scripts need to overcome a few technical challenges:<ref>{{cite webcn|urldate=http://google-rank-checker.squabbel.com|title=ScrapingJuly Google Ranks for Fun and Profit|website=google-rank-checker.squabbel.com2024}}</ref>
* IP rotation using Proxies (proxies should be unshared and not listed in blacklists)
* Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective longterm scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
* Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser<ref name=":0" />
*HTML [[Document Object Model|DOM]] parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
* Error handling, automated reaction on captcha or block pages and other unusual responses<ref>{{cite news |last1=Deniel Iblika |title=De Online Marketing Diensten van DoubleSmart |url=https://www.doublesmart.nl/diensten/ |accessdate=16 January 2019 |work=DoubleSmart |agency=Diensten |date=3 January 2018 |language=nl}}</ref>
* Captcha definition explained as mentioned above by<ref>{{cite news |last1=Jan Janssen |title=Online Marketing Services van SEO SNEL |url=https://www.seo-snel.nl/captcha/ |accessdate=26 September 2019 |work=SEO SNEL |agency=Services |date=26 September 2019 |language=nl}}</ref>
 
* Utilizing IP rotation usingwith Proxiesproxies. These (proxies should be exclusive (unshared) and not listedflagged inon any blacklists).
An example of an open source scraping software which makes use of the above-mentioned techniques is GoogleScraper.<ref name=":0">{{cite web|url=https://github.com/NikolaiT/GoogleScraper|title=Python3 framework GoogleScraper|website=scrapeulous}}</ref> This framework controls browsers over the DevTools Protocol and makes it hard for Google to detect that the browser is automated.
* Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective longtermlong-term scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
* Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser<ref name=":0" />
*HTML [[Document Object Model|DOM]] parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
* Error handling, automated reaction on captcha or block pages and other unusual responses{{cn|date=July 2024}}
 
== Programming languages ==
{{unref-section|date=July 2024}}
When developing a scraper for a search engine, almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.
 
[[PHP]] is a commonly used language to write scraping scripts for websites or backend services, since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/[[C++]] code. [[Ruby on Rails]] as well as [[Python (programming language)|Python]] are also frequently used to automated scraping jobs. For highest performance, C++ DOM parsers should be considered.
 
Additionally, [[Bash scripting language|bash scripting]] can be used together with cURL as a command line tool to scrape a search engine.
 
== Tools and scripts==
When developing a search engine scraper there are several existing tools and libraries available that can either be used, extended or just analyzed to learn from.
* [[iMacros]] - A free browser automation toolkit that can be used for very small volume scraping from within a user browser <ref>{{Cite web|url=https://stackoverflow.com/q/32171929 |title=iMacros to extract google results|website=stackoverflow.com|access-date=2017-04-04}}</ref>
* [[cURL]] – a command line browser for automation and testing, as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web|url=https://curl.haxx.se/libcurl/|title=libcurl - the multiprotocol file transfer library|website=curl.haxx.se}}</ref>
* Google-search - A Go package to scrape Google. <ref>{{cite web|url=https://github.com/rocketlaunchr/google-search|title=A Go package to scrape Google.|via=GitHub}}</ref>
* [https://seotoolskit.co/ SEO Tools Kit] – Free Online Tools, Duckduckgo, Baidu, [[Petal Search|Petal]], [[Sogou]]) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.<ref>{{cite web|url=https://seotoolskit.co/|title=Free online SEO Tools (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/SEO Tools Kit|date=15 January 2019|publisher=|via=GitHub}}</ref>
*se-scraper - Successor of SEO Tools Kit. Scrape search engines concurrently with different proxies. <ref>{{Citation|last=Tschacher|first=Nikolai|title=NikolaiT/se-scraper|date=2020-11-17|url=https://github.com/NikolaiT/se-scraper|access-date=2020-11-19}}</ref>
 
== Legal ==
When scraping websites and services the legal part is often a big concern for companies, for web scraping it greatly depends on the country a scraping user/company is from as well as which data or website is being scraped. With many different court rulings all over the world.<ref>{{cite web|url=http://blog.icreon.us/advise/web-scraping-legality|title=Is Web Scraping Legal? |publisher=Icreon (blog)}}</ref><ref>{{cite web|url=https://arstechnica.com/tech-policy/2014/04/appeals-court-reverses-hackertroll-weev-conviction-and-sentence/|title=Appeals court reverses hacker/troll "weev" conviction and sentence [Updated]|website=arstechnica.com|date=11 April 2014 }}</ref><ref>{{cite web|url=https://www.techdirt.com/articles/20090605/2228205147.shtml|title=Can Scraping Non-Infringing Content Become Copyright Infringement... Because Of How Scrapers Work?|website=www.techdirt.com|date=10 June 2009 }}</ref>
 
However, when it comes to scraping search engines the situation is different, search engines usually do not list intellectual property as they just repeat or summarize information they scraped from other websites.
 
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service,<ref>{{cite webmagazine|url=https://www.wired.com/2011/02/bing-copies-google/|title=Google Catches Bing Copying; Microsoft Says ‘So'So What?'|first=Ryan|last=Singel|workmagazine=Wired}}</ref> but even this incident did not result in a court case.
 
One possible reason might be that search engines like Google, [[Petal Search|Petal]], [[Sogou]] are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms.
 
==See also==
Line 81 ⟶ 66:
==References==
{{Reflist}}
 
==External links==
* [https://scrapy.org/ Scrapy] Open source python framework, not dedicated to search engine scraping but regularly used as base and with a large number of users.
* [http://scraping.compunect.com Compunect scraping sourcecode] - A range of well known open source PHP scraping scripts including a regularly maintained Google Search scraper for scraping advertisements and organic resultpages.
* [http://google-rank-checker.squabbel.com/ Justone free scraping scripts] - Information about Google scraping as well as open source PHP scripts (last updated mid 2016)
* [http://scraping.services/?api&chapter=Source%20Code Scraping.Services source code] - Python and PHP open source classes for a 3rd party scraping API. (updated January 2017, free for private use)
* [http://simplehtmldom.sourceforge.net/ PHP Simpledom] A widespread open source PHP DOM parser to interpret HTML code into variables.
*[https://serpapi.com/ SerpApi] Third party service based in the United States allowing you to scrape search engines legally.
 
[[Category:Search engine software]]
[[Category:Web crawlers| ]]
[[Category:Internet search algorithms]]
[[Category:Search_engine_optimizationSearch engine optimization]]
[[Category:Web scraping]]