Content deleted Content added
BOT--Reverting link addition(s) by 122.175.147.231 to revision 1079623058 (seofixing.blogspot.com/ [\bblogspot\.(com|in)\b]) |
No edit summary |
||
Line 10:
Search engines like Google have implemented various forms of human detection to block any sort of automated access to their service,<ref>{{Cite web|url=https://support.google.com/webmasters/answer/66357?hl=en|title=Automated queries – Search Console Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> in the intent of driving the users of scrapers towards buying their official [[API]]s instead.
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search
== Difficulties ==
Google is the by far the largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.<ref>{{cite web|url=http://searchengineland.com/google-worlds-most-popular-search-engine-148089|title=Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly|date=11 February 2013|website=searchengineland.com}}</ref>
Although Google does not take legal action against scraping, it uses a range of defensive methods that makes scraping their results a challenging task, even when the scraping tool is realistically [[Spoofing attack|spoofing]] a normal web browser:
* Google is using a complex system of request rate limitation which can vary for each language, country, User-Agent as well as depending on the keywords or search parameters. The rate limitation can make it unpredictable when accessing a search engine automated, as the behaviour patterns are not known to the outside developer or user.
* Network and [[IP address|IP]] limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
* Offending IPs and offending IP networks can easily be stored in a blacklist database to detect offenders much faster. The fact that most ISPs give [[dynamic IP addresses]] to customers requires that such automated bans be only temporary,
* Behaviour based detection is the most difficult defense system. Search engines serve their pages to millions of users every day, this provides a large amount of behaviour information. A scraping script or bot is not behaving like a real user, aside from having non-typical access times, delays and session times the keywords being harvested might be related to each other or include unusual parameters. Google for example has a very sophisticated behaviour analyzation system, possibly using [[deep learning]] software to detect unusual patterns of access. It can detect unusual activity much faster than other search engines.<ref>{{cite web|url=http://tor.stackexchange.com/questions/313/does-google-know-that-i-am-using-tor-browser|title=Does Google know that I am using Tor Browser?|website=tor.stackexchange.com}}</ref>
* [[HTML]] markup changes, depending on the methods used to harvest the content of a website, even a small change in HTML data can render a scraping tool broken until it is updated.
* General changes in detection systems. In the past years search engines have tightened their detection systems nearly month by month making it more and more difficult to reliable scrape as the developers need to experiment and adapt their code regularly.<ref>{{cite web|url=https://productforums.google.com/forum/#!topic/websearch/MAju1QDF6_8|title=Google Groups|website=google.com}}</ref>
== Detection ==
When search engine defense thinks an access might be automated, the search engine can react differently.
The first layer of defense is a captcha page<ref>{{Cite web|url=https://support.google.com/recaptcha/answer/6081888?hl=en|title=My computer is sending automated queries – reCAPTCHA Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> where the user is prompted to verify they are a real person and not a bot or tool. Solving the [[CAPTCHA|captcha]] will create a [[HTTP cookie|cookie]] that permits access to the search engine again for a while. After about one day, the captcha page is removed again.
The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted, or the user changes their IP.
The third layer of defense is a long-term block of the entire network segment. Google has blocked large network blocks for months. This sort of block is likely triggered by an administrator and only happens if a scraping tool is sending a very high number of requests.
Line 36 ⟶ 35:
== Methods of scraping Google, Bing, Yahoo, [[Petal Search|Petal]] or [[Sogou]] ==
To scrape a search engine successfully, the two major factors are time and amount.
The more keywords a user needs to scrape and the smaller the time for the job, the more difficult scraping will be and the more developed a scraping script or tool needs to be.
Scraping scripts need to overcome a few technical challenges:<ref>{{cite web|url=http://google-rank-checker.squabbel.com|title=Scraping Google Ranks for Fun and Profit|website=google-rank-checker.squabbel.com}}</ref>
* IP rotation using Proxies (proxies should be unshared and not listed in blacklists)
* Proper time management, time between keyword changes, pagination as well as correctly placed delays
* Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser<ref name=":0" />
*HTML [[Document Object Model|DOM]] parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
Line 48 ⟶ 47:
* Captcha definition explained as mentioned above by<ref>{{cite news |last1=Jan Janssen |title=Online Marketing Services van SEO SNEL |url=https://www.seo-snel.nl/captcha/ |accessdate=26 September 2019 |work=SEO SNEL |agency=Services |date=26 September 2019 |language=nl}}</ref>
An example of an open source scraping software which makes use of the above
== Programming languages ==
When developing a scraper for a search engine, almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.
[[PHP]] is a commonly used language to write scraping scripts for websites or backend services, since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/[[C++]] code. [[Ruby on Rails]] as well as [[Python (programming language)|Python]] are also frequently used to automated scraping jobs. For highest performance, C++ DOM parsers should be considered.
Line 59 ⟶ 58:
== Tools and scripts==
When developing a search engine scraper there are several existing tools and libraries available that can either be used, extended or just analyzed to learn from.
* [[iMacros]] - A free browser automation toolkit that can be used for very small volume scraping from within a
* [[cURL]] – a command line browser for automation and testing, as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web|url=https://curl.haxx.se/libcurl/|title=libcurl - the multiprotocol file transfer library|website=curl.haxx.se}}</ref>
*
* [https://seotoolskit.co/ SEO Tools Kit] – Free Online Tools, Duckduckgo, Baidu, [[Petal Search|Petal]], [[Sogou]]) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.<ref>{{cite web|url=https://seotoolskit.co/|title=Free online SEO Tools (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/SEO Tools Kit|date=15 January 2019|publisher=|via=GitHub}}</ref>
*se-scraper - Successor of SEO Tools Kit. Scrape search engines concurrently with different proxies. <ref>{{Citation|last=Tschacher|first=Nikolai|title=NikolaiT/se-scraper|date=2020-11-17|url=https://github.com/NikolaiT/se-scraper|access-date=2020-11-19}}</ref>
|