Search engine scraping: Difference between revisions

Content deleted Content added
No edit summary
Tag: Reverted
No edit summary
 
(4 intermediate revisions by 4 users not shown)
Line 4:
{{Original research|date=March 2021}}
}}
'''Search engine scraping''' isscraping refers to the processautomated ofextraction harvestingof [[URL]]s, descriptions, orand other informationdata from [[search engine]]s results. ThisIt is a specificspecialized formsubset of [[screenweb scraping]] orfocused [[webexclusively scraping]] dedicated toon search enginesengine onlycontent.
 
Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing|indexing]] status.
Line 13:
Google is by far the largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.<ref>{{cite web|url=http://searchengineland.com/google-worlds-most-popular-search-engine-148089|title=Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly|date=11 February 2013|website=searchengineland.com}}</ref>
 
Although Google does not take legal action against scraping, it uses a range of defensive methods that makes scraping their results a challenging task, even when the scraping tool is realistically [[Spoofing attack|spoofing]] a normal [[web browser]]:
* Google is using a complex system of request rate limitation which can vary for each language, country, User-Agent as well as depending on the keywords or search parameters. The rate limitation can make it unpredictable when accessing a search engine automated, as the behaviour patterns are not known to the outside developer or user.
* Network and [[IP address|IP]] limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
Line 25:
 
The first layer of defense is a captcha page<ref>{{Cite web|url=https://support.google.com/recaptcha/answer/6081888?hl=en|title=My computer is sending automated queries – reCAPTCHA Help|website=support.google.com|language=en|accessdate=2017-04-02}}</ref> where the user is prompted to verify they are a real person and not a bot or tool. Solving the [[CAPTCHA|captcha]] will create a [[HTTP cookie|cookie]] that permits access to the search engine again for a while. After about one day, the captcha page is displayed again.
 
<ref>{{cite news |last1=Elbert |first1=Alex |title=Freelance SEO Singapore |url=https://freelanceseo.sg/ |access-date=26 July 2021}}</ref>
The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted, or the user changes their IP.