Content deleted Content added
No edit summary |
|||
(10 intermediate revisions by 8 users not shown) | |||
Line 4:
{{Original research|date=March 2021}}
}}
'''Search engine scraping'''
Most commonly larger [[search engine optimization]] (SEO) providers depend on regularly scraping keywords from search engines to monitor the competitive position of their customers' websites for relevant keywords or their [[search engine indexing|indexing]] status.
The process of entering a website and extracting data in an automated fashion is also often called "[[Web crawler|crawling]]". Search engines get almost all their data from automated crawling bots.
== Difficulties ==
Google is by far the largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.<ref>{{cite web|url=http://searchengineland.com/google-worlds-most-popular-search-engine-148089|title=Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly|date=11 February 2013|website=searchengineland.com}}</ref>
Although Google does not take legal action against scraping, it uses a range of defensive methods that makes scraping their results a challenging task, even when the scraping tool is realistically [[Spoofing attack|spoofing]] a normal [[web browser]]:
* Google is using a complex system of request rate limitation which can vary for each language, country, User-Agent as well as depending on the keywords or search parameters. The rate limitation can make it unpredictable when accessing a search engine automated, as the behaviour patterns are not known to the outside developer or user.
* Network and [[IP address|IP]] limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
Line 37 ⟶ 33:
== Methods of scraping ==
{{unref-section|date=July 2024}}
To scrape a search engine successfully, the two major factors are time and amount.
The more keywords a user needs to scrape and the smaller the time for the job, the more difficult scraping will be and the more developed a scraping script or tool needs to be.
Scraping scripts need to overcome a few technical challenges:
* Utilizing IP rotation with proxies. These proxies should be exclusive (unshared) and not flagged on any blacklists.
* Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective long-term scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
* Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser
*HTML [[Document Object Model|DOM]] parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
* Error handling, automated reaction on captcha or block pages and other unusual responses
== Programming languages ==
{{unref-section|date=July 2024}}
When developing a scraper for a search engine, almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.
[[PHP]] is a commonly used language to write scraping scripts for websites or backend services, since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/[[C++]] code. [[Ruby on Rails]] as well as [[Python (programming language)|Python]] are also frequently used to automated scraping jobs
Additionally, [[Bash scripting language|bash scripting]] can be used together with cURL as a command line tool to scrape a search engine.
== Legal ==
When scraping websites and services the legal part is often a big concern for companies, for web scraping it greatly depends on the country a scraping user/company is from as well as which data or website is being scraped. With many different court rulings all over the world.
However, when it comes to scraping search engines the situation is different, search engines usually do not list intellectual property as they just repeat or summarize information they scraped from other websites.
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service,<ref>{{cite magazine|url=https://www.wired.com/2011/02/bing-copies-google/|title=Google Catches Bing Copying; Microsoft Says 'So What?'|first=Ryan|last=Singel|magazine=Wired}}</ref> but even this incident did not result in a court case.
==See also==
Line 79 ⟶ 66:
==References==
{{Reflist}}
[[Category:Search engine software]]
Line 86 ⟶ 72:
[[Category:Search engine optimization]]
[[Category:Web scraping]]
|