Search engine scraping: Difference between revisions

Content deleted Content added
Hedja (talk | contribs)
Removed some excessive mentions of specific search engines (seems to be spam).
m Utilizing IP rotation with proxies. These proxies should be exclusive (unshared) and not flagged on any blacklists.
Line 42:
 
Scraping scripts need to overcome a few technical challenges:<ref>{{cite web|url=http://google-rank-checker.squabbel.com|title=Scraping Google Ranks for Fun and Profit|website=google-rank-checker.squabbel.com}}</ref>
* Utilizing IP rotation usingwith Proxiesproxies. These (proxies should be exclusive (unshared) and not listedflagged inon any blacklists).
* Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective long-term scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
* Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser<ref name=":0" />