Search engine scraping: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Toomanylinks}}
ChabbD (talk | contribs)
fixed spelling
Line 45:
Scraping scripts need to overcome a few technical challenges:<ref>{{cite web|url=http://google-rank-checker.squabbel.com|title=Scraping Google Ranks for Fun and Profit|website=google-rank-checker.squabbel.com}}</ref>
* IP rotation using Proxies (proxies should be unshared and not listed in blacklists)
* Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective longtermlong-term scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
* Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser<ref name=":0" />
*HTML [[Document Object Model|DOM]] parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
Line 65:
* [[cURL]] – a command line browser for automation and testing, as well as a powerful open source HTTP interaction library available for a large range of programming languages.<ref>{{cite web|url=https://curl.haxx.se/libcurl/|title=libcurl - the multiprotocol file transfer library|website=curl.haxx.se}}</ref>
* Google-search - A Go package to scrape Google.<ref>{{cite web|url=https://github.com/rocketlaunchr/google-search|title=A Go package to scrape Google.|via=GitHub}}</ref>
* [https://seotoolskit.co/ SEO Tools Kit] – Free Online Tools, DuckduckgoDuckDuckGo, Baidu, [[Sogou]]) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.<ref>{{cite web|url=https://seotoolskit.co/|title=Free online SEO Tools (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/SEO Tools Kit|date=15 January 2019|publisher=|via=GitHub}}</ref>
*se-scraper - Successor of SEO Tools Kit. Scrape search engines concurrently with different proxies.<ref>{{Citation|last=Tschacher|first=Nikolai|title=NikolaiT/se-scraper|date=2020-11-17|url=https://github.com/NikolaiT/se-scraper|access-date=2020-11-19}}</ref>