I have a python script to scrape Alibaba products. When I start to run it and visited several Alibaba products, Alibaba website start to block my IP because I visited their site every 3 seconds.
I found a solution to prevent blocking from Alibaba website. I need to have a 20 seconds of delay for every product website. But the problem is the scraping is too slow, and Alibaba have a millions of product per category.
Can you suggest how to improve scraping speed without blocking my IP? is Proxy IP is that only solution?
You can buy IPs for cheap. It will come in a .txt file. You can even get a premium version of some proxy provider if you Google hard enough.
You can then use random headers, combine them with random IPs, and then send a request.
If you combine automatic header rotations and random IPs in combination with it, you can achieve what you want.
There is no readily available implementation of this.
from fake_useragent import UserAgent
import requests
url = "https://www.google.com/search?tbm=bks&q=" + query
headers = {
'User-Agent': UserAgent().random
}
response = requests.get(url, headers = headers, proxies = ???)
r = response.content
The code will look something along the lines above, you just need to figure out how to randomize picking of a particular IP from the proxylist, and fitting it in proxies.
Also keep in mind! The proxies that you use also matters. HTTP
proxies are easily traceable by the website. HTTPS
proxies are what you need! But they are harder to get, and they stop working quickly!
Hope this helps. :)
Firebase Cloud Functions: PubSub, "res.on is not a function"
TypeError: Cannot read properties of undefined (reading 'createMessageComponentCollector')
I've seen a lot of questions that ask about pivot tablesEven if they don't know that they are asking about pivot tables, they usually are
Regarding mentioned subject, I need a way in Python to open Excel file and wait to fill it with some data and close it then the script continue work