Amazon2csv: Amazon products scraper to CSV (no API token required) (opens in new tab)

(github.com)

129 pointstducret7y ago56 comments

56 comments

Why not use the API? Disclaimer: I'm the author of python-amazon-simple-product-api [1]

[1] https://github.com/yoavaviram/python-amazon-simple-product-a...

k__7y ago

Sometimes this isn't possible.

I wrote an app that is basically a new UI for the Amazon products. It runs entirely on the client. The Amazon API simply didn't work in that setup.

AznHisoka7y ago

Are you referring to the Product Advertising API?

Doesnt that require you to have a quota of affiliate sales to keep using it? I can’t find where they state this requirement but I remembered they were very sneaky about disclosing this. If you dont have any affiliate sales after X months, your API key will stop working.

ZoomStop7y ago

Currently you have to be a member of their affiliate program to get API access. To become a full "member" you have to be a prospect who generates three referral sales (iirc) within a 30 day period. So once in you have the API, but getting in isn't as easy as filling out a form. From there you can get your API rate limits increased from the default 1x call per second up to 10 based on your prior 30 day affiliate sales.

1 more reply

raitucarp7y ago

Man, looks great. I also build something similar in node.js. I implement everything what documentation said (complete implementation). ICYMI:

https://github.com/Ribhnux/piranhax

wdr17y ago

The API comes with a TOS that severely restricts what you can do with the data.

amingilani7y ago

Scraping Amazon is fun and all, but when you start overdoing it they rate-limit your IP and show you my worst nightmare: the Dogs of Amazon (a 500 page with pictures)

Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy Amazon products from countries where they don't ship easily, like Pakistan.

Edit: totally open to partnerships in more countries

jeanlucas7y ago

I'm from Brazil and what you said made me curious, not sure why, but Amazon here didn't catch. How did you solve problems like logistics and interest from the public?

amingilani7y ago

I'm sorry, I have trouble understanding your question but if you mean how we ship from Amazon to Pakistan, and how we got people to use our service: we worked out a pipeline to get products from the US to Amazon, and advertising + word-of-mouth. Also:

+ There's no direct way to buy 90% of products from Amazon since they don't ship to Pakistan

+ Our service is the only in the country that gives a fixed price at checkout in PKR

+ Our customer service is excellent

+ We're one of the cheapest options available, as long as the competition imports products legally.

1 more reply

yasoob7y ago

Hi Amin, your platform seems nice. Just wanted to give you a heads-up that your website is being classified as ["phishing" by Avast](https://i.imgur.com/SmuuRfD.png). I think if you replace "Amazon" in the url with something else it should work fine. Best of luck!

always_good7y ago

Reminds me of how nobody could see one of my user's avatars because the url (a hash) had started with an "ad" segment (for bucketing), as in "/avatars/ad/ad3adb33f". So adblockers blocked it.

My protest against such a ridiculous heuristic was to not fix it.

1 more reply

amingilani7y ago

Thank you Yasoob! Dammit, again? I already had them white-label our site once but I'll look into this again. Thank you!

jploh7y ago

In the Philippines there's something quite similar called Galleon. They've been recently acquired but I think they might be open to partner. They've expanded to Thailand, if I'm not mistaken.

dewey7y ago

Are you using the API or web scraping? We never really had problems with IP banning if the traffic looks like a real user.

amingilani7y ago

Neither, actually, we're using a heavily configured reverse-proxy.

This means that, unfortunately, all the traffic has to go through our own servers.

Jdam7y ago

The issue with those tools is that Amazon changes the product layout very often and heavily conducts A/B tests. I’ve once even heard that computer vision is the most stable way to scrape Amazon. I guess this library will stop working rather soon.

RhodesianHunter7y ago

>I’ve once even heard that computer vision is the most stable way to scrape Amazon

At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.

mxvzr7y ago

Are you able to share some details? How often did you have to get new IP addresses? What about user agent? Were the scapers "straight to the point" like amazon2csv (ie: make a request directly to the search page) or did they have randomized behavior (eg: re-use sessions from time to time; click a random link on the page; start from the homepage...)? Did the scrapers ever went against amz's robots.txt directives (eg: interacting with the cart page)? Ever heard from amz itself about your employer's activities on their site?

1 more reply

AznHisoka7y ago

Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).

I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.

mygo7y ago

> I guess this library will stop working rather soon.

Don’t really see that as a dealbreaker. So the library will need maintenance. Normal for libraries to need updates in order to keep up with changes. It works today, and it will work whenever it’s updated. Better than nothing and for many use cases that’s good enough.

hobofan7y ago

Search results scraping on Amazon is fairly stable.

What's more difficult is product page scraping, because there you have hundreds of different variations. Some from A/B testing and a lot just being specific things that show up for certain product categories (e.g. video games).

bufferoverflow7y ago

I remember trying to build a scraper for Amazon. I quickly discovered that there are many types of item pages, and they change over time too. A/B testing probably. Just to get the price of the product out of their HTML markup reliably was a nightmare, I had to build a huge tree of if-this-then-maybe-that logic.

AdamRoberts7y ago

The company I work for (zinc.io) has this: https://zincapi.com/

We brand it as an ordering API, but we also offer retrieving the product data (item details/pricing.) We put a LOT of engineering resources into data quality and maintenance, as the API is core to our flagship product, PriceYak. If you have questions or want a token, email adam@zinc.io and mention this post.

ikeboy7y ago

If you're using this for anything serious, it's probably better to sign up for the keepa API at about $50/month and they scrape Amazon for you. Worth it to not need to deal with the complexities.

AdamM127y ago

Nice. From my experience I've found Parsel [1] (used by scrapy) to be an easier to use HTML parsing library than Beautiful Soup. That's just imo.

[1] https://github.com/scrapy/parsel

microdrum7y ago

Hm, another no-API option (at least if you are on WordPress) is: https://wpcommission.com

alex_sp7y ago

So how many calls is one allowed before getting banned? Any guidelines on how to use this without breaching T&Cs?

staticautomatic7y ago

Am I the only one who thinks this is rather weird, or at least unconventional code for a scraper in Python?

dec0dedab0de7y ago

I just took a glance, but nothing seemed too off. Do you care to elaborate?

staticautomatic7y ago

Sure. I'm not really trying to criticize the code, it's just that a lot of this looks foreign and unconventional to me.

1. requests.Session() is a class. IDK what request.session() invokes (see https://github.com/tducret/amazon-scraper-python/blob/master...).

2. Isn't one of the points of using Session() that it'll persist stuff like cookies and headers? So why is it re-defining the headers multiple times? (e.g. both GET and POST in the same session have their own respective but identical headers).

3. Is the use of `arg=""` idiomatic? For example in https://github.com/tducret/amazon-scraper-python/blob/master...

4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.

2 more replies

RobLach7y ago

If it works...

kull7y ago

It is also illegal to scrape AZ, since if you scrape it , it means you don’t own this content and you are just stilling products data added to the site by produsts proper owners.

zeusk7y ago

why aren't Larry and Sergey behind bars, then? Scraping publicly available information is far from illegal.

Also, Interestingly only Alibaba's bots are completely blocked from crawling: https://www.amazon.com/robots.txt

stef257y ago

> Scraping publicly available information is far from illegal.

The scraping itself may not be (although I'm pretty sure here in Belgium there is a law against collecting other people's data), but what you do with it may not be legal.

You could make a case for making any kind of profit generated from scraping data illegal. Don't get me wrong, I love scraping things myself.

Also find it amazing there are companies out there like Crawlera that can do serious scraping work and openly flaunt deploying tech to get around whatever scraping blockers are out there.

1 more reply

kull7y ago

Check amazon api T&C, also try to do the same with Craigslist and see how long you they will let you do it. scraping data is always a shady business if you do it without a permission of content owner

3 more replies

smt887y ago

Why would the owner of a product want to keep their product info a secret?

kull7y ago

Ex. People take products data and copy to eBay then try to dropship getting products from your fba. People pay big money for nice photos of products and then somebody just comes and takes it as their own

1 more reply

j / k navigate · click thread line to collapse

56 comments

yoaviram7y ago

Why not use the API? Disclaimer: I'm the author of python-amazon-simple-product-api [1]

[1] https://github.com/yoavaviram/python-amazon-simple-product-a...

k__7y ago

Sometimes this isn't possible.

I wrote an app that is basically a new UI for the Amazon products. It runs entirely on the client. The Amazon API simply didn't work in that setup.

AznHisoka7y ago

Are you referring to the Product Advertising API?

ZoomStop7y ago

1 more reply

raitucarp7y ago

Man, looks great. I also build something similar in node.js. I implement everything what documentation said (complete implementation). ICYMI:

https://github.com/Ribhnux/piranhax

wdr17y ago

The API comes with a TOS that severely restricts what you can do with the data.

amingilani7y ago

Scraping Amazon is fun and all, but when you start overdoing it they rate-limit your IP and show you my worst nightmare: the Dogs of Amazon (a 500 page with pictures)

Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy Amazon products from countries where they don't ship easily, like Pakistan.

Edit: totally open to partnerships in more countries

jeanlucas7y ago

I'm from Brazil and what you said made me curious, not sure why, but Amazon here didn't catch. How did you solve problems like logistics and interest from the public?

amingilani7y ago

+ There's no direct way to buy 90% of products from Amazon since they don't ship to Pakistan

+ Our service is the only in the country that gives a fixed price at checkout in PKR

+ Our customer service is excellent

+ We're one of the cheapest options available, as long as the competition imports products legally.

1 more reply

yasoob7y ago

always_good7y ago

Reminds me of how nobody could see one of my user's avatars because the url (a hash) had started with an "ad" segment (for bucketing), as in "/avatars/ad/ad3adb33f". So adblockers blocked it.

My protest against such a ridiculous heuristic was to not fix it.

1 more reply

amingilani7y ago

Thank you Yasoob! Dammit, again? I already had them white-label our site once but I'll look into this again. Thank you!

jploh7y ago

In the Philippines there's something quite similar called Galleon. They've been recently acquired but I think they might be open to partner. They've expanded to Thailand, if I'm not mistaken.

dewey7y ago

Are you using the API or web scraping? We never really had problems with IP banning if the traffic looks like a real user.

amingilani7y ago

Neither, actually, we're using a heavily configured reverse-proxy.

This means that, unfortunately, all the traffic has to go through our own servers.

Jdam7y ago

RhodesianHunter7y ago

>I’ve once even heard that computer vision is the most stable way to scrape Amazon

At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.

mxvzr7y ago

1 more reply

AznHisoka7y ago

Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).

I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.

mygo7y ago

> I guess this library will stop working rather soon.

hobofan7y ago

Search results scraping on Amazon is fairly stable.

bufferoverflow7y ago

AdamRoberts7y ago

The company I work for (zinc.io) has this: https://zincapi.com/

ikeboy7y ago

If you're using this for anything serious, it's probably better to sign up for the keepa API at about $50/month and they scrape Amazon for you. Worth it to not need to deal with the complexities.

AdamM127y ago

Nice. From my experience I've found Parsel [1] (used by scrapy) to be an easier to use HTML parsing library than Beautiful Soup. That's just imo.

[1] https://github.com/scrapy/parsel

microdrum7y ago

Hm, another no-API option (at least if you are on WordPress) is: https://wpcommission.com

alex_sp7y ago

So how many calls is one allowed before getting banned? Any guidelines on how to use this without breaching T&Cs?

staticautomatic7y ago

Am I the only one who thinks this is rather weird, or at least unconventional code for a scraper in Python?

dec0dedab0de7y ago

I just took a glance, but nothing seemed too off. Do you care to elaborate?

staticautomatic7y ago

Sure. I'm not really trying to criticize the code, it's just that a lot of this looks foreign and unconventional to me.

1. requests.Session() is a class. IDK what request.session() invokes (see https://github.com/tducret/amazon-scraper-python/blob/master...).

3. Is the use of `arg=""` idiomatic? For example in https://github.com/tducret/amazon-scraper-python/blob/master...

4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.

2 more replies

RobLach7y ago

If it works...

kull7y ago

It is also illegal to scrape AZ, since if you scrape it , it means you don’t own this content and you are just stilling products data added to the site by produsts proper owners.

zeusk7y ago

why aren't Larry and Sergey behind bars, then? Scraping publicly available information is far from illegal.

Also, Interestingly only Alibaba's bots are completely blocked from crawling: https://www.amazon.com/robots.txt

stef257y ago

> Scraping publicly available information is far from illegal.

The scraping itself may not be (although I'm pretty sure here in Belgium there is a law against collecting other people's data), but what you do with it may not be legal.

You could make a case for making any kind of profit generated from scraping data illegal. Don't get me wrong, I love scraping things myself.

Also find it amazing there are companies out there like Crawlera that can do serious scraping work and openly flaunt deploying tech to get around whatever scraping blockers are out there.

1 more reply

kull7y ago

Check amazon api T&C, also try to do the same with Craigslist and see how long you they will let you do it. scraping data is always a shady business if you do it without a permission of content owner

3 more replies

smt887y ago

Why would the owner of a product want to keep their product info a secret?

kull7y ago

1 more reply

j / k navigate · click thread line to collapse