QVC Sues Shopping App for Web Scraping That Allegedly Triggered Site Outage (opens in new tab)

(newmedialaw.proskauer.com)

27 pointsdomdip11y ago33 comments

33 comments

24 comments · 7 top-level

korzun11y ago· 6 in thread

Agree with the suit but QVC (by this time) should have rate limiting / throttling per IP.

(waits for somebody to claim that each request came from a different proxy)

greglindahl11y ago

From TFA:

  The complaint alleges that the defendant disguised its 
  web crawler to mask its source IP address and thus 
  prevented QVC technicians from identifying the source of 
  the requests and quickly repairing the problem.

korzun11y ago

Yes. Did your read/understand what I posted? Especially the last part where I predicted that somebody will crawl out and post what you just did?

atwebb11y ago

> (waits for somebody to claim that each request came from a different proxy)

Seems they were using proxies but you don't want to go blocking IPs making a number of requests, right? they could be legitmate shoppers opening lots of tabs, refreshing, hitting image/resource heavy pages, analytics etc. I'm not a network or server admin though so maybe there are some tools or methods that help identify the bad traffic.

korzun11y ago

Rate limiting is not blocking.

Multiple tabs do not translate to concurrent requests. A shop like QVC should have a high limit and frankly an automatic system that flags stuff like this and notifies proper parties.

1 more reply

chadgeidel11y ago

This happened to us recently (however not at this scale), and I assure you this wouldn't solve anything. We only got a few requests per IP.

korzun11y ago

I assure you that if you had a proper system in place it would limit the exposure. We are talking about 600 r/s (QVC should be able to eat that anyways, but whatever).

At this level you should know this will happen and have placeholders in place.

If every single IP is unique, additional layer would be pattern matching traffic by different network classes and flagging it actively.

Another way to rate limit is by number of pages per session / time.

Then you can have another fail safe method where if X exceeds Y start throttling top X (dynamically scale up until X is returned to normal) while notifying DevOps of the issue.

Etc.

1 more reply

johngd11y ago· 4 in thread

My main focus for the entirety of my career has been on internet facing consumer web applications. I have seen many, many, DOS attacks from IRC bots to Ukrainian web scrapers to Chinese get-lucky wordpress exploit scanners. Most of these can be ignored and blocked with little effort.

By FAR the most annoying of any of these is when Google, Bing and/or Yahoo decide to wake up and crawl your infrastructure with little regard to your robots.txt or webmaster settings, if available. I think they have got better in recent years, but they used to be the absolute worst. It came down to: Let us DOS you, or have your ranking suffer. Suing Google, Bing, Yahoo isn't exactly an option.

Some context: I was the lead architect/engineer combo for a CMS that hosted ~500k domains for a fairly large international company. Some days I could login and see them crawling every domain from A-Z. Some days I would get caught by Google and Bing at the same time. They were the largest consumers of data on this system.

acdha11y ago

FWIW, every time I've seen what looked like a major search engine ignoring rate-limits (either Crawl-Delay or webmaster tools settings) a check of the actual IPs being used showed that it was someone spoofing a well-known User-Agent, which left you needing some other form of rate-limiting either way.

johngd11y ago

Very true, however the incidents I reference this was definitely not the case.

In fact for a while we would get Bing (MSN bot back then) crawl us everyday at the same time, almost on the dot.

Let me plug project honeypot (which I am in no way affiliated with). This is truly an awesome, and surprisingly accurate, free, service that does an amazing job at collecting heuristics on suspicious IP activity and exposing it in a easy to interpret way..

http://www.projecthoneypot.org/index.php

1 more reply

greglindahl11y ago

One thing that is often a surprise to webmasters is that if you have 100+ websites on a single IP, robots.txt specifies a crawl-delay per website and not per IP. So you can end up with 100x the crawling you thought you were specifying in robots.txt.

johngd11y ago

This is very true. Logistically (not technically) this wasn't an option - read: business decision. Before the days where everything had an API, and things were done with spreadsheets, and billing codes, and product codes, and typos... managing DNS for a large number of domains was not fun and rather expensive. Want to move a bunch of domains to a new IP address? $$$+Time + more time and money. We weren't able to trust our domain broker and had to double check EVERYTHING they did.

To you point, if you are a large provider, especially one that passively and actively sends a lot of money towards the search engines, there are some additional options at your disposal. We (the business units) had contacts with adsense etc, which would come in handy.

birken11y ago· 2 in thread

Result.ly are really a bunch of jerks. One of the most common sense things you can possibly do while crawling a website is monitor the response time and/or error rates from the sites you are crawling. If those are going up, your crawl rate should go down or go to 0.

There is one form of internet justice, which is QVC should file abuse complaints to the ISPs that host those IPs. I've found abuse complaints are the best way to stop people from using IPs for bad activities (excessive scraping, spamming, etc).

greglindahl11y ago

From the complaint, it seems that Result.ly was crawling through proxies... which means QVC doesn't know whom to complain to:

  The complaint alleges that the defendant disguised its
  web crawler to mask its source IP address and thus 
  prevented QVC technicians from identifying the source of 
  the requests and quickly repairing the problem.

Your comment about crawler politeness is spot-on.

birken11y ago

My reading of that is they were splitting their crawling amongst a large block of IPs they had. If they are using proxies, that is much easier because you can just block them all without any worry of accidentally blocking real consumers (in addition to the fact that you can also file abuse complaints). I think at Thumbtack we blocked all AWS IPs, a good deal of foreign IPs, in addition to the specific IPs of people who were abusively crawling.

At the same time, the onus shouldn't be on QVC to have to block this, result.ly should either be a good citizen or have to face a lawsuit. Granted, QVC's tech team should be able to deal with this because next time the person who is DOSing them might not be a US entity which can be sued, but that isn't entirely relevant in this situation.

1 more reply

Someone123411y ago· 2 in thread

> Of these and other causes of action typically alleged in these situations, the breach of contract claim is often the clearest source of a remedy.

That's a strange claim given that we're talking about a "contract" which QVC has no proof that the other party read or agreed to, and which there has been no explicit exchange ("offer" and "acceptance").

Are web-site contracts/terms even enforceable at all? According to this article[0]/case law likely not. Strange thing for a lawyer to say, but this article makes a lot of strange claims that seem inconsistent with US case law.

[0] http://www.forbes.com/sites/oliverherzfeld/2013/01/22/are-we...

greglindahl11y ago

Several cases have allowed contract claims where (1) the crawler created accounts and (2) the account creation flow has a checkbox for the user to indicate that they agree to the contract.

Trying to enforce a contract on a crawler that's just fetching pages without ever checking a box is much more difficult... many failures in the past.

domdipOP11y ago

IANAL. The linked cases are probably more relevant than yours, courts will have different standards if the other party is a (presumably unsophisticated) consumer vs. a corporation. Also unclear if there was other communication b/w QVC and Resultly, which may be important.

Also, "clearest source of a remedy" refers to ability to actually get compensation, which may be limited or difficult to argue for under DMCA/CFAA.

Spoom11y ago· 2 in thread

Honestly, you really shouldn't have to hit "36,000 requests per minute" scraping a website for price updates. Can someone explain if there is any scenario in which this is reasonable? Do QVC's prices change that often?

tomjen311y ago

Just a guess, assuming they have a lot of different things they sell? Also it may go through a partial checkout flow for each item (to find the shipping rates, etc).

Still yeah, that is too much.

xur1711y ago

Yeah, this really stood out to me. 600 requests per second against one website seems pretty insane. I can see why QVC is suing.

swalsh11y ago· 1 in thread

I have mixed feelings about this. On the one hand, the bot seems to have been a really bad netizen. On the other hand I hate the idea of there being a precedence that you can be sued for automating get requests.

akama11y ago

I don't think they are getting sued for automating get requests. Most of the problem here seems to be the excessive number of requests that made the scraping effectively be a DOS attack.

Xorlev11y ago

Having been on both sides of the coin, once you hit 600 reqs/s without a prior arrangement, that almost qualifies as a DoS attack. If they'd maintained 200-300 req/min would have been pretty acceptable.

j / k navigate · click thread line to collapse

33 comments

24 comments · 7 top-level

korzun11y ago· 6 in thread

Agree with the suit but QVC (by this time) should have rate limiting / throttling per IP.

(waits for somebody to claim that each request came from a different proxy)

greglindahl11y ago

From TFA:

  The complaint alleges that the defendant disguised its 
  web crawler to mask its source IP address and thus 
  prevented QVC technicians from identifying the source of 
  the requests and quickly repairing the problem.

korzun11y ago

Yes. Did your read/understand what I posted? Especially the last part where I predicted that somebody will crawl out and post what you just did?

atwebb11y ago

> (waits for somebody to claim that each request came from a different proxy)

korzun11y ago

Rate limiting is not blocking.

Multiple tabs do not translate to concurrent requests. A shop like QVC should have a high limit and frankly an automatic system that flags stuff like this and notifies proper parties.

1 more reply

chadgeidel11y ago

This happened to us recently (however not at this scale), and I assure you this wouldn't solve anything. We only got a few requests per IP.

korzun11y ago

I assure you that if you had a proper system in place it would limit the exposure. We are talking about 600 r/s (QVC should be able to eat that anyways, but whatever).

At this level you should know this will happen and have placeholders in place.

If every single IP is unique, additional layer would be pattern matching traffic by different network classes and flagging it actively.

Another way to rate limit is by number of pages per session / time.

Then you can have another fail safe method where if X exceeds Y start throttling top X (dynamically scale up until X is returned to normal) while notifying DevOps of the issue.

Etc.

1 more reply

johngd11y ago· 4 in thread

acdha11y ago

johngd11y ago

Very true, however the incidents I reference this was definitely not the case.

In fact for a while we would get Bing (MSN bot back then) crawl us everyday at the same time, almost on the dot.

http://www.projecthoneypot.org/index.php

1 more reply

greglindahl11y ago

johngd11y ago

birken11y ago· 2 in thread

greglindahl11y ago

From the complaint, it seems that Result.ly was crawling through proxies... which means QVC doesn't know whom to complain to:

  The complaint alleges that the defendant disguised its
  web crawler to mask its source IP address and thus 
  prevented QVC technicians from identifying the source of 
  the requests and quickly repairing the problem.

Your comment about crawler politeness is spot-on.

birken11y ago

1 more reply

Someone123411y ago· 2 in thread

> Of these and other causes of action typically alleged in these situations, the breach of contract claim is often the clearest source of a remedy.

[0] http://www.forbes.com/sites/oliverherzfeld/2013/01/22/are-we...

greglindahl11y ago

Several cases have allowed contract claims where (1) the crawler created accounts and (2) the account creation flow has a checkbox for the user to indicate that they agree to the contract.

Trying to enforce a contract on a crawler that's just fetching pages without ever checking a box is much more difficult... many failures in the past.

domdipOP11y ago

Also, "clearest source of a remedy" refers to ability to actually get compensation, which may be limited or difficult to argue for under DMCA/CFAA.

Spoom11y ago· 2 in thread

tomjen311y ago

Just a guess, assuming they have a lot of different things they sell? Also it may go through a partial checkout flow for each item (to find the shipping rates, etc).

Still yeah, that is too much.

xur1711y ago

Yeah, this really stood out to me. 600 requests per second against one website seems pretty insane. I can see why QVC is suing.

swalsh11y ago· 1 in thread

akama11y ago

I don't think they are getting sued for automating get requests. Most of the problem here seems to be the excessive number of requests that made the scraping effectively be a DOS attack.

Xorlev11y ago

j / k navigate · click thread line to collapse