The Problem With Client-Side Analytics (opens in new tab)

(spider.io)

37 pointsblahpro14y ago27 comments

27 comments

21 comments · 11 top-level

lmkg14y ago· 8 in thread

...wut?

That's ridiculous. What evidence is there that there are groups of nefarious hackers out there spoofing analytics data on people's websites? I don't think there is a need for this solution because the problem doesn't exist. If I wanted to mess with someone's websites, there are much better ways than injecting some false data into their Google Analytics.

lamnk14y ago

It's a problem of information asymmetry in the online (display) advertising industry, not a problem with hackers. Because advertisers don't know how many pageviews/visitors a website has, advertising agencies often have to make purchasing decisions based on numbers from Comscore, Quantcast, Google Analytics etc. Clearly, if the website owner can spoof his analytic data, he can sell his inventory at higher rates.

mpclark14y ago

...but not for long, because his CTR will be down on the floor.

This feels to me like a clever solution for a problem that doesn't exist.

1 more reply

gyardley14y ago

While I haven't seen any real, meaningful efforts at spoofing analytics data (I vaguely recall some grad student project), I've certainly heard of companies spoofing their own analytics data to appear bigger than they are to interested parties.

More than one unscrupulous publisher has gamed comScore and Neilsen to pump up their reach figures and get access to more attractive advertisers.

Of course, scammers aren't going to use this service. There's a market selling verification services to advertisers, but there's a ton of companies in this area already - from my limited vantage point, Double Verify looks like the market leader.

angelbob14y ago

You don't think that, say, the same people who resent ads being shown and turn them off might think it was hilarious to send back bad analytics data to companies who are quietly profiting from same?

The same people, who, say, use a housemate's phone number for their Safeway card so that Safeway can't determine their shopping habits?

If there was an easy way to do it, many people would want to.

It's clear that there could be an easy way to do it if you put, like, two good hours of work into it.

gyardley14y ago

I co-founded and ran Pinch Media, a mobile application analytics company. We operated independently for around two years before selling. During those two years, I believe we got more bad PR than a typical analytics company. I certainly got my fair share of anonymous hate email.

During that time, we received exactly two easily-filterable attempts to spoof analytics traffic. Historically, anyway, this doesn't seem to be a real problem.

1 more reply

storborg14y ago

I've often seen a different, but related effect of client side analytics where content thieves will "accidentally" spoof analytics data by simply copying a site verbatim, with the analytics tag included.

In this case it's usually relatively easy to filter out, because the analytics host will identify the fake requests as coming from pages served on a different domain. However, it is annoying, and a combination of content theft plus hijacked DNS could result in more sinister influences.

eps14y ago

What evidence is there that there are groups of nefarious hackers intercepting my shopping session at ToysRUs online store? Why the hell should I be spending my hard earned milliamps on this SSL thing? Certainly if someone wanted to mess with me, they would just whack me on the head in the dark corner of the street.

ceejayoz14y ago

Intercepting a shopping session gets you a credit card number. Spoofing analytics adds a pageview to a stat you don't even get to see.

1 more reply

badclient14y ago· 2 in thread

1. Considering most client-side analytics are based on IP address, you will require a large number of IPs.

2. It should not be terribly hard to filter out known open proxies or sessions with a specific nefarious pattern.

Overall, I think this post addresses a problem that doesn't quite exist yet; and if/when it does, it can be addresses in many ways.

angelbob14y ago

You wouldn't spoof it with an open proxy. This wouldn't be organized crime trying to screw up your A/B testing. This would be individual consumer advocates and reactionaries with a GreaseMonkey script that intentionally sent back wrong numbers or dupes.

And they'd be doing it because of a principle like "these companies don't tell us that they gather and make money off this consumer data. If they won't admit it up front, let's just not give the data to them."

Go ahead, tell me that won't happen at least a few times in the next 5-10 years.

webjunkie14y ago

I also don't think this guy actually implemented the spoof he talks about..

ismarc14y ago

Our company has its own internal analytics system and, while their approach could technically work to prevent spoofing, there's other, simpler ways. The first is simple deduplication of received events. This will carve out a large portion of invalid requests, particularly if you have thresholds of time for how frequently a received event is considered valid. The second is to calculate the quartiles and outliers. This allows you to remove all but the most sophisticated spoofing and is good practice to remove ill-behaved browsers and filter out things like malware detection tools that duplicate browser requests if they haven't seen the site before. There's many operations you can do to determine the validity of data received, however who knows how much of this is actually done by analytics providers. We've built our own internal analytics system (and expose it to customers) because existing solutions weren't robust enough for our needs. The biggest lesson has been that trying to get higher than about 98% accuracy on delivered events actually lowered the accuracy of events and using calculations on the backend was more reliable, but requires specific knowledge of the type of events.

tptacek14y ago

First, that's not a "digital signature", it's a MAC. It's the secret-suffix SHA1 MAC, to be precise.

Second, the secret-suffix SHA1 MAC isn't secure. Its insecurity is the reason we have HMAC.

This seems to me to be the kind of thing you'd want to get right if the whole value proposition of your solution was "verifying URLs with cryptography".

bluesmoon14y ago

We noticed this problem at Yahoo! (I worked on the web performance analytics). Approximately 2% (note, that's 2% of 200 million daily) of our beacons were "fake". Now there are two reasons for fake beacons.

1. (Most common) many small sites seem to really like the design of various Yahoo! pages, so they copy the code verbatim, and change the content, but they leave the beaconing code in there, so you end up with fake beacons.

2. (Less common) individuals trying to break the system. We would see various patterns including XSS attempts in the beacon variables, and also in the user agent string. We'd see absurd values (eg: load time of 1 week, or 20ms or -3s, or bandwidth of 4Tbps).

It's completely possible to stop all fake requests, provided you have control over the web servers that serve pages as well as the servers that receive beacons. It's costly though. Requiring you to not just sign part of the request, but also add a nonce to ensure that the request came from a server you control (avoid replays). Also throw in rate limiting for added effect (hey, if you're random sampling, then randomly dropping beacons works in your favour ;)).

It doesn't stop there though, post processing and statistical analysis of the data can take you further.

It gets harder when you're a service provider providing an analytics service to customers where you do not have access or control over their web servers.

At my new startup (lognormal.com) we try to mitigate the effect of fake beacons the best that we can.

posabsolute14y ago

Well.. I can see that a problem for 0.5% of business's... maybe... I think he is over thinking this, most business do not need that kind of protection

There are better ways to "hack" a company that spoofing their websites analytic lol, people that got that large number of ips have better (worst) things to do than that..

Also how the f would you know they are ab testing something..

mdda14y ago

Rather than signing requests for the (largish) javascript file (which would benefit most from being cached), it would make more sense for the signed-timestamp key to be passed as one parameter via the image grab. Or am I missing something?

krisneuharth14y ago

Totally off topic but there is a bug in the PHP code example:

echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\&...;

should be:

echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\&...;

skeltoac14y ago

In before solution waiting for a... oh, too late. It is a problem. However, signing resources means no HTTP caching of the most expensive resource we generate. That is not practical where I work. Guess the cache can be programmed to do the signing.

There are trade-offs just like every other CAPTCHA-class problem out there. Isn't that what you are after: an automated human detector?

ROFISH14y ago

This is a solution looking for a problem.

I know my Google Analytics aren't 100% correct, but I don't think people are spoofing them. The differences lie more in people who click through faster than GA can load (which can be easily possible on those still on 56k), or have "privacy blockers" in their ad block to remove GA altogether.

youngtaff14y ago

One of the problems with client side analytics is they don't give you the whole picture i.e 4xx and 5xx errors are missing from them

j / k navigate · click thread line to collapse

27 comments

21 comments · 11 top-level

lmkg14y ago· 8 in thread

...wut?

lamnk14y ago

mpclark14y ago

...but not for long, because his CTR will be down on the floor.

This feels to me like a clever solution for a problem that doesn't exist.

1 more reply

gyardley14y ago

More than one unscrupulous publisher has gamed comScore and Neilsen to pump up their reach figures and get access to more attractive advertisers.

angelbob14y ago

You don't think that, say, the same people who resent ads being shown and turn them off might think it was hilarious to send back bad analytics data to companies who are quietly profiting from same?

The same people, who, say, use a housemate's phone number for their Safeway card so that Safeway can't determine their shopping habits?

If there was an easy way to do it, many people would want to.

It's clear that there could be an easy way to do it if you put, like, two good hours of work into it.

gyardley14y ago

During that time, we received exactly two easily-filterable attempts to spoof analytics traffic. Historically, anyway, this doesn't seem to be a real problem.

1 more reply

storborg14y ago

eps14y ago

ceejayoz14y ago

Intercepting a shopping session gets you a credit card number. Spoofing analytics adds a pageview to a stat you don't even get to see.

1 more reply

badclient14y ago· 2 in thread

1. Considering most client-side analytics are based on IP address, you will require a large number of IPs.

2. It should not be terribly hard to filter out known open proxies or sessions with a specific nefarious pattern.

Overall, I think this post addresses a problem that doesn't quite exist yet; and if/when it does, it can be addresses in many ways.

angelbob14y ago

Go ahead, tell me that won't happen at least a few times in the next 5-10 years.

webjunkie14y ago

I also don't think this guy actually implemented the spoof he talks about..

ismarc14y ago

tptacek14y ago

First, that's not a "digital signature", it's a MAC. It's the secret-suffix SHA1 MAC, to be precise.

Second, the secret-suffix SHA1 MAC isn't secure. Its insecurity is the reason we have HMAC.

This seems to me to be the kind of thing you'd want to get right if the whole value proposition of your solution was "verifying URLs with cryptography".

bluesmoon14y ago

It doesn't stop there though, post processing and statistical analysis of the data can take you further.

It gets harder when you're a service provider providing an analytics service to customers where you do not have access or control over their web servers.

At my new startup (lognormal.com) we try to mitigate the effect of fake beacons the best that we can.

posabsolute14y ago

Well.. I can see that a problem for 0.5% of business's... maybe... I think he is over thinking this, most business do not need that kind of protection

There are better ways to "hack" a company that spoofing their websites analytic lol, people that got that large number of ips have better (worst) things to do than that..

Also how the f would you know they are ab testing something..

mdda14y ago

krisneuharth14y ago

Totally off topic but there is a bug in the PHP code example:

echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\&...;

should be:

echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\&...;

skeltoac14y ago

There are trade-offs just like every other CAPTCHA-class problem out there. Isn't that what you are after: an automated human detector?

ROFISH14y ago

This is a solution looking for a problem.

youngtaff14y ago

One of the problems with client side analytics is they don't give you the whole picture i.e 4xx and 5xx errors are missing from them

j / k navigate · click thread line to collapse