That's ridiculous. What evidence is there that there are groups of nefarious hackers out there spoofing analytics data on people's websites? I don't think there is a need for this solution because the problem doesn't exist. If I wanted to mess with someone's websites, there are much better ways than injecting some false data into their Google Analytics.
This feels to me like a clever solution for a problem that doesn't exist.
More than one unscrupulous publisher has gamed comScore and Neilsen to pump up their reach figures and get access to more attractive advertisers.
Of course, scammers aren't going to use this service. There's a market selling verification services to advertisers, but there's a ton of companies in this area already - from my limited vantage point, Double Verify looks like the market leader.
The same people, who, say, use a housemate's phone number for their Safeway card so that Safeway can't determine their shopping habits?
If there was an easy way to do it, many people would want to.
It's clear that there could be an easy way to do it if you put, like, two good hours of work into it.
During that time, we received exactly two easily-filterable attempts to spoof analytics traffic. Historically, anyway, this doesn't seem to be a real problem.
In this case it's usually relatively easy to filter out, because the analytics host will identify the fake requests as coming from pages served on a different domain. However, it is annoying, and a combination of content theft plus hijacked DNS could result in more sinister influences.
2. It should not be terribly hard to filter out known open proxies or sessions with a specific nefarious pattern.
Overall, I think this post addresses a problem that doesn't quite exist yet; and if/when it does, it can be addresses in many ways.
And they'd be doing it because of a principle like "these companies don't tell us that they gather and make money off this consumer data. If they won't admit it up front, let's just not give the data to them."
Go ahead, tell me that won't happen at least a few times in the next 5-10 years.
Second, the secret-suffix SHA1 MAC isn't secure. Its insecurity is the reason we have HMAC.
This seems to me to be the kind of thing you'd want to get right if the whole value proposition of your solution was "verifying URLs with cryptography".
1. (Most common) many small sites seem to really like the design of various Yahoo! pages, so they copy the code verbatim, and change the content, but they leave the beaconing code in there, so you end up with fake beacons.
2. (Less common) individuals trying to break the system. We would see various patterns including XSS attempts in the beacon variables, and also in the user agent string. We'd see absurd values (eg: load time of 1 week, or 20ms or -3s, or bandwidth of 4Tbps).
It's completely possible to stop all fake requests, provided you have control over the web servers that serve pages as well as the servers that receive beacons. It's costly though. Requiring you to not just sign part of the request, but also add a nonce to ensure that the request came from a server you control (avoid replays). Also throw in rate limiting for added effect (hey, if you're random sampling, then randomly dropping beacons works in your favour ;)).
It doesn't stop there though, post processing and statistical analysis of the data can take you further.
It gets harder when you're a service provider providing an analytics service to customers where you do not have access or control over their web servers.
At my new startup (lognormal.com) we try to mitigate the effect of fake beacons the best that we can.
There are better ways to "hack" a company that spoofing their websites analytic lol, people that got that large number of ips have better (worst) things to do than that..
Also how the f would you know they are ab testing something..
echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\&...;
should be:
echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\&...;
There are trade-offs just like every other CAPTCHA-class problem out there. Isn't that what you are after: an automated human detector?
I know my Google Analytics aren't 100% correct, but I don't think people are spoofing them. The differences lie more in people who click through faster than GA can load (which can be easily possible on those still on 56k), or have "privacy blockers" in their ad block to remove GA altogether.