"[..] write serverless code which runs in the fabric of the Internet itself" - https://blog.cloudflare.com/building-with-workers-kv/
It’s just a distraction from someone sharing free and functional code with the community.
Cloud I gave up on quite early. It's a nonsense word but not an incorrect one: sure, it's not literally water vapor, but it's not like you're saying "yourcomputer" when the definition is "someone else's computer". Cloud is a new word for something that didn't really have a word. "Server" comes close, but in "cloud" there is the additional implication that it's not yours (making quite a difference in many cases, so I guess it warrants having another word). Serverless... that's just shared hosting.
There client is server so it's Serverless.
I built a prototype of something very similar, but using Google BigQuery to store and extract data[0] but never took it beyond the concept phase. I’m still using and actively maintain an open source lambda-based A/B testing severless framework however with similar (but simpler) architecture[1]
[0] https://blog.gingerlime.com/2016/a-scalable-analytics-backen...
This is what most people don’t get with ga.
Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.
Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.
All others « free google analytics alternatives » that exists today don’t have this type of mechanism to prevent from data corruption.
In general they just get an Http Request and acknowledge it as a legitimate visit.
Logging an Http request from a browser is not even a tenth of the work GA does under the hood.
Unless it's referrer spam...that somehow still sticks around (at least last time I checked, which was several months ago).
> Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.
I do a lot of work with GA, and have seen this misperception brought up a few times. When it comes to data processing, GA is not intelligent. If you haven't told it to do something explicitly, it isn't doing it. And if you tell it to do something, it'll only do that for all new data and will make no attempts to do it to historical data.
- GA is relatively robust against web scraping due to the fact that most scrapers don't render the page. So the GA-related code on the page is never executed and a hit is never made to Google's servers. If the scraper is using a headless browser, such as Puppeteer, and renders the page, then it will in fact send that hit to GA.
- If you've checked the "Exclude bots" view setting[1], it will apply the IAB Spiders and Bots list to traffic[2]. This is a deterministic list of user-agent based filters to apply[3], and anyone is capable of paying for it. Google just gives it to GA users for free via the Exclude bots filter.
- The Exclude Bots setting does nothing else than that. Scrapers like Puppeteer by default report their user agent as the version of Chromium they're using. These will show up just like any legitimate user to your site that also is browsing with that specific version of Chromium.
- GA has pretty robust filtering options[4]. But you have to manually create them. And they don't apply retroactively. You can filter IPs here, and only here. While you can apply reporting filters after the fact on a lot of fields, IP addresses aren't available as one of those fields. This makes it really frustrating to retroactively get rid of junk traffic, whether internal or automated/scrapping. You can approximate it by getting creative with fields that make a good proxy. The only exception to this would be GA360/Google Marketing Cloud customers, since they can access their clickstream data via BigQuery as part of their subscription.
- GA's interface will give you really smart looking notifications now like "Filter internal traffic. Hits from your corporate network are showing up in property example.com". It's not doing anything super neat like dynamically cross-referencing your IP address as you're in the admin area against the collected data in your GA property. It's literally just triggering that warning based off the fact that you haven't applied any IP-based filters applied yet.
There are quite a few other completely unintuitive aspects of GA that are rooted in the fact that their data processing model is incredibly straightforward, and there are very few exceptions to it and virtually no edge cases taken into account. Which leads to a lot of instances where people's expectations on behavior decouple from actual behavior. But a good rule of thumb is that, if a particular functionality or metric seems even remotely like it'd require extra computation or complexity to implement in a way to make it match what you're thinking. Then it's highly likely it doesn't work the way you think.
[1] https://support.google.com/analytics/answer/1010249?hl=en
[2] https://www.iab.com/guidelines/iab-abc-international-spiders...
[3] https://www.iab.com/wp-content/uploads/2015/11/IAB_SpidersBo...
[4] https://support.google.com/analytics/topic/1032939?hl=en&ref...
I don't understand what's cubejs doing in this app?
Once data is inside athena, it's matter of querying it right.
I would rather call them instead "zero config servers".
Google Analytics is much faster, responds in a few hundreads milliseconds.
What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?
Are you referring to their reporting API, or their collection endpoint? The collection endpoint is certainly fast to respond, but the actual reporting API can be quite slow depending on what you're trying to get from it.
> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?
I'm not the parent, but I've created setups like what was mentioned. It sounds like they hosted the collection endpoint on AppEngine, then used DataFlow for streaming the data into BigQuery. Potentially using a Pub/Sub topic to queue up for DataFlow, since that has native integrations with DataFlow and even has a template available to support it[1].
[1] https://cloud.google.com/dataflow/docs/guides/templates/prov...
GA stores summary tables for each day for the basic values. If you have a large site and request segments or anything that's not in the summary tables, it can be quite slow.
Also, BigQuery is multi-tenant. GA would have dedicated instances.
> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?
cosmie pretty much got it. AppEgnine collected. DataFlow sessionized and some other processing (geoip lookup, filtering, &c). BigQuery stored.
I actually had AppEngine dumping into Cloud Datastore, but I also experimented with PubSub and also using Cloud Storage access logs.
Basically you select all rows then cubjs does something on those rows when you can infact directly run queries in Athena
Am i missing something?