Show HN: Serverless Analytics Built from Scratch (opens in new tab)

(statsbotco.github.io)

114 pointskeydunov7y ago47 comments

47 comments

40 comments · 15 top-level

lucb1e7y ago· 7 in thread

"serverless" is really the misnomer of the year.

mikejulietbravo7y ago

"Still on a server, but not your problem" just doesn't have the same ring to it though

buster7y ago

"/cgi-bin/ with javascript instead of perl" doesn't make me want to buy it, either.

gpantazes7y ago

"I implemented this using some technology that abstracted away the server configuration, but it is running on a server... hmmm"

dna_polymerase7y ago

Nonono, no server:

"[..] write serverless code which runs in the fabric of the Internet itself" - https://blog.cloudflare.com/building-with-workers-kv/

1 more reply

bobx117y ago

Someone posts this on every article called serverless, but what is the purpose? Everyone used to say the same thing about cloud computing.

It’s just a distraction from someone sharing free and functional code with the community.

lucb1e7y ago

I hope this one won't stick. I never call it serverless because it doesn't make any sense to me (I call it "shared hosting", the term everyone already knows). Then, by pointing it out to those who do, maybe they realize that they're repeating marketing nonsense.

Cloud I gave up on quite early. It's a nonsense word but not an incorrect one: sure, it's not literally water vapor, but it's not like you're saying "yourcomputer" when the definition is "someone else's computer". Cloud is a new word for something that didn't really have a word. "Server" comes close, but in "cloud" there is the additional implication that it's not yours (making quite a difference in many cases, so I guess it warrants having another word). Serverless... that's just shared hosting.

1 more reply

InGodsName7y ago

Real serverless is bitorrent.

There client is server so it's Serverless.

soared7y ago· 5 in thread

I mean as a POC its not bad, but google analytics is not the same as analyzing server logs (contrary to what most people would suggest). Most of the value of ga comes from session and user level metrics, which are 1000x more difficult to implement than showing pageviews. Unless you are planning on building a device graph that rivals google, you can't clone ga.

asien7y ago

> google analytics is not the same as analyzing server logs

This is what most people don’t get with ga.

Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.

All others « free google analytics alternatives » that exists today don’t have this type of mechanism to prevent from data corruption.

In general they just get an Http Request and acknowledge it as a legitimate visit.

Logging an Http request from a browser is not even a tenth of the work GA does under the hood.

mayank7y ago

> Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

Unless it's referrer spam...that somehow still sticks around (at least last time I checked, which was several months ago).

1 more reply

manigandham7y ago

I have to disagree here. GA is very advanced but still rather dumb with data collection, and can be gamed in many ways, and I'm saying this as a user of GA for 10+ years along with their enterprise/premium suite.

eli7y ago

I'm not sure how well that filtering works in practice. I think most of it is just that it only tracks clients that load javascript.

cosmie7y ago

> Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

> Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.

I do a lot of work with GA, and have seen this misperception brought up a few times. When it comes to data processing, GA is not intelligent. If you haven't told it to do something explicitly, it isn't doing it. And if you tell it to do something, it'll only do that for all new data and will make no attempts to do it to historical data.

- GA is relatively robust against web scraping due to the fact that most scrapers don't render the page. So the GA-related code on the page is never executed and a hit is never made to Google's servers. If the scraper is using a headless browser, such as Puppeteer, and renders the page, then it will in fact send that hit to GA.

- If you've checked the "Exclude bots" view setting[1], it will apply the IAB Spiders and Bots list to traffic[2]. This is a deterministic list of user-agent based filters to apply[3], and anyone is capable of paying for it. Google just gives it to GA users for free via the Exclude bots filter.

- The Exclude Bots setting does nothing else than that. Scrapers like Puppeteer by default report their user agent as the version of Chromium they're using. These will show up just like any legitimate user to your site that also is browsing with that specific version of Chromium.

- GA has pretty robust filtering options[4]. But you have to manually create them. And they don't apply retroactively. You can filter IPs here, and only here. While you can apply reporting filters after the fact on a lot of fields, IP addresses aren't available as one of those fields. This makes it really frustrating to retroactively get rid of junk traffic, whether internal or automated/scrapping. You can approximate it by getting creative with fields that make a good proxy. The only exception to this would be GA360/Google Marketing Cloud customers, since they can access their clickstream data via BigQuery as part of their subscription.

- GA's interface will give you really smart looking notifications now like "Filter internal traffic. Hits from your corporate network are showing up in property example.com". It's not doing anything super neat like dynamically cross-referencing your IP address as you're in the admin area against the collected data in your GA property. It's literally just triggering that warning based off the fact that you haven't applied any IP-based filters applied yet.

There are quite a few other completely unintuitive aspects of GA that are rooted in the fact that their data processing model is incredibly straightforward, and there are very few exceptions to it and virtually no edge cases taken into account. Which leads to a lot of instances where people's expectations on behavior decouple from actual behavior. But a good rule of thumb is that, if a particular functionality or metric seems even remotely like it'd require extra computation or complexity to implement in a way to make it match what you're thinking. Then it's highly likely it doesn't work the way you think.

[1] https://support.google.com/analytics/answer/1010249?hl=en

[2] https://www.iab.com/guidelines/iab-abc-international-spiders...

[3] https://www.iab.com/wp-content/uploads/2015/11/IAB_SpidersBo...

[4] https://support.google.com/analytics/topic/1032939?hl=en&ref...

1 more reply

cheriot7y ago· 5 in thread

I'd be curious to see a cost estimate for some traffic level. I wonder if there's a way to put the pixel in s3 and process the access logs more cheaply.

teej7y ago

I’ve seen folks put their pixel endpoint behind Fastly and process the access log delivered in S3. A Fastly VCL can handle the same transform that this Lambda is doing.

mrkurt7y ago

We have people doing exactly this with fly.io, you could also do it with lambda@edge if you're a masochist. Or with Cloud Flare workers if you dislike small startups.

InGodsName7y ago

Is fastly free? Why would they use fastly and not s3?

1 more reply

yahelc7y ago

I've done this a few times and have found it to be an extremely effective way to do simple pixel tracking (for custom emails and the like).

InGodsName7y ago

Here is some more data: http://highscalability.com/blog/2018/4/2/how-ipdata-serves-2...

I don't understand what's cubejs doing in this app?

Once data is inside athena, it's matter of querying it right.

jimktrains27y ago· 4 in thread

Interesting. I once built a ga clone using appengine, cloud dataflow, and big query. I guess that would count as serverless? Benchmarked it against the official dumps to big quey too and it was pretty spot on for every metric we could lookup!

pavel_tiunov7y ago

Yes. I guess your setup is serverless as well. Big Query is one of Serverless MPP databases that shares similar concepts with AWS Athena.

InGodsName7y ago

Bigquery takes minimum 2-3 seconds for every query.

Google Analytics is much faster, responds in a few hundreads milliseconds.

What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

cosmie7y ago

> Google Analytics is much faster, responds in a few hundreads milliseconds.

Are you referring to their reporting API, or their collection endpoint? The collection endpoint is certainly fast to respond, but the actual reporting API can be quite slow depending on what you're trying to get from it.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

I'm not the parent, but I've created setups like what was mentioned. It sounds like they hosted the collection endpoint on AppEngine, then used DataFlow for streaming the data into BigQuery. Potentially using a Pub/Sub topic to queue up for DataFlow, since that has native integrations with DataFlow and even has a template available to support it[1].

[1] https://cloud.google.com/dataflow/docs/guides/templates/prov...

jimktrains27y ago

> Google Analytics is much faster, responds in a few hundreads milliseconds.

GA stores summary tables for each day for the basic values. If you have a large site and request segments or anything that's not in the summary tables, it can be quite slow.

Also, BigQuery is multi-tenant. GA would have dedicated instances.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

cosmie pretty much got it. AppEgnine collected. DataFlow sessionized and some other processing (geoip lookup, filtering, &c). BigQuery stored.

I actually had AppEngine dumping into Cloud Datastore, but I also experimented with PubSub and also using Cloud Storage access logs.

InGodsName7y ago· 2 in thread

Please explain what is cube.js doing in this? I mean, what exactly cubejs does.

pavel_tiunov7y ago

Thanks for the question! We should do a better job describing this. In short: 1. Generates analytic SQL queries based on Cube.js schema. It can be simple ones like calculating page views or more advanced like calculating session metrics, attribution models or funnels. 2. Caches sql responses to not to overwhelm SQL backend with user requests. 3. Pre-aggregates data to be able to query trillions of data points in matter of seconds. 4. Orchestrates SQL query execution. Organizes dependencies between pre-aggregations, queue priorities, cache refreshes. 5. Provides REST analytic API for end users.

InGodsName7y ago

Why do you need to select all rows in your Cubejs section when you can directly run query in athena and get back the aggregates you need.

Basically you select all rows then cubjs does something on those rows when you can infact directly run queries in Athena

Am i missing something?

1 more reply

jimmychangas7y ago· 1 in thread

I think you can use API Gateway as a proxy for Kinesis, removing the need for Lambda.

k__7y ago

Seems to be:

https://docs.aws.amazon.com/apigateway/latest/developerguide...

tyingq7y ago· 1 in thread

Genuine question. What does this do that just inserting vanilla GA code in the page doesn't? Trying to understand the "why".

teej7y ago

Some people don’t want to put a GA tag on their site because of concerns around how Google uses the data. Also you can’t arbitrarily query GA data so this gives you that capability.

gingerlime7y ago

Interesting and nicely presented!

I built a prototype of something very similar, but using Google BigQuery to store and extract data[0] but never took it beyond the concept phase. I’m still using and actively maintain an open source lambda-based A/B testing severless framework however with similar (but simpler) architecture[1]

[0] https://blog.gingerlime.com/2016/a-scalable-analytics-backen...

[1] https://github.com/Alephbet/gimel

rmccue7y ago

The blog post about it is probably a better link for HN: https://statsbot.co/blog/building-open-source-google-analyti...

code4tee7y ago

Build serverless app to track web stats. Get it featured on Hacker News and use the flood of traffic to demo what was done. Very meta. Nice job.

westoque7y ago

We should really stop using the word "serverless".

I would rather call them instead "zero config servers".

teej7y ago

This sort of thing works until you have one person run a security scan on your site, corrupting your user agents and event types.

manigandham7y ago

Side note: If you want to build your own mid-size event analytics data pipeline, then I recommend looking at snowplow: https://github.com/snowplow/snowplow

_9hey7y ago

Endless loading... I think there's a bug

graphememes7y ago

is PHP serverless :thinking:

j / k navigate · click thread line to collapse

47 comments

40 comments · 15 top-level

lucb1e7y ago· 7 in thread

"serverless" is really the misnomer of the year.

mikejulietbravo7y ago

"Still on a server, but not your problem" just doesn't have the same ring to it though

buster7y ago

"/cgi-bin/ with javascript instead of perl" doesn't make me want to buy it, either.

gpantazes7y ago

"I implemented this using some technology that abstracted away the server configuration, but it is running on a server... hmmm"

dna_polymerase7y ago

Nonono, no server:

"[..] write serverless code which runs in the fabric of the Internet itself" - https://blog.cloudflare.com/building-with-workers-kv/

1 more reply

bobx117y ago

Someone posts this on every article called serverless, but what is the purpose? Everyone used to say the same thing about cloud computing.

It’s just a distraction from someone sharing free and functional code with the community.

lucb1e7y ago

1 more reply

InGodsName7y ago

Real serverless is bitorrent.

There client is server so it's Serverless.

soared7y ago· 5 in thread

asien7y ago

> google analytics is not the same as analyzing server logs

This is what most people don’t get with ga.

Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

All others « free google analytics alternatives » that exists today don’t have this type of mechanism to prevent from data corruption.

In general they just get an Http Request and acknowledge it as a legitimate visit.

Logging an Http request from a browser is not even a tenth of the work GA does under the hood.

mayank7y ago

> Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

Unless it's referrer spam...that somehow still sticks around (at least last time I checked, which was several months ago).

1 more reply

manigandham7y ago

eli7y ago

I'm not sure how well that filtering works in practice. I think most of it is just that it only tracks clients that load javascript.

cosmie7y ago

> Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

[1] https://support.google.com/analytics/answer/1010249?hl=en

[2] https://www.iab.com/guidelines/iab-abc-international-spiders...

[3] https://www.iab.com/wp-content/uploads/2015/11/IAB_SpidersBo...

[4] https://support.google.com/analytics/topic/1032939?hl=en&ref...

1 more reply

cheriot7y ago· 5 in thread

I'd be curious to see a cost estimate for some traffic level. I wonder if there's a way to put the pixel in s3 and process the access logs more cheaply.

teej7y ago

I’ve seen folks put their pixel endpoint behind Fastly and process the access log delivered in S3. A Fastly VCL can handle the same transform that this Lambda is doing.

mrkurt7y ago

We have people doing exactly this with fly.io, you could also do it with lambda@edge if you're a masochist. Or with Cloud Flare workers if you dislike small startups.

InGodsName7y ago

Is fastly free? Why would they use fastly and not s3?

1 more reply

yahelc7y ago

I've done this a few times and have found it to be an extremely effective way to do simple pixel tracking (for custom emails and the like).

InGodsName7y ago

Here is some more data: http://highscalability.com/blog/2018/4/2/how-ipdata-serves-2...

I don't understand what's cubejs doing in this app?

Once data is inside athena, it's matter of querying it right.

jimktrains27y ago· 4 in thread

pavel_tiunov7y ago

Yes. I guess your setup is serverless as well. Big Query is one of Serverless MPP databases that shares similar concepts with AWS Athena.

InGodsName7y ago

Bigquery takes minimum 2-3 seconds for every query.

Google Analytics is much faster, responds in a few hundreads milliseconds.

What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

cosmie7y ago

> Google Analytics is much faster, responds in a few hundreads milliseconds.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

[1] https://cloud.google.com/dataflow/docs/guides/templates/prov...

jimktrains27y ago

> Google Analytics is much faster, responds in a few hundreads milliseconds.

GA stores summary tables for each day for the basic values. If you have a large site and request segments or anything that's not in the summary tables, it can be quite slow.

Also, BigQuery is multi-tenant. GA would have dedicated instances.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

cosmie pretty much got it. AppEgnine collected. DataFlow sessionized and some other processing (geoip lookup, filtering, &c). BigQuery stored.

I actually had AppEngine dumping into Cloud Datastore, but I also experimented with PubSub and also using Cloud Storage access logs.

InGodsName7y ago· 2 in thread

Please explain what is cube.js doing in this? I mean, what exactly cubejs does.

pavel_tiunov7y ago

InGodsName7y ago

Why do you need to select all rows in your Cubejs section when you can directly run query in athena and get back the aggregates you need.

Basically you select all rows then cubjs does something on those rows when you can infact directly run queries in Athena

Am i missing something?

1 more reply

jimmychangas7y ago· 1 in thread

I think you can use API Gateway as a proxy for Kinesis, removing the need for Lambda.

k__7y ago

Seems to be:

https://docs.aws.amazon.com/apigateway/latest/developerguide...

tyingq7y ago· 1 in thread

Genuine question. What does this do that just inserting vanilla GA code in the page doesn't? Trying to understand the "why".

teej7y ago

Some people don’t want to put a GA tag on their site because of concerns around how Google uses the data. Also you can’t arbitrarily query GA data so this gives you that capability.

gingerlime7y ago

Interesting and nicely presented!

[0] https://blog.gingerlime.com/2016/a-scalable-analytics-backen...

[1] https://github.com/Alephbet/gimel

rmccue7y ago

The blog post about it is probably a better link for HN: https://statsbot.co/blog/building-open-source-google-analyti...

code4tee7y ago

Build serverless app to track web stats. Get it featured on Hacker News and use the flood of traffic to demo what was done. Very meta. Nice job.

westoque7y ago

We should really stop using the word "serverless".

I would rather call them instead "zero config servers".

teej7y ago

This sort of thing works until you have one person run a security scan on your site, corrupting your user agents and event types.

manigandham7y ago

Side note: If you want to build your own mid-size event analytics data pipeline, then I recommend looking at snowplow: https://github.com/snowplow/snowplow

_9hey7y ago

Endless loading... I think there's a bug

graphememes7y ago

is PHP serverless :thinking:

j / k navigate · click thread line to collapse