I could be missing something, but that seems to solve both the duplicate event firing (an upsert command based on the UUID makes duplicate event writing a non-issue) and the timing issues.
Though I’m still incredibly skeptical of “real-time analytics.” The number of business cases that require actual real-time analysis are pretty limited. High frequency trading and...?
However, this event ID is not enough to identify and then dedupe all types of duplicate events. This blog post provides more information:
https://snowplowanalytics.com/blog/2015/08/19/dealing-with-d...
Big thanks to pragmacoders for putting this tutorial together! It's awesome seeing what you are doing with the Snowplow platform :-)
If we leave these at an event level, in a database, we'd need to:
- Loop through all events
- Group them by user
- Figure out which ones represent the beginning of a page view
- Figure out which ones represent the end of a page view
- Subtract these two events
- Aggregate these for all users in the database
- Find the median
This can take a prohibitive amount of time for a pretty simple question. It'd be much faster if the data was already stored in a manner that is useful for such an exploration.
As questions become more complex - so do queries. They become hard do think about and the time it takes to process them explodes.
One related concepts is "Entity-Centric Indexing", which talks about this problem in terms of ElasticSearch.
Anywho - that's a long-winded reason why I'm taking this approach!
Here's 18 lines of SQL to save you a week of writing a data pipeline in Go.
WITH page_timings AS (
SELECT page_type
, event_timestamp_utc
, LEAD(event_timestamp_utc, 1)
OVER(PARTITION BY session_id
ORDER BY event_timestamp_utc)
AS next_pageview
FROM events
WHERE event_type = 'page_view'
)
SELECT page_type
, MEDIAN(
DATEDIFF(sec,
event_timestamp_utc,
next_pageview)
) AS median_sec_on_page
FROM page_timings
GROUP BY page_type
;> It is likely that one day, I'll need to shard the data and distribute the processing amongst multiple servers. But no company I've used this with currently has enough data flowing through its analytics system, or intense amounts of real-time processing, to warrant such a complexity.
This solution is so over-engineered for low data volume. You could capture all of the business value for 10% of the engineering effort by just dumping this data into a database meant for analytics. And then you'd at least have an answer for things like fixing broken data, making full-history business logic changes, merging events, etc.
If you're in AWS, sending your events from snowplow into S3 and then into Redshift/Athena/Presto/PostgreSQL is the way to go.
Like: - "What is the median time that users spend on a given page?"
- "How many events do users generally fire in their first 3 sessions?"
-"Who visited Page A, then Page B, THEN Page C?"
-"How many users who used this feature also used this feature within the same day?"
This is difficult to do with purely event-level metrics. It also takes a really long time and isn't suited to exploration.
We're already using a data warehouse (AWS Redshift) to analyze event-level metrics. But more complex queries are much easier, and much faster to do, when the data is stored in ways that are more suited to such calculations.
So, that's what this article is hopefully helping people to do!
For a related topic, "Entity-Centric Indexing" is the ElasticSearch equivalent of what I'm working on!
But my biggest question is WHY did this person feel it necessary to do this project? From first glance, there is no way Crystal is producing the traffic required to roll this solution. There are dozens of companies that can solve this problem for a few hundred dollars a month and have handled all the problems discussed in this article at serious scale for their customers.
The most irritating part is that the developer states in the beginning why he did this: because it’s fun.
Disclosure: I’m a founder of company whose core product is a real time analytics platform for web and mobile. Of course I’m going to recommend “Buy” for a small company like this in a Build vs Buy analysis. But when “professional engineers” say they’re building projects “because they are fun”, a lot of people suffer.
I'm confused why you're angry at me exploring a subject and giving my learnings away for free to anyone who is interested.
I am also confused why you assume knowledge of how I use the things created in this article or the technical needs of projects I work on, when I have not described those needs nor described the data I work with.
Ah - You own a company that sells the thing I'm trying to help people do for free and with open-source tools. Never mind. Not confused.
In all seriousness - If I'm leading people in the wrong direction, I definitely would like to know a better direction to lead them. But that direction won't be to deposit money into your wallet. It will be to help them learn (hopefully) useful techniques to deal with their problems.