Apache Kafka and GDPR compliance (opens in new tab)

(landoop.com)

84 pointsAntwnis8y ago46 comments

46 comments

29 comments · 6 top-level

Sir_Substance8y ago· 7 in thread

>The right to be forgotten, becomes one of the hardest challenges because of data immutability. Apache Kafka does not support deleting records, and although some eventual deletion is supported, it requires

This always seemed like an incredibly toxic decision to me. It's one that crops up in all sorts of systems, large and small. What, none of these people /ever/ foresaw the need to delete some data?

AntwnisOP8y ago

As everything in life, to gain something, you need to sacrifice something else. With RDBMS you get mutability; but to go 10x or 100x faster/larger you need to make hard decisions.

HDFS, S3 and other systems have immutability in-built. Immutability is not bad per-se, as it give (some) assurance that data has not been tampered with, and although it could be implemented, the system cost could be significant.

Stricking the right balance is the challenge

wiz21c8y ago

It's not that simple. For example in my business, we may give some money to help someone "once in its life" (the law says so). Therefore, if the persons asks to be deleted, then we might not apply the law anymore because it'll mean we won't remember the decision... I think GDPR is a good thing, but at some point, in my business, those who write the laws will have to be aware of it (and the legal teams is miles away from the IT stuff, sadly).

tscs378y ago

The GDPR offers exceptions to the right to erasure, this mostly includes legal compliance (banks) or in the interest of legal claims or when data cannot be easily deleted as individual record. It also does not affect any non-digital documents which aren't filed. This is all laid out very thoroughly in the legal documents relating to this.

1 more reply

mclarke8y ago

GDPR has an exemption related to the legal requirement to process data that might cover this (and related) scenarios.

> ...(unless) processing is necessary for compliance with a legal obligation to which the controller is subject;

1 more reply

closeparen8y ago

Where integrity matters, you never want data to be mutated with no trace. An audit trail is almost always needed - it’s not an extreme leap, then, to say “why don’t we just replay the audit trail to arrive at the current state?”

nemothekid8y ago

>What, none of these people /ever/ foresaw the need to delete some data?

It's a performance trade off, and not a very surprising one. Hard Disk Drives have always been known to never actually delete data (if you want the data gone, you overwrite it with 0s). It's not unimaginable that this performance trade-off found its way up the stack.

And just like a regular HDD, you can forcibly delete the data, it's just a very expensive operation that isn't needed 95% of the time.

Sir_Substance8y ago

>And just like a regular HDD, you can forcibly delete the data

Except apparently not, because the linked article is literally saying it's not supported.

I get wanting an audit trail, and I get wanting to not delete data if you don't have to for performance reasons, but neither of those things is the same as saying "it's literally not possible to delete stuff".

3 more replies

theptip8y ago· 6 in thread

This "right to be forgotten" requirement is quite staggering in scope. Do I need to dig out all of my offsite tape backups and re-transcribe them to edit out my user's data every time a user requests to be forgotten?

Sibling comments mention a cunning scheme with encryption, but that doesn't really help an enterprise with an existing non-GDPR-compliant backup archive.

antoncohen8y ago

I asked a GDPR consultant we hired about backups. The answer was basically that the letter of the law requires data to be deleted from backups, but people aren't going to do that, and there is some language about "reasonable" effort or something like that.

This might be useful: http://www.davidfroud.com/does-right-to-erasure-include-back...

dannyw8y ago

Deletion doesn’t have to be immediately complete; if you rotate backups on a six month period then that’s okay.

At least, that’s what I heard about how a BigCo is doing GDPR.

numbsafari8y ago

My understanding is that the GDPR “right to be forgotten” does not cover backups. There may be some exceptions, but there are practical limits on its reach.

sulam8y ago

I believe your understanding is incorrect. GDPR certainly includes storage and processing, both of which backups probably trigger.

Anyway, think about the spirit of the law, and then think about how that interacts with backups. If someone asks to be deleted from your system, you do so, and then you restore a backup with their data, you have clearly violated the intent.

1 more reply

eledra8y ago

Given the fact that backup is a must for almost any system, it would be silly for GDPR to not "cover" backup files.

theptip8y ago

Interesting, if that's so, can we redefine the underlying Kafka topics as "backups" and achieve compliance by having the stream processors drop "forgotten" records when replaying a topic?

1 more reply

throwaway2016a8y ago· 4 in thread

I'm interested in the right to be forgotten section but I'm confused as too what this article is saying...

How exactly do you "forget" the data on the logs?

One interesting solution that kills two birds with one stone is if you encrypt the personally identifiable information then delete the private key if there is a request to be forgotten. Has the added benefit of also effectively destroying the data in backup copies too.

AntwnisOP8y ago

> How exactly do you "forget" the data on the logs?

If we think around the options, you can have either:

i) eventual deletion (log retention policy) ii) compacted topics (and push null values) iii) expensive re-processing of the entire log iv) expensive segment re-write operation

with each option bringing in a new set of challenges

nerpderp838y ago

Encrypt with a user specific key when the data enters the log. You can effectively delete all the user specific data by throwing the key away. No tracking down files or reprocessing necessary.

2 more replies

MarkMc8y ago

Dear System Administrator,

We've just hacked your server and wiped the crypto keys for your users. As you know, all your backups are now useless.

Send us $1 million in Bitcoin to get your crypto keys back.

Sincerely,

Hacker McHackface

kbart8y ago

If somebody managed to hack into your servers deep enough to access private keys, you are f*cked anyway (they can as well delete/encrypt all data), so it's not an argument against user data encryption. Actually, storing private keys safely is easier than bulk data, because you can use dedicated hardware for that - HSM.

polskibus8y ago· 3 in thread

I'm wondering if anyone thought about a GDPR extension that would include machine learning extension, ie. being forgotten meant "unlearning" to the model from my data (or relearning it on dataset from which my data was removed).

hobofan8y ago

I would consider that already covered under the GDPR. Most machine learning approaches today make little to no guarantees about differential privacy and allow for (partial) extraction of the training dataset, which would mean that the request for deletion was never fully fulfilled.

polskibus8y ago

So do you mean that GDPR allows for a request for removal from model or of there is an exemption from data mining results?

1 more reply

lifeisstillgood8y ago

if your PII has been incorporated into a model, let's say giving a likelihood to buy red cars based on 100 data points, then it's fairly safe to assume your PII is anonymised - i cannot imagine a way back from model that each input

skyisblue8y ago· 2 in thread

With GDPR do we need to get consent from users before we can set any cookies?

kbart8y ago

GPDR itself doesn't specify cookies use. "Cookie law" is defined in ePrivacy Directive (2002/58/EC) which to be replaced by ePrivacy Regulation which is an addendum to GPDR. Actually, it's going to be much saner approach than the joke the current "cookie law" is:

"Simpler rules on cookies: the cookie provision, which has resulted in an overload of consent requests for internet users, will be streamlined. The new rule will be more user-friendly as browser settings will provide for an easy way to accept or refuse tracking cookies and other identifiers. The proposal also clarifies that no consent is needed for non-privacy intrusive cookies improving internet experience (e.g. to remember shopping cart history) or cookies used by a website to count the number of visitors."[0]

To answer your question "do we need to get consent from users before we can set any cookies?"

It depends: yes for tracking cookies, no for others. How to tell them apart is another question..

0. https://en.wikipedia.org/wiki/EPrivacy_Regulation_(European_...

throwanem8y ago

Isn't that already an EU requirement?

alexatkeplar8y ago· 1 in thread

We've been doing a lot of thinking about how to support GDPR at Snowplow (Kafka and Kinesis but plenty of other logs and stores) - for our first phase we're just going to support irreversible pseudonymization of tagged PII:

https://github.com/snowplow/snowplow/issues/3472

For later phases, yes user-specific encryption of PII or hashing-with-lookup table are the way to go...

brians8y ago

I wish you wouldn’t call it irreversible. Every large public claim of that sort has proven false. Consider the Netflix case, where the separate IMDB review dataset allowed reconstruction of pseudonymous movie watching records.

These approaches may help with compliance, but they’re the opposite of real safety.

j / k navigate · click thread line to collapse

46 comments

29 comments · 6 top-level

Sir_Substance8y ago· 7 in thread

This always seemed like an incredibly toxic decision to me. It's one that crops up in all sorts of systems, large and small. What, none of these people /ever/ foresaw the need to delete some data?

AntwnisOP8y ago

As everything in life, to gain something, you need to sacrifice something else. With RDBMS you get mutability; but to go 10x or 100x faster/larger you need to make hard decisions.

Stricking the right balance is the challenge

wiz21c8y ago

tscs378y ago

1 more reply

mclarke8y ago

GDPR has an exemption related to the legal requirement to process data that might cover this (and related) scenarios.

> ...(unless) processing is necessary for compliance with a legal obligation to which the controller is subject;

1 more reply

closeparen8y ago

nemothekid8y ago

>What, none of these people /ever/ foresaw the need to delete some data?

And just like a regular HDD, you can forcibly delete the data, it's just a very expensive operation that isn't needed 95% of the time.

Sir_Substance8y ago

>And just like a regular HDD, you can forcibly delete the data

Except apparently not, because the linked article is literally saying it's not supported.

3 more replies

theptip8y ago· 6 in thread

Sibling comments mention a cunning scheme with encryption, but that doesn't really help an enterprise with an existing non-GDPR-compliant backup archive.

antoncohen8y ago

This might be useful: http://www.davidfroud.com/does-right-to-erasure-include-back...

dannyw8y ago

Deletion doesn’t have to be immediately complete; if you rotate backups on a six month period then that’s okay.

At least, that’s what I heard about how a BigCo is doing GDPR.

numbsafari8y ago

My understanding is that the GDPR “right to be forgotten” does not cover backups. There may be some exceptions, but there are practical limits on its reach.

sulam8y ago

I believe your understanding is incorrect. GDPR certainly includes storage and processing, both of which backups probably trigger.

1 more reply

eledra8y ago

Given the fact that backup is a must for almost any system, it would be silly for GDPR to not "cover" backup files.

theptip8y ago

Interesting, if that's so, can we redefine the underlying Kafka topics as "backups" and achieve compliance by having the stream processors drop "forgotten" records when replaying a topic?

1 more reply

throwaway2016a8y ago· 4 in thread

I'm interested in the right to be forgotten section but I'm confused as too what this article is saying...

How exactly do you "forget" the data on the logs?

AntwnisOP8y ago

> How exactly do you "forget" the data on the logs?

If we think around the options, you can have either:

i) eventual deletion (log retention policy) ii) compacted topics (and push null values) iii) expensive re-processing of the entire log iv) expensive segment re-write operation

with each option bringing in a new set of challenges

nerpderp838y ago

Encrypt with a user specific key when the data enters the log. You can effectively delete all the user specific data by throwing the key away. No tracking down files or reprocessing necessary.

2 more replies

MarkMc8y ago

Dear System Administrator,

We've just hacked your server and wiped the crypto keys for your users. As you know, all your backups are now useless.

Send us $1 million in Bitcoin to get your crypto keys back.

Sincerely,

Hacker McHackface

kbart8y ago

polskibus8y ago· 3 in thread

hobofan8y ago

polskibus8y ago

So do you mean that GDPR allows for a request for removal from model or of there is an exemption from data mining results?

1 more reply

lifeisstillgood8y ago

skyisblue8y ago· 2 in thread

With GDPR do we need to get consent from users before we can set any cookies?

kbart8y ago

To answer your question "do we need to get consent from users before we can set any cookies?"

It depends: yes for tracking cookies, no for others. How to tell them apart is another question..

0. https://en.wikipedia.org/wiki/EPrivacy_Regulation_(European_...

throwanem8y ago

Isn't that already an EU requirement?

alexatkeplar8y ago· 1 in thread

https://github.com/snowplow/snowplow/issues/3472

For later phases, yes user-specific encryption of PII or hashing-with-lookup table are the way to go...

brians8y ago

These approaches may help with compliance, but they’re the opposite of real safety.

j / k navigate · click thread line to collapse