Fedora, UUIDs, and user tracking (opens in new tab)

(lwn.net)

132 pointstlburke7y ago83 comments

83 comments

Good on them for taking the time to think about how to do this properly. I use Fedora, I'm happy to them know that I use Fedora, and even to make the check-in somewhat regular so that they can know if I stop using Fedora.

Provided that they figure out a way that absolutely nothing can be done with the information other than to say, "This non-identifiable machine reports that it uses Fedora," I'd be okay with that.

z3t47y ago

It's a slippery slope, first you want to know how many use your software, then you want to know what features they use, then you want to know other apps installed, then you want to know what web sites they visit and what they search for, and so forth.

Barrin927y ago

I can understand the slippery slope threat in a proprietary system, but Fedora is a open system. Should data collection at some point go beyond whatever it is the devs announce, someone is going to pick it up and everyone is going to drop the distro like a hot potato. So I'm not sure this is a big issue.

Karunamon7y ago

The lack of serious consequences to Canonical regarding their spyware-like, opt-out search integrations in Ubuntu with Amazon way back when (and then doubling down with legal threats against fixubuntu.com) gives me doubt as to this being the case.

That goes double given Fedora's position as what boils down to RHEL upstream. Too much corporate support for any backlash to make a dent, even if it could get critical mass.

3 more replies

throwaway20487y ago

Telemetry has a nice creeping scope like that, although not necessarily a problem with fedora, this scope often expands to "how can we monetize this info".

paavoova7y ago

It's interesting to see how this discussion steers towards privacy. Some distributions, like Ubuntu for example, are far less conservative. Besides the NTP tracking mentioned in the article, there's the Amazon fiasco of past, and Ubuntu 18.04 installs quite bit of telemetry [1], including tracking packages. It ships with a "dynamic" MOTD that runs a script periodically which downloads updates from Canonical. While this may be useful for server administrators who wish to be notified of products and updates, it has at one point shown ads for an HBO show [2].

Annoyingly, while installing the Xubuntu flavor, there appeared to be no option to opt out nor was there even mention of any such telemetry in the live installer interface. I had to track down and disable manually post installation - something the average user is not going to bother with and what Canonical is surely betting on. I appreciate how Poettering brings up trust and "red flags", knowing full well the lower the transparency, the larger the reactionary incentive for users to opt out or disable such telemetry. Canonical could perhaps take note.

[1] https://askubuntu.com/questions/1027532/how-to-opt-out-of-sy... [2] https://bugs.launchpad.net/ubuntu/+source/base-files/+bug/17...

jammygit7y ago

"Essentially, users would need to trust that the project isn't doing the tracking because it says it isn't."

The cynic in me is recalling that red hat just got bought by IBM and IBM is in the news for tracking people in a weather app in a sneaky way

I don't know any better though, maybe fedora is quite independent of red hat/IBM and its 100% legit to trust their promises. I'm not sure how it works tbh

Edit: added quote from article

TomasSedovic7y ago

(disclaimer: I work at Red Hat, not on any OS/distro)

This is a nitpick, but Red Hat did not get bought by IBM. What happened was IBM announcing the intention to buy Red Hat.

It's maybe a subtle, but possibly important distinction. Red Hat is still its own independent entity until the deal goes through (which means IIRC passing the board's approval, SEC and likely other stuff). This is expected to happen in late 2019 I believe, but it might still fall through.

This doesn't absolutely dispel any possibility of IBM's influence, but it should be very low/zero until the merger actually goes through. But I also don't know how all this works.

jammygit7y ago

Thanks for clarifying for me, I had that mixed up

1 more reply

javajosh7y ago

This post describes a bad way to track users, but the real utility of this post is in the email that describes a way to count Fedora users without tracking them:

https://lwn.net/ml/fedora-devel/20190108152239.GA24118@garde...

yjftsjthsd-h7y ago

That actually sounds more reasonable, although it does run tiny risk of being trivial to mess with if a malicious client wanted to skew numbers. But I don't think it's possible to defend against that without being horribly invasive, privacy perspective.

I must say, it feels odd to support a Poettering proposal, but this actually does look like a good solution.

stordoff7y ago

> although it does run tiny risk of being trivial to mess with if a malicious client wanted to skew numbers

Is that not also the case with the UUID solution? Generating the UUIDs in virtual machines, or just replacing the UUIDs in the requests, doesn't seem out of the question

msla7y ago

> although it does run tiny risk of being trivial to mess with if a malicious client wanted to skew numbers.

I don't doubt it will happen if this becomes well-known. Activism takes many forms.

stordoff7y ago

My first thought when I saw they just wanted to do counting was could you not doing something like send a

> This_is_a_first_install

request the first time, then

> get_updates

in future, an so I'm glad to see the proposed solution is something vaguely similar.

v_lisivka7y ago

Yep. Qutote from Fedora Wiki:

Options for "true" values

Rather than a simple boolean, we'd like the "countme" variable to act as an increment-counter. That is, it would be "1" the first week, "2" the second week, "3" the third week, and so on. This will let us sort out short-lived test or CI infrastructure machines and get a better picture of how systems are used over time, without tracking individual systems. Optionally, we could have a cap on the maximum value to mitigate risk of uniqueness for systems which have been running for a very long time (it may be that there are only a few systems running for exactly 327 weeks, for example). As the supported lifetime of a Fedora release is about 30 months, a logical cutoff would be around 60 weeks — the counter could go from "59" to "old".

Crontab7y ago

I don't mind Fedora wanting to get counts of things, provided it is exclusively an opt-in feature. Debian's 'popcon' is an example of doing it right.

sliken7y ago

Why not just track downloads from the mirrors? If you post a new version of package for fedora 29, just track how many downloads of that specific file are made. Write some scripts for log processing and require official mirrors to submit the logs to give you the package counts.

That way user info never makes it past the mirror (which has their IP anyways) and you don't need anything complex like UUIDs, playing tricks with NTP, or calling home.

This would give a reasonably accurate number. Use bash for measuring linux installs (pretty rare to have linux installed without bash). Then more desktop apps like firefox, eog, and xpdf to measure desktop use. If interested in server side track mongodb, apache, mysql, and similar.

This would also help fedora decide which applications they should pay more attention to.

jewelry7y ago

But this would be either over-counting if some CI scripts download the version every once a while, or under-counting if some organization put the image on their own privately maintained mirror which is quite common.

noobiemcfoob7y ago

Telemetry of any type usually fails to measure precisely the thing you want, but something adjacent that correlates strongly. What you mention are clear problems with inferring usage from downloads, but if you can infer the percentage of downloads that correlates to a machine running Fedora, you don't need much more precision.

notduncansmith7y ago

Perhaps packages are reinstalled too frequently to give accurate numbers? Though I agree this sounds like a good solution.

JohnFen7y ago

Any distro that phones home with a unique identifier is a distro I won't touch with a ten foot pole. I don't care what they claim they will or won't use that identifier for.

kbenson7y ago

Maxims that act on the symptom rather than the problem rarely help in the end, as the problem just evolves to support its needs through other means.

For example, sending a unique identifier is not the problem. Tracking people through a unique identifier is. So, depending on your goals, you can design a unique identifier system that does not allow tracking (or at least makes the tracking period so small as to be unuseful for purposes other than designed) as outlined in the article through changing the identifier on the client side weekly.

If all you want to do is get a good estimate of how many users use what types of configurations of your software (major and minor version), a UUID that rotates weeks on the client side is perfectly acceptable to use for those statistics to a fair degree of accuracy.

On the other end of the spectrum, people long ago started reducing their trackable footprint online, and the online tracking ecosystem just evolved to finding people through other, trickier methods, such as browser fingerprinting.

JohnFen7y ago

You're right in general, of course. But here's the reason for my hardline stance on that: history shows that trusting promises or assertions made about things like unique identifiers is unwise, and so I have to take a strong defensive stance.

> you can design a unique identifier system that does not allow tracking

You can (sortof), but we run against that trust issue again. If I'm giving a unique identifier to someone, I have no way of knowing if their assertions about its use are accurate. Even if they are, there's no guarantee that won't change in the future.

> If all you want to do is get a good estimate of how many users use what types of configurations of your software (major and minor version)

You're talking about the perspective of the publisher. I'm talking about my perspective as a user. A company's "need" to collect metrics is their problem, not mine. If their solution results in more information disclosure than I'm comfortable with (and a unique identifier absolutely is), then I will avoid their software or block communications to their home base.

2 more replies

msla7y ago

There are reasons to draw a line in the sand, to say that even attempting to do some things is contrary to a strong norm that we will defend even if you promise that you're not using it for anything malicious, something which is hard to police.

Taking a strong stand against tracking and, therefore, in favor of privacy is perfectly reasonable for people who use Linux in part due to our hatred of the deep tracking closed-source OSes do.

1 more reply

tracker17y ago

I would think if it rotated on the first of each month, that would probably be sufficient... then you could get your counts for any given month (excluding first/last day) assuming most system check every week or two at least, and it would be pretty consistent.

1 more reply

jefftk7y ago

Later on in the article they describe a revised solution that doesn't do that:

> Poettering came up with a scheme that alleviated most of the problems that were identified. He proposed that a "countme" flag simply be added to a single mirror-list query each week. The sum of all such queries over a week's time should provide an accurate estimate of the number of Fedora systems. That way, UUIDs need not be stored, which removes much of the concern—data that is not stored cannot be misused.

JohnFen7y ago

Yes, that's much better.

akerro7y ago

>unique user ID (UUID) for each installed system that would be sent with DNF mirror-list requests. It explicitly calls out privacy concerns: "We don't want to track; just count."

If Fedora server is compromised they can serve different packages to different users.

derefr7y ago

Given that package servers serve packages over HTTP, you can already do this, identifying the user you want to serve different packages by their IP.

However, the packages need to be signed by Fedora for the package manager to accept them, so this has been considered a pretty weak excuse for an "attack" for a while now. "Getting access to code-signing keys allows you to attack the people consuming signed binaries"—wow, you don't say!

prolurker7y ago

With control over the mirror list you can prevent certain users from getting updates which is a security problem but without being able to sign packages the danger is limited.

rhn_mk17y ago

Looking at the wiki page [0], I can see the benefits of the move:

> Better metrics overall

> Public stats page updated automatically

> Better knowledge of relative use of different variants

> Insight into Fedora's use in short-lived test systems and temporary containers vs. longer-term installations

but nothing evaluating how and whether the proposed solutions will achieve those things.

With no method being perfect, I'm suprised that no one is calling for a quantitative evaluation of various ID collection schemes, and that there is defined "good enough" value, other than

> We need better data than that.

I'm not a Fedora maintainer, and I'm not maintaining any other software of such popularity, so I have to ask: why? I assume it's to allocate work better. At which point do the downsides outweigh that benefit?

[0] https://fedoraproject.org/wiki/Changes/DNF_Better_Counting

z3t47y ago

If it's totally anonymous there's nothing stopping someone trolling the statistics.

tflink7y ago

Disclaimer: I work for Red Hat on Fedora

True but we're already in that boat with the way that we gather statistics from mirror hits. I have a hard time seeing how a method like the one proposed would be any more vulnerable to tampering.

EDIT: spelling

whatshisface7y ago

The idea is that it isn't less vulnerable to tampering, but you pay a privacy and public image cost.

v_lisivka7y ago

This change proposal can be tracked here: https://fedoraproject.org/wiki/Changes/DNF_Better_Counting

In short:

Add a new "countme" variable. This variable will: - Start as a "true" value, - Reset to a "false" value the first time the client successfully makes a request to Fedora mirror servers, and - Be reset to a "true" value after seven days.

This way, rather than filtering by unique IP addresses, we can count only the "true" requests, so we count each machine once — but no more than once.

Beldin7y ago

I'm not sure what they want to count. It definitely isn't users, as they ignore multiple users per system. It seems to be something like "currently active and online machines". But then you should not ignore machines that will not be updated. Maybe they mean "machines that follow the weekly update schedule this week"?

That seems to be what Poeterring's approach counts.

tflink7y ago

Disclaimer: I work for Red Hat on Fedora. Take that for what you will

As far as I know, the desire is to get better numbers on how much the parts of Fedora are being used. There is always more work to do than there are folks to do all of it; having better numbers on how much different bits are being used helps us make better decisions on what to focus on.

Granted, I'm not Matt but I've heard him talk about similar things and have run into the issue myself - "Is anyone even using this? Is it worth putting this level of effort into this particular thing?"

EDIT: Phrasing of the last sentence

JohnFen7y ago

But Fedora should remain wary of an over-reliance on telemetry. It's very, very easy to draw the wrong conclusions about things, leading to decisions that reduce the quality of the product.

As an example, there are very likely to be packages that aren't often needed, but are absolutely critical when they are.

nixpulvis7y ago

Just count the number of bug reports. That seems like a more useful metric anyway. If the users aren't complaining who cares.

(about 75% serious)

viraptor7y ago

It's not a consistent metric. You'll get both spikes around new releases and changes that reflect the automated reporting/ease of reporting changes.

moosingin3space7y ago

Also, since Fedora is primarily an integration project, many users report bugs upstream.

flowless7y ago

Here you go https://retrace.fedoraproject.org/faf/summary/

The same problem arises though as you can't track senders - there's no way of knowing how many reports were produced by a single machine.

anonunt7y ago

what funny is i just started using fedora (and i have actually been really enjoying it).. but to help me remember its not apt or rpm or even yum i have been thinking to myself Do Not Follow - for no reason at all other than i first learned about it after installing a new machine and configuring firefox etc. :)

dane-pgp7y ago

> Lennart Poettering ... did suggest using an application-specific machine ID, like those calculated by sd_id128_get_machine_app_specific().

Yes, I'm sure he did.

jjgreen7y ago

Please don't use "UUID" for that, it's taken (and useful).

nixpulvis7y ago

Well, they could just as easily use a "real" UUID [1] variant, and all the concerns of this topic would still remain the same.

[1]: https://en.wikipedia.org/wiki/Universally_unique_identifier

Tharkun7y ago

Wonderful. I guess I'll now have to find a way to regenerate this UUID or to spoof it every time Fedora tries to phone home.

If you want to count users, ask for permission during firstboot. If that's too much to ask, then I'll be in the market for a new OS. Maybe I'll finally go back to my first love: FreeBSD.

MBCook7y ago

Read the whole article. They seem to have decided against that and for a simple ‘countme’ flag on update requests to mirrors. Possibly by only a random subset of machines.

No tracking, just simple numeric data for for purpose.

j / k navigate · click thread line to collapse

83 comments

AdmiralAsshat7y ago

Provided that they figure out a way that absolutely nothing can be done with the information other than to say, "This non-identifiable machine reports that it uses Fedora," I'd be okay with that.

z3t47y ago

Barrin927y ago

Karunamon7y ago

That goes double given Fedora's position as what boils down to RHEL upstream. Too much corporate support for any backlash to make a dent, even if it could get critical mass.

3 more replies

throwaway20487y ago

Telemetry has a nice creeping scope like that, although not necessarily a problem with fedora, this scope often expands to "how can we monetize this info".

paavoova7y ago

[1] https://askubuntu.com/questions/1027532/how-to-opt-out-of-sy... [2] https://bugs.launchpad.net/ubuntu/+source/base-files/+bug/17...

jammygit7y ago

"Essentially, users would need to trust that the project isn't doing the tracking because it says it isn't."

The cynic in me is recalling that red hat just got bought by IBM and IBM is in the news for tracking people in a weather app in a sneaky way

I don't know any better though, maybe fedora is quite independent of red hat/IBM and its 100% legit to trust their promises. I'm not sure how it works tbh

Edit: added quote from article

TomasSedovic7y ago

(disclaimer: I work at Red Hat, not on any OS/distro)

This is a nitpick, but Red Hat did not get bought by IBM. What happened was IBM announcing the intention to buy Red Hat.

This doesn't absolutely dispel any possibility of IBM's influence, but it should be very low/zero until the merger actually goes through. But I also don't know how all this works.

jammygit7y ago

Thanks for clarifying for me, I had that mixed up

1 more reply

javajosh7y ago

This post describes a bad way to track users, but the real utility of this post is in the email that describes a way to count Fedora users without tracking them:

https://lwn.net/ml/fedora-devel/20190108152239.GA24118@garde...

yjftsjthsd-h7y ago

I must say, it feels odd to support a Poettering proposal, but this actually does look like a good solution.

stordoff7y ago

> although it does run tiny risk of being trivial to mess with if a malicious client wanted to skew numbers

Is that not also the case with the UUID solution? Generating the UUIDs in virtual machines, or just replacing the UUIDs in the requests, doesn't seem out of the question

msla7y ago

> although it does run tiny risk of being trivial to mess with if a malicious client wanted to skew numbers.

I don't doubt it will happen if this becomes well-known. Activism takes many forms.

stordoff7y ago

My first thought when I saw they just wanted to do counting was could you not doing something like send a

> This_is_a_first_install

request the first time, then

> get_updates

in future, an so I'm glad to see the proposed solution is something vaguely similar.

v_lisivka7y ago

Yep. Qutote from Fedora Wiki:

Options for "true" values

Crontab7y ago

I don't mind Fedora wanting to get counts of things, provided it is exclusively an opt-in feature. Debian's 'popcon' is an example of doing it right.

sliken7y ago

That way user info never makes it past the mirror (which has their IP anyways) and you don't need anything complex like UUIDs, playing tricks with NTP, or calling home.

This would also help fedora decide which applications they should pay more attention to.

jewelry7y ago

noobiemcfoob7y ago

notduncansmith7y ago

Perhaps packages are reinstalled too frequently to give accurate numbers? Though I agree this sounds like a good solution.

JohnFen7y ago

Any distro that phones home with a unique identifier is a distro I won't touch with a ten foot pole. I don't care what they claim they will or won't use that identifier for.

kbenson7y ago

Maxims that act on the symptom rather than the problem rarely help in the end, as the problem just evolves to support its needs through other means.

JohnFen7y ago

> you can design a unique identifier system that does not allow tracking

> If all you want to do is get a good estimate of how many users use what types of configurations of your software (major and minor version)

2 more replies

msla7y ago

Taking a strong stand against tracking and, therefore, in favor of privacy is perfectly reasonable for people who use Linux in part due to our hatred of the deep tracking closed-source OSes do.

1 more reply

tracker17y ago

1 more reply

jefftk7y ago

Later on in the article they describe a revised solution that doesn't do that:

JohnFen7y ago

Yes, that's much better.

akerro7y ago

>unique user ID (UUID) for each installed system that would be sent with DNF mirror-list requests. It explicitly calls out privacy concerns: "We don't want to track; just count."

If Fedora server is compromised they can serve different packages to different users.

derefr7y ago

Given that package servers serve packages over HTTP, you can already do this, identifying the user you want to serve different packages by their IP.

prolurker7y ago

With control over the mirror list you can prevent certain users from getting updates which is a security problem but without being able to sign packages the danger is limited.

rhn_mk17y ago

Looking at the wiki page [0], I can see the benefits of the move:

> Better metrics overall

> Public stats page updated automatically

> Better knowledge of relative use of different variants

> Insight into Fedora's use in short-lived test systems and temporary containers vs. longer-term installations

but nothing evaluating how and whether the proposed solutions will achieve those things.

With no method being perfect, I'm suprised that no one is calling for a quantitative evaluation of various ID collection schemes, and that there is defined "good enough" value, other than

> We need better data than that.

[0] https://fedoraproject.org/wiki/Changes/DNF_Better_Counting

z3t47y ago

If it's totally anonymous there's nothing stopping someone trolling the statistics.

tflink7y ago

Disclaimer: I work for Red Hat on Fedora

True but we're already in that boat with the way that we gather statistics from mirror hits. I have a hard time seeing how a method like the one proposed would be any more vulnerable to tampering.

EDIT: spelling

whatshisface7y ago

The idea is that it isn't less vulnerable to tampering, but you pay a privacy and public image cost.

v_lisivka7y ago

This change proposal can be tracked here: https://fedoraproject.org/wiki/Changes/DNF_Better_Counting

In short:

This way, rather than filtering by unique IP addresses, we can count only the "true" requests, so we count each machine once — but no more than once.

Beldin7y ago

That seems to be what Poeterring's approach counts.

tflink7y ago

Disclaimer: I work for Red Hat on Fedora. Take that for what you will

EDIT: Phrasing of the last sentence

JohnFen7y ago

But Fedora should remain wary of an over-reliance on telemetry. It's very, very easy to draw the wrong conclusions about things, leading to decisions that reduce the quality of the product.

As an example, there are very likely to be packages that aren't often needed, but are absolutely critical when they are.

nixpulvis7y ago

Just count the number of bug reports. That seems like a more useful metric anyway. If the users aren't complaining who cares.

(about 75% serious)

viraptor7y ago

It's not a consistent metric. You'll get both spikes around new releases and changes that reflect the automated reporting/ease of reporting changes.

moosingin3space7y ago

Also, since Fedora is primarily an integration project, many users report bugs upstream.

flowless7y ago

Here you go https://retrace.fedoraproject.org/faf/summary/

The same problem arises though as you can't track senders - there's no way of knowing how many reports were produced by a single machine.

anonunt7y ago

dane-pgp7y ago

> Lennart Poettering ... did suggest using an application-specific machine ID, like those calculated by sd_id128_get_machine_app_specific().

Yes, I'm sure he did.

jjgreen7y ago

Please don't use "UUID" for that, it's taken (and useful).

nixpulvis7y ago

Well, they could just as easily use a "real" UUID [1] variant, and all the concerns of this topic would still remain the same.

[1]: https://en.wikipedia.org/wiki/Universally_unique_identifier

Tharkun7y ago

Wonderful. I guess I'll now have to find a way to regenerate this UUID or to spoof it every time Fedora tries to phone home.

If you want to count users, ask for permission during firstboot. If that's too much to ask, then I'll be in the market for a new OS. Maybe I'll finally go back to my first love: FreeBSD.

MBCook7y ago

Read the whole article. They seem to have decided against that and for a simple ‘countme’ flag on update requests to mirrors. Possibly by only a random subset of machines.

No tracking, just simple numeric data for for purpose.

j / k navigate · click thread line to collapse