But since this post is about the About page, let me share a couple of lessons I've learned from the blog, which has been more successful in communicating my research than I'd dared to hope for when I started it 3.5 years ago.
1. Those of us working on technical areas often struggle to explain our ideas to others not as technical, in a way that avoids oversimplification and losing essential meaning. Sometimes you'll discover an analogy or metaphor or phrase that does both. Seize those chances, they're powerful.
2. Coming up with a name is more important than you might think. If a good name will make your idea or product even 5% stickier, it follows that it may be worthwhile to spend 5% of your time just coming up with the name. One way to do it is to be constantly on the lookout for a good name while you're working on the product.
3. If you're writing about something that has policy implications, and want it to be read in Washington, it's hard but not impossible. Two important requirements are to network and build up an audience — they aren't going to read your blog just because it ranks high in Google searches — and to use language that non-technical people can understand.
Happy to answer any questions!
https://www.i2b2.org/NLP/DataSets/Main.php
De-identification of medical charts is a bottleneck in clinical research. It's impractical to ask for thousands consent forms, however, smaller sample sizes are inconclusive, so much, that most of medicine is driven by inconclusive research findings. Moreover, full anonymization does not allow to follow patient records over time. This will kill any big patient outcome study, at least financially. What are your thoughts?
I submitted the about page because the two key claims that you make: (1) you only need a few bits of information to identify a person uniquely in the whole world and (2) this information is becoming easier and easier to obtain - both make a lot of sense to me. Your about page does an excellent job of communication these two points and I thought it might interesting food for thought for HN.
I'm wondering whether much as we'd all like to have privacy and anonymity, these could be goals that might be impossible to achieve in the future. I'd like to hear what your thoughts are on where, we as a society are heading in this context and whether it's unrealistic to expect that conventional expectations of privacy will continue to be fulfilled in the future. Perhaps, we should accept that the privacy battle is lost and try to other solutions to the problems that privacy was solving?
http://33bits.org/2011/10/18/printer-dotspervasive-tracking-...
http://33bits.org/2011/06/08/the-many-ways-in-which-the-inte...
The synopses of the two posts are:
My opinion is that it impossible to put the genie back into the bottle — the cost of tracking every person, object and activity will continue to drop exponentially. ... If we accept that we cannot stop the invention and use of tracking technologies, what are our choices? Our best hope, I believe, is a world in which the ability to conduct tracking and surveillance is symmetrically distributed, a society in which ordinary citizens can and do turn the spotlight on those in power, keeping that power in check. On the other hand, a world in which only the government, large corporations and the rich are able to utilize these technologies, but themselves hide under a veil of secrecy, would be a true dystopia.
and from the other side:
There are many, many things that digital technology allows us to do more privately today than we ever could.
[examples snipped, but I recommend taking a look at the post]
Of course, I’ve only presented one half of the story. The other half, that technology is also allowing us to expose ourselves in ways never before, has been told so many times by so many people, and so loudly, that it is drowning out meaningful conversation about privacy.
Although these two opinions might at first sight seem contradictory, they are not. Some day I will get around to putting the two sides of the argument together into a coherent narrative that explains the nuanced scenario that I think we're heading towards, but for now I will offer you the above articles.
Take this scenario (and also check out the exclaimer at the bottom!).
All users on the earth type the same paragraph, or perhaps some password (clearly the longer the most distinct fingerprint, but bear with me on this).
Based on this sequence of keypresses, I capture the timestamp that each keypress is activated, and then the duration that each key is held down for.
Based on this information, how would you recommend, or suggest that a person goes about detecting some unique fingerprint from these values.
I was thinking the best way would be to have each keypress some space point and the duration held down a vector. And then if each use is entering the same paragraph, the distance accross all of the vectors could be used to calculate some identifying fingerprint.
Exclaimer: I'm absolutely not interested in the slightest in tracking users. Every weekend I try and research something that interests me. Last weekend was user fingerprinting based on the typing speed and cadence of users.
In the research community it's a proven and accepted concept. There are products in the market that do two-factor authentication based on password + keystroke dynamics, but I don't know how well they work.
I'm a sucker for a great name, but this is misleading. If something will give you an x% better outcome, it doesn't follow that you should spend about x% of your resources on it.
It depends entirely on the opportunity costs, yeah? You should spend time on your name only when you believe that thinking about names for another hour will do more good than coding or talking to customers for another hour.
Am sorry does both of what?.. Don't mean to nitpick, but this is a problem i run into regularly and don't seem to be able to find a reliable approach. So just trying to divide it into the factors involved...
According to… http://www.wolframalpha.com/input/?i=how+many+people+have+li... http://www.wolframalpha.com/input/?i=106+billion+in+binary
37 bits
However, 33 bits is a simplification. You can express 8.5 billion different values with 33 bits, but unless those 33 values map to well-distributed discriminators, the number is meaningless...
For instance, knowing that a person lives in the US reveals a little under 5 bits of information (there are a little over 2^28 people living in the US, according to Google). The entropy is just a measure of the amount of information that gives us in narrowing down a population; not an exact encoding.
*Edit - s/science/theory/i
People have been disputing how many people have ever lived on the earth for decades. Some anthropologists have thrown out numbers as high as 70billion-120billion, although other scientists have admittedly said the number is probably around 7-10B.
How many people to calculate how many people have ever lived?
That's just a log2( X ) number of bits would be needed. As the author of the blog says, 6000 billion people could be written in just 43 bits.
An interesting debate, regardless. I had no idea it could be so hard to calculate. At any rate, as you say, 43 bits would have humanity covered for a long, long time.
[1] http://en.wikipedia.org/wiki/World_population#Number_of_huma...
As for the development of algorithms to gather those bits, that's what my entire Ph.D. is about and what my blog is mostly about. This is what I've been proving for the last 6 years.
*Just realized comment numbers are unstable. Bad wordpress.
http://en.wikipedia.org/wiki/Entropy_(information_theory)
As the stuff posted on 33bits regularly demonstrates, it is surprisingly easy to get this much information for a whole lot of people.
I suggest you peruse it.
For instance if you know "Frank" doesn't wear a Rolex, that would not rule out very many people. So statistically, it would probably be better to know if Frank has red hair, as that could rule out a lot more people.
Also, let's say you have it narrowed down to four people, but the last bit of information is common to all of them. You now have to get another bit, and possibly another, correct?
EDIT: Felt like I didn't express my main point well enough: while you can certainly narrow down people with "bits" of information, information is most of the time not just 1 or 0 and can be fuzzy (or too common) to be useful in a binary search, although with the right bits of information it can of course be fruitful.
I'm really interested by this concept and also curious as to if anyone is employing it on a mass scale.
The definition of a bit is something which removes half the possibilities. If you have 4 people and acquire a "bit" of information that breaks them into two categories, one with 4 people and one with 0 people, you, by definition, in fact have 0 bits.
Fractional bits are not only possible, they are by far the common case. With a lg2 in the definition of the bit, it's pretty uncommon to have integral bits.
Critical insight: What we call a "bit" in a computer and a "bit" in information theory are related but not the same thing. You can't have a fraction of a bit stored in your computer's RAM, the words are meaningless. It is best to simply flush your idea of what you think a bit is and start over again from scratch when studying information theory, then when you are comfortable with it the connections will become obvious. Starting from the RAM side is actively harmful.
Birthday, Gender and Zipcode is enough to identify someone uniquely approximately 85% of the time.
And a quickly googled source but the meme is older than that: http://godplaysdice.blogspot.com/2009/12/uniquely-identifyin...
I think you should count the dead as well. But then, 33 bits ~= 8 billion, which should still be enough, I guess.