I think most languages these days are a bit smarter and avoid this beginner mistake (for various reasons).
The only way this is 'solved' is if some third party authority hands out top level names and refuses to register names that are similar to other names for some definition of similar. The number of levels between top level and package name is irrelevant.
There's another solution (like debian does), auditing what the package itself does, so that you don't allow malicious code into the repository.
While attacking a single package would be possible, covering any interesting amount of "typo"-space would require registering huge amounts of namespaces.
If package manager developers are smart, the allocation of namespaces is also handled externally and associated with some cost (e. g. domain names).
Therefore these kinds of attacks become impractical.
That's something that can be flagged for manual review before it gets too far.
My hope is to be automating a large amount of the review in the next few months, however I think this is a good argument for never having it be fully automatic. Having a human sanity check submissions isn't a terrible idea if we can keep the workload down.
Certainly this doesn't prevent a malicious author from posting a legitimate package and then changing the contents to be malicious, but that can be somewhat solved by turning off automatic updates.
Thanks for keeping Package Control high quality, I know it's highly appreciated :-)
One step to mitigate things like this as well would be to have some sort of "crowd-sourcing" command in the package manager program... like "npm flag coffe-script" or something like that to alert repository maintainers of possible issues.
[1]: http://central.sonatype.org/pages/ossrh-guide.html [2]: http://central.sonatype.org/articles/2014/Feb/27/why-the-wai...
Perhaps you could make this safer by adding an automatic check for how much the package has changed since the last version? And at least warn the user when they want to update?
It does raise the barrier to entry, but it would prevent typosquatting and regular namesquatting.
EDIT: Does any major package manager provide a "did you mean" functionality, offering a list of actual package names similar to what you typed?
and then also have perfect memory of all packages and notice that similarly named package is too (for some value of "too") similarly named to some already existing one... even if e.g. both are a correct dictionary word.
Which Debian has, because submitting a new package is a much more involved processes than sudo apt-get publish.
Also, doesn't point out that the bigger threat is that this is wormable.
The acknowledgements mention 2 of the university advisers and a PyPi admin consented to the "notification program".
Still, people with good intentions have been prosecuted and convicted for less. I would be very concerned for this student.
That is a crime under the CFAA in the USA. Not sure what it is in Germany/EU.
Anyway, this is all part of why I always try to build inside a container, or at least in a virtualenv where I don't need to sudo the install.
>17000 computers were forced to execute [unauthorized] arbitrary code
Certainly a crime in the US, not sure about Germany.
Nice execution though!
If I intentionally leave an infected USB drive on the ground, someone picks it up and sticks it into it's computer, am I liable?
Seems like it could go either way.
I used the Ruby code at the beginning of http://stackoverflow.com/questions/16323571/measure-the-dist... to calculate the distance between the package names at page 60 of the thesis and their typos. The maximum is 2.
I checked some similar package names from a Gemfile.lock of a project of mine. Unfortunately the two gems hike and hirb are also at distance 2. Probably many short names are close with this metric.
A combination of the two approaches could be ok: knowing that a name was blacklisted should be an indicator that's not a good name, despite the distance with any other name, plus an approval of the maintainers for distance 2.
But a blacklist could generate another type of squatting, with people trying to pre-blacklist perfectly legit names. Only one thing is sure: there is more work to do for the maintainers and this extra friction is not good.
Edit: the distance suffers from the same problem.
I see what you did.
Possible explainations:
* Perhaps many of those are automated build systems, which would also explain the high number of systems with admin access (for example, if you use travis without docker, every build runs in a clean vm with admin access).
* People download one package and install it multiple times? Seems unlikely
Any other ideas?
sudo pip install lumpy (instead of numpy)
Ran it again because it 'didn't work'
I think that this clearly falls under the heading 'naming issue.' People know what they want, but do not enter it properly.
I can't think of a 100% off-hand, which isn't surprising, because it's a hard problem.
pmontra's suggestion to use typo blacklisting ain't a bad idea. Maybe some sort of reputation-per-name could help?
I wonder if you could do something similar here - enter the name of the package and a code of some sort. I haven't thought this through in a lot of detail.
That doesn't work with arbitrary names because they are, well, arbitrary.
Maintainer/PackageName
It solves so many problems, this included.It's pretty mind blowing how big of a blindspot package installers are. I guess running everything inside a e.g. Docker container/VM would be a partial interim solution for the paranoid?
It's a bit better - there is only one possible source of compromise rather than everyone on the network path. Given that npm/pip likely keep archives of all packages uploaded, it would be much harder (perhaps impossible) to attack someone secretly this way, at least in the long term.
Good package managers require signing of uploads (e.g. maven central requires every package to have a GPG signature; Debian goes further, and requires your key to be signed by an existing member of the organization). If the client checks the signatures you end up with a system that's perhaps actually secure.
A signed package doesn't really tell you that much. In the best case scenario it tells you the package you're installing in fact came from developer X and contains code Y (which you kinda already know since you have the source code). This works as long as you know and trust developer X, or did your due diligence reading through the code (which you can already do today).
I can't think of an end solution that wouldn't have to rely on network effects and social proof, which strikes me as rather fragile. Maybe formal verification and AI can help, but that's a long way off (?)
I'm curious to hear your opinion about a combination of digital signing with e.g. keybase/blockchain + reputation system, a sandboxed development environment (mitigates the "short con" risk) and a sandboxed production environment, with the minimum set of permissions required to operate (as well as auditing of course).
Call me pessimistic but I don't see developers taking on the extra friction given the status quo. Though a major data breach or two might change things, as I'm sure we'll find out sooner or later.
That way authors can continue to use any name they want, and the emphasis is on letting installers know that they might be installing the wrong package.
That'll be fun to automate around in puppet or ansible.
Now that there's a strategy for finding fakers: 1) You have an attacker-defender arms race. The attacker will always be one step ahead of the defender. 2) You have the extra burden of keeping up in this race, otherwise your security feature is a facade. At best, this is useless. At worst, it lulls your users into a false sense of security.
https://rubygems.org/gems/bundle Total downloads 1,800,600
Source (empty) at https://github.com/will/bundle and interesting README.
https://rubygems.org/gems/bundler Total downloads 92,116,090
It's almost the 2%.
In Python, "pytables" (should be "tables") and "skimage" (should be "scikit-image") come to mind.
I wonder what kind of steps we can take to prevent this risk.
but even this just tries to put the problem under carpet. you could still for example have requests package which just installs request package, works as expected, just sends request/response to your own server from time to time. ie. when there's http basic auth used only.
You can also make this the default, with npm config set ignore-scripts true (and then --ignore-scripts false at install time if you wish to run them).
Consider the following:
requests - a python package for making HTTP requests. requestr - a python package for a fictional startup that allows you to send requests to your nearest and dearest.
Given they both could be typos of each other:
1) How do we determine which one to use? What if someone accidentally also tries "requestd", somewhere between the two ?
2) How do we apply the principle of least surprise - I asked to install requests, and everything installed just fine, but now I can't import it?!
$ pip install requestr
Package "requestr": did you mean "requests"? [Y/n]
(reason for this warning: similar spelling and requests is much more popular)
Pass --no-spell-warnings to disable this feature.http://incolumitas.com/data/thesis.pdf section 5 "Practical implications". Just wanted to point out that in case you skipped it it's worth a read, some interesting proposals there that are worth discussing with package manager maintainers.
I particularly like the preemptive approach of auto-blacklisting common typos by simply monitoring the number of times a specific unexisting package is requested over time (5.10). So if a lot of people regularly attempt to install the unexisting package "reqeusts", it could signal that it's a common typo and should be blacklisted to prevent malicious use in the future. False positives could always be sorted out manually by communicating with the package manager maintainers.
- The package name is something lot of people regularly attempt to install, but it doesn't exist (per above) - The package name is 1-2 chars off from the name of another package which has more than X downloads - The package is frequently installed then uninstalled in a short time
The two solutions here are user-local packages (pip --user, for example) and virtual environments.
Ones dev environment should be a place where remote code execution is a high probablity and we need better tools to partition that from high value data.