Quoting the docstring on the `track_module` function:
"""This function executes the tracking of a single module by launching a
subprocess to execute this module against the target module. The
implementation of thie tracking resides in the __main__ in order to
carefully control the import ecosystem.
Source: https://github.com/IBM/import-tracker/blob/67a1e84e5a609e52e...Here's the actual subprocess call: https://github.com/IBM/import-tracker/blob/67a1e84e5a609e52e...
# Launch the process
proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, env=env)
I think this is clever, and maybe even necessary, but feels risky to do on unaudited third-party Python libraries.Maybe I'm misunderstanding something?
This is why my coworker built the project he called "dowsing"; it tries to understand as much as possible from the setup.py's AST, without actually executing it.
For context on why I'm using the suprocess here, this allows the tracking to correctly allocate dependencies that are imported more than once (think my_lib.submod1 and my_lib.submod2 both need tensorflow, but my_lib.submod3 doesn't).
FYI, pylint does something similar for native-code extension modules (unless this changed in the past few years): it imports them dynamically!
EDIT: reading the code more closely and reading the rest of the comments, more precisely, it's not the subprocess call itself, but rather importing an arbitrary Python module, which could be a path for code execution. But this is the case generally with Python: importing a module executes code, and so even just importing (not otherwise executing) an untrusted module could be problematic.
It's a very interesting use case to consider how a similar solution could work as a sandbox for investigating supply chain concerns with third-party libraries that have transitive dependencies. I think some of the static analysis tools referenced in other comments would address this better since the real concern there is detecting the presence of transitive dependencies which may be malicious as opposed to identifying exactly where in the target library those dependencies are used.
When I read the title I was hoping for something else though, what I would love is a tool that logs and potentially blocks unexpected IO operations on a library basis. With the increasing common supply chain attacks we are seeing (there was a PyPI one just the other day), having a way to at least report on unexpected activity if not help prevent it would be brilliant. Has anyone ever found a tool like that?
(Obviously the ultimate solution would be an outbound firewall, but it seems be that although you can easily do this in a VM or bare metal, I haven't seen any PAAS platforms have that sort of capability)
You could do something close to that with Python's audit hooks, which were introduced with 3.8[1]. One massive caveat: audit hooks can be disabled by an attacker with the ability to control the interpreter, and are not perfect (there's plenty of things they don't cover.)
(More generally: this kind of auditing/restriction falls under the umbrella of "capability management." OpenBSD's pledge[2] is another example.)
python3 -m import_tracker --name datasette --recursive | jq
{
"datasette": [
"aiofiles",
"click",
"markupsafe",
"mergedeep",
"pluggy",
"yaml"
],
"datasette.version": [],
"datasette.utils.shutil_backport": [
"click",
"markupsafe",
"mergedeep",
"yaml"
],
"datasette.utils.sqlite": [
"click",
"markupsafe",
"mergedeep",
"yaml"
],
"datasette.utils": [
"click",
"markupsafe",
"mergedeep",
"yaml"
],
"datasette.utils.asgi": [
"aiofiles",
"click",
"markupsafe",
"mergedeep",
"yaml"
],
"datasette.hookspecs": [
"aiofiles",
"click",
"markupsafe",
"mergedeep",
"pluggy",
"yaml"
]
}
Related tool: pipdeptree - here's the output from that against a project that installs a lot of extra stuff: https://github.com/simonw/latest-datasette-with-all-plugins/...FD: My company, my work.
> This data allows deterministic dependency resolution which is required by mach-nix[1] to generate reproducible python environments.
It accepts dependencies in requirements.txt format (e.g. Django==3.1 or tensorflow) https://pydepchecker.z33.web.core.windows.net/
It's got a few shortcomings. Dependency resolution in Python is pretty difficult to work out when you've got a lot of libraries with common dependencies. And the license info on Pypi isn't always correct. But it's always been a quick useful tool for me.
The point was that pretty no much what starting point one used, you pulled in much the same amount. There was a common core but even so it was like a starfish - if you start at tip of one limb, you pull in that limb and the core. start on another limb same thing.
but all the limbs are about the same size
it's just anecdata but it has been at the back of my mind as some kind of rule.