Skip to content

Conversation

@stevengj
Copy link
Member

@stevengj stevengj commented Apr 2, 2018

This PR adds an optional sha="...." keyword argument to download (via libmbedtls), to make it easier for people to add SHA-256 validation to downloads. Furthermore, a deprecation warning is emitted for non-validated download calls during Pkg.build or Pkg3.build.

Fixes #26680. (Does not affect #17958, which is about the Julia build itself.)

I intentionally kept the SHA hashing functionality here minimal; a more full-featured SHA interface is exposed by packages like MbedTLS.jl (update: or SHA.jl).

(It's all well and good to say that people should use BinaryProvider or whatever, but I think it would be better to build security directly into the download function to make sure that it has a wider reach.)

Question: are there other contexts in which we should deprecate non-validated download calls? Module loading, for instance? Would be very easy to add this constraint to precompiling or Pkg.test, for example, if that is desired.

@stevengj stevengj added packages Package management and loading security System security concerns and vulnerabilities triage This should be discussed on a triage call labels Apr 2, 2018
@KristofferC
Copy link
Member

KristofferC commented Apr 2, 2018

to make it easier for people to add SHA-256 validation to downloads

Is

import SHA
bytes2hex(SHA.sha256(open(destination)))

really too difficult?

a more full-featured SHA interface can be exposed by packages like MbedTLS.jl.

Or the SHA.jl package that we have as a standard library?

Furthermore, a deprecation warning is emitted for non-validated download calls during Pkg.build or Pkg3.build

Should be in a different PR imo.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

Whoops, I didn't notice that we have an SHA package in the stdlib! I can switch to using this if that is preferred. Though I don't know if it is possible for Base to depend on a stdlib package without moving most of the package into Base? Calling mbedtls is < 30 lines of code here, which is a lot more lightweight than moving all of SHA.jl into Base.

Doing the SHA as a separate step makes it impossible to enforce that download calls in a build context perform an SHA validation, which to me is a huge disadvantage compared to a keyword argument in download.

@KristofferC
Copy link
Member

But doing the SHA as a separate step makes it impossible to enforce that download calls in a build context perform an SHA validation

Sure, but I don't think we should be doing that.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

Sure, but I don't think we should be doing that.

Why not? The Base.download function very inviting (and widely used) but creates security holes during build. Obviously, we can't prevent all security holes in user code, but isn't this low-hanging fruit?

@KristofferC
Copy link
Member

It seems that using something like BinDeps is far more common then raw download commands and the action then would be to open an issue at the BinDeps repo that they are not following best practices.

Anyway, I think having the build-thing as a separate PR would make it easier to focus on the implementation of the check here, and whether it should warn during build time in the other PR:

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

The Pkg.build integration is only one line of this PR, and is kind of the main point of having an sha keyword in the download function itself, so it doesn't make sense to me to separate them.

Updating BinDeps is a separate issue. It looks like there are about 3 dozen registered packages that use Base.download during build.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

@KristofferC, you still haven't explained your objection to enforcing hash validation for Base.download during build. What is the downside here? Is it just that you don't like having 30 lines of mbedtls calls in Base?

@KristofferC
Copy link
Member

The Pkg.build integration is only one line of this PR, and is kind of the main point of having an sha keyword in the download function itself.

The number of lines doesn't really matter, it is the effect it has and the type of discussion it will produce, no?

If you feel this should be enforced for registered packages then adding that as a rule to METADATA seems more appropriate.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

Okay, so what is your objection to "the effect it has and the type of discussion it will produce"? Why would it be better to enforce this only for registered packages, rather than everywhere? I feel like you keep giving meta-objections.

My reason for combining that change into this PR is precisely for the discussion — it is impossible to evaluate the merits of including a sha directly into download separately from the desire to enforce it during Pkg.build, since the latter is a primary motivation for the former.

@KristofferC
Copy link
Member

KristofferC commented Apr 2, 2018

Being able to conveniently check a hash in the download function is convenient in itself, but it is true that it is very easy already.
Having this split means that the PR for the package manager could be opened in the place where it is developed (https://github.com/JuliaLang/Pkg3.jl).

Why would it be better to enforce this only for registered packages, rather than everywhere?

Because maybe I don't care about the hash in my own package. Maybe I have my own registry of packages that only download from my local server. Maybe I want to use some other hash than SHA512. A security choice like this has to be adopted (and enforced) on the community level. Saying that the combination of exactly build-time and exactly the Base download function and exactly the SHA512 hash is somehow special is not true and this does not bring any extra security to users to pretend that it is. As you have already established, very few packages would actually be affected by the warning here. A rule in METADATA about what a package should do when downloading dependencies would be much more effective.

@nalimilan
Copy link
Member

Whoops, I didn't notice that we have an SHA package in the stdlib! I can switch to using this if that is preferred. Though I don't know if it is possible for Base to depend on a stdlib package without moving most of the package into Base? Calling mbedtls is < 30 lines of code here, which is a lot more lightweight than moving all of SHA.jl into Base.

We don't actually require MbedTLS currently, we support using libgit2 with the system's SSL library. See JuliaLang/MbedTLS.jl#123 (comment).

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

this does not bring any extra security to users to pretend that it is.

You are going too far. Requiring some cryptographically strong hash validation (in the context where binaries/source are most often downloaded) is more secure than requiring no hash (in any context).

Additional hash types can always be added later just by adding additional Base.shacheck methods, because the sha keyword is untyped. But yes, for now it would be picking one good hash as the built-in one. I don't see a problem with having a good default that can be overridden.

As you have already established, very few packages would actually be affected by the warning here.

I don't know if three dozen is "very few". Also, the next step is to deprecate BinDeps.download_cmd to recommend Base.download, which would expose many more people to the warning.

@KristofferC
Copy link
Member

Requiring some cryptographically strong hash validation is more secure than requiring no hash.

Yes, and that is done by making that a rule for the community that provides those packages, e.g. METADATA or your own company's registry (which might have even stronger security requirements).
Without a community rule, this does absolutely zero for security since it is trivial to circumvent.

Also, the next step is to deprecate BinDeps.download_cmd to recommend Base.download

The next step is to move to BinaryBuilder.jl (which does do hash checking).

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

Without a community rule, this does absolutely zero for security since it is trivial to circumvent.

The goal here is not to prevent people from intentionally circumventing security measures. That is very very hard to stop, even with "community rules".

The goal is to provide automatic notification for a common security issue coupled with an easy fix (not "rewrite your package's build system from scratch"). It's not fair to describe this as doing "absolutely zero for security."

The next step is to move to BinaryBuilder.jl (which does do hash checking).

No.

Switching to BinaryBuilder may be a long-term goal, but for security concerns you need to have an easy upgrade path that requires minimal changes to packages. Otherwise it will be years (if ever) before some packages go to the trouble of upgrading.

@KristofferC
Copy link
Member

KristofferC commented Apr 2, 2018

If you want to convince the BinDeps maintainers that they should print a warning if they are downloading things without checking a hash and add a hash field then sure go ahead with that.

We should not be printing a warning like this from Base when we have no idea about the context download is called. BinDeps and BinaryProvider know that binary dependencies are being downloaded. download doesn't know anything. Pkg.build might call download in a way where a hash doesn't even make sense to provide.

We could make a bindep_download function in Pkg or something along those lines that would do the equivalent of what is done here, but download itself should be freely usable without getting warnings.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2018

We should not be printing a warning like this from Base when we have no idea about the context download is called.

Being called from Pkg.build is not "no idea about the context".

Can you give any real examples of a build script that is calling download in a case where a hash is inappropriate?

@Petr-Hlavenka
Copy link
Contributor

In the context of JuliaPackaging/WinRPM.jl#144 I'd really like to see the most simple download_file_from_web_core or something similar that a user stuck on a silly restricted platform (aka Windows Server) could overload to overcome the weird rules enforced by his corporate network administrator. I'd like to see the Base.download function to do this. And I'd also like to use this very same function to download data (huge ~100GB scientific images) for processing without getting a deep warning or recalculating SHA on them. The safety should be the responsibility of "package providers" and not of a function for generic data transfer.

@stevengj
Copy link
Member Author

stevengj commented Apr 3, 2018

@phlavenk, why would it be an undue burden for WinRPM users to provide an SHA hash to check, even it happens to be on Windows Server?

Why would you be downloading huge scientific images for processing during Pkg.build? (Usage of download in a normal runtime context would not require an SHA in this PR.)

@StefanKarpinski
Copy link
Member

I was originally in favor of this but I've changed my mind. It's fairly arbitrary to support just one hash function for this and it feels like an ad hoc attempt to fix a larger problem. Instead, I feel that we should move all package and artifact delivery to a solution where we serve them ourselves from servers that we control that support content-addressed URLs, e.g.

https://get.julialang.org/package/7525ab40-00ff-43e9-965b-b3cd4310bcf6.git
https://get.julialang.org/package/7525ab40-00ff-43e9-965b-b3cd4310bcf6/sha256/9ae7d86b8eac797ed4f26fafb2359c4834a2e375cfc1cfe10d30c10cf5371c37.tgz
https://get.julialang.org/artifact/sha256/f1ef74c864c4e4fa15291f2c37027b861b7b7abdaf7bd244cc58166bcb28fb37.tgz

There are many benefits to this:

  1. Inherent verifiability. The correct hash of the content is baked into the very way you acquire it, so the mechanism for getting these is inherently verifiable.

  2. Autonomy and control. We know and control all the ways content is served. We can be sure that every artifact can be gotten in a consistent set of ways, e.g. HTTPS always works and we can continue to support protocols that versions of Julia we know are out there still need (c.f. the GitHub TLS business). Since the content is hash-verifiable, HTTPS is defense in depth anyway and we can support older protocols for longer without risk.

  3. Complete consistency. GitHub supports serving archives for releases as tarballs, but there are packages that are not developed on GitHub, and we need to handle those specially and differently. If instead of relying on GitHub for this, we serve the tarballs ourselves for every package that is registered, then we know that they will be available for every registered package, regardless of how it is developed or hosted.

  4. Stability. We can continue to serve all packages and artifacts that have ever been published regardless of whether the original author decides to take them down. This protects us from situations like the NPM left-pad fiasco.

  5. Statistics and tracking. We would know, as a community, what packages people are using and on what platforms. We can anonymize this data and make it available to analyze.

Having one particular hashing mechanism built into the download function seems not to fit in with this overall vision and seems likely to just get in the way. It seems better to just change the download in the original issue to use HTTPS until we have a more complete picture for solving this in general.

@Petr-Hlavenka
Copy link
Contributor

@stevengj My point was only to keep Base.download function as trivial as possible for the case one needs to redefine it for some edge case. It allows you to e.g. replace by a function that does ssh command to a different machine that will do the download and place the file to a shared storage and then copy it to the final destination (a hack I needed to do some months ago on a corporate network-connected windows server).
In this usecase the added SHA verification that the new function has to implement is the burden.

With the case of huge file download - I missed the idea and thought it is mandatory to always calculate SHA. Opt-in in this PR is the good behavior.

@stevengj
Copy link
Member Author

stevengj commented Apr 3, 2018

@StefanKarpinski, it still seems like you'd want to verify the SHA after download, for end-to-end security.

@StefanKarpinski
Copy link
Member

it still seems like you'd want to verify the SHA after download, for end-to-end security.

Yes, of course. But that would be part of the mechanism for acquiring a content-addressed artifact, not for downloading an arbitrary URL. Also, baking in support for one hashing function seems weird.

@stevengj
Copy link
Member Author

stevengj commented Apr 4, 2018

Also, baking in support for one hashing function seems weird.

It sounds better if you spell it "providing a reasonable default" rather than "baking in".

We can easily design the API to make it more hash-agnostic. e.g. would you be happier with download(url, hash=SHA256("..."))?

@StefanKarpinski
Copy link
Member

I think we should have sha{1,256,512}check(filename, hash) functions exported from SHA and have this as a recommended step right after downloading in build scripts. Baking one particular verification mechanism into the download function feels weird and messy. The mutable global state here to try to coerce people into doing this also doesn't seem right. In the future, all downloads should go through a secure server that we control and that we know how to verify content from.

@StefanKarpinski
Copy link
Member

Triage consensus is against this.

@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Apr 5, 2018
@stevengj
Copy link
Member Author

stevengj commented Apr 5, 2018

Okay, will close.

Without a deprecation, though, probably it will take a long time for the existing download calls to be updated to use a hash and/or a secure server, unless there is some femtocleaner-like automatic-update tool.

@stevengj stevengj closed this Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

packages Package management and loading security System security concerns and vulnerabilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants