It turns out that this technology already exists in a much better form. It's called cache. The problem is that almost everyone hosts their own version of jQuery. If everyone simply linked the "canonical" version of jQuery (the CDN link is right on their site) then requiring jQuery will be effectively free because it will be in everyone's cache.
Also the cache is supported by all browsers with an elegant fallback. Instead of having to manually having to check if your user's browser has the resource you want preloaded you just like the URL and the best option will automatically be used.
TL;DR Rather then turning this into a political issue stop bundling resources, modern protocols and intelligent parallel loading allow using the cache to stove this problem.
<script src="jQuery-1.12.2.min.js" authoritative-cache-provider="https://ajax.googleapis.com/ajax/libs/jquery/1.12.2/jquery.m... sha-256="31be012d5df7152ae6495decff603040b3cfb949f1d5cf0bf5498e9fc117d546"></script>
Would this cause more problems than it would solve? I'm assuming disk access is faster than network access.
I'm concerned about people like me who use noscript selectively. How easy is it to create a malicious file that matches the checksum of a known file?
I'd say not easy at all, practically impossible.
SHA-256? Very, very, very, very hard. I don't believe there are any known attacks for collisions for SHA-256.
Of course this would need a proposal or something but it would be interesting to consider.
http://plan9.bell-labs.com/sys/doc/venti/venti.html
Also available on *nix
As others have pointed out, it's quite difficult. But here's another way to think about it: if hash collisions become easy in popular libraries, the whole internet will be broken and nobody will be thinking about this particular exploit.
Servers won't be able to reliably update. Keys won't be able to be checked against fingerprints. Trivial hash collisions will be chaos. Fortunately, we seem to have hit a stride of fairly sound hash methods in terms of collision freedom.
<script src="jQuery-1.12.2.min.js" sha-256="31be012d5df7152ae6495decff603040b3cfb949f1d5cf0bf5498e9fc117d546"></script>
? If you wanted to explicitly fetch from google if the client doesn't have a cached copy, then instead do <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.2/jquery.min.js" sha-256="31be012d5df7152ae6495decff603040b3cfb949f1d5cf0bf5498e9fc117d546"></script>
The first would seem preferable though, as loading from an external source would expose the user to cross-site tracking.You're right in that the first one you had with just the sha-256 would be pretty much equivalent as what I had especially given that hn readers have resoundingly given support to the idea that it is non-trivial to create a malicious file with the same hash as our script file. I was simply trying to be cautious and retain some control for the web application (even if the extra sense of security is misplaced).
This is the use case I'm trying to protect by adding a new "canonical" reference that the web application decides. As others in this thread have said, it is very unlikely that someone will be able to craft a malicious script with the same hash as what I already have. The reason I still stand by including both is firstly compatibility (I hope browsers can simply ignore the sha-256 hash and the authorized cache links if they don't know what to do with it).
As a noscript user, I do not want to trust y0l0swagg3r cdn (just giving an example, please forgive me if this is your company name). NoScript blocks everything other than a select whitelist. If the CDN happens to be blocked, my website should still continue to function loading the script from my server.
My motivation here was to allow perhaps even smaller companies to sort of pool their common files into their own cdn? <script src="jimaca.js" authoritative-cache-provider="https://cdn.jimacajs.example.com/v1/12/34/jimaca.js""></scri... I also want to avoid a situation where Microsoft can come to me and tell me that I can't name my js files microsoft.js or something. The chances of an accidental collision are apparently very close to zero so I agree with you that there is room for improvement. (:
This is definitely not an RFC or anything formal. I am just a student and in no position to actually effect any change or even make a formal proposal.
SHA + CDN url list (for whitelisting/reliability purposes - public/trusted, and then private for reliability) would be ideal.
You only gain performance if the browser already has a cached version of this specific version on this specific CDN. If you don't - you end up losing performance, because now an additional DNS lookup needs to be performed, and an additional TCP connection needs to be opened.
Here are a few reasons people choose to avoid CDNized versions of JS libraries. http://www.sitepoint.com/7-reasons-not-to-use-a-cdn/
This is a 6 year old post, but it raises some valid concerns: https://zoompf.com/blog/2010/01/should-you-use-javascript-li...
And really if you are using one of the major libraries and a major CDN (Google, JQuery, etc.) over time your users will end up having the stuff in the cache, either from you or from others having used the same library version and cdn.
I suppose someone has done a study on CDN spreading of libraries and CDNS among users, so that you could figure out what the chance is that a user coming to your site will have a specific library cached - there's this http://www.stevesouders.com/blog/2013/03/18/http-archive-jqu... but it is 3 years ago, really this information would need to be maintained at least annually to tell you what your top cdn would be for a library.
So you use the one that has both, but one is not canonical, which means more cache misses. That doesn't even count the fact that there are different versions of each library, each with it's own uses, and distribution, and the common CDN approach becomes far less valuable.
In the end, you're better off compositing micro-frameworks and building yourself. Though this takes effort... React + Redux with max compression in a simple webpack project seems to take about 65K for me, before actually adding much to the project. Which isn't bad at all... if I can keep the rest of the project under 250K, that's less than the CSS + webfonts. It's still half a mb though... just the same, it's way better than a lot of sites manage, even with CDNs
Since HTTPS needs an extra round trip to startup, it's now even more important to not CDN your libraries. The average bandwidth of a user is only going to go up, and their connection latency will remain the same.
If you are making a SaaS product that business want, using CDNs also make it hard to offer a enterprise on-site version as they want the software to have no external dependencies.
If the user making the request is in Australia, for example, and your web server is in the US, the user is going to be able to complete many round trip requests to the local CDN pop in Australia in the time it takes to make a single request to your server in the US.
Latency is one of the main reasons TO use a CDN. A CDN's entire business model depends on making sure they have reliable and low latency connections to end users. They peer with multiple providers in multiple regions, to make sure links aren't congested and requests are routed efficiently.
Unless you are going to run datacenters all around the world, you aren't going to beat a CDN in latency.
As for adoption, that is very much a chicken and egg problem.
DNS resolution time is a pretty significant impact for a lot of sites.
Also for my sites I have a fallback to a local copy of the script. This allows me to do completely local development and remain up if the public CDN goes down (or gets compromised). With small (usually) performance impact.
It's not, though. I ran this experiment when I tried to get Google Search to adopt JQuery (back in 2010). About 13% of visits (then) hit Google with a clean cache. This is Google Search, which at the time was the most visited website in the world, and it was using the Google CDN version of JQuery, which at the time was what the JQuery homepage recommended.
The situation is likely worse now, with the rise of mobile. When I did some testing on mobile browsing performance in early 2014, there were some instances where two pageviews was enough to make a page fall out of cache.
I'd encourage you to go to chrome://view-http-cache/ and take a look at what's actually in your cache. Mine has about 18 hours worth of pages. The vast majority is filled up with ad-tracking garbage and Facebook videos. It also doesn't help that every Wordpress blog has its own copy of JQuery (WordPress is a significant fraction of the web), or for that matter that DoubleClick has a cache-busting parameter on all their JS so they can include the referer. There's sort of a cache-poisoning effect where every site that chooses not to use a CDN for JQuery etc. makes the CDN less effective for sites that do choose to.
[On a side note, when I look at my cache entries I just wanna say "Doubleclick: Breaking the web since 2000". It was DoubleClick that finally got/forced me to switch from Netscape to Internet Explorer, because they served broken Javascript in an ad that hung Netscape for about 40% of the web. Grrrr....]
There is the problem then, and the solution? I for one don't make bloated sites nilly-willy, I suck at what I do but at least I do love to fiddle and tweak for the sake of it, not because anyone else might even notice; and I like that in websites and prefer to visit those, too. Clean, no-BS, no hype "actual websites". So I'd be rather annoyed if my browser brought some more stuff I don't need along just because the web is now a marketing machine and people need to deploy their oh so important landing pages with stock photos and stock text and stock product in 3 seconds. It was fine before that, and I think a web with hardly any money to be made in it would still work fine, it would still develop. The main difference is that it would be mostly developed by people who you'd have to pay to stay away, instead of the other way around. I genuinely feel we're cheating ourselves out of the information age we could have, that is, one with informed humans.
On top of that, while everyone uses jquery, everyone uses different version of it (say, 1.5.1, 1.5.2, ... hundreds of different versions in total probably).
So create one massive target that needs to be breached to access massive numbers of websites around the world?
Imagine if every Windows PC ran code from a single web page on startup every time they started up. Now imagine if anything could be put in that code and it would be ran. How big of a target would that be?
While there are cases where the performance is worth using a CDN, there are plenty of reasons to not want to run foreign code.
(Now maybe we could add some security, like generating a hash of the code on the CDN and matching it with a value provided by the website and only running the code if the hashes matched. But there are still business risks even with that.)
See https://developer.mozilla.org/en-US/docs/Web/Security/Subres... though it isn't universally supported yet.
Any site that expects users to trust it with sensitive data should not be pulling in any off-site JavaScript.
As for checksumming, browser vendors don't need to pre-load popular JavaScript libraries (though they might choose to do so, especially if they already ship those libraries for use by their extensions and UI). But checksum-based caching means the cost still only gets paid by the first site, after which the browser has the file cached no matter who references it.
- jquery.com becomes a central point of failure and attack;
- jquery gets to be the biggest tracker of all time;
- cache does not stay forever. Actually, with pages taking 3Mo everytime they load, after 100 click (not much), I invalidated 300 Mo of cache. If firefox allow 2go of cache (it's lot for only one app), then an the end of the day, all my cache has been busted.There's no reason to hit the webserver with an If-modified when the libraries already include their version in the path.