It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.
The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.
but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.
I bet someone's already started a YouTube downloader that uses a headless browser
But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.
You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.
Do you guys use an extension to process it or something?
(Same issue with Reddit of course)
The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.
A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.
That's why noscript/basic (x)html is so much important.
> M. Bellard and friends
Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.
I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/
This final point I disagree with entirely. You can't point to Bellard doing something as evidence that it's reasonable. This is a guy that wrote a program that generated a TV signal via a VGA card. :D
Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."
xhtml has been dead for a decade
Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.
It's sort of the sandwich categorization problem:
If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P
I say "not a sandwich".
Why do I need a full XML parser when I can just extract what I need with regex?
And:
All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.
Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".
The amount of high engagement just plain wrong tweets there are is just sad.
Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.
Before accusing someone of pedantry, it would first be good not to completely misread them.
Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K
https://github.com/kristopolous/tube-get
It too deals with this problem but does so in a way that'd be easy to maliciously sabotage
Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...
As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.
The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time
PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.
2. They're fundamentally not compatible approaches. This is worthless to them
It is technically wrong - it isn't a sufficiently rich and powerful approach to handle all JS (HTML) that you might throw at it. It'll work for a while until it eventually barfs when you least expect it.
EXCEPT that if the inputs you are giving it come from some understood source(s) that aren't likely to change, then a simpler approach to the "all singing all dancing" correct may be appropriate and justified. E.g. because it might be easier to write, easier to maintain and/or less attack surface etc.
Does that apply to YouTube? Or any of the other hundreds of supported sites?
It's definitely neat, but not especially useful outside of the confines of its current application, and the security concerns of such a tiny subset will be minimal.
It's even very sensitive to white space.
yt-dlp seems to support running javascript in a full javascript interpreter/headless browser called phantomjs though. Running javascript in a full interpreter like this is a lot more scary from a security standpoint. I am not sure whether phantomjs sandboxes the javascript evaluation from the rest of the system, and if it does, whether the sandbox actually works properly at all. It looks like the project is not being maintained which is another bad sign.
Big projects with lots of manpower behind them such as chromium have trouble keeping javascript evaluation safe, so I would really suggest not trusting phantomjs on untrusted input.
This isn't something YouTube particularly enjoys. They would rather you keep coming back -- every visit is more ad revenue for them. If you have an offline copy, you don't need to visit YouTube anymore.
YouTube has an incentive, therefore, to make it more difficult to download (or "scrape") their content.
I'm not particularly sure of the specific details, but apparently YouTube has added JavaScript (a programming language that executes in the browser) as a hurdle to jump over. A simple python script doesn't have enough brains to execute JavaScript, only enough to realize that it exists. (Clearly, youtube-dl is sophistication enough to have jumped over it.)
These are the conclusions I come to, having written software for about a decade.
1) Once you give information to someone, be it text, pictures, sound, or video -- they will do whatever they want with it, and you have no control. Oh, yes -- it may be illegal. Maybe unethical. But the fact of the matter is you do not have control over information once it leaves your hands.
2) Adding hurdles to make it harder to access the information does little to stop someone who is dedicated to accessing it.
3) Implementing a subset of JavaScript in such an elegant and tiny manner is quite impressive.
How you interpret these facts depends on your worldviews. If you are a media and content creator, you will view these facts differently than a politician, and a teenager.
As an engineer and amateur philosopher, I certainly support the rights of content creators to be paid for their work. And yet, I fear that more and more, content creators want to lease me a right to listen their music, instead of own a copy of it.
I used to own CDs, DVDs, movies, and books. What happens if Amazon or YouTube decides to not serve me anymore? Anything I've "purchased" from them, I lose access to.
Further more, if I create a song, I used to be able to burn copies of CDs and distribute it on the street corners. Now, you have to sign up to stream on Spotify. This is a double edged sword -- I get a wide audience, but Spotify will do whatever they want with me.
This troubles me.
Usually in a virtual machine.
The browser is client-facing and everything there is possible to reverse engineer and figure out. So if you design a web-based application, and are depending on client-side Javascript for any security or distribution enforcement, it can be helpful, but can ultimately be unwound and cracked even if obfuscated, etc.
> Be impressed at what was achieved here?
Yes. Try to download a YouTube video with out it or an online service which is probably using it internally.
This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs
TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.
Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.
But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.
Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.
[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]
I also assume you mean mainstream JS engine, but Duktape, JerryScript and QuickJS are all C APIs.
They probably could have used ex. https://github.com/PetterS/quickjs instead of the hacks in the OP linked file.
You are correct though that I was only thinking of the big engines - bias on my part alas.
For your suggested alternate engines, JerryScript and QuickJS seem more complete than Duktape but I can't quite work out the GC strategy of JerryScript. Bellard says QuickJS has a cycle detector but I'm generally dubious of them based on prior experience.
If I was shipping software that had to actually include a JS engine, if perf was not an issue I would probably use JerryScript or QuickJS as binary size I think would be a more critical component.
Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?
https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...
Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.
Thanks for the link.
> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]
And here we have a complicated Python program with a partial JS implementation in it.
Edit: it's also required to download music, otherwise it will just fail
Source:
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...
- https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...
- https://github.com/ytdl-org/youtube-dl/commit/cf001636600430...
Overview of the control flow (already known):
The Youtube API provides you with n - your video access token
If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js
n is modified by a cryptic function
Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.
So they can always change the function to keep you on your toes, hence you need to be able to run semi-arbitrary JS in order to keep using the API.Waste of human brainpower but I guess that energy is better spent imagining a world where Google isn't in charge instead of kvetching about what they're doing with their influence.
We have a custom-made Discord music bot on our server which uses ytdl to stream songs so we can listen together, and at one point we were listening and suddenly got some obscure JavaScript error.
We began joking that there's some bug in the code which breaks it after 6PM, but later found out that Google had changed some of the obfuscated JS and this basically broke this part of code, which prevented us from fetching the song information.
It's kinda annoying if you have a lot of youtube tabs open for a long time and come back to them.
I believe YouTube limits your bitrate if you don't pass a specific calculated value; it's possible youtube-dl has to parse and eval JS to get it.
It's starting to become Widevine bullshit all over again.
Since the calculation of the response is done in JS, and they occasionally change the formula, some download programs are moving towards running the JS rather than trying to keep up with the changes.
It’s really just bullshit to make people’s lives harder.