YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python (opens in new tab)

(twitter.com)

473 pointsyuuta3y ago155 comments

155 comments

89 comments · 20 top-level

esprehn3y ago· 12 in thread

This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.

Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.

It's sort of the sandwich categorization problem:

If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P

I say "not a sandwich".

jraph3y ago

And as a user of youtube-dl, I'm quite happy about this. This probably allows a very safe, restricted "subset" of JS. Way better than using a full JS engine. 900 lines is still small and manageable.

mjevans3y ago

yt-dlp sometimes doesn't know how to evaluate the javascript / emcascript and will call out to an optional dependency, a real javascript interpreter, if installed.

sebzim45003y ago

I'm trying to get the thread model here. Is the concern that Youtube will inject JS into the payload which tries to break out of the youtuble-dl js sandbox using some zero day in whatever js engine they would use instead?

6 more replies

jiggawatts3y ago

That’s the exact same logic I hear from developers who say things like:

Why do I need a full XML parser when I can just extract what I need with regex?

And:

All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.

dang3y ago

Ok, we've changed this title to shrink the scope of the interpreter.

Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".

ec1096853y ago

Hence why HN better than Twitter.

The amount of high engagement just plain wrong tweets there are is just sad.

tra33y ago

It’s quacks like a duck at midnight, but it’s actually a frog?

blast3y ago

I suppose this means it would be easy for YouTube to fuck with youtube-dl simply by throwing in more features of JS?

_pn3l3y ago

Cat, meet mouse.

1 more reply

Test01293y ago

This really isn't fair. Just because it doesn't faithfully implement whatever standard Javascript is on doesn't mean it isn't an interpreter. All an interpreter is is something that executes a script directly rather than requiring compilation. It is a defacto interpreter for a subset of javascript. Nothing more, nothing less. The title could be more clear, however.

blast3y ago

esprehn didn't say it isn't an interpreter. They're saying it is an interpreter and what it's interpreting isn't (all of) JS. That's also what you're saying, so you're agreeing with esprehn.

Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.

Before accusing someone of pedantry, it would first be good not to completely misread them.

baobabKoodaa3y ago

There's a huge difference between an interpreter for "JavaScript" and an interpreter for a "subset of JavaScript".

1 more reply

anony233y ago· 12 in thread

What purpose does it serve?

rany_3y ago

They need to run a JavaScript function to download YouTube videos at normal speeds.

Edit: it's also required to download music, otherwise it will just fail

Source:

- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...

- https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...

- https://github.com/ytdl-org/youtube-dl/commit/cf001636600430...

ajkjk3y ago

Wow:

   Overview of the control flow (already known):
   The Youtube API provides you with n - your video access token
   If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js
   n is modified by a cryptic function
   Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.

So they can always change the function to keep you on your toes, hence you need to be able to run semi-arbitrary JS in order to keep using the API.

Waste of human brainpower but I guess that energy is better spent imagining a world where Google isn't in charge instead of kvetching about what they're doing with their influence.

1 more reply

elaus3y ago

I'd have to read up on the specifics as well, but I think basically Youtube uses a lot of obfuscated, rapidly and automatically changing Javascript code to fetch the video data. A project like youtube-dl has to run this code to be able to download videos, because that's what's happening in the browser as well.

temp_account_323y ago

For those interested further, in some of the past few weeks youtube-dl had stopped working intermittently for multiple hours at a time, and it was precisely related to this code.

We have a custom-made Discord music bot on our server which uses ytdl to stream songs so we can listen together, and at one point we were listening and suddenly got some obscure JavaScript error.

We began joking that there's some bug in the code which breaks it after 6PM, but later found out that Google had changed some of the obfuscated JS and this basically broke this part of code, which prevented us from fetching the song information.

londons_explore3y ago

If you start a youtube video and then pause it and resume a few days later, you'll notice that the youtube page plays for ~30 seconds (ie. whats buffered) and then the page refreshes. I'd guess this refresh is to pick up the new javascript and any updates to the HTML code.

It's kinda annoying if you have a lot of youtube tabs open for a long time and come back to them.

bitexploder3y ago

What is interesting is it seems to be constant cat and mouse. I download a YT vid. It crawls. Update yt-dlp, it flies again. I love yt-dlp and use it a lot.

lupire3y ago

But why not just use a normal JS engine called from Python?

hadrien013y ago

It's used in the YouTube extractor: https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...

I believe YouTube limits your bitrate if you don't pass a specific calculated value; it's possible youtube-dl has to parse and eval JS to get it.

RicoElectrico3y ago

> I believe YouTube limits your bitrate if you don't pass a specific calculated value

It's starting to become Widevine bullshit all over again.

1 more reply

oynqr3y ago

You need to run some obscured JS to get decent download speeds from Youtube. Something along the lines of PoW.

db48x3y ago

It’s not like proof of work at all. It’s just a challenge and response; youtube includes a random number in the webpage for each video, and expects to see a request parameter with a particular value calculated from that random number when you request the video. If you don’t do the arithmetic it throttles you to 50kb/s.

Since the calculation of the response is done in JS, and they occasionally change the formula, some download programs are moving towards running the JS rather than trying to keep up with the changes.

It’s really just bullshit to make people’s lives harder.

3 more replies

throwaway09843y ago

IIRC it's used to extract/generate the signatures needed for YouTube media URLs

M303y ago· 11 in thread

How should a programming noob interpret this? Be impressed at what was achieved here? Be concerned about security implications using the tool? Something else entirely?

rkangel3y ago

This is the compiler writer equivalent of parsing HTML with regex:

It is technically wrong - it isn't a sufficiently rich and powerful approach to handle all JS (HTML) that you might throw at it. It'll work for a while until it eventually barfs when you least expect it.

EXCEPT that if the inputs you are giving it come from some understood source(s) that aren't likely to change, then a simpler approach to the "all singing all dancing" correct may be appropriate and justified. E.g. because it might be easier to write, easier to maintain and/or less attack surface etc.

pwdisswordfish93y ago

> some understood source(s) that aren't likely to change

Does that apply to YouTube? Or any of the other hundreds of supported sites?

1 more reply

lolinder3y ago

It's an extremely tiny subset of JS—as an example, the only object that can be instantiated is Date. Anything other than "Date" after "new" throws an exception.

It's definitely neat, but not especially useful outside of the confines of its current application, and the security concerns of such a tiny subset will be minimal.

petters3y ago

> Anything other than "Date" after "new" throws an exception

It's even very sensitive to white space.

smcl3y ago

All of the above, really.

chlorion3y ago

The "interpreter" in the youtube-dl source is probably safe from a security standpoint.

yt-dlp seems to support running javascript in a full javascript interpreter/headless browser called phantomjs though. Running javascript in a full interpreter like this is a lot more scary from a security standpoint. I am not sure whether phantomjs sandboxes the javascript evaluation from the rest of the system, and if it does, whether the sandbox actually works properly at all. It looks like the project is not being maintained which is another bad sign.

Big projects with lots of manpower behind them such as chromium have trouble keeping javascript evaluation safe, so I would really suggest not trusting phantomjs on untrusted input.

bjt2n39043y ago

The goal of youtube-dl is to download a video off of YouTube for offline storage.

This isn't something YouTube particularly enjoys. They would rather you keep coming back -- every visit is more ad revenue for them. If you have an offline copy, you don't need to visit YouTube anymore.

YouTube has an incentive, therefore, to make it more difficult to download (or "scrape") their content.

I'm not particularly sure of the specific details, but apparently YouTube has added JavaScript (a programming language that executes in the browser) as a hurdle to jump over. A simple python script doesn't have enough brains to execute JavaScript, only enough to realize that it exists. (Clearly, youtube-dl is sophistication enough to have jumped over it.)

These are the conclusions I come to, having written software for about a decade.

1) Once you give information to someone, be it text, pictures, sound, or video -- they will do whatever they want with it, and you have no control. Oh, yes -- it may be illegal. Maybe unethical. But the fact of the matter is you do not have control over information once it leaves your hands.

2) Adding hurdles to make it harder to access the information does little to stop someone who is dedicated to accessing it.

3) Implementing a subset of JavaScript in such an elegant and tiny manner is quite impressive.

How you interpret these facts depends on your worldviews. If you are a media and content creator, you will view these facts differently than a politician, and a teenager.

As an engineer and amateur philosopher, I certainly support the rights of content creators to be paid for their work. And yet, I fear that more and more, content creators want to lease me a right to listen their music, instead of own a copy of it.

I used to own CDs, DVDs, movies, and books. What happens if Amazon or YouTube decides to not serve me anymore? Anything I've "purchased" from them, I lose access to.

Further more, if I create a song, I used to be able to burn copies of CDs and distribute it on the street corners. Now, you have to sign up to stream on Spotify. This is a double edged sword -- I get a wide audience, but Spotify will do whatever they want with me.

This troubles me.

Test01293y ago

> How should a programming noob interpret this?

Usually in a virtual machine.

tenebrisalietum3y ago

> How should a programming noob interpret this?

The browser is client-facing and everything there is possible to reverse engineer and figure out. So if you design a web-based application, and are depending on client-side Javascript for any security or distribution enforcement, it can be helpful, but can ultimately be unwound and cracked even if obfuscated, etc.

> Be impressed at what was achieved here?

Yes. Try to download a YouTube video with out it or an online service which is probably using it internally.

Supermancho3y ago

Youtube-dl is impressive. This particular hack is not.

1 more reply

Tao33003y ago

In the face of weird shit like this, I give you the permission to go with your gut.

sylware3y ago· 9 in thread

Nowadays "javascript" refers to the scriptable, grotesquely and absurdely complex and massive web engines, aka google financed blink and geeko, then apple financed webkit, that with their SDK.

The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.

A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.

That's why noscript/basic (x)html is so much important.

dtx13y ago

> but seems to stay in the "reasonable" realm

> M. Bellard and friends

Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.

olliej3y ago

Yeah I agree with almost all of this - the massive size and complexity of commercial engines makes it seem like JS the language must also be complex.

I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/

This final point I disagree with entirely. You can't point to Bellard doing something as evidence that it's reasonable. This is a guy that wrote a program that generated a TV signal via a VGA card. :D

axiolite3y ago

> quickjs from M. Bellard and friends

Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."

https://en.wikipedia.org/wiki/Fabrice_Bellard

a_e_k3y ago

Could just be the usual abbreviation for Monsieur.

ganjatech3y ago

Monsieur Bellard - M. Bellard

1 more reply

randyrand3y ago

Chrome and Safari both have open source JS engines…

userbinator3y ago

That's beside the point. Open-source is not useful to the smaller players if it is too complex to comprehend and constantly churned.

1 more reply

oblak3y ago

ah, but quickjs is an actual js engine. I have tried a couple of versions with real progress between them. This thing here is not

languageserver3y ago

> That's why noscript/basic (x)html is so much important.

xhtml has been dead for a decade

lolinder3y ago· 5 in thread

To be clear, this is an extremely tiny subset of JS. It looks like they only implemented the features needed to run a very specific function. For example, the only symbol allowed after "new" is "Date", everything else throws an exception.

It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.

krab3y ago

It will only grow - as new scripts will need to be interpreted, new features will be added.

lolinder3y ago

I would be horrified if this grew much further. It's perfectly fine for its current scope, but the architecture would not scale at all to a full interpreter without essentially starting from scratch.

2 more replies

mid-kid3y ago

Yeah, it's essentially used as a javascript expression solver. You can see the full extent of its capabilities in the testsuite: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.

em-bee3y ago

if it's going to need much more than that then it probably would make more sense to port the whole application to javascript instead.

but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.

pvillano3y ago

Eventually, YouTube-dl might have to simulate an entire browser and human user to fool Google. Until then, the usefulness of YouTube-dl is that it's less heavy than a full browser.

I bet someone's already started a YouTube downloader that uses a headless browser

Uptrenda3y ago· 5 in thread

Anyone who has ever pulled a website from a script knows the pain that is Javascript. Normally you want to just get some text and work out the API actions but a lot of sites use horribly obfuscated Javascript -- either because that's what modern web development is (lolz) -- or because its part of their 'security.' That means if you want to write browser-based bots properly -- you ought to use a browser. There are special browsers that run 'headlessly' or are designed mostly for bot use. Like https://www.selenium.dev/ which plugs into a few different 'browser engines.'

But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.

Scaevolus3y ago

Embedding V8 can work quite well: https://github.com/sqreen/PyMiniRacer

You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.

hansvm3y ago

I use pyminiracer to great effect for that sort of scraping.

eurasiantiger3y ago

Just npm install puppeteer.

lolinder3y ago

Puppeteer is cool, but it's exactly what OP is warning against: it's a full browser that is downloaded and run through npm. It's remarkably well packaged, but still far more error prone than a simple HTTP request, and far more likely to break on its own just with the passage of time.

2 more replies

ciupicri3y ago

By the way there is also Playwright [1] and it has Python bindings too [2].

[1]: https://playwright.dev/

[2]: https://playwright.dev/python/docs/intro

delusional3y ago· 5 in thread

Can we stop the trend of linking to tweets that just contain another link to the content? what's the point? Wouldn't this be 10x better if it was a link directly to the github?

derangedHorse3y ago

I like the Twitter linking since it's almost like the OP is giving credit to where they found the information from.

plaguepilled3y ago

Agreed. If you only know this from someone else's observation, you should link the observation.

1 more reply

kelnos3y ago

I was thinking the same thing; link to the file on Github, with the same title text as is there now, and it saves me an extra click. And any time I don't have to visit Twitter, I consider that a win.

1 more reply

caned3y ago

I often share links to HN instead of the referred link. Many times the comments are as interesting as the content. This applies to sharing Twitter or Reddit links, too, albeit with a lower S/N ratio.

Firmwarrior3y ago

Is there some trick to actually being able to see information on Twitter? When I click a tweet, I get the tweet, then a random smattering of 2-3 semi-related tweets, and then a login popup that breaks the page

Do you guys use an extension to process it or something?

(Same issue with Reddit of course)

1 more reply

jraph3y ago· 4 in thread

I do wonder why YouTube does not try harder to make it difficult to do this computation meant to prove you are a legit YouTube web client. Providing an easy-to-find, simple JS function interpretable with 900 lines of Python is like they don't try at all. They might as well do nothing.

Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?

zuminator3y ago

I'd guess that their efforts to make it harder are limited by the fact that they want YouTube to be able to play on thousands of different low powered set top boxes and cheap phones. So whatever obfuscated code they use has to be simple enough to be run and periodically updated by all these different devices, and that same simplicity makes it emulable.

Arnavion3y ago

They do make it harder from time to time. In fact yt-dlp's interpreter has been broken for a month or so now and the devs finally gave up and told users to just install PhantomJS (which itself hasn't been updated since 2016 and probably has bugs / vulns of its own, but whatever).

https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...

whywhywhywhy3y ago

I mean if this is the direction it’s heading it makes more sense to port yt-dlp to node. It’s already dependent on a scripting language, it may as well be the one YouTube speaks.

Cthulhu_3y ago

I'm guessing the amount of people using it is low enough to not bother with mitigation. Then again, there's a LOT of YT videos that take clips from other videos (which in most cases falls under fair use), which I can imagine would use this tool.

haunter3y ago· 2 in thread

The same in yt-dlp https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp...

Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K

cheschire3y ago

Is yt-dlp's implementation being better the reason why I have fewer throttling issues than with youtube-dl?

LeoPanthera3y ago

Maybe this isn't true anymore, but for a while they would hit different APIs. yt-dlp was using the Android YouTube API because it had no throttling.

kristopolous3y ago· 2 in thread

To understand why, I have a far simpler tool that focuses on a subset of sites (adult content video aggregators)

https://github.com/kristopolous/tube-get

It too deals with this problem but does so in a way that'd be easy to maliciously sabotage

Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...

As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.

The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time

pabs33y ago

Would you mind switching to subprocess with shell=False? os.popen is obsolete and insecure because it passes the command through the shell.

PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.

kristopolous3y ago

1. It's ancient code but sure

2. They're fundamentally not compatible approaches. This is worthless to them

olliej3y ago· 2 in thread

This is super cool.

Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.

But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.

Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.

[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]

esprehn3y ago

This doesn't actually implement any of the JS language though, it just reuses all of python's semantics and hard coded a tiny list of ex. String methods

I also assume you mean mainstream JS engine, but Duktape, JerryScript and QuickJS are all C APIs.

They probably could have used ex. https://github.com/PetterS/quickjs instead of the hacks in the OP linked file.

olliej3y ago

Ah, I only briefly scanned the implementation, and it looked like it was doing actual work - is it mostly string replacing to get approximate python equivalent syntax? Regardless that's disappointing.

You are correct though that I was only thinking of the big engines - bias on my part alas.

For your suggested alternate engines, JerryScript and QuickJS seem more complete than Duktape but I can't quite work out the GC strategy of JerryScript. Bellard says QuickJS has a cycle detector but I'm generally dubious of them based on prior experience.

If I was shipping software that had to actually include a JS engine, if perf was not an issue I would probably use JerryScript or QuickJS as binary size I think would be a more critical component.

aeyes3y ago

They just don't want to use any external dependencies... There is also an AES implementation: https://github.com/ytdl-org/youtube-dl/blob/master/youtube_d...

lewisl90293y ago

Another really cool JS dialect I recently learned about is njs from the nginx team: https://github.com/nginx/njs

This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs

TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.

homarp3y ago

the tests for it: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

mdaniel3y ago

I was expecting this to be about Duktape <https://github.com/svaarala/duktape>, but heh, for sure no. I'd bet $1 there's no way youtube-dl would switch, but I wonder if yt-dlp would?

rcarmo3y ago

Awesome. Even if it's likely incomplete, it might come in really handy for some scraping I need to do...

Too3y ago

They must have been inspired by this PyCon presentation, where David Beazley live codes a fully working webassembly interpreter, in under one hour. https://youtu.be/VUT386_GKI8

atan23y ago

This seems to be a pretty small subset of JavaScript, but I personally love small projects like this for educational purposes. Removing the noise and keeping things minimal helps my brain reason about things.

Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.

Thanks for the link.

Tao33003y ago

Greenspun's Tenth Rule:

> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]

And here we have a complicated Python program with a partial JS implementation in it.

[1] https://en.wikipedia.org/wiki/Greenspun's_tenth_rule

tonetheman3y ago

If this got much bigger I would switch it to quickjs

j / k navigate · click thread line to collapse

155 comments

89 comments · 20 top-level

esprehn3y ago· 12 in thread

This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.

It's sort of the sandwich categorization problem:

I say "not a sandwich".

jraph3y ago

And as a user of youtube-dl, I'm quite happy about this. This probably allows a very safe, restricted "subset" of JS. Way better than using a full JS engine. 900 lines is still small and manageable.

mjevans3y ago

yt-dlp sometimes doesn't know how to evaluate the javascript / emcascript and will call out to an optional dependency, a real javascript interpreter, if installed.

sebzim45003y ago

6 more replies

jiggawatts3y ago

That’s the exact same logic I hear from developers who say things like:

Why do I need a full XML parser when I can just extract what I need with regex?

And:

All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.

dang3y ago

Ok, we've changed this title to shrink the scope of the interpreter.

Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".

ec1096853y ago

Hence why HN better than Twitter.

The amount of high engagement just plain wrong tweets there are is just sad.

tra33y ago

It’s quacks like a duck at midnight, but it’s actually a frog?

blast3y ago

I suppose this means it would be easy for YouTube to fuck with youtube-dl simply by throwing in more features of JS?

_pn3l3y ago

Cat, meet mouse.

1 more reply

Test01293y ago

blast3y ago

esprehn didn't say it isn't an interpreter. They're saying it is an interpreter and what it's interpreting isn't (all of) JS. That's also what you're saying, so you're agreeing with esprehn.

Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.

Before accusing someone of pedantry, it would first be good not to completely misread them.

baobabKoodaa3y ago

There's a huge difference between an interpreter for "JavaScript" and an interpreter for a "subset of JavaScript".

1 more reply

anony233y ago· 12 in thread

What purpose does it serve?

rany_3y ago

They need to run a JavaScript function to download YouTube videos at normal speeds.

Edit: it's also required to download music, otherwise it will just fail

Source:

- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...

- https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...

- https://github.com/ytdl-org/youtube-dl/commit/cf001636600430...

ajkjk3y ago

Wow:

   Overview of the control flow (already known):
   The Youtube API provides you with n - your video access token
   If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js
   n is modified by a cryptic function
   Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.

So they can always change the function to keep you on your toes, hence you need to be able to run semi-arbitrary JS in order to keep using the API.

Waste of human brainpower but I guess that energy is better spent imagining a world where Google isn't in charge instead of kvetching about what they're doing with their influence.

1 more reply

elaus3y ago

temp_account_323y ago

For those interested further, in some of the past few weeks youtube-dl had stopped working intermittently for multiple hours at a time, and it was precisely related to this code.

We have a custom-made Discord music bot on our server which uses ytdl to stream songs so we can listen together, and at one point we were listening and suddenly got some obscure JavaScript error.

londons_explore3y ago

It's kinda annoying if you have a lot of youtube tabs open for a long time and come back to them.

bitexploder3y ago

What is interesting is it seems to be constant cat and mouse. I download a YT vid. It crawls. Update yt-dlp, it flies again. I love yt-dlp and use it a lot.

lupire3y ago

But why not just use a normal JS engine called from Python?

hadrien013y ago

It's used in the YouTube extractor: https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...

I believe YouTube limits your bitrate if you don't pass a specific calculated value; it's possible youtube-dl has to parse and eval JS to get it.

RicoElectrico3y ago

> I believe YouTube limits your bitrate if you don't pass a specific calculated value

It's starting to become Widevine bullshit all over again.

1 more reply

oynqr3y ago

You need to run some obscured JS to get decent download speeds from Youtube. Something along the lines of PoW.

db48x3y ago

Since the calculation of the response is done in JS, and they occasionally change the formula, some download programs are moving towards running the JS rather than trying to keep up with the changes.

It’s really just bullshit to make people’s lives harder.

3 more replies

throwaway09843y ago

IIRC it's used to extract/generate the signatures needed for YouTube media URLs

M303y ago· 11 in thread

How should a programming noob interpret this? Be impressed at what was achieved here? Be concerned about security implications using the tool? Something else entirely?

rkangel3y ago

This is the compiler writer equivalent of parsing HTML with regex:

pwdisswordfish93y ago

> some understood source(s) that aren't likely to change

Does that apply to YouTube? Or any of the other hundreds of supported sites?

1 more reply

lolinder3y ago

It's an extremely tiny subset of JS—as an example, the only object that can be instantiated is Date. Anything other than "Date" after "new" throws an exception.

It's definitely neat, but not especially useful outside of the confines of its current application, and the security concerns of such a tiny subset will be minimal.

petters3y ago

> Anything other than "Date" after "new" throws an exception

It's even very sensitive to white space.

smcl3y ago

All of the above, really.

chlorion3y ago

The "interpreter" in the youtube-dl source is probably safe from a security standpoint.

Big projects with lots of manpower behind them such as chromium have trouble keeping javascript evaluation safe, so I would really suggest not trusting phantomjs on untrusted input.

bjt2n39043y ago

The goal of youtube-dl is to download a video off of YouTube for offline storage.

YouTube has an incentive, therefore, to make it more difficult to download (or "scrape") their content.

These are the conclusions I come to, having written software for about a decade.

2) Adding hurdles to make it harder to access the information does little to stop someone who is dedicated to accessing it.

3) Implementing a subset of JavaScript in such an elegant and tiny manner is quite impressive.

How you interpret these facts depends on your worldviews. If you are a media and content creator, you will view these facts differently than a politician, and a teenager.

I used to own CDs, DVDs, movies, and books. What happens if Amazon or YouTube decides to not serve me anymore? Anything I've "purchased" from them, I lose access to.

This troubles me.

Test01293y ago

> How should a programming noob interpret this?

Usually in a virtual machine.

tenebrisalietum3y ago

> How should a programming noob interpret this?

> Be impressed at what was achieved here?

Yes. Try to download a YouTube video with out it or an online service which is probably using it internally.

Supermancho3y ago

Youtube-dl is impressive. This particular hack is not.

1 more reply

Tao33003y ago

In the face of weird shit like this, I give you the permission to go with your gut.

sylware3y ago· 9 in thread

Nowadays "javascript" refers to the scriptable, grotesquely and absurdely complex and massive web engines, aka google financed blink and geeko, then apple financed webkit, that with their SDK.

That's why noscript/basic (x)html is so much important.

dtx13y ago

> but seems to stay in the "reasonable" realm

> M. Bellard and friends

Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.

olliej3y ago

Yeah I agree with almost all of this - the massive size and complexity of commercial engines makes it seem like JS the language must also be complex.

I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/

axiolite3y ago

> quickjs from M. Bellard and friends

Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."

https://en.wikipedia.org/wiki/Fabrice_Bellard

a_e_k3y ago

Could just be the usual abbreviation for Monsieur.

ganjatech3y ago

Monsieur Bellard - M. Bellard

1 more reply

randyrand3y ago

Chrome and Safari both have open source JS engines…

userbinator3y ago

That's beside the point. Open-source is not useful to the smaller players if it is too complex to comprehend and constantly churned.

1 more reply

oblak3y ago

ah, but quickjs is an actual js engine. I have tried a couple of versions with real progress between them. This thing here is not

languageserver3y ago

> That's why noscript/basic (x)html is so much important.

xhtml has been dead for a decade

lolinder3y ago· 5 in thread

It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.

krab3y ago

It will only grow - as new scripts will need to be interpreted, new features will be added.

lolinder3y ago

I would be horrified if this grew much further. It's perfectly fine for its current scope, but the architecture would not scale at all to a full interpreter without essentially starting from scratch.

2 more replies

mid-kid3y ago

Yeah, it's essentially used as a javascript expression solver. You can see the full extent of its capabilities in the testsuite: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.

em-bee3y ago

if it's going to need much more than that then it probably would make more sense to port the whole application to javascript instead.

but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.

pvillano3y ago

Eventually, YouTube-dl might have to simulate an entire browser and human user to fool Google. Until then, the usefulness of YouTube-dl is that it's less heavy than a full browser.

I bet someone's already started a YouTube downloader that uses a headless browser

Uptrenda3y ago· 5 in thread

Scaevolus3y ago

Embedding V8 can work quite well: https://github.com/sqreen/PyMiniRacer

You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.

hansvm3y ago

I use pyminiracer to great effect for that sort of scraping.

eurasiantiger3y ago

Just npm install puppeteer.

lolinder3y ago

2 more replies

ciupicri3y ago

By the way there is also Playwright [1] and it has Python bindings too [2].

[1]: https://playwright.dev/

[2]: https://playwright.dev/python/docs/intro

delusional3y ago· 5 in thread

Can we stop the trend of linking to tweets that just contain another link to the content? what's the point? Wouldn't this be 10x better if it was a link directly to the github?

derangedHorse3y ago

I like the Twitter linking since it's almost like the OP is giving credit to where they found the information from.

plaguepilled3y ago

Agreed. If you only know this from someone else's observation, you should link the observation.

1 more reply

kelnos3y ago

I was thinking the same thing; link to the file on Github, with the same title text as is there now, and it saves me an extra click. And any time I don't have to visit Twitter, I consider that a win.

1 more reply

caned3y ago

I often share links to HN instead of the referred link. Many times the comments are as interesting as the content. This applies to sharing Twitter or Reddit links, too, albeit with a lower S/N ratio.

Firmwarrior3y ago

Do you guys use an extension to process it or something?

(Same issue with Reddit of course)

1 more reply

jraph3y ago· 4 in thread

Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?

zuminator3y ago

Arnavion3y ago

https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...

whywhywhywhy3y ago

I mean if this is the direction it’s heading it makes more sense to port yt-dlp to node. It’s already dependent on a scripting language, it may as well be the one YouTube speaks.

Cthulhu_3y ago

haunter3y ago· 2 in thread

The same in yt-dlp https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp...

Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K

cheschire3y ago

Is yt-dlp's implementation being better the reason why I have fewer throttling issues than with youtube-dl?

LeoPanthera3y ago

Maybe this isn't true anymore, but for a while they would hit different APIs. yt-dlp was using the Android YouTube API because it had no throttling.

kristopolous3y ago· 2 in thread

To understand why, I have a far simpler tool that focuses on a subset of sites (adult content video aggregators)

https://github.com/kristopolous/tube-get

It too deals with this problem but does so in a way that'd be easy to maliciously sabotage

Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...

As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.

pabs33y ago

Would you mind switching to subprocess with shell=False? os.popen is obsolete and insecure because it passes the command through the shell.

PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.

kristopolous3y ago

1. It's ancient code but sure

2. They're fundamentally not compatible approaches. This is worthless to them

olliej3y ago· 2 in thread

This is super cool.

Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.

esprehn3y ago

This doesn't actually implement any of the JS language though, it just reuses all of python's semantics and hard coded a tiny list of ex. String methods

I also assume you mean mainstream JS engine, but Duktape, JerryScript and QuickJS are all C APIs.

They probably could have used ex. https://github.com/PetterS/quickjs instead of the hacks in the OP linked file.

olliej3y ago

You are correct though that I was only thinking of the big engines - bias on my part alas.

If I was shipping software that had to actually include a JS engine, if perf was not an issue I would probably use JerryScript or QuickJS as binary size I think would be a more critical component.

aeyes3y ago

They just don't want to use any external dependencies... There is also an AES implementation: https://github.com/ytdl-org/youtube-dl/blob/master/youtube_d...

lewisl90293y ago

Another really cool JS dialect I recently learned about is njs from the nginx team: https://github.com/nginx/njs

This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs

TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.

homarp3y ago

the tests for it: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

mdaniel3y ago

I was expecting this to be about Duktape <https://github.com/svaarala/duktape>, but heh, for sure no. I'd bet $1 there's no way youtube-dl would switch, but I wonder if yt-dlp would?

rcarmo3y ago

Awesome. Even if it's likely incomplete, it might come in really handy for some scraping I need to do...

Too3y ago

They must have been inspired by this PyCon presentation, where David Beazley live codes a fully working webassembly interpreter, in under one hour. https://youtu.be/VUT386_GKI8

atan23y ago

Thanks for the link.

Tao33003y ago

Greenspun's Tenth Rule:

> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]

And here we have a complicated Python program with a partial JS implementation in it.

[1] https://en.wikipedia.org/wiki/Greenspun's_tenth_rule

tonetheman3y ago

If this got much bigger I would switch it to quickjs

j / k navigate · click thread line to collapse