Timeline and frame based editing is the end result but this more about the elemental creation and from there, editing it into time based scene.
Ive spent the last few years in the cinematography department, and most directors and director of photographers will write the scenes on flash cards and move them around and rearrange them because not every story is linear, even though we have to find a way to present every story in a linear form. And from there each scenes requires multiple angles, shots, motivations and things change so much on the fly, that a screenplay becomes a document that becomes quite dense with non-presented information.
So, I suppose the next step to this would be to parse a bunch of screenplays from different formats, into a single readable format and then train an image model on the frames of those movies we also trained the text model with screenplays on to get a cross reference of what is written down vs what is displayed visually. And we can break down the visual shots with camera movements, steadicam, dolly move etc as well as identify key props in the image model (maybe. Sounds expensive) and compare them to key props in the script. I don’t know, I’m spitballing now but a multi-modal Hollywood film producer would be kind of fun but this totally is just starting as a way to standardize the script in a granular form and to code since I’m not out on set.