Are smaller sensors also faster to read, given the lower capacitance? I wonder if that might give them an advantage when it comes to stacking and averaging images to reduce noise.
That's another good question and (as ever) the devil is in the engineering detail. I work with noise professionally (well, with signal in a low SNR environment, where the SNR per acquisition of interest is definitely << 1...) and those sorts of questions are good to ask but depend a lot on the exact noise stats: your approach requires that the average value of the noise is the same in both cases (read-out N times, digitally average; vs read-out once vs "analogue" averaging). In practice, because the distributions differ of different noise sources and images are positive semi-definite, I doubt this is true. The advantage of stacking is that helps with motion (a lot) and also helps with a finite dynamic range instrument (i.e. you don't blow it and can do gain compression).
> Are smaller sensors also faster to read, given the lower capacitance?
This Stanford paper with a model of a CMOS sensor [1] is rather old but quite a good explanation of where the readout time comes from; the capacitance across the active area is C_{pd} but the minimum read-out time is dominated by the capacitance of the readout bias, C_T, across the ADC line. As a result it scales by transistor feature size (fig 5) independent of the sensor area. Of course, as 'moar megapixelz' came along, this got higher and other designs were explored to mitigate it – a paper from Rochester [2] states that removing it buggers up the noise statistics unless you do "clever things" (which they describe in detail).
> That is a good point that I hadn't considered, thanks.
I should procrastinate more productively, but thank you!
[1] https://isl.stanford.edu/~abbas/group/papers_and_pub/tcas1.p... [2] https://sci-hub.se/10.1109/ISCAS.2008.4541803
Yeah, that was my assumption that I didn't articulate very clearly. For the same total exposure time, you wouldn't intuitively expect an average of multiple short exposures to be much different from a single long exposure (all depends on the mathematical details, as you say). But of course if you have a good alignment algorithm, you can often in practice use a much longer total exposure time. Then, to state the obvious, you can use a lower ISO and get less noise.
Is it the case that C_T is independent of C_{pd}? Looking at the (obviously rather idealized) circuit diagram, it seems odd that their values would be entirely independent in a realistic design. I guess I am basing my intuition on the simpler case of reading a single photodiode, where it is undoubtedly the case that the larger capacitance of a larger photodiode makes it more difficult to achieve a high bandwidth. Perhaps 'read time' as such is not the issue. But being able to scale up the individual photodiodes without sacrificing bandwidth seems a big magical.