That is 8 terabytes (n1-standard-8) of RAM for 18.6 * RF gigabytes of data or am I wrong? The entire thing should fit in the page cache assuming compaction keeps up. There are no deletes so no tombstones. On an overwrite workload I don't know what the space amplification will be and whether you will actually run out of RAM for caching.
90% of people who read a benchmark aren't going to look at it the way I do. They don't have a cost model that says how fast things should be and how much they should cost. I do and when things are fitting in memory I have a different a different set of expectations. If I am looking at the wrong instance type please let me know.
For the workload you described Cassandra shouldn't be doing random IO. I would expect there to be three + N streams of sequential IO. The write ahead log, memtable flushing, compaction output, and N streams reading tables for compaction.
All the write IO can be deferred and rescheduled heavily because fsyncs are infrequent. Reads for compaction are done by background tasks and shouldn't effect foreground latency.
If read ahead is not working for compaction (and killing disk throughput) that may be something that needs to be addressed. Compaction should be requesting IOs large enough to amortize the cost of seeking. Page faults in memory mapped files don't stop read ahead and I think that the kernel will even detect sequential access and double read ahead. For a workload like this with no random reads you could configure the kernel to read ahead 2-4 megabytes and the page cache would probably absorb it fine.
For the workload you described the only fsyncs would be the log (every 10 seconds) and memtable flushes/compaction finishing and that would normally be so infrequent as to not move the needle on overall IO capacity (although obviously it consumes sequential IO). You set trickle fsync so we are still talking about 10 megabyte writes.
Granted there are many things I don't know about Cassandra. I've plumbed a lot of it, but it isn't my day job. Using a 24 gigabyte heap and a 600 megabyte young generation is questionable to me. I think Cassandra can do better, and I also think there are several tools that would do the same job with 10-20x less nodes by exploiting the fact that they never have to do random reads from disk.