https://github.com/neilk/misc/blob/master/randline
I've had this script (or versions of it) around for more than a decade. I didn't know the technique had a name.
[edit: i was going to delete this, but since you replied i'll leave it - it does appear (too?) in the camel book, on p 246 of my copy, but like you say, it's for a single line. hadn't opened that book in years, took me some time to find it...]
(so it presumably consumes the entire stream before giving any results; any alternative i can think of would not be "really random" unless you knew the length of the stream in advance).
dmesg | gawk -f reservoir-sample.awk k=5 record_separator='==='
#!/usr/bin/gawk -f
### reservoir-sample.awk
###
### Sample k random lines from a stream, without knowing the size of the stream.
###
### (Tomer Altman)
### Parameters: (set from command-line)
##
## k: number of lines to sample
## record_separator: string used to separate iterations of the sample in stdout.
## Define a function for returning a number between 1 and n:
function random_int (n) { return 1 + int(rand() * n) }
## Initialize the PRNG seed:
BEGIN { srand() }
## For the first k lines, initialize the sample array:
NR <= k { sample[NR] = $0 }
## If we've initialized the sample array, and we pick an integer between one and the current record number that is less than or equal to k, update the sample array and print it to stdout:
NR > k && (current_random_int = random_int(NR)) <= k {
sample[current_random_int] = $0
for (i in sample) print sample[i]
print (record_separator != "") ? record_separator : "--"
}It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.
(This bit me once. Filed a bug about maybe clarifying the man page, but nope, apparently we're all supposed to recognize instantly the implication of a random hash.)
$ seq 15 | dimsum -n 10
14
12
3
4
5
6
7
8
9
10