undefined | Better HN

0 pointstaltman113y ago0 comments

As minimax stated, your awk code won't provide exactly k samples. Here's a bit of awk code that implements reservoir sampling (apologies in advance for any bugs) and prints the current sample to stdout, without needing to first process the entire stream. It simply prints the set of samples to stdout whenever the sample updates, with sets of samples separated by a distinct string. It is called as follows (of course, replace dmesg with any stream of your choosing):

  dmesg | gawk -f reservoir-sample.awk k=5 record_separator='==='


  #!/usr/bin/gawk -f

  ### reservoir-sample.awk
  ###
  ### Sample k random lines from a stream, without knowing the size of the stream.
  ###
  ### (Tomer Altman)

  ### Parameters: (set from command-line)
  ##
  ## k: number of lines to sample
  ## record_separator: string used to separate iterations of the sample in stdout.

  ## Define a function for returning a number between 1 and n:
  function random_int (n) { return 1 + int(rand() * n) }

  ## Initialize the PRNG seed:
  BEGIN { srand() }

  ## For the first k lines, initialize the sample array:

  NR <= k { sample[NR] = $0 }

  ## If we've initialized the sample array, and we pick an integer between one and the current record number that is less than or equal to k, update the sample array and print it to stdout:
  NR > k && (current_random_int = random_int(NR)) <= k {

    sample[current_random_int] = $0

    for (i in sample) print sample[i]
    
    print (record_separator != "") ? record_separator : "--"
  }

0 comments

taltman1OP13y ago

In case anyone is interested, I've extended this code and put it up on GitHub as a script called 'samp':

https://github.com/taltman/scripts

j / k navigate · click thread line to collapse