Python's random or NumPy's random?

folgert · April 23, 2020, 6:52am

I’m working on a stochastic model which heavily relies on random sampling from a list of items. Normally I turn to NumPy for this with its numpy.random.choice() function. To my surprise, it turned out that random.sample() (and random.choice()) in Python’s standard library is quite a bit faster:

import numpy as np

x = np.arange(1000)
%timeit np.random.choice(x, replace=False)

gives me 16.2 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each), while:

import random

x = list(range(1000))
%timeit random.sample(x, 1)

gives me 2.16 µs ± 32.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each).

Anyone an idea of what might be going on?

simon · April 24, 2020, 3:19pm

Ah, it leaves a thing when I delete my own message. I thought I couldn’t reproduce the problem but I had the wrong boolean for “replace”, so nevermind.

fotis · April 28, 2020, 1:28pm

Interesting observation, thanks!

random.sample() uses 2 different approaches depending on the sample size. From random.py:

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

This seems to work really well for small sample sizes. If you use larger samples, then the difference in time starts to disappear and for k = 200 I get:
numpy:

37.1 µs ± 942 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

random:

115 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If random.sample is too fast for you , you can always use random.choice() which is, for some unknown reason, 20 times slower .

folgert · April 29, 2020, 6:59am

I can confirm that with higher k values, NumPy catches up! Interesting. I should spend more time reading the docs