# What happens when prob argument in sample sums to less/greater than 1?

As already mentioned, the weights are normalized to sum to 1 as can be demonstrated:

```
> x/sum(x)
[1] 0.15384615 0.38461538 0.38461538 0.07692308
```

This matches your simulated tabulated data:

```
# 1 2 3 4
#0.1544 0.3839 0.3848 0.0768
```

Good question. The docs are unclear on this, but the question can be answered by reviewing the source code.

If you look at the R code, `sample`

always calls another R function, `sample.int`

If you pass in a single number `x`

to `sample`

, it will use `sample.int`

to create a vector of integers less than or equal to that number, whereas if `x`

is a vector, it uses `sample.int`

to generate a sample of integers less than or equal to `length(x)`

, then uses that to subset x.

Now, if you examine the function `sample.int`

, it looks like this:

```
function (n, size = n, replace = FALSE, prob = NULL, useHash = (!replace &&
is.null(prob) && size <= n/2 && n > 1e+07))
{
if (useHash)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}
```

The `.Internal`

means any sampling is done by calling compiled code written in C: in this case, it's the function `do_sample`

, defined here in src/main/random.c.

If you look at this C code, `do_sample`

checks whether it has been passed a `prob`

vector. If not, it samples on the assumption of equal weights. If `prob`

exists, the function ensures that it is numeric and not NA. If `prob`

passes these checks, a pointer to the underlying array of doubles is generated and passed to another function in random.c called `FixUpProbs`

, defined here.

This function examines each member of `prob`

and throws an error if any elements of `prob`

are not positive finite doubles. It then normalises the numbers by dividing each by the sum of all. There is therefore no preference at all for `prob`

summing to 1 inherent in the code. That is, even if `prob`

sums to 1 in your input, the function will still calculate the sum and divide each number by it.

Therefore, the parameter is poorly named. It should be "weights", as others here have pointed out. To be fair, the docs only say that `prob`

should be a vector of weights, not absolute probabilities.

So the behaviour of the `prob`

parameter from my reading of the code should be:

`prob`

can be absent altogether, in which case sampling defaults to equal weights.- If any of
`prob`

's numbers are less than zero, or are infinite, or NA, the function will throw. - An error should be thrown if any of the
`prob`

values are non-numeric, as they will be interpreted as`NA`

in the SEXP passed to the C code. `prob`

must have the same length as`x`

or the C code throws- You can pass a zero probability as one or more elements of
`prob`

if you have specified`replace=T`

, as long as you have at least one non-zero probability. - If you specify
`replace=F`

, the number of samples you request must be less than or equal to the number of non-zero elements in`prob`

. Essentially,`FixUpProbs`

will throw if you ask it to sample with a zero probability. - A valid
`prob`

vector will be normalised to sum to 1 and used as sampling weights.

As an interesting side effect of this behaviour, this allows you to use odds instead of probabilities if you are choosing between 2 alternatives by setting probs = `c(1, odds)`