Understanding the order() function

This seems to explain it.

The definition of order is that a[order(a)] is in increasing order. This works with your example, where the correct order is the fourth, second, first, then third element.

You may have been looking for rank, which returns the rank of the elements
R> a <- c(4.1, 3.2, 6.1, 3.1)
R> order(a)
[1] 4 2 1 3
R> rank(a)
[1] 3 2 4 1
so rank tells you what order the numbers are in, order tells you how to get them in ascending order.

plot(a, rank(a)/length(a)) will give a graph of the CDF. To see why order is useful, though, try plot(a, rank(a)/length(a),type="S") which gives a mess, because the data are not in increasing order

If you did
oo<-order(a)
plot(a[oo],rank(a[oo])/length(a),type="S")
or simply
oo<-order(a)
plot(a[oo],(1:length(a))/length(a)),type="S")
you get a line graph of the CDF.

I'll bet you're thinking of rank.


To sort a 1D vector or a single column of data, just call the sort function and pass in your sequence.

On the other hand, the order function is necessary to sort data two-dimensional data--i.e., multiple columns of data collected in a matrix or dataframe.

Stadium Home Week Qtr Away Off Def Result       Kicker Dist
751     Out  PHI   14   4  NYG PHI NYG   Good      D.Akers   50
491     Out   KC    9   1  OAK OAK  KC   Good S.Janikowski   32
702     Out  OAK   15   4  CLE CLE OAK   Good     P.Dawson   37
571     Out   NE    1   2  OAK OAK  NE Missed S.Janikowski   43
654     Out  NYG   11   2  PHI NYG PHI   Good      J.Feely   26
307     Out  DEN   14   2  BAL DEN BAL   Good       J.Elam   48
492     Out   KC   13   3  DEN  KC DEN   Good      L.Tynes   34
691     Out  NYJ   17   3  BUF NYJ BUF   Good     M.Nugent   25
164     Out  CHI   13   2   GB CHI  GB   Good      R.Gould   25
80      Out  BAL    1   2  IND IND BAL   Good M.Vanderjagt   20

Here is an excerpt of data for field goal attempts in the 2008 NFL season, a dataframe i've called 'fg'. suppose that these 10 data points represent all of the field goals attempted in 2008; further suppose you want to know the the distance of the longest field goal attempted that year, who kicked it, and whether it was good or not; you also want to know the second-longest, as well as the third-longest, etc.; and finally you want the shortest field goal attempt.

Well, you could just do this:

sort(fg$Dist, decreasing=T)

which returns: 50 48 43 37 34 32 26 25 25 20

That is correct, but not very useful--it does tell us the distance of the longest field goal attempt, the second-longest,...as well as the shortest; however, but that's all we know--eg, we don't know who the kicker was, whether the attempt was successful, etc. Of course, we need the entire dataframe sorted on the "Dist" column (put another way, we want to sort all of the data rows on the single attribute Dist. that would look like this:

Stadium Home Week Qtr Away Off Def Result       Kicker Dist
751     Out  PHI   14   4  NYG PHI NYG   Good      D.Akers   50
307     Out  DEN   14   2  BAL DEN BAL   Good       J.Elam   48
571     Out   NE    1   2  OAK OAK  NE Missed S.Janikowski   43
702     Out  OAK   15   4  CLE CLE OAK   Good     P.Dawson   37
492     Out   KC   13   3  DEN  KC DEN   Good      L.Tynes   34
491     Out   KC    9   1  OAK OAK  KC   Good S.Janikowski   32
654     Out  NYG   11   2  PHI NYG PHI   Good      J.Feely   26
691     Out  NYJ   17   3  BUF NYJ BUF   Good     M.Nugent   25
164     Out  CHI   13   2   GB CHI  GB   Good      R.Gould   25
80      Out  BAL    1   2  IND IND BAL   Good M.Vanderjagt   20

This is what order does. It is 'sort' for two-dimensional data; put another way, it returns a 1D integer index comprised of the row numbers such that sorting the rows according to that vector, would give you a correct row-oriented sort on the column, Dist

Here's how it works. Above, sort was used to sort the Dist column; to sort the entire dataframe on the Dist column, we use 'order' exactly the same way as 'sort' is used above:

ndx = order(fg$Dist, decreasing=T)

(i usually bind the array returned from 'order' to the variable 'ndx', which stands for 'index', because i am going to use it as an index array to sort.)

that was step 1, here's step 2:

'ndx', what is returned by 'sort' is then used as an index array to re-order the dataframe, 'fg':

fg_sorted = fg[ndx,]

fg_sorted is the re-ordered dataframe immediately above.

In sum, 'sort' is used to create an index array (which specifies the sort order of the column you want sorted), which then is used as an index array to re-order the dataframe (or matrix).


(I thought it might be helpful to lay out the ideas very simply here to summarize the good material posted by @doug, & linked by @duffymo; +1 to each,btw.)

?order tells you which element of the original vector needs to be put first, second, etc., so as to sort the original vector, whereas ?rank tell you which element has the lowest, second lowest, etc., value. For example:

> a <- c(45, 50, 10, 96)
> order(a)  
[1] 3 1 2 4  
> rank(a)  
[1] 2 3 1 4  

So order(a) is saying, 'put the third element first when you sort... ', whereas rank(a) is saying, 'the first element is the second lowest... '. (Note that they both agree on which element is lowest, etc.; they just present the information differently.) Thus we see that we can use order() to sort, but we can't use rank() that way:

> a[order(a)]  
[1] 10 45 50 96  
> sort(a)  
[1] 10 45 50 96  
> a[rank(a)]  
[1] 50 10 45 96  

In general, order() will not equal rank() unless the vector has been sorted already:

> b <- sort(a)  
> order(b)==rank(b)  
[1] TRUE TRUE TRUE TRUE  

Also, since order() is (essentially) operating over ranks of the data, you could compose them without affecting the information, but the other way around produces gibberish:

> order(rank(a))==order(a)  
[1] TRUE TRUE TRUE TRUE  
> rank(order(a))==rank(a)  
[1] FALSE FALSE FALSE  TRUE  

Tags:

Sorting

R

R Faq