Keys and fast binary search based subset

2016-12-02

This vignette is aimed at those who are already familiar with data.table syntax, its general form, how to subset rows in i, select and compute on columns, add/modify/delete columns by reference in j and group by using by. If you’re not familiar with these concepts, please read the “Introduction to data.table” and “Reference semantics” vignettes first.


Data

We will use the same flights data as in the “Introduction to data.table” vignette.

flights <- fread("flights14.csv")
head(flights)
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475    9
# 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475   11
# 3: 2014     1   1         2         9      AA    JFK  LAX      351     2475   19
# 4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035    7
# 5: 2014     1   1         2         1      AA    JFK  LAX      350     2475   13
# 6: 2014     1   1         4         0      AA    EWR  LAX      339     2454   18
dim(flights)
# [1] 253316     11

Introduction

In this vignette, we will

1. Keys

a) What is a key?

In the “Introduction to data.table” vignette, we saw how to subset rows in i using logical expressions, row numbers and using order(). In this section, we will look at another way of subsetting incredibly fast - using keys.

But first, let’s start by looking at data.frames. All data.frames have a row names attribute. Consider the data.frame DF below.

set.seed(1L)
DF = data.frame(ID1 = sample(letters[1:2], 10, TRUE),
                ID2 = sample(1:3, 10, TRUE),
                val = sample(10),
                stringsAsFactors = FALSE,
                row.names = sample(LETTERS[1:10]))
DF
#   ID1 ID2 val
# C   a   3   5
# D   a   1   6
# E   b   2   4
# G   a   1   2
# B   b   1  10
# H   a   2   8
# I   b   1   9
# F   b   2   1
# J   a   3   7
# A   b   2   3

rownames(DF)
#  [1] "C" "D" "E" "G" "B" "H" "I" "F" "J" "A"

We can subset a particular row using its row name as shown below:

DF["C", ]
#   ID1 ID2 val
# C   a   3   5

i.e., row names are more or less an index to rows of a data.frame. However,

  1. Each row is limited to exactly one row name.

    But, a person (for example) has at least two names - a first and a second name. It is useful to organise a telephone directory by surname then first name.

  2. And row names should be unique.

    rownames(DF) = sample(LETTERS[1:5], 10, TRUE)
    # Warning: non-unique values when setting 'row.names': 'C', 'D'
    # Error in `row.names<-.data.frame`(`*tmp*`, value = value): duplicate 'row.names' are not allowed

Now let’s convert it to a data.table.

DT = as.data.table(DF)
DT
#     ID1 ID2 val
#  1:   a   3   5
#  2:   a   1   6
#  3:   b   2   4
#  4:   a   1   2
#  5:   b   1  10
#  6:   a   2   8
#  7:   b   1   9
#  8:   b   2   1
#  9:   a   3   7
# 10:   b   2   3

rownames(DT)
#  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Instead, in data.tables we set and use keys. Think of a key as supercharged rownames.

Keys and their properties

  1. We can set keys on multiple columns and the column can be of different typesinteger, numeric, character, factor, integer64 etc. list and complex types are not supported yet.

  2. Uniqueness is not enforced, i.e., duplicate key values are allowed. Since rows are sorted by key, any duplicates in the key columns will appear consecutively.

  3. Setting a key does two things:

    1. physically reorders the rows of the data.table by the column(s) provided by reference, always in increasing order.

    2. marks those columns as key columns by setting an attribute called sorted to the data.table.

    Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.

For the rest of the vignette, we will work with flights data set.

b) Set, get and use keys on a data.table

– How can we set the column origin as key in the data.table flights?

setkey(flights, origin)
head(flights)
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014     1   1         4         0      AA    EWR  LAX      339     2454   18
# 2: 2014     1   1        -5       -17      AA    EWR  MIA      161     1085   16
# 3: 2014     1   1       191       185      AA    EWR  DFW      214     1372   16
# 4: 2014     1   1        -1        -2      AA    EWR  DFW      214     1372   14
# 5: 2014     1   1        -3       -10      AA    EWR  MIA      154     1085    6
# 6: 2014     1   1         4       -17      AA    EWR  DFW      215     1372    9

## alternatively we can provide character vectors to the function 'setkeyv()'
# setkeyv(flights, "origin") # useful to program with

  • You can use the function setkey() and provide the column names (without quoting them). This is helpful during interactive use.

  • Alternatively you can pass a character vector of column names to the function setkeyv(). This is particularly useful while designing functions to pass columns to set key on as function arguments.

  • Note that we did not have to assign the result back to a variable. This is because like the := function we saw in the “Introduction to data.table” vignette, setkey() and setkeyv() modify the input data.table by reference. They return the result invisibly.

  • The data.table is now reordered (or sorted) by the column we provided - origin. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the data.table, and is therefore very memory efficient.

  • You can also set keys directly when creating data.tables using the data.table() function using key argument. It takes a character vector of column names.

set* and :=:

In data.table, the := operator and all the set* (e.g., setkey, setorder, setnames etc..) functions are the only ones which modify the input object by reference.

Once you key a data.table by certain columns, you can subset by querying those key columns using the .() notation in i. Recall that .() is an alias to list().

– Use the key column origin to subset all rows where the origin airport matches “JFK”

flights[.("JFK")]
#        year month day dep_delay arr_delay carrier origin dest air_time distance hour
#     1: 2014     1   1        14        13      AA    JFK  LAX      359     2475    9
#     2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475   11
#     3: 2014     1   1         2         9      AA    JFK  LAX      351     2475   19
#     4: 2014     1   1         2         1      AA    JFK  LAX      350     2475   13
#     5: 2014     1   1        -2       -18      AA    JFK  LAX      338     2475   21
#    ---                                                                              
# 81479: 2014    10  31        -4       -21      UA    JFK  SFO      337     2586   17
# 81480: 2014    10  31        -2       -37      UA    JFK  SFO      344     2586   18
# 81481: 2014    10  31         0       -33      UA    JFK  LAX      320     2475   17
# 81482: 2014    10  31        -6       -38      UA    JFK  SFO      343     2586    9
# 81483: 2014    10  31        -6       -38      UA    JFK  LAX      323     2475   11

## alternatively
# flights[J("JFK")] (or) 
# flights[list("JFK")]

– How can we get the column(s) a data.table is keyed by?

Using the function key().

key(flights)
# [1] "origin"

c) Keys and multiple columns

To refresh, keys are like supercharged row names. We can set key on multiple columns and they can be of multiple types.

– How can I set keys on both origin and dest columns?

setkey(flights, origin, dest)
head(flights)
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014     1   2        -2       -25      EV    EWR  ALB       30      143    7
# 2: 2014     1   3        88        79      EV    EWR  ALB       29      143   23
# 3: 2014     1   4       220       211      EV    EWR  ALB       32      143   15
# 4: 2014     1   4        35        19      EV    EWR  ALB       32      143    7
# 5: 2014     1   5        47        42      EV    EWR  ALB       26      143    8
# 6: 2014     1   5        66        62      EV    EWR  ALB       31      143   23

## or alternatively
# setkeyv(flights, c("origin", "dest")) # provide a character vector of column names

key(flights)
# [1] "origin" "dest"

  • It sorts the data.table first by the column origin and then by dest by reference.

– Subset all rows using key columns where first key column origin matches “JFK” and second key column dest matches “MIA”

flights[.("JFK", "MIA")]
#       year month day dep_delay arr_delay carrier origin dest air_time distance hour
#    1: 2014     1   1        -1       -17      AA    JFK  MIA      161     1089   15
#    2: 2014     1   1         7        -8      AA    JFK  MIA      166     1089    9
#    3: 2014     1   1         2        -1      AA    JFK  MIA      164     1089   12
#    4: 2014     1   1         6         3      AA    JFK  MIA      157     1089    5
#    5: 2014     1   1         6       -12      AA    JFK  MIA      154     1089   17
#   ---                                                                              
# 2746: 2014    10  31        -1       -22      AA    JFK  MIA      148     1089   16
# 2747: 2014    10  31        -3       -20      AA    JFK  MIA      146     1089    8
# 2748: 2014    10  31         2       -17      AA    JFK  MIA      150     1089    6
# 2749: 2014    10  31        -3       -12      AA    JFK  MIA      150     1089    5
# 2750: 2014    10  31        29         4      AA    JFK  MIA      146     1089   19

How does the subset work here?

  • It is important to undertand how this works internally. “JFK” is first matched against the first key column origin. And within those matching rows, “MIA” is matched against the second key column dest to obtain row indices where both origin and dest match the given values.

  • Since no j is provided, we simply return all columns corresponding to those row indices.

– Subset all rows where just the first key column origin matches “JFK”

key(flights)
# [1] "origin" "dest"

flights[.("JFK")] ## or in this case simply flights["JFK"], for convenience
#        year month day dep_delay arr_delay carrier origin dest air_time distance hour
#     1: 2014     1   1        10         4      B6    JFK  ABQ      280     1826   20
#     2: 2014     1   2       134       161      B6    JFK  ABQ      252     1826   22
#     3: 2014     1   7         6         6      B6    JFK  ABQ      269     1826   20
#     4: 2014     1   8        15       -15      B6    JFK  ABQ      259     1826   20
#     5: 2014     1   9        45        32      B6    JFK  ABQ      267     1826   20
#    ---                                                                              
# 81479: 2014    10  31         0       -18      DL    JFK  TPA      142     1005    8
# 81480: 2014    10  31         1        -8      B6    JFK  TPA      149     1005   19
# 81481: 2014    10  31        -2       -22      B6    JFK  TPA      145     1005   14
# 81482: 2014    10  31        -8        -5      B6    JFK  TPA      149     1005    9
# 81483: 2014    10  31        -4       -18      B6    JFK  TPA      145     1005    8

– Subset all rows where just the second key column dest matches “MIA”

flights[.(unique(origin), "MIA")]
#       year month day dep_delay arr_delay carrier origin dest air_time distance hour
#    1: 2014     1   1        -5       -17      AA    EWR  MIA      161     1085   16
#    2: 2014     1   1        -3       -10      AA    EWR  MIA      154     1085    6
#    3: 2014     1   1        -5        -8      AA    EWR  MIA      157     1085   11
#    4: 2014     1   1        43        42      UA    EWR  MIA      155     1085   15
#    5: 2014     1   1        60        49      UA    EWR  MIA      162     1085   21
#   ---                                                                              
# 9924: 2014    10  31       -11        -8      AA    LGA  MIA      157     1096   13
# 9925: 2014    10  31        -5       -11      AA    LGA  MIA      150     1096    9
# 9926: 2014    10  31        -2        10      AA    LGA  MIA      156     1096    6
# 9927: 2014    10  31        -2       -16      AA    LGA  MIA      156     1096   19
# 9928: 2014    10  31         1       -11      US    LGA  MIA      164     1096   15

What’s happening here?

2) Combining keys with j and by

All we have seen so far is the same concept – obtaining row indices in i, but just using a different method – using keys. It shouldn’t be surprising that we can do exactly the same things in j and by as seen from the previous vignettes. We will highlight this with a few examples.

a) Select in j

– Return arr_delay column as a data.table corresponding to origin = "LGA" and dest = "TPA".

key(flights)
# [1] "origin" "dest"
flights[.("LGA", "TPA"), .(arr_delay)]
#       arr_delay
#    1:         1
#    2:        14
#    3:       -17
#    4:        -4
#    5:       -12
#   ---          
# 1848:        39
# 1849:       -24
# 1850:       -12
# 1851:        21
# 1852:       -11

  • The row indices corresponding to origin == "LGA" and dest == "TPA" are obtained using key based subset.

  • Once we have the row indices, we look at j which requires only the arr_delay column. So we simply select the column arr_delay for those row indices in the exact same way as we have seen in Introduction to data.table vignette.

  • We could have returned the result by using with = FALSE as well.

    flights[.("LGA", "TPA"), "arr_delay", with = FALSE]

b) Chaining

– On the result obtained above, use chaining to order the column in decreasing order.

flights[.("LGA", "TPA"), .(arr_delay)][order(-arr_delay)]
#       arr_delay
#    1:       486
#    2:       380
#    3:       351
#    4:       318
#    5:       300
#   ---          
# 1848:       -40
# 1849:       -43
# 1850:       -46
# 1851:       -48
# 1852:       -49

c) Compute or do in j

– Find the maximum arrival delay correspondong to origin = "LGA" and dest = "TPA".

flights[.("LGA", "TPA"), max(arr_delay)]
# [1] 486

  • We can verify that the result is identical to first value (486) from the previous example.

d) sub-assign by reference using := in j

We have seen this example already in the Reference semantics vignette. Let’s take a look at all the hours available in the flights data.table:

# get all 'hours' in flights
flights[, sort(unique(hour))]
#  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

We see that there are totally 25 unique values in the data. Both 0 and 24 hours seem to be present. Let’s go ahead and replace 24 with 0, but this time using key.

setkey(flights, hour)
key(flights)
# [1] "hour"
flights[.(24), hour := 0L]
#         year month day dep_delay arr_delay carrier origin dest air_time distance hour
#      1: 2014     4  15       598       602      DL    EWR  ATL      104      746    0
#      2: 2014     5  22       289       267      DL    EWR  ATL      102      746    0
#      3: 2014     7  14       277       253      DL    EWR  ATL      101      746    0
#      4: 2014     2  14       128       117      EV    EWR  BDL       27      116    0
#      5: 2014     6  17       127       119      EV    EWR  BDL       24      116    0
#     ---                                                                              
# 253312: 2014     8   3         1       -13      DL    JFK  SJU      196     1598    0
# 253313: 2014    10