An R mystery
Someone recently sent out a message in our development team internal chatroom with the subject line "[s]ome bizarre R code behavior." We had used a typical workflow where we duplicated a data frame into a copy of another data frame, i.e., df_new <- df_original. This allows us to do some data munging on the new data frame, while the original is left intact. Since assignment in R creates a distinct copy in memory, any changes to the new data frame would have no impact on the original.
That is not what was happening.
Instead, modifying column names of the new data frame was also modifying them in the original! This caused a crashing error further along in the script where there was code that depended on the column names. Basically, R was behaving like other programming languages where an assignment just created a pointer or reference, and any changing of the value of one variable changed the value for both.
A simplified version of the buggy code would look something like this:
This gives the error:
Why doesn't the column col1 exist in the data frame df_original? df_original should be unchanged, and we never deliberately removed the column.
The hint is in the packages that we used, specifically, data.table. data.table is a powerful package that includes many convenience functions. One huge advantage of data.table over base R objects like data.frame is that data.table has better performance on large datasets. But in this case, the catch is how it achieves this performance gain. According to the documentation, "all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column." [https://cran.r-project.org/web/packages/data.table/data.table.pdf]
In other words, this extends the base R data.frame to behave similarly to some other programming languages. It allows modification in place in memory, rather than having to do inefficient copying. This sounds like exactly what is going on in our buggy code, where we don't seem to have a separate copy of the data frame.
In this case, a data.table function sneaked into our code without us realizing it -- setnames(). Because we were mixing functions from various packages, the issue was not obvious to us.
So what is really going on?
A deep dive into R memory handling
Let’s use the lobstr package [https://cran.r-project.org/web/packages/lobstr/index.html] to investigate the structure of how these data frames are actually stored in memory.
The possibly-intimidating-looking hex codes (e.g., 0x7fb6f3302008) give the address in memory of that object and all of its properties. If you look closely, you’ll see that all of the addresses are identical between df_original and df_new. So both variables are pointing at the same object in memory. How can this be, if R always makes a new copy when you assign a variable?
The answer is that despite R’s frequent copying, it tries to be efficient in its memory handling. When the df_new variable is first assigned, it just points to the same address in memory. But if any data is modified, then R creates a copy and modifies that:
You can see how the memory address of the whole object has changed from 0x7fb6f3302008 to 0x7fb6f2f389c8. And more importantly, the address of col2 has changed, while col1 has stayed the same.
We’re narrowing in on the problem with using a data.table function. Notice that the data.frames have an attribute called names. This is the vector that holds the names of the columns. Also, notice that the address of this attribute is identical between the two data frames (0x7fb6f3a4dc88). Recall from the data.table documentation that “set” functions, including setnames(), change their input by reference. So if you call setnames() on df_new, it will change the single value in memory that both variables point to:
If we were to use a base R function to rename the columns, such as names() or colnames(), rather than the data.table function, then R would simply copy the names attribute, just like it did above when we set a value in col2. So the solution to our buggy code from the top is to simply replace the call to setnames() with names():
Only the column in df_new is renamed, and no errors!
The conclusion that I drew from this, is that while data.table can be very useful and efficient, there are pitfalls that developers need to watch out for. You need to be intentional about using data.table functions and be especially careful about using them on data frames and not actual data.table objects.
One suggestion is to explicitly use the package name before functions, e.g., data.table::setnames() to avoid confusion of what package the function comes from.
Thanks to CANA’s Renee Carlucci, Rick Hanson, and Rocky Graciani for helping to solve this R mystery.