Pitfalls With Using the data.table Package

Aaron Luprek
Jan 10, 2022
3 min read

An R mystery

Someone recently sent out a message in our development team internal chatroom with the subject line "[s]ome bizarre R code behavior." We had used a typical workflow where we duplicated a data frame into a copy of another data frame, i.e., df_new <- df_original. This allows us to do some data munging on the new data frame, while the original is left intact. Since assignment in R creates a distinct copy in memory, any changes to the new data frame would have no impact on the original.

That is not what was happening.

Instead, modifying column names of the new data frame was also modifying them in the original! This caused a crashing error further along in the script where there was code that depended on the column names. Basically, R was behaving like other programming languages where an assignment just created a pointer or reference, and any changing of the value of one variable changed the value for both.

A simplified version of the buggy code would look something like this:

This gives the error:

The Problem

Why doesn't the column col1 exist in the data frame df_original? df_original should be unchanged, and we never deliberately removed the column.

The hint is in the packages that we used, specifically, data.table. data.table is a powerful package that includes many convenience functions. One huge advantage of data.table over base R objects like data.frame is that data.table has better performance on large datasets. But in this case, the catch is how it achieves this performance gain. According to the documentation, "all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column." [https://cran.r-project.org/web/packages/data.table/data.table.pdf]

In other words, this extends the base R data.frame to behave similarly to some other programming languages. It allows modification in place in memory, rather than having to do inefficient copying. This sounds like exactly what is going on in our buggy code, where we don't seem to have a separate copy of the data frame.

In this case, a data.table function sneaked into our code without us realizing it -- setnames(). Because we were mixing functions from various packages, the issue was not obvious to us.

So what is really going on?

A deep dive into R memory handling

Let’s use the lobstr package [https://cran.r-project.org/web/packages/lobstr/index.html] to investigate the structure of how these data frames are actually stored in memory.

This returns:

The possibly-intimidating-looking hex codes (e.g., 0x7fb6f3302008) give the address in memory of that object and all of its properties. If you look closely, you’ll see that all of the addresses are identical between df_original and df_new. So both variables are pointing at the same object in memory. How can this be, if R always makes a new copy when you assign a variable?

The answer is that despite R’s frequent copying, it tries to be efficient in its memory handling. When the df_new variable is first assigned, it just points to the same address in memory. But if any data is modified, then R creates a copy and modifies that:

Result:

You can see how the memory address of the whole object has changed from 0x7fb6f3302008 to 0x7fb6f2f389c8. And more importantly, the address of col2 has changed, while col1 has stayed the same.

We’re narrowing in on the problem with using a data.table function. Notice that the data.frames have an attribute called names. This is the vector that holds the names of the columns. Also, notice that the address of this attribute is identical between the two data frames (0x7fb6f3a4dc88). Recall from the data.table documentation that “set” functions, including setnames(), change their input by reference. So if you call setnames() on df_new, it will change the single value in memory that both variables point to:

Returns:

If we were to use a base R function to rename the columns, such as names() or colnames(), rather than the data.table function, then R would simply copy the names attribute, just like it did above when we set a value in col2. So the solution to our buggy code from the top is to simply replace the call to setnames() with names():

Result:

Only the column in df_new is renamed, and no errors!

Lesson learned?

The conclusion that I drew from this, is that while data.table can be very useful and efficient, there are pitfalls that developers need to watch out for. You need to be intentional about using data.table functions and be especially careful about using them on data frames and not actual data.table objects.

One suggestion is to explicitly use the package name before functions, e.g., data.table::setnames() to avoid confusion of what package the function comes from.

Thanks to CANA’s Renee Carlucci, Rick Hanson, and Rocky Graciani for helping to solve this R mystery.

Aaron Luprek is a Senior Software Developer here at CANA. You can contact Aaron at aluprek@canallc.com or LinkedIn.

8 Comments

DhanitA ParyA

Jun 23

I recently discovered Khelo24 and was genuinely impressed by its wide selection of games and user-friendly interface. Whether you're a casual gamer or a competitive player, khelo24 login provides a secure and enjoyable platform for online gaming. With fast payouts and a responsive support team, the overall experience is smooth and reliable. Definitely worth checking out!

Team Khelo24 https://www.khelo24.co.in/

Odien86187

Jun 10

23win đang dẫn đầu xu hướng cá cược 2025 với giao diện mượt mà, tỷ lệ hấp dẫn và trải nghiệm người dùng vượt trội, thu hút hàng nghìn người chơi mỗi ngày. Truy cập https://23winn.digital/

Swepson Huang

Jun 07

Hàng trăm trò chơi đổi thưởng, tỷ lệ kèo cạnh tranh cùng dịch vụ đỉnh cao chỉ có tại 78win. Để tham gia an toàn và nhanh chóng, hãy truy cập link chính thức: 78win-vietnam com.

Jun 03

Chơi ga tại 23win , nền tảng chất lượng cao với hàng loạt trò chơi hấp dẫn. Công nghệ tiên tiến, bảo mật tuyệt đối, hỗ trợ khách hàng 24/7, giao diện thân thiện.

Nhờ 15 năm kinh nghiệm, fun88 hiểu rõ nhu cầu người chơi và luôn cải tiến dịch vụ. Với bảo mật hàng đầu, trò chơi hấp dẫn và hỗ trợ chuyên nghiệp, fun88linkb com được yêu thích tại châu Á.

Pitfalls With Using the data.table Package

An R mystery

The Problem

A deep dive into R memory handling

Lesson learned?

Recent Posts

8 Comments

CANA Site Map

CONTACT US

Thanks! Message sent.