An Idea From UNIX for R Tidyverse Pipelines

Rick Hanson
Feb 13, 2023
3 min read

Updated: Nov 28, 2023

R’s Tidyverse is a collection of R packages that work together with a common functional interface to accomplish data transformations and analysis in the R programming language.

To see it in action, let’s consider the following sequence of data transformations on some dataframe (called df) as an example.

df <- read_csv(“some_file.csv”)
df <- groupby(df, name, a, b, c)
df <- summarize(df, qty = qty)
df <- ungroup(df)
df <- mutate(df, version = “20230130”)

Each of the functions groupby( ), summarize( ), ungroup( ), and mutate( ) are from the Tidyverse collection. Notice that the first argument of each of these functions is the data (usually a dataframe) to be operated upon and is the data just updated on the previous line.

However, the standard practice is to code such a sequence, not as consecutive assignment statements, as above, but as ONE expression where each step is a “link” in a “chain” of transformations, as in the following code.

df <- read_csv(“some_file.csv”) %>%
      groupby(name, a, b, c)    %>%
      summarize(qty = qty)      %>%
      ungroup()                 %>%
      mutate(version = “20230130”)

Notice that each step of the transformation is now “linked” to the next with the infix operator %>% (sometimes called “pipe”). Also, note that the first argument of each Tidyverse function is now dropped. That is because the passing of the data from one transformation function to the next is handled by the pipe operator %>%. Eliminating the repeated “noise” on each line also makes the code more readable. The chain of such transformations is called a “pipeline.”

Code Usage Example

Now, here is a usage of the above idea that we see a lot in our code. If we want to create a dataframe A for output (and possibly use A for further processing), we like to write code like the following (where [step i] is not real R code, but represents some Tidyverse function call, as in the concrete example above).

A <- [step 1] %>%
     [step 2] %>%
     [step 3] %>%
     [step 4] %>%
     [step 5]


write_csv(A, "A.csv")

But in our codebase, we've seen a variant of this code too, where we want to peek at what the data looks like after step 3, but before going to step 4, as in the following:

A_intermediate <- [step 1] %>%
                  [step 2] %>%
                  [step 3]


write_csv(A_intermediate, "A_intermediate.csv")


A <- A_intermediate %>%
     [step 4] %>%
     [step 5]


write_csv(A, "A.csv")

The variable names often don't say "intermediate" and this fragmentation of the pipeline is often more difficult for the reader to read, especially when this “peeking” method occurs several times in a long pipeline.

To the rescue, enter the idea of tee which is an old UNIX utility. UNIX shell expressions have pipelines too -- I believe that R pipelines are inspired by them (or by something else that was in turn inspired by UNIX shell pipelines).

A Unix pipeline looks like this:

[step 1] | [step 2] | [step 3]

where each [step i] is some UNIX command, and each command is “linked” to the next by the | character (analogous to %>% in Tidyverse pipelines). If a call to the utility tee is inserted into the pipeline, e.g., as follows:

[step 1] | [step 2] | tee foo.txt | [step 3]

then the file foo.txt has the intermediate result of the pipeline after step 2, but before step 3, and can be inspected for audit or debug purposes.

Using this idea, we write an R function called tee( ) so that our code looks more like the original (non-fragmented) version of the pipeline:

A <- [step 1] %>%
     [step 2] %>%
     [step 3] %>%
     tee("A_intermediate.csv") %>%
     [step 4] %>%
     [step 5]


 write_csv(A, "A.csv")

The definition of tee( ) is simply: grab the given dataframe, write it to a file and then pass it along (for further processing down the rest of the pipeline).

tee <- function(df, filename)
{
  write_csv(df, filename)
  df
}

More Tee Fun

We can simplify our new code even further by reusing tee( ) for writing the final version of A also.

A <- [step 1] %>%
     [step 2] %>%
     [step 3] %>%
     tee("A_intermediate.csv") %>%
     [step 4] %>%
     [step 5] %>%
     tee("A.csv")

This puts all the processing for dataframe A in one expression, making it easy for the reader to see it as completely self-contained. This final version is the form of the data transformation found in our codebases.

Summary

We can see how the Tidyverse package collection is an extremely useful programming construct for data analysis. Tidyverse functions have a common interface which can be leveraged so that we can code our data transformations into easy-to-read pipeline expressions. Easier reading entails easier development, less coding mistakes, and easier debugging, if it should come to that. We have additionally leveraged an idea from UNIX to help us “peek” into the partially transformed data in a pipeline to help us understand what the data looks like at a particular point in the transformation pipeline.

Rick Hanson is our Senior Operations Research Analyst here at CANA. You can reach him at rhanson@canallc.com or on Linkedin.

12 Comments

RaliyA KahaY

Jun 25

If you’re aiming to elevate your IPL betting experience, securing a reliable IPL Betting ID is essential. A verified and trustworthy Betting ID not only guarantees safe and smooth transactions but also opens the door to a wide range of betting options, including both pre-match and live, in-play markets. Platforms like iplbettingid.org.in offer a streamlined registration process, enabling secure deposits and real-time betting. With key features such as competitive odds, mobile-friendly access, and round-the-clock customer support, choosing a reputable platform can greatly enhance your overall experience. Always prioritize platforms that value security and user satisfaction to get the most out of your IPL betting journey.

Team iplbettingid

https://iplbettingid.org.in/

DhanitA ParyA

Jun 23

I recently came across Khelo24bett and was impressed by its user-friendly interface and extensive game selection. The platform offers a seamless betting experience with a variety of sports and casino games. Additionally, the Khelo 24bett loyalty program provides rewarding benefits for regular players, enhancing the overall gaming experience. I look forward to exploring more of what Khelo24bett has to offer.

Team khelo24bett

https://khelo24bett.com/

Reddy Book Club

Jun 18

Fascinating insights on UNIX-inspired ideas for R Tidyverse pipelines! It’s amazing how concepts like pipelines and streamlining data processes can also be applied to the gaming industry, especially in online casino real money games where seamless user experience is key. If you're interested in exploring how digital innovation enhances online gambling, check out reddy book club for some engaging content. Happy to share more on how technology shapes this exciting field!

Odien86187

May 16

Bsport là nhà cái cá cược trực tuyến uy tín, được cấp phép hợp pháp, cung cấp đa dạng sản phẩm cá cược thể thao trực tuyến và xem bóng đá trực tiếp chất lượng cao.

Neha Singh

Apr 15

Dive into the exciting world of the Mahadev Book game and unlock your full potential with our variety of games. For any issues or inquiries, the Mahadev Book customer care number is available 24x7 to assist you. Alternatively, connect through the Mahadev ID contact number for quick, direct support.

An Idea From UNIX for R Tidyverse Pipelines

Code Usage Example

More Tee Fun

Summary

Recent Posts

12 Comments

CANA Site Map

CONTACT US

Thanks! Message sent.