The Basics of Data Manipulation

The Basics

This tutorial shows you the five key functions from the dplyr package which enables you to solve most of the common data manipulation challenges:

  • Learn how to pick cases or observations depending on their values with filter ().
  • Arrange and rearrange the order of the dataset with arrange ().
  • Select variables by their names with select ().
  • Extend the dataset and create new variables with mutate ().
  • Summarize and collapse the data set with summarise ().

We use the mtcars data set, which is implemented in R and so you can easily run the code on your own device. In order to do so, you need to have access to the dyplr package. So, please check whether library(tidyverse) is installed at your own machine - which includes the dyplr package - by loading the tidyverse package. You can install the package by typing install.packages("tidyverse"), in case you are working with it for the very first time.

More importantly, this is an interactive tutorial, which means you can run R code as you see below, adjust examples of the command I provided for you and hopefully learn R much faster.

So, let’s pRactice! We can start by looking at the data frame and get familiar. The data set comprises fuel consumption and other aspects about cars (e.g mpg: Miles/(US) gallon, cyl: Number of cylinders). To see the first few observations from the data frame, we have to call the data set by typing head(mtcars) .

head(mtcars)
#Ask R to provide a data description from the help file by typing ?mtcars, if you are not familiar with the data set.

Filter ()

Use filter () to create a subset of your data based on specific values.

  1. First, you have to provide the name of your data frame. As I said, we are working with the mtcars data set.
  2. Use mathematical arguments to subset and filter your data. Here, I specify that the data set contains only cars with more than 100 hp (gross horsepower): hp > 100
filter(mtcars, hp > 100)
## Hint 1: Use comparison operators to select observations: >, >=, <, <=, != (not equal), and == in case you want to tell R that something should be equal.
##Hint 2: You need to provide the variable name if you combine several condition
filter(mtcars, hp > 100 & hp < 200)

By running this line of code, dplyr executes the filtering operation and returns a new data frame. dplyr never modify the existing data frame, so if you want to save the result, you’ll need to use the assignment operator: <-

And you can use the boolean operations to combine several condition: & (and), | (or), ! (not) are the most common operators.

Practice

  1. Now its your turn, adjust the code above so that only cars with an automatic transmission (stored in the variable am: 0 = automatic, 1 = manual) will be in the data frame.

  2. Can you adjust the code that two condition have to be fullfilled simultaneously. For example, choose cars with more than 100 hp and less then 200 hp.

Arrange ()

You can arrange the order of your data frame with arrange (). For example, we can sort the cars in the data set from the lowest to highest horsepower.

arrange (mtcars, hp)
#Use `desc(variablename)` to order the data frame in descending order :-)

In some instances you may wonder where you missing values have been gone. NA or missing values are always sorted at the end of the rows. Lets create a small data frame with 3 values only, and the last one is missing value. Now we can see how the data is sorted:

df <- tibble(x = c(5, 2, NA)) #.
arrange(df, x)

Select ()

In this tutorial we use nice and handy data sets which are implemented in R. In real life, you will encounter messy data or you have access to big data set from an official resources. Select () helps us to focus on the part or subset of the data set we are actually interesed in. Certainly, the mtcars data is not messy nor big, but we can learn the principals how to select columns anyways.

We can simply use select () to specifiy which variables (e.g. mpg, hp) we want to select from the mtcars data set. R provides us with a new data frame which contains the selected variables only.

select(mtcars, mpg, hp)

Select () comes with a few nice features and functions we can use:

#Selects all columns between var1 to var2, as a starting and endpoint
select(dataset, var1:var2)

#Select all columns except those from var1 to var2
select(dataset, -(var1:var2))

#Select all variables which start with text string such as "var_"
select(dataset, starts_with("var_"))
## The same applies for the end of the variable name: ends_with("xyz")

#And in some instances you have a running number (X1,.,X5) you can also select
select(dataset, num_range("x", 1:5))

Practice

Give it a try and get familiar with select ():

  1. Select all columns from the mtcars data between horsepower (hp) and automatic transmission (am).
  2. Select all variables which start with a “d”.
  3. Can you imagine how the everything() command works? With this command you can move a list of variables to the start of the data frame, everything else keeps its order. Give it a try or look at the hint.
head(mtcars)
select(dataset, var1, var2, everything())

Mutate ()

Often, the raw data does not contain the variables in a way we need them. In such instances we have to transform the data, create new variables depending on the values of the raw data. Here comes mutate () into play, which adds a generated variable (or column) at the end of your dataset. First, we need a narrower data set, otherwise you cannot see the added variables directly. Let’s select hp and gear from the mtcars data set. Then, we can create some nonsense variables for illustration purposes.

df <- select(mtcars, hp, gear) #a subset of the data set

mutate(df,
       some_variable = hp * gear,
       other_variable = some_variable/hp
)

Thus, with mutate () we can simply extend our data set and generate new variables. We can even use the first nonsense variable (some_variable) to create an other_variable which depends on the first one we created.

In general, use arithmetic operators (+, -, *, /, ^), logical comparisons (<, <=, ==, =>, >, !) and other functions (log, rankings, etc.) to create the needed variable.

Summarise ()

We can easily calcuate measurement of centrals tendicies (e.g. means, modus) with the summarise () function, which collapses a data into a single row. For instance, let’s calculate the mean value of horsepower for all observed cars.

summarise(mtcars, mean_hp = mean(hp))

Most of the time we are not interest in the overall mean, instead we want to compare an outcome for different groups (e.g. men vs. women). For instance, we believe that cars with different transmissions (am: 0 = automatic, 1 = manual) may differ in terms of horsepower. This is certainly a stupid example, but we can use it to learn how to calculate group-specific means. Therefore, we have to group and store our data first with compare_group <- group_by(mtcars, am) and then estimate the mean value.

compare_group <- group_by(mtcars, am)
summarise(compare_group, hp_mean = mean(hp))

This goes way easier with the pipe %>%. The pipe is a key element in the tidyverse package and it is implemented in many instances. It simply puts the steps of your analysis forward, so you can combine several lines of code in one command. Sure, in our example we have only two lines of code, but even in this case, it is easier to read if you get used to the pipe. Just say: “next or then” in case you see a %>%. Let’s see how it work.

mtcars %>% 
  group_by(am) %>% 
  summarise(
    mean_hp = mean(hp)
  )

Thus, first we call the data frame %>%, group the data %>% and call the summary function in which we can store different estimates for the groups. I guess the main idea is clear, let’s see how summarise () works in practice.

Practice

There are several useful summary functions: To get the number of observation per group, add count = n() to the summary function and make sure that your summary does not rely on an unlawful small number of observations. Calculate the median (median(x)) instead of the mean, add the standard deviation (sd(x)) and count the number of unique values (n_distinct(x)) .

#adjust me!
mtcars %>% 
  group_by(am) %>% 
  summarise(
    mean_hp = mean(hp)
  )
mtcars %>% 
  group_by(am) %>% 
  summarise(
    median_hp = median(hp),
    count = n(),
    sd = sd(hp),
    n_distinct(hp)
  )

The End

Congratualtions, first mission accomplished! You know now how to handle and manipulate data, which gives you the skills to jump right into your first analysis. Let my give you a final recommendation and a short outview for the next pRactice session, before you start on your own.

I know from my personal experience, that R can be awkward to learn and everyone makes a lot of mistakes in the beginning. It’s pointless to say, but you will only learn it by writing many lines of code which probably include an abundance of mistakes and error. So, don’t think an of error as something you did wrong, it’s simply something that you didn’t achieved yet :). Certainly, there is no need to make our lives even harder then it is. R has a big community and a lot of information is available online. Moreover, there are also several cheat sheet for different R packages available. For example, have a look at the dplyr cheat sheet provided by Garrett Grolemund, which gives a compact overview about the possibilities at your hand.

In the next session we dig a little bit deeper into the tidyverse to learn more about data management in R. Garrick Aden-Buie made some pretty nice animations to visualize how the tidyverse works. For example, we can make a full_join to combine every peace of information from two different data frames.

Source: Garrick Aden-Buie

Source: Garrick Aden-Buie