Data Wrangling

How do we do it? 🤔

This diagram is taken from the R for Data Science book by Garrett Grolemund and Hadley Wickham, which it is a great resource for learning R. There is a whole community built around it and you could join it and start learning together: R4DS online learning community.

Dataset

To learn and practise how to organise data we will use a gapminder data set available from the gapminder package in R. This dataset is put into R by Jennifer Bryan from a tank of data sets available from Gapminder.

Gapminder is an independent Swedish foundation that helps to promote sustainable global development by collecting and analysing relevant data and by developing and designing teaching/learning tools. Gapminder was founded in Sweden by Hans Rosling who was a mastermind for distinctive and insightful storytelling about global development using visual animation.

You can see Hans in action in this BBC documentary The joy of Stats available on YouTube.

Gapminder Data

For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

Before you can take a look at this data set first run the folowing code

# install necessary packages:
install.packages("dplyr", repos = "http://cran.us.r-project.org")
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
install.packages("gapminder", repos = "http://cran.us.r-project.org")
# have a look at the data 
head(gapminder::gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

💡Note that there are 6 columns, each of which we call a variable.

Description: Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

The main data frame gapminder has 1704 rows and 6 variables: - country: factor with 142 levels - continent: factor with 5 levels - year: ranges from 1952 to 2007 in increments of 5 years - lifeExp: life expectancy at birth, in years - pop: population - gdpPercap: GDP per capita

gapminder::gapminder[1:3,]
## # A tibble: 3 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.

1st look at the data using the following functions: dim() & head()

library(gapminder)
dim(gapminder)
## [1] 1704    6
head(gapminder, n=10)
## # A tibble: 10 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.

Can you tell what each of the two functions does?!

Do we have the information about the structure of the data? 🤔 We can examine the structure using str() function, but the output could look messy and hard to follow if the data set is big. 🤪

str(gapminder) 
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

The dplyr Package

The dplyr provides a “grammar” (the verbs) for data manipulation and for operating on data frames in a tidy way. The key operator and the essential verbs are:

%>%: the “pipe” operator used to connect multiple verb actions together into a pipeline.

select(): return a subset of the columns of a data frame.

mutate(): add new variables/columns or transform existing variables.

filter(): extract a subset of rows from a data frame based on logical conditions.

arrange(): reorder rows of a data frame according to single or multiple variables.

summarise() / summarize(): reduce each group to a single row by calculating aggregate measures.

We can have a look at the data and its structure by using the glimpse() function from the dplyr package.

suppressPackageStartupMessages(library(dplyr))
glimpse(gapminder) 
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Af…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 148803…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.…

🤓💡: Notice how we can prevent display of the messages that appear when uploading the packages by using, in this case, suppressPackageStartupMessages()!

Now we have the dplyr package uploaded, let us learn its verbs. 😇


The pipeline operater: %>%

Left Hand Side (LHS) %>% Right Hand Side (RHS)

x %>% f(…, y)

f(x,y)

The “pipe” **passes the result of the *LHS as the 1st operator argument of the function on the RHS**.

3 %>% sum(4)      <==>        sum(3, 4)

%>% is very practical for chaining together multiple dplyr functions in a sequence of operations.


Pick variables by their names: select(),

  • starts_with("X") every name that starts with “X”.

  • ends_with("X") every name that ends with “X”.

  • contains("X") every name that contains “X”.

  • matches("X") every name that matches “X”, where “X” can be a regular expression.

  • num_range("x", 1:5) the variables named x01, x02, x03, x04, x05.

  • one_of(x) => every name that appears in x, which should be a character vector.


👉 Practice ⏰💻: Select your variables

  1. that ends with letter p

  2. starts with letter co. Try to do this selection using base R.

😃🙌 Solutions:

gm_pop_gdp <- select(gapminder, ends_with("p"))
head(gm_pop_gdp, n = 1)
## # A tibble: 1 x 3
##   lifeExp     pop gdpPercap
##     <dbl>   <int>     <dbl>
## 1    28.8 8425333      779.
gm_cc <- select(gapminder, starts_with("co"))
head(gm_cc, n = 1)
## # A tibble: 1 x 2
##   country     continent
##   <fct>       <fct>    
## 1 Afghanistan Asia

of course all of this could be done using base R for example:

gm_cc <- gapminder[c("country", "continent")]

but it’s less intuitive and often requires more typing.


Create new variables of existing variables: mutate()

It will allow you to add to the data frame df a new column, z, which is the multiplication of the columns x and y: mutate(df, z = x * y). If we would like to observe lifeExp measured in months we could create a new column lifeExp_month:

gapminder2 <- mutate(gapminder, LifeExp_month = lifeExp * 12) 
head(gapminder2, n = 2)
## # A tibble: 2 x 7
##   country     continent  year lifeExp     pop gdpPercap LifeExp_month
##   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>         <dbl>
## 1 Afghanistan Asia       1952    28.8 8425333      779.          346.
## 2 Afghanistan Asia       1957    30.3 9240934      821.          364.

Pick observations by their values: filter()

There is a set of logical operators in R that you can use inside filter():

  • x < y: TRUE if x is less than y
  • x <= y: TRUE if x is less than or equal to y
  • x == y: TRUE if x equals y
  • x != y: TRUE if x does not equal y
  • x >= y: TRUE if x is greater than or equal to y
  • x > y: TRUE if x is greater than y
  • x %in% c(a, b, c): TRUE if x is in the vector c(a, b, c)
  • is.na(x): Is NA
  • !is.na(x): Is not NA

👉 Practice ⏰💻: Filter your data

Use gapminder2 df to filter:

  1. only European countries and save it as gapmEU

  2. only European countries from 2000 onward and save it as gapmEU21c

  3. rows where the life expectancy is greater than 80

Don’t forget to use == instead of =! and Don’t forget the quotes ""

😃🙌 Solutions:

gapmEU <- filter(gapminder2, continent == "Europe") 
head(gapmEU, 2)
## # A tibble: 2 x 7
##   country continent  year lifeExp     pop gdpPercap LifeExp_month
##   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>         <dbl>
## 1 Albania Europe     1952    55.2 1282697     1601.          663.
## 2 Albania Europe     1957    59.3 1476505     1942.          711.
gapmEU21c <- filter(gapminder2, continent == "Europe" & year >= 2000)
head(gapmEU21c, 2)
## # A tibble: 2 x 7
##   country continent  year lifeExp     pop gdpPercap LifeExp_month
##   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>         <dbl>
## 1 Albania Europe     2002    75.7 3508512     4604.          908.
## 2 Albania Europe     2007    76.4 3600523     5937.          917.
filter(gapminder2, lifeExp > 80)

Reorder the rows: arrange() is used to reorder rows of a data frame (df) according to one of the variables/columns.

  • If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable.

  • If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).


👉 Practice ⏰💻: Arranging your data 1) Arrange countries in gapmEU21c df by life expectancy in ascending and descending order.

  1. Using gapminder df
  • Find the records with the smallest population

  • Find the records with the largest life expectancy.

😃🙌 Solution 1):

gapmEU21c_h2l <- arrange(gapmEU21c, lifeExp)
head(gapmEU21c_h2l, 2)
## # A tibble: 2 x 7
##   country continent  year lifeExp      pop gdpPercap LifeExp_month
##   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
## 1 Turkey  Europe     2002    70.8 67308928     6508.          850.
## 2 Romania Europe     2002    71.3 22404337     7885.          856.
gapmEU21c_l2h <- arrange(gapmEU21c, desc(lifeExp)) # note the use of desc()
head(gapmEU21c_l2h, 2)
## # A tibble: 2 x 7
##   country     continent  year lifeExp     pop gdpPercap LifeExp_month
##   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>         <dbl>
## 1 Iceland     Europe     2007    81.8  301931    36181.          981.
## 2 Switzerland Europe     2007    81.7 7554661    37506.          980.

😃🙌 Solution 2):

arrange(gapminder, pop)
## # A tibble: 1,704 x 6
##    country               continent  year lifeExp   pop gdpPercap
##    <fct>                 <fct>     <int>   <dbl> <int>     <dbl>
##  1 Sao Tome and Principe Africa     1952    46.5 60011      880.
##  2 Sao Tome and Principe Africa     1957    48.9 61325      861.
##  3 Djibouti              Africa     1952    34.8 63149     2670.
##  4 Sao Tome and Principe Africa     1962    51.9 65345     1072.
##  5 Sao Tome and Principe Africa     1967    54.4 70787     1385.
##  6 Djibouti              Africa     1957    37.3 71851     2865.
##  7 Sao Tome and Principe Africa     1972    56.5 76595     1533.
##  8 Sao Tome and Principe Africa     1977    58.6 86796     1738.
##  9 Djibouti              Africa     1962    39.7 89898     3021.
## 10 Sao Tome and Principe Africa     1982    60.4 98593     1890.
## # … with 1,694 more rows
arrange(gapminder, desc(lifeExp))
## # A tibble: 1,704 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # … with 1,694 more rows

Collapse many values down to a single summary: summarise()

The syntax of summarise() is simple and consistent with the other verbs included in the dplyr package.

  • uses the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

  • builds a new dataset that contains only the summarising statistics.

Objective Function Description
basic sum(x) Sum of vector x
centre mean(x) Mean (average) of vector x
median(x) Median of vector x
spread sd(x) Standard deviation of vector x
quantile(x, probs) Quantile of vector x
range(x) Range of vector x
diff(range(x))) Width of the range of vector x
min(x) Min of vector x
max(x) Max of vector x
abs(x) Absolute value of a number x

👉 Practice ⏰💻: Use summarise():

  1. to print out a summary of gapminder containing two variables: max_lifeExp and max_gdpPercap.

  2. to print out a summary of gapminder containing two variables: mean_lifeExp and mean_gdpPercap.

😃🙌 Solution: Summarise your data

summarise(gapminder, max_lifeExp = max(lifeExp), max_gdpPercap = max(gdpPercap))
## # A tibble: 1 x 2
##   max_lifeExp max_gdpPercap
##         <dbl>         <dbl>
## 1        82.6       113523.
summarise(gapminder, mean_lifeExp = mean(lifeExp), mean_gdpPercap = mean(gdpPercap))
## # A tibble: 1 x 2
##   mean_lifeExp mean_gdpPercap
##          <dbl>          <dbl>
## 1         59.5          7215.

Subsetting: group_by()

dplyr’s group_by() function enables you to group your data. It allows you to create a separate df that splits the original df by a variable.

The function summarise() can be combined with group_by().

Objective Function Description
Position first() First observation of the group
last() Last observation of the group
nth() n-th observation of the group
Count n() Count the number of rows
n_distinct() Count the number of distinct observations

👉 Practice ⏰💻: Subset your data

  1. Identify how many countries are given in gapminder data for each continent.

😃🙌 Solution:

gapminder %>%
     group_by(continent) %>%
     summarise(n_distinct(country))
## # A tibble: 5 x 2
##   continent `n_distinct(country)`
##   <fct>                     <int>
## 1 Africa                       52
## 2 Americas                     25
## 3 Asia                         33
## 4 Europe                       30
## 5 Oceania                       2

Let’s %>% all up!

You can try to get into the habit of using a shortcut for the pipe operator

🗣👥 Confer with your neighbours: What relationship do you expect to see between population size (pop) and life expectancy (lifeExp)?

Look what this code produces

gapminder_pipe <- gapminder %>%
  filter(continent == "Europe" & year ==  2007) %>%
  mutate(pop_e6 = pop / 1000000)
plot(gapminder_pipe$pop_e6, gapminder_pipe$lifeExp, cex = 0.5, col = "red")


tidyr

The tidyr can help you to create tidy data. Tidy data is data where:

  • Every column is a variable
  • Every row is an observation
  • Every cell is a single value

The tidyr package embraces the principles of tidy data and provides the standard key functionalities to organise data values within a dataset.

Hadley Wickham the author of the tidyr package talks in his paper Tidy Data about the importance of the data cleaning process and structuring datasets to facilitate data analysis.

Real datasets, that you are most likely to download from https://data.gov.rs/ or any other open source data platform, would often violate the three precepts of tidy data in all kinds of different ways:

  • Variables would not have their names and column headers are values.
  • A number of variables are stored in one column
  • A single variable is stored in several columns
  • Same information stored multiple times as different variables

to name a few.

To illustrate this, let us go back onto https://data.gov.rs/ and access Kvalitet Vazduha 2017. In particular, we want to access 2017-NO2.csv data.

## If you don't have tidyr installed yet, uncomment and run the line below
#install.packages("tidyr")
library(tidyr)
# access 2017-NO2.csv data
no2 <- read.csv("http://data.sepa.gov.rs/dataset/ca463c44-fbfa-4de9-9a75-790995bf2830/resource/74516688-5fb5-47b2-becc-6b6e31a24d80/download/2017-no2.csv",
                stringsAsFactors = FALSE, 
                fileEncoding = "latin1")
# have a look at the data
glimpse(no2)
## Observations: 365
## Variables: 8
## $ Datum                   <chr> "01.01.2017", "02.01.2017", "03.01.2017"…
## $ Novi.Sad.SPENS.NO2      <dbl> 22.89, 32.94, 14.86, 22.73, 20.89, 10.47…
## $ Beograd.Mostar.NO2      <dbl> 28.83, 39.12, 25.20, 28.48, 27.82, 41.91…
## $ Beograd.Vraèar.NO2      <dbl> 182.84, 244.56, 147.49, 159.12, 124.16, …
## $ Beograd.Zeleno.brdo.NO2 <dbl> 36.32, 36.68, 27.00, 35.15, 24.92, 8.27,…
## $ Kragujevac..NO2         <dbl> 38.92, 43.50, 40.17, 45.44, 43.04, 34.55…
## $ U.ice..NO2              <dbl> 49.84, 50.34, 36.35, 49.07, 33.33, 39.88…
## $ Ni..IZJZ.Ni...NO2       <dbl> 18.61, 22.87, 15.68, 21.97, 21.68, 12.11…

It shows that this data set has 365 observations and 8 variables. Nonetheless, we need to consider what type of information we have here:

  • date NO2 measurement taken: given in a single column -> ✅ tidy🙂
  • places where NO2 measurement was taken: given in several columns -> ❎ tidy🙁
  • NO2 measurements: given in several columns -> ❎ tidy🙁

Hmmm… 🤔 This doesn’t look tidy at all 😳

This data is about measurement level of NO2(µg/m3) in several different towns/places, which means that NO2 is our main response variable. The way in which this variable is given in this data is certainly not tidy. It defeats the key principles of tidy data: Every column is a variable and furthermore, Every row is NOT an observation.

It appears that this data has 8 variables, but we have realised that there are only 3: date, place and no2. To tidy it, we need to stack it by turning columns into rows. We are happy with the variable date and it should remain as a single column, the other 7 columns we want to convert into two variables: place and no2.

To make wide format data into tall format we have to turn columns into rows using gather() function.

We will create variable place in which we will hold the headers as given in the columns 2:8. The values inside those columns will be saved in the new variable no2.

new_no2 <- no2 %>%
  gather("place", "no2", -Datum, factor_key = TRUE) # stack all columns apart from `Datum`
glimpse(new_no2)
## Observations: 2,555
## Variables: 3
## $ Datum <chr> "01.01.2017", "02.01.2017", "03.01.2017", "04.01.2017", "0…
## $ place <fct> Novi.Sad.SPENS.NO2, Novi.Sad.SPENS.NO2, Novi.Sad.SPENS.NO2…
## $ no2   <dbl> 22.89, 32.94, 14.86, 22.73, 20.89, 10.47, 9.58, 15.99, 14.…

Let us see the names of the places

new_no2 %>%
     group_by(place) %>%
     summarise(n())
## # A tibble: 7 x 2
##   place                   `n()`
##   <fct>                   <int>
## 1 Novi.Sad.SPENS.NO2        365
## 2 Beograd.Mostar.NO2        365
## 3 Beograd.Vraèar.NO2        365
## 4 Beograd.Zeleno.brdo.NO2   365
## 5 Kragujevac..NO2           365
## 6 U.ice..NO2                365
## 7 Ni..IZJZ.Ni...NO2         365

Those names look very messy. We could use function from stringr package str_sub(). To begin with let’s remove .NO2 at the end of each name.

## If you don't have stringr installed yet, uncomment and run the line below
#install.packages("stringr")
library(stringr)
new_no2$place <- new_no2$place %>% 
    str_sub(end = -5)
glimpse(new_no2)
## Observations: 2,555
## Variables: 3
## $ Datum <chr> "01.01.2017", "02.01.2017", "03.01.2017", "04.01.2017", "0…
## $ place <chr> "Novi.Sad.SPENS", "Novi.Sad.SPENS", "Novi.Sad.SPENS", "Nov…
## $ no2   <dbl> 22.89, 32.94, 14.86, 22.73, 20.89, 10.47, 9.58, 15.99, 14.…
new_no2 %>%
    group_by(place) %>%
    summarise(n())
## # A tibble: 7 x 2
##   place               `n()`
##   <chr>               <int>
## 1 Beograd.Mostar        365
## 2 Beograd.Vraèar        365
## 3 Beograd.Zeleno.brdo   365
## 4 Kragujevac.           365
## 5 Ni..IZJZ.Ni..         365
## 6 Novi.Sad.SPENS        365
## 7 U.ice.                365

It still doesn’t look right. 😟 This could be a tedious job. 😥 It is no wonder that many data analysts grumble about time spent on the process of cleaning and preparing the data. It could be a very long and time consuming process, but the more you do it and the more experience you gain the easier and less painful it gets.

Perhaps, you can try to explore other available packages in R that could help you with organising your data into your ideal format. To give you an idea we will show you how it can easily be done when using forcats::fct_recode() function.

## If you don't have forcats installed yet, uncomment and run the line below
#install.packages("forcats")
library(forcats)
new_no2 <- no2 %>%
  gather("place", "no2", -Datum, factor_key = TRUE) %>% # stack all columns apart from `Datum`
  mutate(place = fct_recode(place, 
                            "NS_Spens" = "Novi.Sad.SPENS.NO2",
                            "BG_Most" = "Beograd.Mostar.NO2",
                            "BG_Vracar" = "Beograd.Vraèar.NO2", 
                            "BG_ZelenoBrdo" = "Beograd.Zeleno.brdo.NO2", 
                            "KG" = "Kragujevac..NO2", 
                            "NI" = "Ni..IZJZ.Ni...NO2",
                            "UZ" = "U.ice..NO2"))
glimpse(new_no2)
## Observations: 2,555
## Variables: 3
## $ Datum <chr> "01.01.2017", "02.01.2017", "03.01.2017", "04.01.2017", "0…
## $ place <fct> NS_Spens, NS_Spens, NS_Spens, NS_Spens, NS_Spens, NS_Spens…
## $ no2   <dbl> 22.89, 32.94, 14.86, 22.73, 20.89, 10.47, 9.58, 15.99, 14.…

By now, you should have gained enough knowledge in using R to give you the necessary confidence to start exploring other functions of the tidyr package. You should not stop there, but go beyond and explore the whole of the tidyverse opinionated collection of R packages for data science. 😇🎶

To learn more about tidy data in r check Data Tidying section from the famous Data Science with R by Garrett Grolemund

Have you tried learning data science by posting your questions and discussing it with other people within the R community? 👥💻📊📈🗣 RStudio Community


YOUR TURN 👇

Practise by doing the following set of exercises:

  1. Install and upload the rattle package and see what it does.

  2. Create a new R script file to explore weatherAUS dataset.

  1. select() variable: MinTemp, MaxTemp, Rainfall and Sunshine by pipping the dataset into `dplyr::select() function.

  2. produce a summary using base::summary() function of these numeric values.

  3. within this selection filter only those observations where Rainfall >= 1 and save the results into the computer’s memory (ie. save the results as an object).

  4. Try to think of how else you can use other dplyr verbs on this weatherAUS dataset. Write your question first, before embarking on typing the code.

  5. Write a short report on what visualisation you think would be interesting to produce for this weatherAUS dataset and why?


Useful Links

Data Wrangling cheat sheet

Data Transformation with dplyr cheat sheet


© 2020 Tatjana Kecojevic