Chapter 2 Data Input, Output, and Exploration

2.1 Goals

By the end of this chapter, you will be able to:

Read a CSV file into your R environment
Explore different attributes of a dataframe
- column names
- row names
- more with attributes()
- Subset a dataframe by rows and columns
- Filter a dataframe by the values of a column
Use two data exploration and cleaning functions
- Frequency counts of unique values with table()
- Search for any duplicates within your data with duplicated()
Write from your R environment to a CSV file

2.2 Reading in Data

2.2.1 Read CSV Files

store the filepath within a character variable
the filepath could be relative to your working directory, or absolute
read.csv is a function that could read CSV files on your computer, and returns a dataframe for your R environment to use

# define relative filepath
pathToAdosFile <- './datasets/adosm1.csv'

# Use read.csv() and filepath variable as first argument to function
adosm1 <- read.csv(pathToAdosFile, 
                   stringsAsFactors = FALSE)
# stringsAsFactors is an optional argument that tells read.csv to
# not turn a string column into a R factor variable

# adosm1 is now considered a datraframe

2.3 Observing Dataframe attributes

2.3.1 `View`

To view your dataframe call the View() function.
Note the capital View().

View(adosm1)

# After Viewing, you can click on columns to sort

2.3.2 See Attributes

display column names names()
display rownames with row.names()
see all attributes with attributes(dataframe)

# display column names
# names(dataframe), we will use our adosm1 dataframe
# wrap function with head() to show first 6
head(names(adosm1))

## [1] "id"                "visit"             "cbe_36"           
## [4] "recruitment_group" "gender"            "ados_version"

# display rownames, wrap function with head() to show first 6
head(row.names(adosm1))

## [1] "1" "2" "3" "4" "5" "6"

2.4 Subsetting Data

In our case, adosm1 is within our R environment, so we would call
- adosm1[namesOfRows, namesOfColumns]

2.4.1 Subset by columns

Note: it is best practice to subset columns by their names, not their numbers
- This is because subsetting by column names is more specific and less error-prone
- For ex. this can cause a problem if the columns were reordered

# To access the id column, we would write
adosm1[, "id"]
# left of comma is empty to signify all rows to be accessed
View(adosm1[, "id"])

# To access all rows, for columns id and visit
adosm1[, c("id", "visit")]

# in order to make code less verbose, when accessing one column,
# we use the $ operator
# to access the id column
adosm1$id

Summary for accessing columns:
- For multiple columns
- adosm1[, c("nameOfColumn1" "nameOfColumn2")]
- For one column
- adosm1$nameOfColumn

2.4.2 Subset by rows

We could similarly access rows by row numbers
- adosm1[c(1, 2, 5), ]
We could also access rows by their row names
- adosm1[c("1", "2", "5")]
Lastly, we could access rows with a logical vector (explained later)

# access 1st, 2nd, and 5th rows
# right of comma is empty to signify all columns to be accessed
adosm1[c(1, 2, 5), ]

2.4.3 Subset by rows and columns

# access 1st, 2nd, and 5th rows, within id and visit columns
adosm1[c(1, 2, 5), c("id", "visit")]

2.5 Filtering a dataframe: Subsetting part 2

2.5.1 Subset with a logical vector

What is very powerful in this is we could subset a dataframe in order to answer specific questions
This is done by using a logical vector to subset or slice a dataframe

2.5.2 Subset logical vector example

# To Filter for all visits at 36 months:
# Create logical vector
visits36mo <- adosm1$visit == "36"
# This will return a vector that has a length equal to the 
# number of rows in our dataframe
head(visits36mo)  # TRUE are values = 36, FALSE not = 36

# use  our visits36mo  logical vector to filter dataframe
adosm1[visits36mo, ]  # returns  a dataframe of all visits at 36mo

adosm1[visits36mo, "id"] # returns all the ids with a visit at 36mo

# returns alg and version at 36 months
adosm1[visits36mo, c("ados_algorithm", "ados_version")]

2.6 Functions to explore data or find unclean data

2.6.1 `table()`

# To get a frequency count of  all unique values,
# we use the  table()

# table() takes in 1 or more vectors as input
# since every dataframe column could be considered a vector (or list)
# we could check for every unique ados version with freq counts with
table(adosm1$ados_version)

## 
## ADOS-1 ADOS-2 
##    986     43

# we could check  for every unique combination of vectors by 
# providing multiple vector inputs

# To see every unique combination of visit and ados algorithm
table(adosm1$visit, adosm1$ados_algorithm)

##     
##          few to no words no words some words
##   18   2             158       90        212
##   21   0               0        7          2
##   24   3              52       32        292
##   30   0               6        0         32
##   36   5               4       15        117

2.6.2 Checking for duplicates: `duplicated()`

We will first generate a dataframe that has duplicates, to show how to detect duplicates by different subgroups

dupDf <- data.frame(id = c(1, 1, 2, 2, 3, 3),
                    visit = c('10 months', '20 months',  '10 months', '10 months', '20 months', '20 months'),
                    item1 = c(1, 2, 3, 3, 5, 6))

The duplicated function returns a logical vector of the duplicates that exist in a dataframe
The TRUE values represent the index locations of where a duplicate row was found
We could use this logical vector to see where a duplicate exists

duplicateIndices <- duplicated(dupDf)  # store dup vector in variable

duplicateIndices  # view logical vector of duplicates

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

dupDf[duplicateIndices, ]

##   id     visit item1
## 4  2 10 months     3

2.6.3 `Duplicated()` by key variables: id and visit

we could check for duplicates by columns that should represent a unique combination.
in this example: id and visit

# to check duplicates by groups
dupIndGrouped <- duplicated(dupDf[, c("id", "visit")])
dupDf[dupIndGrouped, ]

##   id     visit item1
## 4  2 10 months     3
## 6  3 20 months     6

View(dupDf[dupIndGrouped, ])  # View duplicatecases by id and visit

2.7 Assignment 1: adosm2 dataset

2.7.1 Instructions

Read in the adosm2.csv file
What is the frequency count for every unique visit?
What are the ids that exist for the 30 month visit?
Are there duplicates by id and visit?
What is the frequency count of duplicates vs non-duplicates?

2.7.2 Solution Assignment 1

# 1. Read in the adosm2.csv file
adosm2Path <- './datasets/adosm2.csv'
adosm2 <- read.csv(adosm2Path, stringsAsFactors = FALSE)

# 2. What is the frequency count for every unique visit?
table(adosm2$visit)

## 
##         24         30         36 School-Age 
##         53         13        473          8

# 3, What are the ids that exist for the 30 month visit?
adosm2[adosm2$visit == "30", "id"]

##  [1]   2  45  62 111 119 132 139 258 261 272 305 486 497

# This works too- adosm2$id[adosm2$visit == "30"]

# 4. Are there duplicates by id and visit?
table(duplicated(adosm2[, c("id", "visit")]))

## 
## FALSE 
##   547

# 5. What are the frequency count of duplicates vs non-duplicates
table(duplicated(adosm2))

## 
## FALSE 
##   547

# There are no duplicates, so no need to filter out duplicates

Chapter 2 Data Input, Output, and Exploration

2.1 Goals

2.2 Reading in Data

2.2.1 Read CSV Files

2.3 Observing Dataframe attributes

2.3.1 View

2.3.2 See Attributes

2.4 Subsetting Data

2.4.1 Subset by columns

2.4.2 Subset by rows

2.4.3 Subset by rows and columns

2.5 Filtering a dataframe: Subsetting part 2

2.5.1 Subset with a logical vector

2.5.2 Subset logical vector example

2.6 Functions to explore data or find unclean data

2.6.1 table()

2.6.2 Checking for duplicates: duplicated()

2.6.3 Duplicated() by key variables: id and visit

2.7 Assignment 1: adosm2 dataset

2.7.1 Instructions

2.7.2 Solution Assignment 1

2.3.1 `View`

2.6.1 `table()`

2.6.2 Checking for duplicates: `duplicated()`

2.6.3 `Duplicated()` by key variables: id and visit