Intro to R notes (from Rachael McBride)

Related event: Introduction to R (Sat 26 Sep 2015)




R - the basics

Interactive console

Unlike some languages, R has an interactive console. This allows you to try out your code before you execute it as a script. It is particularly useful for exploratory data analysis i.e. when you first get a data set and are trying to understand its properties and characteristics.

10 + 5
## [1] 15


Primitive data types and variables

Primitive data types are the basic building blocks of a programming language. We will examine numerical, textual and boolean types here. We also introduce variables. A variable is a symbolic name associated with a value, and whose associated value may be changed.

Numerical:

a = 10
b = 5.1

Above, the value 10 is known as a int or integer. We have assigned the value 10 to a variable named a. The above decimal 5.1 is known as a float. We have assigned this value to a variable named b.

We can now use the variables a and b to perform a number of operations or calculations:

a + b
## [1] 15.1
a * b
## [1] 51
a / b
## [1] 1.960784


Textual:

Text-based values are store in primitive data types known as strings.

"learning R"
## [1] "learning R"
test_string_1 = "learning R"
test_string_1
## [1] "learning R"

Here, we have assigned the text “learning R” to the string variable test_string_1.

Boolean:

The boolean primitive data type can have one of two values - TRUE or FALSE.

test_boolean_1 = TRUE
test_boolean_2 = T
test_boolean_3 = FALSE
test_boolean_4 = F

For booleans in R, T can be used interchangeably with TRUE, and vice versa for F and FALSE.

test_boolean_1
## [1] TRUE
test_boolean_2
## [1] TRUE
test_boolean_3
## [1] FALSE
test_boolean_4
## [1] FALSE


Variable names

In R, variable names:

  1. Are case sensitive e.g. variable ‘a’ is not the same as variable ‘A’

  2. Cannot begin with a number e.g. a variable called ‘1a’ is not accepted by R, but a variable called ‘a1’ is.

Calling an built-in function

Lets try R’s built-in function for calculating the square root of a number

sqrt(9)
## [1] 3
sqrt(25)
## [1] 5


We can assign the results of the function to variables that we can use later

sqrt_of_9 = sqrt(9)
sqrt_of_25 = sqrt(25)
sqrt_of_9
## [1] 3
sqrt_of_25
## [1] 5




Finding your way around R


Q. How do I find out more about my current session in R?

A. Try:

 sessionInfo()

This is useful to know when installing libraries, as not all libraries are available for all version numbers.

Q. How do I see what libraries are already installed?

A. Try:

 library()



Q. How do I find out more about about a library, a topic etc.?

A. Use R’s help system. If you know the specific name:

 help("mean")

or

 ?mean


If you have an idea of what you are looking for, but not quite sure what it is called, try:

 help.search("cluster")

or

 ??clust

This is similar to running a fuzzy matching search.



Q. Ok, the ‘cluster’ library looks interesting. How do I use it?

A. Use library() to load the library of interest. For example:

library("cluster") 



Q. I would like to use a package or library that is currently not installed on my computer. How can I install it?

A. Use install.packages("name_of_package", dependencies = T). See ?install.packages for more.



Q. I want to play around with R more. Are there any test data sets I can use?

A. R comes with test data sets. See data() for more. For example, to use data on the survival of passengers in the Titanic, data(Titanic)


Primary data structures in R

The primary data structures are:

  • vector

  • dataframe

  • matrix

  • list

The vector

Create a vector

Allows you to store a collection of elements.

transport = c("car", "bus", "plane")
lotto = c(7, 22, 32, 34, 40, 42)



Add names to the elements of the vector

To add names to an existing vector:

names(transport) = c("road", "bus lane", "plane")


To check:

transport
##     road bus lane    plane 
##    "car"    "bus"  "plane"
names(transport)
## [1] "road"     "bus lane" "plane"


To add names as you create a vector:

ages = c("Ann" = 2, "Barry" = 4, "Bosco" = 7)
ages
##   Ann Barry Bosco 
##     2     4     7
names(ages)
## [1] "Ann"   "Barry" "Bosco"



Get the length of a vector(or other objects)

Use length(). Note: This can be used on a number of different objects.

length(transport)
## [1] 3
length(ages)
## [1] 3



Exercise:

Peter, Bob and Jill each have a number of jelly babies. Peter has 6, Bob has 8 and Jill has 10. Each person eats 2 of their jelly babies. They then discover another packet of jelly babies behind the couch with 6 jelly babies in it. They decide to share it: Peter gets 4, Bob gets 2 less than Peter and Jill gets 2 less than Bob. Use vectors in R to calculate how much each person has in the end

  1. Create a vector, called ‘start’, that reflect how many each person starts out with.

  2. Create a vector called ‘eats’, that represents how many jelly babies each person eats. (Hint: Try ?rep)

  3. Create a vector called ‘gets’, that represents how much jelly babies from the new bag each person gets. (Hint: Try ?seq)

  4. Subtract ‘eat’ from ‘start’ and add ‘gets’


Access elements of the vector

By condition: Access elements by filtering on a particular condition.

lotto
## [1]  7 22 32 34 40 42
lotto > 33
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE
lotto[lotto > 33]
## [1] 34 40 42


By location: Access elements by location.

lotto[1]
## [1] 7
lotto[2:3]
## [1] 22 32
lotto[3:length(lotto)]
## [1] 32 34 40 42
lotto[length(lotto):4]
## [1] 42 40 34



Concatenate vectors

To join two or more vectors together:

a = c(1, 2, 3, 4)
b = c(5, 6, 7, 8)
together = c(a, b)
together
## [1] 1 2 3 4 5 6 7 8




The list

A data container that can store different types of data structures at the same time.

Create a list

age = 3
allergies = TRUE
friends = c("Joe", "Tyler", "Nina")
child_1 = list("age" = age, "allergies" = allergies , "friends" = friends)
child_2 = list("age" = 2, "allergies" = FALSE, friends = "James", "note" = "Will not eat fish fingers")
child_1
## $age
## [1] 3
## 
## $allergies
## [1] TRUE
## 
## $friends
## [1] "Joe"   "Tyler" "Nina"
child_2
## $age
## [1] 2
## 
## $allergies
## [1] FALSE
## 
## $friends
## [1] "James"
## 
## $note
## [1] "Will not eat fish fingers"


Add to a list i.e. create a list of lists

children = list("Ann" = child_1, "Tomasz" = child_2)
children
## $Ann
## $Ann$age
## [1] 3
## 
## $Ann$allergies
## [1] TRUE
## 
## $Ann$friends
## [1] "Joe"   "Tyler" "Nina" 
## 
## 
## $Tomasz
## $Tomasz$age
## [1] 2
## 
## $Tomasz$allergies
## [1] FALSE
## 
## $Tomasz$friends
## [1] "James"
## 
## $Tomasz$note
## [1] "Will not eat fish fingers"


Accessing elements in a list

By name:

names(child_1)
## [1] "age"       "allergies" "friends"
child_1[["age"]]
## [1] 3
child_1$age
## [1] 3
names(children)
## [1] "Ann"    "Tomasz"
children[["Tomasz"]]
## $age
## [1] 2
## 
## $allergies
## [1] FALSE
## 
## $friends
## [1] "James"
## 
## $note
## [1] "Will not eat fish fingers"
children$"Tomasz"
## $age
## [1] 2
## 
## $allergies
## [1] FALSE
## 
## $friends
## [1] "James"
## 
## $note
## [1] "Will not eat fish fingers"
children$Tomasz$note
## [1] "Will not eat fish fingers"




The dataframe

A ‘table’ of data.

Create a dataframe

Use data.frame()

dframe = data.frame(transport, ages)
dframe
##          transport ages
## road           car    2
## bus lane       bus    4
## plane        plane    7


Explore

Get summary statistics, row and column names, dimensions and the first 5 rows of the dataframe

summary(dframe)
##  transport      ages      
##  bus  :1   Min.   :2.000  
##  car  :1   1st Qu.:3.000  
##  plane:1   Median :4.000  
##            Mean   :4.333  
##            3rd Qu.:5.500  
##            Max.   :7.000
rownames(dframe)
## [1] "road"     "bus lane" "plane"
colnames(dframe)
## [1] "transport" "ages"
dim(dframe)
## [1] 3 2
head(dframe)
##          transport ages
## road           car    2
## bus lane       bus    4
## plane        plane    7


Access particular entries in a dataframe

You can access paticular entries in a dataframe by specifying names and or locations of the row(s) and columns of interest.

By location:

dframe[, 2]     # Access the 2nd column
## [1] 2 4 7
dframe[3, ]     # Access the 3rd row
##       transport ages
## plane     plane    7
dframe[3, 2]    # Access the value in the third row, end column
## [1] 7
dframe[1:2,]    # Access the first 2 rows
##          transport ages
## road           car    2
## bus lane       bus    4


By name:

dframe[,'ages']     # Access the 'ages' column
dframe['plane',]    # Access the 'plane' row


Or a mixture of both:

dframe[2:3, 'ages']    # Access the 2nd and 3rd values in the 'age' column
## [1] 4 7

Note: To access values in a data frame, remember [row, column] or ‘RC’

Note that the row returned is in the form of a list

result = dframe[3,]
result
##       transport ages
## plane     plane    7
typeof(result)
## [1] "list"
names(result)
## [1] "transport" "ages"
result$ages
## [1] 7


Add another column

The data frame dframe contains ages. Lets add the names of the people for these ages.

ages
##   Ann Barry Bosco 
##     2     4     7


Extract the names associated with each age:

names(ages)
## [1] "Ann"   "Barry" "Bosco"


Add the column to dframe using the cbind():

dframe = cbind(dframe, names(ages))
head(dframe)
##          transport ages names(ages)
## road           car    2         Ann
## bus lane       bus    4       Barry
## plane        plane    7       Bosco


Re-name the columns

Use colnames(). Row names can be renamed in a similar fashion using rownames()

colnames(dframe) = c("modes of transport", "ages", "names")




The matrix

Create a matrix

m = matrix(data = seq(1, 8), nrow = 2, ncol = 4, byrow = T)

See ?matrix() for more details.

Give it row names and column names

In two separate steps:

rownames(m) = c("row1", "row2")
colnames(m) = c("column1", "column2", "column3", "column4")


Or in one single step:

dimnames(m) = list(c("row1", "row2"),
                   c("column1", "column2", "column3", "column4"))



Accessing particular values in a matrix

Similar to accessing particular values in a data frame

Recall, m is:

m
##      column1 column2 column3 column4
## row1       1       2       3       4
## row2       5       6       7       8


Get the contents of the row named “row1”

m["row1",]
## column1 column2 column3 column4 
##       1       2       3       4


Get the contents of the column named “column2”

m[, "column2"]
## row1 row2 
##    2    6


Get the contents of the cell in row “row1”, column “column2”

m["row1", "column2"]
## [1] 2


Similarly, location can be used.

Get the contents of the first column:

m[1,]
## column1 column2 column3 column4 
##       1       2       3       4


Get the contents of the second column:

m[,2]
## row1 row2 
##    2    6


Get the contents of the cell in the first row and second column

m[1, 2]
## [1] 2




Data I/O

There are many ways to get data in and out of R.

R objects

To save:

To save one object:

save(dframe, file = "dframe_output.RData")

To save multiple R objects to a single file:

save(list = c("dframe", "children"), file = "dframe_and_children_output.RData")


To load:

First remove traces of existing versions from your R session:

rm(list = c("dframe", "children"))       # Remove the existing variables
ls()                                     # Verify that they have been successfully removed
##  [1] "a"              "age"            "ages"           "allergies"     
##  [5] "b"              "child_1"        "child_2"        "friends"       
##  [9] "lotto"          "m"              "result"         "sqrt_of_25"    
## [13] "sqrt_of_9"      "test_boolean_1" "test_boolean_2" "test_boolean_3"
## [17] "test_boolean_4" "test_string_1"  "together"       "transport"

Next read in the fresh variables

load("dframe_and_children_output.RData") # Load the variables back into R
ls()                                     # Verify that they have been loaded successfully 
##  [1] "a"              "age"            "ages"           "allergies"     
##  [5] "b"              "child_1"        "child_2"        "children"      
##  [9] "dframe"         "friends"        "lotto"          "m"             
## [13] "result"         "sqrt_of_25"     "sqrt_of_9"      "test_boolean_1"
## [17] "test_boolean_2" "test_boolean_3" "test_boolean_4" "test_string_1" 
## [21] "together"       "transport"



Text files

Vector and data frame contents can be stored and read in from text files. For this, R has a collection of built-in functions for text files of various formats.

Writing contents to a text file

The base function is write.table(). Its help file provides a detailed description of the different function arguments available to you. The other data input functions in this help file are variants of write.table() with different default argument values.

?write.table()


Tab-delimited files:
To write out to a tab-delimited file with column names and rownames:

file_contents = write.table(x = dframe,
                           file = "dframe-tab_delim.txt",
                           quote = F,
                           sep = "\t")


Comma-separated files: To read in a comma-separated file with column names and row names:

file_contents = write.csv(x = dframe,
                          file = "dframe-comma_separated.csv",
                          quote = F)



Reading in contents from a text file:

The base function is read.table(). Its help file provides a detailed description of the different function arguments available to you. The other data input functions in this help file are variants of read.table() with different default argument values.

?read.table()


Tab-delimited files:
To read in a tab-delimited file with column names and rownames:

file_contents = read.table(file = "dframe-tab_delim.txt",
                           header = T,
                           sep = "\t")


Comma-separated files: To read in a comma-separated file with column names and row names:

file_contents = read.csv(file = "dframe-comma_separated.csv",
                         row.names = 1)



Reading from a database:

R has libraries that allow to a number of databases, allowing you to read data from and write data to a variety of databases. Below is a list of types of databases and R libraries that can be used to connect to them.

  • MySQL: RMySQL

  • Microsoft SQL: RODBC

  • PostgreSQL: RPostgreSQL

  • MongoDB: RMongo, rmongodb

Please see individual libraries for more.




Control structures

Control structures allow you to implement different code depending on a given condition of a variable or parameter

If…else:

Allows you to execute a piece of code if, and only if, a given condition is met. Otherwise, another piece of code is executed.

In other words,

  if (condition 1 is TRUE) {

      then execute this piece of code….

  } else {

      execute this code instead….

  }


For example,

current_bank_balance = 10

if(current_bank_balance > 0){        
  print(paste("You have E", current_bank_balance, " in your account", sep = ""))    # Execute if current_bank_balance is less than or equal to zero
}else{
  print("Oh-oh! You are out of money") # Otherwise, execute this
}
## [1] "You have E10 in your account"

Re-run the above, varying the value of current_bank_balance

ifelse:

ifelse is a more concise version of if...else, but its use may decrease the readability of your code.

ifelse(current_bank_balance > 0, 
       paste("You have E", current_bank_balance, " in your account", sep = ""),  
       "Oh-oh! You are out of money")
## [1] "You have E10 in your account"



The switch function:

if...else is useful when there are two scenarios or cases to consider. If there are more than two scenarios or cases, consider using ‘switch’.

In other words,

    switch (input,

            “case1” = return_value_for_case1,
            “case2” = return_value_for_case2,
            “case3” = return_value_for_case2,
            …..)


For example,

animal = "horse"
type = switch(animal, "horse" = "mammal", "snake" = "reptile", "trout" = "fish")
print(type)
## [1] "mammal"

Re-run for an animal type of snake and trout

Exercise

A bank has 3 different account types: current, savings_1 and savings_2 account. Each account has the following characteristics:

  • Current: current interest rate = 0.05%

  • Savings_1: current interest rate = 1.2%

  • Savings_2: current interest rate = 2%


Create a switch statement that will return the correct rate for each bank account type.


Loops

Loops are a way to re-run code until a given condition is met. Two types of loops in R are ‘for’ and ‘while’

For loop:

Repeats a portion of code for each element in a vector

In other words,

      for (each_element in a_vector) {
          execute this code
      }


For example,

# Print out each element of the transport vector
transport = c("car", "bus", "plane")
names(transport) = c("road", "bus lane", "plane")

for(each in transport){ print("====================") # Acts as a visual separator print(each) }

## [1] "===================="
## [1] "car"
## [1] "===================="
## [1] "bus"
## [1] "===================="
## [1] "plane"


for loops can also be used to keep track of the index or location of a vector, as in the following example:

# Print out each element of the transport vector

for(index in 1:length(transport)){ print("====================") # Acts as a visual separator print(paste("Element ", index, ": ", transport[index], sep = "")) }

## [1] "===================="
## [1] "Element 1: car"
## [1] "===================="
## [1] "Element 2: bus"
## [1] "===================="
## [1] "Element 3: plane"



While loop:

Continues to execute a portion of code as along as a given condition is met. When the condition is no longer met, the loop is exited.

In other words,

      while (condition is met) {
           execute this code
     }


As example,

count = 0

while(count < 10){ print("====================") # Acts as a visual separator print(count) # Print the current value of count count = count + 2 # Increment count by 2 }

## [1] "===================="
## [1] 0
## [1] "===================="
## [1] 2
## [1] "===================="
## [1] 4
## [1] "===================="
## [1] 6
## [1] "===================="
## [1] 8


Another example:

remainder = TRUE      # Initialise the starting condition
count = 0             # Initialise a counter

while(remainder){ print("====================") # Acts as a visual separator print(count) # Print the current value of count count = count + 1 if(count %% 6){ remainder = FALSE print("No remainder.Exiting loop...") } }

## [1] "===================="
## [1] 0
## [1] "No remainder.Exiting loop..."




Functions

Functions allow you to re-use code. Functions are useful in situations where your code will perform the same operations repeatedly during the execution of your code.

Functions can take inputs and return outputs
An example of a function without specified inputs or outputs:

if_finished = function(){        # No specified inputs
  print("Complete!")             # No specified outputs. Rather a message is printed to a console
}

if_finished()

## [1] "Complete!"

An example of a function with a specified input and output:

get_squared_root = function(value)
{
  # Return the square root of given value
  result = value^0.5
  return(result)
}
answer = get_squared_root(9) # input is 9. Function output captured by 'answer'
answer
## [1] 3


Note: If there is no return() at the end of the function, the last value of the function is returned instead.
This allows us to re-write get_squared_root() more succinctly:

get_squared_root = function(value)
{
  # Return the square root of given value
  value^0.5
}
answer = get_squared_root(9) # input is 9. Function output captured by 'answer'
answer
## [1] 3


Returning multiple results from a function: Combine multiple results into one variable such as a vector or list and return that one variable

get_squared_root_and_squared_values = function(value)
{
  # Return the square root and squared value of given value
  squared_root = value^0.5
  squared = value*value
  return(c("sqrt" = squared_root, "squared" = squared))
}
answer = get_squared_root_and_squared_values(9) # input is 9. Function output captured by 'answer'
answer
##    sqrt squared 
##       3      81
answer["sqrt"]
## sqrt 
##    3
answer["squared"]
## squared 
##      81




End of Session 1

Comments

comments powered by Disqus