R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years. As of May 2018, R ranks 11th in the TIOBE index that measure of popularity of programming languages.
R is a GNU package. The source code for the R software environment is written primarily in C, Fortran, and R itself. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. While R has a command line interface, there are several graphical front-ends and integrated development environments available.
The following blogs give information about R gathered from various sources
Introduction
R is an interpreted programming language and software environment for statistical analysis, graphics representation and reporting. R also allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency. R is named so after the creators Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.
R-Studio is most commonly used for development. It provides features to run commands as well as scripts. You can install R and R-Studio from their website.
R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just typing the following command at your command prompt:
> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"
Here first statement defines a string variable myString, where we assign a string "Hello, World!" and then next statement print() is being used to print the value stored in variable myString.
R Script File
Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript. So let's start with writing following code in a text file called test.R
# Hello World
myString <- "Hello, World!"
print ( myString)
``
You can save the above code in a file helloWorld.R and execute it at the command prompt as given below. You can also do that within the R-Studio if you want to avoid the hassles of setting up the path.
```r
$ Rscript test.R
[1] "Hello, World!"
Data Types
Like most scripting languages, variables in R are not hard typed. You do not declare a variable to be limited to a given data type. The variables in R are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are:
- Vectors
- Lists
- Matrices
- Arrays
- Factors
Data Frames The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.
Logical (TRUE / FALSE)
- Numeric (1, 2, 33.44)
- Integer (1L, -100L, 0L C- omplex (4 + 10i))
- Character ("1", "examples", 'of', "characters")
- Raw (A raw sequence of bytes.) ```r v <- TRUE print(class(v)) [1] "logical"
v <- 23.5 print(class(v)) [1] "numeric"
v <- 2L print(class(v)) [1] "integer"
v <- 2+5i print(class(v)) [1] "complex"
v <- "TRUE" print(class(v)) [1] "character"
v <- charToRaw("Convert Characters to RAW") print(class(v)) [1] "raw"
%[https://clnk.in/pNCo]
## Variable
A valid variable name consists of letters, numbers and the dot or underline characters. The variable name starts with a letter or the dot not followed by a number.
```r
var_name2. # Valid Has letters, numbers, dot and underscore
var_name% # Invalid Has the character '%'. Only dot(.) and underscore allowed.
2var_name # Invalid Starts with a number
.var_name # Valid Can start with a dot(.) but the dot(.)should not be followed by a number.
var.name # Valid Variable name can contain a dot(.)
.2var_name # Invalid The starting dot is followed by a number making it invalid.
_var_name # Invalid Starts with _ which is not valid
The variables can be assigned values using leftward, rightward and equal to operator.
# Assignment using equal operator.
var.1 = c(0,1,2,3)
# Assignment using leftward operator.
var.2 <- c("learn","R")
# Assignment using rightward operator.
c(TRUE,1) -> var.3
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")
print() / cat()
We can view the contents of a variable using the print or cat functions. Print takes a single parameter while cat takes multiple parameters and concatenates them all.
print("Hello World")
[1] "Hello World"
cat("Hello", "World")
Hello world
ls() / rm()
R does not provide for namespaces. (You can import certain packages to enforce namespaces). For example, a variable declared in an if block is also available after you come out of the block. It is very easy to lose track of the variables available at a given point. R provides for two very useful functions to deal with this.
ls() gives a list of variables defined at a given time. And, if you want, you can also delete a variable from the memory using the rm()
Operators
R is quite rich in operators it provides. Not be as rich as Perl - but, the operators in R are taylored towards handling chunks of data. By default all the operators when applied on vectors perform the operation on individual corresponding elements. Operators in R can be classified into 5 major types:
Arithmetic
R defines these arithmetic operators +, -, *, /, %%, %/%, ^
The meaning of +, -, *, /, ^ is the same as in most other languages. That does not need any clarification. The %% and %/% are more interesting. Both are related to integer division. One gives the quotient and the other gives the remainder
> # / performs the usual division
> c(4, 2, 5.5, 6.5) / c(2, 4, 2.5, 3)
[1] 2.000000 0.500000 2.200000 2.166667
> # %% gives the remainder. Note that both the operands could be non integers.
> # But the operator ensures integer division.
> c(4, 2, 5.5, 6.5) %% c(2, 4, 2.5, 3)
[1] 0.0 2.0 0.5 0.5
> # %/% gives the quotient. Note that both the operands could be non integers.
> # But the operator ensures integer division.
> c(4, 2, 5.5, 6.5) %/% c(2, 4, 2.5, 3)
[1] 2 0 2 2
Relational
R defines the usual relational operators: <, >, =<, >=, ==, !=
They mean almost what they would mean in any other language. But, as mentioned above, the operators work on individual elements in the vector and produce another vector of boolean elements that stand for the result of each individual comparison. For example:
> c(4, 2, 5.5, 6.5) < c(2, 4, 2.5, 3)
[1] FALSE TRUE FALSE FALSE
Logical
R defines all the usual logical operators: &, |, !, && and ||
The operators &, | and ! do just what one would expect - operate on individual elements of the operand vectors and produce another boolean vector as result. But the && and || work differently - They just operate on the first elements of the vectors and return a single boolean value based on that.
Assignment
There are two types of assignments in R. Left assignment and Right assignment.
> a <- c(1,2,3)
> a
[1] 1 2 3
> c(3,4,5,6) -> a
> a
[1] 3 4 5 6
You can also use <<-, ->> and ofcourse = There are subtle differences between these - we will check them out down the line.
Miscellaneous
R also provides other operators :, %in% and %*%
print(2:8)
[1] 2 3 4 5 6 7 8
print(8 %in% 1:10)
[1] TRUE
print(12 %in% 1:10)
[1] FALSE
These are not limited to numbers. They work as well on other data types.
Code Flow
Line most programming languages, R provides code flow support using if/else blocks and for/while loops. No developer would need a detailed explanation about these. The below code snippets give an overview of how it is used in R code.
if / else / else if
output <- 'blank'
number <- 10
if(number > 10){
report <- "Greater than 10"
}else if (number < 10){
report <- "Less than 10"
}else{
report <- 'Equal to 10'
}
print(report)
[1] Equal to 10
for loops
We have versatile for loops in R. It provides ways of looping through the various data structures like vectors, lists, matrix, arrays.
vec <- c(1,3,4,6,9)
for (v in vec) {
print(v)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
You can do the same with other data structures as well. Also the collection can be generated dynamically in the command:
for ( i in 1:10 ){
print (i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
While Loops
While loops provide a more generic and more powerful mechanism to loop. The While loops in R are quite similar to most other languages:
> x <= 0
> while(x < 10){
+ cat('Value of x: ',x)
+ print("X is still less than 10")
+ # add one to x
+ x <- x+1
+ }
Value of x: 0[1] "X is still less than 10"
Value of x: 1[1] "X is still less than 10"
Value of x: 2[1] "X is still less than 10"
Value of x: 3[1] "X is still less than 10"
Value of x: 4[1] "X is still less than 10"
Value of x: 5[1] "X is still less than 10"
Value of x: 6[1] "X is still less than 10"
Value of x: 7[1] "X is still less than 10"
Value of x: 8[1] "X is still less than 10"
Value of x: 9[1] "X is still less than 10"
>
While loops also provide for break and next if you want to cut short through the loop at any point.
Matrices
Matrix is a very useful data structure in R. A lot of data processing and machine learning computations involve Matrices. So it is important that we understand this well.
As one would expect, the Matrix is a two dimensional data structure consisting of rows and columns.
Creating a Matrix
Matrix can be defined using the function matrix(). The first/required argument to this function is the vector that should be cast into a matrix. The matrix() function splits the vector into a matrix based on the parameters passed in.
> matrix(1:20, nrow=4)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
By default, the values are split along the columns. But, we can make it flow along rows by setting the byrow parameter.
> matrix(1:20, byrow=TRUE, nrow=4)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
OR
> matrix(1:20, byrow=T, nrow=4)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
Another way to create a matrix from vectors is to bind two or more vectors of same size. This can be done using rbind() or cbind().
> rbind(1:5, 2:6)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 3 4 5 6
>
> cbind(1:5, 2:6)
[,1] [,2]
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
In this case, the names of the individual vectors are assigned as matrix labels.
> v1 <- c(1:4)
> v2 <- c(4:1)
> rbind(v1,v2)
[,1] [,2] [,3] [,4]
v1 1 2 3 4
v2 4 3 2 1
> cbind(v1,v2)
v1 v2
[1,] 1 4
[2,] 2 3
[3,] 3 2
[4,] 4 1
Labels
It makes a lot of sense to label the rows and columns so that the code and graphs look a lot more meaningful. We can do that using the methods like colnames and rownames. Consider the below example that depicts the sale of bikes by different brands.
First define the two vectors
> honda <- c(10, 14, 12, 13, 11)
> honda
[1] 10 14 12 13 11
> yamaha <- c(12, 13, 14, 11, 10)
> yamaha
[1] 12 13 14 11 10
Now combine into a single vector that can be split to create a matrix.
> sales <- c(honda, yamaha)
> sales
[1] 10 14 12 13 11 12 13 14 11 10
Next, split the vector into a matrix.
> sales.matrix <- matrix(sales, byrow=T, nrow=2)
> sales.matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 10 14 12 13 11
[2,] 12 13 14 11 10
Finally, use the functions colnames and rownames to add labels to the matrix.
> colnames(sales.matrix) <- c("2013", "2014", "2015", "2016", "2017")
> rownames(sales.matrix) <- c("Honda", "Yamaha")
> sales.matrix
2013 2014 2015 2016 2017
Honda 10 14 12 13 11
Yamaha 12 13 14 11 10
Matrix Arithmetic
Just like vectors, the arithmetic operations on the metrices work on the individual elements.
> mat <- matrix(1:20, byrow = T, nrow=5)
Scalar multiplication results in multiplication on each element.
> mat * 2
[,1] [,2] [,3] [,4]
[1,] 2 4 6 8
[2,] 10 12 14 16
[3,] 18 20 22 24
[4,] 26 28 30 32
[5,] 34 36 38 40
Scalar division results in division of each element
> mat / 2
[,1] [,2] [,3] [,4]
[1,] 0.5 1 1.5 2
[2,] 2.5 3 3.5 4
[3,] 4.5 5 5.5 6
[4,] 6.5 7 7.5 8
[5,] 8.5 9 9.5 10
Similarly, the exponent method works on each element.
> mat ^2
[,1] [,2] [,3] [,4]
[1,] 1 4 9 16
[2,] 25 36 49 64
[3,] 81 100 121 144
[4,] 169 196 225 256
[5,] 289 324 361 400
You can also get an inverse of a matrix that results in inverse of each element.
> 1/mat
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.50000000 0.33333333 0.25000000
[2,] 0.20000000 0.16666667 0.14285714 0.12500000
[3,] 0.11111111 0.10000000 0.09090909 0.08333333
[4,] 0.07692308 0.07142857 0.06666667 0.06250000
[5,] 0.05882353 0.05555556 0.05263158 0.05000000
Logical operations result in a matrix of boolean values.
> mat > 15
[,1] [,2] [,3] [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE
[5,] TRUE TRUE TRUE TRUE
You can also filter the matrix elements to get a vector output.
> mat[mat > 15]
[1] 17 18 19 16 20
Just like scalar addition, you can use the operations on two matrices to get result on individual elements.
> mat + mat
[,1] [,2] [,3] [,4]
[1,] 2 4 6 8
[2,] 10 12 14 16
[3,] 18 20 22 24
[4,] 26 28 30 32
[5,] 34 36 38 40
> mat * mat
[,1] [,2] [,3] [,4]
[1,] 1 4 9 16
[2,] 25 36 49 64
[3,] 81 100 121 144
[4,] 169 196 225 256
[5,] 289 324 361 400
> mat / mat
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 1 1 1 1
[5,] 1 1 1 1
> mat ^ mat
[,1] [,2] [,3] [,4]
[1,] 1.000000e+00 4.000000e+00 2.700000e+01 2.560000e+02
[2,] 3.125000e+03 4.665600e+04 8.235430e+05 1.677722e+07
[3,] 3.874205e+08 1.000000e+10 2.853117e+11 8.916100e+12
[4,] 3.028751e+14 1.111201e+16 4.378939e+17 1.844674e+19
[5,] 8.272403e+20 3.934641e+22 1.978420e+24 1.048576e+26
>
Matrix dot product is denoted by %*%
> mat %*% t(mat)
[,1] [,2] [,3] [,4] [,5]
[1,] 30 70 110 150 190
[2,] 70 174 278 382 486
[3,] 110 278 446 614 782
[4,] 150 382 614 846 1078
[5,] 190 486 782 1078 1374
The data operations like sum and mean are implemented by functions like colSums, colMeans, rowSums, rowMeans, sum
> colSums(sales.matrix)
2013 2014 2015 2016 2017
22 27 26 24 21
> colMeans(sales.matrix)
2013 2014 2015 2016 2017
11.0 13.5 13.0 12.0 10.5
> rowSums(sales.matrix)
Honda Yamaha
60 60
> rowMeans(sales.matrix)
Honda Yamaha
12 12
> sum(sales.matrix)
[1] 120
Data slicing and indexing are required for any data processing. They are implemented as follows
> mat[1,]
[1] 1 2 3 4
> mat[1,3:4]
[1] 3 4
> mat[1:3,1:3]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 5 6 7
[3,] 9 10 11
> mat[,3:4]
[,1] [,2]
[1,] 3 4
[2,] 7 8
[3,] 11 12
[4,] 15 16
[5,] 19 20
Built-in Data Sets
R provides several built in data sets. They have reasonable size and accuracy and help us in rapid prototyping and also in using standard values in regular code
R provides these datasets in form of Data Frames. Here are a few examples
States:
> state.abb
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
[30] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
> state.area
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450 83557 56400 36291 56290 82264 40395 48523 33215 10577 8257
[22] 58216 84068 47716 69686 147138 77227 110540 9304 7836 121666 49576 52586 70665 41222 69919 96981 45333 1214 31055 77047 42244
[43] 267339 84916 9609 40815 68192 24181 56154 97914
>
> head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
WorldPhones:
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
There are many other data sets.
Dataset Contents
Name | Details |
AirPassengers | Monthly Airline Passenger Numbers 1949-1960 |
BJsales | Sales Data with Leading Indicator |
BJsales.lead (BJsales) | Sales Data with Leading Indicator |
BOD | Biochemical Oxygen Demand |
CO2 | Carbon Dioxide Uptake in Grass Plants |
ChickWeight | Weight versus age of chicks on different diets |
DNase | Elisa assay of DNase |
EuStockMarkets | Daily Closing Prices of Major European Stock Indices, 1991-1998 |
Formaldehyde | Determination of Formaldehyde |
HairEyeColor | Hair and Eye Color of Statistics Students |
Harman23.cor | Harman Example 2.3 |
Harman74.cor | Harman Example 7.4 |
Indometh | Pharmacokinetics of Indomethacin |
InsectSprays | Effectiveness of Insect Sprays |
JohnsonJohnson | Quarterly Earnings per Johnson & Johnson Share |
LakeHuron | Level of Lake Huron 1875-1972 |
LifeCycleSavings | Intercountry Life-Cycle Savings Data |
Loblolly | Growth of Loblolly pine trees |
Nile | Flow of the River Nile |
Orange | Growth of Orange Trees |
OrchardSprays | Potency of Orchard Sprays |
PlantGrowth | Results from an Experiment on Plant Growth |
Puromycin | Reaction Velocity of an Enzymatic Reaction |
Seatbelts | Road Casualties in Great Britain 1969-84 |
Theoph | Pharmacokinetics of Theophylline |
Titanic | Survival of passengers on the Titanic |
ToothGrowth | The Effect of Vitamin C on Tooth Growth in Guinea Pigs |
UCBAdmissions | Student Admissions at UC Berkeley |
UKDriverDeaths | Road Casualties in Great Britain 1969-84 |
UKgas | UK Quarterly Gas Consumption |
USAccDeaths | Accidental Deaths in the US 1973-1978 |
USArrests | Violent Crime Rates by US State |
USJudgeRatings | Lawyers' Ratings of State Judges in the US Superior Court |
USPersonalExpenditure | Personal Expenditure Data |
UScitiesD | Distances Between European Cities and Between US Cities |
VADeaths | Death Rates in Virginia (1940) |
WWWusage | Internet Usage per Minute |
WorldPhones | The World's Telephones |
ability.cov | Ability and Intelligence Tests |
airmiles | Passenger Miles on Commercial US Airlines, 1937-1960 |
airquality | New York Air Quality Measurements |
anscombe | Anscombe's Quartet of 'Identical' Simple Linear Regressions |
attenu | The Joyner-Boore Attenuation Data |
attitude | The Chatterjee-Price Attitude Data |
austres | Quarterly Time Series of the Number of Australian Residents |
beaver1 (beavers) | Body Temperature Series of Two Beavers |
beaver2 (beavers) | Body Temperature Series of Two Beavers |
cars | Speed and Stopping Distances of Cars |
chickwts | Chicken Weights by Feed Type |
co2 | Mauna Loa Atmospheric CO2 Concentration |
crimtab | Student's 3000 Criminals Data |
discoveries | Yearly Numbers of Important Discoveries |
esoph | Smoking, Alcohol and (O)esophageal Cancer |
euro | Conversion Rates of Euro Currencies |
euro.cross (euro) | Conversion Rates of Euro Currencies |
eurodist | Distances Between European Cities and Between US Cities |
faithful | Old Faithful Geyser Data |
fdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
freeny | Freeny's Revenue Data |
freeny.x (freeny) | Freeny's Revenue Data |
freeny.y (freeny) | Freeny's Revenue Data |
infert | Infertility after Spontaneous and Induced Abortion |
iris | Edgar Anderson's Iris Data |
iris3 | Edgar Anderson's Iris Data |
islands | Areas of the World's Major Landmasses |
ldeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
lh | Luteinizing Hormone in Blood Samples |
longley | Longley's Economic Regression Data |
lynx | Annual Canadian Lynx trappings 1821-1934 |
mdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
morley | Michelson Speed of Light Data |
mtcars | Motor Trend Car Road Tests |
nhtemp | Average Yearly Temperatures in New Haven |
nottem | Average Monthly Temperatures at Nottingham, 1920-1939 |
npk | Classical N, P, K Factorial Experiment |
occupationalStatus | Occupational Status of Fathers and their Sons |
precip | Annual Precipitation in US Cities |
presidents | Quarterly Approval Ratings of US Presidents |
pressure | Vapor Pressure of Mercury as a Function of Temperature |
quakes | Locations of Earthquakes off Fiji |
randu | Random Numbers from Congruential Generator RANDU |
rivers | Lengths of Major North American Rivers |
rock | Measurements on Petroleum Rock Samples |
sleep | Student's Sleep Data |
stack.loss (stackloss) | Brownlee's Stack Loss Plant Data |
stack.x (stackloss) | Brownlee's Stack Loss Plant Data |
stackloss | Brownlee's Stack Loss Plant Data |
state.abb (state) | US State Facts and Figures |
state.area (state) | US State Facts and Figures |
state.center (state) | US State Facts and Figures |
state.division (state) | US State Facts and Figures |
state.name (state) | US State Facts and Figures |
state.region (state) | US State Facts and Figures |
state.x77 (state) | US State Facts and Figures |
sunspot.month | Monthly Sunspot Data, from 1749 to "Present" |
sunspot.year | Yearly Sunspot Data, 1700-1988 |
sunspots | Monthly Sunspot Numbers, 1749-1983 |
swiss | Swiss Fertility and Socioeconomic Indicators (1888) Data |
treering | Yearly Treering Data, -6000-1979 |
trees | Girth, Height and Volume for Black Cherry Trees |
uspop | Populations Recorded by the US Census |
volcano | Topographic Information on Auckland's Maunga Whau Volcano |
warpbreaks | The Number of Breaks in Yarn during Weaving |
women | Average Heights and Weights for American Women |
Data Frames
All data in vectors and matrices is enforced to a single data type. But Data Frames let you overcome this limitation. A data frame can contain several elements of different types. An example of R Data Frame can be seen here:
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
Creating Data Frame
A new data frame object can be created using the function data.frame()
> empty <- data.frame() # empty data frame
> vector.1 <- 1:10 # vector of integers
> vector.2 <- letters[1:10] # vector of strings
> df <- data.frame(column.1=vector.1,column.2=vector.2)
>
> df
column.1 column.2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
Importing and Exporting Data
You can export and import the data frame to a CSV file. This is useful for saving the context of a data operation.
> write.csv(df, file='mydata.csv') # Save the data frame to CSV file
>
You can load the contents from the CSV file as below
> d2 <- read.csv('mydata.csv') # Load the data frame from CSV file
> d2
X column.1 column.2
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
6 6 6 f
7 7 7 g
8 8 8 h
9 9 9 i
10 10 10 j
Please note that there is a difference in what we saved and what we read from the file. The row numbers are also saved in the CSV and then loaded as an independent column when reading from the CSV.
Analyzing Data Frames
While analyzing the data, it is very useful if we can have an initial idea about the kind of data present in the data frame - the columns, the data type, max/min/mean values for numbers, etc. R provides a good set of utilities to make this job simpler. Let us try to understand the data frame states.x77
Head / Tail
The data frame is too big to be viewed manually. We can get a very basic glimpse of the data in there by using the head.
> head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
```r
Or you can use tail to get the last 6 elements
```r
> tail(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Vermont 472 3907 0.6 71.64 5.5 57.1 168 9267
Virginia 4981 4701 1.4 70.08 9.5 47.8 85 39780
Washington 3559 4864 0.6 71.72 4.3 63.5 32 66570
West Virginia 1799 3617 1.4 69.48 6.7 41.6 100 24070
Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149 54464
Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Please note that 6 is just the default value for number of rows in head and tail. You can always override it using the second parameter
> head(mtcars, 3)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
>
> tail(mtcars, 3)
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
```r
### Summary and Structure
R also provides two more utility functions that help you understand the data
```r
> #Structure
> str(state.x77)
num [1:50, 1:8] 3615 365 2212 2110 21198 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
>
> summary(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Min. : 365 Min. :3098 Min. :0.500 Min. :67.96 Min. : 1.400 Min. :37.80 Min. : 0.00 Min. : 1049
1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12 1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985
Median : 2838 Median :4519 Median :0.950 Median :70.67 Median : 6.850 Median :53.25 Median :114.50 Median : 54277
Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88 Mean : 7.378 Mean :53.11 Mean :104.46 Mean : 70736
3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89 3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81163
Max. :21198 Max. :6315 Max. :2.800 Max. :73.60 Max. :15.100 Max. :67.30 Max. :188.00 Max. :566432
```r
### Counts
There is another set of functions that help us understand the meaning of the information contained in the data frame
```r
> ncol(df)
[1] 8
> nrow(df)
[1] 50
>
> colnames(df)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder" "HS Grad" "Frost" "Area"
> rownames(df)
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" "Delaware" "Florida"
[10] "Georgia" "Hawaii" "Idaho" "Illinois" "Indiana" "Iowa" "Kansas" "Kentucky" "Louisiana"
[19] "Maine" "Maryland" "Massachusetts" "Michigan" "Minnesota" "Mississippi" "Missouri" "Montana" "Nebraska"
[28] "Nevada" "New Hampshire" "New Jersey" "New Mexico" "New York" "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" "South Dakota" "Tennessee" "Texas" "Utah" "Vermont"
[46] "Virginia" "Washington" "West Virginia" "Wisconsin" "Wyoming"
Filter Data
You can also filter the data to get a subset of what is available in the data frame. For example, if we want to pull out only those cars that have 5 gears:
> mtcars[mtcars$gear == 5, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
We can also use logical operators on the condition out there. For example, if we want an additional criteria that the car should also have 4 cylinders, we can do this:
> mtcars[mtcars$gear == 5 & mtcars$cyl > 4, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
We can also perform statistical operations on this data:
> mean(mtcars[mtcars$hp > 100 & mtcars$wt > 2.5, ]$mpg)
[1] 16.86364
```r
%[https://clnk.in/pNCA]
## Indexing Data Frames
By indexing, we can obtain subsets of the given dataframe. Often we need to add new rows and columns to the given data frame. R provides some functions to enable this functionality. R provides two methods - cbind and rbind to do this.
### Add Row
Lets first check out the row bind functionality. To start with, pick up two parts of the mtcars dataset.
```r
> df1 = mtcars[1:5, 1:5]
> df1
mpg cyl disp hp drat
Mazda RX4 21.0 6 160 110 3.90
Mazda RX4 Wag 21.0 6 160 110 3.90
Datsun 710 22.8 4 108 93 3.85
Hornet 4 Drive 21.4 6 258 110 3.08
Hornet Sportabout 18.7 8 360 175 3.15
>
> df2 = mtcars[6, 1:5]
> df2
mpg cyl disp hp drat
Valiant 18.1 6 225 105 2.76
Now, we can join these using rbind
> df <- rbind(df1, df2)
> df
mpg cyl disp hp drat
Mazda RX4 21.0 6 160 110 3.90
Mazda RX4 Wag 21.0 6 160 110 3.90
Datsun 710 22.8 4 108 93 3.85
Hornet 4 Drive 21.4 6 258 110 3.08
Hornet Sportabout 18.7 8 360 175 3.15
Valiant 18.1 6 225 105 2.76
```r
### Add Column
Similarly, we can also join columns using the cbind command.
```r
> df1 <- mtcars[1:5, 1:5]
> df2 <- mtcars[1:5, 6:7]
>
> df1
mpg cyl disp hp drat
Mazda RX4 21.0 6 160 110 3.90
Mazda RX4 Wag 21.0 6 160 110 3.90
Datsun 710 22.8 4 108 93 3.85
Hornet 4 Drive 21.4 6 258 110 3.08
Hornet Sportabout 18.7 8 360 175 3.15
> df2
wt qsec
Mazda RX4 2.620 16.46
Mazda RX4 Wag 2.875 17.02
Datsun 710 2.320 18.61
Hornet 4 Drive 3.215 19.44
Hornet Sportabout 3.440 17.02
Now, we can merge these using the rbind method
> cbind(df1, df2)
mpg cyl disp hp drat wt qsec
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
Datsun 710 22.8 4 108 93 3.85 2.320 18.61
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
Ofcourse, for the cbind and rbind to work properly, the other dimension should match correctly. For example, while appending rows using rbind, the columns should match properly and vice-versa.
Lists
Lists are the objects which contain elements of different types like numbers, strings, vectors, data frames and another list inside it. A list can also contain a matrix or even function as its elements. List is created using list() function. Lists are typically used for organizing data rather than processing it.
Creating a List
Lists are created using the list() function. Following is an example to create a list containing strings, numbers, vectors and a logical values.
# Create a list containing strings, numbers, vectors and a logical
# values.
> list_data <- list(mtcars[1:5,], c('A', 'Sample', 'Vector'), c(21,32,11), TRUE, 51.23, 119.1)
> list_data
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
[[2]]
[1] "A" "Sample" "Vector"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
As you can see, each item in the list is associated with an index number that is shown as [[1]], [[2]]. We can also assign names to these elements.
Naming List Elements
The list elements can be given names and they can be accessed using these names.
> # Create a list containing a vector, a matrix and a list.
> list_data <- list(df = mtcars[1:5,], vec1 = c('A', 'Sample', 'Vector'), vec2 = c(21,32,11), bln = TRUE, num1 = 51.23, num2 = 119.1)
> list_data
$df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
$vec1
[1] "A" "Sample" "Vector"
$vec2
[1] 21 32 11
$bln
[1] TRUE
$num1
[1] 51.23
$num2
[1] 119.1
You can check for the names in a list using
> names(list_data)
[1] "df" "vec1" "vec2" "bln" "num1" "num2"
Now we can also assign names to these objects
> names(list_data) <- c("Data Frame", "Vector 1", "Vector 2", "Boolean", "Number 1", "Number 2")
This updates the names of the list elements
> list_data
$`Data Frame`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
$`Vector 1`
[1] "A" "Sample" "Vector"
$`Vector 2`
[1] 21 32 11
$Boolean
[1] TRUE
$`Number 1`
[1] 51.23
$`Number 2`
[1] 119.1
```r
### Accessing List Elements
Elements of the list can be accessed by the index of the element in the list. In case of named lists it can also be accessed using the names.
```r
> list_data[1]
$`Data Frame`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
> list_data[1]
$`Data Frame`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
> list_data$`Vector 1`
[1] "A" "Sample" "Vector"
Manipulating List Elements
We can add, delete and update list elements as shown below. We can add only at the end of a list. But we can update/delete any element.
> list_data[4] <- NULL
> list_data
$`Data Frame`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
$`Vector 1`
[1] "A" "Sample" "Vector"
$`Vector 2`
[1] 21 32 11
$`Number 1`
[1] 51.23
$`Number 2`
[1] 119.1
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
# Merge the two lists.
merged.list <- c(list1,list2)
# Print the merged list.
merged.list
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
Converting List to Vector
A list can be converted to a vector so that the elements of the vector can be used for further manipulation. All the arithmetic operations on vectors can be applied after the list is converted into vectors. To do this conversion, we use the unlist() function. It takes the list as input and produces a vector.
> list1 <- list(1:5)
> list1
[[1]]
[1] 1 2 3 4 5
> list2 <-list(10:14)
> list2
[[1]]
[1] 10 11 12 13 14
> v1 <- unlist(list1)
> v2 <- unlist(list2)
>
> v1
[1] 1 2 3 4 5
> v2
[1] 10 11 12 13 14
>
> result <- v1+v2
> result
[1] 11 13 15 17 19
File IO
R deals with data. So it has functions for various aspects of data processing. But how does it get this data? Reading from in input file is a very important aspect of data processing. R provides simple functions for reading and writing data to various file formats.
File Dump
R allows you to dump data into a file. Such file can be read only in R
> df = mtcars
> save(file = "file.out", compress = T, list = c("df"))
This saves the contents of df into file.out. The same can be loaded back from the file using the load method
> load(file = "file.out")
> head(df, 3)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Note the parameter compress=T . Obviously this results in a compressed output file. If you check out the generated file, it is an illegible binary file. You have an option to use ascii=T, that generates a file with ASCII content.
Dump Everything
There is an extension of this method that lets you dump everything in the memory. You can specify the file name else, it is saved to ".RData"
> save.image()
You can also set ascii and compress. The load method does not change. It just picks data from the specified file and sets the variables in the memory.
> load(".RData")
CSV File
You can also save the file in form of a CSV
> head(df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
>
> write.csv(df, file="df.csv")
> df = read.csv("df.csv")
> head(df)
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note here that on reading the CSV file, the row names are treated as the first column in the dataframe.
Table
> head(df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
>
> write.table(df, "df.table")
>
> head(read.table("df.table"))
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1