How to Read Dataset in Csv Into Rstudio
Reading and Writing CSV Files
Overview
Education: 30 min
Exercises: 0 minQuestions
How do I read data from a CSV file into R?
How exercise I write data to a CSV file?
Objectives
Read in a .csv, and explore the arguments of the csv reader.
Write the altered data prepare to a new .csv, and explore the arguments.
The most common manner that scientists store data is in Excel spreadsheets. While in that location are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), users ofttimes find it easier to save their spreadsheets in comma-separated values files (CSV) and then apply R's built in functionality to read and manipulate the data. In this short lesson, nosotros'll learn how to read data from a .csv and write to a new .csv, and explore the arguments that let you read and write the information correctly for your needs.
Read a .csv and Explore the Arguments
Let'south start past opening a .csv file containing information on the speeds at which cars of different colors were clocked in 45 mph zones in the iv-corners states (CarSpeeds.csv
). We will use the built in read.csv(...)
function call, which reads the data in equally a data frame, and assign the data frame to a variable (using <-
) so that it is stored in R's retentiveness. And then nosotros will explore some of the basic arguments that tin can be supplied to the function. First, open the RStudio project containing the scripts and data you were working on in episode 'Analyzing Patient Data'.
# Import the data and look at the showtime vi rows carSpeeds <- read.csv ( file = 'data/machine-speeds.csv' ) head ( carSpeeds )
Color Speed State i Bluish 32 NewMexico ii Cherry-red 45 Arizona iii Blue 35 Colorado 4 White 34 Arizona five Scarlet 25 Arizona 6 Blue 41 Arizona
Changing Delimiters
The default delimiter of the
read.csv()
function is a comma, but you lot tin apply other delimiters by supplying the 'sep' statement to the role (e.g., typingsep = ';'
allows a semi-colon separated file to be correctly imported - see?read.csv()
for more than data on this and other options for working with unlike file types).
The call in a higher place will import the information, only we have not taken advantage of several handy arguments that tin can be helpful in loading the information in the format we want. Let'south explore some of these arguments.
The default for read.csv(...)
is to ready the header
statement to TRUE
. This means that the first row of values in the .csv is set as header information (column names). If your data set does not have a header, gear up the header
argument to FALSE
:
# The first row of the data without setting the header statement: carSpeeds [ 1 , ]
Color Speed State 1 Blue 32 NewMexico
# The first row of the information if the header argument is set to FALSE: carSpeeds <- read.csv ( file = 'data/automobile-speeds.csv' , header = FALSE ) carSpeeds [ 1 , ]
V1 V2 V3 1 Color Speed Country
Conspicuously this is not the desired behavior for this data fix, but information technology may exist useful if y'all have a dataset without headers.
The stringsAsFactors
Statement
In older versions of R (prior to 4.0) this was perhaps the most important argument in read.csv()
, particularly if yous were working with categorical data. This is because the default beliefs of R was to convert character strings into factors, which may make it hard to practise such things as replace values. It is important to be aware of this behaviour, which we will demonstrate. For example, permit'due south say nosotros find out that the data collector was color bullheaded, and accidentally recorded green cars as being blue. In club to right the data set, permit's replace 'Blue' with 'Green' in the $Color
column:
# Hither nosotros will utilize R's `ifelse` function, in which we provide the test phrase, # the outcome if the result of the test is 'Truthful', and the effect if the # result is 'FALSE'. We volition also assign the results to the Color cavalcade, # using '<-' # Starting time - reload the data with a header carSpeeds <- read.csv ( file = 'information/auto-speeds.csv' , stringsAsFactors = TRUE ) carSpeeds $ Color <- ifelse ( carSpeeds $ Color == 'Blue' , 'Green' , carSpeeds $ Color ) carSpeeds $ Color
[1] "Green" "one" "Green" "5" "four" "Green" "Dark-green" "2" "5" [10] "4" "iv" "five" "Green" "Green" "2" "iv" "Green" "Dark-green" [xix] "5" "Green" "Green" "Green" "4" "Light-green" "4" "4" "four" [28] "4" "5" "Greenish" "4" "five" "2" "4" "2" "two" [37] "Dark-green" "4" "2" "iv" "2" "two" "4" "4" "5" [46] "2" "Green" "four" "4" "2" "2" "4" "5" "4" [55] "Greenish" "Green" "2" "Dark-green" "5" "2" "4" "Green" "Dark-green" [64] "5" "2" "4" "4" "ii" "Green" "5" "Green" "4" [73] "five" "5" "Green" "Green" "Dark-green" "Green" "Green" "5" "two" [82] "Green" "5" "2" "2" "four" "iv" "5" "v" "five" [91] "5" "4" "4" "four" "5" "ii" "5" "2" "two" [100] "five"
What happened?!? It looks like 'Blue' was replaced with 'Greenish', merely every other colour was turned into a number (as a character string, given the quote marks before and later on). This is because the colors of the cars were loaded every bit factors, and the cistron level was reported following replacement.
To see the internal construction, we can apply another function, str()
. In this example, the dataframe's internal structure includes the format of each column, which is what we are interested in. str()
will exist reviewed a little more than in the lesson Data Types and Structures.
# Reload the data with a header (the previous ifelse call modifies attributes) carSpeeds <- read.csv ( file = 'data/machine-speeds.csv' , stringsAsFactors = True ) str ( carSpeeds )
'information.frame': 100 obs. of 3 variables: $ Color: Factor westward/ five levels " Ruddy","Black",..: 3 1 3 5 4 three iii 2 5 four ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 2 1 i 1 3 two i 2 ...
Nosotros can see that the $Color
and $Land
columns are factors and $Speed
is a numeric cavalcade.
Now, let'south load the dataset using stringsAsFactors=Imitation
, and meet what happens when nosotros try to replace 'Blue' with 'Green' in the $Color
cavalcade:
carSpeeds <- read.csv ( file = 'data/machine-speeds.csv' , stringsAsFactors = FALSE ) str ( carSpeeds )
'data.frame': 100 obs. of 3 variables: $ Color: chr "Blue" " Red" "Blue" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ Land: chr "NewMexico" "Arizona" "Colorado" "Arizona" ...
carSpeeds $ Colour <- ifelse ( carSpeeds $ Colour == 'Blue' , 'Dark-green' , carSpeeds $ Color ) carSpeeds $ Color
[ane] "Dark-green" " Red" "Green" "White" "Reddish" "Green" "Green" "Black" "White" [x] "Ruddy" "Red" "White" "Dark-green" "Light-green" "Black" "Red" "Green" "Green" [19] "White" "Green" "Green" "Green" "Red" "Light-green" "Carmine" "Red" "Cerise" [28] "Ruby-red" "White" "Green" "Cherry-red" "White" "Black" "Red" "Black" "Blackness" [37] "Green" "Red" "Black" "Red" "Black" "Black" "Red" "Cherry-red" "White" [46] "Blackness" "Green" "Cherry-red" "Red" "Black" "Blackness" "Reddish" "White" "Crimson" [55] "Green" "Green" "Black" "Green" "White" "Blackness" "Ruddy" "Dark-green" "Dark-green" [64] "White" "Blackness" "Red" "Ruddy" "Blackness" "Green" "White" "Green" "Red" [73] "White" "White" "Green" "Green" "Dark-green" "Green" "Green" "White" "Blackness" [82] "Light-green" "White" "Blackness" "Black" "Ruddy" "Red" "White" "White" "White" [91] "White" "Cherry" "Cherry-red" "Red" "White" "Black" "White" "Black" "Blackness" [100] "White"
That's amend! And we can see how the data now is read as character instead of factor. From R version four.0 onwards we do not take to specify stringsAsFactors=FALSE
, this is the default behavior.
The as.is
Argument
This is an extension of the stringsAsFactors
argument, but gives you command over private columns. For instance, if nosotros want the colors of cars imported as strings, merely nosotros want the names of united states of america imported every bit factors, nosotros would load the data set as:
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , as.is = 1 ) # Note, the 1 applies equally.is to the outset cavalcade only
Now we tin can come across that if nosotros try to replace 'Blue' with 'Light-green' in the $Color
cavalcade everything looks fine, while trying to supervene upon 'Arizona' with 'Ohio' in the $Land
column returns the cistron numbers for the names of states that nosotros oasis't replaced:
'information.frame': 100 obs. of three variables: $ Color: chr "Bluish" " Cherry" "Blue" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: Gene w/ 4 levels "Arizona","Colorado",..: three ane 2 i one ane 3 2 1 ii ...
carSpeeds $ Colour <- ifelse ( carSpeeds $ Color == 'Blue' , 'Greenish' , carSpeeds $ Color ) carSpeeds $ Color
[1] "Green" " Ruby" "Dark-green" "White" "Red" "Green" "Green" "Black" "White" [10] "Ruby" "Crimson" "White" "Greenish" "Green" "Black" "Crimson" "Green" "Green" [19] "White" "Green" "Green" "Green" "Carmine" "Greenish" "Red" "Cherry-red" "Red" [28] "Red" "White" "Greenish" "Red" "White" "Black" "Reddish" "Black" "Blackness" [37] "Green" "Reddish" "Black" "Ruby-red" "Black" "Black" "Red" "Red" "White" [46] "Black" "Greenish" "Red" "Carmine" "Black" "Blackness" "Carmine" "White" "Crimson" [55] "Dark-green" "Light-green" "Black" "Light-green" "White" "Black" "Red" "Light-green" "Green" [64] "White" "Black" "Cherry-red" "Reddish" "Black" "Light-green" "White" "Light-green" "Blood-red" [73] "White" "White" "Green" "Light-green" "Light-green" "Light-green" "Green" "White" "Black" [82] "Green" "White" "Black" "Black" "Blood-red" "Red" "White" "White" "White" [91] "White" "Ruby-red" "Carmine" "Crimson" "White" "Blackness" "White" "Black" "Black" [100] "White"
carSpeeds $ Country <- ifelse ( carSpeeds $ Country == 'Arizona' , 'Ohio' , carSpeeds $ State ) carSpeeds $ State
[i] "three" "Ohio" "2" "Ohio" "Ohio" "Ohio" "3" "ii" "Ohio" "two" [eleven] "4" "4" "four" "4" "4" "3" "Ohio" "3" "Ohio" "four" [21] "4" "4" "iii" "ii" "2" "iii" "two" "four" "ii" "4" [31] "3" "2" "ii" "iv" "2" "ii" "iii" "Ohio" "four" "2" [41] "2" "iii" "Ohio" "iv" "Ohio" "2" "three" "3" "3" "ii" [51] "Ohio" "4" "4" "Ohio" "iii" "two" "4" "two" "iv" "4" [61] "four" "2" "3" "2" "3" "ii" "3" "Ohio" "3" "four" [71] "4" "2" "Ohio" "4" "2" "2" "ii" "Ohio" "iii" "Ohio" [81] "4" "ii" "2" "Ohio" "Ohio" "Ohio" "4" "Ohio" "4" "4" [91] "four" "Ohio" "Ohio" "three" "2" "2" "4" "three" "Ohio" "4"
We tin see that $Colour
cavalcade is a character while $Land
is a factor.
Updating Values in a Factor
Suppose we want to go on the colors of cars as factors for some other operations we want to perform. Write code for replacing 'Blue' with 'Green' in the
$Color
column of the cars dataset without importing the data withstringsAsFactors=Imitation
.Solution
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' ) # Replace 'Blue' with 'Green' in cars$Color without using the stringsAsFactors # or as.is arguments carSpeeds $ Color <- ifelse ( as.character ( carSpeeds $ Color ) == 'Blue' , 'Greenish' , as.character ( carSpeeds $ Color )) # Catechumen colors back to factors carSpeeds $ Color <- equally.gene ( carSpeeds $ Color )
The strip.white
Argument
Information technology is not uncommon for mistakes to take been fabricated when the data were recorded, for example a space (whitespace) may accept been inserted before a data value. By default this whitespace will be kept in the R surround, such that '\ Red' will be recognized equally a dissimilar value than 'Red'. In social club to avoid this blazon of error, utilize the strip.white
argument. Let's see how this works by checking for the unique values in the $Colour
column of our dataset:
Here, the information recorder added a space before the color of the auto in one of the cells:
# We use the built-in unique() function to extract the unique colors in our dataset unique ( carSpeeds $ Color )
[ane] Green Red White Red Blackness Levels: Ruby Black Green Red White
Oops, nosotros run across two values for reddish cars.
Permit's try again, this fourth dimension importing the data using the strip.white
statement. Annotation - this argument must be accompanied by the sep
argument, by which we indicate the blazon of delimiter in the file (the comma for most .csv files)
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = False , strip.white = True , sep = ',' ) unique ( carSpeeds $ Colour )
[one] "Blue" "Blood-red" "White" "Black"
That's better!
Specify Missing Data When Loading
It is mutual for data sets to have missing values, or mistakes. The convention for recording missing values often depends on the individual who collected the information and can exist recorded equally
n.a.
,--
, or empty cells " ". R recognises the reserved character stringNA
every bit a missing value, but not some of the examples higher up. Allow's say the inflamation scale in the data set we used earlierinflammation-01.csv
actually starts at1
for no inflamation and the cypher values (0
) were a missed observation. Looking at the?read.csv
aid page is at that place an argument we could utilise to ensure all zeros (0
) are read in asNA
? Perhaps, in theautomobile-speeds.csv
information contains mistakes and the person measuring the car speeds could not accurately distinguish between "Black or "Blue" cars. Is there a style to specify more than one 'cord', such as "Black" and "Blue", to be replaced byNA
Solution
read.csv ( file = "data/inflammation-01.csv" , na.strings = "0" )
or , in
motorcar-speeds.csv
utilize a character vector for multiple values.read.csv ( file = 'data/car-speeds.csv' , na.strings = c ( "Black" , "Bluish" ) )
Write a New .csv and Explore the Arguments
After altering our cars dataset by replacing 'Blue' with 'Green' in the $Color
column, we at present want to save the output. There are several arguments for the write.csv(...)
function phone call, a few of which are particularly important for how the data are exported. Permit's explore these now.
# Export the data. The write.csv() function requires a minimum of two # arguments, the information to be saved and the name of the output file. write.csv ( carSpeeds , file = 'information/car-speeds-cleaned.csv' )
If you lot open up the file, yous'll see that information technology has header names, considering the information had headers inside R, but that there are numbers in the first column.
The row.names
Argument
This statement allows united states to prepare the names of the rows in the output data file. R's default for this statement is TRUE
, and since information technology does non know what else to name the rows for the cars information set up, information technology resorts to using row numbers. To correct this, nosotros tin can set row.names
to FALSE
:
write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' , row.names = FALSE )
Now we come across:
Setting Column Names
In that location is likewise a
col.names
argument, which tin be used to set the column names for a data set without headers. If the information set already has headers (eastward.chiliad., we used theheaders = True
argument when importing the data) then acol.names
argument will exist ignored.
The na
Argument
At that place are times when we want to specify certain values for NA
s in the data set up (e.k., we are going to pass the data to a program that only accepts -9999 as a nodata value). In this case, we want to set the NA
value of our output file to the desired value, using the na argument. Let'due south meet how this works:
# First, replace the speed in the 3rd row with NA, by using an index (foursquare # brackets to indicate the position of the value we desire to replace) carSpeeds $ Speed [ three ] <- NA caput ( carSpeeds )
Color Speed State 1 Blue 32 NewMexico ii Ruddy 45 Arizona three Blue NA Colorado 4 White 34 Arizona five Ruby-red 25 Arizona 6 Blue 41 Arizona
write.csv ( carSpeeds , file = 'data/machine-speeds-cleaned.csv' , row.names = Fake )
Now we'll set NA
to -9999 when nosotros write the new .csv file:
# Annotation - the na argument requires a string input write.csv ( carSpeeds , file = 'data/motorcar-speeds-cleaned.csv' , row.names = Imitation , na = '-9999' )
And we run across:
Cardinal Points
Import information from a .csv file using the
read.csv(...)
function.Sympathize some of the central arguments available for importing the information properly, including
header
,stringsAsFactors
,every bit.is
, andstrip.white
.Write data to a new .csv file using the
write.csv(...)
roleEmpathize some of the fundamental arguments available for exporting the data properly, such as
row.names
,col.names
, andna
.
Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/
0 Response to "How to Read Dataset in Csv Into Rstudio"
Post a Comment