How to Read Dataset in Csv Into Rstudio

Reading and Writing CSV Files

Overview

Education: 30 min
Exercises: 0 min

Questions

  • How do I read data from a CSV file into R?

  • How exercise I write data to a CSV file?

Objectives

  • Read in a .csv, and explore the arguments of the csv reader.

  • Write the altered data prepare to a new .csv, and explore the arguments.

The most common manner that scientists store data is in Excel spreadsheets. While in that location are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), users ofttimes find it easier to save their spreadsheets in comma-separated values files (CSV) and then apply R's built in functionality to read and manipulate the data. In this short lesson, nosotros'll learn how to read data from a .csv and write to a new .csv, and explore the arguments that let you read and write the information correctly for your needs.

Read a .csv and Explore the Arguments

Let'south start past opening a .csv file containing information on the speeds at which cars of different colors were clocked in 45 mph zones in the iv-corners states (CarSpeeds.csv). We will use the built in read.csv(...) function call, which reads the data in equally a data frame, and assign the data frame to a variable (using <-) so that it is stored in R's retentiveness. And then nosotros will explore some of the basic arguments that tin can be supplied to the function. First, open the RStudio project containing the scripts and data you were working on in episode 'Analyzing Patient Data'.

                          # Import the data and look at the showtime vi rows                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/machine-speeds.csv'              )                                          head              (              carSpeeds              )                                                  
                          Color Speed     State i  Bluish    32 NewMexico ii   Cherry-red    45   Arizona iii  Blue    35  Colorado 4 White    34   Arizona five   Scarlet    25   Arizona 6  Blue    41   Arizona                      

Changing Delimiters

The default delimiter of the read.csv() function is a comma, but you lot tin apply other delimiters by supplying the 'sep' statement to the role (e.g., typing sep = ';' allows a semi-colon separated file to be correctly imported - see ?read.csv() for more than data on this and other options for working with unlike file types).

The call in a higher place will import the information, only we have not taken advantage of several handy arguments that tin can be helpful in loading the information in the format we want. Let'south explore some of these arguments.

The default for read.csv(...) is to ready the header statement to TRUE. This means that the first row of values in the .csv is set as header information (column names). If your data set does not have a header, gear up the header argument to FALSE:

                          # The first row of the data without setting the header statement:                                          carSpeeds              [              1              ,                                          ]                                                  
                          Color Speed     State 1  Blue    32 NewMexico                      
                          # The first row of the information if the header argument is set to FALSE:                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/automobile-speeds.csv'              ,                                          header                                          =                                          FALSE              )                                          carSpeeds              [              1              ,                                          ]                                                  
                          V1    V2    V3 1 Color Speed Country                      

Conspicuously this is not the desired behavior for this data fix, but information technology may exist useful if y'all have a dataset without headers.

The stringsAsFactors Statement

In older versions of R (prior to 4.0) this was perhaps the most important argument in read.csv(), particularly if yous were working with categorical data. This is because the default beliefs of R was to convert character strings into factors, which may make it hard to practise such things as replace values. It is important to be aware of this behaviour, which we will demonstrate. For example, permit'due south say nosotros find out that the data collector was color bullheaded, and accidentally recorded green cars as being blue. In club to right the data set, permit's replace 'Blue' with 'Green' in the $Color column:

                          # Hither nosotros will utilize R's `ifelse` function, in which we provide the test phrase,                                          # the outcome if the result of the test is 'Truthful', and the effect if the                                          # result is 'FALSE'. We volition also assign the results to the Color cavalcade,                                          # using '<-'                                          # Starting time - reload the data with a header                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/auto-speeds.csv'              ,                                          stringsAsFactors                                          =                                          TRUE              )                                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [1] "Green" "one"     "Green" "5"     "four"     "Green" "Dark-green" "2"     "5"      [10] "4"     "iv"     "five"     "Green" "Green" "2"     "iv"     "Green" "Dark-green"  [xix] "5"     "Green" "Green" "Green" "4"     "Light-green" "4"     "4"     "four"      [28] "4"     "5"     "Greenish" "4"     "five"     "2"     "4"     "2"     "two"      [37] "Dark-green" "4"     "2"     "iv"     "2"     "two"     "4"     "4"     "5"      [46] "2"     "Green" "four"     "4"     "2"     "2"     "4"     "5"     "4"      [55] "Greenish" "Green" "2"     "Dark-green" "5"     "2"     "4"     "Green" "Dark-green"  [64] "5"     "2"     "4"     "4"     "ii"     "Green" "5"     "Green" "4"      [73] "five"     "5"     "Green" "Green" "Dark-green" "Green" "Green" "5"     "two"      [82] "Green" "5"     "2"     "2"     "four"     "iv"     "5"     "v"     "five"      [91] "5"     "4"     "4"     "four"     "5"     "ii"     "5"     "2"     "two"     [100] "five"                      

What happened?!? It looks like 'Blue' was replaced with 'Greenish', merely every other colour was turned into a number (as a character string, given the quote marks before and later on). This is because the colors of the cars were loaded every bit factors, and the cistron level was reported following replacement.

To see the internal construction, we can apply another function, str(). In this example, the dataframe's internal structure includes the format of each column, which is what we are interested in. str() will exist reviewed a little more than in the lesson Data Types and Structures.

                          # Reload the data with a header (the previous ifelse call modifies attributes)                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/machine-speeds.csv'              ,                                          stringsAsFactors                                          =                                          True              )                                          str              (              carSpeeds              )                                                  
            'information.frame':	100 obs. of  3 variables:  $ Color: Factor westward/ five levels " Ruddy","Black",..: 3 1 3 5 4 three iii 2 5 four ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 2 1 i 1 3 two i 2 ...                      

Nosotros can see that the $Color and $Land columns are factors and $Speed is a numeric cavalcade.

Now, let'south load the dataset using stringsAsFactors=Imitation, and meet what happens when nosotros try to replace 'Blue' with 'Green' in the $Color cavalcade:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/machine-speeds.csv'              ,                                          stringsAsFactors                                          =                                          FALSE              )                                          str              (              carSpeeds              )                                                  
            'data.frame':	100 obs. of  3 variables:  $ Color: chr  "Blue" " Red" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ Land: chr  "NewMexico" "Arizona" "Colorado" "Arizona" ...                      
                          carSpeeds              $              Colour                                          <-                                          ifelse              (              carSpeeds              $              Colour                                          ==                                          'Blue'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [ane] "Dark-green" " Red"  "Green" "White" "Reddish"   "Green" "Green" "Black" "White"  [x] "Ruddy"   "Red"   "White" "Dark-green" "Light-green" "Black" "Red"   "Green" "Green"  [19] "White" "Green" "Green" "Green" "Red"   "Light-green" "Carmine"   "Red"   "Cerise"    [28] "Ruby-red"   "White" "Green" "Cherry-red"   "White" "Black" "Red"   "Black" "Blackness"  [37] "Green" "Red"   "Black" "Red"   "Black" "Black" "Red"   "Cherry-red"   "White"  [46] "Blackness" "Green" "Cherry-red"   "Red"   "Black" "Blackness" "Reddish"   "White" "Crimson"    [55] "Green" "Green" "Black" "Green" "White" "Blackness" "Ruddy"   "Dark-green" "Dark-green"  [64] "White" "Blackness" "Red"   "Ruddy"   "Blackness" "Green" "White" "Green" "Red"    [73] "White" "White" "Green" "Green" "Dark-green" "Green" "Green" "White" "Blackness"  [82] "Light-green" "White" "Blackness" "Black" "Ruddy"   "Red"   "White" "White" "White"  [91] "White" "Cherry"   "Cherry-red"   "Red"   "White" "Black" "White" "Black" "Blackness" [100] "White"                      

That's amend! And we can see how the data now is read as character instead of factor. From R version four.0 onwards we do not take to specify stringsAsFactors=FALSE, this is the default behavior.

The as.is Argument

This is an extension of the stringsAsFactors argument, but gives you command over private columns. For instance, if nosotros want the colors of cars imported as strings, merely nosotros want the names of united states of america imported every bit factors, nosotros would load the data set as:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          as.is                                          =                                          1              )                                          # Note, the 1 applies equally.is to the outset cavalcade only                                                  

Now we tin can come across that if nosotros try to replace 'Blue' with 'Light-green' in the $Color cavalcade everything looks fine, while trying to supervene upon 'Arizona' with 'Ohio' in the $Land column returns the cistron numbers for the names of states that nosotros oasis't replaced:

            'information.frame':	100 obs. of  three variables:  $ Color: chr  "Bluish" " Cherry" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Gene w/ 4 levels "Arizona","Colorado",..: three ane 2 i one ane 3 2 1 ii ...                      
                          carSpeeds              $              Colour                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Greenish'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [1] "Green" " Ruby"  "Dark-green" "White" "Red"   "Green" "Green" "Black" "White"  [10] "Ruby"   "Crimson"   "White" "Greenish" "Green" "Black" "Crimson"   "Green" "Green"  [19] "White" "Green" "Green" "Green" "Carmine"   "Greenish" "Red"   "Cherry-red"   "Red"    [28] "Red"   "White" "Greenish" "Red"   "White" "Black" "Reddish"   "Black" "Blackness"  [37] "Green" "Reddish"   "Black" "Ruby-red"   "Black" "Black" "Red"   "Red"   "White"  [46] "Black" "Greenish" "Red"   "Carmine"   "Black" "Blackness" "Carmine"   "White" "Crimson"    [55] "Dark-green" "Light-green" "Black" "Light-green" "White" "Black" "Red"   "Light-green" "Green"  [64] "White" "Black" "Cherry-red"   "Reddish"   "Black" "Light-green" "White" "Light-green" "Blood-red"    [73] "White" "White" "Green" "Light-green" "Light-green" "Light-green" "Green" "White" "Black"  [82] "Green" "White" "Black" "Black" "Blood-red"   "Red"   "White" "White" "White"  [91] "White" "Ruby-red"   "Carmine"   "Crimson"   "White" "Blackness" "White" "Black" "Black" [100] "White"                      
                          carSpeeds              $              Country                                          <-                                          ifelse              (              carSpeeds              $              Country                                          ==                                          'Arizona'              ,                                          'Ohio'              ,                                          carSpeeds              $              State              )                                          carSpeeds              $              State                                                  
                          [i] "three"    "Ohio" "2"    "Ohio" "Ohio" "Ohio" "3"    "ii"    "Ohio" "two"     [eleven] "4"    "4"    "four"    "4"    "4"    "3"    "Ohio" "3"    "Ohio" "four"     [21] "4"    "4"    "iii"    "ii"    "2"    "iii"    "two"    "four"    "ii"    "4"     [31] "3"    "2"    "ii"    "iv"    "2"    "ii"    "iii"    "Ohio" "four"    "2"     [41] "2"    "iii"    "Ohio" "iv"    "Ohio" "2"    "three"    "3"    "3"    "ii"     [51] "Ohio" "4"    "4"    "Ohio" "iii"    "two"    "4"    "two"    "iv"    "4"     [61] "four"    "2"    "3"    "2"    "3"    "ii"    "3"    "Ohio" "3"    "four"     [71] "4"    "2"    "Ohio" "4"    "2"    "2"    "ii"    "Ohio" "iii"    "Ohio"  [81] "4"    "ii"    "2"    "Ohio" "Ohio" "Ohio" "4"    "Ohio" "4"    "4"     [91] "four"    "Ohio" "Ohio" "three"    "2"    "2"    "4"    "three"    "Ohio" "4"                      

We tin see that $Colour cavalcade is a character while $Land is a factor.

Updating Values in a Factor

Suppose we want to go on the colors of cars as factors for some other operations we want to perform. Write code for replacing 'Blue' with 'Green' in the $Color column of the cars dataset without importing the data with stringsAsFactors=Imitation.

Solution

                                  carSpeeds                                                      <-                                                      read.csv                  (                  file                                                      =                                                      'data/car-speeds.csv'                  )                                                      # Replace 'Blue' with 'Green' in cars$Color without using the stringsAsFactors                                                      # or as.is arguments                                                      carSpeeds                  $                  Color                                                      <-                                                      ifelse                  (                  as.character                  (                  carSpeeds                  $                  Color                  )                                                      ==                                                      'Blue'                  ,                                                      'Greenish'                  ,                                                      as.character                  (                  carSpeeds                  $                  Color                  ))                                                      # Catechumen colors back to factors                                                      carSpeeds                  $                  Color                                                      <-                                                      equally.gene                  (                  carSpeeds                  $                  Color                  )                                                                  

The strip.white Argument

Information technology is not uncommon for mistakes to take been fabricated when the data were recorded, for example a space (whitespace) may accept been inserted before a data value. By default this whitespace will be kept in the R surround, such that '\ Red' will be recognized equally a dissimilar value than 'Red'. In social club to avoid this blazon of error, utilize the strip.white argument. Let's see how this works by checking for the unique values in the $Colour column of our dataset:

Here, the information recorder added a space before the color of the auto in one of the cells:

                          # We use the built-in unique() function to extract the unique colors in our dataset                                          unique              (              carSpeeds              $              Color              )                                                  
            [ane] Green  Red  White Red   Blackness Levels:  Ruby Black Green Red White                      

Oops, nosotros run across two values for reddish cars.

Permit's try again, this fourth dimension importing the data using the strip.white statement. Annotation - this argument must be accompanied by the sep argument, by which we indicate the blazon of delimiter in the file (the comma for most .csv files)

                          carSpeeds                                          <-                                          read.csv              (                                          file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          False              ,                                          strip.white                                          =                                          True              ,                                          sep                                          =                                          ','                                          )                                          unique              (              carSpeeds              $              Colour              )                                                  
            [one] "Blue"  "Blood-red"   "White" "Black"                      

That's better!

Specify Missing Data When Loading

It is mutual for data sets to have missing values, or mistakes. The convention for recording missing values often depends on the individual who collected the information and can exist recorded equally n.a., --, or empty cells " ". R recognises the reserved character string NA every bit a missing value, but not some of the examples higher up. Allow's say the inflamation scale in the data set we used earlier inflammation-01.csv actually starts at 1 for no inflamation and the cypher values (0) were a missed observation. Looking at the ?read.csv aid page is at that place an argument we could utilise to ensure all zeros (0) are read in as NA? Perhaps, in the automobile-speeds.csv information contains mistakes and the person measuring the car speeds could not accurately distinguish between "Black or "Blue" cars. Is there a style to specify more than one 'cord', such as "Black" and "Blue", to be replaced by NA

Solution

                                  read.csv                  (                  file                                                      =                                                      "data/inflammation-01.csv"                  ,                                                      na.strings                                                      =                                                      "0"                  )                                                                  

or , in motorcar-speeds.csv utilize a character vector for multiple values.

                                  read.csv                  (                                                      file                                                      =                                                      'data/car-speeds.csv'                  ,                                                      na.strings                                                      =                                                      c                  (                  "Black"                  ,                                                      "Bluish"                  )                                                      )                                                                  

Write a New .csv and Explore the Arguments

After altering our cars dataset by replacing 'Blue' with 'Green' in the $Color column, we at present want to save the output. There are several arguments for the write.csv(...) function phone call, a few of which are particularly important for how the data are exported. Permit's explore these now.

                          # Export the data. The write.csv() function requires a minimum of two                                          # arguments, the information to be saved and the name of the output file.                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'information/car-speeds-cleaned.csv'              )                                                  

If you lot open up the file, yous'll see that information technology has header names, considering the information had headers inside R, but that there are numbers in the first column.

csv written without row.names argument

The row.names Argument

This statement allows united states to prepare the names of the rows in the output data file. R's default for this statement is TRUE, and since information technology does non know what else to name the rows for the cars information set up, information technology resorts to using row numbers. To correct this, nosotros tin can set row.names to FALSE:

                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              )                                                  

Now we come across:

csv written with row.names argument

Setting Column Names

In that location is likewise a col.names argument, which tin be used to set the column names for a data set without headers. If the information set already has headers (eastward.chiliad., we used the headers = True argument when importing the data) then a col.names argument will exist ignored.

The na Argument

At that place are times when we want to specify certain values for NAs in the data set up (e.k., we are going to pass the data to a program that only accepts -9999 as a nodata value). In this case, we want to set the NA value of our output file to the desired value, using the na argument. Let'due south meet how this works:

                          # First, replace the speed in the 3rd row with NA, by using an index (foursquare                                          # brackets to indicate the position of the value we desire to replace)                                          carSpeeds              $              Speed              [              three              ]                                          <-                                          NA                                          caput              (              carSpeeds              )                                                  
                          Color Speed     State 1  Blue    32 NewMexico ii   Ruddy    45   Arizona three  Blue    NA  Colorado 4 White    34   Arizona five   Ruby-red    25   Arizona 6  Blue    41   Arizona                      
                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/machine-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Fake              )                                                  

Now we'll set NA to -9999 when nosotros write the new .csv file:

                          # Annotation - the na argument requires a string input                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/motorcar-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Imitation              ,                                          na                                          =                                          '-9999'              )                                                  

And we run across:

csv written with -9999 as NA

Cardinal Points

  • Import information from a .csv file using the read.csv(...) function.

  • Sympathize some of the central arguments available for importing the information properly, including header, stringsAsFactors, every bit.is, and strip.white.

  • Write data to a new .csv file using the write.csv(...) role

  • Empathize some of the fundamental arguments available for exporting the data properly, such as row.names, col.names, and na.

hughesnoultand56.blogspot.com

Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/

0 Response to "How to Read Dataset in Csv Into Rstudio"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel