Category: 2. Data Interfaces

https://cdn-icons-png.flaticon.com/512/5145/5145268.png

  • Databases

    The data is Relational database systems are stored in a normalized format. So, to carry out statistical computing we will need very advanced and complex Sql queries. But R can connect easily to many relational databases like MySql, Oracle, Sql server etc. and fetch records from them as a data frame. Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using all the powerful packages and functions.

    In this tutorial we will be using MySql as our reference database for connecting to R.

    RMySQL Package

    R has a built-in package named “RMySQL” which provides native connectivity between with MySql database. You can install this package in the R environment using the following command.

    install.packages("RMySQL")

    Connecting R to MySql

    Once the package is installed we create a connection object in R to connect to the database. It takes the username, password, database name and host name as input.

    # Create a connection Object to MySQL database.
    # We will connect to the sampel database named "sakila" that comes with MySql installation.
    mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila',
       host = 'localhost')
    
    # List the tables available in this database.
     dbListTables(mysqlconnection)

    When we execute the above code, it produces the following result −

     [1] "actor"                      "actor_info"                
     [3] "address"                    "category"                  
     [5] "city"                       "country"                   
     [7] "customer"                   "customer_list"             
     [9] "film"                       "film_actor"                
    [11] "film_category"              "film_list"                 
    [13] "film_text"                  "inventory"                 
    [15] "language"                   "nicer_but_slower_film_list"
    [17] "payment"                    "rental"                    
    [19] "sales_by_film_category"     "sales_by_store"            
    [21] "staff"                      "staff_list"                
    [23] "store"                     
    

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Querying the Tables

    We can query the database tables in MySql using the function dbSendQuery(). The query gets executed in MySql and the result set is returned using the R fetch() function. Finally it is stored as a data frame in R.

    # Query the "actor" tables to get all the rows.
    result = dbSendQuery(mysqlconnection, "select * from actor")
    
    # Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
    data.frame = fetch(result, n = 5)
    print(data.fame)

    When we execute the above code, it produces the following result −

            actor_id   first_name    last_name         last_update
    1        1         PENELOPE      GUINESS           2006-02-15 04:34:33
    2        2         NICK          WAHLBERG          2006-02-15 04:34:33
    3        3         ED            CHASE             2006-02-15 04:34:33
    4        4         JENNIFER      DAVIS             2006-02-15 04:34:33
    5        5         JOHNNY        LOLLOBRIGIDA      2006-02-15 04:34:33
    

    Query with Filter Clause

    We can pass any valid select query to get the result.

    result = dbSendQuery(mysqlconnection, "select * from actor where last_name = 'TORN'")
    
    # Fetch all the records(with n = -1) and store it as a data frame.
    data.frame = fetch(result, n = -1)
    print(data)

    When we execute the above code, it produces the following result −

            actor_id    first_name     last_name         last_update
    1        18         DAN            TORN              2006-02-15 04:34:33
    2        94         KENNETH        TORN              2006-02-15 04:34:33
    3       102         WALTER         TORN              2006-02-15 04:34:33
    

    Updating Rows in the Tables

    We can update the rows in a Mysql table by passing the update query to the dbSendQuery() function.

    dbSendQuery(mysqlconnection, "update mtcars set disp = 168.5 where hp = 110")

    After executing the above code we can see the table updated in the MySql Environment.

    Inserting Data into the Tables

    dbSendQuery(mysqlconnection,
       "insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
       values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)"
    )

    After executing the above code we can see the row inserted into the table in the MySql Environment.

    Creating Tables in MySql

    We can create tables in the MySql using the function dbWriteTable(). It overwrites the table if it already exists and takes a data frame as input.

    # Create the connection object to the database where we want to create the table.
    mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila', 
       host = 'localhost')
    
    # Use the R data frame "mtcars" to create the table in MySql.
    # All the rows of mtcars are taken inot MySql.
    dbWriteTable(mysqlconnection, "mtcars", mtcars[, ], overwrite = TRUE)

    After executing the above code we can see the table created in the MySql Environment.

    Dropping Tables in MySql

    We can drop the tables in MySql database passing the drop table statement into the dbSendQuery() in the same way we used it for querying data from tables.

    dbSendQuery(mysqlconnection, 'drop table if exists mtcars')

    After executing the above code we can see the table is dropped in the MySql Environment.

  • JSON Files

    JSON file stores data as text in human-readable format. Json stands for JavaScript Object Notation. R can read JSON files using the rjson package.

    Install rjson Package

    In the R console, you can issue the following command to install the rjson package.

    install.packages("rjson")
    

    Input Data

    Create a JSON file by copying the below data into a text editor like notepad. Save the file with a .json extension and choosing the file type as all files(*.*).

    { 
       "ID":["1","2","3","4","5","6","7","8" ],
       "Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
       "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
       
       "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
    
      "7/30/2013","6/17/2014"],
    "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Read the JSON File

    The JSON file is read by R using the function from JSON(). It is stored as a list in R.

    # Load the package required to read JSON files.
    library("rjson")
    
    # Give the input file name to the function.
    result <- fromJSON(file = "input.json")
    
    # Print the result.
    print(result)

    When we execute the above code, it produces the following result −

    $ID
    [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"
    
    $Name
    [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Gary"     "Nina"     "Simon"    "Guru"
    
    $Salary
    [1] "623.3"  "515.2"  "611"    "729"    "843.25" "578"    "632.8"  "722.5"
    
    $StartDate
    [1] "1/1/2012"   "9/23/2013"  "11/15/2014" "5/11/2014"  "3/27/2015"  "5/21/2013"
       "7/30/2013"  "6/17/2014"
    
    $Dept
    [1] "IT"         "Operations" "IT"         "HR"         "Finance"    "IT"
       "Operations" "Finance"
    

    Convert JSON to a Data Frame

    We can convert the extracted data above to a R data frame for further analysis using the as.data.frame() function.

    # Load the package required to read JSON files.
    library("rjson")
    
    # Give the input file name to the function.
    result <- fromJSON(file = "input.json")
    
    # Convert JSON file to a data frame.
    json_data_frame <- as.data.frame(result)
    
    print(json_data_frame)

    When we execute the above code, it produces the following result −

          id,   name,    salary,   start_date,     dept
    1      1    Rick     623.30    2012-01-01      IT
    2      2    Dan      515.20    2013-09-23      Operations
    3      3    Michelle 611.00    2014-11-15      IT
    4      4    Ryan     729.00    2014-05-11      HR
    5     NA    Gary     843.25    2015-03-27      Finance
    6      6    Nina     578.00    2013-05-21      IT
    7      7    Simon    632.80    2013-07-30      Operations
    8      8    Guru     722.50    2014-06-17      Finance
    
  • Web Data

    Many websites provide data for consumption by its users. For example the World Health Organization(WHO) provides reports on health and medical information in the form of CSV, txt and XML files. Using R programs, we can programmatically extract specific data from such websites. Some packages in R which are used to scrap data form the web are − “RCurl”,XML”, and “stringr”. They are used to connect to the URL’s, identify required links for the files and download them to the local environment.

    Install R Packages

    The following packages are required for processing the URL’s and links to the files. If they are not available in your R Environment, you can install them using following commands.

    install.packages("RCurl")
    install.packages("XML")
    install.packages("stringr")
    install.packages("plyr")
    

    Input Data

    We will visit the URL weather data and download the CSV files using R for the year 2015.

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Example

    We will use the function getHTMLLinks() to gather the URLs of the files. Then we will use the function download.file() to save the files to the local system. As we will be applying the same code again and again for multiple files, we will create a function to be called multiple times. The filenames are passed as parameters in form of a R list object to this function.

    # Read the URL.
    url <- "http://www.geos.ed.ac.uk/~weather/jcmb_ws/"
    
    # Gather the html links present in the webpage.
    links <- getHTMLLinks(url)
    
    # Identify only the links which point to the JCMB 2015 files. 
    filenames <- links[str_detect(links, "JCMB_2015")]
    
    # Store the file names as a list.
    filenames_list <- as.list(filenames)
    
    # Create a function to download the files by passing the URL and filename list.
    downloadcsv <- function (mainurl,filename) {
       filedetails <- str_c(mainurl,filename)
       download.file(filedetails,filename)
    }
    
    # Now apply the l_ply function and save the files into the current R working directory.
    l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")

    Verify the File Download

    After running the above code, you can locate the following files in the current R working directory.

    "JCMB_2015.csv" "JCMB_2015_Apr.csv" "JCMB_2015_Feb.csv" "JCMB_2015_Jan.csv"
       "JCMB_2015_Mar.csv"
    
  • XML Files

    XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file.

    You can read a xml file in R using the “XML” package. This package can be installed using following command.

    install.packages("XML")
    

    Input Data

    Create a XMl file by copying the below data into a text editor like notepad. Save the file with a .xml extension and choosing the file type as all files(*.*).

    <RECORDS>
       <EMPLOYEE>
    
      &lt;ID&gt;1&lt;/ID&gt;
      &lt;NAME&gt;Rick&lt;/NAME&gt;
      &lt;SALARY&gt;623.3&lt;/SALARY&gt;
      &lt;STARTDATE&gt;1/1/2012&lt;/STARTDATE&gt;
      &lt;DEPT&gt;IT&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;2&lt;/ID&gt;
      &lt;NAME&gt;Dan&lt;/NAME&gt;
      &lt;SALARY&gt;515.2&lt;/SALARY&gt;
      &lt;STARTDATE&gt;9/23/2013&lt;/STARTDATE&gt;
      &lt;DEPT&gt;Operations&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;3&lt;/ID&gt;
      &lt;NAME&gt;Michelle&lt;/NAME&gt;
      &lt;SALARY&gt;611&lt;/SALARY&gt;
      &lt;STARTDATE&gt;11/15/2014&lt;/STARTDATE&gt;
      &lt;DEPT&gt;IT&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;4&lt;/ID&gt;
      &lt;NAME&gt;Ryan&lt;/NAME&gt;
      &lt;SALARY&gt;729&lt;/SALARY&gt;
      &lt;STARTDATE&gt;5/11/2014&lt;/STARTDATE&gt;
      &lt;DEPT&gt;HR&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;5&lt;/ID&gt;
      &lt;NAME&gt;Gary&lt;/NAME&gt;
      &lt;SALARY&gt;843.25&lt;/SALARY&gt;
      &lt;STARTDATE&gt;3/27/2015&lt;/STARTDATE&gt;
      &lt;DEPT&gt;Finance&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;6&lt;/ID&gt;
      &lt;NAME&gt;Nina&lt;/NAME&gt;
      &lt;SALARY&gt;578&lt;/SALARY&gt;
      &lt;STARTDATE&gt;5/21/2013&lt;/STARTDATE&gt;
      &lt;DEPT&gt;IT&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;7&lt;/ID&gt;
      &lt;NAME&gt;Simon&lt;/NAME&gt;
      &lt;SALARY&gt;632.8&lt;/SALARY&gt;
      &lt;STARTDATE&gt;7/30/2013&lt;/STARTDATE&gt;
      &lt;DEPT&gt;Operations&lt;/DEPT&gt;
    </EMPLOYEE> <EMPLOYEE>
      &lt;ID&gt;8&lt;/ID&gt;
      &lt;NAME&gt;Guru&lt;/NAME&gt;
      &lt;SALARY&gt;722.5&lt;/SALARY&gt;
      &lt;STARTDATE&gt;6/17/2014&lt;/STARTDATE&gt;
      &lt;DEPT&gt;Finance&lt;/DEPT&gt;
    </EMPLOYEE> </RECORDS>

    Reading XML File

    The xml file is read by R using the function xmlParse(). It is stored as a list in R.

    # Load the package required to read XML files.
    library("XML")
    
    # Also load the other required package.
    library("methods")
    
    # Give the input file name to the function.
    result <- xmlParse(file = "input.xml")
    
    # Print the result.
    print(result)

    When we execute the above code, it produces the following result −

    1
    Rick
    623.3
    1/1/2012
    IT
    
    2
    Dan
    515.2
    9/23/2013
    Operations
    
    3
    Michelle
    611
    11/15/2014
    IT
    
    4
    Ryan
    729
    5/11/2014
    HR
    
    5
    Gary
    843.25
    3/27/2015
    Finance
    
    6
    Nina
    578
    5/21/2013
    IT
    
    7
    Simon
    632.8
    7/30/2013
    Operations
    
    8
    Guru
    722.5
    6/17/2014
    Finance
    

    Get Number of Nodes Present in XML File

    # Load the packages required to read XML files.
    library("XML")
    library("methods")
    
    # Give the input file name to the function.
    result <- xmlParse(file = "input.xml")
    
    # Exract the root node form the xml file.
    rootnode <- xmlRoot(result)
    
    # Find number of nodes in the root.
    rootsize <- xmlSize(rootnode)
    
    # Print the result.
    print(rootsize)

    When we execute the above code, it produces the following result −

    output
    [1] 8
    

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Details of the First Node

    Let’s look at the first record of the parsed file. It will give us an idea of the various elements present in the top level node.

    # Load the packages required to read XML files.
    library("XML")
    library("methods")
    
    # Give the input file name to the function.
    result <- xmlParse(file = "input.xml")
    
    # Exract the root node form the xml file.
    rootnode <- xmlRoot(result)
    
    # Print the result.
    print(rootnode[1])

    When we execute the above code, it produces the following result −

    $EMPLOYEE
       1
       Rick
       623.3
       1/1/2012
       IT
     
    
    attr(,"class")
    [1] "XMLInternalNodeList" "XMLNodeList" 
    

    Get Different Elements of a Node

    # Load the packages required to read XML files.
    library("XML")
    library("methods")
    
    # Give the input file name to the function.
    result <- xmlParse(file = "input.xml")
    
    # Exract the root node form the xml file.
    rootnode <- xmlRoot(result)
    
    # Get the first element of the first node.
    print(rootnode[[1]][[1]])
    
    # Get the fifth element of the first node.
    print(rootnode[[1]][[5]])
    
    # Get the second element of the third node.
    print(rootnode[[3]][[2]])

    When we execute the above code, it produces the following result −

    1 
    IT 
    Michelle 
    

    XML to Data Frame

    To handle the data effectively in large files we read the data in the xml file as a data frame. Then process the data frame for data analysis.

    # Load the packages required to read XML files.
    library("XML")
    library("methods")
    
    # Convert the input xml file to a data frame.
    xmldataframe <- xmlToDataFrame("input.xml")
    print(xmldataframe)

    When we execute the above code, it produces the following result −

          ID    NAME     SALARY    STARTDATE       DEPT 
    1      1    Rick     623.30    2012-01-01      IT
    2      2    Dan      515.20    2013-09-23      Operations
    3      3    Michelle 611.00    2014-11-15      IT
    4      4    Ryan     729.00    2014-05-11      HR
    5     NA    Gary     843.25    2015-03-27      Finance
    6      6    Nina     578.00    2013-05-21      IT
    7      7    Simon    632.80    2013-07-30      Operations
    8      8    Guru     722.50    2014-06-17      Finance
    

    As the data is now available as a dataframe we can use data frame related function to read and manipulate the file.

  • Binary Files

    A binary file is a file that contains information stored only in form of bits and bytes.(0’s and 1’s). They are not human readable as the bytes in it translate to characters and symbols which contain many other non-printable characters. Attempting to read a binary file using any text editor will show characters like Ø and ð.

    The binary file has to be read by specific programs to be useable. For example, the binary file of a Microsoft Word program can be read to a human readable form only by the Word program. Which indicates that, besides the human readable text, there is a lot more information like formatting of characters and page numbers etc., which are also stored along with alphanumeric characters. And finally a binary file is a continuous sequence of bytes. The line break we see in a text file is a character joining first line to the next.

    Sometimes, the data generated by other programs are required to be processed by R as a binary file. Also R is required to create binary files which can be shared with other programs.

    R has two functions WriteBin() and readBin() to create and read binary files.

    Syntax

    writeBin(object, con)
    readBin(con, what, n )
    

    Following is the description of the parameters used −

    • con is the connection object to read or write the binary file.
    • object is the binary file which to be written.
    • what is the mode like character, integer etc. representing the bytes to be read.
    • n is the number of bytes to read from the binary file.

    Example

    We consider the R inbuilt data “mtcars”. First we create a csv file from it and convert it to a binary file and store it as a OS file. Next we read this binary file created into R.

    Writing the Binary File

    We read the data frame “mtcars” as a csv file and then write it as a binary file to the OS.

    # Read the "mtcars" data frame as a csv file and store only the columns 
       "cyl", "am" and "gear".
    write.table(mtcars, file = "mtcars.csv",row.names = FALSE, na = "", 
       col.names = TRUE, sep = ",")
    
    # Store 5 records from the csv file as a new data frame.
    new.mtcars <- read.table("mtcars.csv",sep = ",",header = TRUE,nrows = 5)
    
    # Create a connection object to write the binary file using mode "wb".
    write.filename = file("/web/com/binmtcars.dat", "wb")
    
    # Write the column names of the data frame to the connection object.
    writeBin(colnames(new.mtcars), write.filename)
    
    # Write the records in each of the column to the file.
    writeBin(c(new.mtcars$cyl,new.mtcars$am,new.mtcars$gear), write.filename)
    
    # Close the file for writing so that it can be read by other program.
    close(write.filename)

    Reading the Binary File

    The binary file created above stores all the data as continuous bytes. So we will read it by choosing appropriate values of column names as well as the column values.

    # Create a connection object to read the file in binary mode using "rb".
    read.filename <- file("/web/com/binmtcars.dat", "rb")
    
    # First read the column names. n = 3 as we have 3 columns.
    column.names <- readBin(read.filename, character(),  n = 3)
    
    # Next read the column values. n = 18 as we have 3 column names and 15 values.
    read.filename <- file("/web/com/binmtcars.dat", "rb")
    bindata <- readBin(read.filename, integer(),  n = 18)
    
    # Print the data.
    print(bindata)
    
    # Read the values from 4th byte to 8th byte which represents "cyl".
    cyldata = bindata[4:8]
    print(cyldata)
    
    # Read the values form 9th byte to 13th byte which represents "am".
    amdata = bindata[9:13]
    print(amdata)
    
    # Read the values form 9th byte to 13th byte which represents "gear".
    geardata = bindata[14:18]
    print(geardata)
    
    # Combine all the read values to a dat frame.
    finaldata = cbind(cyldata, amdata, geardata)
    colnames(finaldata) = column.names
    print(finaldata)

    When we execute the above code, it produces the following result and chart −

     [1]    7108963 1728081249    7496037          6          6          4
     [7]          6          8          1          1          1          0
    [13]          0          4          4          4          3          3
    
    [1] 6 6 4 6 8
    
    [1] 1 1 1 0 0
    
    [1] 4 4 4 3 3
    
    
     cyl am gear
    [1,] 6 1 4 [2,] 6 1 4 [3,] 4 1 4 [4,] 6 0 3 [5,] 8 0 3

    As we can see, we got the original data back by reading the binary file in R.

  • Excel File

    Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format. R can read directly from these files using some excel specific packages. Few such packages are – XLConnect, xlsx, gdata etc. We will be using xlsx package. R can also write into excel file using this package.

    Install xlsx Package

    You can use the following command in the R console to install the “xlsx” package. It may ask to install some additional packages on which this package is dependent. Follow the same command with required package name to install the additional packages.

    install.packages("xlsx")
    

    Verify and Load the “xlsx” Package

    Use the following command to verify and load the “xlsx” package.

    # Verify the package is installed.
    any(grepl("xlsx",installed.packages()))
    
    # Load the library into R workspace.
    library("xlsx")

    When the script is run we get the following output.

    [1] TRUE
    Loading required package: rJava
    Loading required package: methods
    Loading required package: xlsxjars
    

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Input as xlsx File

    Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.

    id	name      salary    start_date	dept
    1	Rick	    623.3	  1/1/2012	   IT
    2	Dan       515.2     9/23/2013    Operations
    3	Michelle  611	     11/15/2014	IT
    4	Ryan	    729	     5/11/2014	   HR
    5	Gary	    43.25     3/27/2015  	Finance
    6	Nina	    578       5/21/2013	   IT
    7	Simon	    632.8	  7/30/2013	   Operations
    8	Guru	    722.5	  6/17/2014	   Finance
    

    Also copy and paste the following data to another worksheet and rename this worksheet to “city”.

    name	    city
    Rick	    Seattle
    Dan       Tampa
    Michelle  Chicago
    Ryan	    Seattle
    Gary	    Houston
    Nina	    Boston
    Simon	    Mumbai
    Guru	    Dallas
    

    Save the Excel file as “input.xlsx”. You should save it in the current working directory of the R workspace.

    Reading the Excel File

    The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data frame in the R environment.

    # Read the first worksheet in the file input.xlsx.
    data <- read.xlsx("input.xlsx", sheetIndex = 1)
    print(data)

    When we execute the above code, it produces the following result −

          id,   name,     salary,   start_date,   dept
    1      1    Rick      623.30    2012-01-01    IT
    2      2    Dan       515.20    2013-09-23    Operations
    3      3    Michelle  611.00    2014-11-15    IT
    4      4    Ryan      729.00    2014-05-11    HR
    5     NA    Gary      843.25    2015-03-27    Finance
    6      6    Nina      578.00    2013-05-21    IT
    7      7    Simon     632.80    2013-07-30    Operations
    8      8    Guru      722.50    2014-06-17    Finance
    
  • CSV Files

    In R, we can read data from files stored outside the R environment. We can also write data into files which will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml etc.

    In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should be present in current working directory so that R can read it. Of course we can also set our own directory and read files from there.

    Getting and Setting the Working Directory

    You can check which directory the R workspace is pointing to using the getwd() function. You can also set a new working directory using setwd()function.

    # Get and print current working directory.
    print(getwd())
    
    # Set current working directory.
    setwd("/web/com")
    
    # Get and print current working directory.
    print(getwd())

    When we execute the above code, it produces the following result −

    [1] "/web/com/1441086124_2016"
    [1] "/web/com"
    

    This result depends on your OS and your current directory where you are working.

    Input as CSV File

    The csv file is a text file in which the values in the columns are separated by a comma. Let’s consider the following data present in the file named input.csv.

    You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the save As All files(*.*) option in notepad.

    id,name,salary,start_date,dept
    1,Rick,623.3,2012-01-01,IT
    2,Dan,515.2,2013-09-23,Operations
    3,Michelle,611,2014-11-15,IT
    4,Ryan,729,2014-05-11,HR
    5,Gary,843.25,2015-03-27,Finance
    6,Nina,578,2013-05-21,IT
    7,Simon,632.8,2013-07-30,Operations
    8,Guru,722.5,2014-06-17,Finance

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Reading a CSV File

    Following is a simple example of read.csv() function to read a CSV file available in your current working directory −

    data <- read.csv("input.csv")
    print(data)

    When we execute the above code, it produces the following result −

          id,   name,    salary,   start_date,     dept
    1      1    Rick     623.30    2012-01-01      IT
    2      2    Dan      515.20    2013-09-23      Operations
    3      3    Michelle 611.00    2014-11-15      IT
    4      4    Ryan     729.00    2014-05-11      HR
    5     NA    Gary     843.25    2015-03-27      Finance
    6      6    Nina     578.00    2013-05-21      IT
    7      7    Simon    632.80    2013-07-30      Operations
    8      8    Guru     722.50    2014-06-17      Finance
    

    Analyzing the CSV File

    By default the read.csv() function gives the output as a data frame. This can be easily checked as follows. Also we can check the number of columns and rows.

    data <- read.csv("input.csv")
    
    print(is.data.frame(data))
    print(ncol(data))
    print(nrow(data))

    When we execute the above code, it produces the following result −

    [1] TRUE
    [1] 5
    [1] 8
    

    Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in subsequent section.

    Get the maximum salary

    # Create a data frame.
    data <- read.csv("input.csv")
    
    # Get the max salary from data frame.
    sal <- max(data$salary)
    print(sal)

    When we execute the above code, it produces the following result −

    [1] 843.25
    

    Get the details of the person with max salary

    We can fetch rows meeting specific filter criteria similar to a SQL where clause.

    # Create a data frame.
    data <- read.csv("input.csv")
    
    # Get the max salary from data frame.
    sal <- max(data$salary)
    
    # Get the person detail having max salary.
    retval <- subset(data, salary == max(salary))
    print(retval)

    When we execute the above code, it produces the following result −

          id    name  salary  start_date    dept
    5     NA    Gary  843.25  2015-03-27    Finance
    

    Get all the people working in IT department

    # Create a data frame.
    data <- read.csv("input.csv")
    
    retval <- subset( data, dept == "IT")
    print(retval)

    When we execute the above code, it produces the following result −

           id   name      salary   start_date   dept
    1      1    Rick      623.3    2012-01-01   IT
    3      3    Michelle  611.0    2014-11-15   IT
    6      6    Nina      578.0    2013-05-21   IT
    

    Get the persons in IT department whose salary is greater than 600

    # Create a data frame.
    data <- read.csv("input.csv")
    
    info <- subset(data, salary > 600 & dept == "IT")
    print(info)

    When we execute the above code, it produces the following result −

           id   name      salary   start_date   dept
    1      1    Rick      623.3    2012-01-01   IT
    3      3    Michelle  611.0    2014-11-15   IT
    

    Get the people who joined on or after 2014

    # Create a data frame.
    data <- read.csv("input.csv")
    
    retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
    print(retval)

    When we execute the above code, it produces the following result −

           id   name     salary   start_date    dept
    3      3    Michelle 611.00   2014-11-15    IT
    4      4    Ryan     729.00   2014-05-11    HR
    5     NA    Gary     843.25   2015-03-27    Finance
    8      8    Guru     722.50   2014-06-17    Finance
    

    Writing into a CSV File

    R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This file gets created in the working directory.

    # Create a data frame.
    data <- read.csv("input.csv")
    retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
    
    # Write filtered data into a new file.
    write.csv(retval,"output.csv")
    newdata <- read.csv("output.csv")
    print(newdata)

    When we execute the above code, it produces the following result −

      X      id   name      salary   start_date    dept
    1 3      3    Michelle  611.00   2014-11-15    IT
    2 4      4    Ryan      729.00   2014-05-11    HR
    3 5     NA    Gary      843.25   2015-03-27    Finance
    4 8      8    Guru      722.50   2014-06-17    Finance
    

    Here the column X comes from the data set newper. This can be dropped using additional parameters while writing the file.

    # Create a data frame.
    data <- read.csv("input.csv")
    retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
    
    # Write filtered data into a new file.
    write.csv(retval,"output.csv", row.names = FALSE)
    newdata <- read.csv("output.csv")
    print(newdata)

    When we execute the above code, it produces the following result −

          id    name      salary   start_date    dept
    1      3    Michelle  611.00   2014-11-15    IT
    2      4    Ryan      729.00   2014-05-11    HR
    3     NA    Gary      843.25   2015-03-27    Finance
    4      8    Guru      722.50   2014-06-17    Finance