Author: saqibkhan

  • Mean, Median and Mode

    Statistical analysis in R is performed by using many in-built functions. Most of these functions are part of the R base package. These functions take R vector as an input along with the arguments and give the result.

    The functions we are discussing in this chapter are mean, median and mode.

    Mean

    It is calculated by taking the sum of the values and dividing with the number of values in a data series.

    The function mean() is used to calculate this in R.

    Syntax

    The basic syntax for calculating mean in R is −

    mean(x, trim = 0, na.rm = FALSE, ...)
    

    Following is the description of the parameters used −

    • x is the input vector.
    • trim is used to drop some observations from both end of the sorted vector.
    • na.rm is used to remove the missing values from the input vector.

    Example

    # Create a vector. 
    x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
    
    # Find Mean.
    result.mean <- mean(x)
    print(result.mean)

    When we execute the above code, it produces the following result −

    [1] 8.22
    

    Applying Trim Option

    When trim parameter is supplied, the values in the vector get sorted and then the required numbers of observations are dropped from calculating the mean.

    When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.

    In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed from the vector for calculating mean are (−21,−5,2) from left and (12,18,54) from right.

    # Create a vector.
    x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
    
    # Find Mean.
    result.mean <-  mean(x,trim = 0.3)
    print(result.mean)

    When we execute the above code, it produces the following result −

    [1] 5.55
    

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Applying NA Option

    If there are missing values, then the mean function returns NA.

    To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA values.

    # Create a vector. 
    x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
    
    # Find mean.
    result.mean <-  mean(x)
    print(result.mean)
    
    # Find mean dropping NA values.
    result.mean <-  mean(x,na.rm = TRUE)
    print(result.mean)

    When we execute the above code, it produces the following result −

    [1] NA
    [1] 8.22
    

    Median

    The middle most value in a data series is called the median. The median() function is used in R to calculate this value.

    Syntax

    The basic syntax for calculating median in R is −

    median(x, na.rm = FALSE)
    

    Following is the description of the parameters used −

    • x is the input vector.
    • na.rm is used to remove the missing values from the input vector.

    Example

    # Create the vector.
    x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
    
    # Find the median.
    median.result <- median(x)
    print(median.result)

    When we execute the above code, it produces the following result −

    [1] 5.6
    

    Mode

    The mode is the value that has highest number of occurrences in a set of data. Unike mean and median, mode can have both numeric and character data.

    R does not have a standard in-built function to calculate mode. So we create a user function to calculate mode of a data set in R. This function takes the vector as input and gives the mode value as output.

    Example

    # Create the function.
    getmode <- function(v) {
       uniqv <- unique(v)
       uniqv[which.max(tabulate(match(v, uniqv)))]
    }
    
    # Create the vector with numbers.
    v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
    
    # Calculate the mode using the user function.
    result <- getmode(v)
    print(result)
    
    # Create the vector with characters.
    charv <- c("o","it","the","it","it")
    
    # Calculate the mode using the user function.
    result <- getmode(charv)
    print(result)

    When we execute the above code, it produces the following result −

    [1] 2
    [1] "it"
    
  • Scatterplots

    Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis.

    The simple scatterplot is created using the plot() function.

    Syntax

    The basic syntax for creating scatterplot in R is −

    plot(x, y, main, xlab, ylab, xlim, ylim, axes)
    

    Following is the description of the parameters used −

    • x is the data set whose values are the horizontal coordinates.
    • y is the data set whose values are the vertical coordinates.
    • main is the tile of the graph.
    • xlab is the label in the horizontal axis.
    • ylab is the label in the vertical axis.
    • xlim is the limits of the values of x used for plotting.
    • ylim is the limits of the values of y used for plotting.
    • axes indicates whether both axes should be drawn on the plot.

    Example

    We use the data set “mtcars” available in the R environment to create a basic scatterplot. Let’s use the columns “wt” and “mpg” in mtcars.

    input <- mtcars[,c('wt','mpg')]
    print(head(input))

    When we execute the above code, it produces the following result −

                        wt      mpg
    Mazda RX4           2.620   21.0
    Mazda RX4 Wag       2.875   21.0
    Datsun 710          2.320   22.8
    Hornet 4 Drive      3.215   21.4
    Hornet Sportabout   3.440   18.7
    Valiant             3.460   18.1
    

    Creating the Scatterplot

    The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon).

    # Get the input values.
    input <- mtcars[,c('wt','mpg')]
    
    # Give the chart file a name.
    png(file = "scatterplot.png")
    
    # Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
    plot(x = input$wt,y = input$mpg,
       xlab = "Weight",
       ylab = "Milage",
       xlim = c(2.5,5),
       ylim = c(15,30),		 
       main = "Weight vs Milage"
    )
    	 
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Scatter Plot using R

    Scatterplot Matrices

    When we have more than two variables and we want to find the correlation between one variable versus the remaining ones we use scatterplot matrix. We use pairs() function to create matrices of scatterplots.

    Syntax

    The basic syntax for creating scatterplot matrices in R is −

    pairs(formula, data)
    

    Following is the description of the parameters used −

    • formula represents the series of variables used in pairs.
    • data represents the data set from which the variables will be taken.

    Example

    Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each pair.

    # Give the chart file a name.
    png(file = "scatterplot_matrices.png")
    
    # Plot the matrices between 4 variables giving 12 plots.
    
    # One variable with 3 others and total 4 variables.
    
    pairs(~wt+mpg+disp+cyl,data = mtcars,
       main = "Scatterplot Matrix")
    
    # Save the file.
    dev.off()

    When the above code is executed we get the following output.

    Scatter Plot Matrices using R
  • Line Graphs

    A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in identifying the trends in data.

    The plot() function in R is used to create the line graph.

    Syntax

    The basic syntax to create a line chart in R is −

    plot(v,type,col,xlab,ylab)
    

    Following is the description of the parameters used −

    • v is a vector containing the numeric values.
    • type takes the value “p” to draw only the points, “l” to draw only the lines and “o” to draw both points and lines.
    • xlab is the label for x axis.
    • ylab is the label for y axis.
    • main is the Title of the chart.
    • col is used to give colors to both the points and lines.

    Example

    A simple line chart is created using the input vector and the type parameter as “O”. The below script will create and save a line chart in the current R working directory.

    # Create the data for the chart.
    v <- c(7,12,28,3,41)
    
    # Give the chart file a name.
    png(file = "line_chart.jpg")
    
    # Plot the bar chart. 
    plot(v,type = "o")
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Line Chart using R

    Line Chart Title, Color and Labels

    The features of the line chart can be expanded by using additional parameters. We add color to the points and lines, give a title to the chart and add labels to the axes.

    Example

    # Create the data for the chart.
    v <- c(7,12,28,3,41)
    
    # Give the chart file a name.
    png(file = "line_chart_label_colored.jpg")
    
    # Plot the bar chart.
    plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
       main = "Rain fall chart")
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Line Chart Labeled with Title in R

    Multiple Lines in a Line Chart

    More than one line can be drawn on the same chart by using the lines()function.

    After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in the chart,

    # Create the data for the chart.
    v <- c(7,12,28,3,41)
    t <- c(14,7,6,19,3)
    
    # Give the chart file a name.
    png(file = "line_chart_2_lines.jpg")
    
    # Plot the bar chart.
    plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall", 
       main = "Rain fall chart")
    
    lines(t, type = "o", col = "blue")
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Line Chart with multiple lines in R
  • Histograms

    A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range.

    R creates histogram using hist() function. This function takes a vector as an input and uses some more parameters to plot histograms.

    Syntax

    The basic syntax for creating a histogram using R is −

    hist(v,main,xlab,xlim,ylim,breaks,col,border)
    

    Following is the description of the parameters used −

    • v is a vector containing numeric values used in histogram.
    • main indicates title of the chart.
    • col is used to set color of the bars.
    • border is used to set border color of each bar.
    • xlab is used to give description of x-axis.
    • xlim is used to specify the range of values on the x-axis.
    • ylim is used to specify the range of values on the y-axis.
    • breaks is used to mention the width of each bar.

    Example

    A simple histogram is created using input vector, label, col and border parameters.

    The script given below will create and save the histogram in the current R working directory.

    # Create data for the graph.
    v <-  c(9,13,21,8,36,22,12,41,31,33,19)
    
    # Give the chart file a name.
    png(file = "histogram.png")
    
    # Create the histogram.
    hist(v,xlab = "Weight",col = "yellow",border = "blue")
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Histogram Of V

    Range of X and Y values

    To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.

    The width of each of the bar can be decided by using breaks.

    # Create data for the graph.
    v <- c(9,13,21,8,36,22,12,41,31,33,19)
    
    # Give the chart file a name.
    png(file = "histogram_lim_breaks.png")
    
    # Create the histogram.
    hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),
       breaks = 5)
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Histogram Line Breaks
  • Boxplots

    Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into three quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.

    Boxplots are created in R by using the boxplot() function.

    Syntax

    The basic syntax to create a boxplot in R is −

    boxplot(x, data, notch, varwidth, names, main)
    

    Following is the description of the parameters used −

    • x is a vector or a formula.
    • data is the data frame.
    • notch is a logical value. Set as TRUE to draw a notch.
    • varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size.
    • names are the group labels which will be printed under each boxplot.
    • main is used to give a title to the graph.

    Example

    We use the data set “mtcars” available in the R environment to create a basic boxplot. Let’s look at the columns “mpg” and “cyl” in mtcars.

    input <- mtcars[,c('mpg','cyl')]
    print(head(input))

    When we execute above code, it produces following result −

                       mpg  cyl
    Mazda RX4         21.0   6
    Mazda RX4 Wag     21.0   6
    Datsun 710        22.8   4
    Hornet 4 Drive    21.4   6
    Hornet Sportabout 18.7   8
    Valiant           18.1   6
    

    Creating the Boxplot

    The below script will create a boxplot graph for the relation between mpg (miles per gallon) and cyl (number of cylinders).

    # Give the chart file a name.
    png(file = "boxplot.png")
    
    # Plot the chart.
    boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
       ylab = "Miles Per Gallon", main = "Mileage Data")
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Box Plot using R

    Boxplot with Notch

    We can draw boxplot with notch to find out how the medians of different data groups match with each other.

    The below script will create a boxplot graph with notch for each of the data group.

    # Give the chart file a name.
    png(file = "boxplot_with_notch.png")
    
    # Plot the chart.
    boxplot(mpg ~ cyl, data = mtcars, 
       xlab = "Number of Cylinders",
       ylab = "Miles Per Gallon", 
       main = "Mileage Data",
       notch = TRUE, 
       varwidth = TRUE, 
       col = c("green","yellow","purple"),
       names = c("High","Medium","Low")
    )
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Box Plot with notch using R
  • Bar Charts

    A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.

    Syntax

    The basic syntax to create a bar-chart in R is −

    barplot(H,xlab,ylab,main, names.arg,col)
    

    Following is the description of the parameters used −

    • H is a vector or matrix containing numeric values used in bar chart.
    • xlab is the label for x axis.
    • ylab is the label for y axis.
    • main is the title of the bar chart.
    • names.arg is a vector of names appearing under each bar.
    • col is used to give colors to the bars in the graph.

    Example

    A simple bar chart is created using just the input vector and the name of each bar.

    The below script will create and save the bar chart in the current R working directory.

    # Create the data for the chart
    H <- c(7,12,28,3,41)
    
    # Give the chart file a name
    png(file = "barchart.png")
    
    # Plot the bar chart 
    barplot(H)
    
    # Save the file
    dev.off()

    When we execute above code, it produces following result −

    Bar Chart using R

    Bar Chart Labels, Title and Colors

    The features of the bar chart can be expanded by adding more parameters. The main parameter is used to add title. The col parameter is used to add colors to the bars. The args.name is a vector having same number of values as the input vector to describe the meaning of each bar.

    Example

    The below script will create and save the bar chart in the current R working directory.

    # Create the data for the chart
    H <- c(7,12,28,3,41)
    M <- c("Mar","Apr","May","Jun","Jul")
    
    # Give the chart file a name
    png(file = "barchart_months_revenue.png")
    
    # Plot the bar chart 
    barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
    main="Revenue chart",border="red")
    
    # Save the file
    dev.off()

    When we execute above code, it produces following result −

    Bar Chart with title using R

    Group Bar Chart and Stacked Bar Chart

    We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.

    More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.

    # Create the input vectors.
    colors = c("green","orange","brown")
    months <- c("Mar","Apr","May","Jun","Jul")
    regions <- c("East","West","North")
    
    # Create the matrix of the values.
    Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow = 3, ncol = 5, byrow = TRUE)
    
    # Give the chart file a name
    png(file = "barchart_stacked.png")
    
    # Create the bar chart
    barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab = "revenue", col = colors)
    
    # Add the legend to the chart
    legend("topleft", regions, cex = 1.3, fill = colors)
    
    # Save the file
    dev.off()
     Stacked Bar Chart using R
  • Pie Charts

    R Programming language has numerous libraries to create charts and graphs. A pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart.

    In R the pie chart is created using the pie() function which takes positive numbers as a vector input. The additional parameters are used to control labels, color, title etc.

    Syntax

    The basic syntax for creating a pie-chart using the R is −

    pie(x, labels, radius, main, col, clockwise)
    

    Following is the description of the parameters used −

    • x is a vector containing the numeric values used in the pie chart.
    • labels is used to give description to the slices.
    • radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
    • main indicates the title of the chart.
    • col indicates the color palette.
    • clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.

    Example

    A very simple pie-chart is created using just the input vector and labels. The below script will create and save the pie chart in the current R working directory.

    # Create data for the graph.
    x <- c(21, 62, 10, 53)
    labels <- c("London", "New York", "Singapore", "Mumbai")
    
    # Give the chart file a name.
    png(file = "city.png")
    
    # Plot the chart.
    pie(x,labels)
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Pie Chatr using R

    Pie Chart Title and Colors

    We can expand the features of the chart by adding more parameters to the function. We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart. The length of the pallet should be same as the number of values we have for the chart. Hence we use length(x).

    Example

    The below script will create and save the pie chart in the current R working directory.

    # Create data for the graph.
    x <- c(21, 62, 10, 53)
    labels <- c("London", "New York", "Singapore", "Mumbai")
    
    # Give the chart file a name.
    png(file = "city_title_colours.jpg")
    
    # Plot the chart with title and rainbow color pallet.
    pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    Pie-chart with title and colours

    Slice Percentages and Chart Legend

    We can add slice percentage and a chart legend by creating additional chart variables.

    # Create data for the graph.
    x <-  c(21, 62, 10,53)
    labels <-  c("London","New York","Singapore","Mumbai")
    
    piepercent<- round(100*x/sum(x), 1)
    
    # Give the chart file a name.
    png(file = "city_percentage_legends.jpg")
    
    # Plot the chart.
    pie(x, labels = piepercent, main = "City pie chart",col = rainbow(length(x)))
    legend("topright", c("London","New York","Singapore","Mumbai"), cex = 0.8,
       fill = rainbow(length(x)))
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    pie-chart with percentage and labels

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    3D Pie Chart

    A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a function called pie3D() that is used for this.

    # Get the library.
    library(plotrix)
    
    # Create data for the graph.
    x <-  c(21, 62, 10,53)
    lbl <-  c("London","New York","Singapore","Mumbai")
    
    # Give the chart file a name.
    png(file = "3d_pie_chart.jpg")
    
    # Plot the chart.
    pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")
    
    # Save the file.
    dev.off()

    When we execute the above code, it produces the following result −

    3D pie-chart
  • Databases

    The data is Relational database systems are stored in a normalized format. So, to carry out statistical computing we will need very advanced and complex Sql queries. But R can connect easily to many relational databases like MySql, Oracle, Sql server etc. and fetch records from them as a data frame. Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using all the powerful packages and functions.

    In this tutorial we will be using MySql as our reference database for connecting to R.

    RMySQL Package

    R has a built-in package named “RMySQL” which provides native connectivity between with MySql database. You can install this package in the R environment using the following command.

    install.packages("RMySQL")

    Connecting R to MySql

    Once the package is installed we create a connection object in R to connect to the database. It takes the username, password, database name and host name as input.

    # Create a connection Object to MySQL database.
    # We will connect to the sampel database named "sakila" that comes with MySql installation.
    mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila',
       host = 'localhost')
    
    # List the tables available in this database.
     dbListTables(mysqlconnection)

    When we execute the above code, it produces the following result −

     [1] "actor"                      "actor_info"                
     [3] "address"                    "category"                  
     [5] "city"                       "country"                   
     [7] "customer"                   "customer_list"             
     [9] "film"                       "film_actor"                
    [11] "film_category"              "film_list"                 
    [13] "film_text"                  "inventory"                 
    [15] "language"                   "nicer_but_slower_film_list"
    [17] "payment"                    "rental"                    
    [19] "sales_by_film_category"     "sales_by_store"            
    [21] "staff"                      "staff_list"                
    [23] "store"                     
    

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Querying the Tables

    We can query the database tables in MySql using the function dbSendQuery(). The query gets executed in MySql and the result set is returned using the R fetch() function. Finally it is stored as a data frame in R.

    # Query the "actor" tables to get all the rows.
    result = dbSendQuery(mysqlconnection, "select * from actor")
    
    # Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
    data.frame = fetch(result, n = 5)
    print(data.fame)

    When we execute the above code, it produces the following result −

            actor_id   first_name    last_name         last_update
    1        1         PENELOPE      GUINESS           2006-02-15 04:34:33
    2        2         NICK          WAHLBERG          2006-02-15 04:34:33
    3        3         ED            CHASE             2006-02-15 04:34:33
    4        4         JENNIFER      DAVIS             2006-02-15 04:34:33
    5        5         JOHNNY        LOLLOBRIGIDA      2006-02-15 04:34:33
    

    Query with Filter Clause

    We can pass any valid select query to get the result.

    result = dbSendQuery(mysqlconnection, "select * from actor where last_name = 'TORN'")
    
    # Fetch all the records(with n = -1) and store it as a data frame.
    data.frame = fetch(result, n = -1)
    print(data)

    When we execute the above code, it produces the following result −

            actor_id    first_name     last_name         last_update
    1        18         DAN            TORN              2006-02-15 04:34:33
    2        94         KENNETH        TORN              2006-02-15 04:34:33
    3       102         WALTER         TORN              2006-02-15 04:34:33
    

    Updating Rows in the Tables

    We can update the rows in a Mysql table by passing the update query to the dbSendQuery() function.

    dbSendQuery(mysqlconnection, "update mtcars set disp = 168.5 where hp = 110")

    After executing the above code we can see the table updated in the MySql Environment.

    Inserting Data into the Tables

    dbSendQuery(mysqlconnection,
       "insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
       values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)"
    )

    After executing the above code we can see the row inserted into the table in the MySql Environment.

    Creating Tables in MySql

    We can create tables in the MySql using the function dbWriteTable(). It overwrites the table if it already exists and takes a data frame as input.

    # Create the connection object to the database where we want to create the table.
    mysqlconnection = dbConnect(MySQL(), user = 'root', password = '', dbname = 'sakila', 
       host = 'localhost')
    
    # Use the R data frame "mtcars" to create the table in MySql.
    # All the rows of mtcars are taken inot MySql.
    dbWriteTable(mysqlconnection, "mtcars", mtcars[, ], overwrite = TRUE)

    After executing the above code we can see the table created in the MySql Environment.

    Dropping Tables in MySql

    We can drop the tables in MySql database passing the drop table statement into the dbSendQuery() in the same way we used it for querying data from tables.

    dbSendQuery(mysqlconnection, 'drop table if exists mtcars')

    After executing the above code we can see the table is dropped in the MySql Environment.

  • JSON Files

    JSON file stores data as text in human-readable format. Json stands for JavaScript Object Notation. R can read JSON files using the rjson package.

    Install rjson Package

    In the R console, you can issue the following command to install the rjson package.

    install.packages("rjson")
    

    Input Data

    Create a JSON file by copying the below data into a text editor like notepad. Save the file with a .json extension and choosing the file type as all files(*.*).

    { 
       "ID":["1","2","3","4","5","6","7","8" ],
       "Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
       "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
       
       "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
    
      "7/30/2013","6/17/2014"],
    "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Read the JSON File

    The JSON file is read by R using the function from JSON(). It is stored as a list in R.

    # Load the package required to read JSON files.
    library("rjson")
    
    # Give the input file name to the function.
    result <- fromJSON(file = "input.json")
    
    # Print the result.
    print(result)

    When we execute the above code, it produces the following result −

    $ID
    [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"
    
    $Name
    [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Gary"     "Nina"     "Simon"    "Guru"
    
    $Salary
    [1] "623.3"  "515.2"  "611"    "729"    "843.25" "578"    "632.8"  "722.5"
    
    $StartDate
    [1] "1/1/2012"   "9/23/2013"  "11/15/2014" "5/11/2014"  "3/27/2015"  "5/21/2013"
       "7/30/2013"  "6/17/2014"
    
    $Dept
    [1] "IT"         "Operations" "IT"         "HR"         "Finance"    "IT"
       "Operations" "Finance"
    

    Convert JSON to a Data Frame

    We can convert the extracted data above to a R data frame for further analysis using the as.data.frame() function.

    # Load the package required to read JSON files.
    library("rjson")
    
    # Give the input file name to the function.
    result <- fromJSON(file = "input.json")
    
    # Convert JSON file to a data frame.
    json_data_frame <- as.data.frame(result)
    
    print(json_data_frame)

    When we execute the above code, it produces the following result −

          id,   name,    salary,   start_date,     dept
    1      1    Rick     623.30    2012-01-01      IT
    2      2    Dan      515.20    2013-09-23      Operations
    3      3    Michelle 611.00    2014-11-15      IT
    4      4    Ryan     729.00    2014-05-11      HR
    5     NA    Gary     843.25    2015-03-27      Finance
    6      6    Nina     578.00    2013-05-21      IT
    7      7    Simon    632.80    2013-07-30      Operations
    8      8    Guru     722.50    2014-06-17      Finance
    
  • Web Data

    Many websites provide data for consumption by its users. For example the World Health Organization(WHO) provides reports on health and medical information in the form of CSV, txt and XML files. Using R programs, we can programmatically extract specific data from such websites. Some packages in R which are used to scrap data form the web are − “RCurl”,XML”, and “stringr”. They are used to connect to the URL’s, identify required links for the files and download them to the local environment.

    Install R Packages

    The following packages are required for processing the URL’s and links to the files. If they are not available in your R Environment, you can install them using following commands.

    install.packages("RCurl")
    install.packages("XML")
    install.packages("stringr")
    install.packages("plyr")
    

    Input Data

    We will visit the URL weather data and download the CSV files using R for the year 2015.

    Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

    Example

    We will use the function getHTMLLinks() to gather the URLs of the files. Then we will use the function download.file() to save the files to the local system. As we will be applying the same code again and again for multiple files, we will create a function to be called multiple times. The filenames are passed as parameters in form of a R list object to this function.

    # Read the URL.
    url <- "http://www.geos.ed.ac.uk/~weather/jcmb_ws/"
    
    # Gather the html links present in the webpage.
    links <- getHTMLLinks(url)
    
    # Identify only the links which point to the JCMB 2015 files. 
    filenames <- links[str_detect(links, "JCMB_2015")]
    
    # Store the file names as a list.
    filenames_list <- as.list(filenames)
    
    # Create a function to download the files by passing the URL and filename list.
    downloadcsv <- function (mainurl,filename) {
       filedetails <- str_c(mainurl,filename)
       download.file(filedetails,filename)
    }
    
    # Now apply the l_ply function and save the files into the current R working directory.
    l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")

    Verify the File Download

    After running the above code, you can locate the following files in the current R working directory.

    "JCMB_2015.csv" "JCMB_2015_Apr.csv" "JCMB_2015_Feb.csv" "JCMB_2015_Jan.csv"
       "JCMB_2015_Mar.csv"