XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file. You can read a xml file in R using the “XML” package. This package can be installed using following command. Input Data Create a XMl file by copying the below data into a text editor like notepad. Save the file with a .xml extension and choosing the file type as all files(*.*). Reading XML File The xml file is read by R using the function xmlParse(). It is stored as a list in R. When we execute the above code, it produces the following result − Get Number of Nodes Present in XML File When we execute the above code, it produces the following result − Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Details of the First Node Let’s look at the first record of the parsed file. It will give us an idea of the various elements present in the top level node. When we execute the above code, it produces the following result − Get Different Elements of a Node When we execute the above code, it produces the following result − XML to Data Frame To handle the data effectively in large files we read the data in the xml file as a data frame. Then process the data frame for data analysis. When we execute the above code, it produces the following result − As the data is now available as a dataframe we can use data frame related function to read and manipulate the file.
Binary Files
A binary file is a file that contains information stored only in form of bits and bytes.(0’s and 1’s). They are not human readable as the bytes in it translate to characters and symbols which contain many other non-printable characters. Attempting to read a binary file using any text editor will show characters like Ø and ð. The binary file has to be read by specific programs to be useable. For example, the binary file of a Microsoft Word program can be read to a human readable form only by the Word program. Which indicates that, besides the human readable text, there is a lot more information like formatting of characters and page numbers etc., which are also stored along with alphanumeric characters. And finally a binary file is a continuous sequence of bytes. The line break we see in a text file is a character joining first line to the next. Sometimes, the data generated by other programs are required to be processed by R as a binary file. Also R is required to create binary files which can be shared with other programs. R has two functions WriteBin() and readBin() to create and read binary files. Syntax Following is the description of the parameters used − Example We consider the R inbuilt data “mtcars”. First we create a csv file from it and convert it to a binary file and store it as a OS file. Next we read this binary file created into R. Writing the Binary File We read the data frame “mtcars” as a csv file and then write it as a binary file to the OS. Reading the Binary File The binary file created above stores all the data as continuous bytes. So we will read it by choosing appropriate values of column names as well as the column values. When we execute the above code, it produces the following result and chart − As we can see, we got the original data back by reading the binary file in R.
Excel File
Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format. R can read directly from these files using some excel specific packages. Few such packages are – XLConnect, xlsx, gdata etc. We will be using xlsx package. R can also write into excel file using this package. Install xlsx Package You can use the following command in the R console to install the “xlsx” package. It may ask to install some additional packages on which this package is dependent. Follow the same command with required package name to install the additional packages. Verify and Load the “xlsx” Package Use the following command to verify and load the “xlsx” package. When the script is run we get the following output. Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Input as xlsx File Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1. Also copy and paste the following data to another worksheet and rename this worksheet to “city”. Save the Excel file as “input.xlsx”. You should save it in the current working directory of the R workspace. Reading the Excel File The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data frame in the R environment. # Read the first worksheet in the file input.xlsx. data <- read.xlsx(“input.xlsx”, sheetIndex = 1) print(data) When we execute the above code, it produces the following result −
CSV Files
In R, we can read data from files stored outside the R environment. We can also write data into files which will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml etc. In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should be present in current working directory so that R can read it. Of course we can also set our own directory and read files from there. Getting and Setting the Working Directory You can check which directory the R workspace is pointing to using the getwd() function. You can also set a new working directory using setwd()function. When we execute the above code, it produces the following result − This result depends on your OS and your current directory where you are working. Input as CSV File The csv file is a text file in which the values in the columns are separated by a comma. Let’s consider the following data present in the file named input.csv. You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the save As All files(*.*) option in notepad. Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Reading a CSV File Following is a simple example of read.csv() function to read a CSV file available in your current working directory − When we execute the above code, it produces the following result − Analyzing the CSV File By default the read.csv() function gives the output as a data frame. This can be easily checked as follows. Also we can check the number of columns and rows. When we execute the above code, it produces the following result − Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in subsequent section. Get the maximum salary When we execute the above code, it produces the following result − Get the details of the person with max salary We can fetch rows meeting specific filter criteria similar to a SQL where clause. When we execute the above code, it produces the following result − Get all the people working in IT department When we execute the above code, it produces the following result − Get the persons in IT department whose salary is greater than 600 When we execute the above code, it produces the following result − Get the people who joined on or after 2014 When we execute the above code, it produces the following result − Writing into a CSV File R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This file gets created in the working directory. When we execute the above code, it produces the following result − Here the column X comes from the data set newper. This can be dropped using additional parameters while writing the file. When we execute the above code, it produces the following result −
Data Reshaping
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format that is different from format in which we received it. R has many functions to split, merge and change the rows to columns and vice-versa in a data frame. Joining Columns and Rows in a Data Frame We can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames using rbind() function. When we execute the above code, it produces the following result − Merging Data Frames We can merge two data frames by using the merge() function. The data frames must have same column names on which the merging happens. In the example below, we consider the data sets about Diabetes in Pima Indian Women available in the library names “MASS”. we merge the two data sets based on the values of blood pressure(“bp”) and body mass index(“bmi”). On choosing these two columns for merging, the records where values of these two variables match in both data sets are combined together to form a single data frame. When we execute the above code, it produces the following result − Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Melting and Casting One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to get a desired shape. The functions used to do this are called melt() and cast(). We consider the dataset called ships present in the library called “MASS”. When we execute the above code, it produces the following result − Melt the Data Now we melt the data to organize it, converting all columns other than type and year into multiple rows. When we execute the above code, it produces the following result − Cast the Molten Data We can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is done using the cast() function. When we execute the above code, it produces the following result −
Packages
R packages are a collection of R functions, complied code and sample data. They are stored under a directory called “library” in the R environment. By default, R installs a set of packages during installation. More packages are added later, when they are needed for some specific purpose. When we start the R console, only the default packages are available by default. Other packages which are already installed have to be loaded explicitly to be used by the R program that is going to use them. All the packages available in R language are listed at R Packages. Below is a list of commands to be used to check, verify and use the R packages. Check Available R Packages Get library locations containing R packages When we execute the above code, it produces the following result. It may vary depending on the local settings of your pc. Get the list of all the packages installed When we execute the above code, it produces the following result. It may vary depending on the local settings of your pc. Get all packages currently loaded in the R environment When we execute the above code, it produces the following result. It may vary depending on the local settings of your pc. Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Install a New Package There are two ways to add new R packages. One is installing directly from the CRAN directory and another is downloading the package to your local system and installing it manually. Install directly from CRAN The following command gets the packages directly from CRAN webpage and installs the package in the R environment. You may be prompted to choose a nearest mirror. Choose the one appropriate to your location. Install package manually Go to the link R Packages to download the package needed. Save the package as a .zip file in a suitable location in the local system. Now you can run the following command to install this package in the R environment. Load Package to Library Before a package can be used in the code, it must be loaded to the current R environment. You also need to load a package that is already installed previously but not available in the current environment. A package is loaded using the following command −
Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Following are the characteristics of a data frame. Create Data Frame When we execute the above code, it produces the following result − Get the Structure of the Data Frame The structure of the data frame can be seen by using str() function. When we execute the above code, it produces the following result − Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Summary of Data in Data Frame The statistical summary and nature of the data can be obtained by applying summary() function. When we execute the above code, it produces the following result − Extract Data from Data Frame Extract specific column from a data frame using column name. When we execute the above code, it produces the following result − Extract the first two rows and then all columns When we execute the above code, it produces the following result − Extract 3rd and 5th row with 2nd and 4th column When we execute the above code, it produces the following result − Expand Data Frame A data frame can be expanded by adding columns and rows. Add Column Just add the column vector using a new column name. When we execute the above code, it produces the following result − Add Row To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function. In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame. When we execute the above code, it produces the following result −
Factors
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, “Female” and True, False etc. They are useful in data analysis for statistical modeling. Factors are created using the factor () function by taking a vector as input. Example When we execute the above code, it produces the following result − Factors in Data Frame On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it. When we execute the above code, it produces the following result − Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Changing the Order of Levels The order of the levels in a factor can be changed by applying the factor function again with new order of the levels. When we execute the above code, it produces the following result − [1] East West East North North East West West West East North Levels: East North West [1] East West East North North East West West West East North Levels: East West North Generating Factor Levels We can generate factor levels by using the gl() function. It takes two integers as input which indicates how many levels and how many times each level. Syntax Following is the description of the parameters used − Example When we execute the above code, it produces the following result −
Arrays
Arrays are the R data objects which can store data in more than two dimensions. For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type. An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array. Example The following example creates an array of two 3×3 matrices each with 3 rows and 3 columns. When we execute the above code, it produces the following result − Naming Columns and Rows We can give names to the rows, columns and matrices in the array by using the dimnames parameter. When we execute the above code, it produces the following result − Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career. Accessing Array Elements When we execute the above code, it produces the following result − Manipulating Array Elements As array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices. When we execute the above code, it produces the following result − [,1] [,2] [,3] [1,] 10 20 26 [2,] 18 22 28 [3,] 6 24 30 Calculations Across Array Elements We can do calculations across the elements in an array using the apply() function. Syntax Following is the description of the parameters used − Example We use the apply() function below to calculate the sum of the elements in the rows of an array across all the matrices. When we execute the above code, it produces the following result −
Matrices
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. They contain elements of the same atomic types. Though we can create a matrix containing only characters or only logical values, they are not of much use. We use matrices containing numeric elements to be used in mathematical calculations. A Matrix is created using the matrix() function. Syntax The basic syntax for creating a matrix in R is − Following is the description of the parameters used − Example Create a matrix taking a vector of numbers as input. When we execute the above code, it produces the following result − Accessing Elements of a Matrix Elements of a matrix can be accessed by using the column and row index of the element. We consider the matrix P above to find the specific elements below. When we execute the above code, it produces the following result − Matrix Computations Various mathematical operations are performed on the matrices using the R operators. The result of the operation is also a matrix. The dimensions (number of rows and columns) should be same for the matrices involved in the operation. Matrix Addition & Subtraction When we execute the above code, it produces the following result − Matrix Multiplication & Division When we execute the above code, it produces the following result −