R Language can be used for powerful Data Analysis. However, it can be made even more powerful if we could add persistence to the data while using the power of R. I will discuss Data Analysis with R using data stored in Relational Databases in this article.
Relational Databases can manipulate large Data Sets. Also, Relational Databases provide concurrent access to the data. R programs are restricted to running single threaded and are restricted by the memory available in the machine that it is run on. In contract, Databases are deployed on large systems and support parallel processing and provide for very high performance. We can conduct analytics using In-Database Analytics which has many benefits. One of the most important advantage being that there will be no need for data movement between the database and the R environment.
I will use IBM DB2 on IBM Cloud (IBM Bluemix) as the Relational Database. I use this as this service is available to me from IBM Cloud account that I have. This makes it possible for me not to maintain any Database Engine in my computer. I can maintain my data on my computer and upload what I need into IBM DB2 Database using the services provided by IBM Cloud. Most importantly, DB2 provides facility for running R Programs in the DB2 Database Environment. This provides for In-Database Analytics using DB2. It provides the ibmDBR API which can be used to create Analytics programs.
I can store up to 1 GB of data on this instance without having to pay anything. I will show you how data can be loaded into IBM DB2 Database using Graphical User Interface (GUI) and Command Line Interface (CLI).
I will use the R Package ggplot2 for creating the visualisation.
The tools you need are as follows
- R Software. You can download this from CRAN website. Click this link for downloading R for MacOSX. You can search this site for the suitable download for your machine.
- IBM Bluemix Account. You can create a IBM Bluemix account by clicking this link. It does not cost anything to create the account. For experimentation, most of the services can be used for free. Only when you need to use professional versions, you will need to pay for the services. I have set up my account for Pay-As-You-Go. As I have done so recently, I am still not sure how much I will be billed at an average per month.
- IBM DB2 Service. Once you are in your IBM Bluemix Account, click on Catalog. From the list of services provided, select DB2. By default, the free service option is selected. Unless you want to change this, click Create. Otherwise, select the paid service for your need and then click create. Once your instance is created, keep a record the configuration string. It will be needed later on for most actions.
- R Studio (optional). R Studio is a very useful IDE for R Programming. However, it is choice whether you would like to use it or not.
- ggplot2 package. You can get this only after you have installed R Software (will also work if you have R Studio as well). Please see instructions below for how to install and use ggplot2.
- Lastly, you will definitely need an internet connection if you like to use the components as mentioned above. However, if you are using a database in your local machine, you do not need the Internet connection.
1. Loading Data in the Database
Before we can start with any programming, we need to upload the data into the database. We can do this using the GUI provided by IBM DB2 service. Or, we can use CLI.
1.a. Loading Data into DB2 using Graphical User Interface (GUI)
I have been maintaining the data of the Prices of six commodities – Potato, Tomato, Onion, Gold, SENSEX and NIFTY. This is available to me in an Excel File. However, you cannot upload data from Excel File into DB2. So, I saved this file as a Comma Separated File (CSV) file. The name of the file is PriceIndex.CSV (You will see the filename as PriceIndex_New.CSV as I had already the original file prior to capturing the screen shots). The following screen shots will guide you through the process of uploading the data.
When you invoke the LOAD function on the DB2 Dashboard, you will see the screen below. On this screen, drag and drop your file in the greyed area OR click on Browse Files, to select the file you want to upload.
Once you have provided the file to upload, you will see this screen.
You need to select a Schema. I selected DASH13434. Then click NEXT.
Click on New Table and provide a Table Name. I gave the name as PRICEINDEX. The screen will appear as shown below.
The column Datatypes will be chosen by default. You can change the data types. Uncheck Detect Data Types. Now you can provide the data types for the columns as shown below.
Now, click BEGIN LOAD. Once data is loaded, you will see the summary sheet as shown below.
1.b. Loading Data into DB2 using Command Line Interface (CLI)
I downloaded the utility Lift for using the CLI.
First action required is to upload the file in the DB2 instance. This can be accomplished using the command below.
lift put --file PriceIndex_New.csv --target-user=<userID> --target-password=<password> --target-host=<host>
You can get the USER ID, PASSWORD and HOST NAME from the Configuration String in the IBM Bluemix account for the service DB2.
Next, we need to load the data into the Table. It is required that the table definition be available in DB2 instance prior to this action. You can create the table definition by using the Run SQL feature and issue the CREATE TABLE Command. To load the data, use the following command.
lift load --filename PriceIndex_New.csv --target-schema DASH13434 --target-table PRICEINDEX --target-user=<userID> --target-password=<password> --target-host=<host> --file-origin user
2. Connecting to the Database from R
Now that the data has been uploaded in the DB2 database, we will connect to the DB2 Database instance and view the data before we conduct some analysis on the same.
To connect to the DB2 Database from R using JDBC, issue the following commands in the R Environment.
To check the connection, issue the following command.
Now, let us get the data from the table.
We can find out a lot of statistical information from this data. However, I will not deal with them here. We will look at plotting this data using ggplot2.
We can check the correlations between the different commodity prices.
Remember to close the connection at the end.
3. Plotting data with ggplot2
First we create a simple density plot for the GOLD Prices.
We add a column called MONTH To the Data Set.
df$MONTH = month(df$DATE)
We now make a Scatter Plot between the SENSEX Index and GOLD Price. We colour the dots by the months of the recording.
df %>% ggplot(aes(SENSEX, GOLD, color = MONTH)) + geom_point(aes(x = SENSEX, y = GOLD))
You can do a lot more with the data you have. This is just a sample of what is possible.