Visualising Database Data with R

R Language can be used for powerful Data Analysis. However, it can be made even more powerful if we could add persistence to the data while using the power of R. I will discuss Data Analysis with R using data stored in Relational Databases in this article.

Relational Databases can manipulate large Data Sets. Also, Relational Databases provide concurrent access to the data. R programs are restricted to running single threaded and are restricted by the memory available in the machine that it is run on. In contract, Databases are deployed on large systems and support parallel processing and provide for very high performance. We can conduct analytics using In-Database Analytics which has many benefits. One of the most important advantage being that there will be no need for data movement between the database and the R environment.

R Language provides 2 libraries containing methods for connecting to the Relational Databases. They are RJDBC and RODBC. I will use RJDBC for connecting to the Database using R in this article.

I will use IBM DB2 on IBM Cloud (IBM Bluemix) as the Relational Database. I use this as this service is available to me from IBM Cloud account that I have. This makes it possible for me not to maintain any Database Engine in my computer. I can maintain my data on my computer and upload what I need into IBM DB2 Database using the services provided by IBM Cloud. Most importantly, DB2 provides facility for running R Programs in the DB2 Database Environment. This provides for In-Database Analytics using DB2. It provides the ibmDBR API which can be used to create Analytics programs.

I can store up to 1 GB of data on this instance without having to pay anything. I will show you how data can be loaded into IBM DB2 Database using Graphical User Interface (GUI) and Command Line Interface (CLI).

I will use the R Package ggplot2 for creating the visualisation.

The tools you need are as follows

  1. R Software. You can download this from CRAN website. Click this link for downloading R for MacOSX. You can search this site for the suitable download for your machine.
  2. IBM Bluemix Account. You can create a IBM Bluemix account by clicking this link. It does not cost anything to create the account. For experimentation, most of the services can be used for free. Only when you need to use professional versions, you will need to pay for the services. I have set up my account for Pay-As-You-Go. As I have done so recently, I am still not sure how much I will be billed at an average per month.
  3. IBM DB2 Service. Once you are in your IBM Bluemix Account, click on Catalog. From the list of services provided, select DB2. By default, the free service option is selected. Unless you want to change this, click Create. Otherwise, select the paid service for your need and then click create. Once your instance is created, keep a record the configuration string. It will be needed later on for most actions.
  4. R Studio (optional). R Studio is a very useful IDE for R Programming. However, it is choice whether you would like to use it or not.
  5. ggplot2 package. You can get this only after you have installed R Software (will also  work if you have R Studio as well). Please see instructions below for how to install and use ggplot2.
  6. Lastly, you will definitely need an internet connection if you like to use the components as mentioned above. However, if you are using a database in your local machine, you do not need the Internet connection.

1. Loading Data in the Database

Before we can start with any programming, we need to upload the data into the database. We can do this using the GUI provided by IBM DB2 service. Or, we can use CLI.

1.a. Loading Data into DB2 using Graphical User Interface (GUI)

I have been maintaining the data of the Prices of six commodities – Potato, Tomato, Onion, Gold, SENSEX and NIFTY. This is available to me in an Excel File. However, you cannot upload data from Excel File into DB2. So, I saved this file as a Comma Separated File (CSV) file. The name of the file is PriceIndex.CSV (You will see the filename as PriceIndex_New.CSV as I had already the original file prior to capturing the screen shots). The following screen shots will guide you through the process of uploading the data.

When you invoke the LOAD function on the DB2 Dashboard, you will see the screen below. On this screen, drag and drop your file in the greyed area OR click on Browse Files, to select the file you want to upload.

DataLoadGUI1

Once you have provided the file to upload, you will see this screen.

DataLoadGUI2

You need to select a Schema. I selected DASH13434. Then click NEXT.

DataLoadGUI3

Click on New Table and provide a Table Name. I gave the name as PRICEINDEX. The screen will appear as shown below.

DataLoadGUI4

Click NEXT.

DataLoadGUI5

The column Datatypes will be chosen by default. You can change the data types. Uncheck Detect Data Types. Now you can provide the data types for the columns as shown below.

DataLoadGUI6

Click NEXT.

DataLoadGUI7

Now, click BEGIN LOAD. Once data is loaded, you will see the summary sheet as shown below.

DataLoadGUI8

1.b. Loading Data into DB2 using Command Line Interface (CLI)

I downloaded the utility Lift for using the CLI.

First action required is to upload the file in the DB2 instance. This can be accomplished using the command below.

lift put --file PriceIndex_New.csv --target-user=<userID> --target-password=<password> --target-host=<host>

You can get the USER ID, PASSWORD and HOST NAME from the Configuration String in the IBM Bluemix account for the service DB2.

Next, we need to load the data into the Table. It is required that the table definition be available in DB2 instance prior to this action. You can create the table definition by using the Run SQL feature and issue the CREATE TABLE Command. To load the data, use the following command.

lift load --filename PriceIndex_New.csv --target-schema DASH13434 --target-table PRICEINDEX --target-user=<userID> --target-password=<password> --target-host=<host> --file-origin user

2. Connecting to the Database from R

Now that the data has been uploaded in the DB2 database, we will connect to the DB2 Database instance and view the data before we conduct some analysis on the same.

To connect to the DB2 Database from R using JDBC, issue the following commands in the R Environment.

dyn.load('/Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home/lib/server/libjvm.dylib')

library(rJava)
library(RJDBC)
dsn_driver = "com.ibm.db2.jcc.DB2Driver"
dsn_database = "BLUDB" # e.g. "BLUDB"
dsn_hostname = "<host>" # e.g. check connection string
dsn_port = "50000" # e.g. "50000"
dsn_protocol = "TCPIP" # i.e. "TCPIP"
dsn_uid = "<userID>" # e.g. replace with your userid
dsn_pwd = "<password>" # e.g. replace with your password
jcc = JDBC("com.ibm.db2.jcc.DB2Driver", "db2jcc4.jar");
jdbc_path = paste("jdbc:db2://", dsn_hostname, ":", dsn_port, "/", dsn_database, sep="");
conn = dbConnect(jcc, jdbc_path, user=dsn_uid, password=dsn_pwd)

To check the connection, issue the following command.
attributes(conn)

R Output1

Now, let us get the data from the table.

query = "SELECT * FROM PRICEINDEX";
rs = dbSendQuery(conn, query);
df = fetch(rs, -1);
head(df)

R Output2
We can find out a lot of statistical information from this data. However, I will not deal with them here. We will look at plotting this data using ggplot2.

R Output3

We can check the correlations between the different commodity prices.

R Output4

Remember to close the connection at the end.

dbDisconnect(conn)

3. Plotting data with ggplot2

First we create a simple density plot for the GOLD Prices.

ggplot1

ggplot2

We add a column called MONTH To the Data Set.

df$MONTH = month(df$DATE)

We now make a Scatter Plot between the SENSEX Index and GOLD Price. We colour the dots by the months of the recording.

df %>% ggplot(aes(SENSEX, GOLD, color = MONTH)) + geom_point(aes(x = SENSEX, y = GOLD))

Rplot

You can do a lot more with the data you have. This is just a sample of what is possible.

Advertisements