Prep: October 2017

Monday 23 October 2017

What is Multi-tenant Container Database?

If we need to manage multiple applications' schemas with same name, we can't do it in a single database. We have to go for multiple databases or multiple machines. This will lead to additional costs in terms of memory, CPU usage and administration. In case of Multitenant Container Database(CDB), we can have multiple virtual databases on one physical database. This is similar to virtual machines and these virtual databases/containers are called Pluggable Databases(PDB). The actual physical database that holds all these containers is the CDB.

To use this functionality, you can first create a CDB and then create PDBs. A CDB will have a root container, seed$PDB(template for creating PDB) and multiple PDBs. If you don't need this functionality, you can simply create a non-CDB which we had earlier to Oracle 12c.

The advantage of this feature is that we can host multiple PDBs for various applications on a single machine - on a SINGLE DB INSTANCE. These PDBs behave like independent databases. So we can set different admin constraints for each PDB. And we can also make some top level changes to CDB which all the PDBs can use - for example, adding extra resources or upgrading DB.

Also, all these PDBs are pluggable - that is you can get a non-CDB and plug it as a PDB or plug a PDB from CDB1 to CDB2. For example, to debug a production app in testing environment, you can easily get production data by plugging in the production PDB to testing environment.

Saturday 7 October 2017

edX Analytics Edge Unit 3 Summary

Linear regression is useful in predicting continuous outcomes. But if we consider categorical data like quality of health care-poor/good, election results-republican/democrat, etc, we have to round the outcome to 0/1. We generally use expert assessment in such cases. But this is time consuming and can not possible consider huge amounts of data. Logistic/Logit regression is useful in these cases where it predicts the probability of a variable being true.

We use Logistic Response Function to predict the probability that y=1
P(y=1)=1/1+e^-(b0+b1x1+b2x2+...+bkxk)

Positive coefficients b1,b2..bk will increase the probability of y=1 and negative coefficients increase the probability of y=0.

Odds=P(y=1)/P(y=0)
if Odds>1, P(y=1) is more

P(y=0)=1-P(y=1)

Based on this, log(Odds)=b0+b1x1+...+bkxk. This is called the logit function.

We need to have a baseline model just like linear regression. In classification problems, baseline model predicts the most frequent outcome - which means the highest frequency between y=1 and y=0. This can be easily obtained using table(DF1$y). You can calculate accuracy using highest frequency/total number of observations. Our goal is to be better than this accuracy.

Earlier we split training and test data using some condition on an independent variable in subset function. For example, DFTr= subset(DF1, V1==x).

Now, we will randomly split the data into training and test data. For this, we need to install caTools R package.
install.packages("caTools")
Select a CRAN mirror and download the package.
library(caTools) -- loads the library.
set.seed(10)- initializes the random generator.
split= sample.split(DF$y, SplitRatio=0.75)
In this case, data will be split such that 75% of the data is training data and 25% is test data.
Though the data is randomly split, it is split such that y is balanced in both data sets. For example, if y=1 for 75% of the initial data, then data is split such that y=1 for 75% of training data and 75% of test data.

DFTr= subset(DF1, split==TRUE)
DFTe= subset(DF1, split==FALSE)

Just like lm(), we have glm() here which stands for generalized linear model.

m1=glm(y ~ v1+v2, data=DFTr, family=binomial)
family=binomial gives a logistic regression model.

summary(m1)
This also gives AIC which is similar to adjusted R2 in linear model - number of variables used w.r.t number of observations. AIC should be less.

PTr=predict(m1,type="response")
type="response" gives probabilities.

You can find the average prediction for y=1 using
tapply(PTr, DFTr$y,mean)

To convert probabilities to predictions, we can use a threshold.
If P(y=1)>=t, then y=1.
If P(y=1)<t, then y=0.

Select threshold based on what type of errors are better for you.

If threshold is very high, then we predict y=1 very rarely. So the errors here would be actual outcome being y=1, but our predict is y=0.
If threshold is very small, then we predict y=1 very often. Errors here would be actual outcome being y=0, but our predict id y=1.

If there's no preference, we can simply choose t=0.5.

We can make this quantitative by using confusion/classification matrix. This has actual outcomes as rows and predicted outcomes as columns.

	Predicted=0	Predicted=1
Actual=0	True Negatives(TN)	False Positives(FP)
Actual=1	False Negatives(FN)	True Positives(TP)

TN and TP are the ones we got right. FN and FP are the ones which we got wrong.

Sensitivity/True Positive Rate = TP/TP+FN
False Negative Error Rate=1-sensitivity
Specificity/True Negative Rate = TN/TN+FP
False Positive Error Rate=1-specificity
Accuracy=TN+TP/N
Error rate=FP+FN/N

If t is high, specificity will be high and sensitivity will be low and vice versa.

We can get the confusion matrix using table(DFTr$y, PTr>t).

We can get a visualization of threshold values using Receiver Operator Characteristic(ROC) curve which has sensitivity on y-axis and 1-specificity(False Positive Rate) on x-axis. ROC curve starts at (0,0) or t=1 and ends at (1,1) or t=0.

To get ROC curve in R, we need to install ROCR package.
install.packages("ROCR")
library(ROCR)

ROCRPred = prediction(PTr, DFTr$y)
ROCRPerf = performance(ROCRPred, "tpr","fpr")
tpr and fpr are arguments for x and y axis.
plot(ROCRPerf, colorize=TRUE,print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1.7))
This plot will be colorized and will have threshold labels adjacent to the curve.
auc=as.numeric(performance(ROCRPred,"auc")@y.values)
This gives the Area Under Curve AUC which is an absolute measure of quality of prediction.

We can then test our model on test data just like we did on training data.
PTe = predict(m1,type="response",newdata=DFTe)

In case of missing data, we can use multiple imputation. For this we need to install and load "mice" package.
set.seed(144)
I1= complete(mice(DF1)
We can add the variables from I1 to DF1 that has missing data.

We can improve baseline model by using a sign function that returns 1 if it is passed a positive number, -1 if it is passed a negative number and 0 if it is passed 0.
table(DFTe$y,sign(DFTe$v1))

edX Analytics Edge Unit 2 Summary

Linear regression can be used for predictions in many fields like wine tasting and sports statistics. Linear regression is where we predict a dependent variable based on some independent variables.

Consider a one-variable linear regression first - which can be plotted as a simple x vs y graph. Here x is the independent variable and y is the dependent variable which needs to be predicted. If we take the average of the dependent variable and draw a line, we will get a baseline model. The goal is to design a model better than the baseline by drawing a predictive line through the data.

Both baseline and the linear model which we will design will not give the perfect values for y.
The difference between the actual y value and the predicted y value is the error residual. The sum of error residuals at each data point is SSE(Sum of Squared Errors).
The difference between the actual y value and the average y value is the error in baseline model. The sum of such errors at each data point is SST(Total Sum of Squares).

Root Means Squared Error RMSE=sqrt(SSE/N)
where N is the number of data points.
R2= 1- SSE/SST

If SSE=SST, then R2=0. This is the baseline model case.
When SSE<SST, we reduce the errors in our model- which is our goal. In this case, R2 will be close to 1. So when designing the model, we should look for higher R2 value.

If we consider multiple variable linear regression, we can write y as follows:
y=b0+b1x1+b2x2+b3x3+...+bnxn+error
here b0 is the inetercept
b1,b2..bn are coefficients
error is the residual

Our goal is to find the best coefficients.

Multiple variables can impact y. As we add more variables, R2 value improves. But as keep on adding more and more variable, the amount by which R2 improves decreases. So we need to make sure we are not adding irrelevant variables which just complicate the model.

In R:

m1 = lm(y ~ x1+x2, data=DF1)
DF1 is a data frame read from a csv file.
x1,x2 are the independent variable and y is the dependent variable to be predicted.
m1 is the model output.

summary(lm)- gives the intercept, coefficients and residuals. It also gives which variable is significant for the model.
We can also see R2 and adjusted R2 values. Adjusted R2 value is R2 value adjusted as per the number of independent variables used relative to the data points. Adding an irrelevant variable will make the adjusted R2 to drop.
Std Error is the residual.
t=coefficient/Std Error. This needs to be high.
P(t)-probability that the coefficient is close to 0. We want this to be as small as possible. Based on these values, you can see stars beside the variable.
*** - P<0.001
**- 0.001<P<0.01
*-0.01<P<0.05
.-0.05<p<0.1

SSE = sum(m1$residuals^2)

cor(DF1$V1, DF1$V2) - gives the correlation between two variables.
cor(DF1) - gives table of correlation among all variables.

A high correlation between two independent variables may give misleading variable significance values. In general, if correlation >0.7 or <-0.7 , variables should be chosen carefully checking which will perform better.

Training data- data used to build a model
Test data- new data to test how the model will perform

p1= predict(m1, newdata=DFT1)
DFT1 is the data frame from test data.
m1 is the model we have built using training data

SSE=sum((DFT1$y- p1)^2)
SST= sum((DFT1$y-mean(DFT$y))^2)

R2=1-SSE/SST

The model on test data can even give negative R2 values indicating that it's not performing well.

m2 = step(m1) gives a simplified model but not necessarily the best model.

Friday 6 October 2017

edX Analytics Edge Unit 1 Summary

The amount of data we deal with currently is humongous and is in Zettabytes(10^21 bytes). Analytics is our solution to do deal with such huge data- by finding patterns in data, developing a data model using the pattern and predicting results using the model. Few examples are IBM Watson- that won Jeopardy! quiz show, eHarmony -dating site that predicts compatibility based on inputs from users, heart and other health studies that can predict health issues based on user's data.

For data analysis, SAS, Excel, MATLAB, pandas in Python, R,etc can be used.

R is a programming language and an environment for statistical computing originally developed from S and made open source.

You can download RStudio, which is an IDE for R. A google search will give you cheat sheets for R commands.

On R console in RStudio, you can type any command and get the output.

You can set the directory using "Change Directory" from File menu.

Basics:

getwd() get the current directory path
dir() get directory contents
?<command> help
ls() get variables stored
rm(<var>) removes the variable
# comments
=/~ assignment
print(<var>)/<var> prints the variable

R has atomic elements like character, numeric, integer, logical(T/F) just like other programming languages.

It has vectors that can combine series of elements in one single object.
x= c(1,2,3)
c combines the elements 1,2,3 to a single object stored as vector x.

You can also specify range as follows:
x=1:3
y=4:6

rbind(x,y)-stacks by columns
1 2 3
4 5 6
cbind(x,y)-stacks by rows
1 4
2 5
3 6

To get tabular form data, you can use data frames.
DF1 = data.frame(v1,v2)
v1 and v2 are vector variables.

To add new variable, DF1$v3 = c(10,11,12)
You can even convert a logical variable to numeric and add it as DF1$V1=as.numeric(c(F,T,F))
To add new data/observations, you can use rbind function.

str(DF1) will give the structure of the data frame DF1 including number of variables, observations and variable names.
summary(DF1) will give values of many functions like min,max, mean, median, 1st quartile, 3rd quartile.
mean-average of values
median-middle value when the values are sorted
1st Quartile- say this value is x, then 25% of the data <x
3rd Quartile-say y, 75% of the data <y

nrow(DF1)- number of observations

sd(DF1$V1) - gives the standard deviation - square root of variance--which is the average of squared differences from mean.

You can find the min/max of a variable in data frame using :
which.min(DF1$V1)
which.max(DF1$V1)

CSV(Comma-separated values) files can be read and stored into variables.
f1 = read.csv("f1.csv")
A subset of data can also be generated using the subset command.
f2 = subset(f1, v1=="val1")
To write a data frame into a csv, use write.csv.
write.csv(f2,"f2.csv")

To create a scatter plot graph, you can use plot(DF$V1, DF$V2).

You can limit the number of variables printed by using DF1[c(v1,v2)]

hist(DF1$V1) - gives a histogram - which is basically the probability distribution of V1 i.e V1 vs the frequency of V1 in data.

boxplot(DF$V1 ~ DF1$V2) - boxplot of V1 and V2- it provides a statistical distribution of these variables- range(min,max), outliers, median, 1st and 3rd quartiles.

Inter-quartile range=(3rd Q- 1st Q)
Outliers- values that satisfy one of the below conditions:
> Inter quartile+ 3rd Q
<1st Q- Inter quartile

You can specify labels by using boxplot(DF$V1 ~ DF$V2, xlab="x", ylab="", main="x vs y")
Similarly, you can specify color as col="red".
You can also limit the data using xlim=c(0,100) and break it to be more refined using breaks=500. These will help to see the data well over a refined range.

table(DF1$V1) - this will give a table of V1 similar to what we found in summary - with number of observations under each value of V1. Using a table over 2 variables will help in seeing the correlation between those two variables.

tapply(DF1$V1,DF1$V2, mean)- splits data by V2 and applied mean function on V1. So this gives the mean of each type of V1 values across V2.

If there are NA values in data, then it's not possible to get above data. So you can specify to remove NA values as below:
tapply(DF1$V1, DF1$V2, min, na.rm=TRUE)

match("VAL1", DF1$V1)- gives the index of that value in V1