Prep

Monday, 25 May 2020

Finding maximum and minimum in union find algorithm

This post is a problem based on union find algorithm. For the algorithm, please see "Weighted Quick Union With Path Compression - Coursera Algorithms"

Problem: Given an element, find the max/min of it's connected components tree.

Following is my, probably brute force, solution to find max/min. For this, we use extra arrays to store the max and min values. We updated the max and min only for root nodes. For all other nodes, just find the root and it's max/min.

Sunday, 24 May 2020

Weighted Quick Union With Path Compression - Coursera Algorithms

Weighted Quick Union can be improved further by path compression. The idea here is to avoid flatten the tree. After computing the root of a node, we set the id of each examined node to point to that root.

Running Time:
For M union-find operations on N elements, running time ≤ c ( N + M lg* N ) array accesses.

lg*N = number of times we have to take log of N to get 1.

It's almost linear and union find doesn't have any linear algorithm.

Following is the program:

Weighted Quick Union Algorithm - Coursera Algorithms

Quick Find and Quick Union were slow in performing union and finding if two elements are connected respectively. However, Quick Union can be improvised by using weighted trees. The idea here is to avoid tall trees of Quick Union and always balance the trees such that smaller trees always link up to the larger trees.

We use array to initialize elements to indices itself. To weigh trees, we use an extra array to track the size of trees at each element.

Any node x's depth increases by one when it's tree T1 is merged into another T2.
As |T2|>=|T1| , the size of tree containing x at least doubles.
As N*lg N = N, the size of tree containing x can double at most lg N times.
So the depth of any node at x is at most lg N.

Running Time:
initialize - N
union(p,q) - lg N⁺ - including the root calculations
connected(p,q) - lg N - proportional to depth of p and q

Following is the program:

Saturday, 23 May 2020

Quick Union Algorithm - Coursera Algorithms

Quick Find was very slow with quadratic operations. Quick Union is a lazy union-find algorithm to unite sets and to see if an element is connected to another. We use array to initialize elements to indices itself. Later, we update the elements to point to it's parent. This way we create trees out of elements.

Array accesses:
initialize - N
union(p,q) - N⁺ including the cost of root calculations
connected(p,q) -N

The trees can get quite tall, making the connected operation quite slow.

Following is the program:

Quick Find Algorithm - Coursera Algorithms

Quick Find is an eager union-find algorithm.

Purpose: To find if two elements are connected and to perform union operation on two elements.

We will consider the elements as simple int array. We initialize array to index elements itself and then change the ids in the array when an union operation is performed.

Array accesses: 2N+1
initialize - N
union(p,q) - N
connected(p,q) - 1

If N unions are performed on N elements, union takes N² array access. Quadratic operations don't scale.

Following is the program:

Monday, 23 October 2017

What is Multi-tenant Container Database?

If we need to manage multiple applications' schemas with same name, we can't do it in a single database. We have to go for multiple databases or multiple machines. This will lead to additional costs in terms of memory, CPU usage and administration. In case of Multitenant Container Database(CDB), we can have multiple virtual databases on one physical database. This is similar to virtual machines and these virtual databases/containers are called Pluggable Databases(PDB). The actual physical database that holds all these containers is the CDB.

To use this functionality, you can first create a CDB and then create PDBs. A CDB will have a root container, seed$PDB(template for creating PDB) and multiple PDBs. If you don't need this functionality, you can simply create a non-CDB which we had earlier to Oracle 12c.

The advantage of this feature is that we can host multiple PDBs for various applications on a single machine - on a SINGLE DB INSTANCE. These PDBs behave like independent databases. So we can set different admin constraints for each PDB. And we can also make some top level changes to CDB which all the PDBs can use - for example, adding extra resources or upgrading DB.

Also, all these PDBs are pluggable - that is you can get a non-CDB and plug it as a PDB or plug a PDB from CDB1 to CDB2. For example, to debug a production app in testing environment, you can easily get production data by plugging in the production PDB to testing environment.

Saturday, 7 October 2017

edX Analytics Edge Unit 3 Summary

Linear regression is useful in predicting continuous outcomes. But if we consider categorical data like quality of health care-poor/good, election results-republican/democrat, etc, we have to round the outcome to 0/1. We generally use expert assessment in such cases. But this is time consuming and can not possible consider huge amounts of data. Logistic/Logit regression is useful in these cases where it predicts the probability of a variable being true.

We use Logistic Response Function to predict the probability that y=1
P(y=1)=1/1+e^-(b0+b1x1+b2x2+...+bkxk)

Positive coefficients b1,b2..bk will increase the probability of y=1 and negative coefficients increase the probability of y=0.

Odds=P(y=1)/P(y=0)
if Odds>1, P(y=1) is more

P(y=0)=1-P(y=1)

Based on this, log(Odds)=b0+b1x1+...+bkxk. This is called the logit function.

We need to have a baseline model just like linear regression. In classification problems, baseline model predicts the most frequent outcome - which means the highest frequency between y=1 and y=0. This can be easily obtained using table(DF1$y). You can calculate accuracy using highest frequency/total number of observations. Our goal is to be better than this accuracy.

Earlier we split training and test data using some condition on an independent variable in subset function. For example, DFTr= subset(DF1, V1==x).

Now, we will randomly split the data into training and test data. For this, we need to install caTools R package.
install.packages("caTools")
Select a CRAN mirror and download the package.
library(caTools) -- loads the library.
set.seed(10)- initializes the random generator.
split= sample.split(DF$y, SplitRatio=0.75)
In this case, data will be split such that 75% of the data is training data and 25% is test data.
Though the data is randomly split, it is split such that y is balanced in both data sets. For example, if y=1 for 75% of the initial data, then data is split such that y=1 for 75% of training data and 75% of test data.

DFTr= subset(DF1, split==TRUE)
DFTe= subset(DF1, split==FALSE)

Just like lm(), we have glm() here which stands for generalized linear model.

m1=glm(y ~ v1+v2, data=DFTr, family=binomial)
family=binomial gives a logistic regression model.

summary(m1)
This also gives AIC which is similar to adjusted R2 in linear model - number of variables used w.r.t number of observations. AIC should be less.

PTr=predict(m1,type="response")
type="response" gives probabilities.

You can find the average prediction for y=1 using
tapply(PTr, DFTr$y,mean)

To convert probabilities to predictions, we can use a threshold.
If P(y=1)>=t, then y=1.
If P(y=1)<t, then y=0.

Select threshold based on what type of errors are better for you.

If threshold is very high, then we predict y=1 very rarely. So the errors here would be actual outcome being y=1, but our predict is y=0.
If threshold is very small, then we predict y=1 very often. Errors here would be actual outcome being y=0, but our predict id y=1.

If there's no preference, we can simply choose t=0.5.

We can make this quantitative by using confusion/classification matrix. This has actual outcomes as rows and predicted outcomes as columns.

	Predicted=0	Predicted=1
Actual=0	True Negatives(TN)	False Positives(FP)
Actual=1	False Negatives(FN)	True Positives(TP)

TN and TP are the ones we got right. FN and FP are the ones which we got wrong.

Sensitivity/True Positive Rate = TP/TP+FN
False Negative Error Rate=1-sensitivity
Specificity/True Negative Rate = TN/TN+FP
False Positive Error Rate=1-specificity
Accuracy=TN+TP/N
Error rate=FP+FN/N

If t is high, specificity will be high and sensitivity will be low and vice versa.

We can get the confusion matrix using table(DFTr$y, PTr>t).

We can get a visualization of threshold values using Receiver Operator Characteristic(ROC) curve which has sensitivity on y-axis and 1-specificity(False Positive Rate) on x-axis. ROC curve starts at (0,0) or t=1 and ends at (1,1) or t=0.

To get ROC curve in R, we need to install ROCR package.
install.packages("ROCR")
library(ROCR)

ROCRPred = prediction(PTr, DFTr$y)
ROCRPerf = performance(ROCRPred, "tpr","fpr")
tpr and fpr are arguments for x and y axis.
plot(ROCRPerf, colorize=TRUE,print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1.7))
This plot will be colorized and will have threshold labels adjacent to the curve.
auc=as.numeric(performance(ROCRPred,"auc")@y.values)
This gives the Area Under Curve AUC which is an absolute measure of quality of prediction.

We can then test our model on test data just like we did on training data.
PTe = predict(m1,type="response",newdata=DFTe)

In case of missing data, we can use multiple imputation. For this we need to install and load "mice" package.
set.seed(144)
I1= complete(mice(DF1)
We can add the variables from I1 to DF1 that has missing data.

We can improve baseline model by using a sign function that returns 1 if it is passed a positive number, -1 if it is passed a negative number and 0 if it is passed 0.
table(DFTe$y,sign(DFTe$v1))