Friday 6 October 2017

edX Analytics Edge Unit 1 Summary

The amount of data we deal with currently is humongous and is in Zettabytes(10^21 bytes). Analytics is our solution to do deal with such huge data- by finding patterns in data, developing a data model using the pattern and predicting results using the model.  Few examples are IBM Watson- that won Jeopardy! quiz show, eHarmony -dating site that predicts compatibility based on inputs from users, heart and other health studies that can predict health issues based on user's data.

For data analysis, SAS, Excel, MATLAB, pandas in Python, R,etc can be used.

R is a programming language and an environment for statistical computing originally developed from S and made open source.

You can download RStudio, which is an IDE for R. A google search will give you cheat sheets for R commands.

On R console in RStudio, you can type any command and get the output.

You can set the directory using "Change Directory" from File menu.

Basics:

getwd() get the current directory path
dir() get directory contents
?<command> help
ls() get variables stored
rm(<var>) removes the variable
# comments
=/~ assignment
print(<var>)/<var> prints the variable

R has atomic elements like character, numeric, integer, logical(T/F) just like other programming languages.

It has vectors that can combine series of elements in one single object.
x= c(1,2,3)
c combines the elements 1,2,3 to a single object stored as vector x.

You can also specify range as follows:
x=1:3
y=4:6

rbind(x,y)-stacks by columns
1 2 3
4 5 6
cbind(x,y)-stacks by rows
1 4
2 5
3 6

To get tabular form data, you can use data frames.
DF1 = data.frame(v1,v2)
v1 and v2 are vector variables.

To add new variable, DF1$v3 = c(10,11,12)
You can even convert a logical variable to numeric and add it as DF1$V1=as.numeric(c(F,T,F))
To add new data/observations, you can use rbind function.

str(DF1) will give the structure of the data frame DF1 including number of variables, observations and variable names.
summary(DF1) will give values of many functions like min,max, mean, median, 1st quartile, 3rd quartile.
mean-average of values
median-middle value when the values are sorted
1st Quartile- say this value is x, then 25% of the data <x
3rd Quartile-say y, 75% of the data <y

nrow(DF1)- number of observations

sd(DF1$V1) - gives the standard deviation - square root of variance--which is the average of squared differences from mean.

You can find the min/max of a variable in data frame using :
which.min(DF1$V1)
which.max(DF1$V1)

CSV(Comma-separated values) files can be read and stored into variables.
f1 = read.csv("f1.csv")
A subset of data can also be generated using the subset command.
f2 = subset(f1, v1=="val1")
To write a data frame into a csv, use write.csv.
write.csv(f2,"f2.csv")

To create a scatter plot graph, you can use plot(DF$V1, DF$V2).

You can limit the number of variables printed by using DF1[c(v1,v2)]

hist(DF1$V1) - gives a histogram - which is basically the probability distribution of V1 i.e V1 vs the frequency of V1 in data.

boxplot(DF$V1 ~ DF1$V2) - boxplot of V1 and V2- it provides a statistical distribution of these variables- range(min,max), outliers, median, 1st and 3rd quartiles.

Inter-quartile range=(3rd Q- 1st Q)
Outliers- values that satisfy one of  the below conditions:
> Inter quartile+ 3rd Q
<1st Q- Inter quartile

You can specify labels by using boxplot(DF$V1 ~ DF$V2, xlab="x", ylab="", main="x vs y")
Similarly, you can specify color as col="red".
You can also limit the data using xlim=c(0,100) and break it to be more refined using breaks=500. These will help to see the data well over a refined range.

table(DF1$V1) - this will give a table of V1 similar to what we found in summary - with number of observations under each value of V1. Using a table over 2 variables will help in seeing the correlation between those two variables.

tapply(DF1$V1,DF1$V2, mean)- splits data by V2 and applied mean function on V1. So this gives the mean of each type of V1 values across V2.

If there are NA values in data, then it's not possible to get above data. So you can specify to remove NA values as below:
tapply(DF1$V1, DF1$V2, min, na.rm=TRUE)

match("VAL1", DF1$V1)- gives the index of that value in V1



No comments:

Post a Comment