- Blogs
- rekharajashekar11_96236's blog
- Cheat Sheets For Machine Learning Frameworks
Home > Blogs > Cheat Sheets For Machine Learning Frameworks
Traditional Machine Learning algorithms like decision trees were invented in the late 1900s. In fact, the backpropagation algorithm, which drives modern neural networks and is used to train almost every the deep learning machine was invented in 1970. So what took so long for Machine learning to boom? Two things - Data and Computational power. Now that we have modern day GPUs and TPUs providing enormous computational power, and enormous quantities of data being generated every second, deep learning is the talk of the town. Every Tech company - Facebook, Google, Amazon, Microsoft has released its own framework to build deep learning applications. Libraries like Scikit-Learn has been popular till date for traditional Machine Learning implementation. All of these frameworks are well implemented in Python. The picture below shows the development timeline of these frameworks -
Image credits: http://www.popit.kr/wp-content/uploads/2017/05/deeplearning_fw_timeline1.png
From 2013 to 2017, 9 Deep Learning frameworks were developed of which 8 were in Python. So which one should you learn? Moreover, is it enough to learn just one of them?
The latter question is a bit tricky and well connected to the former. Consider a task where you want to quickly prototype something. Building something simple for experimentation purposes is more difficult in Tensorflow than in high-level APIs like Keras. Keras enables you to build a neural network in just a few lines of code. On the contrary, Tensorflow until now had a lot of boilerplate code to be taken care of, even for simple architectures. However, when it comes to deployment for large scale applications, Tensorflow is battlefield tested.
In this article, we summarize Machine Learning frameworks its advantages, disadvantages, and provide code snippets to help you implement algorithms. We focus on Scikit-Learn or sklearn. It is a powerful library that not only provides Machine Learning frameworks but also data pre-processing tools. We first explore data preprocessing steps, then move on to model implementation and then finally we look at how sklearn helps tune model hyperparameters effectively.
Data Splitting:
It is a common practice to split your data into training and validation/test set. You train your model on the training set and then validate it on the validation set. To split your data into training and test set, you do the following:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)
X are the features, y is the predicted variable. test_size is the fraction of the total number of points to be included in the test set. Full documentation.
Scaling Data
Scaling is an important step if you wish to use distance-based algorithms like KNN or even complex algorithms like Neural Networks. If you want to scale your feature to have mean equal to zero and standard deviation equal to one, you use standard scaling. Full documentation
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)
Another way to scale data is to scale it such that the values are between zero and one. This is called Minmax scaling. The implementation is as follows
From sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_transformed = scaler.transform(X_train)
Encoding:
It encodes the values in a feature as categories from 0 to n-1 classes.
from sklearn import preprocessing
Label_enc = preprocessing.LabelEncoder()
label_enc.fit([1,2,2,2,3])
label_enc.transform([2,1,3,3])
>> [1,0,2,2]
One hot encoding:
Another type of encoding that can be used is one hot encoding. For a feature having n classes, it creates a vector of length n, where the ith position is one if the feature class value is i for a data point, whereas the rest of the n-1 positions are occupied by zeros.
from sklearn.preprocessing import OneHotEncoder
enc= OneHotEncoder()
enc.fit([[0,1],[1,2],[0,0]])
enc.transform([[1,1],[0,2]])
>> [[0,1,0,1,0],[1,0,0,0,1]]
Fitting Machine Learning models:
Using Machine Learning Algorithms is easy, in terms of the code. The general code is the same. What changes is the hyperparameter tuning part. Parameters to be tuned for logistic regression are different than that from Decision trees. Consider the implementation of Decision Trees
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = ‘gini’, max_depth = 10) #set parameters #here
model.fit(X_train,y_train)
Predictions = model.predict(X_test)
To implement another model like Logistic Regression, just import the model and replace the model variable with the appropriate code. Links to other models are given below
Note: Although sklearn provides a Neural Network framework, it is advisable to use libraries like Tensorflow or Keras or Pytorch.
Metrics
Once you have made the predictions for the validation data, you quantify the performance using metrics like Mean Squared Error (MSE), Accuracy, F1-Score, etc.
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,prediction) #y_test > test labels, predictions > model #output
>> 0.758
For classification tasks,
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)
>> 0.63
Cross-Validation
As mentioned earlier, it is a good practice to cross-validate in order to generalize the results and make the model more robust. In sklearn, it is implemented as follows
from sklearn.model_selection import cross_validate
model = DecisionTreeClassifier(#parameters)
cv_results = cross_validate(model,X,y,cv = 3) #cv= number of folds
cv_results[‘test_score’]
>> array([0.7,0.66,0.84])
Note that you can also change the metric from default to your choice. Refer the documentation.
Hyper-parameter Tuning:
As mentioned earlier, parameter tuning is an important part of model fitting. But how to choose the best combination of hyper-parameters? Grid search helps find the best set among the defined range of parameters. The implementation follows
from sklearn.model_selection import GridSearchCV
parameter_dictionary = {‘max_depth’:[2,5,7,10], ‘criterion’:[‘gini’,’entropy’]}
#we define a dictionary of hyperparameters with keys as the parameter name and values as a list of parameter values
grid_search = GridSearchCV(model, parameter_dictionary, cv=5)
#to get the best score of the k fold cross validation among different combination
#of parameters
grid_search.best_score_
>> 0.752
#to get the best model of the k fold cross validation among different combination
#of parameters
grid_search.best_estimator_
>> <model object>
Get the full documentation here.
The above examples are more than enough for you to get started with end to end model building. Of course, there is a lot more to it than these common uses of sklearn. The best way to master any framework is to look up the documentation and implement it.
While Scikit Learn is a fast, robust and well-supported framework, there are a few limitations to it. It does not support statistical analysis options as much as other frameworks like statsmodel. Meaning, if you want to calculate statistics for regression analysis like t-tests and several others, you need to write your own code or move to statsmodel. If you wish to get a lot of detailed statistical information from the model, statsmodel is good. Otherwise, it is advisable to use Scikit Learn.
If you are a Data Science or a Machine Learning enthusiast and found the above article helpful, here’s something for you - Data Science is moving fast. If you wish to make your place in this exciting field, you need a planned program with quality teaching and mentorship. You might want to checkout Manipal’s certification in AI which will take you from basics of Python to Advanced Machine Learning, seamlessly.