How to solve data science Hackathon – PIMA Diabetes
How to solve data science Hackathon
We at The Data Monk always believe in learning while coding and practicing. In this process we come across multiple things and one such learning tool is Hackathon. There are different websites like www.analyticsvidhya.com or https://www.kaggle.com/
How to solve data science Hackathon?
The best way is to get your hands dirty and start with any problem, but I have personally seen people starting with a problem and never actually reaching the solution part. So, we suggest you to follow the paths of those who have created such models.
If you are still unsure about how to make a career in Data Science, then this is your Data Science Career Path
You can also go through our Titanice Kaggle Solution
The below post is contributed by Disha Gupta
Problem statement: Create a machine learning model to predict that the person has diabetes or not.
Context: The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
Content: The dataset consists of several features like number of pregnancies, BMI, Age, Blood Pressure, Insulin level, Glucose level, Diabetes Pedigree Function and the target variable the outcome i.e. is the person suffering from diabetes (1) or not (0).
Lets first understand what every feature in the dataset mean.
- Pregnancies: It gives the number of times the female was pregnant
- Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test
- Blood Pressure (mm Hg): Diastolic blood pressure indicates how much pressure your blood is exerting against your artery walls while the heart is resting between beats.
- Skin Thickness: Triceps skin fold thickness (mm)
- Insulin Level: 2-Hour serum insulin (mu U/ml)
- Diabetes Pedigree Function: The likelihood of a person being diabetic depending on the family history
Let’s start with the model building. Here, I have used RandomForest algorithm of Machine Learning using Python.
Random forest, consists of a large number of individual decision trees that operate as an ensemble . Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
We will start with importing the libraries required. So, here pandas, numpy, seaborn and matplotlib.pyplot are imported
data = pd.read_csv("./diabetes.csv")
Here, we’ll read the our data which is in csv form using pandas
shape function helps us find number of rows and columns in the dataset. In this dataset, there are 768 rows and 9 columns
head(5) function lets us print first 5 entries in the dataset. Here, first 5 entries are printed, in which the number of people having diabetes(outcome = 1) is 3.
This line of code lets us find the number of nulls in the dataset. False as an o/p depicts that there are no nulls or Nan in the dataset
Its important to find the correlation between the features in the dataset, for finding correlation, I used corr() function, and plotting it in a heatmap gives me the result in the form of visualization. If in case, I don’t want to use any visualization, I can just use the below line of code to find the relationship between the features in the dataset.
Now I want to find the number of people/females who are diabetic and who are not diabetic, I’ll create true_count and false_count or any variable to get the count of data. So, in this dataset, 268 females are diagnosed with diabetes while 500 aren’t.
From scikit learn’s model selection, we will import train_test_split , to split the dataset into training set and test set. Here, I have separated feature columns and target column. The test size will be 30% of the entire dataset.
The above lines of code, is for checking all the values those are 0 in the feature columns. The below lines of code are for replacing the 0’s by the mean value using Imputer from scikit learn’s preprocessing module.
Now, lets create the ML model. I have used RandomForest algorithm which is a type of classification model. RandomForestClassifier is to be imported from ensemble module of scikit learn.
Classifier is the object created to fit the algorithm. classifier.fit will use X_train and Y_train to train the model.
After I create a model, I need to find how accurate it is in predicting the values of y or outcome. So, by using accuracy_score from metrics module of scikit learn, I’ll find the accuracy. Here the accuracy is 72.72%, which is not a bad result. But for improving accuracy, hyper parameters can used.
This was one simple approach to handle a Hackathon problem, but the secret behind high accuracy will always be feature engineering.
You can use this article as a way to understand how to start with sample data and solve Hackathons.
P.S. – There a multiple things which you can add in a problem statement to boost your accuracy but this article is not meant to solve that problem. It’s only meant to help you create your first submission to a Data Science Hackathon
If you want any guidance or want to make a career in Data Science, then you can take a look on our proposed Data science path
Connect with us on Linkedin for latest updates and Data Science e-books giveaways- Nitin Kamal (Co-Founder, The Data Monk)
Our Linkedin Page for latest updates and giveaways- https://www.linkedin.com/company/14573769