You always have to read and write files when working for a company or Hackathon. So, it’s necessary to know how to read different types of files.
Let’s start the boring but important part
The most important command to open a file in Python is the open() method. It takes two parameters, Name of the file and action mode.
Like most of the other programming languages, Python has 4 modes to access a file:- 1. “r” – Read – Reads a file 2. “a” – Append – Appends a file or create a new file 3. “w” – Write – Writes a new file 4. “x” – Create – Creates the specified file
Apart from these you can also specify the format in which you want to open the file: 1. t for Text(Default) 2. b for Binary file
Open a file x = open(“Analytics.txt”,”rt”) It opens the file, basically reads it in text format
Read the file
You can also read the file line by line by the following method or by using readline() method
Write something in a file
Delete a file
Use the “os” package and then run the remove() command import os os.remove(“file name”)
God forbid, if you ever have to delete a folder and want to look cool in front of your friends, you can use the following command
os.rmdir(“Name of directory”)
Reading CSV file Comma Separated Values or CSV file format is one of the most used file formats and you will definitely come across reading a csv file often. In order to read it, you should ideally import pandas library
There are a lot of file formats, but we covered only those which are of utmost important. In case you need more information, try this link from Data Camp or you can trust your best friend StackOverFlow 😛
If you need information about a specific file format, do comment below.
The reason why I put interview questions as the title of a lot of posts is because:– 1. It makes you click on the post 2. It makes you feel that these are very important questions and you can nail an interview with it 3. These are actual interview questions asked in companies like Myntra, Flipkart, BookMyShow, WNS, Sapient, etc. 4. You have to practice to become perfect. You can practice here or anywhere else. But make sure you know all the questions given below.
Toh surukartehainbinakisibakchodike Let’s start with the questions 😛
1. Which data type is mutable and ordered in Python? List
2. Can a dictionary contain another dictionary? Yes, a dictionary can contain another dictionary. In fact, this is the main advantage of using dictionary data type.
3. When to use list, set or dictionaries in Python? A list keeps order, dict and set don’t: When you care about order, therefore, you must use list (if your choice of containers is limited to these three, of course;-).
dict associates with each key a value, while list and set just contain values: very different use cases, obviously. set requires items to be hashable, list doesn’t: if you have non-hashable items, therefore, you cannot use set and must instead use list.
4.WAP where you first create an empty list and then add the elements. basic_list =  basic_list.append(‘Alpha’) basic_list.append(‘Beta’) basic_list.append(‘Gamma’)
5. What does this mean: *args, **kwargs? And why would we use it? We use *args when we aren’t sure how many arguments are going to be passed to a function, or if we want to pass a stored list or tuple of arguments to a function. **kwargsis used when we don’t know how many keyword arguments will be passed to a function, or it can be used to pass the values of a dictionary as keyword arguments. The identifiers args and kwargs are a convention, you could also use *bob and **billy but that would not be wise.
6. What are negative indexes and why are they used? The sequences in Python are indexed and it consists of the positive as well as negative numbers. The numbers that are positive uses ‘0’ that is uses as first index and ‘1’ as the second index and the process goes on like that.
7. Randomly shuffle the content of a list
8. Take a random sample of 20 elements and put it in a list
9. Take a list and sort it
10. Explain split() and sub() function from the “re” package split() – uses a regex pattern to “split” a given string into a list sub() – finds all substrings where the regex pattern matches and then replace them with a different string
11. What are the supported data types in Python? The most important data types include the following: 1. Number 2. String 3. List 4.Tuple 5. Dictionary 6. Set
12. What is the function to reverse a list? list.reverse()
13. How to remove the last object from the list? list.pop(obj=list[-1]) Removes and returns last object or obj from list.
14. What is a dictionary? A dictionary is one of the built-in data types in Python. It defines an unordered mapping of unique keys to values. Dictionaries are indexed by keys, and the values can be any valid Python data type (even a user-defined class). Notably, dictionaries are mutable, which means they can be modified. A dictionary is created with curly braces and indexed using the square bracket notation.
15. Python is an object oriented language. What are the features of an object oriented programming language? OOP is the programming paradigm based on classes and instances of those classes called objects. The features of OOP are: Encapsulation, Data Abstraction, Inheritance, Polymorphism.
16. What is the difference between append() and extend() method? Both append() and extend() methods are the methods of list. These methods a re used to add the elements at the end of the list. append(element) – adds the given element at the end of the list which has called this method. extend(another-list) – adds the elements of another-list at the end of the list which is called the extend method.
17. Write a program to check if a string is a palindrome? Palindrome is a string which is symmetric like. aba, nitin, nureses run, etc
Below is the code, write it down yourself 😛
18. Take a random list and plot a histogram with 3 bins.
19. What is the different between range () and xrange () functions in Python? range () returns a list whereas xrange () returns an object that acts like an iterator for generating numbers on demand.
20. Guess the output of the following code x = “Fox ate the pizza” print(x[:7])
You can find Python interview questions on many websites, we will keep on updating this list. Time for some marketing, if you want to get some more interview questions on Python, then click below:-
Welcome to the world of Functions. This is undoubtedly the most important topic of your Data Science career 😛 Function will make your life easy and your peer’s life easier !!
Toh shuru karte hain, bina kisi bakchodi ke (Let’s start without wasting any time)
Defining a function A function is a block of code which runs only when it is called. Let’s start with defining a basic function:
You can also define simple function to add two numbers and by passing values to the function
Information is passed in a function as a parameter. In the above example, x and y are two parameters. Arguments are the values passed to these parameters. 4 and 5 are the arguments of the function sum()
Using a default parameter – You might need a default parameter in case no value is passed to the function. It is done in the following way
Write a function to get the Maximum out of two number
You can also create a function without any name, it is called Lambda function. It is a small anonymous function which can take any number of arguments, but can have only one expression.
Let’s learn the basics of Lambda function. Below is the lambda function to add two numbers.
A lambda function to get the cube root of a number
Why do we need a Lambda function? Lambda function is a very convenient way to write small functions, but the real power of a Lambda function relies on the point that you can use it within a function. Let’s see how a lambda function can be used in a better way:-
Look at the above function hello. It has a parameter n which is passed as the string “Data”. This string is saved in x. Now if you pass a number to “x”, then it will be used as a and will multiply Data with 4 in this case.
When you don’t know the number of arguments to pass to a function, then you need to pass a variable parameter.
What *args allows you to do is take in more arguments than the number of formal arguments that you previously defined. With *args, any number of extra arguments can be tacked on to your current formal parameters (including zero extra arguments).
Below is how a variable parameter is passed to a pizza function.
**kwargs You can use **kwargs to let your functions take an arbitrary number of keyword arguments (“kwargs” means “keyword arguments”)
The special syntax **kwargs in function definitions in python is used to pass a keyworded, variable-length argument list. One can think of the kwargs as being a dictionary that maps each keyword to the value that we pass alongside it. That is why when we iterate over the kwargs there doesn’t seem to be any order in which they were printed out.
Few questions which you should try from the previous exercises are:- 1. What is the difference between tuple and list? 2. How to store a dictionary in a list? 3. How to store a list in a dictionary? 4. Create a list using a loop and fill the list with square of numbers from 1 to 10. 5. Write a program to sum all the elements of the list.
You can either go through the previous days session or google these out. For more questions and answers like this, you can purchase our ebook from Amazon. Link below
One of the most important thing which you need to learn in Python is the use of conditional statement. These are small code snippets which will help you solve multiple problems in a project or any hackathon.
Conditional statements help us to apply a particular constraint on the data set. Suppose you want to pull the data only for a particular employee or user; or you want to filter the data for a particular date; or you want to count how many male and female are there in the given data set, every where you will be using these conditional statements.
Every programming language have almost the same conditional statement and Python is not an anomaly.
We will try to keep it crisp in this post but it will keep on haunting you in the upcoming posts, so, try to learn the basics here before proceeding.
There are three types of conditional statement used in Python: 1. if 2. else 3. elif
Python, like other programming languages, supports the usual logical conditions:
1. Equals: x == y 2. Not Equals : x != y 3. Greater than : x>y 4. Greater than or equal to : x >= y 5. Less than : x < y 6. Less than or equal to : x <= y
1. if is simple conditional operator where you put a condition and filter the data set or mould the data set in a particular manner.
P.S. – Python follows indentation religiously, so be very careful in writing codes
2. else operator compliments the if operator. Suppose, the if condition is not satisfied, then the control will move to else
3. elif helps in putting as many conditions in your program as you like. Look at the example below
Let’s try some more examples 1. Applying more than one condition using “and” keyword
2. Applying more than one condition using “or” operator
3. Applying condition on a list
4. Apply condition on a string
5. Multiple if statement
6. If the first “if” condition is true, then the conditional statements will break and even if the “else” condition is true, the control will not go to it. See the example below where both, if and else statements are true
7. if True condition
Summary of the day 1. You learned the basics of if, else and elif conditional statements 2. You can run multiple conditional statements in a nested query 3. You have practiced a few examples of using the conditional statement in different ways
If you have time, make a small calculator using everything you have learnt today
We know that you already know a lot about Python and it’s capability in the Data Science domain.
I have deliberately put screenshots so that you people have to type these commands to practice the syntax of Python 😛
To make sure everyone is on the same page, we will quickly go through the basics of Python:-
1. print(“Hello World”) – print() command to print anything
2. print(“Hello”+” World”) – Plus(+) operator to add two strings
3. Python will throw an error if you do not follow indentation in your code 4. In Python you do not have to declare any variable by a data type
5. There are three types of numeric types supported in Python:- a. int b. float c. complex
Use the type() command to know the data type
6. Multiple occasions will come when you have to type cast a variable into another data type. Python provides 3 functions for the same: a. int() b. float() c. str()
7. Some basic string operations:
8. Following are the operators used in Python: a. Arithmetic Operator – These include +,-,*,/,%,**(Exponential),//(Floor division) b. Assignment Operator – These include =,+=,-=, etc. c. Comparison Operator – These include >,<,<=,>=,!=,etc. d. Logical Operator – These include and, or, not. e. Identity Operator – These include is and is not operator f. Bitwise operator – These include &, |, ^, ~, << and >>
9. List is mutable and is a collection which is ordered and changeable. By mutable we mean that you can change the content of the List. A list can contain any data type.
Functions for List:- a. len() – To get the length of the list b. append() – To add an element to the end of a list c. insert() – To add an element at a desired position d. remove() – To remove specified element e. pop() – It removes the last index if nothing is specified f. del – The keyword del removes the specified index g. clear() – It empties the list
10. Tuples is an unchangeable and ordered collection. List uses a square bracket, whereas tuple uses round brackets. The value of the element of a tuple cannot be changed, thus it is called immutable.
You can completely delete the tuple, but can not add element or delete element from the tuple.
11. Set is another data structure in Python which is unordered and unindexed. Sets are defined by curly brackets.
You can add new items. To add one item you can use add() method, and to add multiple items you can use update() method.
len() function is used to get the length of the set remove() and discard() functions are used to remove an item from the set. Similarly you can use pop() to remove an item, clear() to empty the set, del to delete the set completely.
We can use the set() constructor to make a set
12. Dictionary is one of the most important and used data structure in Python. It stores the values in key-value pair. It is changeable, unordered and indexed.
The first element is called key and the second is the value. If you have the same key for two different values then it will not get printed. Refer to the example below:
Three important ways to access an element in a dictionary are:- 1. for x, y in name.items(): print(x, y)
2. for x in name.values(): print(x)
3. for x in name: print(x)
Keep creating tuple, dictionary, and set for the rest of your life 🙂
5-6 years back Java was said to be ever lasting. Everyone wanted a Java developer in their team. Looking at the current scenario, we can safely assume that Python is and will be one of the most used Programming language across multiple domains ranging from software development to web development and Data Science.
Talking particularly about Data Science, Python is blessed by a humongous community of Data Scientists who contribute a lot to the development and betterment of the language. Apart from the community, the libraries and packages which are regularly developed are making it easier for people to explore Data Science.
Python is not the only language which can be used for Data Science purpose. Few other languages are:- 1. R 2. SAS 3. JAVA 4. C
We will try to cover everything in Python so that you get fluent in at least one language and in the current era if you have to choose one language to better your career, then do give a shot to Python.
At the time of writing this blog, two versions of Python are popular Python 2.7 Python 3.*
Download Start with downloading Anaconda Once you have Anaconda in your system, execute it. It will take ~10 mins to get it done.
From the start itself, try to use Jupyter notebook for your Python programming.
How to launch Jupyter Notebook? Once you have installed Anaconda, you will get an Anaconda Navigator in your start menu or on your desktop. Double click to open it.
This is how Anaconda Navigator will look like. Click on the Launch button below the Jupyter Notebook ico
The Jupyter notebook will look something like the one below:
Click on the new button and select Python 3(if Python 3 has been installed)
Running your first Python program
A programmer is not a programmer is he does not start a new language with Hello World and I ain’t a programmer no more, so I will start with printing The Data Monk 😛
Write the below simple code:
print(“The Data Monk”) and press Shift+Enter to run the line of code. The output will be shown just below the code.
In the next few days, we will import a lot of libraries, try out some good algorithms and visualizations, and will solve some case studies.
You can also install R or any other language and search for the implementation of the algorithms and make cool visualizations 🙂
Few of the libraries which will come handy in this journey are:- 1. NumPy 2. sciPy 3. Matplotlib 4. Pandas
If you have already installed everything, then hop on to Day 21.
1.Whatis a population and a sample? Population is the complete targeted group of people/objects on which the analysis needs to be performed. If the target is Mumbai population then the population will be the total number of people living in Mumbai.
2. What is a sample? A sample is like a subset of the population. Most of the times you won’t be able to do your complete analysis on the Population data set as there will be hundreds of millions of rows and processing it will consume a lot of time. So, we take a sample of data which should be random and unbiased from the population.
3. What is a nominal data set? Nominal data is recorded as categories in a data set. For example, rocks can be generally categorized as igneous, sedimentary and metamorphic.
4. What are the types of variables? Discrete Variable – A variable with a limited number of values (e.g., gender (male/female), college class (freshman/sophomore/junior/senior) Continuous Variable – A variable that can take on many different values, in theory, any value between the lowest and highest points on the measurement scale. Independent Variable – A variable that is manipulated, measured, or selected by the researcher as an antecedent condition to an observed behavior. In a hypothesized cause-and-effect relationship, the independent variable is the cause and the dependent variable is the outcome or effect. Dependent Variable – A variable that is not under the experimenter’s control — the data. It is the variable that is observed and measured in response to the independent variable. Qualitative Variable – A variable based on categorical data. Quantitative Variable – A variable based on quantitative data.
In general, statistics is a study of data: describing properties of the data, which is called descriptive statistics and drawing conclusions about a population of interest from information extracted from a sample, which is called inferential statistics.
5. What are the types of measurements in statistics? 1. Measures of Center –Mean, Median and Mode 2. Measure of Spread –Variance, Standard Deviation, Range and Inter Quartile Range 3. Measures of Shape –Symmetric, Skewness, Kurtosis
6. Define mean. The mean is the most common measure of central tendency and the one that can be mathematically manipulated. It is defined as the average of a distribution is equal to the SX / N. Simply, the mean is computed by summing all the scores in the distribution (SX) and dividing that sum by the total number of scores (N). Example: Heights of five people: 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8 inches. The sum is: 339 inches. Divide 339 by 5 people = 67.8 inches or 5 feet 7.8 inches. The mean (average) is 5 feet 7.8 inches.
7. Give an example of a median. Find the median of 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8 inches. Line up your numbers from smallest to largest: 5 feet 6 inches, 5 feet 7 inches, 5 feet 8 inches, 5 feet 8 inches, 5 feet 10 inches. The median is: 5 feet 8 inches (the number in the middle). Even amount of numbers: Find the median of 7, 2, 43, 16, 11, 5 Line up your numbers in order: 2, 5, 7, 11, 16, 43 Add the 2 middle numbers and divide by 2: 7 + 11 = 18 ÷ 2 = 9 The median is 9.
8. Give an example of mode. Example: Height Chart with people lined up in order of height, short to tall. Find the mode of 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8 inches. Put the numbers is order to make it easier to visualize: 5 feet 6 inches, 5 feet 7 inches, 5 feet 8 inches, 5 feet 8 inches, 5 feet 10 inches. The mode is 5 feet 8 inches (it occurs the most, at 2 times).
9. What is IQR? The interquartile range is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread (i.e. the average or median) when reporting things like school performance or SAT scores.
10. How to calculate IQR? Step 1: Put the numbers in order. 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Step 3: Place parentheses around the numbers above and
below the median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
Step 4: Find Q1 and Q3
Think of Q1 as a median in the lower half of the data and think of Q3 as a
median for the upper half of data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19,
27). Q1 = 5 and Q3 = 18.
Step 5: Subtract Q1 from Q3 to find the interquartile
18 – 5 = 13.
11. Define the measure of shape. Measure of Shape
distributions summarizing data from continuous measurement scales, shape of
graph can be used to describe how the distribution rises and drops.
Symmetric – Distributions that have the same shape on both sides of
the center are called symmetric. A
symmetric distribution with only one peak is referred to as a normal
Skewness – Skewness is a measure of the asymmetry
of the probability distribution of a real-valued random variable about its
mean. The skewness value can be positive or negative, or even undefined. The
qualitative interpretation of the skew is complicated and unintuitive.
12. What is positive skeweness and negative skewness? Positively skewed – A distribution is positively skewed when is has a tail extending out to the right (larger numbers) When a distribution is positively skewed, the mean is greater than the median reflecting the fact that the mean is sensitive to each score in the distribution and is subject to large shifts when the sample is small and contains extreme scores.
Negatively skewed – A negatively skwed distribution has an extended tail pointing to the left (smaller numbers) and reflects bunching of numbers in the upper part of the distribution with fewer scores at the lower end of the measurement scale.
The formula to find skewness manually is this: skewness = (3 * (mean – median)) / standard deviation
13. What is the correlation? Correlation is one of the most basic and important concepts in data science. In a layman language, it is used to get the degree of relationship between 2 variables.
For example – Height and Weight are related i.e. taller people are generally heavier than the shorter one. But, the correlation between these might not be perfect. Consider the variables family income and family expenditure. It is well known that income and expenditure increase or decrease together. Thus they are related in the sense that change in any one variable is accompanied by the change in the other variable.
Correlation can tell you something about the relationship between variables. It is used to understand: 1. Whether the relationship is positive or negative 2. The strength of the relationship.
Correlation is a powerful tool that provides these vital pieces of information.
In the case of family income and family expenditure, it is easy to see that they both rise or fall together in the same direction. This is called a positive correlation.
14. What are the two types of regression? There are two types of regression analysis:- 1. Linear Regression Analysis 2. Multiple Regression Analysis
15. What is Linear Regression? Starting with Linear Regression Analysis, It is basically a technique used to determine/predict the unknown value of a variable by looking at the known values. If X and Y are two variables which are related, then linear regression helps you to predict the value of Y.
A simple example can be the relationship between age of a person and his maturity level. So we can say that these 2 are related and we can guess the maturity level of the person.
By linear regression, we mean models with just one independent and one dependent variable. The variable whose value is to be predicted is known as the dependent variable and the one whose known value is used for prediction is known as the independent variable.
Y = a + bX
This is the linear regression of Y on X where a and b are unknown constant and slope of the equation.
Choice of linear regression is one of the most important parts of applying it. For example, suppose you want to have 2 variables, crop yield (Y) and rainfall (X). Here the construction of the regression line of Y on X would make sense and would be able to demonstrate the dependence of crop yield on rainfall. We would then be able to estimate crop yield given rainfall.
Careless use of linear regression analysis could mean construction of regression line of X on Y which would demonstrate the laughable scenario that rainfall is dependent on crop yield; this would suggest that if you grow really big crops you will be guaranteed a heavy rainfall.
If the regression coefficient of Y on X is 0.53 units, it would indicate that Y will increase by 0.53 if X increased by 1 unit. A similar interpretation can be given for the regression coefficient of X on Y.
16. What is multiple linear regression?
As the name suggests, multiple linear regression uses 2 or more variables as a predictor to get the value of the unknown variable.
For example, the yield of rice per acre depends upon the quality of seed, the fertility of soil, fertilizer used, temperature, rainfall. If one is interested to study the joint effect of all these variables on rice yield, one can use this technique.
An additional advantage of this technique is it also enables us to study the individual influence of these variables on yield.
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
Here b0 is the intercept and b1,b2,b3, etc. are analogous to the slope in the linear regression.
You need to know whether your regression is good or not. In order to judge your regression model examine the coefficient of determination(R2) which always lies between 0 and 1. The closer the value of R2 to 1, the better is the model.
A related question is whether the independent variables individually influence the dependent variable significantly. Statistically, it is equivalent to testing the null hypothesis that the relevant regression coefficient is zero.
This can be done using t-test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences Y significantly while controlling for other independent explanatory variables.
17. What are the major differencws between Linear and Multi linear regression? In simple linear regression a single independent variable is used to predict the value of a dependent variable. In multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables.
As an example, let’s say that the test score of a student in an exam will be dependent on various factors like his focus while attending the class, his intake of food before the exam and the amount of sleep he gets before the exam. Using this test one can estimate the appropriate relationship among these factors.
18. What is Logistic Regression? Logistic regression is a class of regression where the independent variable is used to predict the dependent variable. When the dependent variable has two categories, then it is a binary logistic regression. When the dependent variable has more than two categories, then it is a multinomial logistic regression. When the dependent variable category is to be ranked, then it is an ordinal logistic regression (OLS). To obtain the maximum likelihood estimation, transform the dependent variable in the logit function. Logit is basically a natural log of the dependent variable and tells whether or not the event will occur. Ordinal logistic regression does not assume a linear relationship between the dependent and independent variable. It does not assume homoscedasticity. Wald statistics tests the significance of the individual independent variable.
19. Can Standard Deviation be False? The formula for standard deviation is given below
Since the differences are squared, added and then rooted, negative standard deviations are not possible.
20.What is p-value and give an example? In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. If the p-value is less than 0.05 or 0.01, corresponding respectively to a 5% or 1% chance of rejecting the null hypothesis when it is true (Type I error). Example: Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips * null hypothesis (H0): fair coin; * observation O: 14 heads out of 20 flips; and * p-value of observation O given H0 = Prob(≥ 14 heads or ≥ 14 tails) = 0.115. The calculated p-value exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of 14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time were this in fact the case. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall evenly, the deviation from expected outcome is small enough to be reported as being “not statistically significant at the 5% level”. <sites.google.com>
Questions from Statistics are mostly around the following topics:- 1. Regression 2. Tests in Statistics 3. Hypothesis testing 4. Mean, Median, and Mode 5. Correlation, Standard Deviation, and Variance
This page will be updated every few days. Keep checking the page.