What is Stationarity in Time Series?
Stationarity in Time Series
The first step for any time series analysis is to make the data set stationary. Everyone knows that stationarity means a near to constant mean and variance across time.
Stationarity in Time Series
The red line above shows an increasing trend and the blue line is the result of the de-trending series. De-trending means to fit a regression line and then subtract it using original data
Stationarity does not mean that the series does not change over time, just the way it changes does not itself change over time.
The reason why we need a stationary data is simple – It’s easier to analyze and predict a data set with stationarity. If a series is consistently increasing over time (like the one above), then the sample mean and variance will also grow with the size of the sample, and your model or the proposed time series solution will always underestimate the mean and variance in the future periods.
How you check the stationarity of a series?
In general, we use Augmented Dickey Fuller Test or KPSS test to check the stationarity of the series. Here we will discuss only the ADF test, KPSS phir kabhi
ADF is a statistical significance test (a test which involves null and alternate hypothesis) and it falls under the category of ‘unit root test’. Now, what is a unit root test?
The presence of a unit root means the time series is non-stationary. Besides, the number of unit roots contained in the series corresponds to the number of differencing operations required to make the series stationary.
A time series is a process that can be written in its components which contains ‘roots’. For example:
v(t)=c+a1 v(t−1) + ϵt − 1
The coefficient a1 is a root. You can interpret this process/formula as ‘the value of today depends on the value of yesterday and some randomness we can’t predict’. We expect this process to always converge back to the value of c.
Try this is out with an example:
suppose c=0 and a1=0.5.
If yesterday v(t−1) the value was 100, then we expect that today the value will be around 50. Tomorrow, we expect the value to be 25 and so on.
You see that this series will ‘come home’, in this case meaning it will converge back to the value of cc.
When one of the roots is a unit, i.e. equal to 1 (in this example when a1=1), then this series will not recover to its origin. You can see this by using the example given above.
That is why the concepts of unit roots and unit root tests are useful: it gives us insights into whether the time series will recover to its expected value. If this is not the case, then the process will be very susceptible to shocks and hard to predict and control.
What is the significance of p-value in ADF test?
A high p-value, suppose 0.87 indicates that the possibility of the series to be non-stationary is 87%.
We do multiple differencing in the dataset to make it stationary
adf.test(diff(time_series))
In the above snippet, we are doing one differentiation of the time series data and then testing the stationarity using the adf test in R
You can also try a double differentiation or a difference after log to check the stationarity(if the noise is high)
adf.test(diff(log(time_series))
A rule of thumb – Don’t over differentiate i.e. don’t apply 6-7 differentiation to fix the noise in order to decrease the p-value for a stationary dataset.
In the case of a first-difference, we are literally getting the difference between a value and the one for the time period immediately previous to it. If you are going for a high number of differentiation then it clearly means that your data has too much noise to cater to a time series pattern
Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.
Bottom line :-
-the value of today depends on the value of yesterday and some randomness we can’t predict
-Stationarity is useful to identify the pattern in order to predict values
-You do a difference of order one, two, three, etc. to get to a stationary value for the dataset
-Do an ADF or a KPSS test to check if the series is stationary
-Uske baad chill 🙂
The Data Monk e-books
Tired of online courses costing 2 to 8 lakh and taking more than a year to complete?
Tired of going through 500+ hours of videos at a super slow pace?
We at The Data Monk believe that you have to start and complete things as quickly as possible. We believe in a target-based study where we break a topic into 100 questions and make sure that if you cover these questions you will surely be able to crack the interview questions. Rest all theory and practical can ONLY be learned while working in an organization.
Pick any of our books from our e-shop page and complete it in 6-8 hours, learn the 100 questions and write it in your resume. We guarantee you that you will nail 8 out of 10 interviews
We also have 3 bundles at a price that is affordable to everyone. We are a group of people placed in the best of the product-based companies and we take 100+ interviews per week. Do we know what is being asked and what is not? So, just grab any of the following book bundles and give not more than 30 days to LEARN all the questions. We guarantee you that you will become a very strong candidate in any analytics interview
Set A – [3rd/4th year/ and 0 to 3 years of experience]
Crack any analytics or data science interview with our 1400+ interview questions which focus on multiple domains i.e. SQL, R, Python, Machine Learning, Statistics, and Visualization. – https://thedatamonk.com/product/books-to-crack-analytics-interview/
Set B – [0-5 Years of Experience]
1200+ Interview Questions on all the important Machine Learning algorithms (including complete Python code) Ada Boost, CNN, ANN, Forecasting (ARIMA, SARIMA, ARIMAX), Clustering, LSTM, SVM, Linear Regression, Logistic Regression, Sentiment Analysis, NLP, K-Mean – https://thedatamonk.com/product/machine-learning-interview-questions/
Set C – [0-7 Years of Experience]
2000+ interview questions that include 100 questions each on 12 most asked Machine Learning Algorithms, Python, Numpy and Pandas – 300 Interview Questions, Pandas,PCA,AWS,Data Preprocessing,Case Studies, and many more
https://thedatamonk.com/product/the-data-monk-e-book-bundle/
Note – Set C contains all the questions of Set B
Youtube Channel – The Data Monk
Unlike any other youtube channel, we do not teach basic stuff, we teach only topics that are asked in interviews. If the interviewer asks about p-value, we will have a video on that topic,
If the interviewer is interested in asking the sequence of execution of SQL commands then we will give you an overview of all the commands but stress so much on the question that can answer it comfortably in the interview. We definitely recommend you to follow our youtube channel for any topic that you are interested in or weak at.
If you wish to get all the study material and topics to cover for an interview at one place, then you can subscribe to our channel. We have covered the complete syllabus of
Get all the youtube videos playlist on our youtube Channel – The Data Monk
Code in Python for Data Science – Understand one algorithm at a time in 30 minutes (theory and python code)
Company-wise Data Science Interview Questions – 15 videos on how to crack analytics interview
Complete Numpy Tutorial – 14 videos on all the functions and questions on Numpy
Complete Python Pandas Tutorial – 15 videos to completely cover Pandas
SQL Complete Playlist – 20 highly recommended videos to cover all the interview questions
Case Study and Guesstimates Complete Playlist – Real-life interview case study asked in 2021
Statistics– 10 videos to completely cover Statistics for interviews
Lastly,
If you are in dire need of any help, be it book-wise or guidance-wise, then you can definitely connect with me on Linkedin. We will try to help as much as possible
Comment ( 1 )
Even when we do regression, we don’t want the predictors to be correlated. We don’t need multi-collinearity. On similar lines, in time series the predictors are nothing but past values of time series, hence we remove correlation between the predictors.