Snapdeal Interview Question | Machine
Question
If you are having 3GB RAM in your machine and you want to train your model on an 8GB data set. How would you go about this problem?
in progress
0
Machine Learning
55 years
3 Answers
1284 views
Great Grand Master 0
Answers ( 3 )
1.Try to pre process data and do feature engineering to decrease the size of data
2. distribute the ML algorithm using apache spark or Hadoop
Some of the techniques that can be applied to process large data:
1. Change the data format
2. Use smaller samples of data
3. Stream data or use progressive loading
4. Using a relational database
5. Using big data platform
Most datasets contain an awful lot of observations that add nothing to your model. Simply sampling your data will result in a model that is equally good being built in a fraction of the time.
There are a few scenarios where you truly do need that extra data in the model. Many algorithms support processing training data in batches to achieve this.
In case your data is non-sequential, just randomly pick an appropriately sized subset of your data and train on it for a short while. Then choose another subset from the rest of the unused data and repeat. After you exhaust all training samples, repeat the whole cycle from the start. In the case of sequential data, breaking it into multiple consecutive overlapping chunks should work. The best chunk/subset size, the number of training iterations for each chunk, and possibly overlap may depend on the characteristics of the data and may require some fine-tuning. Do not forget to reserve an adequate portion of your data for a test set and a validation set, especially when you have an abundance of data.