How do you create a sample data of 1000 rows from a population of 1 Million rows and 100 columns?
Question
This could be asked in either SQL, Python or theory.
You need to know the basics of sampling
in progress
0
Statistics
55 years
1 Answer
1128 views
Grand Master 0
Answer ( 1 )
While creating sample from a huge data set, we need to be sure that we don’t lose statistical significance. The distribution of population and sample should be similar.
Now to achieve this target there can be different approaches depending upon how well you know the data.:
1. If you don’t have any prior information about your population, do simple random sampling
i.e. take a random sample with uniform distribution and check if it’s significant or not. If it’s reasonably significant, keep it. If it’s not, take another sample and repeat the procedure until you get a good significance level.
2. If your data has sub-groups say male/female or different countries. Then make different homogeneous sub-groups and randomly select for each subgroup. But make sure to check for statistical significance.
There are other advanced techniques as well like cluster sampling , multi stage sampling and non probability sampling.