Zhu, Justin

Mon, Jan 1, 0001

Data Quantity vs Quality

Too much hype over big data. We should focus more on the quality of data rather than the size.

Qualitative statement: Data quality more important than data quantity.

If we were to look at winning performance from data teams, we are thinking about which variables to include in our model. Some external data, better variables were more important than the latest machine learning algorithms.

Quality Check - Where did the data come from?

N = all, we just have a population. Nobody has all the data. A causal inference question. We have that data set, other variables we could have that we don’t. This is not a new problem at all.

An example is predictions in the 1936 election, Alf Landon vs FDR. People who read Reader’s Digest not representative of people voting in the election.

A smaller sample size might give us better results.

Map example

People were dying of cholera, and at the time people had no clue how cholera was spread. Water getting contaminated was the real reason.

1850 was where people proposed that germs existed. Cholera was spread through bad air, supposedly.

How was contaminated water responsible for cholera truly proven?

John Snow was collecting data on what was going on, doing statistical visualization. He tried to identify a correlation. Maybe the air was very terrible around Broad Street, but that was kind of suspicious. Nobody in the brewery got cholera, what water were those people drinking? Those people got their own source of water.

Other people who lived far away from Broad Street got cholera from water collected from a pump in Broad Street.

All these stories illustrate how individuals accumulate evidence and think about the data.

Natural Experiments

Nature is effectively performing an experiment for you. Two different companies were supplying water. 26,000, Suffock and Boxhall had a massive difference. An association and causal. These companies competing for customers years before. They didn’t even know what a water company it was.

Painstaking data collection. He was going door to door to see which water was serving which household. All these water pipes were completely interlocked.


Sampling is tricky because there’s Bayesian vs. Frequentist debate once more.

The likelihood function is uninformative if you take a design-based approach. This is all about finite population. We have a finite number of individuals. We are not modeling this number as a random variable.

Let us have a snapshot in time

$$y_1, y_2, y_3, \cdots, y_N$$

A census is not a sample but rather getting the entire population, but even for the census, in principle it’s truly impossible to find every single person. Hot take: should we use statistics to correct for these errors?

We conduct surveys to correct for the results of this census. Another brilliant statistician would say that all models are wrong.

These are rather difficult assumptions. These are design-based, and these are random from sampling. We can think of the likelihood function as being completely useless.

The parameters are the $y$ values.

Simple Random Sample (SRS)

We will choose little n out of capital N without replacement. The reason why it is called simple is because $N \choose k$ are all equally likely. We simply choose $n$ without replacement.



Intuitively, $Cov (Y_1, Y_2)$ is negative because taking out large values leaves only small values left.

$Y_1 + Y_2 + \cdots + Y_N$, we will sample all of them. There’s no hard to continue sampling until you have all of them.

Let’s have a best of seven match. You win four and the match is over. Imagine they’re just going to play all seven games anyways, just for fun. This doesn’t change the probability of team A winning given that in each individual game, A has $p$ probability of winning. Then the number of games that $A$ wins is binomial.

$$Y_1 + Y_2 + \cdots + Y_N = y_1 + y_2 + \cdots + y_N$$

Then $$Var(Y_1 + Y_2 + \cdots + Y_N) = Var(y_1 + y_2 + \cdots + y_N) + NVar(Y_1) + N(N - 1)Cov(Y_1, Y_2) = 0$$

$$Var(\bar{Y}) = \frac{1}{n^2}(nVar(Y_1) - \frac{n(n-1)\sigma^2}{N - 1} = \frac{\sigma^2}{n} - \frac{n - 1}{n(N-1)}\sigma ^2 = \frac{\sigma}{n} (\frac{N-n}{N - 1})$$

This is a finite population correction. The correction is one. If we observe every element of the population.

Stratified Sampling

You break up population into straight-up, population into various demographics, age groups, gender groups, location groups, and so on.

Let’s assume we know how much are in each group. We know how much are in each group, and we will decide how much to sample from each group. A simple random sample within each group.

The benefits is limiting compounding variables and variability. You can get a lower variance for your overall estimator.

Let’s assume we have $L$ strata.

$N = N_1 + \cdots + N_L$ where $N_L$ is the size of stratum $l$.

Now we have $\mu{1}$, $\sigma{l}^2$ or a particular stratum $l$

$$\bar{Y}{stratified} = \sum{l = 1}^L \frac{N_2}{N}\bar{Y}_l$$

In each stratum $l$, we will get a simple random stratum of $l$.

All of these $\bar{Y}_l$ are independent.

Lagrange multiplier: $n_l \propto N_l \sigma_l$

If population size is large, but if the variance is high, then we also have a problem.

Theoretical Optimal does not necessarily mean we can use it. People just do proportional allocation. A general estimator which might look complicated but is something we can do quickly.

$$\frac{n_l}{n} \propto N_l \sigma_l$$

Horvite-Thomson Estimator

Let $\tau = \sum_{j = 1}^N y_j$

Let $\tau = \sum_{i \in S} \frac{Y_i}{\pi_i}$

$S_i$ is the sample, $pi$ is the probability of individual being int he sample at least once.

Probability of being in the sample is in the denominator.

Basseau’s Elephant

There’s 50 elephants, pick an average elephant. We will weigh this elephant an multiplies it by 50.

$$\frac{y{stampy}}{\frac{99}{100}} = \frac{100}{99} = 1.01 y{stampy}$$