# Boards

## statistics question

(i've tried google, no luck, I find dis is normally more helpful though probably not on a sunday here goes)

Anyone know how to interpret a data set when you have the entire population (not in the normal sense but in the sense of having everyone in the group you are looking at) rather than a sample. Like say you do a crosstabs of two variables, use chi-square to see if there is an association between them, then normally you get a significance that tells you the chance of this association randomly occuring in a sample by chance if it doesnt really exist in the whole population, but how do you interpret it if you do have all of the population in the data set. I really want to know how it applies to logistic regression models, normally you find out the effects of each variable and eliminate the insignificant ones, but eliminating something because theres too big a chance it might not be real in the population seems wrong if you have the entire population in the dataset, do you just pretend it is a sample?

## can you explain the question using coloured balls and wildly obscure ethnic names?

I'll stand more of a chance.

## hmmm

i'm not an expert on stats, but i think maybe you’re misunderstanding what the p value means. I believe it just represents the probability that your finding could have occurred by chance, if your hypothesis is not true.

So for the logistic regression example, it would be the chance of that correlation occurring without there being that relationship between the variables...? So if there's a high p value, there's a high chance of a pattern between two factors not really being of any significance, so you wouldn't include it in the model.

I don't think samples and populations make a difference to interpreting the p value. you take a sample as a compromise when looking at the whole population isn't feasible. because chance has a greater effect in small sample sizes, you’ll be more likely to get a higher p value and a non significant result (a type II error). If you’re using the whole population, this is the ideal situation – you’re more likely to get a low p value for any relationship that exists.

Hm maybe someone else on here probably knows better than i do though

Also this might help: http://en.wikipedia.org/wiki/Statistical_significance

## hmm

i'm not an expert on stats, but i think maybe you’re misunderstanding what the p value means. I believe it just represents the probability that your finding could have occurred by chance, if your hypothesis is not true.

So for the logistic regression example, it would be the chance of that correlation occurring without there being that relationship between the variables...? So if there's a high p value, there's a high chance of a pattern between two factors not really being of any significance, so you wouldn't include it in the model.

I don't think samples and populations make a difference to interpreting the p value. you take a sample as a compromise when looking at the whole population isn't feasible. because chance has a greater effect in small sample sizes, you’ll be more likely to get a higher p value and a non significant result (a type II error). If you’re using the whole population, this is the ideal situation – you’re more likely to get a low p value for any relationship that exists.

Hm maybe someone else can confirm that i’m not talking rubbish though.

Also this might help: http://en.wikipedia.org/wiki/Statistical_significance

## hmm

i'm not an expert on stats, but i think maybe you’re misunderstanding what the p value means. I believe it just represents the probability that your finding could have occurred by chance, if your hypothesis is not true.

So for the logistic regression example, it would be the chance of that correlation occurring without there being that relationship between the variables...? So if there's a high p value, there's a high chance of a pattern between two factors not really being of any significance, so you wouldn't include it in the model.

I don't think samples and populations make a difference to interpreting the p value. you take a sample as a compromise when looking at the whole population isn't feasible. because chance has a greater effect in small sample sizes, you’ll be more likely to get a higher p value and a non significant result (a type II error). If you’re using the whole population, this is the ideal situation – you’re more likely to get a low p value for any relationship that exists.

Hm maybe someone else can confirm that i’m not talking rubbish though.

Also this might help: http://en.wikipedia.org/wiki/Statistical_significance

## :(

it's not my fault the site doesn't work properly

## thanks

I always thought statistical significance meant the chance that the association you've found between variables in a sample not existing in the whole population, and just occuring due to really bad luck in random sampling, with anything more than 5% being rejected, but 'll read that^ and try and get my head around it, thanks

## surely you won't have a need for a measure of significance

if you're doing regression analysis for the whole population... simply because the "sample" is significant by its definition.

chi-square is useful because it measures significance relating to a sample from a wide population.

you should just do a more simple measure of regression. some kind of 'least squares' analysis.

i'm no expert either, though.