服务承诺
资金托管
原创保证
实力保障
24小时客服
使命必达
51Due提供Essay,Paper,Report,Assignment等学科作业的代写与辅导,同时涵盖Personal Statement,转学申请等留学文书代写。
51Due将让你达成学业目标
51Due将让你达成学业目标
51Due将让你达成学业目标
51Due将让你达成学业目标私人订制你的未来职场 世界名企,高端行业岗位等 在新的起点上实现更高水平的发展
积累工作经验
多元化文化交流
专业实操技能
建立人际资源圈Quiz
2013-11-13 来源: 类别: 更多范文
2010 Biostatistics 08
Normal Distribution
ORIGIN 0
1
The Normal Distribution
The Normal Distribution, also known as the "Gaussian Distribution" or "bell-curve", is the most widely
employed function relating observations X with probabilty P(X) in statistics. Many natural populations
are approximately normally distributed, as are several important derived quantitities even when the
original population is not normally distributed.
P roperly speaking, the Normal Distribution is a continuous "probability density function" meaning that
values of a random variable X may take on any numerical value, not just discrete values. In addition,
because the values of X are infinite the "exact" probabiliy P(X) for any X is zero. Thus, in order to
determine probabilities one typically looks at invervals of X such as X >2.3 or 1< X < 2 and so forth. It is
interesting to note that because the probability P(X) = 0, we don't have to worry about correctly
interpreting pesky boundaries, as seen in discrete distributions, since X > 2 means the same thing as X 2
and X < 2 is the same as X 2.
As described previously, the Normal distribution N(, 2) consists of a family of curves that are specified by
supplying values for two parameters: = the mean of the Normal population, and 2 = the variance of
the same population.
Prototyping the Normal Function using the Gaussian formula:
Making the plot of N(50,100):
< specifying mean (
50
2
i 0 100
< Defining a bunch of X's ranging in value from 0 to 100. Remember that the
range of X is infinite, but we'll plot 101 point here. That should give us enough
points to give us an idea of the Gaussian function shape!
X i
i
Y1
i
< specifying variance ( 2)
100
100
1
2
1 X 2
2 i
2
e
< Formula for Normal distribution. Here we
have computed P(X) for each of our X's.
Zar 2010 Eq. 6.1, p. 66.
Now, let's compare with Mathcad's built-in function:
Y2 dnorm X
i
i
2
100
< MathCad's function asks us provide
standard deviation rather than variance...
Plotting the two sets of Y's:
0.04
The two approaches give the >
same probability function P(X)
for X, so this prototype
confirms the built-in function.
0.03
Y1i
Prototype in R:
Y2i
0.02
dnorm(x,mu,sigma)
0.01
^ R has a nearly
identical function, see
Lecture Worksheet 07
0
0
20
40
60
Xi
80
100
2010 Biostatistics 08
Normal Distribution
2
What happens when or 2 is changed:
Location of mode changes (translation of ) and width of hump changes showing
greater or lesser variance - see 2010 Biostatistics Lecture Worksheet 07.
Simulation of Normally Distributed Data:
65
2
25
625
X rnorm 1000
150
Descriptive Statistics for X:
n length ( X)
n 1000
100
mean( X) 63.5061
Xi
n
50
var ( X) 606.3107
n1
Var ( X) 606.3107
0
< Note: mathcad has two functions:
var(X) = population variance
Var(X) = sample variance
50
0
20
40
60
80
100
i
^ Mean and variance of this sample are close, but not exactly equal to N(65,625).
This is to be expected of a sample as opposed to the entire population
Histogram of X:
plot histogram( 50 X)
60
Prototype in R:
40
#CREATING A PSEUDORANDOM
NORMAL DISTRIBUTION:
X=rnorm(1000,65,25)
hist(X,nclass=50,col="gray",border="red")
1
plot
20
Histogram of X
50
0
50
0
plot
100
40
20
< R has a nearly identical function
rnorm(n,mu,sigma) where n = number
of points desired
0
Frequency
60
80
0
0
50
100
X
150
2010 Biostatistics 08
Normal Distribution
3
Standardizing the Normal Distribution:
In many instances, we have a sample that we may wish to compare with a Normal Distribution. Using
computer-based functions, as above, one has little difficulty calculating probabilities P(X) and simulating
additional samples from a Normally Distributed population N(, 2). When using published tables,
however, it is often useful to compare probabilities with the Standard Normal Distribution ~N(0,1).
This is done by Standardizing the Data:
Given your X's ~N(, 2) you create a new variable Z ~N(0,1) by means of a Linear Transformation:
i 0 999
Z
Xi
i
< Z's are now Standardized ~N(0,1)
mean( Z) 0.0598
< sample estimates are close, but not exactly equal to N(0,1)
Var ( Z) 0.9701
Histogram of Z:
plot histogram( 50 Z)
80
60
Prototype in R:
1
plot
#STANDARDIZING DATA:
mu=65
sigma=25
Z=(X‐mu)/sigma
hist(Z,nclass=50,col="gray",border="red")
40
20
0
4
2
0
0
plot
2
20
Note: in both cases here, we had prior
knowledge of and 2.
With real-world data, we will have to estimate
10
these values, usually with Xbar & s2.
0
Frequency
30
40
His to g ra m o f Z
-3
-2
-1
0
1
Z
2
3
4
2010 Biostatistics 08
Normal Distribution
4
Calculating Probabilities & Quantiles:
The above graphs display the relationship between X values, or observations (also called quantiles), and the
probability that a range (or bin) of X is expected to have given the assumption of Normal probability for X,
indicated as P(X). Most statistical software packages have standard "p" and "q" functions allowing conversion
from X to P(X) and vice versa. In the most useful form, the probability function is given as a Cumulative
Probability (X) starting from X values of minus infinity up to X. In each case a specific cumulative probability
function reqires that one provides specific parameter values for the curve (, ,), along with X OR (X).
Probabilities of the Normal Distribution and Cumulative Normal Distribution N(0,1):
i 0 100
X
i 50
i
< scaling 101 X's to a reasonable scale...
10
0
2
1
1
i
Y4 pnorm X
i
i
Y3 dnorm X
i
< parameters of the Normal N(0,1) distribution...
< Interval Estimate of probability P(X) for each X
< Cumulative probability (X) for each X
Plots of Normal Distribution and Cumulative Normal Distributions
1
0.8
Y3i
0.6
N(0,1)
Y4i
0.4
0.2
0
4
2
0
Xi
Prototype in R:
#PQ FUNCTIONS FOR NORMAL DISTRIBUTION:
mu=0
sigma=1
X=1.6449
PHI=0.90
dnorm(x,mu,sigma) # interval estimate P(X) given X
pnorm(x,mu,sigma) # cumulative phi(X) given X
qnorm(PHI,mu,sigma) # X given cumulative phi(X)
2
4
2010 Biostatistics 08
Normal Distribution
5
Calculating Intervals of the Cumulative Normal Distribution:
0
1
< Normal distribution parameters (change these if desired
Probability that X ranges between -1 and 1:
dnorm 1 0.242
dnorm 1 0.242
< P(X)
pnorm 1 0.1587
pnorm 1 0.8413
< (X)
pnorm 1 pnorm 1 0.6827
< Calculating MAX cut-off - MIN cut-off
^ cumulative value at MIN of interval
68.27%
^ cumulative value at MAX of interval
Probability that X ranges between -2.576 and 2.576:
dnorm 2.576 0.0145
dnorm 2.576 0.0145
< P(X)
pnorm 2.576 0.005
pnorm 2.576 0.995
< (X)
pnorm 2.576 pnorm 2.576 0.99
< Calculating MAX cut-off - MIN cut-off
^ cumulative value at MIN of interval
99%
^ cumulative value at MAX of interval
Probability that X ranges between -1.96 and 1.96
dnorm 1.96 0.0584 dnorm 1.96 0.0584
< P(X)
pnorm 1.96 0.025
< (X)
pnorm 1.96 0.975
pnorm 1.96 pnorm 1.96 0.95
< Calculating MAX cut-off - MIN cut-off
^ cumulative value at MIN of interval
^ cumulative value at MAX of interval
Prototype in R:
#EXAMPLE INTERVAL CALCULATIONS:
mu=0
sigma=1
MIN=pnorm(‐1,mu,sigma)
MAX=pnorm(1,mu,sigma)
MAX‐MIN
MIN=pnorm(‐2.576,mu,sigma)
MAX=pnorm(2.576,mu,sigma)
MAX‐MIN
MIN=pnorm(‐1.96,mu,sigma)
MAX=pnorm(1.96,mu,sigma)
MAX‐MIN
95%
2010 Biostatistics 09
Assessing Data Normality
ORIGIN 1
1
Assessing Data Normality
Assessing Normality of sample data is an essential part of statistical analysis. Q-Q Plots are one easy way to
do this. They are also interesting at this point in our course since they demonstrate the use of the inverse
cumulative probability function for the Normal Distribution.
Q-Q Plots:
Reading Anderson's Iris data:
iris READPRN( "c:/2010BiostatsData/iris.txt" )
2
SL iris
< assigning variable SL
n length ( SL)
n 150
i 1 n
< n = number of observations X
< constructing index variable i
XbarSL mean( SL)
XbarSL 5.8433
< mean of X
SD SL
SD SL 0.8281
< sample standard deviation of X
SESL 0.0676
< standard error of the sample mean of X
SESL
Var ( SL)
SD SL
n
Calculating Cumultive Probability levels N(X):
We will look at variable SL here:
1
1
5.1
2
4.9
3
1
1
4.7
SLsort sort ( SL)
2
4.4
3
First we sort SL:
4.3
4.4
4
4.6
4
4.4
5
5
5
4.5
6
5.4
6
4.6
7
4.6
7
4.6
SL 8
5
9
SLsort 8 4.6
4.4
9
4.6
Now we treat each
index of SLsort as a
quantile, and each
observed value as a
normal cumulative
probability (X):
i 1
2
i
n
1
1
0.0033
2
0.01
3
0.0167
4
0.0233
5
0.03
6
0.0367
7
0.0433
8
0.05
9
0.0567
10 4.9
10 4.7
11 5.4
11 4.7
12 4.8
12 4.8
13 4.8
13 4.8
13 0.0833
14 4.3
14 4.8
14
15 5.8
15 4.8
15 0.0967
16 5.7
16 4.8
16 0.1033
From the values of (X), we now convert back to X
Q qnorm i 0 1
i
^ the 1/2 here
is a correction
factor
10 0.0633
11
0.07
12 0.0767
0.09
2010 Biostatistics 09
Assessing Data Normality
2
Plotting SLsort vs Q:
8
1
1
2
3
-2.128
4
-1.9893
5
-1.8808
6
-1.7908
7
-1.7132
Q 8
-1.6449
9
-1.5834
7.5
-2.7131
-2.3263
7
6.5
SLsort
6
10 -1.5274
11 -1.4758
12 -1.4279
13
5.5
-1.383
14 -1.3408
15 -1.3008
5
16 -1.2628
4.5
4
3
2
1
0
1
2
3
Q
If the sample data are distributed close to the Normal distribution, the Q-Q plot should be mostly a
straight line in the center with an overall S-shaped curve towards each end.
2010 Biostatistics 09
Assessing Data Normality
3
Prototype in R:
#READ IRIS TABLE AND ASSIGN VARIABLE SL
K=read.table("c:/2010BiostatsData/iris.txt")
attach(K)
SL=Sepal.Length
#LOAD PACKAGE ‐ choose "lattice" from pop‐up list
local({pkg 3 is leptokurtic - there is a more acute peak at at
the mean and fatter tails.
8
2010 Biostatistics 10
Repeated Sampling
1
Repeated Sampling: Distribution of Means and
Confidence Intervals
ORIGIN 0
Given the general setup in statistics between random variable X and the probability P(X) governed by a
Probability Density Function such as the Normal Distribution, one typically uses a specific random sample to
estimate the population parameters. Estimation of this sort also involves considering what happens when a
population is repeatedly sampled. One is particularly interested in the sampling distribution of repeated
estimates, such as the mean, and how these estimates may be related to probability.
For the Normal Distribution, the population parameters are:
2
= population mean
= population variance
From our sample, we have the analogous calculations termed
Xbar
= sample mean
s2
point estimates:
= sample variance
Different kinds of statistical theory underlie point estimates generally allowing them to be categorized in
one of two ways:
- "minimum variance", also known as "least squares minimum"
"unbiased" or "Normal theory" estimators, and
- "maximum liklihood" estimators.
How to calculate estimators of these two types is beyond the scope of introductory statistics courses. The
important thing to remember is that the two methods of estimation often, but not always, yield the same
point estimators. The point estimators, then feed into specific statistical techniques. Thus, it is sometimes
important to know which estimator is associated with a particular technique so as not mix approaches.
Maximum liklihood estimators, based on newer theory, are often specifically indicated as such (often using
'hat' notation).
In the case estimating parameters for the Normal Distribution, Xbar is the point estimate for under both
estimation theories. However s2 sum of squares with (n-1) as divisor is the point estimate using Normal
theory whereas 2hat with same sum of sqares but using (n) as divisor is the point estimate using "maximum
liklihood" theory. Confusing, yes, but now that you know the difference not all that bad...
Estimating error on point estimates of the mean:
Although Xbar is our Normal theory estimate of population parameter based on a single sample, one might
readily expect Xbar to differ from sample to sample, and it does. Thus, we need to estimate how much Xbar
will vary from sample to sample. Multiple sampled means differ from each other much less than individual
sample values of X will. The relationship is called the standard variance of the mean. The square
root of variance for the mean is called the standard error of the mean or simply standard error.
Standard Variance of the Mean = sample variance/n
or
Standard Error of the Mean (SEM) = sample standard deviation /
n
2010 Biostatistics 10
Repeated Sampling
2
Central Limit Theorem:
This result is one of the reasons why Normal theory, and the Normal Distribution underlie much of
"parametric" statistics. It says that although the populations from which random variable X are drawn
may not necessarily be normally distributed, the population of means derived by replicate sampling will be
normally distributed. This result allows us to use the Normal Distribution with parameters 2 estimated
respectively by Xbar and s2 (or occasionally 2hat) to estimate probabilities of means P(X) for various values
of X.
Statistics evaluating location of the mean:
Suppose we collect a sample from a population and calculate the mean Xbar. How reliable is Xbar as an
estimate of ' The usual approach is to estimate a difference (also called a distance) between Xbar and
scaled to the variability in Xbar encountered from one sample to the next:
Z
Xbar
< distance divided by Standard Error of the Mean
n
If somehow we know the population parameter then we can resort directly to the standardized Normal
Distribution ~N(0,1) to calculate probabilities P(Z) or cumulative probabilities (Z) . However, in real life
situations, is not known and we must estimate by s. When we do this, the analogous variable t:
t
Xbar
s
n
< Same standardizing approach but
using s instead of
is no longer Normally distributed. Instead, we resort to a new probability density function, known as
"Student's t" to calculate P(t) or (t) given t. Student's t is a commonly employed statistical function
ranking high in importance along with the chi-square distribution (2) and the F distribution. The Student's
t distribution looks very much like the Normal distribution in shape, but is leptokurtic. Typically in
statistical software, both distributions are utilized with analogous functions. See Lecture Worksheet 07
and the Prototype in R below for them. Although Zar in Chapter 6 perfers only to talk about the Normal
distribution by assuming he/we know I think it may be clearer to talk about both together here. The
arguments are identical with the difference between them related to whether we know or whether we
estimate by s.
Prototype in R:
#ANALOGOUS FUNCTIONS FOR
#NORMAL AND T DISTRIBUTIONS
#NORMAL DISTRIBUTION
mu=0 #parameter for mean
sigma=1 #paramater for standard deviation
n=1000 #number of randomly generated data points
X=1.96 #quantile X
P=0.95 # cumulative probability phi(X)
rnorm(n,mu,sigma) #to generate random data points
dnorm(X,mu,sigma) #P(X) from X
pnorm(X,mu,sigma) #phi(X) from X
qnorm(P,mu,sigma) #X from phi(X)
#t DISTRIBUTION
df=5 #degrees of freedom parameter
n=1000 #number of randomly
generated data points
X=1.96 #quantile X
P=0.95 # cumulative probability phi(X)
rt(n,df) #to generate random data points
dt(X,df) #P(X) from X
pt(X,df) #phi(X) from X
qt(P,df) #X from phi(X)
2010 Biostatistics 10
Repeated Sampling
3
Confidence Interval for the Mean:
A sample Confidence Interval (CI) for a sample mean of X (or equivalently in Z or t) is the estimated
range over which repeated samples of Xbar (or Zbar or tbar) are expected to fall (1-)x100 % of the time. If
a hypothesized value for mean, say 0, falls within a CI, then we say 0 is "enclosed" or "captured" by the
CI with a confidence of (1-). Equivalently, for repeated samples, 0 will be enclosed within repeated CI's
(1- )x100 percent of the time.
Let's calculate CI from a pseudo-random example:
X rnorm 100 50 100
< here in fact we know =50 and 2 = 100
n length ( X) n 100
50
10
100
Xbar mean( X)
Xbar 48.4955
s
s 96.4487
2
Var ( X)
< known population standard deviation
< we can also pretend that we don't know the population
parameters and must use sample mean and variance instead as
one usually would with real data.
Calculation of Confidence Intervals:
< We choose a limit probability allowing sample means to differ from
X 100 percent of the time...
0.05
1 0.95
^ since both the Normal Distribution and the t probability distributions are symmetrical, there
are equal-sized tails above and below hypothesized or known . Each tail therefore has /2
probability. This is commonly known as the Two-Tail case...
If and are known - the Normal Distribution Case:
50
10
0 1
2
L qnorm
U qnorm 1
CI
2
n 100
L 1.96
0 1
U 1.96
L U
n
n
CI ( 48.04 51.96 )
2
< lower limit of N(0,1) for /2
0.025
1
2
< upper limit of N(0,1) for /2
0.975
< calculating Confidence Interval using population and .
Note here that I calculated each tail explicitly so I added both L
and U to determine the CI. However, since the distribution is
symmetrical, one might alternatively use:
C = the absolute value of L or U.
In that case one subtracts C
from the mean for the lower
n
limit and adds C
to the mean for the upper limit.
n
Note here that Error of the Mean is derived from known
population parameters.
2010 Biostatistics 10
Repeated Sampling
4
If and are unknown - the t Distribution Case:
Parameters and must be estimated by sample Xbar and s:
Xbar 48.4955
s 9.8208
df n 1
df 99
df
2
L qt
U qt 1
CI
2
L 1.9842
df
U 1.9842
X L s X U s
bar
bar
n
n
< single parameter of Student's t distribution
called "degrees of freedom" df = (n-1)
where n is sample size.
2
0.025
1
2
0.975
< calculating Confidence Interval. Note here that I calculated
each tail explicitly so I added both L and U to determine the
CI. Note also SEM is measured by the sample quantity
CI ( 46.5468 50.4441 )
Prototype in R:
#CONFIDENCE INTERVALS
mu=50
sigma=10
n=100
X=rnorm(100,mu,sigma)
alpha=0.05
#NORMAL DISTRIBUTION
L=qnorm((alpha/2),0,1)
L
U=qnorm((1‐alpha/2),0,1)
U
#confidence interval:
mu+L*(sigma/sqrt(n))
mu+U*(sigma/sqrt(n))
#t DISTRIBUTION
df=n‐1
s=sqrt(var(X))
L=qt((alpha/2),df)
L
U=qt((1‐alpha/2),df)
U
#confidence interval:
mu+L*(s/sqrt(n))
mu+U*(s/sqrt(n))
#NOTE: These values don't match MathCad
#because they are based on a different sample!
s
n
2010 Biostatistics 11
ORIGIN 0
Formal Statistical Tests
The Formal Logic of Statistical Tests
The biological literature is full of scientific research papers in which data that are presumably random
samples of larger populations are collected. From these, sample descriptive statistics are calculated and
summarized. The authors then proceed to advance one or more hypotheses concerning the problem under
study. From this, usually in the Results section or associated tables, these hypotheses or related derived
statistics are judged either to be statitically significant or insignificant, and often Probabilitye values and/or
confidence intervals are reported. All of this, regarding hypotheses, significance and confidence intervals,
falls under the rubric of Inferential Statistics.
As an associate editor of a major journal and frequent reviewer, I very often receive papers to appraise
that include inferential statistics. It is depressingly common to see results summarized in an incoherent
fashion. Usually, incompletely labeled tables are presented that are strikingly similar to the output of one
or another statistical "black box" with significance levels indicated by ** etc. However, it remains unclear
just what the author(s) had in mind, or just what conclusions they or the reader are supposed to draw from
the output. In reading the Material & Methods section of the paper, these authors are often very precise
about the software utilized (e.g., SPSS vers. xxx, such-and-such a procedure with whatever options chosen)
but frustratingly vague about WHY a particular technique was chosen given their data, or WHAT their
statistical hypotheses might have been, or even HOW the results derived from the "black box" relate to
the conclusions they are trying to draw. Sometimes I come the the conclusion that the authors know what
they are doing but are simply unclear in their presentation. In other instances, however, the authors are
clearly relying too much on the "black box" to do the thinking for them. (As an aside, I tend to have
fewer problems of this type with authors who use R. My guess is that in order to use R, one usually has to
spend a little more time learning proper statistical technique...)
In conducting inferential statistics in biological research, therefore, it is very important to be consider
carefully and be explicit about the logic of what one is doing, and to provide readers of your papers with
sufficient information that they can fill in the gaps where necessary. Most textbooks in statistics present
this logic reasonably well at least the first time encountered in the book. Many, including Zar, become a
little sloppy thereafter because they assume that they have already told you how the logic works (which
they have) and are subsequently trying to add new issues into the mix along with, perhaps, an intuitive
rationale. Also, in the case of Zar, the author is attempting to be comprehensive, necessitating brevity
within the extended narrative of the book.
In my opinion, biologists conducing statistical analysis have the following multi-part problem:
1). First, one must state clearly just what biological hypothesis, or hypotheses (one at a time), are the
subject of the study. Such hypotheses must be independent, preferably stated prior to, data collection.
2). Given the biological hypothesis, one must find an appropriate statistical procedure, or perhaps several,
with underlying assumptions that qualify them as most readily applicable.
3). Data must be collected and analyzed in a way that is consistent with all of the assumptions of the
chosen statistical procedure(s). The procedures typically follow a specific logic that must be
understood and strictly followed.
4). Results then need to be presented in a way that repects the logic of each statistical test and allows for
reconstruction of missing steps, when necessary, by potential readers.
5). Finally, and most importantly, there must be an explicit consideration of whether any of the statistical
results actually mean anything as far as the original biological hypotheses were concerned.
1
2010 Biostatistics 11
Formal Statistical Tests
Logic of Statistical Tests:
Here's an excellent framework to follow in conducting a statistical test (Example comes from a One-Sample
t-test of the mean. We'll see this shortly):
Assumptions:
- Observed values X1, X2, X3, ... Xn are a random sample from ~N(, 2).
- Variance 2 of the population is unknown.
^ Each statistical test is only applicable to specific kinds of samples drawn from a
population with specific properties. In this case, data values X are a properly drawn
random sample from a population that has a Normal Distribution with population
parameters mean= and variance= 2 that are unknown. The researcher needs to
verify whether the data at hand might be drawn from a Normal Distribution. If so, then
one can proceed. If not, the test is formally inapplicable. In many instances, however,
tests may be robust to violations of one or more assumptions. For example, the t-test is
reasonably robust to the assumption of population normal distribution, so usually one
can proceed as long as the sample isn't wildly non-Normal.
Hypotheses:
H0: = 0
H1: 0
< 0 is a specified value for
< Two sided test
or
H0: = 0
H1: < 0
< 0 is a specified value for
< One sided test
^ Biological hypotheses are restated formally in a statistical test as statistical hypotheses. Statistical
hypotheses consist of a matched pair of hypotheses that together comprise all possible events (i.e.,
outcomes) in the sample space (i.e., set of all possible outcomes - See Lecture Worksheet 05). In
other words, the probability of the Union of hypotheses is exactly 1.0. The pair of hypotheses
consist of:
the null hypothesis H0
- a biologically "uninteresting" hypothesis often indicating no
effect for treatments, random behavior, or otherwise
non-biological results
the alternative hypothesis H1.
- a biologically "interesting" hypothesis perhaps indicating a
value or difference for a biological treatment, etc.
The general strategy of a statistical test is to use a probability distribution to determine whether
H0 is likely or unlikely. If unlikely, we can reject H0 and in turn accept H1. Acceptance of H1
would then be a statistical decision based on the fact that H1 is the only alternative hypothesis
presented in the test. Consideration of the biological interpretion of the test, and multiple possible
alternative explanations, comes later.
In some instances, statistical hypotheses are termed "two-sided" if two distinct possibilities are
implicit in H1. For instance, in the two-sided statement of hypotheses above, H1 says that < 0
or > 0 . By contrast, the one-sided statement of hypotheses above allows for only one
possibility < 0 .
2
2010 Biostatistics 11
Formal Statistical Tests
Test Statistic:
t
Xbar 0
s
< t is the normalized distance between means Xbar and 0
n
^ A test statistic is a number calculated from the sample used in making a statistical decision
between H0 and H1. Test statistics are usually calculated so that one may consult a well-known
statistical distribution. In this case the value of the test statistic t will be compared with the
t-distribution to find P(t). Note that statistic t and the t-distribution are different things.
Sampling Distribution:
If Assumptions hold and H0 is true, then t ~t(n-1)
^ Test statistics X are carefully chosen to have probabilities P(X) and cumulative probabilities (X) that
are understood. In this case if H0 is true, then the t statistic is distributed according to Student's
t-distribution t(t) with (n-1) degrees of freedom.
Critical Value of the Test:
0.05
< Probability of Type I error must be explicitly set
^ See below for the definition of "Type 1" or "" error. This is a criterion for how
stringent the test will be. Stringency, however, is a tradeoff with "Type 2" or "" error as
described below. Both types of error are dependent on the number of observations n.
C qt n 1
^ A probability P(X) is set above by . From this, one needs to find the quantile, that is, an
X value for which one has P(X)=under some probability distribution. Standard statistical
tables provide a way to find X from P(X) as do explicit functions built into modern statistical
software. In both cases one typically works with the cumulative probability function (X).
In the example above, since t ~t(n-1) we use the inverse cumulative t function qt() to find
the Critical Value C. Note: C is a quantile - a cut off value of the test statistic t.
Decision Rule:
IF t > C, THEN REJECT H0 OTHERWISE ACCEPT H0
< One-way case
IF |t| > C, THEN REJECT H0 OTHERWISE ACCEPT H0
< Two-way case
^ The decision rule compares the calculated test statistic of the sample with the
critical value C. If the rule is determined to be true, then H0 is rejected, and the
alternative H1 accepted for statistical purposes. Of course, upon rejecting H0,
deciding whether H1 is the only viable biological hypothesis comes later.
3
2010 Biostatistics 11
Formal Statistical Tests
Probability Value:
P = tt
< probability of finding test statistic t given the Assumptions and if H0 is true.
^ Although not part of the formal statistical test, it is common practice to provide a
probability value P(X) for the test statistic X calculated in the test assuming H0 to be
true. In the case above, since t ~t(n-1) and we have statistic t, we use the cumulative
probability density function pt(t) to find P(t).
Common attributions for P:
IF
IF
IF
IF
0.001
0.001
0.01
0.05

