statistics for management and economics study notes 4

14. Analysis of Variance

14.1 One-way Analysis of Variance

The analysis of variance is a procedure that tests to determine whether differences exist between two or more population means. one-way analysis of variance is the procedure to apply when the samples are independently drawn.

$H_{0}$: $\mu_{1} = \mu_{2} = \cdots = \mu_{k}$
$H_{1}$: at least two means differ

The statistic that measures the proximity of the sample means to each other is called the between-treatments variation; it is denoted SST, which stands for sum of squares for treatments.

\[SST = \sum_{j=1}^k n_{j}(\bar x_{j} - \bar{\bar x})^2\]

\[\bar{\bar x} =\frac{\sum_{j=1}^k \sum_{i=1}^{n_{j}} x_{ij}}{n}\]

\[n = n_{1} + n_{2} + \cdots + n_{k}\]

\[\bar x_{j} = \frac{\sum_{i=1}^{n_{j}}x_{ij}}{n_{j}}\]

how much variation exists in the percentage of assets, which is measured by the within-treatments variation, which is denoted by SSE (sum of squares for error). The within-treatments variation provides a measure of the amount of variation in the response variable that is not caused by the treatments.

\[SSE = \sum_{j=1}^k \sum_{i=1}^{n_{j}}(x_{ij} - \bar x_{j})^2\]

\[SSE = (n_{1}-1)s_{1}^2 + (n_{2}-1)s_{2}^2 + \cdots + (n_{k}-1)s_{k}^2\]

The mean square for treatments is computed by dividing SST by the number of treatments minus 1.

\[MST = \frac{SST}{k-1}\]

The mean square for error is determined by dividing SSE by the total sample size (labeled n) minus the number of treatments.

\[MSE = \frac{SSE}{n-k}\]

Finally, the test statistic is defined as the ratio of the two mean squares.

\[F = \frac{MST}{MSE}\]

The test statistic is F-distributed with k − 1 and n − k degrees of freedom, provided that the response variable is normally distributed. we reject the null hypothesis only if

\[F > F_{\alpha, k-1, n-k}\]

total variation of all the data is denoted SS(Total)

\[SS(Total) = SST + SSE = \sum_{j=1}^k \sum_{i=1}^{n_{j}}(x_{ij} - \bar{\bar x})^2\]

ANOVA Table for the One-Way Analysis of Variance:

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC
Treatments	k − 1	SST	MST = SST/ (k − 1)	F = MST/MSE
Error	n − k	SSE	MSE = SSE/ (n − k)
Total	n − 1	SS(Total)

Example: a financial analyst randomly sampled 366 American households and asked each to report the age category of the head of the household and the proportion of its financial assets that are invested in the stock market. The age categories are Young (less than 35), Early middle age (35 to 49), Late middle age (50 to 65), Senior (older than 65). The analyst was particularly interested in determining whether the ownership of stocks varied by age.

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC	P
Treatments	3	3741.4	1247.12	2.79	0.0405
Error	362	161871.0	447.16
Total	365	165612.4

Interpret: The value of the test statistic is F = 2.79, and its p-value is .0405, which means there is evidence to infer that the percentage of total assets invested in stocks are different in at least two of the age categories.

14.1.1 Can We Use the t-Test of the Difference between Two Means Instead of the Analysis of Variance?

There are two reasons why we don’t use multiple t-tests instead of one F-test. First, we would have to perform many more calculations. Second, and more important, conducting multiple tests increases the probability of making Type I errors.

14.1.2 Can We Use the Analysis of Variance Instead of the t-Test of $\mu_{1} − \mu_{2}$?

If we want to determine whether $\mu_{1}$ is greater than $\mu_{2}$ (or vice versa), we cannot use the analysis of variance because this technique allows us to test for a difference only. Thus, if we want to test to determine whether one population mean exceeds the other, we must use the t-test of $\mu_{1} − \mu_{2}$ (with $\sigma_{1}^2=\sigma_{2}^2$). Moreover, the analysis of variance requires that the population variances are equal. If they are not, we must use the unequal variances test statistic.

14.2 Multiple Comparisions

Bonferroni adjustment:

\[\alpha = \frac{\alpha_{E}}{n}\]

$\alpha_{E}$, denotes the true probability of making at least one Type I error, is called the experimentwise Type I error rate. n is the number of pairwise comparisons.

14.3 Analysis of Variance Experimental Designs

14.3.1 Single-Factor and Multifactor Experimental Designs

A single-factor analysis of variance addresses the problem of comparing two or more populations defined on the basis of only one factor. A multifactor experiment is one in which two or more factors define the treatments.

The example in 14.1 is a single-factor design because we had one treatment: age of the head of the household. Suppose that we can also look at the gender of the household head in another study. We would then develop a two-factor analysis of variance in which the first factor, age, has four levels, and the second factor, gender, has two levels.

14.3.2 Independent Samples and Blocks

When the problem objective is to compare more than two populations, the experimental design that is the counterpart of the matched pairs experiment is called the randomized block design. The term block refers to a matched group of observations from each population. The randomized block experiment is also called the two-way analysis of variance.

We can determine whether sleeping pills are effective by giving three brands of pills to the same group of people to measure the effects. Such experiments are called repeated measures designs.

The data are analyzed in the same way for both designs.

14.3.3 Fixed and Random Effects

If our analysis includes all possible levels of a factor, the technique is called a fixed effects analysis of variance. If the levels included in the study represent a random sample of all the levels that exist, the technique is called a random-effects analysis of variance.

14.4 Randomized Block (Two-Way) Analysis of Variance

The purpose of designing a randomized block experiment is to reduce the within-treatments variation to more easily detect differences between the treatment means. In the one-way analysis of variance, we partitioned the total variation into the between-treatments and the within-treatments variation; that is,

\[SS(Total) = SST + SSE\]

In the randomized block design of the analysis of variance, we partition the total variation into three sources of variation:

\[SS(Total) = SST + SSB + SSE\]

where SSB, the sum of squares for blocks, measures the variation between the blocks.

BLOCK	1	2	…	k	Block Mean
1	$x_{11}$	$x_{12}$	…	$x_{1k}$	$\bar x[B]_{1}$
2	$x_{21}$	$x_{22}$	…	$x_{2k}$	$\bar x[B]_{2}$
$\vdots$	$\vdots$	$\vdots$		$\vdots$	$\vdots$
b	$x_{b1}$	$x_{b2}$	…	$x_{bk}$	$\bar x[B]_{b}$
Treatment Mean	$\bar x[T]_{1}$	$\bar x[T]_{2}$	…	$\bar x[T]_{k}$

Sums of Squares in the Randomized Block Experiment:

\[SS(Total) = \sum_{j=1}^k \sum_{i=1}^b (x_{ij} - \bar{\bar x})^2\] \[SST = \sum_{j=1}^k b(\bar x[T]_{j} - \bar{\bar x})^2\] \[SSB = \sum_{i=1}^b k(\bar x[B]_{i} - \bar{\bar x})^2\] \[SSE = \sum_{j=1}^k \sum_{i=1}^b (x_{ij} - \bar x[T]_{j} - \bar x[B]_{i} + \bar{\bar x})^2\]

Mean Squares for the Randomized Block Experiment:

\[MST = \frac{SST}{k-1}\] \[MSB = \frac{SSB}{b-1}\] \[MSE = \frac{SSE}{n-k-b-1}\]

Test Statistic for the Randomized Block Experiment

\[F = \frac{MST}{MSE}\]

which is F-distributed with ν1 = k − 1 and ν2 = n − k − b + 1 degrees of freedom.

ANOVA Table for the Randomized Block Analysis of Variance

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC
Treatments	k − 1	SST	MST = SST / (k − 1)	F = MST/MSE
Blocks	b - 1	SSB	MSB = SSB / (b - 1)	F = MSB/MSE
Error	n − k - b + 1	SSE	MSE = SSE / (n − k - b + 1)
Total	n − 1	SS(Total)

Example: A company selected 25 groups of four men, each of whom had cholesterol levels in excess of 280. In each group, the men were matched according to age and weight. Four drugs were administered over a 2-month period, and the reduction in cholesterol was recorded. Do these results allow the company to conclude that differences exist between the four drugs?

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC	P
Drug	3	196.0	65.3	4.12	0.009
Group	24	3848.7	160.4	10.11	0.000
Error	72	1142.6	15.9
Total	99	5187.2

Interpret: we conclude that there is sufficient evidence to infer that at least two of the drugs differ.

14.5 Two-Factor Analysis of Variance

The general term for the experiment features two factors is factorial experiment. In factorial experiments, we can examine the effect on the response variable of two or more factors. We will present the technique for fixed effects only. That means we will address problems where all the levels of the factors are included in the experiment.

Example: As part of a study on job tenure, a survey was conducted in which Americans aged between 37 and 45 were asked how many jobs they have held in their lifetimes. Also recorded were gender and educational attainment. The categories are E1, E2, E3 and E4. Can we infer that differences exist between genders and educational levels?

$H_{0}$: $\mu_{1} = \mu_{2} = \mu_{3} = \mu_{4} = \mu_{5} = \mu_{6} = \mu_{7} = \mu_{8}$
$H_{1}$: At least two means differ

Summary:

Groups	Count	Sum	Average	Variance
Male E1	10	126	12.60	8.27
Male E2	10	110	11.00	8.67
Male E3	10	106	10.60	11.60
Male E4	10	90	9.00	5.33
Female E1	10	115	11.50	8.28
Female E2	10	112	11.20	9.73
Female E3	10	94	9.40	16.49
Female E4	10	81	8.10	12.32

one-way Anova:

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC	P
Between Groups	7	153.35	21.91	2.17	0.0467
Within Groups	72	726.20	10.09
Total	79	879.55

Interpret: The value of the test statistic is F = 2.17 with a p-value of .0467. We conclude that there are differences in the number of jobs between the eight treatments.

This statistical result raises more questions—namely, can we conclude that the differences in the mean number of jobs are caused by differences between males and females? Or are they caused by differences between educational levels? Or, perhaps, are there combinations, called interactions, of gender and education that result in especially high or low numbers?

A complete factorial experiment is an experiment in which the data for all possible combinations of the levels of the factors are gathered. That means that in the above example we measured the number of jobs for all eight combinations. This experiment is called a complete 2 × 4 factorial experiment. In general, we will refer to one of the factors as factor A (arbitrarily chosen). The number of levels of this factor will be denoted by a. The other factor is called factor B, and its number of levels is denoted by b. The number of observations for each combination is called a replicate. The number of replicates is denoted by r. We address only problems in which the number of replicates is the same for each treatment. Such a design is called balanced.

$x_{ijk}$ = $k$th observation in the $ij$th treatment
$\bar x[AB]_{ij}=$ mean of the treatment when the factor A level is i and the factor B level is j
$\bar x[A]_{i}=$ Mean of the observations when the factor A level is i
$\bar x[B]_{j}=$ Mean of the observations when the factor B level is j
$\bar{\bar x}=$ Mean of all the observations
a = Number of factor A levels
b = Number of factor B levels
r = Number of replicates

\[SS(Total) = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^r (x_{ijk} - \bar{\bar x})^2\] \[SS(A) = rb \sum_{i=1}^a (\bar x[A]_{i} - \bar{\bar x})^2\] \[SS(B) = ra \sum_{j=1}^b (\bar x[B]_{j} - \bar{\bar x})^2\] \[SS(AB) = r \sum_{i=1}^a \sum_{j=1}^b (\bar x[AB]_{ij} - \bar x[A]_{i} - \bar x[B]_{j} + \bar{\bar x})^2\] \[SSE = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^r (x_{ijk} - \bar x[AB]_{ij})^2\]

$\nu_{SS(A)} = a -1$
$\nu_{SS(B)} = b -1$
$\nu_{SS(AB)} = (a -1)(b-1)$
$\nu_{SSE} = n - ab$

F-Tests Conducted in Two-Factor Analysis of Variance
Test for Differences between the Levels of Factor A
$H_{0}$: The means of the a levels of factor A are equal
$H_{1}$: At least two means differ
Test for Differences between the Levels of Factor B
$H_{0}$: The means of the a levels of factor B are equal
$H_{1}$: At least two means differ
Test for Interaction between Factors A and B
$H_{0}$: Factors A and B do not interact to affect the mean responses
$H_{1}$: Factors A and B do interact to affect the mean responses

Required Conditions
* The distribution of the response is normally distributed.
* The variance for each treatment is identical.
* The samples are independent.

ANOVA Table for the Two-Factor Experiment:

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC
Factor A	a-1	SS(A)	MS(A)	MS(A)/MSE
Factor B	b-1	SS(B)	MS(B)	MS(B)/MSE
Interaction	(a-1)(b-1)	SS(AB)	MS(AB)	MS(AB)/MSE
Error	n - ab	SSE	MSE
Total	n -1	SS(Total)

Two-way ANOVA: Jobs versus Gender, Education

SOURCE OF VARIATION	DEGREES OF FREEDOM	SUMS OF SQUARES	MEAN SQUARES	F-STATISTIC	P
Gender	1	11.25	11.25	1.12	0.294
Education	3	135.85	45.28	4.49	0.006
Interaction	3	6.25	2.08	0.21	0.892
Error	72	726.20	10.09
Total	79	879.55

Interpret: There is no evidence at the 5% significance level to infer that differences in the number of jobs exist between men and women. There is sufficient evidence at the 5% significance level to infer that differences in the number of jobs exist between educational levels. There is not enough evidence to conclude that there is an interaction between gender and education.

Order of Testing in the Two-Factor Analysis of Variance: Test for interaction first. If there is enough evidence to infer that there is interaction, do not conduct the other tests. If there is not enough evidence to conclude that there is interaction, proceed to conduct the F-tests for factors A and B.

statistics for management and economics study notes 3

9. Sampling Distributions

9.1 Sampling Distribution of the Mean

Central Limit Theorem: The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of X will resemble a normal distribution.

\[\mu_{\bar x} = \mu\]

\[\sigma_{\bar x}^2 = \frac{\sigma^2}{n}\]

If X is normal, then $\bar X$ is normal. If X is nonnormal, then $\bar X$ is approximately normal for sufficiently large sample sizes. The definition of “sufficiently large” depends on the extent of nonnormality of X.

Standardizing the sample mean:

\[Z = \frac{\bar X - \mu}{\sigma / \sqrt{n}}\]

9.2 Sampling Distribution of a Sample Proportion

$\hat P$ is approximately normally distributed provided that np and n(1 − p) are greater than or equal to 5.

\[E(\hat P) = p\]

\[V(\hat P) = \sigma_{\hat p}^2 = \frac{p(1-p)}{n}\]

Standardizing the sample proportion:

\[Z = \frac{\hat P - p}{\sqrt{p(1-p)/n}}\]

9.3 Sampling Distribution of the Difference between Two Means

\[E(\bar X_{1} - \bar X_{2}) = \mu_{\bar x_{1} - \bar x_{2}} = \mu_{1} - \mu_{2}\]

\[V(\bar X_{1} - \bar X_{2}) = \sigma_{\bar x_{1} - \bar x_{2}}^2 = \frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}\]

Standardizing the difference between two sample means:

\[Z = \frac{(\bar X_{1} - \bar X_{2}) - (\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}\]

10. Introduction to Estimation

An unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter.
An unbiased estimator is said to be consistent if the difference between the estimator and the parameter grows smaller as the sample size grows larger.
If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to have relative efficiency.

10.1 Estimating the Population Mean When the Population Standard Deviation is Known

\[\bar x \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

10.2 Determining the Sample Size to Estimate $\mu$

\[n = (\frac{z_{\alpha/2}\sigma}{B})^2\]

\[B = Z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

B stands for the bound on the error of estimation.

11. Introduction to Hypothesis Testing

11.1 Concepts of Hypothesis Testing

null hypothesis usually refers to a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. $H_{0}$
alternative hypothesis (or maintained hypothesis or research hypothesis) refers the hypothesis to be accepted if the null hypothesis is rejected. $H_{1}$
A Type I error occurs when we reject a true null hypothesis. $\alpha$
A Type II error is defined as not rejecting a false null hypothesis. $\beta$
The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.
If we reject the null hypothesis, we conclude that there is enough statistical evidence to infer that the alternative hypothesis is true.
If we do not reject the null hypothesis, we conclude that there is not enough statistical evidence to infer that the alternative hypothesis is true.

11.2 Testing the Population Mean When the Population Standard Deviation is Known

A two-tail test is conducted whenever the alternative hypothesis specifies that the mean is not equal to the value stated in the null hypothesis.
a one-tail test that focuses on the right tail of the sampling distribution whenever we want to know whether there is enough evidence to infer that the mean is greater than the quantity specified by the null hypothesis.
a one-tail test that focuses on the left tail of the sampling distribution whenever we want to know whether there is enough evidence to infer that the mean is less than the quantity specified by the null hypothesis.

11.2.1 Standardized Test Statistic

\[z = \frac{\bar x - \mu}{\sigma / \sqrt{n}}\]

The rejection region:

\[z > z_{\alpha / 2}\]

\[z < - z_{\alpha / 2}\]

11.2.2 Testing Hypotheses and Confidence Interval Estimators

\[\bar x \pm z_{\alpha / 2}\frac{\sigma}{\sqrt{n}}\]

we compute the interval estimate and determine whether the hypothesized value of the mean falls into the interval.

11.3 Calculating the Probability of a Type II Error

Example: A random sample of 400 monthly accounts is drawn, for which the sample mean is $178. The accounts are approximately normally distributed with a standard deviation of $65. Whether the mean is greater than $170 with $\alpha$ = 5%?

$H_{0}$: $\mu \le 170$

$H_{1}$: $\mu \gt 170$

$\frac{\bar x_{L} - 170}{65/\sqrt{400}} = 1.645$

$\bar x_{L} = 175.34$

Therefore, the rejection region is:

$\bar x \gt 175.34$

The sample mean was computed to be 178. Because the test statistic (sample mean) is in the rejection region (it is greater than 175.34), we reject the null hypothesis. Thus, there is sufficient evidence to infer that the mean monthly account is greater than $170.

$\beta = P(\bar X \lt 175.34$, given that the null hypothesis is false )

Suppose that when the mean account is at least $180.

$\beta = P(\bar X \lt 175.34$, given that $\mu = 180)$

$\beta = P(\frac{\bar X - \mu}{\sigma / \sqrt{n}} < \frac{175.34-180}{65/\sqrt{400}}) = P(Z \lt - 1.43) = 0.0764$

This plot illustrates the inverse relationship between the probabilities of Type I and Type II errors. Unfortunately, there is no simple formula to determine what the significance level should be.

11.4 Larger Sample Size Equals More Information Equals Better Decisions

11.5 Power of a Test

power: the probability of its leading us to reject the null hypothesis when it is false. Thus, the power of a test is 1 − β.

12. Inference About a Population

12.1 Inference about a Population Mean When the Population Standard Deviation is Unknown

When the population standard deviation is unknown and the population is normal, the test statistic for testing hypotheses about μ is

\[t = \frac{\bar x - \mu}{s/\sqrt{n}}\]

which is Student t-distributed with ν = n − 1 degrees of freedom.

Confidence Interval Estimator of μ When σ Is Unknown

\[\bar x \pm t_{\alpha/2}\frac{s}{\sqrt{n}}\]

12.2 Inference about a Population Variance

The test statistic used to test hypotheses about $\sigma^2$ is

\[\chi^2 = \frac{(n-1)s^2}{\sigma^2}\]

which is chi-squared distributed with ν = n − 1 degrees of freedom when the population random variable is normally distributed with variance equal to $\sigma^2$.

Confidence Interval Estimator of $\sigma^2$

Lower confidence limit (LCL) = $\frac{(n-1)s^2}{\chi_{\alpha /2}^2}$

Upper confidence limit (UCL) = $\frac{(n-1)s^2}{\chi_{1-\alpha /2}^2}$

12.3 Inference about a Population Proportion

\[\hat p = \frac{x}{n}\]

Test Statistic for p

\[z = \frac{\hat P - p}{\sqrt{p(1-p)/n}}\]

which is approximately normal when np and n(1 − p) are greater than 5.

Confidence Interval Estimator of p

\[\hat p \pm z_{\alpha /2} \sqrt{\hat p (1 - \hat p)/n}\]

Sample Size to Estimate a Proportion

\[n = (\frac{z_{\alpha /2}\sqrt{\hat p (1-\hat p)}}{B})^2\]

\[B = z_{\alpha /2} \sqrt{\frac{\hat p (1-\hat p)}{n}}\]

13. Inference about Comparing Two Populations

13.1 Inference about the Difference between two Means: Independent Samples

Sampling Distribution of $\bar x_{1} - \bar x_{2}$:

$\bar x_{1} - \bar x_{2}$ is normally distributed if the populations are normal and approximately normal if the populations are nonnormal and the sample sizes are large.
\[E( \bar x_{1} - \bar x_{2} ) = \mu_{1} - \mu_{2}\] \[V( \bar x_{1} - \bar x_{2} ) = \frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}\] \[Z = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}\]

13.1.1 Test Statistic for $\mu_{1} - \mu_{2}$ when $\sigma_{1}^2 = \sigma_{2}^2$

\[t = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{s_{p}^2(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}\]

where $s_{p}^2$ is called the pooled variance estimator:

\[s_{p}^2 = \frac{(n_{1} -1)s_{1}^2 + (n_{2} -1)s_{2}^2}{n_{1} + n_{2} - 2}\]

13.1.2 Confidence Interval Estimator of $\mu_{1} - \mu_{2}$ when $\sigma_{1}^2 = \sigma_{2}^2$

\[(\bar x_{1} - \bar x_{2}) \pm t_{\alpha /2}\sqrt{s_{p}^2(\frac{1}{n_{1}} + \frac{1}{n_{2}})}\]

13.1.3 Test Statistic for $\mu_{1} - \mu_{2}$ when $\sigma_{1}^2 \ne \sigma_{2}^2$

\[t = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}\]

\[\nu = \frac{(s_{1}^2/n_{1} + s_{2}^2/n_{2})^2}{\frac{(s_{1}^2/n_{1})^2}{n_{1}-1} + \frac{(s_{2}^2/n_{2})^2}{n_{2}-1}}\]

13.1.4 Confidence Interval Estimator of $\mu_{1} - \mu_{2}$ when $\sigma_{1}^2 \ne \sigma_{2}^2$

\[(\bar x_{1} - \bar x_{2}) \pm t_{\alpha /2}\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}\]

13.1.5 Testing the Population Variances

$H_{0}$: $\frac{\sigma_{1}^2}{\sigma_{2}^2} = 1$
$H_{1}$: $\frac{\sigma_{1}^2}{\sigma_{2}^2} \ne 1$

\[F = \frac{s_{1}^2}{s_{2}^2}\]

$\nu_{1} = n_{1} - 1$ and $\nu_{2} = n_{2} - 1$. This is a two-tail test so that the rejection region is $F \gt F_{\alpha/2, \nu_{1},\nu_{2}}$ or $F \lt F_{1-\alpha/2, \nu_{1},\nu_{2}}$.

Confidence Interval Estimator of $\sigma_{1}^2/\sigma_{2}^2$

\[LCL = \frac{s_{1}^2}{s_{2}^2} \frac{1}{F_{\alpha/2,\nu_{1},\nu_{2}}}\] \[UCL = \frac{s_{1}^2}{s_{2}^2} F_{\alpha/2,\nu_{1},\nu_{2}}\]

13.2 Inference about the Difference between two Means: Matched Pairs Experiment

$\mu_{D}$ is the mean of the population of differences.

Test Statistic for $\mu_{D}$

\[t = \frac{\bar x_{D} - \mu_{D}}{s_{D}/\sqrt{n_{D}}}\]

which is Student t distributed with $\nu = n_{D} - 1$ degrees of freedom, provided that the differences are normally distributed.

Confidence Interval Estimator of $\mu_{D}$

\[\bar x_{D} \pm t_{\alpha/2}\frac{s_{D}}{\sqrt{n_{D}}}\]

13.3 Inference about the Difference between two Population Proportions

The statistic $\hat p_{1} − \hat p_{2}$ is approximately normally distributed provided that the sample sizes are large enough so that $n_{1}p_{1}$, $n_{1}(1-p_{1})$, $n_{2}p_{2}$, and $n_{2}(1-p_{2})$ are all greater than or equal to 5.

\[E(\hat p_{1} − \hat p_{2}) = p_{1} − p_{2}\]

\[V(\hat p_{1} − \hat p_{2}) = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}\]

\[Z = \frac{(\hat p_{1} − \hat p_{2}) - (p_{1} − p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}}\]

\[\hat p_{1} = \frac{x_{1}}{n_{1}}\] \[\hat p_{2} = \frac{x_{2}}{n_{2}}\]

13.3.1 Test Statistic for $p_{1} − p_{2}$: Case 1

$H_{0}$: $p_{1} − p_{2} = 0$

\[z = \frac{\hat p_{1} − \hat p_{2}}{\sqrt{\hat p(1-\hat p)(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}\]

\[\hat p = \frac{x_{1} + x_{2}}{n_{1} + n_{2}}\]

13.3.2 Test Statistic for $p_{1} − p_{2}$: Case 2

$H_{0}$: $p_{1} − p_{2} = D, D\ne0$

\[z = \frac{(\hat p_{1} − \hat p_{2}) - D}{\sqrt{\frac{\hat p_{1}(1-\hat p_{1})}{n_{1}} + \frac{\hat p_{2}(1-\hat p_{2})}{n_{2}}}}\]

ImageJ introduction

ImageJ is a public domain Java image processing program inspired by NIH Image for the Macintosh. It runs, either as an online applet or as a downloadable application, on any computer with a Java 1.4 or later virtual machine. Downloadable distributions are available for Windows, Mac OS, Mac OS X and Linux.

DPI

By default, the DPI in the JPEG header is set to 72. For a higher value, use a unit of “inch” in the Analyze -> Set Scale dialog (requires v1.40 or later). For example, setting “Distance in Pixels” to 300, “Known Distance” to 1 and “Unit of Length” to “inch” will set the DPI to 300.

只能保存为tif格式。另存为png格式dpi会变成72.
Note that ImageJ does not read or write the resolution for JPG files, only that of TIFFs.

Color Spaces

Grayscale: The simplest color representation has no color at all, just black, white, and shades of gray.
RGB: red, green, and blue. RGB is an additive color model — the desired color is created by adding together different amounts of red, green, and blue light.
CMYK: Another way to add color to an image is to subtract it. In subtractive color models, each channel represents a pigment absorbing a certain color. CMYK color represents a common color printing process, with cyan, magenta, yellow, and black inks (the K stands for “key”).

Use this submenu to determine the type of the active image or to convert it to another type.

8-bit. Converts to 8-bit grayscale.
8-bit Color. Converts to 8-bit indexed color using Heckbert’s median-cut color quantization algorithm.
RGB Color. Converts to 32-bit RGB color.

Image -> Type -> 8-bit will convert color to grayscale.

Crop

Image -> Crop

邮件的开头

Thank you for contacting us.
Thank you for your prompt reply. 
Thank you for your reply.
Thank you for getting back to me.
Thank you for providing the requested information.
Thank you for all your assistance.
I truly appreciate … your help in resolving the problem.
Thank you raising your concerns.
Thank you for your feedback.

在邮件的结尾

Thank you for your kind cooperation.
Thank you for your attention to this matter.
Thank you for your understanding.
Thank you for your consideration.
Thank you again for everything you've done.

其它场景

Hope you have a good trip back.
How is the project going? 
I suggest we have a call tonight at 9:30pm. Please let me know if the time is okay for. 
I would like to hold a meeting in the afternoon about our development planning for the project A. 
We’d like to have the meeting on Thu Oct 30. Same time. 
Let’s make a meeting next Monday at 5:30 PM. 
I want to talk to you over the phone regarding issues about report development and the XXX project. 
For the next step of platform implementation, I am proposing… 
I suggest we can have a weekly project meeting over the phone call in the near future.

Should you have any problem accessing the folders, please let me know. 
Thank you and look forward to having your opinion on the estimation and schedule. 
Look forward to your feedbacks and suggestions soon.
What is your opinion on the schedule and next steps we proposed? 
What do you think about this? 
Feel free to give your comments. 
Any question, please don’t hesitate to let me know.
Any question, please let me know.
Please contact me if you have any questions.
Please let me know if you have any question on this.
Your comments and suggestions are welcome!
Please let me know what you think?
Do you have any idea about this? 
It would be nice if you could provide a bit more information on the user’s behavior. 
At your convenience, I would really appreciate you looking into this matter/issue. 
Please see comments below.
My answers are in blue below.
I add some comments to the document for your reference.

Today we would like to finish following tasks by the end of today:
Some known issues in this release:
Our team here reviewed the newest SCM policy and has following concerns:
Here are some more questions/issues for your team:
The current status is as following: 
Some items need your attention:
I have some questions about the report 
For the assignment ABC, I have the following questions:

I enclose the evaluation report for your reference.
Attached please find today’s meeting notes.
Attach is the design document, please review it.
For other known issues related to individual features, please see attached release notes.

Thank you so much for the cooperation.
Thanks for the information.
I really appreciate the effort you all made for this sudden and tight project. 
Thank you for your attention! 
Your kind assistance on this are very much appreciated. 
Really appreciate your help!

I sincerely apologize for this misunderstanding! 
I apologize for the late asking but we want to make sure the correctness of our implementation ASAP.

open up a terminal window and navigate to your local GitHub repository.

git tag -d tagName
git push origin :tagName

If your tag has the same name as one of your branches, use this instead:

git tag -d tagName
git push origin :refs/tags/tagName

You need to replace tagName with the tag name that you want to delete.

1. 下载SNP信息

(1) UCSC genome browser 的 table browser
(2) 选择需要的 assembly（例如：hg19） (3) group 选”Variation” (4) track 选一个需要的数据（例如：Common SNPs(146) ) (5) table 选一个需要的数据（例如：snp146Common ） (6) output format选’BED-browser extensible data’ (7) 点击‘get output’下载数据, 保存为‘hg19_commonSNPs146.txt’

2. 下载基因信息

可以使用这个基因数据genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers.。

注意这个数据用的是hg19/GRChB37的位置信息。

3. 抓取基因内的所有SNPs

(1) 安装BEDTools

(2) 提取常染色体及x，y染色体上的snp并排序

awk '$1!~"_" && $1!~"M" {printf("%s\t%d\t%d\t%s\n", $1,$2,$3,$4);}' hg19_commonSNPs146.txt | sort -k1,1 -k2,2n -k3,3n -k4,4  > hg19_snp146_auto_sorted.txt

(3) 基因的位置上下游各加2000bp

awk '{printf("%s\t%d\t%d\t%s\n", "chr"$1,$2-2000,$3+2000,$4);}' hugo.txt > hugo_2kb.txt

(4) 基因文件里23， 24表示x,y染色体，改正后并排序

sed 's/chr23/chrX/' hugo_2kb.txt > hugo_2kb_v1.txt          
sed 's/chr24/chrY/' hugo_2kb_v1.txt > hugo_2kb_v2.txt          
sort -k1,1 -k2,2n -k3,3n hugo_2kb_v2.txt > hugo_2kb_v2_sorted.txt

(5) mapping (时间比较久)

intersectBed -a hg19_snp146_auto_sorted.txt -b hugo_2kb_v2_sorted.txt -wa -wb | awk '{print $4, $8}' > geneSNPs_2kb.txt

Reference

Mapping SNPs to Genes for GWAS Enrichment Analysis

statistics for management and economics study notes 2

5. Data Collection And Sampling

5.1 Simple Random Sample

A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen.

5.2 Stratified Random Sampling

A stratified random sample is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum.

5.3 Cluster Sampling

A cluster sample is a simple random sample of groups or clusters of elements.

5.4 Sampling Error

Sampling error refers to differences between the sample and the population that exists only because of the observations that happened to be selected for the sample.

5.5 Nonsampling Error

Nonsampling errors result from mistakes made in the acquisition of data or from the sample observations being selected improperly.

Errors in data acquisition.
Nonresponse error refers to error (or bias) introduced when responses are not obtained from some members of the sample.
Selection bias occurs when the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample.

6 Probability

6.1 Intersection

The intersection of events A and B is the event that occurs when both A and B occur. The probability of the intersection is called the joint probability.

6.2 Marginal Probability

Marginal probabilities, computed by adding across rows or down columns, are so named because they are calculated in the margins of the table.

6.3 Conditional Probability

The probability of event A given event B is

\[p(A|B) = \frac{p(AB)}{p(B)}\]

The probability of event B given event A is

\[p(B|A) = \frac{p(AB)}{p(A)}\]

6.4 Independence

Two events A and B are said to be independent if

\[p(A|B) = p(A)\]

\[p(B|A) = p(B)\]

6.5 Union

The union of events A and B is the event that occurs when either A or B or both occur. It is denoted as A or B.

6.6 Complement Rule

The complement of event A is the event that occurs when event A does not occur.

\[p(A^c) = 1 - p(A)\]

6.7 Multiplication Rule

\[p(AB) = p(A)p(B|A) = p(B)p(A|B)\]

6.8 Addition Rule

The probability that event A, or event B, or both occur is

\[p(A or B) = p(A) + p(B) - p(AB)\]

6.9 Bayes’s Law Formula

\[p(A_{i}|B) = \frac{p(A_{i})p(B|A_{i})}{p(A_{1})p(B|A_{1}) + p(A_{2})p(B|A_{2}) + \cdots + p(A_{k})p(B|A_{k})}\]

7. Random Variables and Discrete Probability Distributions

7.1 Describing the Population Probability Distribution

\[E(x) = \mu = \sum xp(x)\]

\[V(x) = \sigma^2 = \sum (x-\mu)^2p(x)\]

7.2 Laws of Expected Value and Variance

\[E(c) = c\] \[E(x + c) = E(x) + c\] \[E(cx) = cE(x)\] \[V(c) = 0\] \[V(x + c) = V(x)\] \[V(cx) = c^2V(x)\]

7.3 Bivariate Distributions

The covariance of two discrete variables is defined as

\[COV(x, y) = \sigma_{xy} = \sum \sum (x - \mu_{x})(y-\mu_{y})p(x, y)\]

Coefficient of Correlation:

\[\rho = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\]

7.4 Laws of Expected Value and Variance of the Sum of Two Variables

\[E(x + y) = E(x) + E(y)\]

\[V(x + y) = V(x) + V(y) + 2COV(x + y)\]

7.5 Mean and Variance of a Portfolio of Two Stocks

\[E(R_{p}) = w_{1}E(R_{1}) + w_{2}E(R_{2})\]

\[V(R_{p}) = w_{1}^2V(R_{1}) + w_{2}^2V(R_{2}) + 2w_{1}w_{2}COV(R_{1}, R_{2}) = w_{1}^2V(R_{1}) + w_{2}^2V(R_{2}) + 2w_{1}w_{2}\rho\sigma_{1}\sigma_{2}\]

7.6 Portfolios with More Than Two Stocks

\[E(R_{p}) = \sum_{i=1}^k w_{i}E(R_{i})\]

\[V(R_{p}) = \sum_{i=1}^k w_{i}^2V(R_{i}) + 2\sum_{i=1}^k \sum_{j=i+1}^k w_{i}w_{j}COV(R_{i}, R_{j})\]

7.7 Binormial Distribution

The binomial experiment consists of a fixed number of trials (n).
Each trial has two possible outcomes. success or failure.
The probability of success is p. The probability of failure is 1 − p.
The trials are independent

The probability of x successes in a binomial experiment with n trials and probability of success = p is

\[p(x) = \frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}\]

7.7.1 Cumulative Probability

\[p(X \le 4) = p(0) + p(1) + p(2) + p(3) + p(4)\]

7.7.2 Binomial Probability p(X ≥ x)

\[p(X \ge x) = 1 - p(X \le (x-1))\]

7.7.3 Binomial Probability P(X = x)

\[p(x) = p(X \le x) - p(X \le (x-1))\]

7.7.4 Mean and Variance of a Binomial Distribution

\[\mu = np\] \[\sigma^2 = np(1-p)\] \[\sigma = \sqrt{np(1-p)}\]

7.8 Poisson Distribution

Like the binomial random variable, the Poisson random variable is the number of occurrences of events, which we’ll continue to call successes. The difference between the two random variables is that a binomial random variable is the number of successes in a set number of trials, whereas a Poisson random variable is the number of successes in an interval of time or specific region of space.

The number of successes that occur in any interval is independent of the number of successes that occur in any other interval.
The probability of a success in an interval is the same for all equal-size intervals.
The probability of a success in an interval is proportional to the size of the interval.
The probability of more than one success in an interval approaches 0 as the interval becomes smaller.

The probability that a Poisson random variable assumes a value of x in a specific interval is

\[p(x) = \frac{e^{-\mu}\mu^x}{x!}\]

the variance of a Poisson random variable is equal to its mean; that is

\[\sigma^2 = \mu\]

\[p(X \ge x) = 1 - p(X \le (x-1))\]

\[p(x) = p(X \le x) - p(X \le (x-1))\]

8. Continuous Probability Distributions

8.1 Uniform Distribution

\[f(x) = \frac{1}{b-a}, a \le x \le b\]

8.2 Normal Distribution

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]

8.2.1 Calculating Normal Probabilities

We standardize a random variable by subtracting its mean and dividing by its standard deviation. When the variable is normal, the transformed variable is called a standard normal random variable and denoted by Z; that is,

\[Z = \frac{X - \mu}{\sigma}\]

8.3 Exponential Distribution

\[f(x) = \lambda e^{-\lambda x}, x \ge 0\]

\[\mu = \sigma = \frac{1}{\lambda}\]

\[p(X > x) = e^{-\lambda x}\]

\[p(X < x) = 1 - e^{-\lambda x}\]

\[p(x_{1} < X < x_{2}) = p(X < x_{2}) - p(X < x_{1}) = e^{-\lambda x_{1}} - e^{-\lambda x_{2}}\]

8.4 Student t Distribution

\[f(t)=\frac{\Gamma[(\nu + 1)/2]}{\sqrt{\nu \pi} \Gamma (\nu /2)}[1 + \frac{t^2}{\nu}]^{-(\nu + 1)/2}\]

where $\nu$ (Greek letter nu) is the parameter of the Student t distribution called the degrees of freedom, and $\Gamma$ is the gamma function.

\[E(t) = 0\]

\[V(t) = \frac{\nu}{\nu - 2}, \nu \gt 2\]

Student t distribution is similar to the standard normal distribution. Both are symmetrical about 0. We describe the Student t distribution as mound shaped, whereas the normal distribution is bell shaped. As $\nu$ grows larger, the Student t distribution approaches the standard normal distribution.

8.5 Chi-Squared Distribution

\[f(\chi^2) = \frac{1}{\Gamma(\nu/2)} \frac{1}{2^{\nu/2}}(\chi^2)^{(\nu/2)-1}e^{-\chi^2/2}\]

\[E(\chi^2) = \nu\]

\[V(\chi^2) = 2\nu\]

8.6 F Distribution

\[E(F) = \frac{\nu_{2}}{\nu_{2} - 2}, \nu_{2} \gt 2\]

\[V(F) = \frac{2\nu_{2}^2(\nu_{1} + \nu_{2} -2)}{\nu_{1}(\nu_{2}-1)^2(\nu_{2} -4)}, \nu_{2} \gt 4\]

$\nu_{1}$ the numerator degrees of freedom and $\nu_{2}$ the denominator degrees of freedom.

BLOCK	1	2	…	k	Block Mean
1	\(x_{11}\)	\(x_{12}\)	…	\(x_{1k}\)	\(\bar x[B]_{1}\)
2	\(x_{21}\)	\(x_{22}\)	…	\(x_{2k}\)	\(\bar x[B]_{2}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)		\(\vdots\)	\(\vdots\)
b	\(x_{b1}\)	\(x_{b2}\)	…	\(x_{bk}\)	\(\bar x[B]_{b}\)
Treatment Mean	\(\bar x[T]_{1}\)	\(\bar x[T]_{2}\)	…	\(\bar x[T]_{k}\)

statistics for management and economics study notes 4

statistics for management and economics study notes 4

statistics for management and economics study notes 4

14. Analysis of Variance

14.1 One-way Analysis of Variance

14.1.1 Can We Use the t-Test of the Difference between Two Means Instead of the Analysis of Variance?

14.1.2 Can We Use the Analysis of Variance Instead of the t-Test of \(\mu_{1} − \mu_{2}\)?

14.2 Multiple Comparisions

14.3 Analysis of Variance Experimental Designs

14.3.1 Single-Factor and Multifactor Experimental Designs

14.3.2 Independent Samples and Blocks

14.3.3 Fixed and Random Effects

14.4 Randomized Block (Two-Way) Analysis of Variance

14.5 Two-Factor Analysis of Variance

statistics for management and economics study notes 3

statistics for management and economics study notes 3

9. Sampling Distributions

9.1 Sampling Distribution of the Mean

9.2 Sampling Distribution of a Sample Proportion

9.3 Sampling Distribution of the Difference between Two Means

10. Introduction to Estimation

10.1 Estimating the Population Mean When the Population Standard Deviation is Known

10.2 Determining the Sample Size to Estimate \(\mu\)

11. Introduction to Hypothesis Testing

11.1 Concepts of Hypothesis Testing

11.2 Testing the Population Mean When the Population Standard Deviation is Known

11.2.1 Standardized Test Statistic

11.2.2 Testing Hypotheses and Confidence Interval Estimators

11.3 Calculating the Probability of a Type II Error

11.4 Larger Sample Size Equals More Information Equals Better Decisions

11.5 Power of a Test

12. Inference About a Population

12.1 Inference about a Population Mean When the Population Standard Deviation is Unknown

12.2 Inference about a Population Variance

12.3 Inference about a Population Proportion

13. Inference about Comparing Two Populations

13.1 Inference about the Difference between two Means: Independent Samples

13.1.1 Test Statistic for \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 = \sigma_{2}^2\)

13.1.2 Confidence Interval Estimator of \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 = \sigma_{2}^2\)

13.1.3 Test Statistic for \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 \ne \sigma_{2}^2\)

13.1.4 Confidence Interval Estimator of \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 \ne \sigma_{2}^2\)

13.1.5 Testing the Population Variances

13.2 Inference about the Difference between two Means: Matched Pairs Experiment

13.3 Inference about the Difference between two Population Proportions

13.3.1 Test Statistic for \(p_{1} − p_{2}\): Case 1

13.3.2 Test Statistic for \(p_{1} − p_{2}\): Case 2

ImageJ introduction

DPI

Color Spaces

Crop

邮件的开头

在邮件的结尾

其它场景

1. 下载SNP信息

2. 下载基因信息

3. 抓取基因内的所有SNPs

statistics for management and economics study notes 2

statistics for management and economics study notes 2

5. Data Collection And Sampling

5.1 Simple Random Sample

5.2 Stratified Random Sampling

5.3 Cluster Sampling

5.4 Sampling Error

5.5 Nonsampling Error

6 Probability

6.1 Intersection

6.2 Marginal Probability

6.3 Conditional Probability

6.4 Independence

6.5 Union

6.6 Complement Rule

6.7 Multiplication Rule

6.8 Addition Rule

6.9 Bayes’s Law Formula

7. Random Variables and Discrete Probability Distributions

7.1 Describing the Population Probability Distribution

7.2 Laws of Expected Value and Variance

7.3 Bivariate Distributions

7.4 Laws of Expected Value and Variance of the Sum of Two Variables

7.5 Mean and Variance of a Portfolio of Two Stocks