http://felixfan.github.io/statistics-for-management-and-economics-study-notes-4 statistics for management and economics study notes 4

14. Analysis of Variance

14.1 One-way Analysis of Variance

The analysis of variance is a procedure that tests to determine whether differences exist between two or more population means. one-way analysis of variance is the procedure to apply when the samples are independently drawn.

\(H_{0}\): \(\mu_{1} = \mu_{2} = \cdots = \mu_{k}\)
\(H_{1}\): at least two means differ

The statistic that measures the proximity of the sample means to each other is called the between-treatments variation; it is denoted SST, which stands for sum of squares for treatments.

\[SST = \sum_{j=1}^k n_{j}(\bar x_{j} - \bar{\bar x})^2\]

\[\bar{\bar x} =\frac{\sum_{j=1}^k \sum_{i=1}^{n_{j}} x_{ij}}{n}\]

\[n = n_{1} + n_{2} + \cdots + n_{k}\]

\[\bar x_{j} = \frac{\sum_{i=1}^{n_{j}}x_{ij}}{n_{j}}\]

how much variation exists in the percentage of assets, which is measured by the within-treatments variation, which is denoted by SSE (sum of squares for error). The within-treatments variation provides a measure of the amount of variation in the response variable that is not caused by the treatments.

\[SSE = \sum_{j=1}^k \sum_{i=1}^{n_{j}}(x_{ij} - \bar x_{j})^2\]

\[SSE = (n_{1}-1)s_{1}^2 + (n_{2}-1)s_{2}^2 + \cdots + (n_{k}-1)s_{k}^2\]

The mean square for treatments is computed by dividing SST by the number of treatments minus 1.

\[MST = \frac{SST}{k-1}\]

The mean square for error is determined by dividing SSE by the total sample size (labeled n) minus the number of treatments.

\[MSE = \frac{SSE}{n-k}\]

Finally, the test statistic is defined as the ratio of the two mean squares.

\[F = \frac{MST}{MSE}\]

The test statistic is F-distributed with k − 1 and n − k degrees of freedom, provided that the response variable is normally distributed. we reject the null hypothesis only if

\[F > F_{\alpha, k-1, n-k}\]

total variation of all the data is denoted SS(Total)

\[SS(Total) = SST + SSE = \sum_{j=1}^k \sum_{i=1}^{n_{j}}(x_{ij} - \bar{\bar x})^2\]

ANOVA Table for the One-Way Analysis of Variance:

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC
Treatments k − 1 SST MST = SST/ (k − 1) F = MST/MSE
Error n − k SSE MSE = SSE/ (n − k)
Total n − 1 SS(Total)

Example: a financial analyst randomly sampled 366 American households and asked each to report the age category of the head of the household and the proportion of its financial assets that are invested in the stock market. The age categories are Young (less than 35), Early middle age (35 to 49), Late middle age (50 to 65), Senior (older than 65). The analyst was particularly interested in determining whether the ownership of stocks varied by age.

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Treatments 3 3741.4 1247.12 2.79 0.0405
Error 362 161871.0 447.16
Total 365 165612.4

Interpret: The value of the test statistic is F = 2.79, and its p-value is .0405, which means there is evidence to infer that the percentage of total assets invested in stocks are different in at least two of the age categories.

14.1.1 Can We Use the t-Test of the Difference between Two Means Instead of the Analysis of Variance?

There are two reasons why we don’t use multiple t-tests instead of one F-test. First, we would have to perform many more calculations. Second, and more important, conducting multiple tests increases the probability of making Type I errors.

14.1.2 Can We Use the Analysis of Variance Instead of the t-Test of \(\mu_{1} − \mu_{2}\)?

If we want to determine whether \(\mu_{1}\) is greater than \(\mu_{2}\) (or vice versa), we cannot use the analysis of variance because this technique allows us to test for a difference only. Thus, if we want to test to determine whether one population mean exceeds the other, we must use the t-test of \(\mu_{1} − \mu_{2}\) (with \(\sigma_{1}^2=\sigma_{2}^2\)). Moreover, the analysis of variance requires that the population variances are equal. If they are not, we must use the unequal variances test statistic.

14.2 Multiple Comparisions

Bonferroni adjustment:

\[\alpha = \frac{\alpha_{E}}{n}\]

\(\alpha_{E}\), denotes the true probability of making at least one Type I error, is called the experimentwise Type I error rate. n is the number of pairwise comparisons.

14.3 Analysis of Variance Experimental Designs

14.3.1 Single-Factor and Multifactor Experimental Designs

A single-factor analysis of variance addresses the problem of comparing two or more populations defined on the basis of only one factor. A multifactor experiment is one in which two or more factors define the treatments.

The example in 14.1 is a single-factor design because we had one treatment: age of the head of the household. Suppose that we can also look at the gender of the household head in another study. We would then develop a two-factor analysis of variance in which the first factor, age, has four levels, and the second factor, gender, has two levels.

14.3.2 Independent Samples and Blocks

When the problem objective is to compare more than two populations, the experimental design that is the counterpart of the matched pairs experiment is called the randomized block design. The term block refers to a matched group of observations from each population. The randomized block experiment is also called the two-way analysis of variance.

We can determine whether sleeping pills are effective by giving three brands of pills to the same group of people to measure the effects. Such experiments are called repeated measures designs.

The data are analyzed in the same way for both designs.

14.3.3 Fixed and Random Effects

If our analysis includes all possible levels of a factor, the technique is called a fixed effects analysis of variance. If the levels included in the study represent a random sample of all the levels that exist, the technique is called a random-effects analysis of variance.

14.4 Randomized Block (Two-Way) Analysis of Variance

The purpose of designing a randomized block experiment is to reduce the within-treatments variation to more easily detect differences between the treatment means. In the one-way analysis of variance, we partitioned the total variation into the between-treatments and the within-treatments variation; that is,

\[SS(Total) = SST + SSE\]

In the randomized block design of the analysis of variance, we partition the total variation into three sources of variation:

\[SS(Total) = SST + SSB + SSE\]

where SSB, the sum of squares for blocks, measures the variation between the blocks.

BLOCK 1 2 k Block Mean
1 \(x_{11}\) \(x_{12}\) \(x_{1k}\) \(\bar x[B]_{1}\)
2 \(x_{21}\) \(x_{22}\) \(x_{2k}\) \(\bar x[B]_{2}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
b \(x_{b1}\) \(x_{b2}\) \(x_{bk}\) \(\bar x[B]_{b}\)
Treatment Mean \(\bar x[T]_{1}\) \(\bar x[T]_{2}\) \(\bar x[T]_{k}\)

Sums of Squares in the Randomized Block Experiment:

\[SS(Total) = \sum_{j=1}^k \sum_{i=1}^b (x_{ij} - \bar{\bar x})^2\] \[SST = \sum_{j=1}^k b(\bar x[T]_{j} - \bar{\bar x})^2\] \[SSB = \sum_{i=1}^b k(\bar x[B]_{i} - \bar{\bar x})^2\] \[SSE = \sum_{j=1}^k \sum_{i=1}^b (x_{ij} - \bar x[T]_{j} - \bar x[B]_{i} + \bar{\bar x})^2\]

Mean Squares for the Randomized Block Experiment:

\[MST = \frac{SST}{k-1}\] \[MSB = \frac{SSB}{b-1}\] \[MSE = \frac{SSE}{n-k-b-1}\]

Test Statistic for the Randomized Block Experiment

\[F = \frac{MST}{MSE}\]

which is F-distributed with ν1 = k − 1 and ν2 = n − k − b + 1 degrees of freedom.

ANOVA Table for the Randomized Block Analysis of Variance

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC
Treatments k − 1 SST MST = SST / (k − 1) F = MST/MSE
Blocks b - 1 SSB MSB = SSB / (b - 1) F = MSB/MSE
Error n − k - b + 1 SSE MSE = SSE / (n − k - b + 1)
Total n − 1 SS(Total)

Example: A company selected 25 groups of four men, each of whom had cholesterol levels in excess of 280. In each group, the men were matched according to age and weight. Four drugs were administered over a 2-month period, and the reduction in cholesterol was recorded. Do these results allow the company to conclude that differences exist between the four drugs?

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Drug 3 196.0 65.3 4.12 0.009
Group 24 3848.7 160.4 10.11 0.000
Error 72 1142.6 15.9
Total 99 5187.2

Interpret: we conclude that there is sufficient evidence to infer that at least two of the drugs differ.

14.5 Two-Factor Analysis of Variance

The general term for the experiment features two factors is factorial experiment. In factorial experiments, we can examine the effect on the response variable of two or more factors. We will present the technique for fixed effects only. That means we will address problems where all the levels of the factors are included in the experiment.

Example: As part of a study on job tenure, a survey was conducted in which Americans aged between 37 and 45 were asked how many jobs they have held in their lifetimes. Also recorded were gender and educational attainment. The categories are E1, E2, E3 and E4. Can we infer that differences exist between genders and educational levels?

\(H_{0}\): \(\mu_{1} = \mu_{2} = \mu_{3} = \mu_{4} = \mu_{5} = \mu_{6} = \mu_{7} = \mu_{8}\)
\(H_{1}\): At least two means differ

Summary:

Groups Count Sum Average Variance
Male E1 10 126 12.60 8.27
Male E2 10 110 11.00 8.67
Male E3 10 106 10.60 11.60
Male E4 10 90 9.00 5.33
Female E1 10 115 11.50 8.28
Female E2 10 112 11.20 9.73
Female E3 10 94 9.40 16.49
Female E4 10 81 8.10 12.32

one-way Anova:

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Between Groups 7 153.35 21.91 2.17 0.0467
Within Groups 72 726.20 10.09
Total 79 879.55

Interpret: The value of the test statistic is F = 2.17 with a p-value of .0467. We conclude that there are differences in the number of jobs between the eight treatments.

This statistical result raises more questions—namely, can we conclude that the differences in the mean number of jobs are caused by differences between males and females? Or are they caused by differences between educational levels? Or, perhaps, are there combinations, called interactions, of gender and education that result in especially high or low numbers?

A complete factorial experiment is an experiment in which the data for all possible combinations of the levels of the factors are gathered. That means that in the above example we measured the number of jobs for all eight combinations. This experiment is called a complete 2 × 4 factorial experiment. In general, we will refer to one of the factors as factor A (arbitrarily chosen). The number of levels of this factor will be denoted by a. The other factor is called factor B, and its number of levels is denoted by b. The number of observations for each combination is called a replicate. The number of replicates is denoted by r. We address only problems in which the number of replicates is the same for each treatment. Such a design is called balanced.

\(x_{ijk}\) = \(k\)th observation in the \(ij\)th treatment
\(\bar x[AB]_{ij}=\) mean of the treatment when the factor A level is i and the factor B level is j
\(\bar x[A]_{i}=\) Mean of the observations when the factor A level is i
\(\bar x[B]_{j}=\) Mean of the observations when the factor B level is j
\(\bar{\bar x}=\) Mean of all the observations
a = Number of factor A levels
b = Number of factor B levels
r = Number of replicates

\[SS(Total) = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^r (x_{ijk} - \bar{\bar x})^2\] \[SS(A) = rb \sum_{i=1}^a (\bar x[A]_{i} - \bar{\bar x})^2\] \[SS(B) = ra \sum_{j=1}^b (\bar x[B]_{j} - \bar{\bar x})^2\] \[SS(AB) = r \sum_{i=1}^a \sum_{j=1}^b (\bar x[AB]_{ij} - \bar x[A]_{i} - \bar x[B]_{j} + \bar{\bar x})^2\] \[SSE = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^r (x_{ijk} - \bar x[AB]_{ij})^2\]

\(\nu_{SS(A)} = a -1\)
\(\nu_{SS(B)} = b -1\)
\(\nu_{SS(AB)} = (a -1)(b-1)\)
\(\nu_{SSE} = n - ab\)

F-Tests Conducted in Two-Factor Analysis of Variance
Test for Differences between the Levels of Factor A
\(H_{0}\): The means of the a levels of factor A are equal
\(H_{1}\): At least two means differ
Test for Differences between the Levels of Factor B
\(H_{0}\): The means of the a levels of factor B are equal
\(H_{1}\): At least two means differ
Test for Interaction between Factors A and B
\(H_{0}\): Factors A and B do not interact to affect the mean responses
\(H_{1}\): Factors A and B do interact to affect the mean responses

Required Conditions
* The distribution of the response is normally distributed.
* The variance for each treatment is identical.
* The samples are independent.

ANOVA Table for the Two-Factor Experiment:

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC
Factor A a-1 SS(A) MS(A) MS(A)/MSE
Factor B b-1 SS(B) MS(B) MS(B)/MSE
Interaction (a-1)(b-1) SS(AB) MS(AB) MS(AB)/MSE
Error n - ab SSE MSE
Total n -1 SS(Total)

Two-way ANOVA: Jobs versus Gender, Education

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Gender 1 11.25 11.25 1.12 0.294
Education 3 135.85 45.28 4.49 0.006
Interaction 3 6.25 2.08 0.21 0.892
Error 72 726.20 10.09
Total 79 879.55

Interpret: There is no evidence at the 5% significance level to infer that differences in the number of jobs exist between men and women. There is sufficient evidence at the 5% significance level to infer that differences in the number of jobs exist between educational levels. There is not enough evidence to conclude that there is an interaction between gender and education.

Order of Testing in the Two-Factor Analysis of Variance: Test for interaction first. If there is enough evidence to infer that there is interaction, do not conduct the other tests. If there is not enough evidence to conclude that there is interaction, proceed to conduct the F-tests for factors A and B.

http://felixfan.github.io/statistics-for-management-and-economics-study-notes-3 statistics for management and economics study notes 3

9. Sampling Distributions

9.1 Sampling Distribution of the Mean

Central Limit Theorem: The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of X will resemble a normal distribution.

\[\mu_{\bar x} = \mu\]

\[\sigma_{\bar x}^2 = \frac{\sigma^2}{n}\]

If X is normal, then \(\bar X\) is normal. If X is nonnormal, then \(\bar X\) is approximately normal for sufficiently large sample sizes. The definition of “sufficiently large” depends on the extent of nonnormality of X.

Standardizing the sample mean:

\[Z = \frac{\bar X - \mu}{\sigma / \sqrt{n}}\]

9.2 Sampling Distribution of a Sample Proportion

\(\hat P\) is approximately normally distributed provided that np and n(1 − p) are greater than or equal to 5.

\[E(\hat P) = p\]

\[V(\hat P) = \sigma_{\hat p}^2 = \frac{p(1-p)}{n}\]

Standardizing the sample proportion:

\[Z = \frac{\hat P - p}{\sqrt{p(1-p)/n}}\]

9.3 Sampling Distribution of the Difference between Two Means

\[E(\bar X_{1} - \bar X_{2}) = \mu_{\bar x_{1} - \bar x_{2}} = \mu_{1} - \mu_{2}\]

\[V(\bar X_{1} - \bar X_{2}) = \sigma_{\bar x_{1} - \bar x_{2}}^2 = \frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}\]

Standardizing the difference between two sample means:

\[Z = \frac{(\bar X_{1} - \bar X_{2}) - (\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}\]

10. Introduction to Estimation

  • An unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter.
  • An unbiased estimator is said to be consistent if the difference between the estimator and the parameter grows smaller as the sample size grows larger.
  • If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to have relative efficiency.

10.1 Estimating the Population Mean When the Population Standard Deviation is Known

\[\bar x \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

10.2 Determining the Sample Size to Estimate \(\mu\)

\[n = (\frac{z_{\alpha/2}\sigma}{B})^2\]

\[B = Z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

B stands for the bound on the error of estimation.

11. Introduction to Hypothesis Testing

11.1 Concepts of Hypothesis Testing

  • null hypothesis usually refers to a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. \(H_{0}\)
  • alternative hypothesis (or maintained hypothesis or research hypothesis) refers the hypothesis to be accepted if the null hypothesis is rejected. \(H_{1}\)
  • A Type I error occurs when we reject a true null hypothesis. \(\alpha\)
  • A Type II error is defined as not rejecting a false null hypothesis. \(\beta\)
  • The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.
  • If we reject the null hypothesis, we conclude that there is enough statistical evidence to infer that the alternative hypothesis is true.
  • If we do not reject the null hypothesis, we conclude that there is not enough statistical evidence to infer that the alternative hypothesis is true.

11.2 Testing the Population Mean When the Population Standard Deviation is Known

  • A two-tail test is conducted whenever the alternative hypothesis specifies that the mean is not equal to the value stated in the null hypothesis.
  • a one-tail test that focuses on the right tail of the sampling distribution whenever we want to know whether there is enough evidence to infer that the mean is greater than the quantity specified by the null hypothesis.
  • a one-tail test that focuses on the left tail of the sampling distribution whenever we want to know whether there is enough evidence to infer that the mean is less than the quantity specified by the null hypothesis.

11.2.1 Standardized Test Statistic

\[z = \frac{\bar x - \mu}{\sigma / \sqrt{n}}\]

The rejection region:

\[z > z_{\alpha / 2}\]

or

\[z < - z_{\alpha / 2}\]

11.2.2 Testing Hypotheses and Confidence Interval Estimators

\[\bar x \pm z_{\alpha / 2}\frac{\sigma}{\sqrt{n}}\]

we compute the interval estimate and determine whether the hypothesized value of the mean falls into the interval.

11.3 Calculating the Probability of a Type II Error

Example: A random sample of 400 monthly accounts is drawn, for which the sample mean is $178. The accounts are approximately normally distributed with a standard deviation of $65. Whether the mean is greater than $170 with \(\alpha\) = 5%?

\(H_{0}\): \(\mu \le 170\)

\(H_{1}\): \(\mu \gt 170\)

\(\frac{\bar x_{L} - 170}{65/\sqrt{400}} = 1.645\)

\(\bar x_{L} = 175.34\)

Therefore, the rejection region is:

\(\bar x \gt 175.34\)

The sample mean was computed to be 178. Because the test statistic (sample mean) is in the rejection region (it is greater than 175.34), we reject the null hypothesis. Thus, there is sufficient evidence to infer that the mean monthly account is greater than $170.

\(\beta = P(\bar X \lt 175.34\), given that the null hypothesis is false )

Suppose that when the mean account is at least $180.

\(\beta = P(\bar X \lt 175.34\), given that \(\mu = 180)\)

\(\beta = P(\frac{\bar X - \mu}{\sigma / \sqrt{n}} < \frac{175.34-180}{65/\sqrt{400}}) = P(Z \lt - 1.43) = 0.0764\)

This plot illustrates the inverse relationship between the probabilities of Type I and Type II errors. Unfortunately, there is no simple formula to determine what the significance level should be.

11.4 Larger Sample Size Equals More Information Equals Better Decisions

11.5 Power of a Test

power: the probability of its leading us to reject the null hypothesis when it is false. Thus, the power of a test is 1 − β.

12. Inference About a Population

12.1 Inference about a Population Mean When the Population Standard Deviation is Unknown

When the population standard deviation is unknown and the population is normal, the test statistic for testing hypotheses about μ is

\[t = \frac{\bar x - \mu}{s/\sqrt{n}}\]

which is Student t-distributed with ν = n − 1 degrees of freedom.

Confidence Interval Estimator of μ When σ Is Unknown

\[\bar x \pm t_{\alpha/2}\frac{s}{\sqrt{n}}\]

12.2 Inference about a Population Variance

The test statistic used to test hypotheses about \(\sigma^2\) is

\[\chi^2 = \frac{(n-1)s^2}{\sigma^2}\]

which is chi-squared distributed with ν = n − 1 degrees of freedom when the population random variable is normally distributed with variance equal to \(\sigma^2\).

Confidence Interval Estimator of \(\sigma^2\)

Lower confidence limit (LCL) = \(\frac{(n-1)s^2}{\chi_{\alpha /2}^2}\)

Upper confidence limit (UCL) = \(\frac{(n-1)s^2}{\chi_{1-\alpha /2}^2}\)

12.3 Inference about a Population Proportion

\[\hat p = \frac{x}{n}\]

Test Statistic for p

\[z = \frac{\hat P - p}{\sqrt{p(1-p)/n}}\]

which is approximately normal when np and n(1 − p) are greater than 5.

Confidence Interval Estimator of p

\[\hat p \pm z_{\alpha /2} \sqrt{\hat p (1 - \hat p)/n}\]

Sample Size to Estimate a Proportion

\[n = (\frac{z_{\alpha /2}\sqrt{\hat p (1-\hat p)}}{B})^2\]

\[B = z_{\alpha /2} \sqrt{\frac{\hat p (1-\hat p)}{n}}\]

13. Inference about Comparing Two Populations

13.1 Inference about the Difference between two Means: Independent Samples

Sampling Distribution of \(\bar x_{1} - \bar x_{2}\):

\(\bar x_{1} - \bar x_{2}\) is normally distributed if the populations are normal and approximately normal if the populations are nonnormal and the sample sizes are large.
\[E( \bar x_{1} - \bar x_{2} ) = \mu_{1} - \mu_{2}\] \[V( \bar x_{1} - \bar x_{2} ) = \frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}\] \[Z = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}\]

13.1.1 Test Statistic for \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 = \sigma_{2}^2\)

\[t = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{s_{p}^2(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}\]

where \(s_{p}^2\) is called the pooled variance estimator:

\[s_{p}^2 = \frac{(n_{1} -1)s_{1}^2 + (n_{2} -1)s_{2}^2}{n_{1} + n_{2} - 2}\]

13.1.2 Confidence Interval Estimator of \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 = \sigma_{2}^2\)

\[(\bar x_{1} - \bar x_{2}) \pm t_{\alpha /2}\sqrt{s_{p}^2(\frac{1}{n_{1}} + \frac{1}{n_{2}})}\]

13.1.3 Test Statistic for \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 \ne \sigma_{2}^2\)

\[t = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}\]

\[\nu = \frac{(s_{1}^2/n_{1} + s_{2}^2/n_{2})^2}{\frac{(s_{1}^2/n_{1})^2}{n_{1}-1} + \frac{(s_{2}^2/n_{2})^2}{n_{2}-1}}\]

13.1.4 Confidence Interval Estimator of \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 \ne \sigma_{2}^2\)

\[(\bar x_{1} - \bar x_{2}) \pm t_{\alpha /2}\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}\]

13.1.5 Testing the Population Variances

\(H_{0}\): \(\frac{\sigma_{1}^2}{\sigma_{2}^2} = 1\)
\(H_{1}\): \(\frac{\sigma_{1}^2}{\sigma_{2}^2} \ne 1\)

\[F = \frac{s_{1}^2}{s_{2}^2}\]

\(\nu_{1} = n_{1} - 1\) and \(\nu_{2} = n_{2} - 1\). This is a two-tail test so that the rejection region is \(F \gt F_{\alpha/2, \nu_{1},\nu_{2}}\) or \(F \lt F_{1-\alpha/2, \nu_{1},\nu_{2}}\).

Confidence Interval Estimator of \(\sigma_{1}^2/\sigma_{2}^2\)

\[LCL = \frac{s_{1}^2}{s_{2}^2} \frac{1}{F_{\alpha/2,\nu_{1},\nu_{2}}}\] \[UCL = \frac{s_{1}^2}{s_{2}^2} F_{\alpha/2,\nu_{1},\nu_{2}}\]

13.2 Inference about the Difference between two Means: Matched Pairs Experiment

\(\mu_{D}\) is the mean of the population of differences.

Test Statistic for \(\mu_{D}\)

\[t = \frac{\bar x_{D} - \mu_{D}}{s_{D}/\sqrt{n_{D}}}\]

which is Student t distributed with \(\nu = n_{D} - 1\) degrees of freedom, provided that the differences are normally distributed.

Confidence Interval Estimator of \(\mu_{D}\)

\[\bar x_{D} \pm t_{\alpha/2}\frac{s_{D}}{\sqrt{n_{D}}}\]

13.3 Inference about the Difference between two Population Proportions

The statistic \(\hat p_{1} − \hat p_{2}\) is approximately normally distributed provided that the sample sizes are large enough so that \(n_{1}p_{1}\), \(n_{1}(1-p_{1})\), \(n_{2}p_{2}\), and \(n_{2}(1-p_{2})\) are all greater than or equal to 5.

\[E(\hat p_{1} − \hat p_{2}) = p_{1} − p_{2}\]

\[V(\hat p_{1} − \hat p_{2}) = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}\]

\[Z = \frac{(\hat p_{1} − \hat p_{2}) - (p_{1} − p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}}\]

\[\hat p_{1} = \frac{x_{1}}{n_{1}}\] \[\hat p_{2} = \frac{x_{2}}{n_{2}}\]

13.3.1 Test Statistic for \(p_{1} − p_{2}\): Case 1

\(H_{0}\): \(p_{1} − p_{2} = 0\)

\[z = \frac{\hat p_{1} − \hat p_{2}}{\sqrt{\hat p(1-\hat p)(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}\]

\[\hat p = \frac{x_{1} + x_{2}}{n_{1} + n_{2}}\]

13.3.2 Test Statistic for \(p_{1} − p_{2}\): Case 2

\(H_{0}\): \(p_{1} − p_{2} = D, D\ne0\)

\[z = \frac{(\hat p_{1} − \hat p_{2}) - D}{\sqrt{\frac{\hat p_{1}(1-\hat p_{1})}{n_{1}} + \frac{\hat p_{2}(1-\hat p_{2})}{n_{2}}}}\]

http://felixfan.github.io/ImageJ

ImageJ introduction

ImageJ is a public domain Java image processing program inspired by NIH Image for the Macintosh. It runs, either as an online applet or as a downloadable application, on any computer with a Java 1.4 or later virtual machine. Downloadable distributions are available for Windows, Mac OS, Mac OS X and Linux.

DPI

By default, the DPI in the JPEG header is set to 72. For a higher value, use a unit of “inch” in the Analyze -> Set Scale dialog (requires v1.40 or later). For example, setting “Distance in Pixels” to 300, “Known Distance” to 1 and “Unit of Length” to “inch” will set the DPI to 300.

只能保存为tif格式。另存为png格式dpi会变成72.
Note that ImageJ does not read or write the resolution for JPG files, only that of TIFFs.

Color Spaces

  • Grayscale: The simplest color representation has no color at all, just black, white, and shades of gray.
  • RGB: red, green, and blue. RGB is an additive color model — the desired color is created by adding together different amounts of red, green, and blue light.
  • CMYK: Another way to add color to an image is to subtract it. In subtractive color models, each channel represents a pigment absorbing a certain color. CMYK color represents a common color printing process, with cyan, magenta, yellow, and black inks (the K stands for “key”).

Use this submenu to determine the type of the active image or to convert it to another type.

  • 8-bit. Converts to 8-bit grayscale.
  • 8-bit Color. Converts to 8-bit indexed color using Heckbert’s median-cut color quantization algorithm.
  • RGB Color. Converts to 32-bit RGB color.

Image -> Type -> 8-bit will convert color to grayscale.

Crop

Image -> Crop

http://felixfan.github.io/english-email-2

邮件的开头

Thank you for contacting us.
Thank you for your prompt reply. 
Thank you for your reply.
Thank you for getting back to me.
Thank you for providing the requested information.
Thank you for all your assistance.
I truly appreciate … your help in resolving the problem.
Thank you raising your concerns.
Thank you for your feedback.

在邮件的结尾

Thank you for your kind cooperation.
Thank you for your attention to this matter.
Thank you for your understanding.
Thank you for your consideration.
Thank you again for everything you've done.

其它场景

Hope you have a good trip back.
How is the project going? 
I suggest we have a call tonight at 9:30pm. Please let me know if the time is okay for. 
I would like to hold a meeting in the afternoon about our development planning for the project A. 
We’d like to have the meeting on Thu Oct 30. Same time. 
Let’s make a meeting next Monday at 5:30 PM. 
I want to talk to you over the phone regarding issues about report development and the XXX project. 
For the next step of platform implementation, I am proposing… 
I suggest we can have a weekly project meeting over the phone call in the near future. 
Should you have any problem accessing the folders, please let me know. 
Thank you and look forward to having your opinion on the estimation and schedule. 
Look forward to your feedbacks and suggestions soon.
What is your opinion on the schedule and next steps we proposed? 
What do you think about this? 
Feel free to give your comments. 
Any question, please don’t hesitate to let me know.
Any question, please let me know.
Please contact me if you have any questions.
Please let me know if you have any question on this.
Your comments and suggestions are welcome!
Please let me know what you think?
Do you have any idea about this? 
It would be nice if you could provide a bit more information on the user’s behavior. 
At your convenience, I would really appreciate you looking into this matter/issue. 
Please see comments below.
My answers are in blue below.
I add some comments to the document for your reference.
Today we would like to finish following tasks by the end of today:
Some known issues in this release:
Our team here reviewed the newest SCM policy and has following concerns:
Here are some more questions/issues for your team:
The current status is as following: 
Some items need your attention:
I have some questions about the report 
For the assignment ABC, I have the following questions:
I enclose the evaluation report for your reference.
Attached please find today’s meeting notes.
Attach is the design document, please review it.
For other known issues related to individual features, please see attached release notes.
Thank you so much for the cooperation.
Thanks for the information.
I really appreciate the effort you all made for this sudden and tight project. 
Thank you for your attention! 
Your kind assistance on this are very much appreciated. 
Really appreciate your help! 
I sincerely apologize for this misunderstanding! 
I apologize for the late asking but we want to make sure the correctness of our implementation ASAP. 
http://felixfan.github.io/delete-github-tag

open up a terminal window and navigate to your local GitHub repository.

git tag -d tagName
git push origin :tagName

If your tag has the same name as one of your branches, use this instead:

git tag -d tagName
git push origin :refs/tags/tagName

You need to replace tagName with the tag name that you want to delete.

http://felixfan.github.io/mapping-snps-to-genes

1. 下载SNP信息

(1) UCSC genome browser 的 table browser
(2) 选择需要的 assembly(例如:hg19) (3) group 选”Variation” (4) track 选一个需要的数据(例如:Common SNPs(146) ) (5) table 选一个需要的数据(例如:snp146Common ) (6) output format选’BED-browser extensible data’ (7) 点击‘get output’下载数据, 保存为‘hg19_commonSNPs146.txt’

2. 下载基因信息

可以使用这个基因数据genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers.

注意这个数据用的是hg19/GRChB37的位置信息。

3. 抓取基因内的所有SNPs

(1) 安装BEDTools

(2) 提取常染色体及x,y染色体上的snp并排序

awk '$1!~"_" && $1!~"M" {printf("%s\t%d\t%d\t%s\n", $1,$2,$3,$4);}' hg19_commonSNPs146.txt | sort -k1,1 -k2,2n -k3,3n -k4,4  > hg19_snp146_auto_sorted.txt

(3) 基因的位置上下游各加2000bp

awk '{printf("%s\t%d\t%d\t%s\n", "chr"$1,$2-2000,$3+2000,$4);}' hugo.txt > hugo_2kb.txt

(4) 基因文件里23, 24表示x,y染色体,改正后并排序

sed 's/chr23/chrX/' hugo_2kb.txt > hugo_2kb_v1.txt          
sed 's/chr24/chrY/' hugo_2kb_v1.txt > hugo_2kb_v2.txt          
sort -k1,1 -k2,2n -k3,3n hugo_2kb_v2.txt > hugo_2kb_v2_sorted.txt          

(5) mapping (时间比较久)

intersectBed -a hg19_snp146_auto_sorted.txt -b hugo_2kb_v2_sorted.txt -wa -wb | awk '{print $4, $8}' > geneSNPs_2kb.txt

Reference

http://felixfan.github.io/statistics-for-management-and-economics-study-notes-2 statistics for management and economics study notes 2

5. Data Collection And Sampling

5.1 Simple Random Sample

A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen.

5.2 Stratified Random Sampling

A stratified random sample is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum.

5.3 Cluster Sampling

A cluster sample is a simple random sample of groups or clusters of elements.

5.4 Sampling Error

Sampling error refers to differences between the sample and the population that exists only because of the observations that happened to be selected for the sample.

5.5 Nonsampling Error

Nonsampling errors result from mistakes made in the acquisition of data or from the sample observations being selected improperly.

  • Errors in data acquisition.
  • Nonresponse error refers to error (or bias) introduced when responses are not obtained from some members of the sample.
  • Selection bias occurs when the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample.

6 Probability

6.1 Intersection

The intersection of events A and B is the event that occurs when both A and B occur. The probability of the intersection is called the joint probability.

6.2 Marginal Probability

Marginal probabilities, computed by adding across rows or down columns, are so named because they are calculated in the margins of the table.

6.3 Conditional Probability

The probability of event A given event B is

\[p(A|B) = \frac{p(AB)}{p(B)}\]

The probability of event B given event A is

\[p(B|A) = \frac{p(AB)}{p(A)}\]

6.4 Independence

Two events A and B are said to be independent if

\[p(A|B) = p(A)\]

or

\[p(B|A) = p(B)\]

6.5 Union

The union of events A and B is the event that occurs when either A or B or both occur. It is denoted as A or B.

6.6 Complement Rule

The complement of event A is the event that occurs when event A does not occur.

\[p(A^c) = 1 - p(A)\]

6.7 Multiplication Rule

\[p(AB) = p(A)p(B|A) = p(B)p(A|B)\]

6.8 Addition Rule

The probability that event A, or event B, or both occur is

\[p(A or B) = p(A) + p(B) - p(AB)\]

6.9 Bayes’s Law Formula

\[p(A_{i}|B) = \frac{p(A_{i})p(B|A_{i})}{p(A_{1})p(B|A_{1}) + p(A_{2})p(B|A_{2}) + \cdots + p(A_{k})p(B|A_{k})}\]

7. Random Variables and Discrete Probability Distributions

7.1 Describing the Population Probability Distribution

\[E(x) = \mu = \sum xp(x)\]

\[V(x) = \sigma^2 = \sum (x-\mu)^2p(x)\]

7.2 Laws of Expected Value and Variance

\[E(c) = c\] \[E(x + c) = E(x) + c\] \[E(cx) = cE(x)\] \[V(c) = 0\] \[V(x + c) = V(x)\] \[V(cx) = c^2V(x)\]

7.3 Bivariate Distributions

The covariance of two discrete variables is defined as

\[COV(x, y) = \sigma_{xy} = \sum \sum (x - \mu_{x})(y-\mu_{y})p(x, y)\]

Coefficient of Correlation:

\[\rho = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\]

7.4 Laws of Expected Value and Variance of the Sum of Two Variables

\[E(x + y) = E(x) + E(y)\]

\[V(x + y) = V(x) + V(y) + 2COV(x + y)\]

7.5 Mean and Variance of a Portfolio of Two Stocks

\[E(R_{p}) = w_{1}E(R_{1}) + w_{2}E(R_{2})\]

\[V(R_{p}) = w_{1}^2V(R_{1}) + w_{2}^2V(R_{2}) + 2w_{1}w_{2}COV(R_{1}, R_{2}) = w_{1}^2V(R_{1}) + w_{2}^2V(R_{2}) + 2w_{1}w_{2}\rho\sigma_{1}\sigma_{2}\]

7.6 Portfolios with More Than Two Stocks

\[E(R_{p}) = \sum_{i=1}^k w_{i}E(R_{i})\]

\[V(R_{p}) = \sum_{i=1}^k w_{i}^2V(R_{i}) + 2\sum_{i=1}^k \sum_{j=i+1}^k w_{i}w_{j}COV(R_{i}, R_{j})\]

7.7 Binormial Distribution

  • The binomial experiment consists of a fixed number of trials (n).
  • Each trial has two possible outcomes. success or failure.
  • The probability of success is p. The probability of failure is 1 − p.
  • The trials are independent

The probability of x successes in a binomial experiment with n trials and probability of success = p is

\[p(x) = \frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}\]

7.7.1 Cumulative Probability

\[p(X \le 4) = p(0) + p(1) + p(2) + p(3) + p(4)\]

7.7.2 Binomial Probability p(X ≥ x)

\[p(X \ge x) = 1 - p(X \le (x-1))\]

7.7.3 Binomial Probability P(X = x)

\[p(x) = p(X \le x) - p(X \le (x-1))\]

7.7.4 Mean and Variance of a Binomial Distribution

\[\mu = np\] \[\sigma^2 = np(1-p)\] \[\sigma = \sqrt{np(1-p)}\]

7.8 Poisson Distribution

Like the binomial random variable, the Poisson random variable is the number of occurrences of events, which we’ll continue to call successes. The difference between the two random variables is that a binomial random variable is the number of successes in a set number of trials, whereas a Poisson random variable is the number of successes in an interval of time or specific region of space.

  • The number of successes that occur in any interval is independent of the number of successes that occur in any other interval.
  • The probability of a success in an interval is the same for all equal-size intervals.
  • The probability of a success in an interval is proportional to the size of the interval.
  • The probability of more than one success in an interval approaches 0 as the interval becomes smaller.

The probability that a Poisson random variable assumes a value of x in a specific interval is

\[p(x) = \frac{e^{-\mu}\mu^x}{x!}\]

the variance of a Poisson random variable is equal to its mean; that is

\[\sigma^2 = \mu\]

\[p(X \ge x) = 1 - p(X \le (x-1))\]

\[p(x) = p(X \le x) - p(X \le (x-1))\]

8. Continuous Probability Distributions

8.1 Uniform Distribution

\[f(x) = \frac{1}{b-a}, a \le x \le b\]

8.2 Normal Distribution

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]

8.2.1 Calculating Normal Probabilities

We standardize a random variable by subtracting its mean and dividing by its standard deviation. When the variable is normal, the transformed variable is called a standard normal random variable and denoted by Z; that is,

\[Z = \frac{X - \mu}{\sigma}\]

8.3 Exponential Distribution

\[f(x) = \lambda e^{-\lambda x}, x \ge 0\]

\[\mu = \sigma = \frac{1}{\lambda}\]

\[p(X > x) = e^{-\lambda x}\]

\[p(X < x) = 1 - e^{-\lambda x}\]

\[p(x_{1} < X < x_{2}) = p(X < x_{2}) - p(X < x_{1}) = e^{-\lambda x_{1}} - e^{-\lambda x_{2}}\]

8.4 Student t Distribution

\[f(t)=\frac{\Gamma[(\nu + 1)/2]}{\sqrt{\nu \pi} \Gamma (\nu /2)}[1 + \frac{t^2}{\nu}]^{-(\nu + 1)/2}\]

where \(\nu\) (Greek letter nu) is the parameter of the Student t distribution called the degrees of freedom, and \(\Gamma\) is the gamma function.

\[E(t) = 0\]

\[V(t) = \frac{\nu}{\nu - 2}, \nu \gt 2\]

Student t distribution is similar to the standard normal distribution. Both are symmetrical about 0. We describe the Student t distribution as mound shaped, whereas the normal distribution is bell shaped. As \(\nu\) grows larger, the Student t distribution approaches the standard normal distribution.

8.5 Chi-Squared Distribution

\[f(\chi^2) = \frac{1}{\Gamma(\nu/2)} \frac{1}{2^{\nu/2}}(\chi^2)^{(\nu/2)-1}e^{-\chi^2/2}\]

\[E(\chi^2) = \nu\]

\[V(\chi^2) = 2\nu\]

8.6 F Distribution

\[E(F) = \frac{\nu_{2}}{\nu_{2} - 2}, \nu_{2} \gt 2\]

\[V(F) = \frac{2\nu_{2}^2(\nu_{1} + \nu_{2} -2)}{\nu_{1}(\nu_{2}-1)^2(\nu_{2} -4)}, \nu_{2} \gt 4\]

\(\nu_{1}\) the numerator degrees of freedom and \(\nu_{2}\) the denominator degrees of freedom.

http://felixfan.github.io/numpy numpy
In [1]:
import numpy as np

1. Array Creation

In [2]:
alist = [1,2,3]
arr = np.array(alist) # converting list to ndarray
arr
Out[2]:
array([1, 2, 3])
In [3]:
arr.tolist() # Converting ndarray to list
Out[3]:
[1, 2, 3]
In [4]:
np.zeros(5) # Creating an array of zeros with five elements
Out[4]:
array([ 0.,  0.,  0.,  0.,  0.])
In [5]:
np.arange(10) # Create an ndarray with 10 elements from 0 to 9
Out[5]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [6]:
np.arange(3,8) # from to
Out[6]:
array([3, 4, 5, 6, 7])
In [7]:
np.linspace(0, 5, 9) # from to steps
Out[7]:
array([ 0.   ,  0.625,  1.25 ,  1.875,  2.5  ,  3.125,  3.75 ,  4.375,  5.   ])
In [8]:
np.logspace(0, 5, 10, base=10.0)
Out[8]:
array([  1.00000000e+00,   3.59381366e+00,   1.29154967e+01,
         4.64158883e+01,   1.66810054e+02,   5.99484250e+02,
         2.15443469e+03,   7.74263683e+03,   2.78255940e+04,
         1.00000000e+05])
In [9]:
np.zeros((5,5)) # Creating a 5x5 array of zeros
Out[9]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
In [10]:
np.ones((3,3)) # Creating a 5x5 array of ones
Out[10]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [11]:
arr1d = np.arange(12)
arr2d = arr1d.reshape((3,4))
arr2d
Out[11]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [12]:
arr2d = np.reshape(arr1d,(4,3))
arr2d
Out[12]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

2. Indexing and Slicing

In [13]:
alist = [[1,2],[3,4]]
arr = np.array(alist)
In [14]:
arr[0,1]
Out[14]:
2
In [15]:
arr[:,1] # access the last column
Out[15]:
array([2, 4])
In [16]:
arr[1,:] ## access the bottom row.
Out[16]:
array([3, 4])
In [17]:
arr = np.arange(5)
index = np.where(arr > 2) # Creating the index array
new_arr = arr[index] # Creating the desired array
new_arr
Out[17]:
array([3, 4])
In [18]:
new_arr = arr[arr > 2]
new_arr
Out[18]:
array([3, 4])
In [19]:
new_arr = np.delete(arr, index)
new_arr
Out[19]:
array([0, 1, 2])

3. Boolean Statements

In [20]:
img1 = np.zeros((5, 5)) + 3
img1[1:3, 2:4] = 6
img1[3:5, 0:2] = 8
img1
Out[20]:
array([[ 3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  6.,  6.,  3.],
       [ 3.,  3.,  6.,  6.,  3.],
       [ 8.,  8.,  3.,  3.,  3.],
       [ 8.,  8.,  3.,  3.,  3.]])
In [21]:
# filter out all values larger than 3 and less than 7
index1 = img1 > 3
index2 = img1 < 7
compoundindex = index1 & index2
img2 = np.copy(img1)
img2[compoundindex] = 0
img2
Out[21]:
array([[ 3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 8.,  8.,  3.,  3.,  3.],
       [ 8.,  8.,  3.,  3.,  3.]])
In [22]:
index3 = (img1==8)
compoundindex2 = compoundindex | index3
img3 = np.copy(img1)
img3[compoundindex2] = 0
img3
Out[22]:
array([[ 3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 0.,  0.,  3.,  3.,  3.],
       [ 0.,  0.,  3.,  3.,  3.]])

4. Read and Write Data

In [23]:
arr = np.loadtxt('dat1.txt')
arr
Out[23]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.],
       [ 7.,  8.,  9.]])
In [24]:
np.savetxt('newdat1.txt', arr, delimiter=',', fmt='%.2f')
In [25]:
arr = np.loadtxt('dat2.txt', 
                 dtype={'names':('name', 'weight','unit'),
                'formats':('S5', 'f2','S2')})
arr
Out[25]:
array([('apple', 2.0, 'kg'), ('pear', 3.30078125, 'kg')], 
      dtype=[('name', 'S5'), ('weight', '<f2'), ('unit', 'S2')])

5. Linear Algebra

In [26]:
A = np.matrix([[3,6,-5],
              [1,-3,2],
              [5,-1,4]])
B = np.matrix([[12],
              [-2],
              [10]])
x = A**(-1)*B
x
Out[26]:
matrix([[ 1.75],
        [ 1.75],
        [ 0.75]])
In [27]:
a = np.array([[3,6,-5],
              [1,-3,2],
              [5,-1,4]])
b = np.array([12, -2, 10])
x = np.linalg.inv(a).dot(b)
x
# Although both methods works, use numpy.array whenever possible
Out[27]:
array([ 1.75,  1.75,  0.75])

6. Statistics

In [28]:
x = np.random.randn(1000)
In [29]:
x.mean()
Out[29]:
-0.012672228653646077
In [30]:
x.std()
Out[30]:
1.0374526578439356
In [31]:
x.var()
Out[31]:
1.0763080172674462
In [32]:
np.median(x)
Out[32]:
-0.027887390949930368
In [33]:
np.mean(x)
Out[33]:
-0.012672228653646077
http://felixfan.github.io/statistics-for-management-and-economics-study-notes-1 statistics for management and economics study notes 1

1. What is Statistics?

  • population is the group of all items of interest to a statistics practitioner. A descriptive measure of a population is called a parameter.
  • sample is a set of data drawn from the studied population. A descriptive measure of a sample is called a statistic.

2. Graphical Descriptive Techniques I

2.1 Types of Data and Information

  • Interval data are real numbers, such as heights, weights, incomes, and distances. We also refer to this type of data as quantitative or numerical.
  • The values of nominal data are categories. the values are not numbers but instead are words that describe the categories. Nominal data are also called qualitative or categorical.
  • Ordinal data appear to be nominal, but the difference is that the order of their values has meaning.

2.2 Describing a Set of Nominal Data

  • A bar chart is often used to display frequencies;
  • A pie chart graphically shows relative frequencies.
  • The bar chart focuses on the frequencies and the pie chart focuses on the proportions

3. Graphical Descriptive Techniques II

3.1 Describing a Set of Interval Data

3.1.1 Histogram

A histogram is created by drawing rectangles whose bases are the intervals and whose heights are the frequencies.

3.1.1.1 Determining the Number of Class Intervals

\[NumberOfClassIntervals = 1 + 3.3 log(n)\]

\[ClassWidth = \frac{LargestObservation - SmallestObservation}{NumberOfClasses}\]

3.1.2 Stem-and-Leaf Display

The first step in developing a stem-and-leaf display is to split each observation into two parts, a stem and a leaf. There are several different ways of doing this. For example, the number 12.3 can be split so that the stem is 12 and the leaf is 3. Another method can define the stem as 1 and the leaf as 2 (ignoring the 3). After each stem, we list that stem’s leaves, usually in ascending order. The advantage of the stem-and-leaf display over the histogram is that we can see the actual observations.

Stem  Leaf
0     000000000111112222223333345555556666666778888999999
1     000001111233333334455555667889999
2     0000111112344666778999
3     001335589
4     124445589
5     022224556789

3.2 Describing time-series Data

cross-sectional data classifies data by type, time-series data classifies them according to whether the observations are measured at the same time or whether they represent measurements at successive points in time. Time-series data are often graphically depicted on a line chart, which plots the value of the variable on the vertical axis and the time periods on the horizontal axis.

3.3 Describing the Relationship between Two Interval Data

In applications where one variable depends to some degree on the other variable, we label the dependent variable Y and the other, called the independent variable, X. In interpreting the results of a scatter diagram it is important to understand that if two variables are linearly related it does not mean that one is causing the other. Correlation is not causation.

4. Numerical Descriptive Techniques

4.1 Measure of Central Location

4.1.1 Arithmetic Mean

\[\mu=\frac{\sum_{i=1}^Nx_{i}}{N}\]

\[\bar x=\frac{\sum_{i=1}^nx_{i}}{n}\]

4.1.2 Median

The median is calculated by placing all the observations in order (ascending or descending). The observation that falls in the middle is the median. When there is an even number of observations, the median is determined by averaging the two observations in the middle.

4.1.3 Mode

The mode is defined as the observation (or observations) that occurs with the greatest frequency.

4.1.4 Mean, Median, Mode: Which Is Best?

The mean is generally our first selection. One advantage the median holds is that it is not as sensitive to extreme values as is the mean. The mode is seldom the best measure of central location.

4.1.5 Geometric mean

\[(1+R_{g})^n=(1+R_{1})(1+R_{2})\cdots(1+R{n})\]

4.2 Measures of Variability

4.2.1 Range

Range = Largest observation − Smallest observation

4.2.2 Variance

\[\sigma^2=\frac{\sum_{i=1}^N(x_{i}-\mu)^2}{N}\]

\[s^2=\frac{\sum_{i=1}^n(x_{i}-\bar x)^2}{n-1}\]

4.2.3 Standard Deviation

\[\sigma = \sqrt{\sigma ^2}\] \[s = \sqrt{s ^2}\]

4.2.4 Chebysheff’s Theorem

The proportion of observations in any sample or population that lie within k standard deviations of the mean is at least

\[1 - \frac{1}{k^2}, k>1\]

4.2.5 Coefficient of Variation

\[CV = \frac{\sigma}{\mu}\]

\[cv = \frac{s}{\bar x}\]

4.3 Measures of Relative Standing and Box Plots

4.3.1 Percentile

The \(P_{th}\) percentile is the value for which P percent are less than that value and (100 – P)% are greater than that value.

4.3.2 Locating Percentiles

\[L_{p} = (n+1)\frac{P}{100}\]

where \(L_{p}\) is the location of the \(P_{th}\) percentile.

Placing the 10 observations in ascending order we get

0 0 5 7 8 9 12 14 22 33

The location of the 25th percentile is

\[L_{25} = (10+1)\frac{25}{100} = 2.75\]

The \(25_{th}\) percentile is three-quarters of the distance between the second (which is 0) and the third (which is 5) observations. Three-quarters of the distance is

(.75)(5 − 0) = 3.75

Because the second observation is 0, the \(25_{th}\) percentile is 0 + 3.75 = 3.75.

4.3.3 Interquartile Range

\[InterquartileRange = Q_{3} − Q_{1}\]

4.3.4 Box Plots

This technique graphs five statistics: the minimum and maximum observations, and the first, second, and third quartiles. The three vertical lines of the box are the first, second, and third quartiles. The lines extending to the left and right are called whiskers. Any points that lie outside the whiskers are called outliers. The whiskers extend outward to the smaller of 1.5 times the interquartile range or to the most extreme point that is not an outlier.

4.4 Measures of Linear Relationship

4.4.1 Covariance

\[\sigma_{xy} = \frac{\sum_{i=1}^N(x_{i}-\mu_{x})(y_{i}-\mu_{y})}{N}\]

\[s_{xy} = \frac{\sum_{i=1}^n(x_{i}-\bar x)(y_{i}-\bar y)}{n-1}\]

4.4.2 Coefficient of Correlation

\[\rho=\frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\]

\[r=\frac{s_{xy}}{s_{x}s_{y}}\]

4.4.3 Least Squares Method

\[\hat y = b_{0} + b_{1}x\]

The coefficients \(b_{0}\) and \(b_{1}\) are derived using calculus so that we minimize the sum of squared deviations:

\[\sum_{i=1}^n(y_{i}-\hat{y_{i}})^2\]

Least Squares Line Coefficients:

\[b_{1} = \frac{s_{xy}}{s_{x}^2}\] \[b_{0} = \bar y - b_{1}\bar x\]

4.4.4 Coefficient of Determination

Coefficient of determination \(R^2\) is calculated by squaring the coefficient of correlation. The coefficient of determination measures the amount of variation in the dependent variable that is explained by the variation in the independent variable.

4.4.5 Interpreting Correlation

Correlation is not Causation

http://felixfan.github.io/bash-cmp

1. 整数比较大小

a=5
b=4
c=5
if [ $a -ne $b ]; then
    echo "$a is not equal to $b"
else
    echo "$a is equal to $b"
fi
5 is not equal to 4
if [ $a -lt $b ]; then
    echo "$a is less than $b"
else
    echo "$a is not less than $b"
fi
5 is not less than 4
if [ $a -gt $b ]; then
    echo "$a is great than $b"
else
    echo "$a is not great than $b"
fi
5 is great than 4
if [ $a -ge $b ]; then
    echo "$a is great than or equal to $b"
else
    echo "$a is less than $b"
fi
5 is great than or equal to 4
if [ $a -le $c ]; then
    echo "$a is less than or equal to $c"
else
    echo "$a is great than $b"
fi
5 is less than or equal to 5
if (($a != $b )); then
    echo "$a is not equal to $b"
else
    echo "$a is equal to $b"
fi
5 is not equal to 4
if (($a < $b)); then
    echo "$a is less than $b"
else
    echo "$a is not less than $b"
fi
5 is not less than 4
if (($a > $b)); then
    echo "$a is great than $b"
else
    echo "$a is not great than $b"
fi
5 is great than 4
if (($a >= $b)); then
    echo "$a is great than or equal to $b"
else
    echo "$a is less than $b"
fi
5 is great than or equal to 4
if (($a <= $c)); then
    echo "$a is less than or equal to $c"
else
    echo "$a is great than $b"
fi
5 is less than or equal to 5

2. 小数比较大小

e=20.0
d=100.50
awk -v a=0.7 -v b=0.5 'BEGIN{print(a>b)?"a is big":"b is big"}'
a is big
c=`echo "$d > $e" | bc`
if [ $c -eq 1 ]; then
    echo "$d is great than $e"
else
    echo "$d is less than or equal to $e"
fi
100.50 is great than 20.0

3. 字符串大小比较

s1='a'
s2='b'
s3='ac'
if [ $s1 == $s2 ]; then
    echo "$s1 is equal to $s2"
else
    echo "$s1 is not equal to $s2"
fi
a is not equal to b
if [ $s1 != $s3 ]; then
    echo "$s1 is not equal to $s3"
else
    echo "$s1 is equal to $s3"
fi
a is not equal to ac
if [ $s1 \< $s2 ]; then
    echo "$s1 is less than $s2"
elif [[ $s1 > $s2 ]]; then
    echo "$s1 is great than $2"
else
    echo "$s1 is equal to $s2"
fi
a is less than b
if [[ $s1 < $s3 ]]; then
    echo "$s1 is less than $s3"
elif [ $s1 \> $s3 ]; then
    echo "$s1 is great than $3"
else
    echo "$s1 is equal to $s3"
fi
a is less than ac
http://felixfan.github.io/scrapy-simple-example

0. 安装scrapy

conda install scrapy # 电脑已经安装了anaconda

1. 创建一个新工程

scrapy startproject njupt #其中njupt是项目名称,可以按照个人喜好来定义

输入以上命令之后,就会看见命令行运行的目录下多了一个名为njupt的目录,目录的结构如下:

|---- njupt
| |---- njupt
|   |---- __init__.py
|   |---- items.py        #用来存储爬下来的数据结构(字典形式)
|    |---- pipelines.py    #用来对爬出来的item进行后续处理,如存入数据库等
|    |---- settings.py    #爬虫配置文件
|    |---- spiders        #此目录用来存放创建的新爬虫文件(爬虫主体)
|     |---- __init__.py
| |---- scrapy.cfg        #项目配置文件

至此,工程创建完毕。

2. 设置 items.py

本文以抓取南邮新闻为例,需要存储三种信息:

  • 南邮新闻标题
  • 南邮新闻时间
  • 南邮新闻的详细链接

items.py内部代码如下:

# -*- coding: utf-8 -*-

import scrapy

class NjuptItem(scrapy.Item):   # NjuptItem 为自动生成的类名
    news_title = scrapy.Field() # 南邮新闻标题
    news_date = scrapy.Field()  # 南邮新闻时间
    news_url = scrapy.Field()   # 南邮新闻的详细链接

3. 编写 spider

spider是爬虫的主体,负责处理requset response 以及url等内容,处理完之后交给pipelines进行进一步处理。 设置完items之后,就在spiders目录下新建一个njuptSpider.py文件,内容如下:

# -*- coding: utf-8 -*-

import scrapy
from njupt.items import NjuptItem
import logging

class njuptSpider(scrapy.Spider):
    name = "njupt"
    allowed_domains = ["njupt.edu.cn"]
    start_urls = [
        "http://news.njupt.edu.cn/s/222/t/1100/p/1/c/6866/i/1/list.htm",
        ]
    
    def parse(self, response):
        news_page_num = 14
        page_num = 386
        if response.status == 200:
            for i in range(2,page_num+1):
                for j in range(1,news_page_num+1):
                    item = NjuptItem() 
                    item['news_url'],item['news_title'],item['news_date'] = response.xpath(
                    "//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/font/text()"
                    "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//td[@class='postTime']/text()"
                    "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/@href").extract()
                  
                    yield item
                    
                next_page_url = "http://news.njupt.edu.cn/s/222/t/1100/p/1/c/6866/i/"+str(i)+"/list.htm"
                yield scrapy.Request(next_page_url,callback=self.parse_news)
        
    def parse_news(self, response):
        news_page_num = 14
        if response.status == 200:
            for j in range(1,news_page_num+1):
                item = NjuptItem()
                item['news_url'],item['news_title'],item['news_date'] = response.xpath(
                "//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/font/text()"
                "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//td[@class='postTime']/text()"
                "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/@href").extract()
                yield item

其中:

  • name为爬虫名称,在后面启动爬虫的命令当中会用到。
  • allowed_domains为允许爬虫爬取的域名范围(如果连接到范围以外的就不爬取)
  • start_urls表明爬虫首次启动之后访问的第一个Url,其结果会被自动返回给parse函数。
  • parse函数为scrapy框架中定义的内置函数,用来处理请求start_urls之后返回的response,由我们实现
  • news_page_num = 14和page_num = 386分别表示每页的新闻数目,和一共有多少页,本来也可以通过xpath爬取下来的,但是我实在是对我们学校的网站制作无语了,html各种混合,于是我就偷懒手动输入了。
  • 之后通过item = NjuptItem()来使用我们之前定义的item,用来存储新闻的url、标题、日期。(这里面有一个小技巧就是通过|来接连xpath可以一次返回多个想要抓取的xpath)
  • 通过yield item来将存储下来的item交由后续的pipelines处理
  • 之后通过生成next_page_url来通过scrapy.Request抓取下一页的新闻信息
  • scrapy.Request的两个参数,一个是请求的URL另外一个是回调函数用于处理这个request的response,这里我们的回调函数是parse_news
  • parse_news里面的步骤和parse差不多,当然你也可以改造一下parse然后直接将其当做回调函数,这样的话一个函数就ok了

4. 编写 pipelines.py

初次编写可以直接编辑njupt目录下的pipelines.py文件。pipelines主要用于数据的进一步处理,比如类型转换、存储入数据库、写到本地等。pipelines是在每次spider中yield item 之后调用,用于处理每一个单独的item。下面代码就是实现了在本地新建一个njupt.txt文件用于存储爬取下来的内容。

import sys
import json

reload(sys) 
sys.setdefaultencoding('utf-8') # 存取中文

class NjuptPipeline(object):
    def __init__(self):
        self.file = open('njupt.txt','w')
    def process_item(self, item, spider):
        self.file.write(item['news_title'])
        self.file.write("\n")
        self.file.write(item['news_date'])
        self.file.write("\n")
        self.file.write(item['news_url'])
        self.file.write("\n")
        return item

5. 编写 settings.py

settings.py文件用于存储爬虫的配置,有很多种配置,由于是入门教程,不需要配置很多,我们这里就添加一下刚才编写的pipelines就行了。文件内容如下。

BOT_NAME = 'njupt'

SPIDER_MODULES = ['njupt.spiders']
NEWSPIDER_MODULE = 'njupt.spiders'


ITEM_PIPELINES = {
    'njupt.pipelines.NjuptPipeline':1,
}

6. 启动爬虫与查看结果

以上步骤全部完成之后,我们就启动命令行,然后切换运行目录到njupt的spiders目录下,通过以下命令启动爬虫

scrapy crawl njupt

经过一段时间的风狂爬取,爬虫结束。

http://felixfan.github.io/python-magic

1. 不要使用可变对象作为函数默认值

字典,集合,列表等等对象是不适合作为函数默认值的. 因为这个默认值实在函数建立的时候就生成了, 每次调用都是用了这个对象的”缓存”。

2. 在循环中修改列表项

b = [2, 4, 5, 6]

for i in b:
  if not i % 2:
    b.remove(i)

In: b
Out: [4, 5] # 本来我想要的结果应该是去除偶数的列表
# 是因为你对列表的remove,影响了它的index
# 因为2被删除后的列表是[4, 5, 6], 所以索引list[1]直接去找5, 忽略了4

3. IndexError – 列表取值超出了他的索引数

my_list = [1, 2, 3, 4, 5]

In: my_list[5] # 根本没有这个元素, IndexError

In: my_list[5:] # 但是可以这样,一定要注意, 用好了是trick,用错了就是坑啊
Out: []

4. 列表的+和+=, append和extend

>>> myList = [1,2,3,4]
>>> print myList
[1, 2, 3, 4]

>>> myList + [1]
[1, 2, 3, 4, 1]
>>> print myList
[1, 2, 3, 4]
### 不改变原列表

>>> myList += [1]
>>> print myList
[1, 2, 3, 4, 1]
### 在原有列表添加

>>> myList.append(2)
>>> print myList
[1, 2, 3, 4, 1, 2]
### 在原有列表添加

>>> myList.extend([9])
>>> print myList
[1, 2, 3, 4, 1, 2, 9]
### 在原有列表添加

5. ‘==’ 和 is 的区别

‘is’是判断2个对象的身份, ‘==’是判断2个对象的值.

# But, 有个特例
In: a = float('nan')

In: print('a is a,', a is a)
Out:('a is a,', True)

In: print('a == a,', a == a)
Out: ('a == a,', False) # 亮瞎我眼睛了

6. 浅拷贝和深拷贝

对于dict和list等数据结构的对象,直接赋值使用的是引用的方式。我们在实际开发中都可以向对某列表的对象做修改,但是可能不希望改动原来的列表。浅拷贝只拷贝父对象,深拷贝还会拷贝对象的内部的子对象。

In [65]: list1 = [1, 2]
In [66]: list2 = list1 # 就是个引用, 你操作list2,其实list1的结果也会变

In [67]: list3 = list1[:] # 浅拷贝

In [69]: import copy
In [70]: list4 = copy.copy(list1) # 浅拷贝, 对list3和list4操作都不会对list1有影响

In [71]: id(list1), id(list2), id(list3), id(list4)
Out[71]: (4480620232, 4480620232, 4479667880, 4494894720)


# 再看看深拷贝和浅拷贝的区别

In [88]: from copy import copy, deepcopy

In [89]: list1 = [[1], [2]]

In [90]: list2 = copy(list1) # 还是浅拷贝

In [91]: list3 = deepcopy(list1) # 深拷贝

In [92]: id(list1), id(list2), id(list3)
Out[92]: (4494896592, 4495349160, 4494896088)

In [93]: list2[0][0] = 3

In [94]: print('list1:', list1)
('list1:', [[3], [2]]) # 看到了吧 假如你操作其子对象 还是和引用一样 影响了源

In [95]: list3[0][0] = 5

In [96]: print('list1:', list1)
('list1:', [[3], [2]]) # 深拷贝就不会影响

7. bool其实是int的子类

In [97]: isinstance(True, int)
Out[97]: True

In [98]: True + True
Out[98]: 2

In [99]: 3 * True + True
Out[99]: 4

In [100]: 3 * True - False
Out[100]: 3

8. 元组是不是真的不可变?

In [111]: tup = ([],)
In [112]: tup[0] += [1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-112-d4f292cf35de> in <module>()
----> 1 tup[0] += [1]

TypeError: 'tuple' object does not support item assignment

In [113]: tup
Out[113]: ([1],) # 明明抛了异常 还能修改?

In [114]: tup = ([],)
In [115]: tup[0].extend([1])
In [116]: tup[0]
Out[116]: [1] # 好吧,我有点看明白了, 虽然我不能直接操作元组,但是不能阻止我操作元组中可变的子对象(list)

In [117]: my_tup = (1,)
In [118]: my_tup += (4,)
In [119]: my_tup = my_tup + (5,)
In [120]: my_tup
Out[120]: (1, 4, 5) # ? 嗯 不是不能操作元组嘛?

9. 枚举

>>> myList=['a','b','c']
>>> for i, item in enumerate(myList):
...  print i, item
... 
0 a
1 b
2 c
>>> list(enumerate('abc')) 
[(0, 'a'), (1, 'b'), (2, 'c')]
>>> list(enumerate('abc', 1)) 
[(1, 'a'), (2, 'b'), (3, 'c')]

10. 列表/字典/集合 解析

>>> my_list = [i for i in xrange(3)]
>>> print my_list
[0, 1, 2]
>>> my_dict = {i: i * i for i in xrange(3)} 
>>> print my_dict
{0: 0, 1: 1, 2: 4}
my_set = {i * 15 for i in xrange(3)}
>>> print my_set
set([0, 30, 15])

11. 强制浮点除法

a = 3
b = 4
result = 1.0 * a / b

12. 简单服务器

快速方便的共享某个目录下的文件.

# Python2
python -m SimpleHTTPServer

# Python 3
python3 -m http.server

假设你的ip是147.8.103.234, 所有人可以通过http://147.8.103.234:8000/ 访问你共享的文件夹。

13. if 结构简化

如果你需要检查几个数值你可以用以下方法:

if n in [1,4,5,6]:

来替代下面这个方式:

if n==1 or n==4 or n==5 or n==6:

14. 字符串/数列 逆序

>>> a = [1,2,3,4]
>>> a[::-1]
[4, 3, 2, 1]

# This creates a new reversed list. 
# If you want to reverse a list in place you can do:

a.reverse()
>>> a = 'hello world'
>>> a[::-1]
'dlrow olleh'

15. 三元运算

三元运算是if-else 语句的快捷操作,也被称为条件运算。

>>> x, y = 1, 2
>>> min = x if x < y else y
>>> max = x if x > y else y
>>> print min, max
1 2

16. 优化循环

循环之外能做的事不要放在循环内.

17. 优化包含多个判断表达式的顺序

对于and,应该把满足条件少的放在前面,对于or,把满足条件多的放在前面。

18. 使用join合并迭代器中的字符串

19. 不借助中间变量交换两个变量的值

a, b = b, a

20. 使用if is

使用 if is True 比 if == True 将近快一倍。

21. 使用级联比较x < y < z

x < y < z效率略高,而且可读性更好。

22. while 1 比 while True 更快

23. 使用**而不是pow

**就是快10倍以上!

24. 使用计数器对象计数

>>> from collections import Counter
>>> c = Counter("hello world")
>>> c
Counter({"l": 3, "o": 2, " ": 1, "e": 1, "d": 1, "h": 1, "r": 1, "w": 1})
>>> c.most_common(2)
[("l", 3), ("o", 2)]

25. 在Python 2中使用Python 3式的输出/除法

from __future__ import print_function
from __future__ import division
http://felixfan.github.io/Kaplan-Meier-Curves

NCCTG Lung Cancer Data

Description: Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.

library(survival)
data(lung)
head(lung)
  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0
inst:       Institution code
time:       Survival time in days
status:     censoring status 1=censored, 2=dead
age:        Age in years
sex:        Male=1 Female=2
ph.ecog:    ECOG performance score (0=good 5=dead)
ph.karno:   Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno:  Karnofsky performance score as rated by patient
meal.cal:   Calories consumed at meals
wt.loss:    Weight loss in last six months

# Kaplan-Meier Analysis

Estimate survival-function

Global Estimate

km.as.one <- survfit(Surv(time, status) ~ 1,  
                     type="kaplan-meier", 
                     conf.type="log", 
                     data=lung)

separate estimate for all sex

km.by.sex <- survfit(Surv(time, status) ~ sex,  
                     type="kaplan-meier", 
                     conf.type="log", data=lung)

Plot estimated survival function

plot(km.as.one, main="Kaplan-Meier estimate with CI", 
     xlab="Survival time in days", 
     ylab="Survival probability", lwd=2)

plot(km.by.sex, main="Kaplan-Meier estimate by sex", 
     xlab="Survival time in days", 
     ylab="Survival probability", 
     lwd=2, col = c("red","blue"))
legend(x="topright", col=c("red","blue"), lwd=2, 
       legend=c("male","female"))

Plot cumulative incidence function

plot(km.by.sex, main="Kaplan-Meier cumulative incidence by sex", 
     xlab="Survival time in days", ylab="Cumulative incidence", 
     lwd=2, col = c("red","blue"),
     fun = function(x){1-x})
legend(x="bottomright", col=c("red","blue"), 
       lwd=2, legend=c("male","female"))

Plot cumulative hazard

plot(km.as.one, main="Kaplan-Meier estimate", 
     xlab="Survival time in days", 
     ylab="Cumulative hazard", lwd=2,
     fun="cumhaz")

Log-rank-test for equal survival-functions

With rho = 0 (default) this is the log-rank or Mantel-Haenszel test, and with rho = 1 it is equivalent to the Peto & Peto modification of the Gehan-Wilcoxon test.

survdiff(Surv(time, status) ~ sex, data=lung)
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

        N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138      112     91.6      4.55      10.3
sex=2  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.00131 

References

http://felixfan.github.io/gdp-animation

1. Simple Pie Chart

GDP data was downloaded from here.

setwd("/Users/alicefelix/Desktop/gdp")
dat <- read.table("GDP1970_2014.txt",header = TRUE)
for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  png(fn)
  pie(df3$GDP, labels = df3$Country, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)))
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014.gif")
system("rm *png")

2. Pie Chart with Annotated Percentages

for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  pct <- round(df3$GDP/sum(df3$GDP)*100)
  pct <- paste(pct,"%", sep="")
  lbls <- paste(df3$Country, pct, sep=" ")
  png(fn)
  pie(df3$GDP, labels = lbls, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)))
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v2.gif")
system("rm *png")

3. 3D Pie Chart

library(plotrix)
for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  pct <- round(df3$GDP/sum(df3$GDP)*100)
  pct <- paste(pct,"%", sep="")
  lbls <- paste(df3$Country, pct, sep=" ")
  png(fn)
  pie3D(df3$GDP, labels = lbls, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)), labelcex = 0.8)
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v3.gif")
system("rm *png")

4. Pie Chart with Annotated Percentages

code from 糗世界.

pie1 <- function (x, labels = names(x), edges = 200, radius = 0.8, clockwise = FALSE, 
                  init.angle = if (clockwise) 90 else 0, density = NULL, angle = 45, 
                  col = NULL, border = NULL, lty = NULL, main = NULL, percentage=T, 
                  rawNumber=F, digits=3, cutoff=0.01, legend=F, legendpos="topright", 
                  legendcol=2, ...)
{
    if (!is.numeric(x) || any(is.na(x) | x < 0)){
      stop("'x' values must be positive.")
    }
  
    if (is.null(labels)){
      labels <- as.character(seq_along(x))
    }else{
      labels <- as.graphicsAnnot(labels)
    }
  
    rawX <- x
    x <- c(0, cumsum(x)/sum(x))
    dx <- diff(x)
    nx <- length(dx)
    plot.new()
    pin <- par("pin")
    xlim <- ylim <- c(-1, 1)
    
    if (pin[1L] > pin[2L]){
      xlim <- (pin[1L]/pin[2L]) * xlim
    }else{
      ylim <- (pin[2L]/pin[1L]) * ylim
    }
    
    dev.hold()
    on.exit(dev.flush())
    plot.window(xlim, ylim, "", asp = 1)
    
    if (is.null(col)){
      col <- if (is.null(density)){
        c("white", "lightblue", "mistyrose", "lightcyan", 
                "lavender", "cornsilk", "pink")
      }else{
        par("fg")
      } 
    }
        
    if (!is.null(col)){
      col <- rep_len(col, nx)
    }
        
    if (!is.null(border)){
      border <- rep_len(border, nx)
    }
      
    if (!is.null(lty)) 
        lty <- rep_len(lty, nx)
    angle <- rep(angle, nx)
    if (!is.null(density)) 
        density <- rep_len(density, nx)
    twopi <- if (clockwise) 
        -2 * pi
    else 2 * pi
    t2xy <- function(t) {
        t2p <- twopi * t + init.angle * pi/180
        list(x = radius * cos(t2p), y = radius * sin(t2p))
    }
    for (i in 1L:nx) {
        n <- max(2, floor(edges * dx[i]))
        P <- t2xy(seq.int(x[i], x[i + 1], length.out = n))
        polygon(c(P$x, 0), c(P$y, 0), density = density[i], angle = angle[i], 
            border = border[i], col = col[i], lty = lty[i])
        if(!legend){
        	P <- t2xy(mean(x[i + 0:1]))
	        lab <- as.character(labels[i])
	        if (!is.na(lab) && nzchar(lab)) {
	            lines(c(1, 1.05) * P$x, c(1, 1.05) * P$y)
	            text(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, 
	                adj = ifelse(P$x < 0, 1, 0), ...)
	        }
        }
    }
    if (percentage) {
    	for (i in 1L:nx){
    		if(dx[i]>cutoff){
    			P <- t2xy(mean(x[i + 0:1]))
            	text(.8 * P$x, .8 * P$y, paste(formatC(dx[i]*100, digits=digits), "%", sep=""), xpd = TRUE, 
                	adj = .5, ...)
    		}
        }
    }else{
        if(rawNumber){
		for (i in 1L:nx){
    			if(dx[i]>cutoff){
    				P <- t2xy(mean(x[i + 0:1]))
            		text(.8 * P$x, .8 * P$y, rawX[i], xpd = TRUE, 
                		adj = .5, ...)
    			}
        	}
        }
    }
    if(legend) legend(legendpos, legend=labels, fill=col, border="black", bty="n", ncol = legendcol)
    title(main = main, ...)
    invisible(NULL)
}
for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  png(fn)
  pie1(df3$GDP, labels = df3$Country, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)))
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v4.gif")
system("rm *png")

5. pie chart with ggplot2

library(ggplot2)
library(dplyr)

for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  #df3 = df3[order(df3$GDP, decreasing = TRUE),] #用 order() 让数据框的数据按 GDP 列数据从大到小排序
  df3 <- df3 %>% group_by(Year) %>% mutate(pos = cumsum(GDP)- GDP/2)
  
  pct <- round(df3$GDP/sum(df3$GDP)*100, 2)
  pct <- paste(pct,"%", sep="")
  lbls <- paste(df3$Country, pct, sep=" ")
  
  p <- ggplot(df3, aes(x = "", y = GDP, fill = Country)) +
    geom_bar(stat = "identity", width = 1) +
    coord_polar(theta = "y") +
    labs(x = "", y = "", title = paste("GDP", i)) +   ## 将标签设为空
    geom_text(aes(x="", y=pos, label = lbls), size=3) +  
    theme_bw() +
    theme(panel.border = element_blank(), panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(), axis.line = element_blank(),
          axis.ticks = element_blank(), axis.text.x = element_blank(),
          legend.position ="none") # 去掉背景坐标图标
  ggsave(fn,p)
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v5.gif")
system("rm *png")

http://felixfan.github.io/gh-md-toc

gh-md-toc is a great cross-platform tool to generate TOC (Table of contents) for README.md or GitHub’s wiki page. Source code and examples are available in their GitHub repository.

http://felixfan.github.io/vi

vi命令

vi编辑器支持编辑模式和命令模式,编辑模式下可以完成文本的编辑功能,命令模式下可以完成对文件的操作命令。默认情况下,打开vi编辑器后自动进入命令模式。从编辑模式切换到命令模式使用“esc”键,从命令模式切换到编辑模式使用“A”、“a”、“O”、“o”、“I”、“i”键。

命令     操作
ZZ      命令模式下保存当前文件所做的修改后退出vi
:行号    光标跳转到指定行的行首
:$       光标跳转到最后一行的行首
:wq     在命令模式下,执行存盘退出操作
:w      在命令模式下,执行存盘操作
:q      在命令模式下,执行退出vi操作
:q!     在命令模式下,执行强制退出vi操作
http://felixfan.github.io/Scatterplot-many-points
library(ggplot2)
dat <- data.frame(x=rnorm(10000), y=rnorm(10000))

point plot

ggplot(dat, aes(x=x, y=y)) + geom_point()

jittering

ggplot(dat, aes(x=x, y=y)) + geom_point(position = 'jitter')

alpha

ggplot(dat, aes(x=x, y=y)) + geom_point(alpha = 0.3)

ggplot(dat, aes(x=x, y=y)) + geom_point(alpha = 0.1)

contour lines

ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_density2d()

HexBins

ggplot(dat, aes(x=x, y=y)) + stat_binhex()

combined

ggplot(dat, aes(x=x, y=y)) + geom_point(colour='black',alpha=0.3) + geom_density2d(colour='red')

http://felixfan.github.io/english-letter
I am writing to confirm /enquire/inform you...

I am writing to follow up on our earlier decision on the marketing campaign in Q2.

With reference to our telephone conversation today…

In my previous e-mail on October 5

As I mentioned earlier about...

as indicated in my previous e-mail...

As we discussed on the phone...

from our decision at the previous meeting…
as you requested/per your requirement...

In reply to your e-mail dated April 1. we decided...

This is in response to your e-mail today

As mentioned before, we deem this product has strong unique selling points in China.

As a follow-up to our phone conversation yesterday, I wanted to get back to you about the pending issues of our agreement.

I received your voice message regarding the subject. I'm wondering if you can elaborate i.e. provide more details.
Please be advised/informed that…

Please note that…

We would like to inform you that…
I am convinced that …

We agree with you on...
With effect from 4 Oct, 2008...

We will have a meeting scheduled as noted below…

Be assured that individual statistics are not disclosed and this is for internal use only.
I am delighted to tell you that…

We are pleased to leam that…

We wish to notify you that…

Congratulation on your…

I am fine with the proposal.

I am pleased to inform you that you have been accepted to join the workshop scheduled for 22-24 Nov, 2008.

We are sorry to inform you that…

I’m afraid I have some bad news.
There are a number of issues with our new system.

Due to circumstances beyond our control...

I don't feel too optimistic about...

It would be difficult for us to accept...

Unfortunately I have to say that, since receiving your enquiries on the subject, our view has not changed.
We would be grateful if you could...

I could appreciate it if you could...

Would you please send us...?

We need your help.

We seek your assistance to cascade/reply this message to your staff.

We look forward to your clarification.

Your prompt attention to this matter will be appreciated.

I would really appreciate meeting up if you can spare the time. Please let me know what suits you best.
Please give us your preliminary thoughts about this.

Would you please reply to this e-mail if you plan to attend?

Please advise if you agree with this approach.

Could you please let me know the status of this project?

If possible. I hope to receive a copy of your proposal when it is finished.

I would appreciate it very much if you would send me your reply by next Monday.

Hope this is OK with you. If not, let me know by e-mail ASAP

Could you please send me your replies to the above questions by the end of June?

May I have your reply by April 1, if possible?
If you wish, we would be happy to…

Please let me know if there's anything I can do to help.

1f there's anything else I can do for you on/regarding this matter.Please feel free to contact me at any time.

If you want additional recommendations on this. Please let us know and we can try to see if this is possible.
I'm just writing to remind you of…

May we remind you that..?

I am enclosing…

Please find enclosed…

Attached hereto…

Attached please find the most up-to-date informationon/regarding/concerning…

Attached please find the draft product plan for your review andcomment.
If you have any further questions, please feel free to contact me.

I hope my clarification has been helpful.

Please feel free to call me at any time, I will continually provide full support.

Please let me know if this is suitable.

Looking forward to seeing you soon.

We look forward to hearing from you soon.

Hope this is clear and we are happy to discuss this further if necessary.

I look forward to receiving your reply soon.

Looking forward to receiving your comments in due course.

I'll keep you posted.

Please keep me informed on the matter.

For any comments/suggestions, please contact Nadia at 2552-7482.
I would like to apologize for…

I apologize for the delay in...

We are sorry for any inconvenience caused.

I am sorry for any inconvenience this has caused you.

I'm sorry about last time.

We apolagize for not replying you earlier.

I’m really sorry about this.

Sorry. I'm late in replying to your e-mail dated Monday April 1.

We apologize for the delay and hope that it doesn’t inconvenience you too much.

Hoping that this will not cause you too much trouble.

Sorry if my voice message is not clear enough.
Thank you for your help.

I appreciate very much that you…

I truly appreciate it.

Thank you for your participation.

Thank you so much for inviting me.

Congratulations to all of you and thanks for your efforts.

Your understanding and cooperation is greatly/highly appreciated.

Your prompt response will be most appreciated.

Once again, thank you all for your commitment and support.

Thanks for your input/clarification/message

Any comments will be much appreciated.

Thank you very much for everything you've done for me.

I would appreciate your kindest understanding with/regarding this matter.

Please convey my thanks to all the staff involved, they certainly did an excellent job.
K.I.S.S. (Keep it simple, stupid!) 尽量简洁。人们每天要接收大量邮件,所以确保你的邮件简单易懂和思路清晰。

例如,如果是要求开会,就不需要过多的铺垫,直接写明时间、地点等主要因素就够了。

Meeting Request: Let’s have our weekly meeting on Feb 18th , at 10:30 AM in Meeting room.

@Lisa, could you please take the meeting minutes this time.

@Chris, please buy some snacks and drinks for each person.

订于2月18日早上10:30在2号会议室开周例会,请Lisa做会议纪要,请Chris为每个人准备点心和饮品。
写英文邮件时,记得介绍自己、说清你想要什么——务必简明扼要地自我介绍,使收件人了解邮件的目的。

例如:Jane作为上海区域经理加入了A公司,她整理了客户资料后给John发了一封邮件:

Hi John, Hope this mail finds you well. I am Jane, SH regional manager from A; I’m contacting you today regarding our upcoming collaboration.

你好John,见信安好。我是A公司的上海区域经理Jane。这封邮件是关于接下来双方合作的事宜。
醒目的标题。务必使收件人看完标题后一目了然。

范例:市场部员工小美去电视台采访,偶遇老同学,顺便为公司谈了一个合作,她回到办公室给老板发邮件,题目是Potential marketing recourse with xx TV的邮件。

有时候,一个标题就可以说清楚邮件内容,甚至正文都可以免了。
强调关键信息,再阐明具体要求-如果你需要他人在时间期限内完成某事,请清晰明了地表达对时限和任务的要求。

例如:HR小琪给刚入职的Bruce发邮件: It is VERY IMPORTANT that we go through the visa procedure soon tomorrow . Please MEET with me at 13:00 on the 7th floor of the office for this to be done.

明天最重要的是签证手续尽量早点办完,请在下午1:00到7楼的办公室与我见面。
根据信息的重要程度由上而下的沟通—确保把最关键的信息在邮件中置顶。老外喜欢开门见山,但单刀直入,邮件中要尽量避免过多不必要的寒暄,确保最重要的信息最先被对方知晓。

例如:老板发邮件给市场全体员工说明市场部主管Jane延迟入职的原因。

Jane will not be able to join us until March 15th. This is because of the rules surrounding her current contract. She will still be able to join us for the seminar on the Feb. 8th.

因为Jane目前合同一些条款的规定,在3月15日之前不能加入我们。她仍然会在2月8日参加我们的研讨会。
署名信息到位-确保你的署名包含你的业务领域,部门信息,头衔以及联系方式,以便收件人更清晰地了解你。

例如:

Best Wishes,
Lucy. Yang
Sales Manager
XXXX Company
1/F XXX Plaza, XX Street,Shanghai, China
Tel: +86 133 XXXXXXX Skype: *****
www.XXXXXX.com

Sometimes by reading your signature people can know if they are talking to the right person.
有的时候一个完整的署名能够直观地展示您的权限与兴趣点,省去很多不必要的沟通。

References

http://www.fortunechina.com/career/c/2016-02/14/content_255590.htm?id=mail

http://www.weixinyidu.com/n_1312384

http://felixfan.github.io/tophat-cufflinks

1. Data

two raw data files were provided as the starting point: * day8.fastq from the first biological condition * day16.fastq from the second biological condition * genome.fa the reference genome * genes.gtf the reference gene annotations

2. Create reference index

bowtie2-build bwtIndex/genome.fa bwtIndex/genome

3. Run tophat using all default parameters

tophat -o output/tophat/day8/ bwtIndex/genome day8.fastq 
tophat -o output/tophat/day16/ bwtIndex/genome day16.fastq

These will create the accepted_hits.bam files containing the alignments, the align_summary.txt files containing summary stats on the mapped reads, the unmapped.bam files containing the records of unmapped reads.

3.1 How many spliced alignments were reported for the ‘day8’ data set?

spliced alignments contain ‘N’ in cigar score.

samtools view output/tophat/day8/accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

4. Run cufflinks

Run cufflinks using the specified labels as prefix for naming the assembled transcripts.

cufflinks -o output/cufflinks/day8 -L Day8 output/tophat/day8/accepted_hits.bam
cufflinks -o output/cufflinks/day16 -L Day16 output/tophat/day16/accepted_hits.bam

These will generate the files transcripts.gtf containing the assembled transcripts, as well as files *.fpkm_tracking containing expression (FPKM) estimates for genes and transcripts.

cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f2 | sort -u | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f4 | sort -u | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep -v "exon_number" | cut -d ' ' -f2 | sort | uniq -c | awk '$1==1' | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1==1' | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1>1' | wc -l

5. Run cuffcompare

Run cuffcompare on the resulting cufflinks transcripts, using the reference gene annotations provided and selecting the option ‘-R’ to consider only the reference transcripts that overlap some input transfrag.

cuffcompare -r genes.gtf -R output/cufflinks/day8/transcripts.gtf
cuffcompare -r genes.gtf -R output/cufflinks/day16/transcripts.gtf

It compares the assembled transcripts against a set of reference gene annotations provided by the user, exon-by-exon, to determine which genes and transcripts in the sample are known, and which ones are likely novel. In the end, it assigns each predicted (cufflinks) transcript a ‘class’ code depending on how it relates to a reference transcript, for example: it is the same as a reference transcript (‘=’), it is only a portion of one (‘c’), a new splice variant of a reference gene (‘j’), etc. See details.

6. Run cuffmerge

echo output/cufflinks/day8/transcripts.gtf > gtf.txt
echo output/cufflinks/day16/transcripts.gtf >> gtf.txt
cuffmerge -g genes.gtf gtf.txt -o output/cuffmerge

7. Run cuffdiff

Run cuffdiff with the merged.gtf file as reference annotation, taking the two alignment files as input.

cuffdiff -o output/cuffdiffs/ output/cuffmerge/merged.gtf output/tophat/day8/accepted_hits.bam output/tophat/day16/accepted_hits.bam

This will create the file gene_exp.diff containing test scores and results for the gene-level differential expression analysis, other *.diff files, as well as tracking files for genes, transcripts, splicing, CDS, TSS, etc.

7.1 How many genes were detected as differentially expressed?

grep –c "yes"" output/cuffdiffs/gene_exp.diff

7.2 How many transcripts were differentially expressed between the two samples?

grep –c yes output/cuffdiffs/isoform_exp.diff
http://felixfan.github.io/hist-bin-width
set.seed(999)
dat<-rnorm(n=1000, m=24, sd=5)
histInfor <- hist(dat)

histInfor
$breaks
 [1]  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

$counts
 [1]   1   7  13  37  64 100 124 160 156 136 105  55  24  11   3   4

$density
 [1] 0.0005 0.0035 0.0065 0.0185 0.0320 0.0500 0.0620 0.0800 0.0780 0.0680
[11] 0.0525 0.0275 0.0120 0.0055 0.0015 0.0020

$mids
 [1]  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

$xname
[1] "dat"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Number of bins (=10)

The bins don’t correspond to exactly the number you put in, because of the way R runs its algorithm to break up the data but it gives you generally what you want.

hist(dat, breaks = 10)

Exact number of bins (=10)

hist(dat, breaks = seq(min(dat), max(dat), length.out = 11))

width of bin (=10)

summary(dat)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.633  20.470  23.950  23.840  27.180  39.310 
hist(dat, breaks = seq(from=5, to=45, by=10))

http://felixfan.github.io/bowtie-var

1. generate bowtie2 index

bowtie2-build ref.fasta indexDir/ref

2. read alignments

2.1 end-to-end read alignments

bowtie2 –x indexDir/ref –U seq.fastq –S out.full.sam

2.2 partial read alignments

bowtie2 –x indexDir/ref –U seq.fastq –S out.local.sam --local

3. statistics of the alignments

3.1 How many matches (alignments) were reported?

Check the SAM file to determine the number of alignment lines, excluding lines that refer to unmapped reads. A SAM line indicating an unmapped read can be recognized by a “*” in column 3 (chrom).

grep -v "^@" out.full.sam | cut -f3 | grep -v "*" | wc -l
grep -v "^@" out.local.sam | cut -f3 | grep -v "*" | wc -l

3.2 How many alignments contained insertions and/or deletions?

This information is captured in the CIGAR field (col 6), marked with ‘D’ and ‘I’, respectively.

cut -f6 out.full.sam | grep -c "[I,D]"
cut -f6 out.local.sam | grep -c "[I,D]"

or

grep -v "^@" out.full.sam | awk '$6~"I" || $6~"D"' | wc -l
grep -v "^@" out.local.sam | awk '$6~"I" || $6~"D"' | wc -l

4. call variants

4.1 converting the SAM file to BAM format

samtools view –bT ref.fasta out.full.sam > out.full.bam

4.2 sort the bam file

samtools sort out.full.bam -o out.full.sorted.bam

4.3 Determine candidate sites

samtools mpileup –f ref.fasta –g out.full.sorted.bam > out.full.mpileup.bcf
bcftools call -m -v -O v -o out.mileup.vcf out.full.mileup.bcf

5. variants statistics

5.1 How many variants were reported for Chr1?

grep -c "^Chr1" out.full.mpileup.vcf 

5.2 How many variants have ‘A’ as the reference allele?

grep -v "^#" out.full.mpileup.vcf | awk '$4=="A"' | wc -l

5.3 How many variants have exactly 20 supporting reads (read depth)?

grep -v "^#" out.full.mpileup.vcf | grep "DP=20;" | wc -l

5.4 How many variants represent indels?

grep -v "^#" out.full.mpileup.vcf | grep "INDEL" | wc -l
http://felixfan.github.io/linux-grep
echo "Hello World" > test.txt
echo "hello python" >> test.txt
echo "big apple" >> test.txt
echo "key1" >> test.txt 
echo "code99" >> test.txt 

区分大小写

grep "Hello" test.txt
Hello World

不区分大小写

grep -i "Hello" test.txt
Hello World
hello python

只显示以’h’开头的文本行

grep "^h" test.txt
hello python

检索以’e’结尾的文本格式

grep -i "e$" test.txt 
big apple

搜索空白行

grep '^$' test.txt

匹配 ‘Hello’ 或 ‘hello’

grep "[Hh]ello" test.txt 
Hello World
hello python

匹配数字

grep "y[0-9]" test.txt 
key1

以匹配两位数

grep "e[0-9][0-9]" test.txt 
code99

匹配字母

grep '[A-Za-z]' test.txt
Hello World
hello python
big apple
key1
code99

显示所有包含 “p” 或 “y” 字母的文本行

grep '[py]'' test.txt
hello python
big apple
key1

匹配包含两个字母’p’的字符串结果

egrep "p{2}" test.txt 
big apple

检索文件内包含’p’和’pp’的字符串结果

egrep "p{1,2}" test.txt 
hello python
big apple

匹配至少含有3个字母’p’的结果

egrep "p{3,}" test.txt 

从文件读入多个匹配模式

echo "y1" > p.txt
echo "d$" >> p.txt
grep -f p.txt test.txt
Hello World
key1

http://felixfan.github.io/RMarkdown-Chinese-PDF 在Mac OS上使用R Markdown生成含有中文的pdf文件

准备工作

除了安装R, RStudio外,还要安装pandocBasicTeX. 如果电脑硬盘空间够大,可以直接安装MacTex. 最后安装R软件包rticles. pandoc 和 BasicTex下载后双击运行安装。BsicTex 安装好后先升级一下,再安装ctex包,具体操作如下:

sudo tlmgr update --self
sudo tlmgr update --all
sudo tlmgr install ctex

如果运行中提示“package.sty” 缺失的话,直接用“sudo tlmgr install package”安装即可(package 为具体的软件包的名字)。打开RStudio安装rticles:

install.packages("rticles")

安装完成后,新建RMarkdown文件,在弹出窗口点击左下角“from template”,在右半边窗口选“CTex Documents”即可。

以下的内容为模版的默认内容的删减版。

引言

中文LaTeX文档并非难题。当然这句话得站在巨人 CTeX 的肩膀上才能说,它让我们只需要一句

\documentclass{ctexart} % 或者ctexrep/ctexbook

或者

\usepackage{ctex}

就轻松搞定中文LaTeX排版问题。跨平台通用的LaTeX编译却是个小难题,主要是没有一种跨平台通用且免费的中文字体。好吧,你们Windows用户永远有宋体黑体,你们Mac用户有华文字体,而我们苦逼Linux用户在编译LaTeX文档就没那么简单了1,是啊,我们有文泉驿,但我们要是用了文泉驿之后把文档发给你们八成不能编译,因为你们没有安装文泉驿。

字体和选项

LaTeX包ctex支持若干种字体选项,如果你是ctex老用户,请注意这里我们要求的最低版本是2.2,你可能需要升级你的LaTeX包。从版本2.0开始,ctex支持根据不同操作系统自动选择中文字体,简直是为人类进步作出了巨大贡献,我们再也不必费尽口舌向用户解释“啊,你用Windows啊,那么你该使用什么字体;啊,你用Mac啊,又该如何如何”。

下面的YAML元数据应该能满足多数用户的需求,主要设置两项参数:文档类为ctexart(当然也可以是别的类),输出格式为rticles::ctex,其默认LaTeX引擎为XeLaTeX(真的,别纠结你的旧爱PDFLaTeX了)。

---
documentclass: ctexart
output: rticles::ctex
---

R代码段

R代码用R Markdown的语法嵌入,即三个反引号开始一段代码```{r}和三个反引号``` 结束一段代码:

options(digits = 4)
fit = lm(dist ~ speed, data = cars)
coef(summary(fit))
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  -17.579     6.7584  -2.601 1.232e-02
## speed          3.932     0.4155   9.464 1.490e-12
b = coef(fit)

上面回归方程中的斜率是3.9324,完整的回归方程为:\[ Y = -17.5791 + 3.9324x\]

画图当然也是木有问题的啦,想画就说嘛,不说我怎么知道你想画呢?

par(mar = c(4, 4, .1, .1), las = 1)
plot(cars, pch = 19)
abline(fit, col = 'red')
cars数据散点图以及回归直线。

cars数据散点图以及回归直线。

请不要问我为什么图浮动到下一页去了,这么初级的LaTeX问题问出来信不信我扁你。

小结

事实证明我们可以理直气壮地通过XeLaTeX将中文R Markdown转化为PDF文档,麻麻再也不用担心我的论文满屏幕都是反斜杠,朕养完小白鼠之后终于不必先折腾三个小时LaTeX再开始写实验报告了:打开RStudio,菜单File > New File > R Markdown,然后从模板中选择CTeX Documents,搞定。


  1. 切,傲娇的Linux用户怎么会干出找你们复制字体的事情

R Markdown转化为PDF文档的效果如下(只显示了第一页)

preview
http://felixfan.github.io/circos

1. prepare data

options(stringsAsFactors = FALSE)
set.seed(999)
library("OmicCircos")
data("UCSC.hg19.chr")
data("TCGA.BC.gene.exp.2k.60")
dat <- UCSC.hg19.chr
dat$chrom <- gsub("chr", "",dat$chrom)


### initial values for simulation data
colors <- rainbow(10, alpha = 0.8)
lab.n <- 50
cnv.n <- 200
arc.n <- 30
fus.n <- 10

### make arc data

arc.d <- c()
for(i in 1:arc.n){
  chr <- sample(1:19, 1)
  chr.i <- which(dat$chrom == chr)
  chr.arc <- dat[chr.i,]
  arc.i <- sample(1:nrow(chr.arc), 2)
  arc.d <- rbind(arc.d, 
                 c(chr.arc[arc.i[1], c(1,2)], 
                   chr.arc[arc.i[2], c(2,4)]))
}
colnames(arc.d) <- c("chr", "start", "end", "value")


### make fusion

fus.d <- c()
for(i in 1:fus.n){
  chr1 <- sample(1:19, 1)
  chr2 <- sample(1:19, 1)
  chr1.i <- which(dat$chrom == chr1)
  chr2.i <- which(dat$chrom == chr2)
  chr1.f <- dat[chr1.i,]
  chr2.f <- dat[chr2.i,]
  fus1.i <- sample(1:nrow(chr1.f), 1)
  fus2.i <- sample(1:nrow(chr2.f), 1)
  n1 <- paste0("geneA", i)
  n2 <- paste0("geneB", i)
  fus.d <- rbind(fus.d, c(
    chr1.f[fus1.i, c(1,2)], n1,
    chr2.f[fus2.i, c(1,2)], n2
  ))
}
colnames(fus.d) <- c("chr1","po1","gene1","chr2","po2","gene2")

cnv.i <- sample(1:nrow(dat), cnv.n)
vale <- rnorm(cnv.n)
cnv.d <- data.frame(dat[cnv.i,c(1,2)], value=vale)

### gene pos
gene.pos <- TCGA.BC.gene.exp.2k.60[,1:3]

### gene expression
gene.exp <- TCGA.BC.gene.exp.2k.60

### p vale
gene.pos$p <- rnorm(250,0.01,0.001)*
  sample(c(1,0.5,0.01,0.001,0.0001),250,replace=TRUE)

2. circos plot

2.1 plot of chromosome

type = "chr": plots of chromosomes or segments
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)

2.2 plot bar charts with the same height

type = "b3": bar charts with the same height
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])

2.3 plot dots with the fixed radius

type = "s2": dots with the fixed radius
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)

2.4 plot arcs with the fixed radius

type = "arc2": arcs with the fixed radius
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)

2.5 plot bar charts (opposite side of cutoff value)

type = "b2": bar charts (opposite side of cutoff value)
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)

2.6 plot arcs with variable radius

type = "arc": arcs with variable radius
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)

2.7 box plots

type = "box": box plots
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=200, cir="hg19",type="box",W=40,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v = 3)

2.8 histograms

type = "h": histograms
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=200, cir="hg19",type="box",W=40,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v=3)
circos(R=160, cir="hg19",type="h",W=40,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)

type = "link": link lines based on Bezier curve
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=200, cir="hg19",type="box",W=40,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v=3)
circos(R=160, cir="hg19",type="h",W=40,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)
circos(R=120,cir="hg19",type="link",W=10,mapping=fus.d,col=colors[c(1,7,9)],lwd=2)

3 plot label

3.1 outside label

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=300, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = FALSE)
circos(R=310, cir="hg19",type="label",W=40,mapping=gene.pos, col=c("black","blue","red"),cex=0.4,side="out")
circos(R=250, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=250, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=220, cir="hg19",type="arc2",W=30,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=190, cir="hg19",type="b2",W=30,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=160, cir="hg19",type="arc",W=30,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=130, cir="hg19",type="box",W=30,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v=3)
circos(R=100, cir="hg19",type="h",W=30,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)
circos(R=90,cir="hg19",type="link",mapping=fus.d,col=colors[c(1,7,9)],lwd=2)

3.2 inside label

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = T)
circos(R=390, cir="hg19",type="label",W=50,mapping=gene.pos, col=c("black","blue","red"),cex=0.4,side="in")
circos(R=240, cir="hg19",type="b3",W=50,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=240, cir="hg19",type="s2",W=50,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=190, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=140, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=90, cir="hg19",type="h",W=40,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)
circos(R=80,cir="hg19",type="link",mapping=fus.d,col=colors[c(1,7,9)],lwd=2)

4 heatmap

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = T)
circos(R=300, cir="hg19",type="heatmap2",W=100,mapping=gene.exp, 
       col.v=4, cluster=FALSE,col.bar = FALSE,lwd=0.1,col="blue")
circos(R=200, cir="hg19",type="s",W=100,mapping = gene.pos,
       col.v=4,col=colors,scale = TRUE,B=TRUE)
sig.gene <- gene.pos[gene.pos$p<0.000001,]
circos(R=190, cir="hg19",type="label",W=40,mapping=sig.gene, col=c("black","blue","red"),cex=0.4,side="in")

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=200, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = T)
circos(R=100, cir="hg19",type="heatmap2",W=100,mapping=gene.exp, 
       col.v=4, cluster=FALSE,col.bar = FALSE,lwd=0.1,col="blue")
circos(R=230, cir="hg19",type="s",W=100,mapping = gene.pos,
       col.v=4,col=colors,scale = TRUE,B=TRUE)
sig.gene <- gene.pos[gene.pos$p<0.000001,]
circos(R=330, cir="hg19",type="label",W=40,mapping=sig.gene, col=c("black","blue","red"),cex=0.4,side="out")

http://felixfan.github.io/birthday-paradox

Birthday paradox

In a set of n randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367. However, 99.9% probability is reached with just 70 people, and 50% probability with 23 people based on the assumption that each day of the year is equally probable for a birthday.

Calculating the probability

x <- rep(NA, 100)
y <- rep(NA, 100)
p <- rep(NA, 100)
x[1]=1
y[1]=1
p[1]=0
for(i in 2:100)
{
  x[i]=i
  y[i]=y[i-1]*(365-i+1)/365
  p[i]=1-y[i]
}
dat = data.frame(numOfIndiv=x, prob=p)
dat2370 = dat[dat$numOfIndiv==23 | dat$numOfIndiv==70,]
dat2370$prob <- round(dat2370$prob, digits=3)

Plot the probability

library(ggplot2)
ggplot(dat, aes(x=numOfIndiv, y=prob)) + 
  geom_line() +
  xlab("Number of Individuals") +
  ylab("Probability of Have Two Individuals with the Same Birthday") +
  ggtitle("Birthday Paradox") +
  geom_point(data=dat2370,aes(x=numOfIndiv, y=prob), colour = "red") +
  geom_label(data=dat2370,
             aes(x=numOfIndiv, y=prob, 
                 label=paste(numOfIndiv,prob,sep=" ")),
             hjust = 1,  vjust = -0.2)

http://felixfan.github.io/latex-markdown
Here is an in-line equation $\sqrt{3x-1}+(1+x)^2$ in the body of the text.

Here is an in-line equation \[ \sqrt{3x-1}+(1+x)^2 \] in the body of the text.

Here is an equation: $\left [ - \frac{\hbar^2}{2 m} \frac{\partial^2}{\partial x^2} + V \right ] \Psi
= i \hbar \frac{\partial}{\partial t} \Psi$

Here is an equation: \[\left [ - \frac{\hbar^2}{2 m} \frac{\partial^2}{\partial x^2} + V \right ] \Psi = i \hbar \frac{\partial}{\partial t} \Psi\]

symbols

1. &

used as separators in alignment environments

a &lt; b

a < b

2. ^, _, { and }

^ used to indicate exponents;
^ used to indicate superscripts;
_ used to indicate subscripts;
{} braces, used for grouping;

x^i_2

\[ x^i_2 \]

{x^i}_2

\[ {x^i}_2 \]

x^{i_2}

\[ x^{i_2} \]

x^{i^2}

\[ x^{i^2} \]

{x^i}^2

\[ {x^i}^2 \]

^ax^b

\[ ^ax^b \]

\sum_{n=1}^\infty

\[ \sum_{n=1}^\infty \]

3. Greek letter

\alpha, \beta, \chi, \Delta, \delta, \epsilon, \eta, \Gamma, \gamma, \iota, \kappa

\[ \alpha, \beta, \chi, \Delta, \delta, \epsilon, \eta, \Gamma, \gamma, \iota, \kappa \]

\Lambda, \lambda, \mu, \omega, \Omega, \phi, \Phi, \pi, \Pi, \psi, \Psi

\[ \Lambda, \lambda, \mu, \omega, \Omega, \phi, \Phi, \pi, \Pi, \psi, \Psi \]

\rho, \sigma, \Sigma, \tau, \theta, \Theta, \upsilon, \Upsilon, \varDelta, \varepsilon, \varGamma

\[ \rho, \sigma, \Sigma, \tau, \theta, \Theta, \upsilon, \Upsilon, \varDelta, \varepsilon, \varGamma \]

\varLambda, \varOmega, \varphi, \varPhi, \varpi, \varPi, \xi, \zeta

\[ \varLambda, \varOmega, \varphi, \varPhi, \varpi, \varPi, \xi, \zeta \]

4. \frac

\frac a b

\[ \frac a b \]

\frac{a-1}b-1

\[\frac{a-1}b-1 \]

\frac{a-1}{b-1}

\[ \frac{a-1}{b-1} \]

github pages: delimiters \\(, \\) and \\[, \\] for inline and displayed math, respectively.
Rstudio: delimiters $, $ and $$, $$ for inline and displayed math, respectively.

Reference: TEX Commands available in MathJax

http://felixfan.github.io/bedtools

1. Introduction

As described on the UCSC Genome Browser website (see link below), the BED format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for bedtools.

bedtools allows one to use the “BED12” format (that is, all 12 fields listed below). However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12 “blocks” when computing overlaps, etc., via the “-split” option. For all other tools, the last six columns are not used for any comparisons by the bedtools. Instead, they will use the entire span (start to end) of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported in the output of all comparisons.

chrom - The name of the chromosome on which the genome feature exists. 
	For example, “chr1”, “contig1112.23”. This column is required.
start - The zero-based starting position of the feature in the chromosome. 
	The first base in a chromosome is numbered 0. This column is required.
end - The one-based ending position of the feature in the chromosome.
	This column is required.
name - Defines the name of the BED feature. This column is optional.
	For example, “LINE”, “Exon3”.
score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. 
	This column is optional.
strand - Defines the strand - either ‘+’ or ‘-‘. This column is optional.
thickStart - The starting position at which the feature is drawn thickly.
	Allowed yet ignored by bedtools.
thickEnd - The ending position at which the feature is drawn thickly.
	Allowed yet ignored by bedtools.
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).
	Allowed yet ignored by bedtools.
blockCount - The number of blocks (exons) in the BED line.
	Allowed yet ignored by bedtools.
blockSizes - A comma-separated list of the block sizes.
	Allowed yet ignored by bedtools.
blockStarts - A comma-separated list of block starts.
	Allowed yet ignored by bedtools.

2. bedtools Examples

2.0 intersect command

cat a.bed 
Chr3	11699949	11700000
Chr3	11699967	11700018
Chr3	11699972	11700023
cat b.bed 
Chr3	11699950	11699990
Chr3	11699970	11700020
Chr4	11699972	11700023
-wa Write the original entry in A for each overlap.
bedtools intersect -wa -a a.bed -b b.bed 
Chr3	11699949	11700000
Chr3	11699949	11700000
Chr3	11699967	11700018
Chr3	11699967	11700018
Chr3	11699972	11700023
Chr3	11699972	11700023
-wb Write the original entry in B for each overlap.
bedtools intersect -wb -a a.bed -b b.bed 
Chr3	11699950	11699990	Chr3	11699950	11699990
Chr3	11699970	11700000	Chr3	11699970	11700020
Chr3	11699967	11699990	Chr3	11699950	11699990
Chr3	11699970	11700018	Chr3	11699970	11700020
Chr3	11699972	11699990	Chr3	11699950	11699990
Chr3	11699972	11700020	Chr3	11699970	11700020
Write the original A and B entries plus the number of 
base pairs of overlap between the two features.
bedtools intersect -wo -a a.bed -b b.bed 
Chr3	11699949	11700000	Chr3	11699950	11699990	40
Chr3	11699949	11700000	Chr3	11699970	11700020	30
Chr3	11699967	11700018	Chr3	11699950	11699990	23
Chr3	11699967	11700018	Chr3	11699970	11700020	48
Chr3	11699972	11700023	Chr3	11699950	11699990	18
Chr3	11699972	11700023	Chr3	11699970	11700020	48

2.1 How many overlaps (each overlap is reported on one line) between the bam file and the gtf file are reported?

To allow the input to be read directly from the BAM file, we use the option ‘-abam’. -bed If using BAM input, write output as BED.

bedtools intersect -abam test.bam -b test.gtf -bed -wo > overlaps.bed

This will create a file with the following format: Columns 1-12 : alignment information, converted to BED format Columns 13-21 : annotation (exon) information, from the GTF file Column 22 : length of the overlap

Alternatively, we could first convert the BAM file to BED format using ‘bedtools bamtobed’ then use the resulting file in the ‘bedtools intersect’ command. To answer the question, the number of overlaps reported is precisely the number of lines in the file (because only entries in the first file that have overlaps in file B are reported, according to the option ‘-wo’):

wc -l overlaps.bed

2.2 How many alignments overlap the annotations?

Columns 1-12 define the alignments:

cut -f1-12 overlaps.bed | sort -u | wc -l

2.3 Conversely, how many exons have reads mapped to them?

Columns 13-21 define the exons:

cut -f13-21 overlaps.bed | sort -u | wc -l
http://felixfan.github.io/nba-heatmap

1. NBA players data in 2014-2015 season

1.1 columns of the data

Rk -- Rank
Pos -- Position
Age -- Age of Player at the start of February 1st of that season.
Tm -- Team
G -- Games
GS -- Games Started
MP -- Minutes Played
FG -- Field Goals
FGA -- Field Goal Attempts
FGR -- Field Goal Percentage
F3P -- 3-Point Field Goals
F3PA -- 3-Point Field Goal Attempts
F3PR -- FG% on 3-Pt FGAs.
F2P -- 2-Point Field Goals
F2PA -- 2-point Field Goal Attempts
F2P -- FG% on 2-Pt FGAs.
eFGR -- Effective Field Goal Percentage
FT -- Free Throws
FTA -- Free Throw Attempts
FTR -- Free Throw Percentage
ORB -- Offensive Rebounds
DRB -- Defensive Rebounds
TRB -- Total Rebounds
AST -- Assists
STL -- Steals
BLK -- Blocks
TOV -- Turnovers
PF -- Personal Fouls
PTS -- Points

1.2 read data

#dat <- read.csv("nba20142015.csv")
library(RCurl)
myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/bioinfor/nba20142015.csv", 
                ssl.verifypeer = FALSE)
dat <- read.csv(textConnection(myCsv))

1.3 select columns for heatmap

Only select the top 20 players with highest points.

keeps <- c('Player','G','FGR','F3PR','F2PR','FTR','ORB','DRB','AST','STL','BLK','TOV','PF','PTS')
subdat <- dat[,names(dat) %in% keeps]
plotdat <- subdat[order(-subdat[,"PTS"]),][1:20,]

Order y-axis inside a geom_tile by PTS. The y-axis is ordered alphabetically in default.

plotdat$Player <- factor(plotdat$Player, levels=(plotdat$Player)[order(plotdat$PTS)])

1.4 prepare for ggplot2

transform data from wide-format to long-format.

library(reshape2)
plotdat.m <- melt(plotdat)

rescale data so that they were between 0 and 1.

library(plyr)
library(scales)
plotdat.m <- ddply(plotdat.m, .(variable), transform, rescale = rescale(value))

1.5 prepare for heatmap, heatmap.2 and d3heatmap

row.names(plotdat) <- plotdat$Player
plotdat.h <- plotdat[,2:14]
plotdat.h <- data.matrix(plotdat.h)

2 Heatmap

#my_col = colorRampPalette(c("yellow","red"))(256)
my_col = colorRampPalette(c("white","green","green4","violet","purple"))(256)

2.1 heatmap in stats package

heatmap(plotdat.h, col = my_col, scale="column",Rowv=NA, Colv=NA)

2.2 heatmap.2 in gplots package

Rowv=FALSE turns off row reorder.

library(gplots)
heatmap.2(plotdat.h, col = my_col, scale="column",dendrogram="none",margins = c(5, 10),Rowv=FALSE)

2.3 d3heatmap in d3heatmap package

library(d3heatmap)
d3heatmap(plotdat.h, scale = "column",dendrogram="none",col = my_col)

2.4 heatmap by ggplot2

library(ggplot2)
ggplot(plotdat.m, aes(variable, Player)) + 
  geom_tile(aes(fill = rescale), colour = "white")+
  scale_fill_gradient(low = "yellow", high = "red")+
  theme(axis.ticks = element_blank(), 
               axis.text.x = element_text(
                 angle = 330, hjust = 0),
               axis.title = element_blank(),
               legend.title = element_blank()
               )

http://felixfan.github.io/bam-sam

1. Sequence Alignment/Map Format Specification

The official defination is here.

“It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.”

The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system.
The BAM, BCFv2, BED, and PSL formats are using the 0-based coordinate system.

1.1 The header section


1.2 The alignment section: mandatory fields

“In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line has 11 mandatory fields. These fields always appear in the same order and must be present, but their values can be ‘0’ or ‘*’ (depending on the field) if the corresponding information is unavailable. The following table gives an overview of the mandatory fields in the SAM format”

1.2.1 The FLAG field

1.2.2 The CIGAR field

2. BAM format

SAM files and BAM files contain the same information, but in a different format. BAM is compressed in the BGZF format.

3. Practice

3.1 How many alignments does the BAM file contain?

A BAM file contains alignments for a set of input reads. Each read can have 0 (none), 1 or multiple alignments on the genome. The number of alignments is the number of entries, excluding the header, contained in the BAM file, or equivalently in its SAM conversion.

samtools flagstat test.bam

An alternate method would be to count the number of lines in the converted SAM file (header excluded):

samtools view test.bam | wc -l

If the BAM file was created with a tool that includes unmapped reads into the BAM file, we would need to exclude the lines representing unmapped reads, i.e. with a “*” in column 3 (chrom)

samtools view test.bam | cut -f 3 | grep -v '*' | wc -l

3.2 How many alignments show the read’s mate unmapped?

An alignment with an unmapped mate is marked with a ‘*’ in column 7.

samtools view test.bam | cut -f 7 | grep -c '*'

3.3 How many alignments contain a deletion (D)?

Deletions are be marked with the letter ‘D’ in the CIGAR string for the alignment, shown in column 6.

samtools view test.bam | cut -f 6 | grep -c 'D'

3.4 How many alignments show the read’s mate mapped to the same chromosome?

An alignment with mate mapped to same chromosome is marked with a “=” in column 7.

samtools view test.bam | cut -f 7 | grep -c '='

3.5 How many alignments are spliced?

A spliced alignment will be marked with an “N” (intron gap) in the CIGAR field (column 6).

samtools view test.bam | cut -f 6 | grep -c 'N'

3.6 How many sequences are in the genome file?

This information can be found in the header of the BAM file. The number of lines describing the sequences in the reference genome.

samtools view -H test.bam | grep -c "SN:"

3.7 What is the length of the first sequence in the genome file?

The length information is stored alongside the sequence identifier in the header (pattern “LN:seq_length”).

samtools view -H test.bam | grep "SN:" | more

3.8 What alignment tool was used?

The program name is listed in the @PG line in the BAM header (pattern “ID:program_name”).

samtools view -H test.bam | grep "^@PG"

3.9 Extract a subregion from the BAM file.

Extract 1,000,000 to 10,000,000 on chromsome 3.

echo "Chr3 1000000 10000000" > region.bed
samtools view -b -L region.bed test.bam > test_region.bam
http://felixfan.github.io/google-scholar-citation
library(scholar)
library(ggplot2)
cit <- get_citation_history('8fX1TSQAAAAJ')
# '8fX1TSQAAAAJ' is my google scholar id 
ggplot(cit,aes(x=year,y=cites)) + 
  geom_bar(stat='identity') +
  theme_bw() +
  xlab('Year') +
  ylab('Google Scholar Citations') + 
  annotate('text',
           label=format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"),
           x=-Inf, y=Inf, 
           vjust=1.5, hjust=-0.05,
           size=3,colour='gray') +
  geom_text(aes(label=cites), vjust=1.5, color="white", size=3) +
  scale_y_continuous(limits = c(0, 60)) +
  scale_x_continuous(breaks=2011:2016) +
  ggtitle("h-index = 6\ni10-index = 5")

http://felixfan.github.io/formattable