http://felixfan.github.io/statistics-for-management-and-economics-study-notes-4 statistics for management and economics study notes 4

14. Analysis of Variance

14.1 One-way Analysis of Variance

The analysis of variance is a procedure that tests to determine whether differences exist between two or more population means. one-way analysis of variance is the procedure to apply when the samples are independently drawn.

\(H_{0}\): \(\mu_{1} = \mu_{2} = \cdots = \mu_{k}\)
\(H_{1}\): at least two means differ

The statistic that measures the proximity of the sample means to each other is called the between-treatments variation; it is denoted SST, which stands for sum of squares for treatments.

\[SST = \sum_{j=1}^k n_{j}(\bar x_{j} - \bar{\bar x})^2\]

\[\bar{\bar x} =\frac{\sum_{j=1}^k \sum_{i=1}^{n_{j}} x_{ij}}{n}\]

\[n = n_{1} + n_{2} + \cdots + n_{k}\]

\[\bar x_{j} = \frac{\sum_{i=1}^{n_{j}}x_{ij}}{n_{j}}\]

how much variation exists in the percentage of assets, which is measured by the within-treatments variation, which is denoted by SSE (sum of squares for error). The within-treatments variation provides a measure of the amount of variation in the response variable that is not caused by the treatments.

\[SSE = \sum_{j=1}^k \sum_{i=1}^{n_{j}}(x_{ij} - \bar x_{j})^2\]

\[SSE = (n_{1}-1)s_{1}^2 + (n_{2}-1)s_{2}^2 + \cdots + (n_{k}-1)s_{k}^2\]

The mean square for treatments is computed by dividing SST by the number of treatments minus 1.

\[MST = \frac{SST}{k-1}\]

The mean square for error is determined by dividing SSE by the total sample size (labeled n) minus the number of treatments.

\[MSE = \frac{SSE}{n-k}\]

Finally, the test statistic is defined as the ratio of the two mean squares.

\[F = \frac{MST}{MSE}\]

The test statistic is F-distributed with k − 1 and n − k degrees of freedom, provided that the response variable is normally distributed. we reject the null hypothesis only if

\[F > F_{\alpha, k-1, n-k}\]

total variation of all the data is denoted SS(Total)

\[SS(Total) = SST + SSE = \sum_{j=1}^k \sum_{i=1}^{n_{j}}(x_{ij} - \bar{\bar x})^2\]

ANOVA Table for the One-Way Analysis of Variance:

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC
Treatments k − 1 SST MST = SST/ (k − 1) F = MST/MSE
Error n − k SSE MSE = SSE/ (n − k)
Total n − 1 SS(Total)

Example: a financial analyst randomly sampled 366 American households and asked each to report the age category of the head of the household and the proportion of its financial assets that are invested in the stock market. The age categories are Young (less than 35), Early middle age (35 to 49), Late middle age (50 to 65), Senior (older than 65). The analyst was particularly interested in determining whether the ownership of stocks varied by age.

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Treatments 3 3741.4 1247.12 2.79 0.0405
Error 362 161871.0 447.16
Total 365 165612.4

Interpret: The value of the test statistic is F = 2.79, and its p-value is .0405, which means there is evidence to infer that the percentage of total assets invested in stocks are different in at least two of the age categories.

14.1.1 Can We Use the t-Test of the Difference between Two Means Instead of the Analysis of Variance?

There are two reasons why we don’t use multiple t-tests instead of one F-test. First, we would have to perform many more calculations. Second, and more important, conducting multiple tests increases the probability of making Type I errors.

14.1.2 Can We Use the Analysis of Variance Instead of the t-Test of \(\mu_{1} − \mu_{2}\)?

If we want to determine whether \(\mu_{1}\) is greater than \(\mu_{2}\) (or vice versa), we cannot use the analysis of variance because this technique allows us to test for a difference only. Thus, if we want to test to determine whether one population mean exceeds the other, we must use the t-test of \(\mu_{1} − \mu_{2}\) (with \(\sigma_{1}^2=\sigma_{2}^2\)). Moreover, the analysis of variance requires that the population variances are equal. If they are not, we must use the unequal variances test statistic.

14.2 Multiple Comparisions

Bonferroni adjustment:

\[\alpha = \frac{\alpha_{E}}{n}\]

\(\alpha_{E}\), denotes the true probability of making at least one Type I error, is called the experimentwise Type I error rate. n is the number of pairwise comparisons.

14.3 Analysis of Variance Experimental Designs

14.3.1 Single-Factor and Multifactor Experimental Designs

A single-factor analysis of variance addresses the problem of comparing two or more populations defined on the basis of only one factor. A multifactor experiment is one in which two or more factors define the treatments.

The example in 14.1 is a single-factor design because we had one treatment: age of the head of the household. Suppose that we can also look at the gender of the household head in another study. We would then develop a two-factor analysis of variance in which the first factor, age, has four levels, and the second factor, gender, has two levels.

14.3.2 Independent Samples and Blocks

When the problem objective is to compare more than two populations, the experimental design that is the counterpart of the matched pairs experiment is called the randomized block design. The term block refers to a matched group of observations from each population. The randomized block experiment is also called the two-way analysis of variance.

We can determine whether sleeping pills are effective by giving three brands of pills to the same group of people to measure the effects. Such experiments are called repeated measures designs.

The data are analyzed in the same way for both designs.

14.3.3 Fixed and Random Effects

If our analysis includes all possible levels of a factor, the technique is called a fixed effects analysis of variance. If the levels included in the study represent a random sample of all the levels that exist, the technique is called a random-effects analysis of variance.

14.4 Randomized Block (Two-Way) Analysis of Variance

The purpose of designing a randomized block experiment is to reduce the within-treatments variation to more easily detect differences between the treatment means. In the one-way analysis of variance, we partitioned the total variation into the between-treatments and the within-treatments variation; that is,

\[SS(Total) = SST + SSE\]

In the randomized block design of the analysis of variance, we partition the total variation into three sources of variation:

\[SS(Total) = SST + SSB + SSE\]

where SSB, the sum of squares for blocks, measures the variation between the blocks.

BLOCK 1 2 k Block Mean
1 \(x_{11}\) \(x_{12}\) \(x_{1k}\) \(\bar x[B]_{1}\)
2 \(x_{21}\) \(x_{22}\) \(x_{2k}\) \(\bar x[B]_{2}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
b \(x_{b1}\) \(x_{b2}\) \(x_{bk}\) \(\bar x[B]_{b}\)
Treatment Mean \(\bar x[T]_{1}\) \(\bar x[T]_{2}\) \(\bar x[T]_{k}\)

Sums of Squares in the Randomized Block Experiment:

\[SS(Total) = \sum_{j=1}^k \sum_{i=1}^b (x_{ij} - \bar{\bar x})^2\] \[SST = \sum_{j=1}^k b(\bar x[T]_{j} - \bar{\bar x})^2\] \[SSB = \sum_{i=1}^b k(\bar x[B]_{i} - \bar{\bar x})^2\] \[SSE = \sum_{j=1}^k \sum_{i=1}^b (x_{ij} - \bar x[T]_{j} - \bar x[B]_{i} + \bar{\bar x})^2\]

Mean Squares for the Randomized Block Experiment:

\[MST = \frac{SST}{k-1}\] \[MSB = \frac{SSB}{b-1}\] \[MSE = \frac{SSE}{n-k-b-1}\]

Test Statistic for the Randomized Block Experiment

\[F = \frac{MST}{MSE}\]

which is F-distributed with ν1 = k − 1 and ν2 = n − k − b + 1 degrees of freedom.

ANOVA Table for the Randomized Block Analysis of Variance

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC
Treatments k − 1 SST MST = SST / (k − 1) F = MST/MSE
Blocks b - 1 SSB MSB = SSB / (b - 1) F = MSB/MSE
Error n − k - b + 1 SSE MSE = SSE / (n − k - b + 1)
Total n − 1 SS(Total)

Example: A company selected 25 groups of four men, each of whom had cholesterol levels in excess of 280. In each group, the men were matched according to age and weight. Four drugs were administered over a 2-month period, and the reduction in cholesterol was recorded. Do these results allow the company to conclude that differences exist between the four drugs?

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Drug 3 196.0 65.3 4.12 0.009
Group 24 3848.7 160.4 10.11 0.000
Error 72 1142.6 15.9
Total 99 5187.2

Interpret: we conclude that there is sufficient evidence to infer that at least two of the drugs differ.

14.5 Two-Factor Analysis of Variance

The general term for the experiment features two factors is factorial experiment. In factorial experiments, we can examine the effect on the response variable of two or more factors. We will present the technique for fixed effects only. That means we will address problems where all the levels of the factors are included in the experiment.

Example: As part of a study on job tenure, a survey was conducted in which Americans aged between 37 and 45 were asked how many jobs they have held in their lifetimes. Also recorded were gender and educational attainment. The categories are E1, E2, E3 and E4. Can we infer that differences exist between genders and educational levels?

\(H_{0}\): \(\mu_{1} = \mu_{2} = \mu_{3} = \mu_{4} = \mu_{5} = \mu_{6} = \mu_{7} = \mu_{8}\)
\(H_{1}\): At least two means differ

Summary:

Groups Count Sum Average Variance
Male E1 10 126 12.60 8.27
Male E2 10 110 11.00 8.67
Male E3 10 106 10.60 11.60
Male E4 10 90 9.00 5.33
Female E1 10 115 11.50 8.28
Female E2 10 112 11.20 9.73
Female E3 10 94 9.40 16.49
Female E4 10 81 8.10 12.32

one-way Anova:

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Between Groups 7 153.35 21.91 2.17 0.0467
Within Groups 72 726.20 10.09
Total 79 879.55

Interpret: The value of the test statistic is F = 2.17 with a p-value of .0467. We conclude that there are differences in the number of jobs between the eight treatments.

This statistical result raises more questions—namely, can we conclude that the differences in the mean number of jobs are caused by differences between males and females? Or are they caused by differences between educational levels? Or, perhaps, are there combinations, called interactions, of gender and education that result in especially high or low numbers?

A complete factorial experiment is an experiment in which the data for all possible combinations of the levels of the factors are gathered. That means that in the above example we measured the number of jobs for all eight combinations. This experiment is called a complete 2 × 4 factorial experiment. In general, we will refer to one of the factors as factor A (arbitrarily chosen). The number of levels of this factor will be denoted by a. The other factor is called factor B, and its number of levels is denoted by b. The number of observations for each combination is called a replicate. The number of replicates is denoted by r. We address only problems in which the number of replicates is the same for each treatment. Such a design is called balanced.

\(x_{ijk}\) = \(k\)th observation in the \(ij\)th treatment
\(\bar x[AB]_{ij}=\) mean of the treatment when the factor A level is i and the factor B level is j
\(\bar x[A]_{i}=\) Mean of the observations when the factor A level is i
\(\bar x[B]_{j}=\) Mean of the observations when the factor B level is j
\(\bar{\bar x}=\) Mean of all the observations
a = Number of factor A levels
b = Number of factor B levels
r = Number of replicates

\[SS(Total) = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^r (x_{ijk} - \bar{\bar x})^2\] \[SS(A) = rb \sum_{i=1}^a (\bar x[A]_{i} - \bar{\bar x})^2\] \[SS(B) = ra \sum_{j=1}^b (\bar x[B]_{j} - \bar{\bar x})^2\] \[SS(AB) = r \sum_{i=1}^a \sum_{j=1}^b (\bar x[AB]_{ij} - \bar x[A]_{i} - \bar x[B]_{j} + \bar{\bar x})^2\] \[SSE = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^r (x_{ijk} - \bar x[AB]_{ij})^2\]

\(\nu_{SS(A)} = a -1\)
\(\nu_{SS(B)} = b -1\)
\(\nu_{SS(AB)} = (a -1)(b-1)\)
\(\nu_{SSE} = n - ab\)

F-Tests Conducted in Two-Factor Analysis of Variance
Test for Differences between the Levels of Factor A
\(H_{0}\): The means of the a levels of factor A are equal
\(H_{1}\): At least two means differ
Test for Differences between the Levels of Factor B
\(H_{0}\): The means of the a levels of factor B are equal
\(H_{1}\): At least two means differ
Test for Interaction between Factors A and B
\(H_{0}\): Factors A and B do not interact to affect the mean responses
\(H_{1}\): Factors A and B do interact to affect the mean responses

Required Conditions
* The distribution of the response is normally distributed.
* The variance for each treatment is identical.
* The samples are independent.

ANOVA Table for the Two-Factor Experiment:

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC
Factor A a-1 SS(A) MS(A) MS(A)/MSE
Factor B b-1 SS(B) MS(B) MS(B)/MSE
Interaction (a-1)(b-1) SS(AB) MS(AB) MS(AB)/MSE
Error n - ab SSE MSE
Total n -1 SS(Total)

Two-way ANOVA: Jobs versus Gender, Education

SOURCE OF VARIATION DEGREES OF FREEDOM SUMS OF SQUARES MEAN SQUARES F-STATISTIC P
Gender 1 11.25 11.25 1.12 0.294
Education 3 135.85 45.28 4.49 0.006
Interaction 3 6.25 2.08 0.21 0.892
Error 72 726.20 10.09
Total 79 879.55

Interpret: There is no evidence at the 5% significance level to infer that differences in the number of jobs exist between men and women. There is sufficient evidence at the 5% significance level to infer that differences in the number of jobs exist between educational levels. There is not enough evidence to conclude that there is an interaction between gender and education.

Order of Testing in the Two-Factor Analysis of Variance: Test for interaction first. If there is enough evidence to infer that there is interaction, do not conduct the other tests. If there is not enough evidence to conclude that there is interaction, proceed to conduct the F-tests for factors A and B.

http://felixfan.github.io/statistics-for-management-and-economics-study-notes-3 statistics for management and economics study notes 3

9. Sampling Distributions

9.1 Sampling Distribution of the Mean

Central Limit Theorem: The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of X will resemble a normal distribution.

\[\mu_{\bar x} = \mu\]

\[\sigma_{\bar x}^2 = \frac{\sigma^2}{n}\]

If X is normal, then \(\bar X\) is normal. If X is nonnormal, then \(\bar X\) is approximately normal for sufficiently large sample sizes. The definition of “sufficiently large” depends on the extent of nonnormality of X.

Standardizing the sample mean:

\[Z = \frac{\bar X - \mu}{\sigma / \sqrt{n}}\]

9.2 Sampling Distribution of a Sample Proportion

\(\hat P\) is approximately normally distributed provided that np and n(1 − p) are greater than or equal to 5.

\[E(\hat P) = p\]

\[V(\hat P) = \sigma_{\hat p}^2 = \frac{p(1-p)}{n}\]

Standardizing the sample proportion:

\[Z = \frac{\hat P - p}{\sqrt{p(1-p)/n}}\]

9.3 Sampling Distribution of the Difference between Two Means

\[E(\bar X_{1} - \bar X_{2}) = \mu_{\bar x_{1} - \bar x_{2}} = \mu_{1} - \mu_{2}\]

\[V(\bar X_{1} - \bar X_{2}) = \sigma_{\bar x_{1} - \bar x_{2}}^2 = \frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}\]

Standardizing the difference between two sample means:

\[Z = \frac{(\bar X_{1} - \bar X_{2}) - (\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}\]

10. Introduction to Estimation

  • An unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter.
  • An unbiased estimator is said to be consistent if the difference between the estimator and the parameter grows smaller as the sample size grows larger.
  • If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to have relative efficiency.

10.1 Estimating the Population Mean When the Population Standard Deviation is Known

\[\bar x \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

10.2 Determining the Sample Size to Estimate \(\mu\)

\[n = (\frac{z_{\alpha/2}\sigma}{B})^2\]

\[B = Z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

B stands for the bound on the error of estimation.

11. Introduction to Hypothesis Testing

11.1 Concepts of Hypothesis Testing

  • null hypothesis usually refers to a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. \(H_{0}\)
  • alternative hypothesis (or maintained hypothesis or research hypothesis) refers the hypothesis to be accepted if the null hypothesis is rejected. \(H_{1}\)
  • A Type I error occurs when we reject a true null hypothesis. \(\alpha\)
  • A Type II error is defined as not rejecting a false null hypothesis. \(\beta\)
  • The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.
  • If we reject the null hypothesis, we conclude that there is enough statistical evidence to infer that the alternative hypothesis is true.
  • If we do not reject the null hypothesis, we conclude that there is not enough statistical evidence to infer that the alternative hypothesis is true.

11.2 Testing the Population Mean When the Population Standard Deviation is Known

  • A two-tail test is conducted whenever the alternative hypothesis specifies that the mean is not equal to the value stated in the null hypothesis.
  • a one-tail test that focuses on the right tail of the sampling distribution whenever we want to know whether there is enough evidence to infer that the mean is greater than the quantity specified by the null hypothesis.
  • a one-tail test that focuses on the left tail of the sampling distribution whenever we want to know whether there is enough evidence to infer that the mean is less than the quantity specified by the null hypothesis.

11.2.1 Standardized Test Statistic

\[z = \frac{\bar x - \mu}{\sigma / \sqrt{n}}\]

The rejection region:

\[z > z_{\alpha / 2}\]

or

\[z < - z_{\alpha / 2}\]

11.2.2 Testing Hypotheses and Confidence Interval Estimators

\[\bar x \pm z_{\alpha / 2}\frac{\sigma}{\sqrt{n}}\]

we compute the interval estimate and determine whether the hypothesized value of the mean falls into the interval.

11.3 Calculating the Probability of a Type II Error

Example: A random sample of 400 monthly accounts is drawn, for which the sample mean is $178. The accounts are approximately normally distributed with a standard deviation of $65. Whether the mean is greater than $170 with \(\alpha\) = 5%?

\(H_{0}\): \(\mu \le 170\)

\(H_{1}\): \(\mu \gt 170\)

\(\frac{\bar x_{L} - 170}{65/\sqrt{400}} = 1.645\)

\(\bar x_{L} = 175.34\)

Therefore, the rejection region is:

\(\bar x \gt 175.34\)

The sample mean was computed to be 178. Because the test statistic (sample mean) is in the rejection region (it is greater than 175.34), we reject the null hypothesis. Thus, there is sufficient evidence to infer that the mean monthly account is greater than $170.

\(\beta = P(\bar X \lt 175.34\), given that the null hypothesis is false )

Suppose that when the mean account is at least $180.

\(\beta = P(\bar X \lt 175.34\), given that \(\mu = 180)\)

\(\beta = P(\frac{\bar X - \mu}{\sigma / \sqrt{n}} < \frac{175.34-180}{65/\sqrt{400}}) = P(Z \lt - 1.43) = 0.0764\)

This plot illustrates the inverse relationship between the probabilities of Type I and Type II errors. Unfortunately, there is no simple formula to determine what the significance level should be.

11.4 Larger Sample Size Equals More Information Equals Better Decisions

11.5 Power of a Test

power: the probability of its leading us to reject the null hypothesis when it is false. Thus, the power of a test is 1 − β.

12. Inference About a Population

12.1 Inference about a Population Mean When the Population Standard Deviation is Unknown

When the population standard deviation is unknown and the population is normal, the test statistic for testing hypotheses about μ is

\[t = \frac{\bar x - \mu}{s/\sqrt{n}}\]

which is Student t-distributed with ν = n − 1 degrees of freedom.

Confidence Interval Estimator of μ When σ Is Unknown

\[\bar x \pm t_{\alpha/2}\frac{s}{\sqrt{n}}\]

12.2 Inference about a Population Variance

The test statistic used to test hypotheses about \(\sigma^2\) is

\[\chi^2 = \frac{(n-1)s^2}{\sigma^2}\]

which is chi-squared distributed with ν = n − 1 degrees of freedom when the population random variable is normally distributed with variance equal to \(\sigma^2\).

Confidence Interval Estimator of \(\sigma^2\)

Lower confidence limit (LCL) = \(\frac{(n-1)s^2}{\chi_{\alpha /2}^2}\)

Upper confidence limit (UCL) = \(\frac{(n-1)s^2}{\chi_{1-\alpha /2}^2}\)

12.3 Inference about a Population Proportion

\[\hat p = \frac{x}{n}\]

Test Statistic for p

\[z = \frac{\hat P - p}{\sqrt{p(1-p)/n}}\]

which is approximately normal when np and n(1 − p) are greater than 5.

Confidence Interval Estimator of p

\[\hat p \pm z_{\alpha /2} \sqrt{\hat p (1 - \hat p)/n}\]

Sample Size to Estimate a Proportion

\[n = (\frac{z_{\alpha /2}\sqrt{\hat p (1-\hat p)}}{B})^2\]

\[B = z_{\alpha /2} \sqrt{\frac{\hat p (1-\hat p)}{n}}\]

13. Inference about Comparing Two Populations

13.1 Inference about the Difference between two Means: Independent Samples

Sampling Distribution of \(\bar x_{1} - \bar x_{2}\):

\(\bar x_{1} - \bar x_{2}\) is normally distributed if the populations are normal and approximately normal if the populations are nonnormal and the sample sizes are large.
\[E( \bar x_{1} - \bar x_{2} ) = \mu_{1} - \mu_{2}\] \[V( \bar x_{1} - \bar x_{2} ) = \frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}\] \[Z = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}\]

13.1.1 Test Statistic for \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 = \sigma_{2}^2\)

\[t = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{s_{p}^2(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}\]

where \(s_{p}^2\) is called the pooled variance estimator:

\[s_{p}^2 = \frac{(n_{1} -1)s_{1}^2 + (n_{2} -1)s_{2}^2}{n_{1} + n_{2} - 2}\]

13.1.2 Confidence Interval Estimator of \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 = \sigma_{2}^2\)

\[(\bar x_{1} - \bar x_{2}) \pm t_{\alpha /2}\sqrt{s_{p}^2(\frac{1}{n_{1}} + \frac{1}{n_{2}})}\]

13.1.3 Test Statistic for \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 \ne \sigma_{2}^2\)

\[t = \frac{(\bar x_{1} - \bar x_{2}) -(\mu_{1} - \mu_{2})}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}\]

\[\nu = \frac{(s_{1}^2/n_{1} + s_{2}^2/n_{2})^2}{\frac{(s_{1}^2/n_{1})^2}{n_{1}-1} + \frac{(s_{2}^2/n_{2})^2}{n_{2}-1}}\]

13.1.4 Confidence Interval Estimator of \(\mu_{1} - \mu_{2}\) when \(\sigma_{1}^2 \ne \sigma_{2}^2\)

\[(\bar x_{1} - \bar x_{2}) \pm t_{\alpha /2}\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}\]

13.1.5 Testing the Population Variances

\(H_{0}\): \(\frac{\sigma_{1}^2}{\sigma_{2}^2} = 1\)
\(H_{1}\): \(\frac{\sigma_{1}^2}{\sigma_{2}^2} \ne 1\)

\[F = \frac{s_{1}^2}{s_{2}^2}\]

\(\nu_{1} = n_{1} - 1\) and \(\nu_{2} = n_{2} - 1\). This is a two-tail test so that the rejection region is \(F \gt F_{\alpha/2, \nu_{1},\nu_{2}}\) or \(F \lt F_{1-\alpha/2, \nu_{1},\nu_{2}}\).

Confidence Interval Estimator of \(\sigma_{1}^2/\sigma_{2}^2\)

\[LCL = \frac{s_{1}^2}{s_{2}^2} \frac{1}{F_{\alpha/2,\nu_{1},\nu_{2}}}\] \[UCL = \frac{s_{1}^2}{s_{2}^2} F_{\alpha/2,\nu_{1},\nu_{2}}\]

13.2 Inference about the Difference between two Means: Matched Pairs Experiment

\(\mu_{D}\) is the mean of the population of differences.

Test Statistic for \(\mu_{D}\)

\[t = \frac{\bar x_{D} - \mu_{D}}{s_{D}/\sqrt{n_{D}}}\]

which is Student t distributed with \(\nu = n_{D} - 1\) degrees of freedom, provided that the differences are normally distributed.

Confidence Interval Estimator of \(\mu_{D}\)

\[\bar x_{D} \pm t_{\alpha/2}\frac{s_{D}}{\sqrt{n_{D}}}\]

13.3 Inference about the Difference between two Population Proportions

The statistic \(\hat p_{1} − \hat p_{2}\) is approximately normally distributed provided that the sample sizes are large enough so that \(n_{1}p_{1}\), \(n_{1}(1-p_{1})\), \(n_{2}p_{2}\), and \(n_{2}(1-p_{2})\) are all greater than or equal to 5.

\[E(\hat p_{1} − \hat p_{2}) = p_{1} − p_{2}\]

\[V(\hat p_{1} − \hat p_{2}) = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}\]

\[Z = \frac{(\hat p_{1} − \hat p_{2}) - (p_{1} − p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}}\]

\[\hat p_{1} = \frac{x_{1}}{n_{1}}\] \[\hat p_{2} = \frac{x_{2}}{n_{2}}\]

13.3.1 Test Statistic for \(p_{1} − p_{2}\): Case 1

\(H_{0}\): \(p_{1} − p_{2} = 0\)

\[z = \frac{\hat p_{1} − \hat p_{2}}{\sqrt{\hat p(1-\hat p)(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}\]

\[\hat p = \frac{x_{1} + x_{2}}{n_{1} + n_{2}}\]

13.3.2 Test Statistic for \(p_{1} − p_{2}\): Case 2

\(H_{0}\): \(p_{1} − p_{2} = D, D\ne0\)

\[z = \frac{(\hat p_{1} − \hat p_{2}) - D}{\sqrt{\frac{\hat p_{1}(1-\hat p_{1})}{n_{1}} + \frac{\hat p_{2}(1-\hat p_{2})}{n_{2}}}}\]

http://felixfan.github.io/ImageJ

ImageJ introduction

ImageJ is a public domain Java image processing program inspired by NIH Image for the Macintosh. It runs, either as an online applet or as a downloadable application, on any computer with a Java 1.4 or later virtual machine. Downloadable distributions are available for Windows, Mac OS, Mac OS X and Linux.

DPI

By default, the DPI in the JPEG header is set to 72. For a higher value, use a unit of “inch” in the Analyze -> Set Scale dialog (requires v1.40 or later). For example, setting “Distance in Pixels” to 300, “Known Distance” to 1 and “Unit of Length” to “inch” will set the DPI to 300.

只能保存为tif格式。另存为png格式dpi会变成72.
Note that ImageJ does not read or write the resolution for JPG files, only that of TIFFs.

Color Spaces

  • Grayscale: The simplest color representation has no color at all, just black, white, and shades of gray.
  • RGB: red, green, and blue. RGB is an additive color model — the desired color is created by adding together different amounts of red, green, and blue light.
  • CMYK: Another way to add color to an image is to subtract it. In subtractive color models, each channel represents a pigment absorbing a certain color. CMYK color represents a common color printing process, with cyan, magenta, yellow, and black inks (the K stands for “key”).

Use this submenu to determine the type of the active image or to convert it to another type.

  • 8-bit. Converts to 8-bit grayscale.
  • 8-bit Color. Converts to 8-bit indexed color using Heckbert’s median-cut color quantization algorithm.
  • RGB Color. Converts to 32-bit RGB color.

Image -> Type -> 8-bit will convert color to grayscale.

Crop

Image -> Crop

http://felixfan.github.io/english-email-2

邮件的开头

Thank you for contacting us.
Thank you for your prompt reply. 
Thank you for your reply.
Thank you for getting back to me.
Thank you for providing the requested information.
Thank you for all your assistance.
I truly appreciate … your help in resolving the problem.
Thank you raising your concerns.
Thank you for your feedback.

在邮件的结尾

Thank you for your kind cooperation.
Thank you for your attention to this matter.
Thank you for your understanding.
Thank you for your consideration.
Thank you again for everything you've done.

其它场景

Hope you have a good trip back.
How is the project going? 
I suggest we have a call tonight at 9:30pm. Please let me know if the time is okay for. 
I would like to hold a meeting in the afternoon about our development planning for the project A. 
We’d like to have the meeting on Thu Oct 30. Same time. 
Let’s make a meeting next Monday at 5:30 PM. 
I want to talk to you over the phone regarding issues about report development and the XXX project. 
For the next step of platform implementation, I am proposing… 
I suggest we can have a weekly project meeting over the phone call in the near future. 
Should you have any problem accessing the folders, please let me know. 
Thank you and look forward to having your opinion on the estimation and schedule. 
Look forward to your feedbacks and suggestions soon.
What is your opinion on the schedule and next steps we proposed? 
What do you think about this? 
Feel free to give your comments. 
Any question, please don’t hesitate to let me know.
Any question, please let me know.
Please contact me if you have any questions.
Please let me know if you have any question on this.
Your comments and suggestions are welcome!
Please let me know what you think?
Do you have any idea about this? 
It would be nice if you could provide a bit more information on the user’s behavior. 
At your convenience, I would really appreciate you looking into this matter/issue. 
Please see comments below.
My answers are in blue below.
I add some comments to the document for your reference.
Today we would like to finish following tasks by the end of today:
Some known issues in this release:
Our team here reviewed the newest SCM policy and has following concerns:
Here are some more questions/issues for your team:
The current status is as following: 
Some items need your attention:
I have some questions about the report 
For the assignment ABC, I have the following questions:
I enclose the evaluation report for your reference.
Attached please find today’s meeting notes.
Attach is the design document, please review it.
For other known issues related to individual features, please see attached release notes.
Thank you so much for the cooperation.
Thanks for the information.
I really appreciate the effort you all made for this sudden and tight project. 
Thank you for your attention! 
Your kind assistance on this are very much appreciated. 
Really appreciate your help! 
I sincerely apologize for this misunderstanding! 
I apologize for the late asking but we want to make sure the correctness of our implementation ASAP. 
http://felixfan.github.io/delete-github-tag

open up a terminal window and navigate to your local GitHub repository.

git tag -d tagName
git push origin :tagName

If your tag has the same name as one of your branches, use this instead:

git tag -d tagName
git push origin :refs/tags/tagName

You need to replace tagName with the tag name that you want to delete.

http://felixfan.github.io/mapping-snps-to-genes

1. 下载SNP信息

(1) UCSC genome browser 的 table browser
(2) 选择需要的 assembly(例如:hg19) (3) group 选”Variation” (4) track 选一个需要的数据(例如:Common SNPs(146) ) (5) table 选一个需要的数据(例如:snp146Common ) (6) output format选’BED-browser extensible data’ (7) 点击‘get output’下载数据, 保存为‘hg19_commonSNPs146.txt’

2. 下载基因信息

可以使用这个基因数据genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers.

注意这个数据用的是hg19/GRChB37的位置信息。

3. 抓取基因内的所有SNPs

(1) 安装BEDTools

(2) 提取常染色体及x,y染色体上的snp并排序

awk '$1!~"_" && $1!~"M" {printf("%s\t%d\t%d\t%s\n", $1,$2,$3,$4);}' hg19_commonSNPs146.txt | sort -k1,1 -k2,2n -k3,3n -k4,4  > hg19_snp146_auto_sorted.txt

(3) 基因的位置上下游各加2000bp

awk '{printf("%s\t%d\t%d\t%s\n", "chr"$1,$2-2000,$3+2000,$4);}' hugo.txt > hugo_2kb.txt

(4) 基因文件里23, 24表示x,y染色体,改正后并排序

sed 's/chr23/chrX/' hugo_2kb.txt > hugo_2kb_v1.txt          
sed 's/chr24/chrY/' hugo_2kb_v1.txt > hugo_2kb_v2.txt          
sort -k1,1 -k2,2n -k3,3n hugo_2kb_v2.txt > hugo_2kb_v2_sorted.txt          

(5) mapping (时间比较久)

intersectBed -a hg19_snp146_auto_sorted.txt -b hugo_2kb_v2_sorted.txt -wa -wb | awk '{print $4, $8}' > geneSNPs_2kb.txt

Reference

http://felixfan.github.io/statistics-for-management-and-economics-study-notes-2 statistics for management and economics study notes 2

5. Data Collection And Sampling

5.1 Simple Random Sample

A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen.

5.2 Stratified Random Sampling

A stratified random sample is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum.

5.3 Cluster Sampling

A cluster sample is a simple random sample of groups or clusters of elements.

5.4 Sampling Error

Sampling error refers to differences between the sample and the population that exists only because of the observations that happened to be selected for the sample.

5.5 Nonsampling Error

Nonsampling errors result from mistakes made in the acquisition of data or from the sample observations being selected improperly.

  • Errors in data acquisition.
  • Nonresponse error refers to error (or bias) introduced when responses are not obtained from some members of the sample.
  • Selection bias occurs when the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample.

6 Probability

6.1 Intersection

The intersection of events A and B is the event that occurs when both A and B occur. The probability of the intersection is called the joint probability.

6.2 Marginal Probability

Marginal probabilities, computed by adding across rows or down columns, are so named because they are calculated in the margins of the table.

6.3 Conditional Probability

The probability of event A given event B is

\[p(A|B) = \frac{p(AB)}{p(B)}\]

The probability of event B given event A is

\[p(B|A) = \frac{p(AB)}{p(A)}\]

6.4 Independence

Two events A and B are said to be independent if

\[p(A|B) = p(A)\]

or

\[p(B|A) = p(B)\]

6.5 Union

The union of events A and B is the event that occurs when either A or B or both occur. It is denoted as A or B.

6.6 Complement Rule

The complement of event A is the event that occurs when event A does not occur.

\[p(A^c) = 1 - p(A)\]

6.7 Multiplication Rule

\[p(AB) = p(A)p(B|A) = p(B)p(A|B)\]

6.8 Addition Rule

The probability that event A, or event B, or both occur is

\[p(A or B) = p(A) + p(B) - p(AB)\]

6.9 Bayes’s Law Formula

\[p(A_{i}|B) = \frac{p(A_{i})p(B|A_{i})}{p(A_{1})p(B|A_{1}) + p(A_{2})p(B|A_{2}) + \cdots + p(A_{k})p(B|A_{k})}\]

7. Random Variables and Discrete Probability Distributions

7.1 Describing the Population Probability Distribution

\[E(x) = \mu = \sum xp(x)\]

\[V(x) = \sigma^2 = \sum (x-\mu)^2p(x)\]

7.2 Laws of Expected Value and Variance

\[E(c) = c\] \[E(x + c) = E(x) + c\] \[E(cx) = cE(x)\] \[V(c) = 0\] \[V(x + c) = V(x)\] \[V(cx) = c^2V(x)\]

7.3 Bivariate Distributions

The covariance of two discrete variables is defined as

\[COV(x, y) = \sigma_{xy} = \sum \sum (x - \mu_{x})(y-\mu_{y})p(x, y)\]

Coefficient of Correlation:

\[\rho = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\]

7.4 Laws of Expected Value and Variance of the Sum of Two Variables

\[E(x + y) = E(x) + E(y)\]

\[V(x + y) = V(x) + V(y) + 2COV(x + y)\]

7.5 Mean and Variance of a Portfolio of Two Stocks

\[E(R_{p}) = w_{1}E(R_{1}) + w_{2}E(R_{2})\]

\[V(R_{p}) = w_{1}^2V(R_{1}) + w_{2}^2V(R_{2}) + 2w_{1}w_{2}COV(R_{1}, R_{2}) = w_{1}^2V(R_{1}) + w_{2}^2V(R_{2}) + 2w_{1}w_{2}\rho\sigma_{1}\sigma_{2}\]

7.6 Portfolios with More Than Two Stocks

\[E(R_{p}) = \sum_{i=1}^k w_{i}E(R_{i})\]

\[V(R_{p}) = \sum_{i=1}^k w_{i}^2V(R_{i}) + 2\sum_{i=1}^k \sum_{j=i+1}^k w_{i}w_{j}COV(R_{i}, R_{j})\]

7.7 Binormial Distribution

  • The binomial experiment consists of a fixed number of trials (n).
  • Each trial has two possible outcomes. success or failure.
  • The probability of success is p. The probability of failure is 1 − p.
  • The trials are independent

The probability of x successes in a binomial experiment with n trials and probability of success = p is

\[p(x) = \frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}\]

7.7.1 Cumulative Probability

\[p(X \le 4) = p(0) + p(1) + p(2) + p(3) + p(4)\]

7.7.2 Binomial Probability p(X ≥ x)

\[p(X \ge x) = 1 - p(X \le (x-1))\]

7.7.3 Binomial Probability P(X = x)

\[p(x) = p(X \le x) - p(X \le (x-1))\]

7.7.4 Mean and Variance of a Binomial Distribution

\[\mu = np\] \[\sigma^2 = np(1-p)\] \[\sigma = \sqrt{np(1-p)}\]

7.8 Poisson Distribution

Like the binomial random variable, the Poisson random variable is the number of occurrences of events, which we’ll continue to call successes. The difference between the two random variables is that a binomial random variable is the number of successes in a set number of trials, whereas a Poisson random variable is the number of successes in an interval of time or specific region of space.

  • The number of successes that occur in any interval is independent of the number of successes that occur in any other interval.
  • The probability of a success in an interval is the same for all equal-size intervals.
  • The probability of a success in an interval is proportional to the size of the interval.
  • The probability of more than one success in an interval approaches 0 as the interval becomes smaller.

The probability that a Poisson random variable assumes a value of x in a specific interval is

\[p(x) = \frac{e^{-\mu}\mu^x}{x!}\]

the variance of a Poisson random variable is equal to its mean; that is

\[\sigma^2 = \mu\]

\[p(X \ge x) = 1 - p(X \le (x-1))\]

\[p(x) = p(X \le x) - p(X \le (x-1))\]

8. Continuous Probability Distributions

8.1 Uniform Distribution

\[f(x) = \frac{1}{b-a}, a \le x \le b\]

8.2 Normal Distribution

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]

8.2.1 Calculating Normal Probabilities

We standardize a random variable by subtracting its mean and dividing by its standard deviation. When the variable is normal, the transformed variable is called a standard normal random variable and denoted by Z; that is,

\[Z = \frac{X - \mu}{\sigma}\]

8.3 Exponential Distribution

\[f(x) = \lambda e^{-\lambda x}, x \ge 0\]

\[\mu = \sigma = \frac{1}{\lambda}\]

\[p(X > x) = e^{-\lambda x}\]

\[p(X < x) = 1 - e^{-\lambda x}\]

\[p(x_{1} < X < x_{2}) = p(X < x_{2}) - p(X < x_{1}) = e^{-\lambda x_{1}} - e^{-\lambda x_{2}}\]

8.4 Student t Distribution

\[f(t)=\frac{\Gamma[(\nu + 1)/2]}{\sqrt{\nu \pi} \Gamma (\nu /2)}[1 + \frac{t^2}{\nu}]^{-(\nu + 1)/2}\]

where \(\nu\) (Greek letter nu) is the parameter of the Student t distribution called the degrees of freedom, and \(\Gamma\) is the gamma function.

\[E(t) = 0\]

\[V(t) = \frac{\nu}{\nu - 2}, \nu \gt 2\]

Student t distribution is similar to the standard normal distribution. Both are symmetrical about 0. We describe the Student t distribution as mound shaped, whereas the normal distribution is bell shaped. As \(\nu\) grows larger, the Student t distribution approaches the standard normal distribution.

8.5 Chi-Squared Distribution

\[f(\chi^2) = \frac{1}{\Gamma(\nu/2)} \frac{1}{2^{\nu/2}}(\chi^2)^{(\nu/2)-1}e^{-\chi^2/2}\]

\[E(\chi^2) = \nu\]

\[V(\chi^2) = 2\nu\]

8.6 F Distribution

\[E(F) = \frac{\nu_{2}}{\nu_{2} - 2}, \nu_{2} \gt 2\]

\[V(F) = \frac{2\nu_{2}^2(\nu_{1} + \nu_{2} -2)}{\nu_{1}(\nu_{2}-1)^2(\nu_{2} -4)}, \nu_{2} \gt 4\]

\(\nu_{1}\) the numerator degrees of freedom and \(\nu_{2}\) the denominator degrees of freedom.

http://felixfan.github.io/numpy numpy
In [1]:
import numpy as np

1. Array Creation

In [2]:
alist = [1,2,3]
arr = np.array(alist) # converting list to ndarray
arr
Out[2]:
array([1, 2, 3])
In [3]:
arr.tolist() # Converting ndarray to list
Out[3]:
[1, 2, 3]
In [4]:
np.zeros(5) # Creating an array of zeros with five elements
Out[4]:
array([ 0.,  0.,  0.,  0.,  0.])
In [5]:
np.arange(10) # Create an ndarray with 10 elements from 0 to 9
Out[5]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [6]:
np.arange(3,8) # from to
Out[6]:
array([3, 4, 5, 6, 7])
In [7]:
np.linspace(0, 5, 9) # from to steps
Out[7]:
array([ 0.   ,  0.625,  1.25 ,  1.875,  2.5  ,  3.125,  3.75 ,  4.375,  5.   ])
In [8]:
np.logspace(0, 5, 10, base=10.0)
Out[8]:
array([  1.00000000e+00,   3.59381366e+00,   1.29154967e+01,
         4.64158883e+01,   1.66810054e+02,   5.99484250e+02,
         2.15443469e+03,   7.74263683e+03,   2.78255940e+04,
         1.00000000e+05])
In [9]:
np.zeros((5,5)) # Creating a 5x5 array of zeros
Out[9]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
In [10]:
np.ones((3,3)) # Creating a 5x5 array of ones
Out[10]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [11]:
arr1d = np.arange(12)
arr2d = arr1d.reshape((3,4))
arr2d
Out[11]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [12]:
arr2d = np.reshape(arr1d,(4,3))
arr2d
Out[12]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

2. Indexing and Slicing

In [13]:
alist = [[1,2],[3,4]]
arr = np.array(alist)
In [14]:
arr[0,1]
Out[14]:
2
In [15]:
arr[:,1] # access the last column
Out[15]:
array([2, 4])
In [16]:
arr[1,:] ## access the bottom row.
Out[16]:
array([3, 4])
In [17]:
arr = np.arange(5)
index = np.where(arr > 2) # Creating the index array
new_arr = arr[index] # Creating the desired array
new_arr
Out[17]:
array([3, 4])
In [18]:
new_arr = arr[arr > 2]
new_arr
Out[18]:
array([3, 4])
In [19]:
new_arr = np.delete(arr, index)
new_arr
Out[19]:
array([0, 1, 2])

3. Boolean Statements

In [20]:
img1 = np.zeros((5, 5)) + 3
img1[1:3, 2:4] = 6
img1[3:5, 0:2] = 8
img1
Out[20]:
array([[ 3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  6.,  6.,  3.],
       [ 3.,  3.,  6.,  6.,  3.],
       [ 8.,  8.,  3.,  3.,  3.],
       [ 8.,  8.,  3.,  3.,  3.]])
In [21]:
# filter out all values larger than 3 and less than 7
index1 = img1 > 3
index2 = img1 < 7
compoundindex = index1 & index2
img2 = np.copy(img1)
img2[compoundindex] = 0
img2
Out[21]:
array([[ 3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 8.,  8.,  3.,  3.,  3.],
       [ 8.,  8.,  3.,  3.,  3.]])
In [22]:
index3 = (img1==8)
compoundindex2 = compoundindex | index3
img3 = np.copy(img1)
img3[compoundindex2] = 0
img3
Out[22]:
array([[ 3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 3.,  3.,  0.,  0.,  3.],
       [ 0.,  0.,  3.,  3.,  3.],
       [ 0.,  0.,  3.,  3.,  3.]])

4. Read and Write Data

In [23]:
arr = np.loadtxt('dat1.txt')
arr
Out[23]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.],
       [ 7.,  8.,  9.]])
In [24]:
np.savetxt('newdat1.txt', arr, delimiter=',', fmt='%.2f')
In [25]:
arr = np.loadtxt('dat2.txt', 
                 dtype={'names':('name', 'weight','unit'),
                'formats':('S5', 'f2','S2')})
arr
Out[25]:
array([('apple', 2.0, 'kg'), ('pear', 3.30078125, 'kg')], 
      dtype=[('name', 'S5'), ('weight', '<f2'), ('unit', 'S2')])

5. Linear Algebra

In [26]:
A = np.matrix([[3,6,-5],
              [1,-3,2],
              [5,-1,4]])
B = np.matrix([[12],
              [-2],
              [10]])
x = A**(-1)*B
x
Out[26]:
matrix([[ 1.75],
        [ 1.75],
        [ 0.75]])
In [27]:
a = np.array([[3,6,-5],
              [1,-3,2],
              [5,-1,4]])
b = np.array([12, -2, 10])
x = np.linalg.inv(a).dot(b)
x
# Although both methods works, use numpy.array whenever possible
Out[27]:
array([ 1.75,  1.75,  0.75])

6. Statistics

In [28]:
x = np.random.randn(1000)
In [29]:
x.mean()
Out[29]:
-0.012672228653646077
In [30]:
x.std()
Out[30]:
1.0374526578439356
In [31]:
x.var()
Out[31]:
1.0763080172674462
In [32]:
np.median(x)
Out[32]:
-0.027887390949930368
In [33]:
np.mean(x)
Out[33]:
-0.012672228653646077
http://felixfan.github.io/statistics-for-management-and-economics-study-notes-1 statistics for management and economics study notes 1

1. What is Statistics?

  • population is the group of all items of interest to a statistics practitioner. A descriptive measure of a population is called a parameter.
  • sample is a set of data drawn from the studied population. A descriptive measure of a sample is called a statistic.

2. Graphical Descriptive Techniques I

2.1 Types of Data and Information

  • Interval data are real numbers, such as heights, weights, incomes, and distances. We also refer to this type of data as quantitative or numerical.
  • The values of nominal data are categories. the values are not numbers but instead are words that describe the categories. Nominal data are also called qualitative or categorical.
  • Ordinal data appear to be nominal, but the difference is that the order of their values has meaning.

2.2 Describing a Set of Nominal Data

  • A bar chart is often used to display frequencies;
  • A pie chart graphically shows relative frequencies.
  • The bar chart focuses on the frequencies and the pie chart focuses on the proportions

3. Graphical Descriptive Techniques II

3.1 Describing a Set of Interval Data

3.1.1 Histogram

A histogram is created by drawing rectangles whose bases are the intervals and whose heights are the frequencies.

3.1.1.1 Determining the Number of Class Intervals

\[NumberOfClassIntervals = 1 + 3.3 log(n)\]

\[ClassWidth = \frac{LargestObservation - SmallestObservation}{NumberOfClasses}\]

3.1.2 Stem-and-Leaf Display

The first step in developing a stem-and-leaf display is to split each observation into two parts, a stem and a leaf. There are several different ways of doing this. For example, the number 12.3 can be split so that the stem is 12 and the leaf is 3. Another method can define the stem as 1 and the leaf as 2 (ignoring the 3). After each stem, we list that stem’s leaves, usually in ascending order. The advantage of the stem-and-leaf display over the histogram is that we can see the actual observations.

Stem  Leaf
0     000000000111112222223333345555556666666778888999999
1     000001111233333334455555667889999
2     0000111112344666778999
3     001335589
4     124445589
5     022224556789

3.2 Describing time-series Data

cross-sectional data classifies data by type, time-series data classifies them according to whether the observations are measured at the same time or whether they represent measurements at successive points in time. Time-series data are often graphically depicted on a line chart, which plots the value of the variable on the vertical axis and the time periods on the horizontal axis.

3.3 Describing the Relationship between Two Interval Data

In applications where one variable depends to some degree on the other variable, we label the dependent variable Y and the other, called the independent variable, X. In interpreting the results of a scatter diagram it is important to understand that if two variables are linearly related it does not mean that one is causing the other. Correlation is not causation.

4. Numerical Descriptive Techniques

4.1 Measure of Central Location

4.1.1 Arithmetic Mean

\[\mu=\frac{\sum_{i=1}^Nx_{i}}{N}\]

\[\bar x=\frac{\sum_{i=1}^nx_{i}}{n}\]

4.1.2 Median

The median is calculated by placing all the observations in order (ascending or descending). The observation that falls in the middle is the median. When there is an even number of observations, the median is determined by averaging the two observations in the middle.

4.1.3 Mode

The mode is defined as the observation (or observations) that occurs with the greatest frequency.

4.1.4 Mean, Median, Mode: Which Is Best?

The mean is generally our first selection. One advantage the median holds is that it is not as sensitive to extreme values as is the mean. The mode is seldom the best measure of central location.

4.1.5 Geometric mean

\[(1+R_{g})^n=(1+R_{1})(1+R_{2})\cdots(1+R{n})\]

4.2 Measures of Variability

4.2.1 Range

Range = Largest observation − Smallest observation

4.2.2 Variance

\[\sigma^2=\frac{\sum_{i=1}^N(x_{i}-\mu)^2}{N}\]

\[s^2=\frac{\sum_{i=1}^n(x_{i}-\bar x)^2}{n-1}\]

4.2.3 Standard Deviation

\[\sigma = \sqrt{\sigma ^2}\] \[s = \sqrt{s ^2}\]

4.2.4 Chebysheff’s Theorem

The proportion of observations in any sample or population that lie within k standard deviations of the mean is at least

\[1 - \frac{1}{k^2}, k>1\]

4.2.5 Coefficient of Variation

\[CV = \frac{\sigma}{\mu}\]

\[cv = \frac{s}{\bar x}\]

4.3 Measures of Relative Standing and Box Plots

4.3.1 Percentile

The \(P_{th}\) percentile is the value for which P percent are less than that value and (100 – P)% are greater than that value.

4.3.2 Locating Percentiles

\[L_{p} = (n+1)\frac{P}{100}\]

where \(L_{p}\) is the location of the \(P_{th}\) percentile.

Placing the 10 observations in ascending order we get

0 0 5 7 8 9 12 14 22 33

The location of the 25th percentile is

\[L_{25} = (10+1)\frac{25}{100} = 2.75\]

The \(25_{th}\) percentile is three-quarters of the distance between the second (which is 0) and the third (which is 5) observations. Three-quarters of the distance is

(.75)(5 − 0) = 3.75

Because the second observation is 0, the \(25_{th}\) percentile is 0 + 3.75 = 3.75.

4.3.3 Interquartile Range

\[InterquartileRange = Q_{3} − Q_{1}\]

4.3.4 Box Plots

This technique graphs five statistics: the minimum and maximum observations, and the first, second, and third quartiles. The three vertical lines of the box are the first, second, and third quartiles. The lines extending to the left and right are called whiskers. Any points that lie outside the whiskers are called outliers. The whiskers extend outward to the smaller of 1.5 times the interquartile range or to the most extreme point that is not an outlier.

4.4 Measures of Linear Relationship

4.4.1 Covariance

\[\sigma_{xy} = \frac{\sum_{i=1}^N(x_{i}-\mu_{x})(y_{i}-\mu_{y})}{N}\]

\[s_{xy} = \frac{\sum_{i=1}^n(x_{i}-\bar x)(y_{i}-\bar y)}{n-1}\]

4.4.2 Coefficient of Correlation

\[\rho=\frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\]

\[r=\frac{s_{xy}}{s_{x}s_{y}}\]

4.4.3 Least Squares Method

\[\hat y = b_{0} + b_{1}x\]

The coefficients \(b_{0}\) and \(b_{1}\) are derived using calculus so that we minimize the sum of squared deviations:

\[\sum_{i=1}^n(y_{i}-\hat{y_{i}})^2\]

Least Squares Line Coefficients:

\[b_{1} = \frac{s_{xy}}{s_{x}^2}\] \[b_{0} = \bar y - b_{1}\bar x\]

4.4.4 Coefficient of Determination

Coefficient of determination \(R^2\) is calculated by squaring the coefficient of correlation. The coefficient of determination measures the amount of variation in the dependent variable that is explained by the variation in the independent variable.

4.4.5 Interpreting Correlation

Correlation is not Causation

http://felixfan.github.io/bash-cmp

1. 整数比较大小

a=5
b=4
c=5
if [ $a -ne $b ]; then
    echo "$a is not equal to $b"
else
    echo "$a is equal to $b"
fi
5 is not equal to 4
if [ $a -lt $b ]; then
    echo "$a is less than $b"
else
    echo "$a is not less than $b"
fi
5 is not less than 4
if [ $a -gt $b ]; then
    echo "$a is great than $b"
else
    echo "$a is not great than $b"
fi
5 is great than 4
if [ $a -ge $b ]; then
    echo "$a is great than or equal to $b"
else
    echo "$a is less than $b"
fi
5 is great than or equal to 4
if [ $a -le $c ]; then
    echo "$a is less than or equal to $c"
else
    echo "$a is great than $b"
fi
5 is less than or equal to 5
if (($a != $b )); then
    echo "$a is not equal to $b"
else
    echo "$a is equal to $b"
fi
5 is not equal to 4
if (($a < $b)); then
    echo "$a is less than $b"
else
    echo "$a is not less than $b"
fi
5 is not less than 4
if (($a > $b)); then
    echo "$a is great than $b"
else
    echo "$a is not great than $b"
fi
5 is great than 4
if (($a >= $b)); then
    echo "$a is great than or equal to $b"
else
    echo "$a is less than $b"
fi
5 is great than or equal to 4
if (($a <= $c)); then
    echo "$a is less than or equal to $c"
else
    echo "$a is great than $b"
fi
5 is less than or equal to 5

2. 小数比较大小

e=20.0
d=100.50
awk -v a=0.7 -v b=0.5 'BEGIN{print(a>b)?"a is big":"b is big"}'
a is big
c=`echo "$d > $e" | bc`
if [ $c -eq 1 ]; then
    echo "$d is great than $e"
else
    echo "$d is less than or equal to $e"
fi
100.50 is great than 20.0

3. 字符串大小比较

s1='a'
s2='b'
s3='ac'
if [ $s1 == $s2 ]; then
    echo "$s1 is equal to $s2"
else
    echo "$s1 is not equal to $s2"
fi
a is not equal to b
if [ $s1 != $s3 ]; then
    echo "$s1 is not equal to $s3"
else
    echo "$s1 is equal to $s3"
fi
a is not equal to ac
if [ $s1 \< $s2 ]; then
    echo "$s1 is less than $s2"
elif [[ $s1 > $s2 ]]; then
    echo "$s1 is great than $2"
else
    echo "$s1 is equal to $s2"
fi
a is less than b
if [[ $s1 < $s3 ]]; then
    echo "$s1 is less than $s3"
elif [ $s1 \> $s3 ]; then
    echo "$s1 is great than $3"
else
    echo "$s1 is equal to $s3"
fi
a is less than ac
http://felixfan.github.io/scrapy-simple-example

0. 安装scrapy

conda install scrapy # 电脑已经安装了anaconda

1. 创建一个新工程

scrapy startproject njupt #其中njupt是项目名称,可以按照个人喜好来定义

输入以上命令之后,就会看见命令行运行的目录下多了一个名为njupt的目录,目录的结构如下:

|---- njupt
| |---- njupt
|   |---- __init__.py
|   |---- items.py        #用来存储爬下来的数据结构(字典形式)
|    |---- pipelines.py    #用来对爬出来的item进行后续处理,如存入数据库等
|    |---- settings.py    #爬虫配置文件
|    |---- spiders        #此目录用来存放创建的新爬虫文件(爬虫主体)
|     |---- __init__.py
| |---- scrapy.cfg        #项目配置文件

至此,工程创建完毕。

2. 设置 items.py

本文以抓取南邮新闻为例,需要存储三种信息:

  • 南邮新闻标题
  • 南邮新闻时间
  • 南邮新闻的详细链接

items.py内部代码如下:

# -*- coding: utf-8 -*-

import scrapy

class NjuptItem(scrapy.Item):   # NjuptItem 为自动生成的类名
    news_title = scrapy.Field() # 南邮新闻标题
    news_date = scrapy.Field()  # 南邮新闻时间
    news_url = scrapy.Field()   # 南邮新闻的详细链接

3. 编写 spider

spider是爬虫的主体,负责处理requset response 以及url等内容,处理完之后交给pipelines进行进一步处理。 设置完items之后,就在spiders目录下新建一个njuptSpider.py文件,内容如下:

# -*- coding: utf-8 -*-

import scrapy
from njupt.items import NjuptItem
import logging

class njuptSpider(scrapy.Spider):
    name = "njupt"
    allowed_domains = ["njupt.edu.cn"]
    start_urls = [
        "http://news.njupt.edu.cn/s/222/t/1100/p/1/c/6866/i/1/list.htm",
        ]
    
    def parse(self, response):
        news_page_num = 14
        page_num = 386
        if response.status == 200:
            for i in range(2,page_num+1):
                for j in range(1,news_page_num+1):
                    item = NjuptItem() 
                    item['news_url'],item['news_title'],item['news_date'] = response.xpath(
                    "//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/font/text()"
                    "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//td[@class='postTime']/text()"
                    "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/@href").extract()
                  
                    yield item
                    
                next_page_url = "http://news.njupt.edu.cn/s/222/t/1100/p/1/c/6866/i/"+str(i)+"/list.htm"
                yield scrapy.Request(next_page_url,callback=self.parse_news)
        
    def parse_news(self, response):
        news_page_num = 14
        if response.status == 200:
            for j in range(1,news_page_num+1):
                item = NjuptItem()
                item['news_url'],item['news_title'],item['news_date'] = response.xpath(
                "//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/font/text()"
                "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//td[@class='postTime']/text()"
                "|//div[@id='newslist']/table[1]/tr["+str(j)+"]//a/@href").extract()
                yield item

其中:

  • name为爬虫名称,在后面启动爬虫的命令当中会用到。
  • allowed_domains为允许爬虫爬取的域名范围(如果连接到范围以外的就不爬取)
  • start_urls表明爬虫首次启动之后访问的第一个Url,其结果会被自动返回给parse函数。
  • parse函数为scrapy框架中定义的内置函数,用来处理请求start_urls之后返回的response,由我们实现
  • news_page_num = 14和page_num = 386分别表示每页的新闻数目,和一共有多少页,本来也可以通过xpath爬取下来的,但是我实在是对我们学校的网站制作无语了,html各种混合,于是我就偷懒手动输入了。
  • 之后通过item = NjuptItem()来使用我们之前定义的item,用来存储新闻的url、标题、日期。(这里面有一个小技巧就是通过|来接连xpath可以一次返回多个想要抓取的xpath)
  • 通过yield item来将存储下来的item交由后续的pipelines处理
  • 之后通过生成next_page_url来通过scrapy.Request抓取下一页的新闻信息
  • scrapy.Request的两个参数,一个是请求的URL另外一个是回调函数用于处理这个request的response,这里我们的回调函数是parse_news
  • parse_news里面的步骤和parse差不多,当然你也可以改造一下parse然后直接将其当做回调函数,这样的话一个函数就ok了

4. 编写 pipelines.py

初次编写可以直接编辑njupt目录下的pipelines.py文件。pipelines主要用于数据的进一步处理,比如类型转换、存储入数据库、写到本地等。pipelines是在每次spider中yield item 之后调用,用于处理每一个单独的item。下面代码就是实现了在本地新建一个njupt.txt文件用于存储爬取下来的内容。

import sys
import json

reload(sys) 
sys.setdefaultencoding('utf-8') # 存取中文

class NjuptPipeline(object):
    def __init__(self):
        self.file = open('njupt.txt','w')
    def process_item(self, item, spider):
        self.file.write(item['news_title'])
        self.file.write("\n")
        self.file.write(item['news_date'])
        self.file.write("\n")
        self.file.write(item['news_url'])
        self.file.write("\n")
        return item

5. 编写 settings.py

settings.py文件用于存储爬虫的配置,有很多种配置,由于是入门教程,不需要配置很多,我们这里就添加一下刚才编写的pipelines就行了。文件内容如下。

BOT_NAME = 'njupt'

SPIDER_MODULES = ['njupt.spiders']
NEWSPIDER_MODULE = 'njupt.spiders'


ITEM_PIPELINES = {
    'njupt.pipelines.NjuptPipeline':1,
}

6. 启动爬虫与查看结果

以上步骤全部完成之后,我们就启动命令行,然后切换运行目录到njupt的spiders目录下,通过以下命令启动爬虫

scrapy crawl njupt

经过一段时间的风狂爬取,爬虫结束。

http://felixfan.github.io/python-magic

1. 不要使用可变对象作为函数默认值

字典,集合,列表等等对象是不适合作为函数默认值的. 因为这个默认值实在函数建立的时候就生成了, 每次调用都是用了这个对象的”缓存”。

2. 在循环中修改列表项

b = [2, 4, 5, 6]

for i in b:
  if not i % 2:
    b.remove(i)

In: b
Out: [4, 5] # 本来我想要的结果应该是去除偶数的列表
# 是因为你对列表的remove,影响了它的index
# 因为2被删除后的列表是[4, 5, 6], 所以索引list[1]直接去找5, 忽略了4

3. IndexError – 列表取值超出了他的索引数

my_list = [1, 2, 3, 4, 5]

In: my_list[5] # 根本没有这个元素, IndexError

In: my_list[5:] # 但是可以这样,一定要注意, 用好了是trick,用错了就是坑啊
Out: []

4. 列表的+和+=, append和extend

>>> myList = [1,2,3,4]
>>> print myList
[1, 2, 3, 4]

>>> myList + [1]
[1, 2, 3, 4, 1]
>>> print myList
[1, 2, 3, 4]
### 不改变原列表

>>> myList += [1]
>>> print myList
[1, 2, 3, 4, 1]
### 在原有列表添加

>>> myList.append(2)
>>> print myList
[1, 2, 3, 4, 1, 2]
### 在原有列表添加

>>> myList.extend([9])
>>> print myList
[1, 2, 3, 4, 1, 2, 9]
### 在原有列表添加

5. ‘==’ 和 is 的区别

‘is’是判断2个对象的身份, ‘==’是判断2个对象的值.

# But, 有个特例
In: a = float('nan')

In: print('a is a,', a is a)
Out:('a is a,', True)

In: print('a == a,', a == a)
Out: ('a == a,', False) # 亮瞎我眼睛了

6. 浅拷贝和深拷贝

对于dict和list等数据结构的对象,直接赋值使用的是引用的方式。我们在实际开发中都可以向对某列表的对象做修改,但是可能不希望改动原来的列表。浅拷贝只拷贝父对象,深拷贝还会拷贝对象的内部的子对象。

In [65]: list1 = [1, 2]
In [66]: list2 = list1 # 就是个引用, 你操作list2,其实list1的结果也会变

In [67]: list3 = list1[:] # 浅拷贝

In [69]: import copy
In [70]: list4 = copy.copy(list1) # 浅拷贝, 对list3和list4操作都不会对list1有影响

In [71]: id(list1), id(list2), id(list3), id(list4)
Out[71]: (4480620232, 4480620232, 4479667880, 4494894720)


# 再看看深拷贝和浅拷贝的区别

In [88]: from copy import copy, deepcopy

In [89]: list1 = [[1], [2]]

In [90]: list2 = copy(list1) # 还是浅拷贝

In [91]: list3 = deepcopy(list1) # 深拷贝

In [92]: id(list1), id(list2), id(list3)
Out[92]: (4494896592, 4495349160, 4494896088)

In [93]: list2[0][0] = 3

In [94]: print('list1:', list1)
('list1:', [[3], [2]]) # 看到了吧 假如你操作其子对象 还是和引用一样 影响了源

In [95]: list3[0][0] = 5

In [96]: print('list1:', list1)
('list1:', [[3], [2]]) # 深拷贝就不会影响

7. bool其实是int的子类

In [97]: isinstance(True, int)
Out[97]: True

In [98]: True + True
Out[98]: 2

In [99]: 3 * True + True
Out[99]: 4

In [100]: 3 * True - False
Out[100]: 3

8. 元组是不是真的不可变?

In [111]: tup = ([],)
In [112]: tup[0] += [1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-112-d4f292cf35de> in <module>()
----> 1 tup[0] += [1]

TypeError: 'tuple' object does not support item assignment

In [113]: tup
Out[113]: ([1],) # 明明抛了异常 还能修改?

In [114]: tup = ([],)
In [115]: tup[0].extend([1])
In [116]: tup[0]
Out[116]: [1] # 好吧,我有点看明白了, 虽然我不能直接操作元组,但是不能阻止我操作元组中可变的子对象(list)

In [117]: my_tup = (1,)
In [118]: my_tup += (4,)
In [119]: my_tup = my_tup + (5,)
In [120]: my_tup
Out[120]: (1, 4, 5) # ? 嗯 不是不能操作元组嘛?

9. 枚举

>>> myList=['a','b','c']
>>> for i, item in enumerate(myList):
...  print i, item
... 
0 a
1 b
2 c
>>> list(enumerate('abc')) 
[(0, 'a'), (1, 'b'), (2, 'c')]
>>> list(enumerate('abc', 1)) 
[(1, 'a'), (2, 'b'), (3, 'c')]

10. 列表/字典/集合 解析

>>> my_list = [i for i in xrange(3)]
>>> print my_list
[0, 1, 2]
>>> my_dict = {i: i * i for i in xrange(3)} 
>>> print my_dict
{0: 0, 1: 1, 2: 4}
my_set = {i * 15 for i in xrange(3)}
>>> print my_set
set([0, 30, 15])

11. 强制浮点除法

a = 3
b = 4
result = 1.0 * a / b

12. 简单服务器

快速方便的共享某个目录下的文件.

# Python2
python -m SimpleHTTPServer

# Python 3
python3 -m http.server

假设你的ip是147.8.103.234, 所有人可以通过http://147.8.103.234:8000/ 访问你共享的文件夹。

13. if 结构简化

如果你需要检查几个数值你可以用以下方法:

if n in [1,4,5,6]:

来替代下面这个方式:

if n==1 or n==4 or n==5 or n==6:

14. 字符串/数列 逆序

>>> a = [1,2,3,4]
>>> a[::-1]
[4, 3, 2, 1]

# This creates a new reversed list. 
# If you want to reverse a list in place you can do:

a.reverse()
>>> a = 'hello world'
>>> a[::-1]
'dlrow olleh'

15. 三元运算

三元运算是if-else 语句的快捷操作,也被称为条件运算。

>>> x, y = 1, 2
>>> min = x if x < y else y
>>> max = x if x > y else y
>>> print min, max
1 2

16. 优化循环

循环之外能做的事不要放在循环内.

17. 优化包含多个判断表达式的顺序

对于and,应该把满足条件少的放在前面,对于or,把满足条件多的放在前面。

18. 使用join合并迭代器中的字符串

19. 不借助中间变量交换两个变量的值

a, b = b, a

20. 使用if is

使用 if is True 比 if == True 将近快一倍。

21. 使用级联比较x < y < z

x < y < z效率略高,而且可读性更好。

22. while 1 比 while True 更快

23. 使用**而不是pow

**就是快10倍以上!

24. 使用计数器对象计数

>>> from collections import Counter
>>> c = Counter("hello world")
>>> c
Counter({"l": 3, "o": 2, " ": 1, "e": 1, "d": 1, "h": 1, "r": 1, "w": 1})
>>> c.most_common(2)
[("l", 3), ("o", 2)]

25. 在Python 2中使用Python 3式的输出/除法

from __future__ import print_function
from __future__ import division
http://felixfan.github.io/Kaplan-Meier-Curves

NCCTG Lung Cancer Data

Description: Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.

library(survival)
data(lung)
head(lung)
  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0
inst:       Institution code
time:       Survival time in days
status:     censoring status 1=censored, 2=dead
age:        Age in years
sex:        Male=1 Female=2
ph.ecog:    ECOG performance score (0=good 5=dead)
ph.karno:   Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno:  Karnofsky performance score as rated by patient
meal.cal:   Calories consumed at meals
wt.loss:    Weight loss in last six months

# Kaplan-Meier Analysis

Estimate survival-function

Global Estimate

km.as.one <- survfit(Surv(time, status) ~ 1,  
                     type="kaplan-meier", 
                     conf.type="log", 
                     data=lung)

separate estimate for all sex

km.by.sex <- survfit(Surv(time, status) ~ sex,  
                     type="kaplan-meier", 
                     conf.type="log", data=lung)

Plot estimated survival function

plot(km.as.one, main="Kaplan-Meier estimate with CI", 
     xlab="Survival time in days", 
     ylab="Survival probability", lwd=2)

plot(km.by.sex, main="Kaplan-Meier estimate by sex", 
     xlab="Survival time in days", 
     ylab="Survival probability", 
     lwd=2, col = c("red","blue"))
legend(x="topright", col=c("red","blue"), lwd=2, 
       legend=c("male","female"))

Plot cumulative incidence function

plot(km.by.sex, main="Kaplan-Meier cumulative incidence by sex", 
     xlab="Survival time in days", ylab="Cumulative incidence", 
     lwd=2, col = c("red","blue"),
     fun = function(x){1-x})
legend(x="bottomright", col=c("red","blue"), 
       lwd=2, legend=c("male","female"))

Plot cumulative hazard

plot(km.as.one, main="Kaplan-Meier estimate", 
     xlab="Survival time in days", 
     ylab="Cumulative hazard", lwd=2,
     fun="cumhaz")

Log-rank-test for equal survival-functions

With rho = 0 (default) this is the log-rank or Mantel-Haenszel test, and with rho = 1 it is equivalent to the Peto & Peto modification of the Gehan-Wilcoxon test.

survdiff(Surv(time, status) ~ sex, data=lung)
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

        N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138      112     91.6      4.55      10.3
sex=2  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.00131 

References

http://felixfan.github.io/gdp-animation

1. Simple Pie Chart

GDP data was downloaded from here.

setwd("/Users/alicefelix/Desktop/gdp")
dat <- read.table("GDP1970_2014.txt",header = TRUE)
for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  png(fn)
  pie(df3$GDP, labels = df3$Country, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)))
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014.gif")
system("rm *png")

2. Pie Chart with Annotated Percentages

for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  pct <- round(df3$GDP/sum(df3$GDP)*100)
  pct <- paste(pct,"%", sep="")
  lbls <- paste(df3$Country, pct, sep=" ")
  png(fn)
  pie(df3$GDP, labels = lbls, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)))
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v2.gif")
system("rm *png")

3. 3D Pie Chart

library(plotrix)
for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  pct <- round(df3$GDP/sum(df3$GDP)*100)
  pct <- paste(pct,"%", sep="")
  lbls <- paste(df3$Country, pct, sep=" ")
  png(fn)
  pie3D(df3$GDP, labels = lbls, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)), labelcex = 0.8)
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v3.gif")
system("rm *png")

4. Pie Chart with Annotated Percentages

code from 糗世界.

pie1 <- function (x, labels = names(x), edges = 200, radius = 0.8, clockwise = FALSE, 
                  init.angle = if (clockwise) 90 else 0, density = NULL, angle = 45, 
                  col = NULL, border = NULL, lty = NULL, main = NULL, percentage=T, 
                  rawNumber=F, digits=3, cutoff=0.01, legend=F, legendpos="topright", 
                  legendcol=2, ...)
{
    if (!is.numeric(x) || any(is.na(x) | x < 0)){
      stop("'x' values must be positive.")
    }
  
    if (is.null(labels)){
      labels <- as.character(seq_along(x))
    }else{
      labels <- as.graphicsAnnot(labels)
    }
  
    rawX <- x
    x <- c(0, cumsum(x)/sum(x))
    dx <- diff(x)
    nx <- length(dx)
    plot.new()
    pin <- par("pin")
    xlim <- ylim <- c(-1, 1)
    
    if (pin[1L] > pin[2L]){
      xlim <- (pin[1L]/pin[2L]) * xlim
    }else{
      ylim <- (pin[2L]/pin[1L]) * ylim
    }
    
    dev.hold()
    on.exit(dev.flush())
    plot.window(xlim, ylim, "", asp = 1)
    
    if (is.null(col)){
      col <- if (is.null(density)){
        c("white", "lightblue", "mistyrose", "lightcyan", 
                "lavender", "cornsilk", "pink")
      }else{
        par("fg")
      } 
    }
        
    if (!is.null(col)){
      col <- rep_len(col, nx)
    }
        
    if (!is.null(border)){
      border <- rep_len(border, nx)
    }
      
    if (!is.null(lty)) 
        lty <- rep_len(lty, nx)
    angle <- rep(angle, nx)
    if (!is.null(density)) 
        density <- rep_len(density, nx)
    twopi <- if (clockwise) 
        -2 * pi
    else 2 * pi
    t2xy <- function(t) {
        t2p <- twopi * t + init.angle * pi/180
        list(x = radius * cos(t2p), y = radius * sin(t2p))
    }
    for (i in 1L:nx) {
        n <- max(2, floor(edges * dx[i]))
        P <- t2xy(seq.int(x[i], x[i + 1], length.out = n))
        polygon(c(P$x, 0), c(P$y, 0), density = density[i], angle = angle[i], 
            border = border[i], col = col[i], lty = lty[i])
        if(!legend){
        	P <- t2xy(mean(x[i + 0:1]))
	        lab <- as.character(labels[i])
	        if (!is.na(lab) && nzchar(lab)) {
	            lines(c(1, 1.05) * P$x, c(1, 1.05) * P$y)
	            text(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, 
	                adj = ifelse(P$x < 0, 1, 0), ...)
	        }
        }
    }
    if (percentage) {
    	for (i in 1L:nx){
    		if(dx[i]>cutoff){
    			P <- t2xy(mean(x[i + 0:1]))
            	text(.8 * P$x, .8 * P$y, paste(formatC(dx[i]*100, digits=digits), "%", sep=""), xpd = TRUE, 
                	adj = .5, ...)
    		}
        }
    }else{
        if(rawNumber){
		for (i in 1L:nx){
    			if(dx[i]>cutoff){
    				P <- t2xy(mean(x[i + 0:1]))
            		text(.8 * P$x, .8 * P$y, rawX[i], xpd = TRUE, 
                		adj = .5, ...)
    			}
        	}
        }
    }
    if(legend) legend(legendpos, legend=labels, fill=col, border="black", bty="n", ncol = legendcol)
    title(main = main, ...)
    invisible(NULL)
}
for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  png(fn)
  pie1(df3$GDP, labels = df3$Country, main=paste("GDP",i,sep=" "), col=rainbow(length(df3$Country)))
  dev.off()
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v4.gif")
system("rm *png")

5. pie chart with ggplot2

library(ggplot2)
library(dplyr)

for(i in 1970:2014){
  fn <- paste(i,".png",sep="")
  df <- subset(dat,Year==i)
  otherGDP <- 2 * df[df$Country=="World",]$GDP - sum(df$GDP)
  df2 <- rbind(df,data.frame(Country="Others",Currency="US$", Year=i, GDP=otherGDP))
  df3 <- subset(df2, Country != "World")
  #df3 = df3[order(df3$GDP, decreasing = TRUE),] #用 order() 让数据框的数据按 GDP 列数据从大到小排序
  df3 <- df3 %>% group_by(Year) %>% mutate(pos = cumsum(GDP)- GDP/2)
  
  pct <- round(df3$GDP/sum(df3$GDP)*100, 2)
  pct <- paste(pct,"%", sep="")
  lbls <- paste(df3$Country, pct, sep=" ")
  
  p <- ggplot(df3, aes(x = "", y = GDP, fill = Country)) +
    geom_bar(stat = "identity", width = 1) +
    coord_polar(theta = "y") +
    labs(x = "", y = "", title = paste("GDP", i)) +   ## 将标签设为空
    geom_text(aes(x="", y=pos, label = lbls), size=3) +  
    theme_bw() +
    theme(panel.border = element_blank(), panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(), axis.line = element_blank(),
          axis.ticks = element_blank(), axis.text.x = element_blank(),
          legend.position ="none") # 去掉背景坐标图标
  ggsave(fn,p)
}
system("convert -delay 50 -loop 0 $(ls -v *png) gdp1970_2014v5.gif")
system("rm *png")

http://felixfan.github.io/gh-md-toc

gh-md-toc is a great cross-platform tool to generate TOC (Table of contents) for README.md or GitHub’s wiki page. Source code and examples are available in their GitHub repository.

http://felixfan.github.io/vi

vi命令

vi编辑器支持编辑模式和命令模式,编辑模式下可以完成文本的编辑功能,命令模式下可以完成对文件的操作命令。默认情况下,打开vi编辑器后自动进入命令模式。从编辑模式切换到命令模式使用“esc”键,从命令模式切换到编辑模式使用“A”、“a”、“O”、“o”、“I”、“i”键。

命令     操作
ZZ      命令模式下保存当前文件所做的修改后退出vi
:行号    光标跳转到指定行的行首
:$       光标跳转到最后一行的行首
:wq     在命令模式下,执行存盘退出操作
:w      在命令模式下,执行存盘操作
:q      在命令模式下,执行退出vi操作
:q!     在命令模式下,执行强制退出vi操作
http://felixfan.github.io/Scatterplot-many-points
library(ggplot2)
dat <- data.frame(x=rnorm(10000), y=rnorm(10000))

point plot

ggplot(dat, aes(x=x, y=y)) + geom_point()

jittering

ggplot(dat, aes(x=x, y=y)) + geom_point(position = 'jitter')

alpha

ggplot(dat, aes(x=x, y=y)) + geom_point(alpha = 0.3)

ggplot(dat, aes(x=x, y=y)) + geom_point(alpha = 0.1)

contour lines

ggplot(dat, aes(x=x, y=y)) + geom_point() + geom_density2d()

HexBins

ggplot(dat, aes(x=x, y=y)) + stat_binhex()

combined

ggplot(dat, aes(x=x, y=y)) + geom_point(colour='black',alpha=0.3) + geom_density2d(colour='red')

http://felixfan.github.io/english-letter
I am writing to confirm /enquire/inform you...

I am writing to follow up on our earlier decision on the marketing campaign in Q2.

With reference to our telephone conversation today…

In my previous e-mail on October 5

As I mentioned earlier about...

as indicated in my previous e-mail...

As we discussed on the phone...

from our decision at the previous meeting…
as you requested/per your requirement...

In reply to your e-mail dated April 1. we decided...

This is in response to your e-mail today

As mentioned before, we deem this product has strong unique selling points in China.

As a follow-up to our phone conversation yesterday, I wanted to get back to you about the pending issues of our agreement.

I received your voice message regarding the subject. I'm wondering if you can elaborate i.e. provide more details.
Please be advised/informed that…

Please note that…

We would like to inform you that…
I am convinced that …

We agree with you on...
With effect from 4 Oct, 2008...

We will have a meeting scheduled as noted below…

Be assured that individual statistics are not disclosed and this is for internal use only.
I am delighted to tell you that…

We are pleased to leam that…

We wish to notify you that…

Congratulation on your…

I am fine with the proposal.

I am pleased to inform you that you have been accepted to join the workshop scheduled for 22-24 Nov, 2008.

We are sorry to inform you that…

I’m afraid I have some bad news.
There are a number of issues with our new system.

Due to circumstances beyond our control...

I don't feel too optimistic about...

It would be difficult for us to accept...

Unfortunately I have to say that, since receiving your enquiries on the subject, our view has not changed.
We would be grateful if you could...

I could appreciate it if you could...

Would you please send us...?

We need your help.

We seek your assistance to cascade/reply this message to your staff.

We look forward to your clarification.

Your prompt attention to this matter will be appreciated.

I would really appreciate meeting up if you can spare the time. Please let me know what suits you best.
Please give us your preliminary thoughts about this.

Would you please reply to this e-mail if you plan to attend?

Please advise if you agree with this approach.

Could you please let me know the status of this project?

If possible. I hope to receive a copy of your proposal when it is finished.

I would appreciate it very much if you would send me your reply by next Monday.

Hope this is OK with you. If not, let me know by e-mail ASAP

Could you please send me your replies to the above questions by the end of June?

May I have your reply by April 1, if possible?
If you wish, we would be happy to…

Please let me know if there's anything I can do to help.

1f there's anything else I can do for you on/regarding this matter.Please feel free to contact me at any time.

If you want additional recommendations on this. Please let us know and we can try to see if this is possible.
I'm just writing to remind you of…

May we remind you that..?

I am enclosing…

Please find enclosed…

Attached hereto…

Attached please find the most up-to-date informationon/regarding/concerning…

Attached please find the draft product plan for your review andcomment.
If you have any further questions, please feel free to contact me.

I hope my clarification has been helpful.

Please feel free to call me at any time, I will continually provide full support.

Please let me know if this is suitable.

Looking forward to seeing you soon.

We look forward to hearing from you soon.

Hope this is clear and we are happy to discuss this further if necessary.

I look forward to receiving your reply soon.

Looking forward to receiving your comments in due course.

I'll keep you posted.

Please keep me informed on the matter.

For any comments/suggestions, please contact Nadia at 2552-7482.
I would like to apologize for…

I apologize for the delay in...

We are sorry for any inconvenience caused.

I am sorry for any inconvenience this has caused you.

I'm sorry about last time.

We apolagize for not replying you earlier.

I’m really sorry about this.

Sorry. I'm late in replying to your e-mail dated Monday April 1.

We apologize for the delay and hope that it doesn’t inconvenience you too much.

Hoping that this will not cause you too much trouble.

Sorry if my voice message is not clear enough.
Thank you for your help.

I appreciate very much that you…

I truly appreciate it.

Thank you for your participation.

Thank you so much for inviting me.

Congratulations to all of you and thanks for your efforts.

Your understanding and cooperation is greatly/highly appreciated.

Your prompt response will be most appreciated.

Once again, thank you all for your commitment and support.

Thanks for your input/clarification/message

Any comments will be much appreciated.

Thank you very much for everything you've done for me.

I would appreciate your kindest understanding with/regarding this matter.

Please convey my thanks to all the staff involved, they certainly did an excellent job.
K.I.S.S. (Keep it simple, stupid!) 尽量简洁。人们每天要接收大量邮件,所以确保你的邮件简单易懂和思路清晰。

例如,如果是要求开会,就不需要过多的铺垫,直接写明时间、地点等主要因素就够了。

Meeting Request: Let’s have our weekly meeting on Feb 18th , at 10:30 AM in Meeting room.

@Lisa, could you please take the meeting minutes this time.

@Chris, please buy some snacks and drinks for each person.

订于2月18日早上10:30在2号会议室开周例会,请Lisa做会议纪要,请Chris为每个人准备点心和饮品。
写英文邮件时,记得介绍自己、说清你想要什么——务必简明扼要地自我介绍,使收件人了解邮件的目的。

例如:Jane作为上海区域经理加入了A公司,她整理了客户资料后给John发了一封邮件:

Hi John, Hope this mail finds you well. I am Jane, SH regional manager from A; I’m contacting you today regarding our upcoming collaboration.

你好John,见信安好。我是A公司的上海区域经理Jane。这封邮件是关于接下来双方合作的事宜。
醒目的标题。务必使收件人看完标题后一目了然。

范例:市场部员工小美去电视台采访,偶遇老同学,顺便为公司谈了一个合作,她回到办公室给老板发邮件,题目是Potential marketing recourse with xx TV的邮件。

有时候,一个标题就可以说清楚邮件内容,甚至正文都可以免了。
强调关键信息,再阐明具体要求-如果你需要他人在时间期限内完成某事,请清晰明了地表达对时限和任务的要求。

例如:HR小琪给刚入职的Bruce发邮件: It is VERY IMPORTANT that we go through the visa procedure soon tomorrow . Please MEET with me at 13:00 on the 7th floor of the office for this to be done.

明天最重要的是签证手续尽量早点办完,请在下午1:00到7楼的办公室与我见面。
根据信息的重要程度由上而下的沟通—确保把最关键的信息在邮件中置顶。老外喜欢开门见山,但单刀直入,邮件中要尽量避免过多不必要的寒暄,确保最重要的信息最先被对方知晓。

例如:老板发邮件给市场全体员工说明市场部主管Jane延迟入职的原因。

Jane will not be able to join us until March 15th. This is because of the rules surrounding her current contract. She will still be able to join us for the seminar on the Feb. 8th.

因为Jane目前合同一些条款的规定,在3月15日之前不能加入我们。她仍然会在2月8日参加我们的研讨会。
署名信息到位-确保你的署名包含你的业务领域,部门信息,头衔以及联系方式,以便收件人更清晰地了解你。

例如:

Best Wishes,
Lucy. Yang
Sales Manager
XXXX Company
1/F XXX Plaza, XX Street,Shanghai, China
Tel: +86 133 XXXXXXX Skype: *****
www.XXXXXX.com

Sometimes by reading your signature people can know if they are talking to the right person.
有的时候一个完整的署名能够直观地展示您的权限与兴趣点,省去很多不必要的沟通。

References

http://www.fortunechina.com/career/c/2016-02/14/content_255590.htm?id=mail

http://www.weixinyidu.com/n_1312384

http://felixfan.github.io/tophat-cufflinks

1. Data

two raw data files were provided as the starting point: * day8.fastq from the first biological condition * day16.fastq from the second biological condition * genome.fa the reference genome * genes.gtf the reference gene annotations

2. Create reference index

bowtie2-build bwtIndex/genome.fa bwtIndex/genome

3. Run tophat using all default parameters

tophat -o output/tophat/day8/ bwtIndex/genome day8.fastq 
tophat -o output/tophat/day16/ bwtIndex/genome day16.fastq

These will create the accepted_hits.bam files containing the alignments, the align_summary.txt files containing summary stats on the mapped reads, the unmapped.bam files containing the records of unmapped reads.

3.1 How many spliced alignments were reported for the ‘day8’ data set?

spliced alignments contain ‘N’ in cigar score.

samtools view output/tophat/day8/accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

4. Run cufflinks

Run cufflinks using the specified labels as prefix for naming the assembled transcripts.

cufflinks -o output/cufflinks/day8 -L Day8 output/tophat/day8/accepted_hits.bam
cufflinks -o output/cufflinks/day16 -L Day16 output/tophat/day16/accepted_hits.bam

These will generate the files transcripts.gtf containing the assembled transcripts, as well as files *.fpkm_tracking containing expression (FPKM) estimates for genes and transcripts.

cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f2 | sort -u | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f4 | sort -u | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep -v "exon_number" | cut -d ' ' -f2 | sort | uniq -c | awk '$1==1' | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1==1' | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1>1' | wc -l

5. Run cuffcompare

Run cuffcompare on the resulting cufflinks transcripts, using the reference gene annotations provided and selecting the option ‘-R’ to consider only the reference transcripts that overlap some input transfrag.

cuffcompare -r genes.gtf -R output/cufflinks/day8/transcripts.gtf
cuffcompare -r genes.gtf -R output/cufflinks/day16/transcripts.gtf

It compares the assembled transcripts against a set of reference gene annotations provided by the user, exon-by-exon, to determine which genes and transcripts in the sample are known, and which ones are likely novel. In the end, it assigns each predicted (cufflinks) transcript a ‘class’ code depending on how it relates to a reference transcript, for example: it is the same as a reference transcript (‘=’), it is only a portion of one (‘c’), a new splice variant of a reference gene (‘j’), etc. See details.

6. Run cuffmerge

echo output/cufflinks/day8/transcripts.gtf > gtf.txt
echo output/cufflinks/day16/transcripts.gtf >> gtf.txt
cuffmerge -g genes.gtf gtf.txt -o output/cuffmerge

7. Run cuffdiff

Run cuffdiff with the merged.gtf file as reference annotation, taking the two alignment files as input.

cuffdiff -o output/cuffdiffs/ output/cuffmerge/merged.gtf output/tophat/day8/accepted_hits.bam output/tophat/day16/accepted_hits.bam

This will create the file gene_exp.diff containing test scores and results for the gene-level differential expression analysis, other *.diff files, as well as tracking files for genes, transcripts, splicing, CDS, TSS, etc.

7.1 How many genes were detected as differentially expressed?

grep –c "yes"" output/cuffdiffs/gene_exp.diff

7.2 How many transcripts were differentially expressed between the two samples?

grep –c yes output/cuffdiffs/isoform_exp.diff
http://felixfan.github.io/hist-bin-width
set.seed(999)
dat<-rnorm(n=1000, m=24, sd=5)
histInfor <- hist(dat)

histInfor
$breaks
 [1]  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

$counts
 [1]   1   7  13  37  64 100 124 160 156 136 105  55  24  11   3   4

$density
 [1] 0.0005 0.0035 0.0065 0.0185 0.0320 0.0500 0.0620 0.0800 0.0780 0.0680
[11] 0.0525 0.0275 0.0120 0.0055 0.0015 0.0020

$mids
 [1]  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

$xname
[1] "dat"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Number of bins (=10)

The bins don’t correspond to exactly the number you put in, because of the way R runs its algorithm to break up the data but it gives you generally what you want.

hist(dat, breaks = 10)

Exact number of bins (=10)

hist(dat, breaks = seq(min(dat), max(dat), length.out = 11))

width of bin (=10)

summary(dat)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.633  20.470  23.950  23.840  27.180  39.310 
hist(dat, breaks = seq(from=5, to=45, by=10))

http://felixfan.github.io/bowtie-var

1. generate bowtie2 index

bowtie2-build ref.fasta indexDir/ref

2. read alignments

2.1 end-to-end read alignments

bowtie2 –x indexDir/ref –U seq.fastq –S out.full.sam

2.2 partial read alignments

bowtie2 –x indexDir/ref –U seq.fastq –S out.local.sam --local

3. statistics of the alignments

3.1 How many matches (alignments) were reported?

Check the SAM file to determine the number of alignment lines, excluding lines that refer to unmapped reads. A SAM line indicating an unmapped read can be recognized by a “*” in column 3 (chrom).

grep -v "^@" out.full.sam | cut -f3 | grep -v "*" | wc -l
grep -v "^@" out.local.sam | cut -f3 | grep -v "*" | wc -l

3.2 How many alignments contained insertions and/or deletions?

This information is captured in the CIGAR field (col 6), marked with ‘D’ and ‘I’, respectively.

cut -f6 out.full.sam | grep -c "[I,D]"
cut -f6 out.local.sam | grep -c "[I,D]"

or

grep -v "^@" out.full.sam | awk '$6~"I" || $6~"D"' | wc -l
grep -v "^@" out.local.sam | awk '$6~"I" || $6~"D"' | wc -l

4. call variants

4.1 converting the SAM file to BAM format

samtools view –bT ref.fasta out.full.sam > out.full.bam

4.2 sort the bam file

samtools sort out.full.bam -o out.full.sorted.bam

4.3 Determine candidate sites

samtools mpileup –f ref.fasta –g out.full.sorted.bam > out.full.mpileup.bcf
bcftools call -m -v -O v -o out.mileup.vcf out.full.mileup.bcf

5. variants statistics

5.1 How many variants were reported for Chr1?

grep -c "^Chr1" out.full.mpileup.vcf 

5.2 How many variants have ‘A’ as the reference allele?

grep -v "^#" out.full.mpileup.vcf | awk '$4=="A"' | wc -l

5.3 How many variants have exactly 20 supporting reads (read depth)?

grep -v "^#" out.full.mpileup.vcf | grep "DP=20;" | wc -l

5.4 How many variants represent indels?

grep -v "^#" out.full.mpileup.vcf | grep "INDEL" | wc -l
http://felixfan.github.io/linux-grep
echo "Hello World" > test.txt
echo "hello python" >> test.txt
echo "big apple" >> test.txt
echo "key1" >> test.txt 
echo "code99" >> test.txt 

区分大小写

grep "Hello" test.txt
Hello World

不区分大小写

grep -i "Hello" test.txt
Hello World
hello python

只显示以’h’开头的文本行

grep "^h" test.txt
hello python

检索以’e’结尾的文本格式

grep -i "e$" test.txt 
big apple

搜索空白行

grep '^$' test.txt

匹配 ‘Hello’ 或 ‘hello’

grep "[Hh]ello" test.txt 
Hello World
hello python

匹配数字

grep "y[0-9]" test.txt 
key1

以匹配两位数

grep "e[0-9][0-9]" test.txt 
code99

匹配字母

grep '[A-Za-z]' test.txt
Hello World
hello python
big apple
key1
code99

显示所有包含 “p” 或 “y” 字母的文本行

grep '[py]'' test.txt
hello python
big apple
key1

匹配包含两个字母’p’的字符串结果

egrep "p{2}" test.txt 
big apple

检索文件内包含’p’和’pp’的字符串结果

egrep "p{1,2}" test.txt 
hello python
big apple

匹配至少含有3个字母’p’的结果

egrep "p{3,}" test.txt 

从文件读入多个匹配模式

echo "y1" > p.txt
echo "d$" >> p.txt
grep -f p.txt test.txt
Hello World
key1

http://felixfan.github.io/RMarkdown-Chinese-PDF 在Mac OS上使用R Markdown生成含有中文的pdf文件

准备工作

除了安装R, RStudio外,还要安装pandocBasicTeX. 如果电脑硬盘空间够大,可以直接安装MacTex. 最后安装R软件包rticles. pandoc 和 BasicTex下载后双击运行安装。BsicTex 安装好后先升级一下,再安装ctex包,具体操作如下:

sudo tlmgr update --self
sudo tlmgr update --all
sudo tlmgr install ctex

如果运行中提示“package.sty” 缺失的话,直接用“sudo tlmgr install package”安装即可(package 为具体的软件包的名字)。打开RStudio安装rticles:

install.packages("rticles")

安装完成后,新建RMarkdown文件,在弹出窗口点击左下角“from template”,在右半边窗口选“CTex Documents”即可。

以下的内容为模版的默认内容的删减版。

引言

中文LaTeX文档并非难题。当然这句话得站在巨人 CTeX 的肩膀上才能说,它让我们只需要一句

\documentclass{ctexart} % 或者ctexrep/ctexbook

或者

\usepackage{ctex}

就轻松搞定中文LaTeX排版问题。跨平台通用的LaTeX编译却是个小难题,主要是没有一种跨平台通用且免费的中文字体。好吧,你们Windows用户永远有宋体黑体,你们Mac用户有华文字体,而我们苦逼Linux用户在编译LaTeX文档就没那么简单了1,是啊,我们有文泉驿,但我们要是用了文泉驿之后把文档发给你们八成不能编译,因为你们没有安装文泉驿。

字体和选项

LaTeX包ctex支持若干种字体选项,如果你是ctex老用户,请注意这里我们要求的最低版本是2.2,你可能需要升级你的LaTeX包。从版本2.0开始,ctex支持根据不同操作系统自动选择中文字体,简直是为人类进步作出了巨大贡献,我们再也不必费尽口舌向用户解释“啊,你用Windows啊,那么你该使用什么字体;啊,你用Mac啊,又该如何如何”。

下面的YAML元数据应该能满足多数用户的需求,主要设置两项参数:文档类为ctexart(当然也可以是别的类),输出格式为rticles::ctex,其默认LaTeX引擎为XeLaTeX(真的,别纠结你的旧爱PDFLaTeX了)。

---
documentclass: ctexart
output: rticles::ctex
---

R代码段

R代码用R Markdown的语法嵌入,即三个反引号开始一段代码```{r}和三个反引号``` 结束一段代码:

options(digits = 4)
fit = lm(dist ~ speed, data = cars)
coef(summary(fit))
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  -17.579     6.7584  -2.601 1.232e-02
## speed          3.932     0.4155   9.464 1.490e-12
b = coef(fit)

上面回归方程中的斜率是3.9324,完整的回归方程为:\[ Y = -17.5791 + 3.9324x\]

画图当然也是木有问题的啦,想画就说嘛,不说我怎么知道你想画呢?

par(mar = c(4, 4, .1, .1), las = 1)
plot(cars, pch = 19)
abline(fit, col = 'red')
cars数据散点图以及回归直线。

cars数据散点图以及回归直线。

请不要问我为什么图浮动到下一页去了,这么初级的LaTeX问题问出来信不信我扁你。

小结

事实证明我们可以理直气壮地通过XeLaTeX将中文R Markdown转化为PDF文档,麻麻再也不用担心我的论文满屏幕都是反斜杠,朕养完小白鼠之后终于不必先折腾三个小时LaTeX再开始写实验报告了:打开RStudio,菜单File > New File > R Markdown,然后从模板中选择CTeX Documents,搞定。


  1. 切,傲娇的Linux用户怎么会干出找你们复制字体的事情

R Markdown转化为PDF文档的效果如下(只显示了第一页)

preview
http://felixfan.github.io/circos

1. prepare data

options(stringsAsFactors = FALSE)
set.seed(999)
library("OmicCircos")
data("UCSC.hg19.chr")
data("TCGA.BC.gene.exp.2k.60")
dat <- UCSC.hg19.chr
dat$chrom <- gsub("chr", "",dat$chrom)


### initial values for simulation data
colors <- rainbow(10, alpha = 0.8)
lab.n <- 50
cnv.n <- 200
arc.n <- 30
fus.n <- 10

### make arc data

arc.d <- c()
for(i in 1:arc.n){
  chr <- sample(1:19, 1)
  chr.i <- which(dat$chrom == chr)
  chr.arc <- dat[chr.i,]
  arc.i <- sample(1:nrow(chr.arc), 2)
  arc.d <- rbind(arc.d, 
                 c(chr.arc[arc.i[1], c(1,2)], 
                   chr.arc[arc.i[2], c(2,4)]))
}
colnames(arc.d) <- c("chr", "start", "end", "value")


### make fusion

fus.d <- c()
for(i in 1:fus.n){
  chr1 <- sample(1:19, 1)
  chr2 <- sample(1:19, 1)
  chr1.i <- which(dat$chrom == chr1)
  chr2.i <- which(dat$chrom == chr2)
  chr1.f <- dat[chr1.i,]
  chr2.f <- dat[chr2.i,]
  fus1.i <- sample(1:nrow(chr1.f), 1)
  fus2.i <- sample(1:nrow(chr2.f), 1)
  n1 <- paste0("geneA", i)
  n2 <- paste0("geneB", i)
  fus.d <- rbind(fus.d, c(
    chr1.f[fus1.i, c(1,2)], n1,
    chr2.f[fus2.i, c(1,2)], n2
  ))
}
colnames(fus.d) <- c("chr1","po1","gene1","chr2","po2","gene2")

cnv.i <- sample(1:nrow(dat), cnv.n)
vale <- rnorm(cnv.n)
cnv.d <- data.frame(dat[cnv.i,c(1,2)], value=vale)

### gene pos
gene.pos <- TCGA.BC.gene.exp.2k.60[,1:3]

### gene expression
gene.exp <- TCGA.BC.gene.exp.2k.60

### p vale
gene.pos$p <- rnorm(250,0.01,0.001)*
  sample(c(1,0.5,0.01,0.001,0.0001),250,replace=TRUE)

2. circos plot

2.1 plot of chromosome

type = "chr": plots of chromosomes or segments
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)

2.2 plot bar charts with the same height

type = "b3": bar charts with the same height
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])

2.3 plot dots with the fixed radius

type = "s2": dots with the fixed radius
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)

2.4 plot arcs with the fixed radius

type = "arc2": arcs with the fixed radius
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)

2.5 plot bar charts (opposite side of cutoff value)

type = "b2": bar charts (opposite side of cutoff value)
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)

2.6 plot arcs with variable radius

type = "arc": arcs with variable radius
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)

2.7 box plots

type = "box": box plots
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=200, cir="hg19",type="box",W=40,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v = 3)

2.8 histograms

type = "h": histograms
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=200, cir="hg19",type="box",W=40,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v=3)
circos(R=160, cir="hg19",type="h",W=40,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)

type = "link": link lines based on Bezier curve
par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=TRUE,print.chr.lab = TRUE)
circos(R=355, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=355, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=320, cir="hg19",type="arc2",W=35,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=280, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=240, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=200, cir="hg19",type="box",W=40,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v=3)
circos(R=160, cir="hg19",type="h",W=40,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)
circos(R=120,cir="hg19",type="link",W=10,mapping=fus.d,col=colors[c(1,7,9)],lwd=2)

3 plot label

3.1 outside label

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=300, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = FALSE)
circos(R=310, cir="hg19",type="label",W=40,mapping=gene.pos, col=c("black","blue","red"),cex=0.4,side="out")
circos(R=250, cir="hg19",type="b3",W=40,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=250, cir="hg19",type="s2",W=40,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=220, cir="hg19",type="arc2",W=30,mapping=arc.d,B=TRUE, col=colors,lwd=10,cutoff=0)
circos(R=190, cir="hg19",type="b2",W=30,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=160, cir="hg19",type="arc",W=30,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=130, cir="hg19",type="box",W=30,mapping=cnv.d,B=TRUE, col=colors[1],lwd=0.1,scale = TRUE,col.v=3)
circos(R=100, cir="hg19",type="h",W=30,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)
circos(R=90,cir="hg19",type="link",mapping=fus.d,col=colors[c(1,7,9)],lwd=2)

3.2 inside label

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = T)
circos(R=390, cir="hg19",type="label",W=50,mapping=gene.pos, col=c("black","blue","red"),cex=0.4,side="in")
circos(R=240, cir="hg19",type="b3",W=50,mapping=cnv.d,B=TRUE, col=colors[7])
circos(R=240, cir="hg19",type="s2",W=50,mapping=cnv.d,B=FALSE, col=colors[1],cex=0.5)
circos(R=190, cir="hg19",type="b2",W=40,mapping=cnv.d,B=TRUE, col=colors[c(7,9)],lwd=2,cutoff=-0.2, col.v=3)
circos(R=140, cir="hg19",type="arc",W=40,mapping=arc.d,B=TRUE, col=colors[c(1,7)],lwd=4,scale = TRUE,col.v=4)
circos(R=90, cir="hg19",type="h",W=40,mapping=cnv.d,B=FALSE, col=colors[3],lwd=0.1,col.v=3)
circos(R=80,cir="hg19",type="link",mapping=fus.d,col=colors[c(1,7,9)],lwd=2)

4 heatmap

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=400, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = T)
circos(R=300, cir="hg19",type="heatmap2",W=100,mapping=gene.exp, 
       col.v=4, cluster=FALSE,col.bar = FALSE,lwd=0.1,col="blue")
circos(R=200, cir="hg19",type="s",W=100,mapping = gene.pos,
       col.v=4,col=colors,scale = TRUE,B=TRUE)
sig.gene <- gene.pos[gene.pos$p<0.000001,]
circos(R=190, cir="hg19",type="label",W=40,mapping=sig.gene, col=c("black","blue","red"),cex=0.4,side="in")

par(mar=c(1,1,1,1))
plot(c(1,800),c(1,800),type="n",axes=FALSE,xlab="",ylab="")
circos(R=200, cir="hg19",type="chr",W=10,scale=FALSE,print.chr.lab = T)
circos(R=100, cir="hg19",type="heatmap2",W=100,mapping=gene.exp, 
       col.v=4, cluster=FALSE,col.bar = FALSE,lwd=0.1,col="blue")
circos(R=230, cir="hg19",type="s",W=100,mapping = gene.pos,
       col.v=4,col=colors,scale = TRUE,B=TRUE)
sig.gene <- gene.pos[gene.pos$p<0.000001,]
circos(R=330, cir="hg19",type="label",W=40,mapping=sig.gene, col=c("black","blue","red"),cex=0.4,side="out")

http://felixfan.github.io/birthday-paradox

Birthday paradox

In a set of n randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367. However, 99.9% probability is reached with just 70 people, and 50% probability with 23 people based on the assumption that each day of the year is equally probable for a birthday.

Calculating the probability

x <- rep(NA, 100)
y <- rep(NA, 100)
p <- rep(NA, 100)
x[1]=1
y[1]=1
p[1]=0
for(i in 2:100)
{
  x[i]=i
  y[i]=y[i-1]*(365-i+1)/365
  p[i]=1-y[i]
}
dat = data.frame(numOfIndiv=x, prob=p)
dat2370 = dat[dat$numOfIndiv==23 | dat$numOfIndiv==70,]
dat2370$prob <- round(dat2370$prob, digits=3)

Plot the probability

library(ggplot2)
ggplot(dat, aes(x=numOfIndiv, y=prob)) + 
  geom_line() +
  xlab("Number of Individuals") +
  ylab("Probability of Have Two Individuals with the Same Birthday") +
  ggtitle("Birthday Paradox") +
  geom_point(data=dat2370,aes(x=numOfIndiv, y=prob), colour = "red") +
  geom_label(data=dat2370,
             aes(x=numOfIndiv, y=prob, 
                 label=paste(numOfIndiv,prob,sep=" ")),
             hjust = 1,  vjust = -0.2)

http://felixfan.github.io/latex-markdown
Here is an in-line equation $\sqrt{3x-1}+(1+x)^2$ in the body of the text.

Here is an in-line equation \[ \sqrt{3x-1}+(1+x)^2 \] in the body of the text.

Here is an equation: $\left [ - \frac{\hbar^2}{2 m} \frac{\partial^2}{\partial x^2} + V \right ] \Psi
= i \hbar \frac{\partial}{\partial t} \Psi$

Here is an equation: \[\left [ - \frac{\hbar^2}{2 m} \frac{\partial^2}{\partial x^2} + V \right ] \Psi = i \hbar \frac{\partial}{\partial t} \Psi\]

symbols

1. &

used as separators in alignment environments

a &lt; b

a < b

2. ^, _, { and }

^ used to indicate exponents;
^ used to indicate superscripts;
_ used to indicate subscripts;
{} braces, used for grouping;

x^i_2

\[ x^i_2 \]

{x^i}_2

\[ {x^i}_2 \]

x^{i_2}

\[ x^{i_2} \]

x^{i^2}

\[ x^{i^2} \]

{x^i}^2

\[ {x^i}^2 \]

^ax^b

\[ ^ax^b \]

\sum_{n=1}^\infty

\[ \sum_{n=1}^\infty \]

3. Greek letter

\alpha, \beta, \chi, \Delta, \delta, \epsilon, \eta, \Gamma, \gamma, \iota, \kappa

\[ \alpha, \beta, \chi, \Delta, \delta, \epsilon, \eta, \Gamma, \gamma, \iota, \kappa \]

\Lambda, \lambda, \mu, \omega, \Omega, \phi, \Phi, \pi, \Pi, \psi, \Psi

\[ \Lambda, \lambda, \mu, \omega, \Omega, \phi, \Phi, \pi, \Pi, \psi, \Psi \]

\rho, \sigma, \Sigma, \tau, \theta, \Theta, \upsilon, \Upsilon, \varDelta, \varepsilon, \varGamma

\[ \rho, \sigma, \Sigma, \tau, \theta, \Theta, \upsilon, \Upsilon, \varDelta, \varepsilon, \varGamma \]

\varLambda, \varOmega, \varphi, \varPhi, \varpi, \varPi, \xi, \zeta

\[ \varLambda, \varOmega, \varphi, \varPhi, \varpi, \varPi, \xi, \zeta \]

4. \frac

\frac a b

\[ \frac a b \]

\frac{a-1}b-1

\[\frac{a-1}b-1 \]

\frac{a-1}{b-1}

\[ \frac{a-1}{b-1} \]

github pages: delimiters \\(, \\) and \\[, \\] for inline and displayed math, respectively.
Rstudio: delimiters $, $ and $$, $$ for inline and displayed math, respectively.

Reference: TEX Commands available in MathJax

http://felixfan.github.io/bedtools

1. Introduction

As described on the UCSC Genome Browser website (see link below), the BED format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for bedtools.

bedtools allows one to use the “BED12” format (that is, all 12 fields listed below). However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12 “blocks” when computing overlaps, etc., via the “-split” option. For all other tools, the last six columns are not used for any comparisons by the bedtools. Instead, they will use the entire span (start to end) of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported in the output of all comparisons.

chrom - The name of the chromosome on which the genome feature exists. 
	For example, “chr1”, “contig1112.23”. This column is required.
start - The zero-based starting position of the feature in the chromosome. 
	The first base in a chromosome is numbered 0. This column is required.
end - The one-based ending position of the feature in the chromosome.
	This column is required.
name - Defines the name of the BED feature. This column is optional.
	For example, “LINE”, “Exon3”.
score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. 
	This column is optional.
strand - Defines the strand - either ‘+’ or ‘-‘. This column is optional.
thickStart - The starting position at which the feature is drawn thickly.
	Allowed yet ignored by bedtools.
thickEnd - The ending position at which the feature is drawn thickly.
	Allowed yet ignored by bedtools.
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).
	Allowed yet ignored by bedtools.
blockCount - The number of blocks (exons) in the BED line.
	Allowed yet ignored by bedtools.
blockSizes - A comma-separated list of the block sizes.
	Allowed yet ignored by bedtools.
blockStarts - A comma-separated list of block starts.
	Allowed yet ignored by bedtools.

2. bedtools Examples

2.0 intersect command

cat a.bed 
Chr3	11699949	11700000
Chr3	11699967	11700018
Chr3	11699972	11700023
cat b.bed 
Chr3	11699950	11699990
Chr3	11699970	11700020
Chr4	11699972	11700023
-wa Write the original entry in A for each overlap.
bedtools intersect -wa -a a.bed -b b.bed 
Chr3	11699949	11700000
Chr3	11699949	11700000
Chr3	11699967	11700018
Chr3	11699967	11700018
Chr3	11699972	11700023
Chr3	11699972	11700023
-wb Write the original entry in B for each overlap.
bedtools intersect -wb -a a.bed -b b.bed 
Chr3	11699950	11699990	Chr3	11699950	11699990
Chr3	11699970	11700000	Chr3	11699970	11700020
Chr3	11699967	11699990	Chr3	11699950	11699990
Chr3	11699970	11700018	Chr3	11699970	11700020
Chr3	11699972	11699990	Chr3	11699950	11699990
Chr3	11699972	11700020	Chr3	11699970	11700020
Write the original A and B entries plus the number of 
base pairs of overlap between the two features.
bedtools intersect -wo -a a.bed -b b.bed 
Chr3	11699949	11700000	Chr3	11699950	11699990	40
Chr3	11699949	11700000	Chr3	11699970	11700020	30
Chr3	11699967	11700018	Chr3	11699950	11699990	23
Chr3	11699967	11700018	Chr3	11699970	11700020	48
Chr3	11699972	11700023	Chr3	11699950	11699990	18
Chr3	11699972	11700023	Chr3	11699970	11700020	48

2.1 How many overlaps (each overlap is reported on one line) between the bam file and the gtf file are reported?

To allow the input to be read directly from the BAM file, we use the option ‘-abam’. -bed If using BAM input, write output as BED.

bedtools intersect -abam test.bam -b test.gtf -bed -wo > overlaps.bed

This will create a file with the following format: Columns 1-12 : alignment information, converted to BED format Columns 13-21 : annotation (exon) information, from the GTF file Column 22 : length of the overlap

Alternatively, we could first convert the BAM file to BED format using ‘bedtools bamtobed’ then use the resulting file in the ‘bedtools intersect’ command. To answer the question, the number of overlaps reported is precisely the number of lines in the file (because only entries in the first file that have overlaps in file B are reported, according to the option ‘-wo’):

wc -l overlaps.bed

2.2 How many alignments overlap the annotations?

Columns 1-12 define the alignments:

cut -f1-12 overlaps.bed | sort -u | wc -l

2.3 Conversely, how many exons have reads mapped to them?

Columns 13-21 define the exons:

cut -f13-21 overlaps.bed | sort -u | wc -l
http://felixfan.github.io/nba-heatmap

1. NBA players data in 2014-2015 season

1.1 columns of the data

Rk -- Rank
Pos -- Position
Age -- Age of Player at the start of February 1st of that season.
Tm -- Team
G -- Games
GS -- Games Started
MP -- Minutes Played
FG -- Field Goals
FGA -- Field Goal Attempts
FGR -- Field Goal Percentage
F3P -- 3-Point Field Goals
F3PA -- 3-Point Field Goal Attempts
F3PR -- FG% on 3-Pt FGAs.
F2P -- 2-Point Field Goals
F2PA -- 2-point Field Goal Attempts
F2P -- FG% on 2-Pt FGAs.
eFGR -- Effective Field Goal Percentage
FT -- Free Throws
FTA -- Free Throw Attempts
FTR -- Free Throw Percentage
ORB -- Offensive Rebounds
DRB -- Defensive Rebounds
TRB -- Total Rebounds
AST -- Assists
STL -- Steals
BLK -- Blocks
TOV -- Turnovers
PF -- Personal Fouls
PTS -- Points

1.2 read data

#dat <- read.csv("nba20142015.csv")
library(RCurl)
myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/bioinfor/nba20142015.csv", 
                ssl.verifypeer = FALSE)
dat <- read.csv(textConnection(myCsv))

1.3 select columns for heatmap

Only select the top 20 players with highest points.

keeps <- c('Player','G','FGR','F3PR','F2PR','FTR','ORB','DRB','AST','STL','BLK','TOV','PF','PTS')
subdat <- dat[,names(dat) %in% keeps]
plotdat <- subdat[order(-subdat[,"PTS"]),][1:20,]

Order y-axis inside a geom_tile by PTS. The y-axis is ordered alphabetically in default.

plotdat$Player <- factor(plotdat$Player, levels=(plotdat$Player)[order(plotdat$PTS)])

1.4 prepare for ggplot2

transform data from wide-format to long-format.

library(reshape2)
plotdat.m <- melt(plotdat)

rescale data so that they were between 0 and 1.

library(plyr)
library(scales)
plotdat.m <- ddply(plotdat.m, .(variable), transform, rescale = rescale(value))

1.5 prepare for heatmap, heatmap.2 and d3heatmap

row.names(plotdat) <- plotdat$Player
plotdat.h <- plotdat[,2:14]
plotdat.h <- data.matrix(plotdat.h)

2 Heatmap

#my_col = colorRampPalette(c("yellow","red"))(256)
my_col = colorRampPalette(c("white","green","green4","violet","purple"))(256)

2.1 heatmap in stats package

heatmap(plotdat.h, col = my_col, scale="column",Rowv=NA, Colv=NA)

2.2 heatmap.2 in gplots package

Rowv=FALSE turns off row reorder.

library(gplots)
heatmap.2(plotdat.h, col = my_col, scale="column",dendrogram="none",margins = c(5, 10),Rowv=FALSE)

2.3 d3heatmap in d3heatmap package

library(d3heatmap)
d3heatmap(plotdat.h, scale = "column",dendrogram="none",col = my_col)

2.4 heatmap by ggplot2

library(ggplot2)
ggplot(plotdat.m, aes(variable, Player)) + 
  geom_tile(aes(fill = rescale), colour = "white")+
  scale_fill_gradient(low = "yellow", high = "red")+
  theme(axis.ticks = element_blank(), 
               axis.text.x = element_text(
                 angle = 330, hjust = 0),
               axis.title = element_blank(),
               legend.title = element_blank()
               )

http://felixfan.github.io/bam-sam

1. Sequence Alignment/Map Format Specification

The official defination is here.

“It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.”

The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system.
The BAM, BCFv2, BED, and PSL formats are using the 0-based coordinate system.

1.1 The header section


1.2 The alignment section: mandatory fields

“In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line has 11 mandatory fields. These fields always appear in the same order and must be present, but their values can be ‘0’ or ‘*’ (depending on the field) if the corresponding information is unavailable. The following table gives an overview of the mandatory fields in the SAM format”

1.2.1 The FLAG field

1.2.2 The CIGAR field

2. BAM format

SAM files and BAM files contain the same information, but in a different format. BAM is compressed in the BGZF format.

3. Practice

3.1 How many alignments does the BAM file contain?

A BAM file contains alignments for a set of input reads. Each read can have 0 (none), 1 or multiple alignments on the genome. The number of alignments is the number of entries, excluding the header, contained in the BAM file, or equivalently in its SAM conversion.

samtools flagstat test.bam

An alternate method would be to count the number of lines in the converted SAM file (header excluded):

samtools view test.bam | wc -l

If the BAM file was created with a tool that includes unmapped reads into the BAM file, we would need to exclude the lines representing unmapped reads, i.e. with a “*” in column 3 (chrom)

samtools view test.bam | cut -f 3 | grep -v '*' | wc -l

3.2 How many alignments show the read’s mate unmapped?

An alignment with an unmapped mate is marked with a ‘*’ in column 7.

samtools view test.bam | cut -f 7 | grep -c '*'

3.3 How many alignments contain a deletion (D)?

Deletions are be marked with the letter ‘D’ in the CIGAR string for the alignment, shown in column 6.

samtools view test.bam | cut -f 6 | grep -c 'D'

3.4 How many alignments show the read’s mate mapped to the same chromosome?

An alignment with mate mapped to same chromosome is marked with a “=” in column 7.

samtools view test.bam | cut -f 7 | grep -c '='

3.5 How many alignments are spliced?

A spliced alignment will be marked with an “N” (intron gap) in the CIGAR field (column 6).

samtools view test.bam | cut -f 6 | grep -c 'N'

3.6 How many sequences are in the genome file?

This information can be found in the header of the BAM file. The number of lines describing the sequences in the reference genome.

samtools view -H test.bam | grep -c "SN:"

3.7 What is the length of the first sequence in the genome file?

The length information is stored alongside the sequence identifier in the header (pattern “LN:seq_length”).

samtools view -H test.bam | grep "SN:" | more

3.8 What alignment tool was used?

The program name is listed in the @PG line in the BAM header (pattern “ID:program_name”).

samtools view -H test.bam | grep "^@PG"

3.9 Extract a subregion from the BAM file.

Extract 1,000,000 to 10,000,000 on chromsome 3.

echo "Chr3 1000000 10000000" > region.bed
samtools view -b -L region.bed test.bam > test_region.bam
http://felixfan.github.io/google-scholar-citation
library(scholar)
library(ggplot2)
cit <- get_citation_history('8fX1TSQAAAAJ')
# '8fX1TSQAAAAJ' is my google scholar id 
ggplot(cit,aes(x=year,y=cites)) + 
  geom_bar(stat='identity') +
  theme_bw() +
  xlab('Year') +
  ylab('Google Scholar Citations') + 
  annotate('text',
           label=format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"),
           x=-Inf, y=Inf, 
           vjust=1.5, hjust=-0.05,
           size=3,colour='gray') +
  geom_text(aes(label=cites), vjust=1.5, color="white", size=3) +
  scale_y_continuous(limits = c(0, 60)) +
  scale_x_continuous(breaks=2011:2016) +
  ggtitle("h-index = 6\ni10-index = 5")

http://felixfan.github.io/formattable
library(formattable)
df <- data.frame(
    Disease=c("BD","BD","CAD","CD","CD","RA","RA","T1D","T2D"),
    K=c(0.0045, 0.01, 0.056,0.0005,0.001,0.0075,0.011,0.0054,0.028),
    NO.SigSNP=c(0,0,13,26,26,3,3,43,10),
    VE=c(NA,NA,0.0091,0.0148, 0.0164,0.0205,0.0219,0.1265,0.0051),
    SampleSize = c(2000,2000,10000,1999,1999,3001,3001,4000,999),
    Tested = c(TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,TRUE),
    Grade=c("A","A","B","C","C","B","B","A","C"),
    Score=c(8.9,8.9,9.1,9.9,9.9,9.5,9.5,4.3,1.6)
)
formattable(df)
Disease K NO.SigSNP VE SampleSize Tested Grade Score
BD 0.0045 0 NA 2000 TRUE A 8.9
BD 0.0100 0 NA 2000 TRUE A 8.9
CAD 0.0560 13 0.0091 10000 FALSE B 9.1
CD 0.0005 26 0.0148 1999 FALSE C 9.9
CD 0.0010 26 0.0164 1999 FALSE C 9.9
RA 0.0075 3 0.0205 3001 TRUE B 9.5
RA 0.0110 3 0.0219 3001 TRUE B 9.5
T1D 0.0054 43 0.1265 4000 FALSE A 4.3
T2D 0.0280 10 0.0051 999 TRUE C 1.6
df2 <- df
df2$K <- percent(df2$K)
df2$SampleSize <- accounting(df2$SampleSize,format="d")
df2$Tested <- formattable(df2$Tested, "yes","no")
formattable(df2)
Disease K NO.SigSNP VE SampleSize Tested Grade Score
BD 0.45% 0 NA 2,000 yes A 8.9
BD 1.00% 0 NA 2,000 yes A 8.9
CAD 5.60% 13 0.0091 10,000 no B 9.1
CD 0.05% 26 0.0148 1,999 no C 9.9
CD 0.10% 26 0.0164 1,999 no C 9.9
RA 0.75% 3 0.0205 3,001 yes B 9.5
RA 1.10% 3 0.0219 3,001 yes B 9.5
T1D 0.54% 43 0.1265 4,000 no A 4.3
T2D 2.80% 10 0.0051 999 yes C 1.6
formattable(df, list(
  K = formatter("span",
                    style=x~style(color=ifelse(rank(-x)<=3,"red","gray")),
                    x~sprintf("%.4f (rank: %2d)", x, rank(-x))
                    ),
  Tested = formatter("span",
                     style=x~style(color=ifelse(x,"green","red")),
                     x~icontext(ifelse(x,"ok","remove"),ifelse(x,"yes","no"))
                     ),
  Grade = formatter("span", style=x~ifelse(x=="A",style(color="red",font.weight="bold"),NA)),
  Score = color_bar("pink", 0.5)
))
Disease K NO.SigSNP VE SampleSize Tested Grade Score
BD 0.0045 (rank: 7) 0 NA 2000 yes A 8.9
BD 0.0100 (rank: 4) 0 NA 2000 yes A 8.9
CAD 0.0560 (rank: 1) 13 0.0091 10000 no B 9.1
CD 0.0005 (rank: 9) 26 0.0148 1999 no C 9.9
CD 0.0010 (rank: 8) 26 0.0164 1999 no C 9.9
RA 0.0075 (rank: 5) 3 0.0205 3001 yes B 9.5
RA 0.0110 (rank: 3) 3 0.0219 3001 yes B 9.5
T1D 0.0054 (rank: 6) 43 0.1265 4000 no A 4.3
T2D 0.0280 (rank: 2) 10 0.0051 999 yes C 1.6
http://felixfan.github.io/linux-comm
echo -e "a\nb\nc\n" > 1.txt
echo -e "a\nc\nd\n" > 2.txt

Compare sorted files FILE1 and FILE2 line-by-line. With no options, comm produces three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. Each of these columns can be suppressed individually with options.

-1     suppress column 1 (lines unique to FILE1)
-2     suppress column 2 (lines unique to FILE2)
-3     suppress column 3 (lines that appear in both files)

输出3列,第一列是1.txt特有的,第二列是2.txt特有的,第三列是共有的。

comm 1.txt 2.txt
		a
b
		c

	d

只输出第三列,即两个文件的共同行。

comm -12 1.txt 2.txt
a
c

只输出第二列,即第二个文件特有的行。

comm -13 1.txt 2.txt
d

只输出第一列,即第一个文件特有的行。

comm -23 1.txt 2.txt
b

只输出第二,三列

comm -1 1.txt 2.txt
	a
	c
d

只输出第一,三列

comm -2 1.txt 2.txt
	a
b
	c

只输出第一,二列

comm -3 1.txt 2.txt
b

	d
http://felixfan.github.io/linux-cut

cut 命令从文件的每一行剪切字节、字符和字段并将这些字节、字符和字段写至标准输出。

-d :自定义分隔符,默认为制表符。cut只允许间隔符是一个字符。
-f  :与-d一起使用,指定显示哪个区域。
echo -e "a b  c\td\n" > test.txt

默认”\t”分割

cut -f 1 test.txt 
a b  c

cut -f 2 test.txt 
d

指定空格为分隔符

cut -d ' ' -f 1 test.txt 
a

cut -d ' ' -f 2 test.txt 
b

cut -d ' ' -f 3 test.txt 


cut -d ' ' -f 4 test.txt 
c	d

指定空格为分割符,取多个域

cut -d ' ' -f 1 -f 2 test.txt 
a b

cut -d ' ' -f 1-2 test.txt 
a b

cut -d ' ' -f 1,2 test.txt 
a b

cut -d ' ' -f1,2 test.txt 
a b
http://felixfan.github.io/stacking-plots-same-x-revised

1 download data

library(FinCal)
require(gridExtra)
library(ggplot2)

dat <- get.ohlc.yahoo('AAPL', '2015-12-01', '2015-12-31')
dat$date <- as.Date(dat$date, "%Y-%m-%d")
dat$volume <- dat$volume/1000000

2 two different plots

Use these two options.

axis.text.y = element_text(angle = 90)
xlim(range(dat$date))
p1 <- ggplot(dat, aes(date, adjusted)) + geom_line() + 
      theme(
            axis.title.x = element_blank(), 
            axis.text.x = element_blank(),
            axis.text.y = element_text(angle = 90)
            ) +
    xlim(range(dat$date))
p2 <- ggplot(dat,aes(date, volume)) + geom_bar(stat="identity") + 
      theme(
            axis.text.x = element_text(angle=90),
            axis.text.y = element_text(angle = 90)
            ) +
    ylab("volume(millions)") +
    xlim(range(dat$date))

grid.arrange(p1, p2, ncol = 1, heights = c(2, 1))

p1 <- p1 + theme(plot.margin = unit(c(0, .5, -1.5, 0), "lines"))
p2 <- p2 + theme(plot.margin = unit(c(0, .5, 0, 0), "lines"))
grid.arrange(p1, p2, ncol = 1, heights = c(2, 1))

http://felixfan.github.io/china-map

虽然只读取.shp 文件,.shx.dbf文件也必须在同一个文件目录下才能读取成功。

library(maptools)
library(ggplot2)
library(plyr)

str.crs <- "+proj=longlat +ellps=clrk66"
china_map <- readShapePoly("bou2_4p.shp",proj4string=CRS(str.crs))

x <- china_map@data                          #读取行政信息
xs <- data.frame(x,id=seq(0:924)-1)          #含岛屿共925个形状
china_map1 <- fortify(china_map)             #转化为数据框
china_map_data <- join(china_map1, xs, type = "full")       #合并两个数据框

p <- ggplot(china_map_data, aes(x = long, y = lat))
p <- p +
  theme(               
      panel.grid = element_blank(),
      panel.background = element_blank(),
      axis.text = element_blank(),
      axis.ticks = element_blank(),
      axis.title = element_blank(),
      legend.position = "none"
  )

边线设置为灰色,每个省级地区用白色填充

p + geom_polygon(aes(group=group), fill= "white", colour="grey60")

边线设置为灰色,每个省级地区用不同颜色填充

p1 <- p + geom_polygon(aes(group = group,fill=NAME),colour="grey60")
p1

添加省会级城市

capital <- readShapePoints("res1_4m.shp", proj4string = CRS(str.crs))
cap.name <- iconv(capital@data$NAME, "gbk", "utf8")
cap.coords <- coordinates(capital)
n <- length(cap.name)
pos <- rep(4, times = n)
spec.name <- c("呼和浩特", "成都", "西宁", "太原", "合肥"
               , "武汉","长沙", "银川", "南宁", "澳门")
pos[cap.name %in% spec.name] <- 2
spec.name <- c("北京", "南京", "上海", "广州")
pos[cap.name %in% spec.name] <- 3

cap_map_data <- data.frame(name=cap.name, 
                           long=as.data.frame(cap.coords)$coords.x1,
                           lat=as.data.frame(cap.coords)$coords.x2,
                           pos=pos)
p1 + geom_text(data=cap_map_data, aes(x=long, y= lat, label=name),family='STKaiti')

添加省会级城市的坐标(白色空心圆圈)

p1 + 
  geom_point(data=cap_map_data,aes(x=long, y= lat),shape=1,colour="white") +
  geom_text(data=cap_map_data,aes(x=long, y= lat,label=name),family='STKaiti')

使用ggrepel解决名字重叠问题

library(ggrepel)
p1 + geom_text_repel(data=cap_map_data,aes(x=long, y= lat,label=name),family='STKaiti')

只使用ggrepel解决香港和澳门名字重叠问题,其它城市给出名字和坐标

spec.city <- c("香港","澳门")
cap_map_data1 <- cap_map_data[cap_map_data$name %in% spec.city,]
cap_map_data2 <- cap_map_data[!cap_map_data$name %in% spec.city,]
p1 + geom_point(data=cap_map_data2,aes(x=long, y= lat),shape=1,colour="white") +
  geom_text(data=cap_map_data2,aes(x=long, y= lat,label=name),family='STKaiti')+
  geom_text_repel(data=cap_map_data1,aes(x=long, y= lat,label=name),family='STKaiti')

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggrepel_0.4     plyr_1.8.3      ggplot2_2.0.0   maptools_0.8-37
[5] sp_1.2-1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3      lattice_0.20-33  digest_0.6.9     grid_3.2.3      
 [5] gtable_0.1.2     formatR_1.2.1    magrittr_1.5     scales_0.3.0    
 [9] evaluate_0.8     stringi_1.0-1    rmarkdown_0.9.2  labeling_0.3    
[13] tools_3.2.3      stringr_1.0.0    foreign_0.8-66   munsell_0.4.2   
[17] yaml_2.1.13      colorspace_1.2-6 htmltools_0.3    knitr_1.12      
http://felixfan.github.io/ssh-scp

SSH用于远程登录。SCP用于异地文件传输。假定你要以用户名user,登录远程主机host,只要一条简单命令就可以了。host可以是域名也可以是IP地址。

ssh user@host

如果本地用户名与远程用户名一致,登录时可以省略用户名。

ssh host

SSH的默认端口是22,可以使用p参数修改这个端口。例如远程主机的端口是99。

ssh -p 99 user@host

退出登录。

exit

文件传输。
ldir目录下的test.txt从本地传到远程主机的rdir目录下

scp /ldir/test.txt user@host:/rdir

远程主机的rdir目录下的test.txt传到本地ldir目录下

scp user@host:/rdir/test.txt /ldir/test.txt

传输文件夹时可以用 -r 也可以使用通配符*

http://felixfan.github.io/aesthetic-mappings-ggplot2
library(ggplot2)

1. Setting vs. mapping

1.1 sets the colour of the points to a constant, using the colour parameter of the layer

p <- ggplot(mtcars, aes(mpg, wt))
p + geom_point(colour = "darkblue")

1.2 maps an aesthetic to a variable

p + geom_point(aes(colour = factor(cyl)))

1.3 maps the colour to the value “darkblue”

"This effiectively creates a new variable containing only the value "darkblue" and 
then maps colour to that new variable. Because this value is discrete, the default
colour scale uses evenly spaced colours on the colour wheel, and since there is 
only one value this colour is pinkish."
p + geom_point(aes(colour = "darkblue"))

http://felixfan.github.io/stacking-plots-same-x

1 download data

library(FinCal)
library(reshape2)

dat <- get.ohlc.yahoo('AAPL', '2015-12-01', '2015-12-31')
#dat$date <- strptime(dat$date, "%Y-%m-%d")
dat$date <- as.Date(dat$date, "%Y-%m-%d")
dat$times <- seq(nrow(dat))
mm <- melt(subset(dat, select=c(times,adjusted, volume)), id.var="times")

2 lattice package

library(lattice)
xyplot(value~times|variable,data=mm,type="l",
       scales=list(y=list(relation="free")),
       layout=c(1,2))

3 ggplot2 package

library(ggplot2)
ggplot(mm, aes(x = times, y = value)) + geom_line(aes(color = variable)) + 
      facet_grid(variable ~ ., scales = "free_y") + theme(legend.position = "none")

4 two different plots

p1 <- ggplot(dat, aes(date, adjusted)) + geom_line() + theme_minimal() + 
      theme(axis.title.x = element_blank(), axis.text.x = element_blank())
p2 <- ggplot(dat,aes(date, volume)) + geom_bar(stat="identity") + theme_minimal() + 
      theme(axis.title.x = element_blank(),axis.text.x = element_text(angle=90))

4.1 same widths, same heights

library(grid)
grid.newpage()
grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))

p3 <- p1 + scale_x_date(date_breaks = "1 day")
p4 <- p2 + scale_x_date(date_breaks = "1 day")
grid.newpage()
grid.draw(rbind(ggplotGrob(p3), ggplotGrob(p4), size = "last"))

4.2 different heights, different widths

require(gridExtra)
grid.arrange(p1, p2, ncol = 1, heights = c(2, 1))

4.3 different heights, same widths

gb1 <- ggplot_build(p1)
gb2 <- ggplot_build(p2)
n1 <- length(gb1$panel$ranges[[1]]$y.labels)
n2 <- length(gb2$panel$ranges[[1]]$y.labels)
gA <- ggplot_gtable(gb1)
gB <- ggplot_gtable(gb2)
g <- gtable:::rbind_gtable(gA, gB, "last")
panels <- g$layout$t[grep("panel", g$layout$name)]
g$heights[panels] <- list(unit(n1*5, "null"), unit(n2,"null")) # change 5 to other int
grid.newpage()
grid.draw(g)

Reference

  • http://stackoverflow.com/questions/11794436/stacking-multiple-plots-vertically-with-the-same-x-axis-but-different-y-axes-in

  • https://gist.github.com/tomhopper/faa24797bb44addeba79

  • http://stackoverflow.com/questions/30308658/arrange-multiple-ggplots-with-same-plot-width-but-different-plot-height

http://felixfan.github.io/linux-find

Usage

find -path pattern -type [d|f]

Find directories

Find all directories whose name start with “2” under current location.

find -path './2*' -type d

Find all directories whose name start with “2” under all directories that in current directories.

find -path './*/2*' -type d

Find files

Change the ‘-type d’ to ‘-type f’.

http://felixfan.github.io/linux-ftp

Logging into an FTP server and exit the FTP server

To connect to the FTP server

ftp ftpserver

or

ftp -i ftpserver

ftpserver is the full machine name of the remote machine or the net address of the remote machine. -i turns off interactive prompting during multiple file transfers.

to exit the FTP environment

quit

or

bye

to get a listing of files in your current local directory

!ls

to list the names of the files in the current remote directory

ls

to change directory on your local machine

lcd

to change directory on the remote machine

cd

Downloading and uploading files

to copy one file from the remote machine to the local machine

get filename

to copy multiple files from the remote machine to the local machine

mget *

to copy one file from the local machine to the remote machine

put filename

to copy multiple files from the local machine to the remote machine

mput *

Other ftp commands

to make a new directory within the current remote directory

mkdir dirname

to remove (delete) a directory in the current remote directory

rmdir dirname

to delete a file in the current remote directory

delete filename

to find out the pathname of the current directory on the remote machine

pwd

References

Basic FTP Commands
Beginner’s guide to using ftp

http://felixfan.github.io/install-pip

Starting from version Python 3.4, pip is already included in the regular install, it is under the Scripts directory. so what you need to do is just add it (e.g., C:\Python34\Scripts) your system’s PATH environment variable.

If pip is not installed, you can add the Scripts directory to your system’s PATH environment variable. Then use the easy_install to install pip.

easy_install pip

Just type pip to see the help information.

pip

Install package using the following command:

pip install pkgName

e.g.,

pip install matplotlib
http://felixfan.github.io/pair-wise-plot
id <- c("A","B","C","D","E","F","G","H")
df =data.frame(id1=rep(id,each=8),id2=rep(id,8),comm=rep(0,64),class=rep(0,64))
n=1
for(i in 1:length(id)){
for(j in 1:length(id)){
if(i==j){
df$comm[n]=as.integer(rnorm(1,200,30))
df$class[n]=1
}else{

df$comm[n]=as.integer(rnorm(1,100,30))
}
n=n+1
}
}
library(ggplot2)
ggplot(df, aes(x=factor(id2),y=comm,fill=factor(class)))+
geom_bar(stat="identity")+
facet_grid(id1 ~ .)+
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
legend.title = element_blank(),
legend.position="none")+
labs(x="",y="")

plot of chunk pairwiseplot

http://felixfan.github.io/quantify-ld

If alleles A and B are the coupled alleles at two different loci and if \(p_{B} = p_{A} + v\), with v = 0, and if \(r^2\) exceeds some threshold t, then

\[ v_{min} = \frac{p_{A}(1 - p_{A})(1 - t)}{(1 - p_{A}(1 - t))} \] \[ v_{max} = \frac{p_{A}(1 - p_{A})(1 - t)}{(p_{A}(1 - t) + t)} \]

library(plyr)
pA <- seq(0.001,0.999,0.001)
pa <- 1-pA
t2 <- rep(0.2,999)
t5 <- rep(0.5,999)
t8 <- rep(0.8,999)
df <- data.frame(pA=pA,pa=pa,t2=t2,t5=t5,t8=t8)

df2 <- transform(df,
vmax2=pa*pA*(1-t2)/(pA*(1-t2)+t2),
vmax5=pa*pA*(1-t5)/(pA*(1-t5)+t5),
vmax8=pa*pA*(1-t8)/(pA*(1-t8)+t8),
vmin2=pa*pA*(1-t2)/(1-pA*(1-t2)),
vmin5=pa*pA*(1-t5)/(1-pA*(1-t5)),
vmin8=pa*pA*(1-t8)/(1-pA*(1-t8))
)

Maximum difference \(v_{max}\) between allele frequency between loci given \(r^2\)

library(ggplot2)
library(reshape2)
df <- df2[,c("pA","vmax2","vmax5","vmax8")]
df <-melt(df, id="pA")
ggplot(df,aes(x=pA,y=value,colour=variable))+
geom_line()+
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
legend.title=element_blank())+
labs(x = expression("Allele frequency at locus 1 (p"[A]*")"),
y = expression(paste("Maximum difference in allele freqency between loci given\t", r^2)))+
scale_colour_discrete(breaks=c("vmax2", "vmax5", "vmax8"),
labels=c(expression(r^2 >= 0.2),
expression(r^2 >= 0.5),
expression(r^2 >= 0.8)))

plot of chunk vmax

Possible range of allele frequencies at two loci given the LD between the two loci

df3 <- transform(df2,pB2=pA+vmax2,pB5=pA+vmax5,pB8=pA+vmax8)
df4 <- transform(df2,pB2=pA-vmin2,pB5=pA-vmin5,pB8=pA-vmin8)
df <- rbind(df3,df4)
df <- df[,c("pA","pB2","pB5","pB8")]
df <-melt(df, id="pA")
ggplot(df,aes(x=pA,y=value,colour=variable))+
geom_line()+
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
legend.title=element_blank())+
labs(x = expression("Allele frequency at locus 1 (p"[A]*")"),
y = expression("Allele frequency at locus 2 (p"[B]*")"))+
scale_colour_discrete(breaks=c("pB2", "pB5", "pB8"),
labels=c(expression(r^2 >= 0.2),
expression(r^2 >= 0.5),
expression(r^2 >= 0.8)))

plot of chunk frange

reference

  • Allele frequencies and the r2 measure of linkage disequilibrium: impact on design and interpretation of association studies. Twin Res Hum Genet. 2005 Apr;8(2):87-94.
http://felixfan.github.io/coverage

naive estimate of coverage of all SNPs in the genome (G)

\[ \frac{T+L}{R} \]

R: a reference set of SNPs T: tag SNPs L: SNPs are in LD with a SNP in T

corrected genomewide coverage

\[ \frac{\frac{L}{R-T}(G-T)+T}{G} \]

G: the number of common SNPs in the genome.

Reference

  • Jeffrey C Barrett & Lon R Cardon. Evaluating coverage of genome-wide association studies. NATURE GENETICS, VOLUME 38, NUMBER 6, PAGE 656-662, JUNE 2006.
http://felixfan.github.io/shell-script

1 awk

1.1 Sum/average column 1 of file.txt

awk '{sum+=$1} END {print sum}' file.txt
awk '{sum+=$1} END {print sum/NR}' file.txt

1.2 Number each line in file.txt

sed = file.txt | sed 'N;s/\n/ /'

1.3 Get unique entries in file based on column 2 (takes only the first instance)

awk '!arr[$2]++' file.txt
awk '!($2 in arr){print}{arr[$2]++}' file.txt
awk '$5 == "abc123"' file.txt
awk '$5 != "abc123"' file.txt
awk '$1 >0 && $1 <23' file.txt
awk '$1=="yes" || $2=="yes"{print $1,$2}' file.txt
awk '$7  ~ /[a-f]/' file.txt
awk '$7  ~ /[1-4]/' file.txt
awk '$7  ~ /^[a-f]/' file.txt
awk '$7  !~ /[1-4]/' file.txt
awk '$7 ~/rs/' file.txt
awk '$7 !~/rs/' file.txt

2 sed

2.1 Trim leading and/or trailing whitespace in file.txt

sed 's/^[ \t]*//' file.txt
sed 's/[ \t]*$//' file.txt
sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

2.2 Delete blank lines in file.txt

sed '/^$/d' file.txt

2.3 Replace all occurances of “foo” with “bar” in file.txt

sed 's/foo/bar/g' file.txt

# only replace the first instance in each line
sed 's/foo/bar/' file.txt

2.4 Extract every 4th line starting at the second line

## line 2, 6, 10, ...
sed -n '2~4p' file.txt

3 sort, uniq

3.1 Count the number of unique lines in file.txt

sort file.txt | uniq | wc -l

3.2 Find number of lines shared by 2 files:

A $\cap$ B

sort file1 file2 | uniq -d

3.3 Find number of unique lines in 2 files:

A $\cup$ B

sort file1 file2 | uniq

3.4 Sort file by column

file: 1.txt

1       2       3
2       2       4
1       1       1
3       4       1
3       1       2
10      2       9

sort -k1,1 1.txt # sort by first column

1       1       1
1       2       3
10      2       9
2       2       4
3       1       2
3       4       1

sort -k1,1 -n 1.txt # sort by first column, numeric sort

1       1       1
1       2       3
2       2       4
3       1       2
3       4       1
10      2       9

sort -k1,1 -k3,3 -n 1.txt # then use the third column as a tie breaker

1       1       1
1       2       3
2       2       4
3       4       1
3       1       2
10      2       9

sort -k1,1 -k3,3 -n -r 1.txt # reverse the order

10      2       9
3       1       2
3       4       1
2       2       4
1       2       3
1       1       1

4 cut

4.1 Find the most common strings in column 2

cut -f2 file.txt | sort | uniq -c | sort -k1nr | head

5 split

5.1 Customize Split File Size using -b option

split -b2000000 test.txt        # 2Mb perl file

5.2 Customize the Number of Split Chunks using -n option

split -n20 test.txt             # create 20 chunks of split files

5.3 Customize Number of Lines using -l option

split -l5000 test.txt             # split files are created with 5000 lines

5.4 Do not generate empty output files with -e option

split -n20 -e test.txt

6 join

6.1 Join two files by matching the first fields

join FILE1 FILE2

6.2 Join two files by matching the first fields, ignore case using -i option

join -i FILE1 FILE2

6.3 Also print unpairable lines from file FILENUM using -a option

where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

join -a1 FILE1 FILE2           ## also print unpairable lines from FILE1
join -v FILE1 FILE2           ## only print unpaired lines

6.5 Join based on different columns from both biles using -1 and -2 option

join -1 2 -2 1 FILE1 FILE2
##join based on the second column of FILE1 and the first column of FILE2

7 paste

paste is used to create columns of data with a user-specified delimiter.

a.txt

a
b
c

b.txt

1
2
3

paste a.txt b.txt

a       1
b       2
c       3

paste b.txt a.txt

1       a
2       b
3       c

paste -d ‘,’ a.txt b.txt

a,1
b,2
c,3

8 cat

cat a.txt b.txt

a
b
c
1
2
3

cat b.txt a.txt

1
2
3
a
b
c

9 Conditionals

if [expression]
then
code if 'expression' is true.
fi
if [expression]
then
code if 'expression' is true.
else
code if 'expression' is false.
fi
x=1
if [ $x -le 5 ] # a space after "[" and before "]", respectively
then
echo "x is less or equal to 5"
else
echo "x is larger than 5"
fi

10 Loops

for i in {1..22} # bash version 3.0+
do
echo "chromosome $i"
done
for i in {1..10..2}      #{START..END..INCREMENT} # bash v4.0+
do
echo "chromosome $i"
done
for((c=1; c<=5; c++ )) # you need (()) but not ()
do
echo "Welcome $c times"
done
x=1
while [ $x -le 5 ] # a space after "[" and before "]", respectively
do
echo "Welcome $x times"
x=$(( $x + 1 ))
done

11 shuf

11.1 Pick 10 random lines from a file:

shuf file.txt | head -n 10

12 echo

echo {A,C,T,G}{A,C,T,G}{A,C,T,G}

References:

http://felixfan.github.io/use-size-of-point-to-represent-the-number-of-it
set.seed(999)
x=sample(1:10,1000,replace=TRUE)
y=sample(1:10,1000,replace=TRUE)
plot(x,y)

plot of chunk point-size-1

f=as.data.frame(table(x,y))
f$x=as.numeric(as.vector(f$x))
f$y=as.numeric(as.vector(f$y))
f$Freq=as.numeric(as.vector(f$Freq))
plot(f$x,f$y,cex=2*f$Freq/mean(f$Freq))

plot of chunk point-size-2

cex can be other values related to f$Freq. I divided it by its mean,
so the size of points will not too big.
http://felixfan.github.io/R-Aggregate

Attach data

head(mtcars)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
attach(mtcars)

Aggregate data frame mtcars by cyl and vs, returning means for numeric variables

When using the aggregate() function, the by variables must be in a list (even if there is only one). The function can be built-in or user provided.

aggregate(mtcars, by = list(cyl, vs), FUN = mean, na.rm = TRUE)
Group.1 Group.2   mpg cyl  disp    hp  drat    wt  qsec vs     am  gear
1       4       0 26.00   4 120.3  91.0 4.430 2.140 16.70  0 1.0000 5.000
2       6       0 20.57   6 155.0 131.7 3.807 2.755 16.33  0 1.0000 4.333
3       8       0 15.10   8 353.1 209.2 3.229 3.999 16.77  0 0.1429 3.286
4       4       1 26.73   4 103.6  81.8 4.035 2.300 19.38  1 0.7000 4.000
5       6       1 19.12   6 204.6 115.2 3.420 3.389 19.21  1 0.0000 3.500
carb
1 2.000
2 4.667
3 3.500
4 1.500
5 2.500

or

aggregate(. ~ cyl + vs, data = mtcars, FUN = mean, na.rm = TRUE)
cyl vs   mpg  disp    hp  drat    wt  qsec     am  gear  carb
1   4  0 26.00 120.3  91.0 4.430 2.140 16.70 1.0000 5.000 2.000
2   6  0 20.57 155.0 131.7 3.807 2.755 16.33 1.0000 4.333 4.667
3   8  0 15.10 353.1 209.2 3.229 3.999 16.77 0.1429 3.286 3.500
4   4  1 26.73 103.6  81.8 4.035 2.300 19.38 0.7000 4.000 1.500
5   6  1 19.12 204.6 115.2 3.420 3.389 19.21 0.0000 3.500 2.500

Mean of mpg by cyl

aggregate(mpg, by = list(cyl), FUN = mean, na.rm = TRUE)
Group.1     x
1       4 26.66
2       6 19.74
3       8 15.10

or

aggregate(mpg ~ cyl, data = mtcars, FUN = mean, na.rm = TRUE)
cyl   mpg
1   4 26.66
2   6 19.74
3   8 15.10

The number of rows for that cyl & vs combination

aggregate(mtcars, by = list(cyl, vs), FUN = length)
Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
1       4       0   1   1    1  1    1  1    1  1  1    1    1
2       6       0   3   3    3  3    3  3    3  3  3    3    3
3       8       0  14  14   14 14   14 14   14 14 14   14   14
4       4       1  10  10   10 10   10 10   10 10 10   10   10
5       6       1   4   4    4  4    4  4    4  4  4    4    4

or

aggregate(. ~ cyl + vs, data = mtcars, FUN = length)
cyl vs mpg disp hp drat wt qsec am gear carb
1   4  0   1    1  1    1  1    1  1    1    1
2   6  0   3    3  3    3  3    3  3    3    3
3   8  0  14   14 14   14 14   14 14   14   14
4   4  1  10   10 10   10 10   10 10   10   10
5   6  1   4    4  4    4  4    4  4    4    4

Detach data

detach(mtcars)

Session information

sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] knitr_1.5

loaded via a namespace (and not attached):
[1] evaluate_0.5.3 formatR_0.10   stringr_0.6.2  tools_3.0.3

References

http://felixfan.github.io/FinCal-example-0.6

Updated on Tue Dec 29 10:47:56 2015

FinCal – Time Value of Money, time series analysis and Computational Finance

FinCal is available on [CRAN] (http://cran.r-project.org/web/packages/FinCal/) and GitHub

install.packages("FinCal",dependencies=TRUE) # from CRAN

or

library("devtools")
install_github("felixfan/FinCal") # from GitHub
library(FinCal)
Note: for all examples, cash inflows are positive and outflows are negative.

Examples for version 0.1 -0.6

Example 1 Computing effective annual rate (EAR)

Using a stated rate of 4.25%, compute EARs for semiannual, quarterly, monthly, daily and continuous compounding.

ear(0.0425,2)
[1] 0.04295156
ear(0.0425,4)
[1] 0.04318215
ear(0.0425,12)
[1] 0.04333772
ear(0.0425,365)
[1] 0.04341347
ear.continuous(0.0425)
[1] 0.04341606

Example 2 Future value (FV) of a single sum

Calculate the FV of a $500 investment at the end of ten years if it earns an annually compounded rate of return of 6%.

fv.simple(r=0.06,n=10,pv=-500)
[1] 895.4238

Example 3 PV of a single sum

Given a discount rate of 3%, calculate the PV of a $1,000,000 cash flow that will be received in five years.

pv.simple(r=0.03,n=5,fv=1000000)
[1] -862608.8

Example 4 FV of an ordinary annuity and annuity due

What is the future value of an ordinary annuity that pays $15,000 per year at the end of each of the next 25 years, given the investment is expected to earn a 6% rate of return?

fv.annuity(r=0.06,n=25,pmt=-15000,type=0)
[1] 822967.7

What is the future value of an annuity that pays $10,000 per year at the beginning of each of the next three years, commencing today, if the cash flows can be invested at an annual rate of 5%?

fv.annuity(r=0.05,n=3,pmt=-10000,type=1)
[1] 33101.25

Example 5 PV of an ordinary annuity and annuity due

What is the PV of an annuity that pays $20,000 per year at the end of each of the next 25 years, given a 6% discount rate?

pv.annuity(r=0.06,n=25,pmt=-20000,type=0)
[1] 255667.1

Given a discount rate of 10%, what is the present value of a 10-year annuity that makes a series of $1000 payments at the beginning of each of the next three years, starting today?

pv.annuity(r=0.1,n=10,pmt=-1000,type=1)
[1] 6759.024

Example 6 PV of a perpetuity

A preferred stock that will pay $2.50 per year in annual dividends beginning next year and plans to follow this dividend policy forever. Given an 10% rate of return, what is the value of this preferred stock today?

pv.perpetuity(r=0.1,pmt=2.5,type=0)
[1] -25

Example 7 Rate of return for a perpetuity

Using the preferred stock described in the preceding example, determine the rate of return that an investor would realize if she paid $75 per share for the stock.

r.perpetuity(pmt=2.5,pv=-75)
[1] 0.03333333

Example 8 PV of a bond’s cash flows

A bond will make coupon interest payments of 70 HK$ (7% of its face value) at the end of each year and will also pay its face value of 1,000 HK$ at maturity in 10 years. If the appropriate discount rate is 6%, what is the present value of the bond’s promised cash flows?

pv(r=0.06,n=10,fv=1000,pmt=70,type=0)
[1] -1073.601

Example 9 Computing the FV and PV of an uneven cash How series

Using a rate of return of 6%, compute the future value of the 6-year uneven cash flow stream occured at the end of each year. (-10000 -5000 2000 4000 6000 8000)

fv.uneven(r=0.06, cf=c(-10000, -5000, 2000, 4000, 6000, 8000))
[1] -1541.791

Compute the present value of this 6-year uneven cash How stream described above using a 10% rate of return.

pv.uneven(r=0.1, cf=c(-10000, -5000, 2000, 4000, 6000, 8000))
[1] 747.1377

Example 10 Loan payment calculation: Annual payments and Quarterly payments

A company plans to borrow $500,000 for five years. The company’s bank will lend the money at a rate of 6% and requires that the loan be paid off in five equal end-of-year payments. Calculate the amount of the payment that the company must make in order to fully amortize this loan in five years.

pmt(r=0.06,n=5,pv=500000,fv=0)
[1] -118698.2

Example 11 Computing the number of periods in an annuity

How many $1000 end-of-year payments are required to accumulate $10,000 if the discount rate is 9%?

n.period(r=0.09,pv=0,fv=10000,pmt=-1000,type=0)
[1] 7.448028

Example 12 Computing the rate of return for a period

Suppose you have the opponunity to invest $1000 at the end of each of the next five years in exchange for $6000 at the end of the fifth year. What is the annual rate of return on this investment?

discount.rate(n=5,fv=6000,pmt=-1000,pv=0,type=0)
[1] 0.09130091

Example 13 Computing NPV

Calculate the NPV of an investment project with an initial cost of $6 million and positive cash flows of $2.6 million at the end of Year 1, $2.4 million at the end of Year 2, and $3.8 million at the end ofYear 3. Use 8% as the discount rate.

npv(r=0.08, cf=c(-6,2.6,2.4,3.8))
[1] 1.481583

Example 14 Computing IRR

What is the IRR for the investment described in example 13?

irr(cf=c(-6,2.6,2.4,3.8))
[1] 0.2032579

Example 15 Computing HPR

Suppose a stock is purchased for $3 and is sold for $4 six months later, during which time it paid $0.50 in dividends. What is the holding period return?

hpr(ev=4,bv=3,cfr=0.5)
[1] 0.5

Example 16 Computing time-weighted rate of return

An investor purchases a share of stock at t = 0 for $10. At the end of the year, t = 1 , the investor buys another share of the same stock for $12. At the end of Year 2, the investor sells both shares for $13 each. At the end of both years 1 and 2, the stock paid a $1 per share dividend. What is the annual time-weighted rate of return for this investment?

twrr(ev=c(12,26),bv=c(10,24),cfr=c(1,2))
[1] 0.2315302

Example 17 Computing Bank discount yield

Calculate the bank discount yield for a T-hill priced at $9,850, with a face value of $10,000 and 120 days until maturity.

bdy(d=150,f=10000,t=120)
[1] 0.045

Example 18 Convert holding period return to the effective annual rate

Compute the EAY using the 120-day HPY of 2.85%.

hpr2ear(hpr=0.0285,t=120)
[1] 0.08923453

Example 19 Computing money market yield

What is the money market yield for a 120-day T-bill that has a bank discount yield of 4.50%?

bdy2mmy(bdy=0.045,t=120)
[1] 0.04568528

Example 20 Converting among EAR, HPY, and MMY

Assume the price of a $10,000 T-hill that matures in 150 days is $9,800. The quoted money market yield is 4.898%. Compute the HPY and the EAR.

hpr(ev=10000,bv=9800)
[1] 0.02040816
mmy2hpr(mmy=0.04898,t=150)
[1] 0.02040833
hpr2ear(hpr=mmy2hpr(mmy=0.04898,t=150),t=150)
[1] 0.05038874
ear2hpr(ear=hpr2ear(hpr=mmy2hpr(mmy=0.04898,t=150),t=150),t=150)
[1] 0.02040833

Example 21 Bond-equivalent yield calculation

What is the yield on a bond-equivalent basis of a 3-month loan has a holding period yield of 4%?

hpr2bey(hpr=0.04,t=3)
[1] 0.1632

What is the yield on a bond-equivalent basis of an investment with 6% effective annual yield?

ear2bey(ear=0.06)
[1] 0.05912603

Example 22 Weighted mean as a portfolio return

A portfolio consists of 40% common stocks, 50% bonds, and 10% cash. If the return on common stocks is 9%, the return on bonds is 6%, and the return on cash is 1%, what is the portfolio return?

wpr(r=c(0.09, 0.06, 0.01),w=c(0.4,0.5,0.1))
[1] 0.067

or

rs=c(0.09, 0.06, 0.01)
ws=c(0.4,0.5,0.1)
wpr(r=rs,w=ws)
[1] 0.067

Example 23 Geometric mean return

For the last three years, the returns for Acme Corporation common stock have been -5%, 11%, and 9%. Compute the compound annual rate of return over the 3-year period.

geometric.mean(r=c(-0.05, 0.11, 0.09))
[1] 0.04750883

Example 24 Calculating average cost with the harmonic mean

An investor purchases $10,000 of stock each month, and over the last three months the prices paid per share were $4.5, $5.2, and $4.8. What is the average cost per share for the shares acquired?

harmonic.mean(p=c(4.5,5.2,4.8))
[1] 4.816467

Example 25 Download historical financial data from Yahoo finance and Google Finance

Download historical financial data from Yahoo finance

apple <- get.ohlc.yahoo(symbol="AAPL",start="2013-07-01",end="2013-08-01")
head(apple)
date   open   high    low  close    volume adjusted
23 2013-07-01 402.69 412.27 401.22 409.22  97763400 55.60200
22 2013-07-02 409.96 421.63 409.47 418.49 117466300 56.86155
21 2013-07-03 420.86 422.98 417.45 420.80  60232200 57.17542
20 2013-07-05 420.39 423.29 415.35 417.42  68506200 56.71616
19 2013-07-08 420.11 421.00 410.65 415.05  74534600 56.39415
18 2013-07-09 413.60 423.50 410.38 422.35  88146100 57.38602

Download historical financial data from Google Finance

apple <- get.ohlc.google(symbol="AAPL",start="2013-07-01",end="2013-08-01")
head(apple)
date  open  high   low close    volume
23 2013-07-01 57.53 58.90 57.32 58.46  97793045
22 2013-07-02 58.57 60.23 58.50 59.78 117521579
21 2013-07-03 60.12 60.43 59.64 60.11  60232158
20 2013-07-05 60.06 60.47 59.34 59.63  68520760
19 2013-07-08 60.02 60.14 58.66 59.29  74578420
18 2013-07-09 59.09 60.50 58.63 60.34  88172238

Download multiple historical financial data from Yahoo finance

applespy <- get.ohlcs.yahoo(symbols=c("AAPL","SPY"),start="2013-01-01",end="2013-07-31")
head(applespy$AAPL)
date   open   high    low  close    volume adjusted
146 2013-01-02 553.82 555.00 541.63 549.03 140129500 73.67851
145 2013-01-03 547.88 549.67 541.00 542.10  88241300 72.74852
144 2013-01-04 536.97 538.63 525.83 527.00 148583400 70.72214
143 2013-01-07 522.00 529.30 515.20 523.90 121039100 70.30613
142 2013-01-08 529.21 531.89 521.25 525.31 114676800 70.49535
141 2013-01-09 522.50 525.01 515.99 517.10 101901100 69.39359
head(applespy$SPY)
date   open   high    low  close    volume adjusted
146 2013-01-02 145.11 146.15 144.73 146.06 192059000 137.5859
145 2013-01-03 145.99 146.37 145.34 145.73 144761800 137.2750
144 2013-01-04 145.97 146.61 145.67 146.37 116817700 137.8779
143 2013-01-07 145.85 146.11 145.43 145.97 110002500 137.5011
142 2013-01-08 145.71 145.91 144.98 145.55 121265100 137.1055
141 2013-01-09 145.87 146.32 145.64 145.92  90745600 137.4540

Download multiple historical financial data from Google Finance

all <- get.ohlcs.google(symbols=c("YHOO","SPY", "SINA"),start="2013-01-01",end="2013-07-31")
head(all$YHOO)
date  open  high   low close   volume
146 2013-01-02 20.20 20.32 20.01 20.08 20463033
145 2013-01-03 20.05 20.10 19.72 19.78 19599094
144 2013-01-04 19.76 19.95 19.72 19.86 12489700
143 2013-01-07 19.56 19.58 19.28 19.40 23866609
142 2013-01-08 19.32 19.68 19.30 19.66 16932176
141 2013-01-09 19.73 19.75 19.22 19.33 21656278
head(all$SPY)
date   open   high    low  close    volume
146 2013-01-02 145.11 146.15 144.73 146.06 192058911
145 2013-01-03 145.99 146.37 145.34 145.73 144761781
144 2013-01-04 145.97 146.61 145.67 146.37 116817675
143 2013-01-07 145.85 146.44 145.43 145.97 110002444
142 2013-01-08 145.71 145.91 144.98 145.55 121265078
141 2013-01-09 145.87 146.32 145.64 145.92  90745581
head(all$SINA)
date  open  high   low close  volume
146 2013-01-02 52.24 55.19 51.75 52.27 3947513
145 2013-01-03 52.34 53.61 51.54 52.77 2200712
144 2013-01-04 52.70 52.94 51.70 52.76 1466652
143 2013-01-07 51.81 52.67 51.40 52.57 1368991
142 2013-01-08 52.20 53.00 51.41 51.81 1381026
141 2013-01-09 51.76 52.43 50.61 51.54 2098518

Example 26 Plots open-high-low-close chart of (financial) time series.

Line chart

apple <- get.ohlc.yahoo(symbol="AAPL",start="2013-07-01",end="2013-08-01")
lineChart(apple,y="adjusted")

Candlestick chart

apple <- get.ohlc.yahoo(symbol="AAPL",start="2013-07-01",end="2013-08-01")
candlestickChart(apple,y="adjusted")

Volume chart

apple <- get.ohlc.yahoo("AAPL",start="2014-04-01",end="2014-04-30")
volumeChart(apple,y="adjusted")

Multiple line chart

all <- get.ohlcs.yahoo(c("SBUX","AAPL","SPY"),start="2013-01-01",end="2013-06-30")
lineChartMult(all,y="adjusted")

Examples 27 Inventory Expense Recognition – cost of goods sold

Three methods: first-in, first-out (FIFO) method; last-in, first-out (LIFO) method; weighted average cost method.

e.g. during one month beginning inventory: 20 units @ $2/unit buy : 30 units @ $3/unit buy : 50 units @ $5/unit sold : 70 units what is the cost of goods sold using three menthod?

# first-in, first-out (FIFO) method
cogs(uinv=20,pinv=2,units=c(30,50),price=c(3,5),sinv=70,method="FIFO")
$costOfGoods
[1] 230

$endingInventory
[1] 150
# last-in, first-out (LIFO) method
cogs(uinv=20,pinv=2,units=c(30,50),price=c(3,5),sinv=70,method="LIFO")
$costOfGoods
[1] 310

$endingInventory
[1] 70
# weighted average cost method
cogs(uinv=20,pinv=2,units=c(30,50),price=c(3,5),sinv=70,method="WAC")
$costOfGoods
[1] 266

$endingInventory
[1] 114

Example 28 Depreciation Expense Recognition

Methods: straight line depreciation method, double-declining balance (DDB) e.g., One Company recently purchased a machine at a cost of $9,800. The machine is expected to have a residual value of $2,000 at the end of its useful life in five years. What is the depreciation expense for all five years using two different menthods?

# straight line depreciation method
slde(cost=9800,rv=2000,t=5)
[1] 1560
# double-declining balance (DDB)
ddb(cost=9800,rv=2000,t=5)
t    ddb
[1,] 1 3920.0
[2,] 2 2352.0
[3,] 3 1411.2
[4,] 4  116.8
[5,] 5    0.0

Example 29 Weighted average shares and EPS

One company has net income of $100,000 and paid $10,000 cash dividends to its preferred shareholders and $10,750 cash dividends to its common shareholders. At the beginning of the year, there were 20,000 shares of common stock outstanding. 20,000 new shares were issued on July 1 . what is the weighted average shares and basic EPS?

# weighted average shares
was=was(ns=c(20000,20000),nm=c(12,6))
was
[1] 30000
# basic EPS
EPS(ni=100000,pd=10000,w=was)
[1] 3

One company has 15,000 shares outstanding all year. It had 2,000 outstanding warrants all year, convertible into one share each at $20 per share. The year-end price of stock was $40, and the average stock price was $30. What effect will these warrants have on the weighted average number of shares?

iss(amp=30,ep=20,n=2000)
[1] 666.6667

During 2013, X reported net income of $231,200 and had 400,000 shares of common stock outstanding for the entire year. X had 2,000 shares of 10%, $100 par convertible preferred stock, convertible into 40 shares each, outstanding for the entire year. X also had 1200, 7%, $1,000 par value convertible bonds, convertible into 100 shares each, outstanding for the entire year. Finally, X had 20,000 stock options outstanding during the year. Each option is convertible into one share of stock at $15 per share. The average market price of the stock for the year was $20. The tax rate is 40%. What are X’s basic and diluted EPS?

EPS(ni=231200,pd=2000*0.1*100,w=400000)
[1] 0.528
iss=iss(amp=20,ep=15,n=20000)
diluted.EPS(ni=231200,pd=2000*0.1*100,cpd=2000*0.1*100,cdi=1200*0.07*1000,tax=0.4,w=400000,cps=2000*40,cds=1200*100,iss=iss)
[1] 0.4654545
http://felixfan.github.io/normality-test

Data

set.seed(999)
x1 <- rbinom(15, 5, 0.6)
x2 <- rbinom(30, 5, 0.6)
x3 <- rbinom(500, 5, 0.6)
x4 <- rlnorm(15)
x5 <- rlnorm(30)
x6 <- rlnorm(500)
x7 <- rnorm(15)
x8 <- rnorm(500)
x9 <- rnorm(5e+06)

Shapiro-Wilk Normality Test

shapiro.test() in package stats. length of data: 3-5000
shapiro.test(x1)

Shapiro-Wilk normality test

data:  x1
W = 0.9157, p-value = 0.1653
shapiro.test(x2)

Shapiro-Wilk normality test

data:  x2
W = 0.834, p-value = 0.0002918
shapiro.test(x3)

Shapiro-Wilk normality test

data:  x3
W = 0.919, p-value = 9.795e-16
shapiro.test(x4)

Shapiro-Wilk normality test

data:  x4
W = 0.6059, p-value = 2.884e-05
shapiro.test(x5)

Shapiro-Wilk normality test

data:  x5
W = 0.8451, p-value = 0.0004895
shapiro.test(x6)

Shapiro-Wilk normality test

data:  x6
W = 0.4848, p-value < 2.2e-16
shapiro.test(x7)

Shapiro-Wilk normality test

data:  x7
W = 0.975, p-value = 0.9235
shapiro.test(x8)

Shapiro-Wilk normality test

data:  x8
W = 0.9987, p-value = 0.9779
shapiro.test(x9)
Error: sample size must be between 3 and 5000

Anderson-Darling test for normality

ad.test() in package nortest
library(nortest)
ad.test(x1)

Anderson-Darling normality test

data:  x1
A = 0.7072, p-value = 0.05103
ad.test(x2)

Anderson-Darling normality test

data:  x2
A = 1.96, p-value = 4.006e-05
ad.test(x3)

Anderson-Darling normality test

data:  x3
A = 17.95, p-value < 2.2e-16
ad.test(x4)

Anderson-Darling normality test

data:  x4
A = 2.059, p-value = 1.542e-05
ad.test(x5)

Anderson-Darling normality test

data:  x5
A = 1.806, p-value = 9.715e-05
ad.test(x6)

Anderson-Darling normality test

data:  x6
A = Inf, p-value = NA
ad.test(x7)

Anderson-Darling normality test

data:  x7
A = 0.1743, p-value = 0.908
ad.test(x8)

Anderson-Darling normality test

data:  x8
A = 0.1436, p-value = 0.9704
ad.test(x9)

Anderson-Darling normality test

data:  x9
A = 0.2392, p-value = 0.7793

Summary

When the sample size is big, the test result is quite reliable.

http://felixfan.github.io/hist-bins

Defaut plot

head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
hist(iris$Sepal.Length)

plot of chunk hist-bins-1

Set the number of bins

## length.out = number of bins + 1 set the number of bins to 10
hist(iris$Sepal.Length, breaks = seq(min(iris$Sepal.Length), max(iris$Sepal.Length),
length.out = 11))

plot of chunk hist-bins-2

http://felixfan.github.io/or-vs-rr

“Odds ratios are not well understood as a measure of effect size, and conversion to relative risks by a simple calculation would improve understanding of findings”

relativeRisk = oddsRatio / (1 - baselineRisk + oddsRatio * baselineRisk)

plot of chunk orvsrr

The odds ratio is always further away from 1 than the relative risk, but they are more similar when the baseline risk is small.

Definitions of odds and risks

<TABLE border=1>

present absent total case a c a+c control b d b+d total a+b c+d a+b+c+d

</TABLE>

Outcome present for treatment: a
Outcome present for control: b
Outcome absent for treatment: c
Outcome absent for control: d

Total for treatment: a+c
Total for control: b+d
Total for present: a+b
Total for absent: c+d

Odds can range from 0 to infinity but are always positive.

Odds of the outcome in the treatment group: a/c
Odds of the outcome in the control group: b/d

Odds ratio

Odds ratio for the outcome comparing treatment with control: (a/c)/(b/d)=(a×d)/(b×c)

Risks can range from 0 to 1.

Risk of the outcome in the treatment group: a/(a+c)
Risk of the outcome in the control group: b/(b+d)

Relative risk

Relative risk comparing treatment with control: (a/(a+c))/(b/(b+d))=(a×(b+d))/(b×(a+c))
  • For both the odds ratio and relative risk, 1 represents no difference between the groups
  • The risk (and the odds) does not have to refer to an undesirable outcome
  • risk = odds/(1+odds)

“Most published research providing an odds ratio as a measure of effect size should also provide sufficient information for the baseline risk, and hence the relative risk, to be calculated. If numbers in each group are given, the crude relative risk can be calculated directly.” – BMJ 2014;348:f7450 doi: 10.1136/bmj.f7450

“In the context of epidemiology, baseline risk is the incidence of the disease, or outcome of interest, in the population.”

Reference

http://felixfan.github.io/calculate-mean-for-selected-columns
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
mydata = iris
mydata$ID = rep(c(1, 2, 3, 4), length.out = nrow(mydata))
head(mydata)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
1          5.1         3.5          1.4         0.2  setosa  1
2          4.9         3.0          1.4         0.2  setosa  2
3          4.7         3.2          1.3         0.2  setosa  3
4          4.6         3.1          1.5         0.2  setosa  4
5          5.0         3.6          1.4         0.2  setosa  1
6          5.4         3.9          1.7         0.4  setosa  2

melt()

library(reshape2)
molten = melt(mydata, id = c("ID", "Species"), na.rm = TRUE)
# each row will represent one observation of one variable.
molten[c(1:5, 55, 140), ]
ID    Species     variable value
1    1     setosa Sepal.Length   5.1
2    2     setosa Sepal.Length   4.9
3    3     setosa Sepal.Length   4.7
4    4     setosa Sepal.Length   4.6
5    1     setosa Sepal.Length   5.0
55   3 versicolor Sepal.Length   6.5
140  4  virginica Sepal.Length   6.9

dcast()

One variable (column)

dcast(molten, formula = ID ~ variable)
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1  1           38          38           38          38
2  2           38          38           38          38
3  3           37          37           37          37
4  4           37          37           37          37
dcast(molten, formula = ID ~ variable, mean)
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1  1        5.853       3.118        3.800       1.258
2  2        5.787       3.026        3.689       1.163
3  3        5.827       3.008        3.751       1.178
4  4        5.908       3.076        3.792       1.197
dcast(molten, formula = ID ~ variable, median)
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1  1         5.75         3.1         4.35        1.35
2  2         5.60         3.0         4.10        1.30
3  3         5.80         3.0         4.50        1.30
4  4         6.00         3.0         4.40        1.30
dcast(molten, formula = Species ~ variable)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa           50          50           50          50
2 versicolor           50          50           50          50
3  virginica           50          50           50          50
dcast(molten, formula = Species ~ variable, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

two variables (columns)

dcast(molten, formula = ID + Species ~ variable)
ID    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1   1     setosa           13          13           13          13
2   1 versicolor           12          12           12          12
3   1  virginica           13          13           13          13
4   2     setosa           13          13           13          13
5   2 versicolor           12          12           12          12
6   2  virginica           13          13           13          13
7   3     setosa           12          12           12          12
8   3 versicolor           13          13           13          13
9   3  virginica           12          12           12          12
10  4     setosa           12          12           12          12
11  4 versicolor           13          13           13          13
12  4  virginica           12          12           12          12
dcast(molten, formula = ID + Species ~ variable, mean)
ID    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1   1     setosa        5.092       3.515        1.492      0.2231
2   1 versicolor        5.925       2.725        4.292      1.3500
3   1  virginica        6.546       3.085        5.654      2.2077
4   2     setosa        4.931       3.292        1.438      0.2308
5   2 versicolor        5.800       2.717        4.108      1.2833
6   2  virginica        6.631       3.046        5.554      1.9846
7   3     setosa        4.950       3.442        1.417      0.2333
8   3 versicolor        6.054       2.823        4.323      1.3538
9   3  virginica        6.458       2.775        5.467      1.9333
10  4     setosa        5.050       3.467        1.500      0.3000
11  4 versicolor        5.954       2.808        4.308      1.3154
12  4  virginica        6.717       2.975        5.525      1.9667
# the order of ID and Species was changed, then the result was also changed
dcast(molten, formula = Species + ID ~ variable)
Species ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1      setosa  1           13          13           13          13
2      setosa  2           13          13           13          13
3      setosa  3           12          12           12          12
4      setosa  4           12          12           12          12
5  versicolor  1           12          12           12          12
6  versicolor  2           12          12           12          12
7  versicolor  3           13          13           13          13
8  versicolor  4           13          13           13          13
9   virginica  1           13          13           13          13
10  virginica  2           13          13           13          13
11  virginica  3           12          12           12          12
12  virginica  4           12          12           12          12
dcast(molten, formula = Species + ID ~ variable, mean)
Species ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1      setosa  1        5.092       3.515        1.492      0.2231
2      setosa  2        4.931       3.292        1.438      0.2308
3      setosa  3        4.950       3.442        1.417      0.2333
4      setosa  4        5.050       3.467        1.500      0.3000
5  versicolor  1        5.925       2.725        4.292      1.3500
6  versicolor  2        5.800       2.717        4.108      1.2833
7  versicolor  3        6.054       2.823        4.323      1.3538
8  versicolor  4        5.954       2.808        4.308      1.3154
9   virginica  1        6.546       3.085        5.654      2.2077
10  virginica  2        6.631       3.046        5.554      1.9846
11  virginica  3        6.458       2.775        5.467      1.9333
12  virginica  4        6.717       2.975        5.525      1.9667

subset of data

library(plyr)  # needed to access . function
dcast(molten, formula = ID ~ variable, subset = .(Species == "setosa"))
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1  1           13          13           13          13
2  2           13          13           13          13
3  3           12          12           12          12
4  4           12          12           12          12
dcast(molten, formula = ID ~ variable, mean, subset = .(Species == "setosa"))
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1  1        5.092       3.515        1.492      0.2231
2  2        4.931       3.292        1.438      0.2308
3  3        4.950       3.442        1.417      0.2333
4  4        5.050       3.467        1.500      0.3000

include margins

dcast(molten, formula = ID ~ variable, mean, margins = TRUE)
ID Sepal.Length Sepal.Width Petal.Length Petal.Width (all)
1     1        5.853       3.118        3.800       1.258 3.507
2     2        5.787       3.026        3.689       1.163 3.416
3     3        5.827       3.008        3.751       1.178 3.441
4     4        5.908       3.076        3.792       1.197 3.493
5 (all)        5.843       3.057        3.758       1.199 3.465
dcast(molten, formula = ID ~ variable, mean, margins = "ID")
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1     1        5.853       3.118        3.800       1.258
2     2        5.787       3.026        3.689       1.163
3     3        5.827       3.008        3.751       1.178
4     4        5.908       3.076        3.792       1.197
5 (all)        5.843       3.057        3.758       1.199

Reference

http://felixfan.github.io/gitbook

install node.js

node.js is available here

add the directory to your path

install GitBook

use the following command to install GitBook.

npm install gitbook -g

install GitBook PDF Generator

npm install gitbook-pdf -g

install GitBook R package

devtools::install_github("jbryer/gitbook")

verify that Gitbook is installed

library(gitbook)
checkForGitbook()

create a new Gitbook

newGitbook("test")

initial the GitBook

The initGitbook function will create the files and folders for your book. This function will change the file extensions of all your files to .Rmd

initGitbook()

buid the .md files

The buildRmd function will convert all .Rmd files in your project to .md using the knitr package.

buildRmd()

build the GitBooks

use format to specify the format of the built book. Options are gitbook (default website book), pdf, or ebook.

(PDF format is not so beautiful and can not add figures.)

# buildGitbook(format='pdf')
buildGitbook()

open your built book

The openGitbook will open your built book using your system’s default web browser.

openGitbook()

publish your built Gitbook to GitHub

publishGitbook(repo = "test")

references

http://felixfan.github.io/excel-xml-json

what I have learned from coursera course Getting and Cleaning Data.

Reading Excel Files

read.xlsx {xlsx}

library(xlsx)
mydata = read.xlsx("data.xlsx", sheetIndex = 1, header = TRUE)

Reading specific rows and columns

mydata = read.xlsx("data.xlsx", sheetIndex = 1, header = TRUE, colIndex = 2:3,
rowIndex = 1:4)

Reading XML

library(XML)
fileUrl <- "http://wwww.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
[1] "breakfast_menu"
names(rootNode)
food   food   food   food   food
"food" "food" "food" "food" "food"

Directly access parts of the XML document

rootNode[[1]]
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
rootNode[[1]][[1]]
<name>Belgian Waffles</name>

Programatically extract parts of the file

xmlSApply(rootNode, xmlValue)
food
"Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650"
food
"Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900"
food
"Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900"
food
"French Toast$4.50Thick slices made from our homemade sourdough bread600"
food
"Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950"

Get the names and prices on the menu

xpathSApply(rootNode, "//name", xmlValue)
[1] "Belgian Waffles"             "Strawberry Belgian Waffles"
[3] "Berry-Berry Belgian Waffles" "French Toast"
[5] "Homestyle Breakfast"
xpathSApply(rootNode, "//price", xmlValue)
[1] "$5.95" "$7.95" "$8.95" "$4.50" "$6.95"

Extract content by attributes

Use htmlTreeParse when the content is known to be (potentially malformed) HTML.

fileUrl = "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl, useInternal = TRUE)
scores <- xpathSApply(doc, "//li[@class='score']", xmlValue)
scores
[1] "49-27"    "14-6"     "30-9"     "23-20"    "26-23"    "19-17"
[7] "19-16"    "24-18"    "20-17 OT" "23-20 OT" "19-3"     "22-20"
[13] "29-26"    "18-16"    "41-7"     "34-17"
teams <- xpathSApply(doc, "//li[@class='team-name']", xmlValue)
teams
[1] "Denver"      "Cleveland"   "Houston"     "Buffalo"     "Miami"
[6] "Green Bay"   "Pittsburgh"  "Cleveland"   "Cincinnati"  "Chicago"
[11] "New York"    "Pittsburgh"  "Minnesota"   "Detroit"     "New England"
[16] "Cincinnati"

Reading Json

library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/felixfan/repos")
names(jsonData)
[1] "id"                "name"              "full_name"
[4] "owner"             "private"           "html_url"
[7] "description"       "fork"              "url"
[10] "forks_url"         "keys_url"          "collaborators_url"
[13] "teams_url"         "hooks_url"         "issue_events_url"
[16] "events_url"        "assignees_url"     "branches_url"
[19] "tags_url"          "blobs_url"         "git_tags_url"
[22] "git_refs_url"      "trees_url"         "statuses_url"
[25] "languages_url"     "stargazers_url"    "contributors_url"
[28] "subscribers_url"   "subscription_url"  "commits_url"
[31] "git_commits_url"   "comments_url"      "issue_comment_url"
[34] "contents_url"      "compare_url"       "merges_url"
[37] "archive_url"       "downloads_url"     "issues_url"
[40] "pulls_url"         "milestones_url"    "notifications_url"
[43] "labels_url"        "releases_url"      "created_at"
[46] "updated_at"        "pushed_at"         "git_url"
[49] "ssh_url"           "clone_url"         "svn_url"
[52] "homepage"          "size"              "stargazers_count"
[55] "watchers_count"    "language"          "has_issues"
[58] "has_downloads"     "has_wiki"          "forks_count"
[61] "mirror_url"        "open_issues_count" "forks"
[64] "open_issues"       "watchers"          "default_branch"
names(jsonData$owner)
[1] "login"               "id"                  "avatar_url"
[4] "gravatar_id"         "url"                 "html_url"
[7] "followers_url"       "following_url"       "gists_url"
[10] "starred_url"         "subscriptions_url"   "organizations_url"
[13] "repos_url"           "events_url"          "received_events_url"
[16] "type"                "site_admin"
jsonData$owner$login
[1] "felixfan" "felixfan" "felixfan" "felixfan" "felixfan" "felixfan"
[7] "felixfan" "felixfan" "felixfan" "felixfan" "felixfan" "felixfan"
[13] "felixfan" "felixfan" "felixfan" "felixfan" "felixfan" "felixfan"
[19] "felixfan" "felixfan" "felixfan" "felixfan" "felixfan" "felixfan"
[25] "felixfan" "felixfan" "felixfan" "felixfan" "felixfan" "felixfan"
jsonData$name
[1] "biojava"                "Calculator"
[3] "ChiSquareCalculator"    "coursera-android"
[5] "CPPLearning"            "datasciencecoursera"
[7] "datasharing"            "devtools"
[9] "felixfan.github.io"     "FinCal"
[11] "FunPlots"               "GEO"
[13] "IntroRandLaTeXforIR"    "IPGWAS"
[15] "JavaSwing"              "jekyll"
[17] "labs"                   "libsvm"
[19] "markdown-here"          "OG-Platform"
[21] "oneliners"              "pages"
[23] "powerAnalysis"          "ProgrammingAssignment2"
[25] "PubMedWordcloud"        "R-Graphics"
[27] "random-forest"          "RGenetics"
[29] "slidify"                "styles"
# create and updated date of project of 'FinCal'
jsonData[jsonData$name == "FinCal", c("created_at", "updated_at")]
created_at           updated_at
10 2013-09-24T06:25:02Z 2013-12-23T07:31:54Z
# create and updated date of all projects
jsonData[, c("name", "created_at", "updated_at")]
name           created_at           updated_at
1                 biojava 2014-02-25T08:34:06Z 2014-02-25T08:34:07Z
2              Calculator 2014-01-20T06:14:33Z 2014-02-27T09:12:59Z
3     ChiSquareCalculator 2014-02-21T06:52:11Z 2014-02-21T07:32:04Z
4        coursera-android 2014-01-20T14:23:37Z 2014-01-20T14:23:38Z
5             CPPLearning 2013-10-23T06:11:50Z 2013-12-07T15:34:51Z
6     datasciencecoursera 2014-04-08T07:14:57Z 2014-04-08T07:18:50Z
7             datasharing 2014-04-08T07:21:12Z 2014-04-08T07:21:13Z
8                devtools 2013-12-11T05:48:57Z 2013-12-11T05:48:58Z
9      felixfan.github.io 2013-09-18T06:01:49Z 2014-04-15T05:59:25Z
10                 FinCal 2013-09-24T06:25:02Z 2013-12-23T07:31:54Z
11               FunPlots 2013-10-11T05:39:17Z 2013-12-07T15:39:42Z
12                    GEO 2013-11-12T05:52:33Z 2013-12-07T15:37:25Z
13    IntroRandLaTeXforIR 2013-11-14T02:34:24Z 2013-11-14T02:34:24Z
14                 IPGWAS 2014-01-24T03:35:16Z 2014-01-24T03:49:35Z
15              JavaSwing 2014-02-24T07:55:14Z 2014-02-24T08:15:20Z
16                 jekyll 2013-09-17T03:39:15Z 2013-09-17T03:39:16Z
17                   labs 2014-04-14T08:57:57Z 2014-04-14T08:57:58Z
18                 libsvm 2014-03-13T06:33:28Z 2014-03-13T06:33:28Z
19          markdown-here 2014-02-27T09:20:30Z 2014-02-27T09:20:31Z
20            OG-Platform 2013-12-04T03:19:15Z 2013-12-04T03:19:19Z
21              oneliners 2013-10-25T09:03:55Z 2013-10-25T09:03:55Z
22                  pages 2013-11-20T15:39:57Z 2013-11-20T15:39:57Z
23          powerAnalysis 2013-10-02T06:23:16Z 2013-10-02T06:24:13Z
24 ProgrammingAssignment2 2014-04-10T06:34:02Z 2014-04-10T06:34:03Z
25        PubMedWordcloud 2013-09-17T02:51:48Z 2014-01-15T04:00:33Z
26             R-Graphics 2014-02-06T06:15:53Z 2014-02-06T06:15:54Z
27          random-forest 2014-03-11T09:48:53Z 2014-03-11T09:48:54Z
28              RGenetics 2014-01-24T07:06:54Z 2014-01-24T07:07:48Z
29                slidify 2013-12-10T03:32:19Z 2013-12-10T03:32:19Z
30                 styles 2013-11-14T09:50:36Z 2013-11-14T09:50:37Z
http://felixfan.github.io/xyplot

scatterplot

~x means display numeric variable x alone.

**y ~ x A** means display the relationship between numeric variables y and x for each level of factor A.
**y ~ x A*B** means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
table(iris$Species)

setosa versicolor  virginica
50         50         50

library(lattice)
xyplot(Sepal.Length ~ Petal.Length | Species, data = iris)

plot of chunk xyplot-1

Add regression line

xyplot(Sepal.Length ~ Petal.Length | Species, data = iris, panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, ...)
})

plot of chunk xyplot-2

Add regression line using type()

xyplot(Sepal.Length ~ Petal.Length | Species, data = iris, type = c("p", "r"))

plot of chunk xyplot-3

The type = argument can be used to enhance the figure with data-responsive elements. The default value type = “p”, which requests only points, type = “r” request a simple linear regression.

references

http://felixfan.github.io/public-data
  • Tycho Project Tycho™ database includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.
http://felixfan.github.io/order-sort-rank

#

head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
attach(iris)

sort()

sort(Sepal.Length)
[1] 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 4.7 4.8 4.8 4.8 4.8 4.8 4.9
[18] 4.9 4.9 4.9 4.9 4.9 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.1 5.1
[35] 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.2 5.2 5.2 5.2 5.3 5.4 5.4 5.4 5.4 5.4
[52] 5.4 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.6 5.6 5.6 5.6 5.6 5.6 5.7 5.7 5.7
[69] 5.7 5.7 5.7 5.7 5.7 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.9 5.9 5.9 6.0 6.0
[86] 6.0 6.0 6.0 6.0 6.1 6.1 6.1 6.1 6.1 6.1 6.2 6.2 6.2 6.2 6.3 6.3 6.3
[103] 6.3 6.3 6.3 6.3 6.3 6.3 6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.5 6.5 6.5 6.5
[120] 6.5 6.6 6.6 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.8 6.8 6.8 6.9 6.9 6.9
[137] 6.9 7.0 7.1 7.2 7.2 7.2 7.3 7.4 7.6 7.7 7.7 7.7 7.7 7.9
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

The data in the data frame are not sorted based on the value of Sepal.Length. Rather, only the variable Sepal.Length was sorted, independently of the data frame.

order()

ordered on one variable

sort1.iris <- iris[order(Sepal.Length), ]
head(sort1.iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
9           4.4         2.9          1.4         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa
42          4.5         2.3          1.3         0.3  setosa
4           4.6         3.1          1.5         0.2  setosa

ordered on multiple variables

sort2.iris <- iris[order(Sepal.Length, Petal.Length), ]
head(sort2.iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
42          4.5         2.3          1.3         0.3  setosa
23          4.6         3.6          1.0         0.2  setosa

ordered in reverse order

by using a minus sign ( - ) in front of the variable that we want sorted in reverse order.

sort3.iris <- iris[order(Sepal.Length, -Petal.Length), ]
head(sort3.iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
9           4.4         2.9          1.4         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa
42          4.5         2.3          1.3         0.3  setosa
4           4.6         3.1          1.5         0.2  setosa

rank

Returns the sample ranks of the values in a vector.

set.seed(9)
x = rnorm(10, 5, 1)
rank(x)
[1]  3  2  7  5  9  1 10  8  6  4

references

http://felixfan.github.io/extract-r-code
library(knitr)

Extract R code only

purl("test.Rmd")


processing file: test.Rmd
|                                                                       
  |                                                                 |   0%
  |                                                                       
  |.................................................................| 100%
output file: test.R
[1] "test.R"

Extract R code and also include documentation

purl("test.Rmd", output = "test2.R", documentation = 2)


processing file: test.Rmd
|                                                                       
  |                                                                 |   0%
  |                                                                       
  |.................................................................| 100%
output file: test2.R
[1] "test2.R"
http://felixfan.github.io/reorder-relabel-boxplot

boxplot

head(airquality)
Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
boxplot(Ozone ~ Month, data = airquality, xlab = "Month", ylab = "Ozone")

plot of chunk reoder-relabel-1

Sort the months by their median

Ozone = airquality$Ozone
Month = airquality$Month
mymedians = tapply(Ozone, Month, median, na.rm = T)
mms = sort(mymedians)
temp = names(mms)
names(temp) = 1:length(temp)
st = sort(temp)
plotOrder = as.integer(names(st))
boxplot(Ozone ~ Month, at = plotOrder, xlab = "Month", ylab = "Ozone")

plot of chunk reoder-relabel-2

Alter the x-axis tick labels (include the median of Ozone in parentheses)

mylabels = vector()
for (i in 1:length(temp)) {
mylabels[i] = paste(temp[i], "(", mymedians[temp[i]], ")", sep = "")
}
boxplot(Ozone ~ Month, at = plotOrder, xaxt = "n")
axis(side = 1, label = mylabels, at = 1:length(mylabels))

plot of chunk reoder-relabel-3

http://felixfan.github.io/confounding-factor

1. Definition

“In statistics, a confounding variable (also confounding factor, a confound, or confounder) is an extraneous variable in a statistical model that correlates (directly or inversely) with both the dependent variable and the independent variable. A perceived relationship between an independent variable and a dependent variable that has been misestimated due to the failure to account for a confounding factor is termed a spurious relationship, and the presence of misestimation for this reason is termed omitted-variable bias.” – wikipedia

2. Example

Suppose that there is a positive correlation between ice-cream consumption and number of drowning deaths for a given period. An evaluator might attempt to explain this correlation by inferring a causal relationship between the two variables. However, a more likely explanation is that, during the summer, warmer temperatures lead to increased ice-cream consumption as well as more people swimming and thus more drowning deaths. The relationship between ice-cream consumption and drowning is spurious.

3. Decreasing the potential for confounding to occur

3.1 Matched variables

“In case-control studies, matched variables most often are the age and sex.”

Matching assumes that the risk is evenly distributed in the controlled factor. Such situation is not always the case.

3.2 Stratification

Suppose that age is assumed to be a possible confounder between activity and infarct. Then the association between activity and infarct would be analyzed in each stratified age group. If the stratified age groups yield much different risk ratios, age must be viewed as a confounding variable.

3.3 Randomization

The study population is divided randomly in order to mitigate the chances of self-selection by participants or bias by the study designers.

References

http://felixfan.github.io/install-update-R-v2
OS:      WIN7
R:       3.03
RStudio: 0.98.501

updated on Tue Apr 08 14:57:09 2014, it was first post here

1. Install R

R is available at http://www.r-project.org/.

2. Update R

if (!require(installr)) {
# load / install+load installr
install.packages("installr")
require(installr)
}

updateR()

3. Install R packages

3.1 Installing R packages from CRAN (http://cran.r-project.org/)

install.packages("FinCal", dependencies = TRUE)  # FinCal is the package name

or

RStudio -> Tools -> Install Packages -> Install from 'Repository (CRAN)'

3.2 Installing R packages from Package Archive File (offline)

install.packages("FinCal_0.5.zip")

or

RStudio -> Tools -> Install Packages -> Install from 'Package Archive File'

or

R CMD INSTALL FinCal_0.5.zip

3.3 Installing R packages from Bioconductor (http://www.bioconductor.org/)

source("http://bioconductor.org/biocLite.R")  # installs 'BiocInstaller'
biocLite()  # installs automatically 'Biobase' 'IRanges' 'AnnotationDbi' ‘BiocGenerics’ ‘RSQLite’
all_group()  # Get list of all packages in BioConductor
biocLite(c("GenomicFeatures", "AnnotationDbi"))  #installing GenomicFeatures &AnnotationDbi packages

3.4 Installing R packages from GitHub (https://github.com/)

install.packages("devtools")  # requires for downloading & installation of GitHub packages
require(devtools)
install_github(repo = "FinCal", username = "felixfan")  # installing FinCal package

4. Update all existing packages

4.1 Automated Re-Install of Packages (packages in the default library dir)

update.packages(ask = FALSE, repos = "http://cran.rstudio.org", checkBuilt = TRUE)

or just

update.packages(ask = FALSE)

4.2 Automated Re-Install of Packages (packages do not in the default library dir)

.libPaths()  # gets the library trees within which packages are looked for
myPath = c("C:/Users/alice/Documents/R/win-library/3.0")  # change it to your own dir
package_df <- as.data.frame(installed.packages(myPath))  #Get currently installed packages
package_list <- as.character(package_df$Package)
install.packages(package_list)  #Re-install all installed packages

4.3 Automated Re-Install of Packages (packages from Bioconductor)

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("BiocUpgrade")

5. Reference

Installing R packages from CRAN/Bioconductor/Omegahat/Github
R 3.0.0 is released! (what’s new, and how to upgrade)
Automated Re-Install of Packages for R 3.0

http://felixfan.github.io/boost-program-options

OS: Windows 7 C++ IDE: Microsoft Visual Studio Express 2013 for Windows Desktop

Install Boost

Download boost_1_55_0.zip and unpack it to a directory (e.g. D:\myprograms\boost_1_55_0). This directory is called the Boost root directory.

Open the command prompt and change your current directory to the Boost root directory.

Then, type the following command:

bootstrap.bat

and then, type the following command:

b2.exe

The first command prepares the Boost.Build system for use. The second command invokes Boost.Build to build the separately-compiled Boost libraries. The second command may take a very long time (e.g. ~1hr).

After the second command finished, you will find a new directory stage was generated under the Boost root directory. There is a subdiretory lib under directory stage (e.g. D:\myprograms\boost_1_55_0\stage\lib).

  1. Right-click project titile in the Solution Explorer pane of Visual Studio IDE and select Properties from the resulting pop-up menu.

  2. In Configuration Properties select C/C++,then select General, then select Additional Include Directories, enter the path to the Boost root directory (e.g. D:\myprograms\boost_1_55_0)

  3. In Configuration Properties select Linker, then select Additional Library Directories, enter the path to the Boost binaries (e.g. D:\myprograms\boost_1_55_0\stage\lib).

  4. From the Build menu, select Build Solution.

Example code

#include "stdafx.h"
#include <iostream>
#include <boost/program_options.hpp>

using namespace std;


int main(int argc, char *argv[])
{
namespace po = boost::program_options;

po::options_description description("Usage:");

description.add_options()
("help,h", "Display this help message")
("version,v", "Display the version number")
("compression,c", po::value<int>(), "Compression level")
("score,s", po::value<int>()->default_value(60), "Final score");

po::variables_map vm;
po::store(po::command_line_parser(argc, argv).options(description).run(), vm);
po::notify(vm);

if (vm.count("help")){
cout << description;
}

if (vm.count("compression")){
cout << "Compression level " << vm["compression"].as<int>() << endl;
}

return 0;
}

References

http://felixfan.github.io/Generalized-Lambda-Distribution

Download data

library(FinCal)

# history data from start (1993) to today
SPY <- get.ohlc.yahoo("SPY")
head(SPY)
date  open  high   low close  volume adjusted
5308 1993-01-29 43.97 43.97 43.75 43.94 1003200    29.78
5307 1993-02-01 43.97 44.25 43.97 44.25  480500    29.99
5306 1993-02-02 44.22 44.38 44.13 44.34  201300    30.05
5305 1993-02-03 44.41 44.84 44.38 44.81  529400    30.37
5304 1993-02-04 44.97 45.09 44.47 45.00  531500    30.50
5303 1993-02-05 44.97 45.06 44.72 44.97  492100    30.48

# adjusted closing prices
SPY.Close <- SPY$adjusted
SPY.vector <- as.vector(SPY.Close)

# Calculate log returns
sp500 <- diff(log(SPY.vector), lag = 1)

# Remove the NA in the first position
sp500 <- sp500[-1]
head(sp500)
[1]  0.001999  0.010593  0.004271 -0.000656  0.000000 -0.006914

Compute the first four moments of the data

library(GLDEX)

# output is the set of estimated lambda parameters λ1 through λ4 for both
# the RS version and the FMKL version of the GLD.
spLambdaDist = fun.data.fit.mm(sp500)
spLambdaDist
RPRS     RMFMKL
[1,]  4.711e-04  4.038e-04
[2,] -4.381e+01  2.081e+02
[3,] -1.700e-01 -1.723e-01
[4,] -1.662e-01 -1.635e-01

# histogram plot of the data along with the density curves for the RS and
# FMKL fits
fun.plot.fit(fit.obj = spLambdaDist, data = sp500, nclass = 100, param.vec = c("rs",
"fmkl"), xlab = "Returns")

plot of chunk lambda

Simulate returns using parameters generated by last step

# RS parameters
lambda.params.rs <- spLambdaDist[, 1]
lambda1.rs <- lambda.params.rs[1]
lambda2.rs <- lambda.params.rs[2]
lambda3.rs <- lambda.params.rs[3]
lambda4.rs <- lambda.params.rs[4]

# RS simulation
set.seed(999)
rs.sample <- rgl(n = 1e+07, lambda1 = lambda1.rs, lambda2 = lambda2.rs, lambda3 = lambda3.rs,
lambda4 = lambda4.rs, param = "rs")

# FKML parameters
lambda.params.fmkl <- spLambdaDist[, 2]
lambda1.fmkl <- lambda.params.fmkl[1]
lambda2.fmkl <- lambda.params.fmkl[2]
lambda3.fmkl <- lambda.params.fmkl[3]
lambda4.fmkl <- lambda.params.fmkl[4]

# FKML simulation
set.seed(999)
fmkl.sample <- rgl(n = 1e+07, lambda1 = lambda1.fmkl, lambda2 = lambda2.fmkl,
lambda = lambda3.fmkl, lambda4 = lambda4.fmkl, param = "fmkl")

Comparing results for the RS, FKML simulation vs S&P 500 market data

# Set normalise='Y' so that kurtosis is calculated with reference to
# kurtosis = 0 under Normal distribution

# Moments of simulated returns using RS method:
fun.moments.r(rs.sample, normalise = "Y")
mean   variance   skewness   kurtosis
3.418e-04  9.457e-05 -1.027e-01  9.947e+00

# Moments of simulated returns using FMKL method:
fun.moments.r(fmkl.sample, normalise = "Y")
mean   variance   skewness   kurtosis
0.0003415  0.0001481 -0.1029554  9.9199126

# Moments calculated from market data:
fun.moments.r(sp500, normalise = "Y")
mean   variance   skewness   kurtosis
0.0003428  0.0001477 -0.1056554  9.9116401

FKML gets better results especially for skewness and variance. Compare to two parameters normal distribution, four parameters generalized lambda distribution is able to preserve skewness and kurtosis of the observed data.

For more details, read the original post here

http://felixfan.github.io/ggplot2-cheatsheet

R for Public Health

library(ggplot2)
library(gridExtra)
mtc <- mtcars
head(mtc)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Scatterplots

Basic scatterplot
p1 <- ggplot(mtc, aes(x = hp, y = mpg))
# Print plot with default points
p1 + geom_point()

plot of chunk ggplot2-Cheatsheet-1

Change color of points
p2 <- p1 + geom_point(color = "red")  #set one color for all points
p3 <- p1 + geom_point(aes(color = wt))  #set color scale by a continuous variable
p4 <- p1 + geom_point(aes(color = factor(am)))  #set color scale by a factor variable
grid.arrange(p2, p3, p4, nrow = 1)

plot of chunk ggplot2-Cheatsheet-2


# Change default colors in color scale
p1 + geom_point(aes(color = factor(am))) + scale_color_manual(values = c("orange",
"purple"))

plot of chunk ggplot2-Cheatsheet-2

Change shape or size of points
p2 <- p1 + geom_point(size = 5)  #increase all points to size 5
p3 <- p1 + geom_point(aes(size = wt))  #set point size by continuous variable
p4 <- p1 + geom_point(aes(shape = factor(am)))  #set point shape by factor variable
grid.arrange(p2, p3, p4, nrow = 1)

plot of chunk ggplot2-Cheatsheet-3


# change the default shapes
p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values = c(0,
2))

plot of chunk ggplot2-Cheatsheet-3

Add lines to scatterplot
p2 <- p1 + geom_point(color = "blue") + geom_line()  #connect points with line
p3 <- p1 + geom_point(color = "red") + geom_smooth(method = "lm", se = TRUE)  #add regression line
p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color = "red")  #add vertical line
grid.arrange(p2, p3, p4, nrow = 1)

plot of chunk ggplot2-Cheatsheet-4


# take out the points, and just create a line plot, and change size and
# color as before
ggplot(mtc, aes(x = wt, y = qsec)) + geom_line(size = 2, aes(color = factor(vs)))

plot of chunk ggplot2-Cheatsheet-4

Change axis labels
p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point()

# label all axes at once
p3 <- p2 + labs(x = "Horsepower", y = "Miles per Gallon")

# label and change font size
p4 <- p2 + theme(axis.title.x = element_text(face = "bold", size = 20)) + labs(x = "Horsepower")

# adjust axis limits and breaks
p5 <- p2 + scale_x_continuous("Horsepower", limits = c(0, 400), breaks = seq(0,
400, 50))

grid.arrange(p3, p4, p5, nrow = 1)

plot of chunk ggplot2-Cheatsheet-5

Change legend options
g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs)))
#move legend inside
g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1))
#move legend bottom
g3 <- g1 + theme(legend.position = "bottom")
#change labels
g4 <- g1 + scale_color_discrete(name ="Engine", labels=c("V-engine", "Straight engine"))
grid.arrange(g2, g3, g4, nrow=1)

plot of chunk ggplot2-Cheatsheet-6


g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt))
g5 + scale_color_continuous(name="Weight", #name of legend
breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable
labels = c("Light", "Medium", "Heavy"), #label
low = "pink",  #color of lowest value
high = "red"  #color of highest value
)

plot of chunk ggplot2-Cheatsheet-6

Change background color and style
g2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point()

# Completely clear all lines except axis lines and make background white
t1 <- theme(plot.background = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(),
axis.line = element_line(size = 0.4))

# Use theme to change axis label style
t2 <- theme(axis.title.x = element_text(face = "bold", color = "black", size = 10),
axis.title.y = element_text(face = "bold", color = "black", size = 10),
plot.title = element_text(face = "bold", color = "black", size = 12))

g3 <- g2 + t1
g4 <- g2 + theme_bw()
g5 <- g2 + theme_bw() + t2 + labs(x = "Horsepower", y = "Miles per Gallon",
title = "MPG vs Horsepower")
grid.arrange(g2, g3, g4, g5, nrow = 1)

plot of chunk ggplot2-Cheatsheet-7

a nice graph using a combination of options
g2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size = 2, aes(color = factor(vs),
shape = factor(vs))) + geom_smooth(aes(color = factor(vs)), method = "lm",
se = TRUE) + scale_color_manual(name = "Engine", labels = c("V-engine",
"Straight engine"), values = c("red", "blue")) + scale_shape_manual(name = "Engine",
labels = c("V-engine", "Straight engine"), values = c(0, 2)) + theme_bw() +
theme(axis.title.x = element_text(face = "bold", color = "black", size = 12),
axis.title.y = element_text(face = "bold", color = "black", size = 12),
plot.title = element_text(face = "bold", color = "black", size = 12),
legend.position = c(1, 1), legend.justification = c(1, 1)) + labs(x = "Horsepower",
y = "Miles per Gallon", title = "Linear Regression (95% CI) of MPG vs Horsepower by Engine type")

g2

plot of chunk ggplot2-Cheatsheet-8

Barplots

Basic barplot
ggplot(mtc, aes(x = factor(gear))) + geom_bar(stat = "bin")

plot of chunk ggplot2-Cheatsheet-9

Horizontal bars, colors, width of bars
# 1. horizontal bars
p1 <- ggplot(mtc, aes(x = factor(gear), y = wt)) + stat_summary(fun.y = mean,
geom = "bar") + coord_flip()

# 2. change colors of bars
p2 <- ggplot(mtc, aes(x = factor(gear), y = wt, fill = factor(gear))) + stat_summary(fun.y = mean,
geom = "bar") + scale_fill_manual(values = c("purple", "blue", "darkgreen"))

# 3. change width of bars
p3 <- ggplot(mtc, aes(x = factor(gear), y = wt)) + stat_summary(fun.y = mean,
geom = "bar", aes(width = 0.5))

grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk ggplot2-Cheatsheet-10

Split and color by another variable
# 1. next to each other
p1 <- ggplot(mtc, aes(x = factor(gear), y = wt, fill = factor(vs)), color = factor(vs)) +
stat_summary(fun.y = mean, position = position_dodge(), geom = "bar")

# 2. stacked
p2 <- ggplot(mtc, aes(x = factor(gear), y = wt, fill = factor(vs)), color = factor(vs)) +
stat_summary(fun.y = mean, position = "stack", geom = "bar")

# 3. with facets
p3 <- ggplot(mtc, aes(x = factor(gear), y = wt, fill = factor(vs)), color = factor(vs)) +
stat_summary(fun.y = mean, geom = "bar") + facet_wrap(~vs)

grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk ggplot2-Cheatsheet-11

Add text to the bars, label axes, and label legend
ag.mtc <- aggregate(mtc$wt, by = list(mtc$gear, mtc$vs), FUN = mean)
colnames(ag.mtc) <- c("gear", "vs", "meanwt")
ag.mtc
gear vs meanwt
1    3  0  4.104
2    4  0  2.748
3    5  0  2.913
4    3  1  3.047
5    4  1  2.591
6    5  1  1.513

# 1. basic
g1 <- ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill = factor(vs), color = factor(vs))) +
geom_bar(stat = "identity", position = position_dodge()) + geom_text(aes(y = meanwt,
ymax = meanwt, label = meanwt), position = position_dodge(width = 0.9),
vjust = -0.5)

# 2. fixing the yaxis problem, changing the color of text, legend labels,
# and rounding to 2 decimals
g2 <- ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill = factor(vs))) +
geom_bar(stat = "identity", position = position_dodge()) + geom_text(aes(y = meanwt,
ymax = meanwt, label = round(meanwt, 2)), position = position_dodge(width = 0.9),
vjust = -0.5, color = "black") + scale_y_continuous("Mean Weight", limits = c(0,
4.5), breaks = seq(0, 4.5, 0.5)) + scale_x_discrete("Number of Gears") +
scale_fill_discrete(name = "Engine", labels = c("V-engine", "Straight engine"))

grid.arrange(g1, g2, nrow = 1)

plot of chunk ggplot2-Cheatsheet-12

Add error bars
summary.mtc2 <- data.frame(gear = levels(as.factor(mtc$gear)), meanwt = tapply(mtc$wt,
mtc$gear, mean), sd = tapply(mtc$wt, mtc$gear, sd))
summary.mtc2
gear meanwt     sd
3    3  3.893 0.8330
4    4  2.617 0.6327
5    5  2.633 0.8189
ggplot(summary.mtc2, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity",
position = "dodge", fill = "lightblue") + geom_errorbar(aes(ymin = meanwt -
sd, ymax = meanwt + sd), width = 0.3, color = "darkblue")

plot of chunk ggplot2-Cheatsheet-13

Add best fit line
# summarize data
summary.mtc3 <- data.frame(hp = levels(as.factor(mtc$hp)), meanmpg = tapply(mtc$mpg,
mtc$hp, mean))

# run a model
l <- summary(lm(meanmpg ~ as.numeric(hp), data = summary.mtc3))

# manually entering the intercept and slope
f1 <- ggplot(summary.mtc3, aes(x = factor(hp), y = meanmpg)) + geom_bar(stat = "identity",
fill = "darkblue") + geom_abline(aes(intercept = l$coef[1, 1], slope = l$coef[2,
1]), color = "red", size = 1.5)

# using stat_smooth to fit the line for you
f2 <- ggplot(summary.mtc3, aes(x = factor(hp), y = meanmpg)) + geom_bar(stat = "identity",
fill = "darkblue") + stat_smooth(aes(group = 1), method = "lm", se = FALSE,
color = "orange", size = 1.5)

grid.arrange(f1, f2, nrow = 1)

plot of chunk ggplot2-Cheatsheet-14

Histograms

set.seed(999)
xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))
yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))
zvar <- as.factor(c(rep(1, 1500), rep(2, 1500)))
xy <- data.frame(xvar, yvar, zvar)
# counts on y-axis
g1 <- ggplot(xy, aes(xvar)) + geom_histogram()  #horribly ugly default
g2 <- ggplot(xy, aes(xvar)) + geom_histogram(binwidth = 1)  #change binwidth
g3 <- ggplot(xy, aes(xvar)) + geom_histogram(fill = NA, color = "black") + theme_bw()  #nicer looking

# density on y-axis
g4 <- ggplot(xy, aes(x = xvar)) + geom_histogram(aes(y = ..density..), color = "black",
fill = NA) + theme_bw()

grid.arrange(g1, g2, g3, g4, nrow = 1)

plot of chunk ggplot2-Cheatsheet-16

Density plots

# basic density
p1 <- ggplot(xy, aes(xvar)) + geom_density()

# histogram with density line overlaid
p2 <- ggplot(xy, aes(x = xvar)) + geom_histogram(aes(y = ..density..), color = "black",
fill = NA) + geom_density(color = "blue")

# split and color by third variable, alpha fades the color a bit
p3 <- ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2)

grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk ggplot2-Cheatsheet-17

Boxplots

# boxplot
b1 <- ggplot(xy, aes(zvar, xvar)) + geom_boxplot(aes(fill = zvar)) + theme(legend.position = "none")

# jitter plot
b2 <- ggplot(xy, aes(zvar, xvar)) + geom_jitter(alpha = I(1/4), aes(color = zvar)) +
theme(legend.position = "none")

# violin plot
b3 <- ggplot(xy, aes(x = xvar)) + stat_density(aes(ymax = ..density.., ymin = -..density..,
fill = zvar, color = zvar), geom = "ribbon", position = "identity") + facet_grid(. ~
zvar) + coord_flip() + theme(legend.position = "none")

grid.arrange(b1, b2, b3, nrow = 1)

plot of chunk ggplot2-Cheatsheet-18

Putting multiple plots together

# rug plot
ggplot(xy, aes(xvar, yvar)) + geom_point() + geom_rug(col = "darkred", alpha = 0.1)

plot of chunk ggplot2-Cheatsheet-19

# placeholder plot - prints nothing at all
empty <- ggplot() + geom_point(aes(1, 1), colour = "white") + theme(plot.background = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.border = element_blank(), panel.background = element_blank(), axis.title.x = element_blank(),
axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(),
axis.ticks = element_blank())

# scatterplot of x and y variables
scatter <- ggplot(xy, aes(xvar, yvar)) + geom_point(aes(color = zvar)) + scale_color_manual(values = c("orange",
"purple")) + theme(legend.position = c(1, 1), legend.justification = c(1,
1))

# marginal density of x - plot on top
plot_top <- ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.5) +
scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none")

# marginal density of y - plot on the right
plot_right <- ggplot(xy, aes(yvar, fill = zvar)) + geom_density(alpha = 0.5) +
coord_flip() + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none")

# arrange the plots together, with appropriate height and width for each row
# and column
grid.arrange(plot_top, empty, scatter, plot_right, ncol = 2, nrow = 2, widths = c(4,
1), heights = c(1, 4))

plot of chunk ggplot2-Cheatsheet-20

Original post is available here

http://felixfan.github.io/Efficiency-of-Importing-Large-CSV-Files-in-R

{data.table} outperforms other methods

library(data.table)
fread("data.csv", header = T, sep = ",")
Data used

size of csv file: 689.4MB (7,009,728 rows * 29 columns)

Methods tested
  • read.csv
  • fread {data.table}
  • read.big.matrix {bigmemory}
  • read.csv.ffdf {ff}
  • read.csv.sql {sqldf}

Details are available at the original post

http://felixfan.github.io/maths-tricks

Guess number

Ask a friend to write down a number with more than 3 digits.
Example: 12345

Ask them to add the digits.
Example: 1+2+3+4+5=15

Ask them to subtract this number from the original one.
Example: 12345–15=12330

Ask them to select any digit from this new number and strike it out,
without showing you. Example: 1330

Ask them to add the remaining digits and write down the answer they get.
Example: 1+3+3+0 =7

Ask them to tell you the number they get (7)
and you will tell them which number they struck out.

SOLUTION:
The way you do this is to subtract the number they give you from the next multiple of 9. The answer you get is the number they struck out. If the final number they give you is a  multiple of 9, there are two possible answers: 0 or 9.
Example: The next multiple of 9 here is 9, 9–7 = 2 and there you have your answer.

Amazing 1089

Take two pieces of paper and hand one to a friend. On yours, without letting them see, write the number 1089, then fold the paper to keep it hidden.

Ask them to think of a 3-digit number that cannot be the same backwards.Then ask them to put the numbers in order from greatest to smallest. Don't let them show what they've written.
Example: 931

Below their number, ask them to write the same digits, but in reverse order, from smallest to greatest.
Example: 139

Now, ask them to subtract the new lower number from the original one they wrote.
Example: 931-139=792

Next, ask them to reverse the order of that number.
Example: 297

Then, get them to add this latest number and the previous number together and show you the result.
Example: 297+ 792 = 1089

Finally, you can reveal your own number, which will be exactly what they have written.

Have fun!

http://felixfan.github.io/linear-regression-anova

When I tried to find out the difference between ANOVA and linear Regression, I got this interesting post: Why ANOVA and Linear Regression are the Same Analysis.

The author used data employment.sav to show why.

I tried both ANOVA and linear Regression using the same data with R lm and aov/anova functions. Not surprisely, I got the same results.

Actually, the description of aov is “Fit an analysis of variance model by a call to lm for each stratum.”

employee.csv is the same data as employment.sav with csv format.

require(RCurl)
Loading required package: RCurl
Loading required package: bitops
myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/employee.csv",
ssl.verifypeer = FALSE)
mydata <- read.csv(textConnection(myCsv))
head(mydata)
id gender     bdate educ jobcat salary salbegin jobtime prevexp minority
1  1      m  2/3/1952   15      3  57000    27000      98     144        0
2  2      m 5/23/1958   16      1  40200    18750      98      36        0
3  3      f 7/26/1929   12      1  21450    12000      98     381        0
4  4      f 4/15/1947    8      1  21900    13200      98     190        0
5  5      m  2/9/1955   15      1  45000    21000      98     138        0
6  6      m 8/22/1958   15      1  32100    13500      98      67        0
mydata$jobcat = gsub(1, "Clerical", mydata$jobcat)
mydata$jobcat = gsub(2, "Custodial", mydata$jobcat)
mydata$jobcat = gsub(3, "Manager", mydata$jobcat)
library(plyr)
ddply(mydata, .(jobcat), summarize, mean_prevexp = mean(prevexp))
jobcat mean_prevexp
1  Clerical        85.04
2 Custodial       298.11
3   Manager        77.62
lm0 <- lm(prevexp ~ jobcat, data = mydata)
lm0

Call:
lm(formula = prevexp ~ jobcat, data = mydata)

Coefficients:
(Intercept)  jobcatCustodial    jobcatManager
85.04           213.07            -7.42

aov0 <- aov(prevexp ~ jobcat, data = mydata)
aov0
Call:
aov(formula = prevexp ~ jobcat, data = mydata)

Terms:
jobcat Residuals
Sum of Squares  1174907   3998900
Deg. of Freedom       2       471

Residual standard error: 92.14
Estimated effects may be unbalanced
summary(lm0)

Call:
lm(formula = prevexp ~ jobcat, data = mydata)

Residuals:
Min     1Q Median     3Q    Max
-154.1  -65.0  -32.0   36.7  391.0

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        85.04       4.84   17.58   <2e-16 ***
jobcatCustodial   213.07      18.38   11.59   <2e-16 ***
jobcatManager      -7.42      11.16   -0.67     0.51
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.1 on 471 degrees of freedom
Multiple R-squared:  0.227,	Adjusted R-squared:  0.224
F-statistic: 69.2 on 2 and 471 DF,  p-value: <2e-16
anova(lm0)
Analysis of Variance Table

Response: prevexp
Df  Sum Sq Mean Sq F value Pr(>F)
jobcat      2 1174907  587453    69.2 <2e-16 ***
Residuals 471 3998900    8490
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov0)
Df  Sum Sq Mean Sq F value Pr(>F)
jobcat        2 1174907  587453    69.2 <2e-16 ***
Residuals   471 3998900    8490
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

what is the difference between anova and aov in R?

  • input format: anova use the object as input, that means you need input the lm(); but aov use the formula as input.
  • output: output is slightly different as the examples demostrated.
http://felixfan.github.io/venn-venndiagram
library(VennDiagram)
## Loading required package: grid

set.seed(99999)

## construct some fake gene names..
oneName <- function() paste(sample(LETTERS, 5, replace = TRUE), collapse = "")
geneNames <- replicate(1000, oneName())

##
GroupA <- sample(geneNames, 400, replace = FALSE)
GroupB <- sample(geneNames, 750, replace = FALSE)
GroupC <- sample(geneNames, 250, replace = FALSE)
GroupD <- sample(geneNames, 300, replace = FALSE)
input <- list(GA = GroupA, GB = GroupB, GC = GroupC, GD = GroupD)
filename = "Venn_4set_pretty.png"
venn.diagram(input, filename = filename, fill = c("cornflowerblue", "green",
"yellow", "red"), height = 900, width = 900, resolution = 500, units = "px",
cex = 0.8)

The output is a TIFF format figure.

http://felixfan.github.io/data-table-2
library(data.table)

Fast and friendly file finagler

fread("file.csv")  # This function is still under development.

Enhanced data.frame

  • DT[i, j, by]: Take DT, subset rows using i, then calculate j grouped by by.
library(car)
head(Prestige)
Prestige$jobs = row.names(Prestige)
DT = data.table(Prestige)
head(DT)
DT[, max(income), by = type]
  • Update by reference using :=
DT[, `:=`(max_income_by_type, max(income)), by = type]  # Add new column
head(DT)

DT[, `:=`(max_income_by_type, NULL)]  # Delete a column
head(DT)
DT[, `:=`(edu, round(education))]  # Add new column
head(DT)
DT[, max(income), by = "type,edu"]
DT[, max(income), by = edu%%3]
  • How to sort a datatable by column(s) in R
head(DT[order(income)])
head(DT[order(-edu, income)])
head(DT[order(edu, -income)])
:= and `:=`()
DT[edu < 9, `:=`(edu_level, "elementary")]
tail(DT)

# add several columns
DT[, `:=`(mean_income = mean(income), sd_income = sd(income)), by = type]
head(DT)
http://felixfan.github.io/Statistics-One

Data

library(RCurl)
myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/stat_one.txt",
ssl.verifypeer = FALSE)
mydata <- read.csv(textConnection(myCsv), sep = "\t")
mydata$subject <- factor(mydata$subject)

Create two subsets, control and concussed

concussed <- subset(mydata, condition == "concussed")
control <- subset(mydata, condition == "control")

Summary statistics

library(psych)
describe(mydata)
var  n  mean    sd median trimmed   mad   min
subject*                      1 40 20.50 11.69  20.50   20.50 14.83  1.00
condition*                    2 40  1.50  0.51   1.50    1.50  0.74  1.00
verbal_memory_baseline        3 40 89.75  6.44  91.00   90.44  6.67 75.00
visual_memory_baseline        4 40 74.88  8.60  75.00   74.97  9.64 59.00
visual.motor_speed_baseline   5 40 34.03  3.90  33.50   34.02  3.62 26.29
reaction_time_baseline        6 40  0.67  0.15   0.65    0.66  0.13  0.42
impulse_control_baseline      7 40  8.28  2.05   8.50    8.38  2.22  2.00
total_symptom_baseline        8 40  0.05  0.22   0.00    0.00  0.00  0.00
verbal_memory_retest          9 40 82.00 11.02  85.00   82.97  9.64 59.00
visual_memory_retest         10 40 71.90  8.42  72.00   72.19 10.38 54.00
visual.motor_speed_retest    11 40 35.83  8.66  35.15   34.98  6.89 20.15
reaction_time_retest         12 40  0.67  0.22   0.65    0.65  0.13  0.19
impulse_control_retest       13 40  6.75  2.98   7.00    6.81  2.97  1.00
total_symptom_retest         14 40 13.88 15.32   7.00   12.38 10.38  0.00
max range  skew kurtosis   se
subject*                    40.00 39.00  0.00    -1.29 1.85
condition*                   2.00  1.00  0.00    -2.05 0.08
verbal_memory_baseline      98.00 23.00 -0.70    -0.51 1.02
visual_memory_baseline      91.00 32.00 -0.11    -0.96 1.36
visual.motor_speed_baseline 41.87 15.58  0.08    -0.75 0.62
reaction_time_baseline       1.20  0.78  1.14     2.21 0.02
impulse_control_baseline    12.00 10.00 -0.57     0.36 0.32
total_symptom_baseline       1.00  1.00  3.98    14.16 0.03
verbal_memory_retest        97.00 38.00 -0.65    -0.81 1.74
visual_memory_retest        86.00 32.00 -0.28    -0.87 1.33
visual.motor_speed_retest   60.77 40.62  0.86     0.65 1.37
reaction_time_retest         1.30  1.11  0.93     1.29 0.03
impulse_control_retest      12.00 11.00 -0.16    -1.06 0.47
total_symptom_retest        43.00 43.00  0.44    -1.47 2.42
describeBy(mydata, mydata$condition)
group: concussed
var  n  mean    sd median trimmed   mad   min
subject*                      1 20 30.50  5.92  30.50   30.50  7.41 21.00
condition*                    2 20  1.00  0.00   1.00    1.00  0.00  1.00
verbal_memory_baseline        3 20 89.65  7.17  92.50   90.56  5.93 75.00
visual_memory_baseline        4 20 74.75  8.03  74.00   74.25  8.15 63.00
visual.motor_speed_baseline   5 20 33.20  3.62  33.09   33.27  3.32 26.29
reaction_time_baseline        6 20  0.66  0.17   0.63    0.64  0.13  0.42
impulse_control_baseline      7 20  8.55  1.64   9.00    8.62  1.48  5.00
total_symptom_baseline        8 20  0.05  0.22   0.00    0.00  0.00  0.00
verbal_memory_retest          9 20 74.05  9.86  74.00   73.88 11.86 59.00
visual_memory_retest         10 20 69.20  8.38  69.50   69.62 10.38 54.00
visual.motor_speed_retest    11 20 38.27 10.01  35.15   37.32  7.73 25.70
reaction_time_retest         12 20  0.78  0.23   0.70    0.74  0.11  0.51
impulse_control_retest       13 20  5.00  2.53   5.00    4.88  2.97  1.00
total_symptom_retest         14 20 27.65  9.07  27.00   27.75 11.12 13.00
max range  skew kurtosis   se
subject*                    40.00 19.00  0.00    -1.38 1.32
condition*                   1.00  0.00   NaN      NaN 0.00
verbal_memory_baseline      97.00 22.00 -0.79    -0.70 1.60
visual_memory_baseline      91.00 28.00  0.45    -0.72 1.80
visual.motor_speed_baseline 39.18 12.89 -0.13    -0.78 0.81
reaction_time_baseline       1.20  0.78  1.38     2.41 0.04
impulse_control_baseline    11.00  6.00 -0.39    -0.81 0.37
total_symptom_baseline       1.00  1.00  3.82    13.29 0.05
verbal_memory_retest        91.00 32.00  0.07    -1.24 2.21
visual_memory_retest        80.00 26.00 -0.27    -1.26 1.87
visual.motor_speed_retest   60.77 35.07  0.77    -0.57 2.24
reaction_time_retest         1.30  0.79  1.09    -0.10 0.05
impulse_control_retest      11.00 10.00  0.39    -0.28 0.57
total_symptom_retest        43.00 30.00 -0.11    -1.25 2.03
--------------------------------------------------------
group: control
var  n  mean   sd median trimmed  mad   min
subject*                      1 20 10.50 5.92  10.50   10.50 7.41  1.00
condition*                    2 20  2.00 0.00   2.00    2.00 0.00  2.00
verbal_memory_baseline        3 20 89.85 5.82  90.00   90.31 7.41 78.00
visual_memory_baseline        4 20 75.00 9.34  77.00   75.50 9.64 59.00
visual.motor_speed_baseline   5 20 34.86 4.09  34.39   34.85 4.92 27.36
reaction_time_baseline        6 20  0.67 0.13   0.66    0.67 0.13  0.42
impulse_control_baseline      7 20  8.00 2.41   7.50    8.12 2.22  2.00
total_symptom_baseline        8 20  0.05 0.22   0.00    0.00 0.00  0.00
verbal_memory_retest          9 20 89.95 4.36  90.50   90.06 5.19 81.00
visual_memory_retest         10 20 74.60 7.76  74.50   75.00 8.15 60.00
visual.motor_speed_retest    11 20 33.40 6.44  34.54   33.52 6.30 20.15
reaction_time_retest         12 20  0.57 0.16   0.56    0.57 0.13  0.19
impulse_control_retest       13 20  8.50 2.31   9.00    8.69 1.48  3.00
total_symptom_retest         14 20  0.10 0.31   0.00    0.00 0.00  0.00
max range  skew kurtosis   se
subject*                    20.00 19.00  0.00    -1.38 1.32
condition*                   2.00  0.00   NaN      NaN 0.00
verbal_memory_baseline      98.00 20.00 -0.41    -0.87 1.30
visual_memory_baseline      88.00 29.00 -0.46    -1.27 2.09
visual.motor_speed_baseline 41.87 14.51  0.09    -1.19 0.91
reaction_time_baseline       1.00  0.58  0.47    -0.02 0.03
impulse_control_baseline    12.00 10.00 -0.41    -0.17 0.54
total_symptom_baseline       1.00  1.00  3.82    13.29 0.05
verbal_memory_retest        97.00 16.00 -0.25    -1.02 0.97
visual_memory_retest        86.00 26.00 -0.23    -1.11 1.73
visual.motor_speed_retest   44.28 24.13 -0.25    -0.77 1.44
reaction_time_retest         0.90  0.71 -0.16     0.06 0.04
impulse_control_retest      12.00  9.00 -0.73    -0.32 0.52
total_symptom_retest         1.00  1.00  2.47     4.32 0.07

Density plots

par(mfrow = c(1, 3))
hist(mydata$total_symptom_retest, xlab = "Total symptom score", main = "")
plot(density(mydata$total_symptom_retest), xlab = "Total sympton score", main = "")

# prob=TRUE for probabilities not counts
hist(mydata$total_symptom_retest, xlab = "Total symptom score", main = "", prob = TRUE)
lines(density(mydata$total_symptom_retest))

plot of chunk statone1

Compare density plots

library(sm)
par(mfrow = c(1, 1))
# This function allows a set of univariate density estimates to be compared,
# both graphically and formally in a permutation test of equality.
sm.density.compare(mydata$total_symptom_retest, mydata$condition, xlab = "Total symptom score")

plot of chunk statone2

Correlation analysis of baseline measures

cor(mydata[3:8])  # Columns 3 to 8 contain the 6 baseline measures
verbal_memory_baseline visual_memory_baseline
verbal_memory_baseline                     1.00000                0.37512
visual_memory_baseline                     0.37512                1.00000
visual.motor_speed_baseline               -0.04057               -0.23339
reaction_time_baseline                     0.14673                0.13615
impulse_control_baseline                   0.13147                0.23756
total_symptom_baseline                     0.13521                0.01689
visual.motor_speed_baseline
verbal_memory_baseline                        -0.040567
visual_memory_baseline                        -0.233391
visual.motor_speed_baseline                    1.000000
reaction_time_baseline                        -0.131955
impulse_control_baseline                       0.005221
total_symptom_baseline                        -0.051903
reaction_time_baseline
verbal_memory_baseline                      0.1467
visual_memory_baseline                      0.1361
visual.motor_speed_baseline                -0.1320
reaction_time_baseline                      1.0000
impulse_control_baseline                    0.1213
total_symptom_baseline                     -0.0339
impulse_control_baseline
verbal_memory_baseline                      0.131471
visual_memory_baseline                      0.237559
visual.motor_speed_baseline                 0.005221
reaction_time_baseline                      0.121334
impulse_control_baseline                    1.000000
total_symptom_baseline                      0.082149
total_symptom_baseline
verbal_memory_baseline                     0.13521
visual_memory_baseline                     0.01689
visual.motor_speed_baseline               -0.05190
reaction_time_baseline                    -0.03390
impulse_control_baseline                   0.08215
total_symptom_baseline                     1.00000
round(cor(mydata[3:8]), 2)  # Round to 2 decimal places
verbal_memory_baseline visual_memory_baseline
verbal_memory_baseline                        1.00                   0.38
visual_memory_baseline                        0.38                   1.00
visual.motor_speed_baseline                  -0.04                  -0.23
reaction_time_baseline                        0.15                   0.14
impulse_control_baseline                      0.13                   0.24
total_symptom_baseline                        0.14                   0.02
visual.motor_speed_baseline
verbal_memory_baseline                            -0.04
visual_memory_baseline                            -0.23
visual.motor_speed_baseline                        1.00
reaction_time_baseline                            -0.13
impulse_control_baseline                           0.01
total_symptom_baseline                            -0.05
reaction_time_baseline
verbal_memory_baseline                        0.15
visual_memory_baseline                        0.14
visual.motor_speed_baseline                  -0.13
reaction_time_baseline                        1.00
impulse_control_baseline                      0.12
total_symptom_baseline                       -0.03
impulse_control_baseline
verbal_memory_baseline                          0.13
visual_memory_baseline                          0.24
visual.motor_speed_baseline                     0.01
reaction_time_baseline                          0.12
impulse_control_baseline                        1.00
total_symptom_baseline                          0.08
total_symptom_baseline
verbal_memory_baseline                        0.14
visual_memory_baseline                        0.02
visual.motor_speed_baseline                  -0.05
reaction_time_baseline                       -0.03
impulse_control_baseline                      0.08
total_symptom_baseline                        1.00

Color scatterplot matrix, colored and ordered by magnitude of r

library(gclus)
base <- mydata[3:8]
base.r <- abs(cor(base))
base.color <- dmat.color(base.r)
base.order <- order.single(base.r)
# This function draws a scatterplot matrix of data.  Variables may be
# reordered and panels colored in the display
cpairs(base, base.order, panel.colors = base.color, gap = 0.5, main = "Variables Ordered and Colored by Correlation")

plot of chunk statone3

Regression analyses, unstandardized

model1 <- lm(mydata$visual_memory_retest ~ mydata$visual_memory_baseline)
summary(model1)

Call:
lm(formula = mydata$visual_memory_retest ~ mydata$visual_memory_baseline)

Residuals:
Min     1Q Median     3Q    Max
-8.137 -2.553 -0.358  2.803 12.152


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                    10.3386     6.5090    1.59     0.12
mydata$visual_memory_baseline   0.8222     0.0864    9.52  1.3e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.64 on 38 degrees of freedom
Multiple R-squared:  0.705,	Adjusted R-squared:  0.697
F-statistic: 90.6 on 1 and 38 DF,  p-value: 1.32e-11

# Print 95% confidence interval for the regression coefficient
confint(model1)
2.5 %  97.5 %
(Intercept)                   -2.8381 23.5154
mydata$visual_memory_baseline  0.6473  0.9971

# Scatterplot with confidence interval around the regression line
library(ggplot2)
ggplot(mydata, aes(x = visual_memory_baseline, y = visual_memory_retest)) +
geom_smooth(method = "lm") + geom_point()

plot of chunk statone4


par(mfrow = c(1, 1))
plot(mydata$visual_memory_retest ~ mydata$visual_memory_baseline, main = "Scatterplot",
ylab = "retest", xlab = "baseline")
abline(model1, col = "blue")

plot of chunk statone4


# To visualize model1, save the predicted scores as a new variable and then
# plot with endurance
mydata$predicted <- fitted(model1)
par(mfrow = c(1, 1))
plot(mydata$visual_memory_retest ~ mydata$predicted, main = "Scatterplot", ylab = "retest",
xlab = "predicted")
abline(lm(mydata$visual_memory_retest ~ mydata$predicted), col = "blue")

plot of chunk statone4


# The function fitted returns predicted scores whereas the function resid
# returns residuals
mydata$e <- resid(model1)
hist(mydata$e)

plot of chunk statone4

plot(mydata$predicted ~ mydata$e, main = "Scatterplot", ylab = "Predicted Scores",
xlab = "Residuals")
abline(lm(mydata$predicted ~ mydata$e), col = "blue")

plot of chunk statone4


# Conduct a model comparison NHST to compare the fit of model1 to the fit of
# model2
model2 <- lm(mydata$visual_memory_retest ~ mydata$visual_memory_baseline + mydata$verbal_memory_baseline)
anova(model1, model2)
Analysis of Variance Table

Model 1: mydata$visual_memory_retest ~ mydata$visual_memory_baseline
Model 2: mydata$visual_memory_retest ~ mydata$visual_memory_baseline +
mydata$verbal_memory_baseline
Res.Df RSS Df Sum of Sq   F Pr(>F)
1     38 818
2     37 790  1      27.8 1.3   0.26

Regression analyses, standardized

# In simple regression, the standardized regression coefficient will be the
# same as the correlation coefficient

round(cor(mydata[3:5]), 2)  # Round to 2 decimal places
verbal_memory_baseline visual_memory_baseline
verbal_memory_baseline                        1.00                   0.38
visual_memory_baseline                        0.38                   1.00
visual.motor_speed_baseline                  -0.04                  -0.23
visual.motor_speed_baseline
verbal_memory_baseline                            -0.04
visual_memory_baseline                            -0.23
visual.motor_speed_baseline                        1.00

model1.z <- lm(scale(mydata$verbal_memory_baseline) ~ scale(mydata$visual_memory_baseline))
summary(model1.z)

Call:
lm(formula = scale(mydata$verbal_memory_baseline) ~ scale(mydata$visual_memory_baseline))

Residuals:
Min      1Q  Median      3Q     Max
-1.9891 -0.5813  0.0866  0.7885  1.3377


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                          -2.63e-17   1.48e-01    0.00    1.000
scale(mydata$visual_memory_baseline)  3.75e-01   1.50e-01    2.49    0.017

(Intercept)
scale(mydata$visual_memory_baseline) *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.939 on 38 degrees of freedom
Multiple R-squared:  0.141,	Adjusted R-squared:  0.118
F-statistic: 6.22 on 1 and 38 DF,  p-value: 0.0171

model2.z <- lm(scale(mydata$verbal_memory_baseline) ~ scale(mydata$visual.motor_speed_baseline))
summary(model2.z)

Call:
lm(formula = scale(mydata$verbal_memory_baseline) ~ scale(mydata$visual.motor_speed_baseline))

Residuals:
Min     1Q Median     3Q    Max
-2.302 -0.730  0.146  0.826  1.327


Coefficients:
Estimate Std. Error t value
(Intercept)                               -1.85e-17   1.60e-01    0.00
scale(mydata$visual.motor_speed_baseline) -4.06e-02   1.62e-01   -0.25
Pr(>|t|)
(Intercept)                                    1.0
scale(mydata$visual.motor_speed_baseline)      0.8

Residual standard error: 1.01 on 38 degrees of freedom
Multiple R-squared:  0.00165,	Adjusted R-squared:  -0.0246
F-statistic: 0.0626 on 1 and 38 DF,  p-value: 0.804

model3.z <- lm(scale(mydata$verbal_memory_baseline) ~ scale(mydata$visual_memory_baseline) +
scale(mydata$visual.motor_speed_baseline))
summary(model3.z)

Call:
lm(formula = scale(mydata$verbal_memory_baseline) ~ scale(mydata$visual_memory_baseline) +
scale(mydata$visual.motor_speed_baseline))

Residuals:
Min      1Q  Median      3Q     Max
-1.9657 -0.5620  0.0848  0.7847  1.3356

Coefficients:
Estimate Std. Error t value
(Intercept)                               -4.48e-18   1.50e-01    0.00
scale(mydata$visual_memory_baseline)       3.87e-01   1.57e-01    2.47
scale(mydata$visual.motor_speed_baseline)  4.97e-02   1.57e-01    0.32
Pr(>|t|)
(Intercept)                                  1.000
scale(mydata$visual_memory_baseline)         0.018 *
scale(mydata$visual.motor_speed_baseline)    0.753
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.95 on 37 degrees of freedom
Multiple R-squared:  0.143,	Adjusted R-squared:  0.0967
F-statistic: 3.09 on 2 and 37 DF,  p-value: 0.0575

NHST for each correlation coefficient

# Null Hypothesis Significance Testing (NHST) is a statistical method for
# testing whether the factor we are talking about has the effect on our
# observation.
cor.test(mydata$visual_memory_baseline, mydata$visual_memory_retest)

Pearson's product-moment correlation

data:  mydata$visual_memory_baseline and mydata$visual_memory_retest
t = 9.519, df = 38, p-value = 1.321e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7147 0.9123
sample estimates:
cor
0.8394

Moderation analysis

Moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. It affects the direction and/or strength of the relation between dependent and independent variables.

myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/stat_one_mod.txt",
ssl.verifypeer = FALSE)
MOD <- read.table(textConnection(myCsv), header = TRUE)
head(MOD)
subject condition  IQ  WM WM.centered D1 D2
1       1   control 134  91       -8.08  0  0
2       2   control 121 145       45.92  0  0
3       3   control  86 118       18.92  0  0
4       4   control  74 105        5.92  0  0
5       5   control  80  96       -3.08  0  0
6       6   control 105 133       33.92  0  0
# First, is there an effect of stereotype threat?
model_mod0 <- lm(MOD$IQ ~ MOD$D1 + MOD$D2)
summary(model_mod0)

Call:
lm(formula = MOD$IQ ~ MOD$D1 + MOD$D2)

Residuals:
Min     1Q Median     3Q    Max
-51.88 -11.13  -0.45   8.77  43.12

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    97.88       2.29    42.8   <2e-16 ***
MOD$D1        -45.72       3.23   -14.2   <2e-16 ***
MOD$D2        -49.86       3.23   -15.4   <2e-16 ***
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.2 on 147 degrees of freedom
Multiple R-squared:  0.666,	Adjusted R-squared:  0.661
F-statistic:  147 on 2 and 147 DF,  p-value: <2e-16
confint(model_mod0)
2.5 % 97.5 %
(Intercept)  93.36 102.40
MOD$D1      -52.11 -39.33
MOD$D2      -56.25 -43.47

# We could also use the aov function (for analysis of variance) followed by
# the TukeyHSD function (Tukey's test of pairwise comparisons, which adjusts
# the p value to prevent infaltion of Type I error rate)
table(MOD$condition)

control threat1 threat2
50      50      50
model_mod0a <- aov(MOD$IQ ~ MOD$condition)
summary(model_mod0a)
Df Sum Sq Mean Sq F value Pr(>F)
MOD$condition   2  76558   38279     147 <2e-16 ***
Residuals     147  38393     261
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Create a set of confidence intervals on the differences between the means
# of the levels of a factor with the specified family-wise probability of
# coverage.  The intervals are based on the Studentized range statistic,
# Tukey's ‘Honest Significant Difference’ method.
TukeyHSD(model_mod0a)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = MOD$IQ ~ MOD$condition)

$`MOD$condition`
diff    lwr     upr  p adj
threat1-control -45.72 -53.37 -38.067 0.0000
threat2-control -49.86 -57.51 -42.207 0.0000
threat2-threat1  -4.14 -11.79   3.513 0.4082
# Moderation analysis (uncentered): model_mod1 tests for 'first-order
# effects'; model_mod2 tests for moderation
model_mod1 <- lm(MOD$IQ ~ MOD$WM + MOD$D1 + MOD$D2)
summary(model_mod1)

Call:
lm(formula = MOD$IQ ~ MOD$WM + MOD$D1 + MOD$D2)

Residuals:
Min     1Q Median     3Q    Max
-47.34  -7.29   0.74   7.61  42.42


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  59.7864     7.1436    8.37  4.3e-14 ***
MOD$WM        0.3728     0.0669    5.57  1.2e-07 ***
MOD$D1      -45.2055     2.9464  -15.34  < 2e-16 ***
MOD$D2      -46.9074     2.9922  -15.68  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.7 on 146 degrees of freedom
Multiple R-squared:  0.725,	Adjusted R-squared:  0.719
F-statistic:  128 on 3 and 146 DF,  p-value: <2e-16
ggplot(MOD, aes(x = WM, y = IQ)) + geom_smooth(method = "lm") + geom_point()

plot of chunk statone5

ggplot(MOD, aes(x = WM, y = IQ)) + stat_smooth(method = "lm", se = F) + geom_point(aes(color = condition))

plot of chunk statone5

ggplot(MOD, aes(x = WM, y = IQ)) + geom_smooth(aes(group = condition), method = "lm",
se = T, color = "black", fullrange = T) + geom_point(aes(color = condition))

plot of chunk statone5

# Create new predictor variables
MOD$WM.D1 <- (MOD$WM * MOD$D1)
MOD$WM.D2 <- (MOD$WM * MOD$D2)
model_mod2 <- lm(MOD$IQ ~ MOD$WM + MOD$D1 + MOD$D2 + MOD$WM.D1 + MOD$WM.D2)
summary(model_mod2)

Call:
lm(formula = MOD$IQ ~ MOD$WM + MOD$D1 + MOD$D2 + MOD$WM.D1 +
MOD$WM.D2)

Residuals:
Min     1Q Median     3Q    Max
-50.41  -7.18   0.42   8.20  40.86


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   85.585     11.358    7.54  5.0e-12 ***
MOD$WM         0.120      0.109    1.10   0.2730
MOD$D1       -93.095     16.857   -5.52  1.5e-07 ***
MOD$D2       -79.897     15.477   -5.16  8.0e-07 ***
MOD$WM.D1      0.472      0.164    2.88   0.0046 **
MOD$WM.D2      0.329      0.155    2.13   0.0353 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.4 on 144 degrees of freedom
Multiple R-squared:  0.741,	Adjusted R-squared:  0.732
F-statistic: 82.4 on 5 and 144 DF,  p-value: <2e-16

anova(model_mod1, model_mod2)
Analysis of Variance Table

Model 1: MOD$IQ ~ MOD$WM + MOD$D1 + MOD$D2
Model 2: MOD$IQ ~ MOD$WM + MOD$D1 + MOD$D2 + MOD$WM.D1 + MOD$WM.D2
Res.Df   RSS Df Sum of Sq    F Pr(>F)
1    146 31655
2    144 29784  2      1871 4.52  0.012 *
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mediation analysis

Rather than hypothesizing a direct causal relationship between the independent variable and the dependent variable, a mediational model hypothesizes that the independent variable influences the mediator variable, which in turn influences the dependent variable…Mediation analyses are employed to understand a known relationship by exploring the underlying mechanism or process by which one variable influences another variable

myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/stat_one_med.txt",
ssl.verifypeer = FALSE)
MED <- read.table(textConnection(myCsv), header = TRUE)
head(MED)
subject condition  IQ WM
1       1   control  73 37
2       2   control 128 77
3       3   control  83 32
4       4   control  83 33
5       5   control  64 53
6       6   control  95 46
# The function sobel in the multilevel package executes the entire mediation
# analysis in one step but first we will do it with 3 lm models
model.YX <- lm(MED$IQ ~ MED$condition)
model.YXM <- lm(MED$IQ ~ MED$condition + MED$WM)
model.MX <- lm(MED$WM ~ MED$condition)

summary(model.YX)

Call:
lm(formula = MED$IQ ~ MED$condition)

Residuals:
Min     1Q Median     3Q    Max
-35.32  -9.57  -1.82  10.68  39.68

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)            97.32       2.07   47.00  < 2e-16 ***
MED$conditionthreat   -11.00       2.93   -3.76  0.00029 ***
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.6 on 98 degrees of freedom
Multiple R-squared:  0.126,	Adjusted R-squared:  0.117
F-statistic: 14.1 on 1 and 98 DF,  p-value: 0.000293
summary(model.YXM)

Call:
lm(formula = MED$IQ ~ MED$condition + MED$WM)

Residuals:
Min     1Q Median     3Q    Max
-31.88  -7.90   0.93   6.99  27.58

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)           55.998      4.644   12.06  < 2e-16 ***
MED$conditionthreat   -2.408      2.316   -1.04      0.3
MED$WM                 0.752      0.080    9.41  2.6e-15 ***
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.6 on 97 degrees of freedom
Multiple R-squared:  0.543,	Adjusted R-squared:  0.533
F-statistic: 57.6 on 2 and 97 DF,  p-value: <2e-16
summary(model.MX)

Call:
lm(formula = MED$WM ~ MED$condition)

Residuals:
Min     1Q Median     3Q    Max
-31.92  -7.75  -0.50  10.18  30.50

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)            54.92       1.90   28.89  < 2e-16 ***
MED$conditionthreat   -11.42       2.69   -4.25  4.9e-05 ***
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.4 on 98 degrees of freedom
Multiple R-squared:  0.156,	Adjusted R-squared:  0.147
F-statistic:   18 on 1 and 98 DF,  p-value: 4.91e-05

# Compare the results to the output of the sobel function
library(multilevel)
# Estimate Sobel's (1982) indirect test for mediation.  The function
# provides an estimate of the magnitude of the indirect effect, Sobel's
# first-order estimate of the standard error associated with the indirect
# effect, and the corresponding z-value.
model.ALL <- sobel(MED$condition, MED$WM, MED$IQ)
model.ALL
$`Mod1: Y~X`
Estimate Std. Error t value  Pr(>|t|)
(Intercept)    97.32      2.071  46.999 4.966e-69
predthreat    -11.00      2.928  -3.756 2.928e-04

$`Mod2: Y~X+M`
Estimate Std. Error t value  Pr(>|t|)
(Intercept)  55.9977      4.644  12.058 5.304e-21
predthreat   -2.4075      2.316  -1.039 3.012e-01
med           0.7524      0.080   9.406 2.577e-15

$`Mod3: M~X`
Estimate Std. Error t value  Pr(>|t|)
(Intercept)    54.92      1.901  28.895 9.487e-50
predthreat    -11.42      2.688  -4.249 4.906e-05

$Indirect.Effect
[1] -8.593

$SE
[1] 2.219

$z.value
[1] -3.872

$N
[1] 100

Conduct group comparisons with both parametric and non-parametric tests

myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/stat_one_t_anova.txt",
ssl.verifypeer = FALSE)
wm <- read.table(textConnection(myCsv), header = TRUE)
head(wm)
cond pre post gain train
1  t08   8    9    1     1
2  t08   8   10    2     1
3  t08   8    8    0     1
4  t08   8    7   -1     1
5  t08   9   11    2     1
6  t08   9   10    1     1

# Create two subsets of data: One for the control group and another for the
# training groups
wm.c = subset(wm, wm$train == "0")
wm.t = subset(wm, wm$train == "1")
# Dependent t-tests and Wilcoxan

# First, compare pre and post scores in the control group
t.test(wm.c$pre, wm.c$post, paired = T)

Paired t-test

data:  wm.c$pre and wm.c$post
t = -9.009, df = 39, p-value = 4.511e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.418 -1.532
sample estimates:
mean of the differences
-1.975


# Wilcoxon Rank Sum and Signed Rank Tests Performs one- and two-sample
# Wilcoxon tests on vectors of data; the latter is also known as
# ‘Mann-Whitney’ test.
wilcox.test(wm.c$pre, wm.c$post, paired = T)

Wilcoxon signed rank test with continuity correction

data:  wm.c$pre and wm.c$post
V = 10, p-value = 1.717e-07
alternative hypothesis: true location shift is not equal to 0

# Next, compare pre and post scores in the training groups
t.test(wm.t$pre, wm.t$post, paired = T)

Paired t-test

data:  wm.t$pre and wm.t$post
t = -14.49, df = 79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.966 -3.009
sample estimates:
mean of the differences
-3.487


# Wilcoxon
wilcox.test(wm.t$pre, wm.t$post, paired = T)

Wilcoxon signed rank test with continuity correction

data:  wm.t$pre and wm.t$post
V = 10, p-value = 3.017e-14
alternative hypothesis: true location shift is not equal to 0

# Cohen's d for dependent t-tests
library(lsr)
cohensD(wm.c$post, wm.c$pre, method = "paired")
[1] 1.424
cohensD(wm.t$post, wm.t$pre, method = "paired")
[1] 1.62
# Independent t-test and Mann Whitney

# Compare the gain scores in the control and training groups
t.test(wm$gain ~ wm$train, var.equal = T)

Two Sample t-test

data:  wm$gain by wm$train
t = -4.04, df = 118, p-value = 9.539e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.2538 -0.7712
sample estimates:
mean in group 0 mean in group 1
1.975           3.487

# Mann-Whitney
wilcox.test(wm$gain ~ wm$train, paired = F)

Wilcoxon rank sum test with continuity correction

data:  wm$gain by wm$train
W = 916, p-value = 0.0001061
alternative hypothesis: true location shift is not equal to 0

# Cohen's d for independent t-tests
cohensD(wm$gain ~ wm$train, method = "pooled")
[1] 0.7824
# Analysis of Variance (ANOVA) and Kruskul Wallis To compare the gain scores
# across all groups, use ANOVA First, check the homogeneity of variance
# assumption
library(car)
leveneTest(wm.t$gain, wm.t$cond, center = "mean")
Levene's Test for Homogeneity of Variance (center = "mean")
Df F value Pr(>F)
group  3    1.13   0.34
76
leveneTest(wm.t$gain, wm.t$cond)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group  3    1.31   0.28
76

aov.model = aov(wm.t$gain ~ wm.t$cond)
summary(aov.model)
Df Sum Sq Mean Sq F value  Pr(>F)
wm.t$cond    3    213      71    35.3 2.2e-14 ***
Residuals   76    153       2
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Kruskal Wallis Rank Sum Test
kruskal.test(wm.t$gain ~ wm.t$cond)

Kruskal-Wallis rank sum test

data:  wm.t$gain by wm.t$cond
Kruskal-Wallis chi-squared = 50.25, df = 3, p-value = 7.084e-11

# Effect size for ANOVA
etaSquared(aov.model, anova = T)
eta.sq eta.sq.part    SS df     MS     F         p
wm.t$cond 0.5821      0.5821 213.0  3 71.013 35.29 2.154e-14
Residuals 0.4179          NA 152.9 76  2.012    NA        NA

# Conduct post-hoc tests to evaluate all pairwise comparisons
TukeyHSD(aov.model)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = wm.t$gain ~ wm.t$cond)

$`wm.t$cond`
diff    lwr   upr  p adj
t12-t08 1.25 0.0716 2.428 0.0333
t17-t08 3.05 1.8716 4.228 0.0000
t19-t08 4.25 3.0716 5.428 0.0000
t17-t12 1.80 0.6216 2.978 0.0008
t19-t12 3.00 1.8216 4.178 0.0000
t19-t17 1.20 0.0216 2.378 0.0443

Conduct a binary logisitc regression

myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/stat/stat_one_glm.txt",
ssl.verifypeer = FALSE)
BL <- read.table(textConnection(myCsv), header = TRUE)
head(BL)
subject verdict danger rehab punish gendet specdet incap
1       1       0      2     2      2      2       0     7
2       2       0      0     9      0      6       8     2
3       3       1      6     3      2     10      10     4
4       4       1      1     3      2      3       2     1
5       5       0      0     7      4      1       1    10
6       6       1     10     6      1      8       0     0
# Binary logistic regression
lrfit <- glm(BL$verdict ~ BL$danger + BL$rehab + BL$punish + BL$gendet + BL$specdet +
BL$incap, family = binomial)
summary(lrfit)

Call:
glm(formula = BL$verdict ~ BL$danger + BL$rehab + BL$punish +
BL$gendet + BL$specdet + BL$incap, family = binomial)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.969  -0.932  -0.463   0.891   1.957

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74758    0.91728   -1.91   0.0568 .
BL$danger    0.29339    0.09292    3.16   0.0016 **
BL$rehab    -0.18784    0.08140   -2.31   0.0210 *
BL$punish    0.07012    0.07111    0.99   0.3241
BL$gendet    0.18574    0.07733    2.40   0.0163 *
BL$specdet   0.00590    0.07865    0.08   0.9402
BL$incap     0.00353    0.07587    0.05   0.9629
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 114.06  on 93  degrees of freedom
AIC: 128.1

Number of Fisher Scoring iterations: 3

confint(lrfit)  # CIs using profiled log-likelihood (default for logistic models)
2.5 %   97.5 %
(Intercept) -3.65201 -0.02217
BL$danger    0.11998  0.48756
BL$rehab    -0.35426 -0.03283
BL$punish   -0.06763  0.21352
BL$gendet    0.03858  0.34394
BL$specdet  -0.15006  0.16092
BL$incap    -0.14735  0.15284
confint.default(lrfit)  # CIs using standard errors
2.5 %   97.5 %
(Intercept) -3.54542  0.05027
BL$danger    0.11127  0.47550
BL$rehab    -0.34738 -0.02831
BL$punish   -0.06926  0.20950
BL$gendet    0.03418  0.33730
BL$specdet  -0.14824  0.16005
BL$incap    -0.14518  0.15224

# Model fit
with(lrfit, null.deviance - deviance)  #difference in deviance for the two models
[1] 24.41
with(lrfit, df.null - df.residual)  #df for the difference between the two models
[1] 6
with(lrfit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))  #p-value
[1] 0.0004397

# Wald tests for Model Coefficients Computes a Wald chi-squared test for 1
# or more coefficients, given their variance-covariance matrix.
library(aod)
wald.test(b = coef(lrfit), Sigma = vcov(lrfit), Terms = 2)  #danger
Wald test:
----------

Chi-squared test:
X2 = 10.0, df = 1, P(> X2) = 0.0016
wald.test(b = coef(lrfit), Sigma = vcov(lrfit), Terms = 3)  #rehab
Wald test:
----------


Chi-squared test:
X2 = 5.3, df = 1, P(> X2) = 0.021
wald.test(b = coef(lrfit), Sigma = vcov(lrfit), Terms = 4)  #punish
Wald test:
----------

Chi-squared test:
X2 = 0.97, df = 1, P(> X2) = 0.32
wald.test(b = coef(lrfit), Sigma = vcov(lrfit), Terms = 5)  #gendet
Wald test:
----------


Chi-squared test:
X2 = 5.8, df = 1, P(> X2) = 0.016
wald.test(b = coef(lrfit), Sigma = vcov(lrfit), Terms = 6)  #specdet
Wald test:
----------

Chi-squared test:
X2 = 0.0056, df = 1, P(> X2) = 0.94
wald.test(b = coef(lrfit), Sigma = vcov(lrfit), Terms = 7)  #incap
Wald test:
----------


Chi-squared test:
X2 = 0.0022, df = 1, P(> X2) = 0.96

# Odds ratios
exp(coef(lrfit))  #exponentiated coefficients
(Intercept)   BL$danger    BL$rehab   BL$punish   BL$gendet  BL$specdet
0.1742      1.3410      0.8287      1.0726      1.2041      1.0059
BL$incap
1.0035

# Classification table
library(QuantPsyc)
# Provides a Classification analysis for a logistic regression model.  Also
# provides McFadden's Rsq.
ClassLog(lrfit, BL$verdict)
$rawtab
resp
0  1
FALSE 39 16
TRUE  13 32

$classtab
resp
0      1
FALSE 0.7500 0.3333
TRUE  0.2500 0.6667

$overall
[1] 0.71

$mcFadden
[1] 0.1763

Statistics One is available on coursera

http://felixfan.github.io/equal-size-plot

Artificial data

set.seed(99999)
mydata = data.frame(x = rep(1:5, times = 2), y = rnorm(10, 10), z = rep(c("a",
"b"), each = 5))
mydata
x      y z
1  1  9.574 a
2  2  9.717 a
3  3  9.101 a
4  4 10.707 a
5  5 12.092 a
6  1 11.636 b
7  2  9.460 b
8  3  9.396 b
9  4  9.797 b
10 5  9.688 b

Plot

Same x axis but different y axis.

par(mfrow = c(2, 1), mar = c(0, 4.1, 4, 2))
# mfrow: A vector of the form c(nr, nc). Subsequent figures will be drawn in
# an nr-by-nc array on the device by rows

# mar: A numerical vector of the form c(bottom, left, top, right) which
# gives the number of lines of margin to be specified on the four sides of
# the plot.  The default is c(5, 4, 4, 2) + 0.1.

# first plot
plot(y ~ x, data = mydata, axes = FALSE, frame.plot = TRUE, xlab = "", main = "Two plot with equal size",
col = c("red", "green")[mydata$z], ylim = c(9, 13))
# The option axes=FALSE suppresses both x and y axes.  xaxt='n' and yaxt='n'
# suppress the x and y axis respectively.

# frame.plot: a logical indicating whether a box should be drawn around the
# plot.

# add legend
legend(x = "topleft", legend = c("a", "b"), col = c("red", "green"), pch = 1)

# add y axis
axis(side = 2, las = 1)
# side: an integer indicating the side of the graph to draw the axis
# (1=bottom, 2=left, 3=top, 4=right)

# las: numeric in {0,1,2,3}; the style of axis labels.  0: always parallel
# to the axis [default], 1: always horizontal, 2: always perpendicular to
# the axis, 3: always vertical.

# second plot
par(mar = c(4, 4.1, 0, 2))

plot(log(y) ~ x, data = mydata, axes = FALSE, frame.plot = TRUE, xlab = "x",
ylab = "log(y)", col = c("red", "green")[mydata$z], ylim = c(2, 3))

# add y axis
axis(side = 2, las = 1)

# add x axis
axis(side = 1)

plot of chunk same-size-plot

http://felixfan.github.io/data-table
library(data.table)

Create data table

# it is similar with data frame
set.seed(999)
dt = data.table(x = sample(c("a", "b", "c", "d"), 10, replace = TRUE), y = rnorm(10))
dt
x       y
1: b -0.5660
2: c -1.8787
3: a -1.2668
4: d -0.9677
5: d -1.1210
6: a  1.3255
7: c  0.1340
8: a  0.9387
9: b  0.1725
10: c  0.9577
dt[2]  # 2nd row
x      y
1: c -1.879
dt[, y]  # y column (as vector)
[1] -0.5660 -1.8787 -1.2668 -0.9677 -1.1210  1.3255  0.1340  0.9387
[9]  0.1725  0.9577
dt[, list(y)]  # y column (as data.table)
y
1: -0.5660
2: -1.8787
3: -1.2668
4: -0.9677
5: -1.1210
6:  1.3255
7:  0.1340
8:  0.9387
9:  0.1725
10:  0.9577


# convert existing data.frame objects to data.table.
dt.cars = data.table(cars)
head(dt.cars)
speed dist
1:     4    2
2:     4   10
3:     7    4
4:     7   22
5:     8   16
6:     9   10

list out all data.tables in memory

tables()  # The result of tables() is itself a data.table
NAME    NROW MB COLS       KEY
[1,] dt        10 1  x,y
[2,] dt.cars   50 1  speed,dist
Total: 2MB

Keys

A key consists of one or more columns of rownames, which may be integer, factor, character or some other class, not simply character. The rows are sorted by the key. A data.table can have at most one key, but duplicate key values are allowed.

# use data.frame syntax in a data.table
dt[2, ]  # the second row of df
x      y
1: c -1.879
dt[dt$x == "a", ]  # all rows with first column is 'a'
x       y
1: a -1.2668
2: a  1.3255
3: a  0.9387

# data.table unique key
setkey(dt, x)  # set x column as key
dt["a", ]  # all rows with first column is 'a', The comma is optional.
x       y
1: a -1.2668
2: a  1.3255
3: a  0.9387
dt["a"]
x       y
1: a -1.2668
2: a  1.3255
3: a  0.9387

# By default all the rows in the group are returned The mult argument allows
# only the first or last row of the group to be returned
dt["a", mult = "first"]
x      y
1: a -1.267
dt["a", mult = "last"]
x      y
1: a 0.9387

binary search (faster)

The vector scan is linear, but the binary search is O(log n).

df2 <- data.frame(x = sample(LETTERS, 1e+07, replace = T), y = sample(letters,
1e+07, replace = T), z = rnorm(1e+07))
system.time(ans1 <- df2[df2$x == "R" & df2$y == "h", ])  # 'vector scan'
user  system elapsed
5.81    0.28    6.11

dt2 <- data.table(df2)
setkey(dt2, x, y)
system.time(ans2 <- dt2[J("R", "h")])  # binary search, faster
user  system elapsed
0.02    0.00    0.01

identical(ans1$z, ans2$z)
[1] TRUE

Fast grouping

system.time(sum1 <- dt2[, sum(z), by = x])
user  system elapsed
0.21    0.05    0.25
head(sum1)
x       V1
1: A  -211.43
2: B   -60.01
3: C  -723.30
4: D   392.90
5: E -1251.52
6: F    92.21

dt2[, sum(z), by = list(x == "A")]  # by expression
x      V1
1:  TRUE  -211.4
2: FALSE -6875.4

system.time(sum2 <- dt2[, sum(z), by = "x,y"])
user  system elapsed
0.14    0.04    0.19
head(sum2)
x y      V1
1: A a 147.617
2: A b   2.296
3: A c 203.767
4: A d  77.195
5: A e -16.550
6: A f  58.810

Fast time series join (or a rolling join)

set.seed(9999)
dt3 = data.table(x = sample(letters, 10, replace = T), y = sample(1:20, 10,
replace = T), z = sample(1:99, 10, replace = T))
setkey(dt3, x)
dt3["o"]  # join to 1st column of key
x  y  z
1: o NA NA
dt3[J("o")]  # same. J() stands for Join, an alias for list()
x  y  z
1: o NA NA
dt3[!"o"]  # all rows other than 'o'
x  y  z
1: f 16 55
2: f 19 34
3: r  1 98
4: r  9 34
5: t  3 35
6: u  4 85
7: v 19 39
8: v  2  7
9: w 17 33
10: z 17 45
dt3[!2:4]  # all rows other than 2:4
x  y  z
1: f 16 55
2: t  3 35
3: u  4 85
4: v 19 39
5: v  2  7
6: w 17 33
7: z 17 45

setkey(dt3, x, y)
dt3[J("o", 3)]  # join to 2 columns
x y  z
1: o 3 NA
dt3[J("o", 3:6)]  # join 4 rows (1 missing)
x y  z
1: o 3 NA
2: o 4 NA
3: o 5 NA
4: o 6 NA
dt3[!J("o", 3)]  # multiple join
x  y  z
1: f 16 55
2: f 19 34
3: r  1 98
4: r  9 34
5: t  3 35
6: u  4 85
7: v  2  7
8: v 19 39
9: w 17 33
10: z 17 45
dt3[, sum(z), by = x][order(-V1)]  # ordering results
x  V1
1: r 132
2: f  89
3: u  85
4: v  46
5: z  45
6: t  35
7: w  33

Learn more

vignette("datatable-intro")
example(data.table)
http://felixfan.github.io/r-basic-functions

1. all, any, which

x = -5:5

all(x)  # Given a set of logical vectors, are all of the values true?
[1] FALSE
all(x > 3)
[1] FALSE

any(x)  # Given a set of logical vectors, is at least one of the values true?
[1] TRUE
any(x > 3)
[1] TRUE

which(x > 3)  # Give the TRUE indices of a logical object, allowing for array indices.
[1] 10 11
x[which(x > 3)]
[1] 4 5
x[x > 3]
[1] 4 5

which.max(x)  # Give the indices of max
[1] 11
x[which.max(x)]
[1] 5
max(x)
[1] 5

which.min(x)  # Give the indices of min
[1] 1
x[which.min(x)]
[1] -5
min(x)
[1] -5

2. summary, max, min, mean, median, range, quantile

options(digits = 4)
y = rnorm(10)
y
[1] -0.4460 -0.2693 -0.1460 -1.9396 -0.0134  0.1919  0.4205  0.5393
[9]  0.2638  0.9963
summary(y)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-1.9400 -0.2380  0.0893 -0.0402  0.3810  0.9960
max(y)
[1] 0.9963
min(y)
[1] -1.94
mean(y)
[1] -0.04025
median(y)
[1] 0.08926
range(y)
[1] -1.9396  0.9963
quantile(y, c(0.25, 0.75, 0.95))
25%     75%     95%
-0.2384  0.3813  0.7906

3. dim, nrow, ncol, colMeans, colSums, rowMeans, rowSums

df = data.frame(x = 1:5, y = 3:7, z = 23:27)
df
x y  z
1 1 3 23
2 2 4 24
3 3 5 25
4 4 6 26
5 5 7 27

dim(df)
[1] 5 3
nrow(df)
[1] 5
ncol(df)
[1] 3

colMeans(df)
x  y  z
3  5 25
colSums(df)
x   y   z
15  25 125
rowMeans(df)
[1]  9 10 11 12 13
rowSums(df)
[1] 27 30 33 36 39

4. nchar, strsplit, substr, toupper, tolower

z = "Hello World"
nchar(z)
[1] 11
strsplit(z, " ")
[[1]]
[1] "Hello" "World"
substr(z, 1, 1)
[1] "H"
substr(z, 1, 3)
[1] "Hel"
toupper(z)
[1] "HELLO WORLD"
tolower(z)
[1] "hello world"

5. cummax, cummin, cumprod, and cumsum

x = c(1, 2, -3, 4, -6, 9, 6, 7)
x
[1]  1  2 -3  4 -6  9  6  7
cumsum(x)
[1]  1  3  0  4 -2  7 13 20
cumprod(x)
[1]     1     2    -6   -24   144  1296  7776 54432
cummax(x)
[1] 1 2 2 4 4 9 9 9
cummin(x)
[1]  1  1 -3 -3 -6 -6 -6 -6

6. diff, duplicated, unique, order, and sort

x = c(1, 0, -2, 3, 6, 0, 9)
x
[1]  1  0 -2  3  6  0  9

diff(x, 1)  # x[i] - x[i-1]
[1] -1 -2  5  3 -6  9

duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
x[!duplicated(x)]
[1]  1  0 -2  3  6  9

unique(x)
[1]  1  0 -2  3  6  9

order(x)
[1] 3 2 6 1 4 5 7
x[order(x)]
[1] -2  0  0  1  3  6  9

sort(x)
[1] -2  0  0  1  3  6  9

ifelse

a = c(4, -4)
sqrt(ifelse(a >= 0, a, NA))
[1]  2 NA

do.call

# Execute a Function Call
do.call(paste, list("Hello", "World", sep = " "))
[1] "Hello World"

x = seq(1, 2, by = 0.2)
y = seq(3, 4, by = 0.2)
do.call(cbind, list(x, y))
[,1] [,2]
[1,]  1.0  3.0
[2,]  1.2  3.2
[3,]  1.4  3.4
[4,]  1.6  3.6
[5,]  1.8  3.8
[6,]  2.0  4.0
do.call(rbind, list(x, y))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1  1.2  1.4  1.6  1.8    2
[2,]    3  3.2  3.4  3.6  3.8    4

expression and eval

# Evaluate an (Unevaluated) Expression
x = 3
y = 5
z = expression(x + y)
z
expression(x + y)
eval(z)
[1] 8
http://felixfan.github.io/install-mySQL

Install MySQL on Windows 7

Download MySQL Installer

  • Windows 7
  • mysql-installer-web-community-5.6.15.0.msi

MySQL Installer

If you do not want to creat a new account, just click “No thanks, just start my download.” (at the bottom of the page).

Note: MySQL Installer is 32 bit, but will install both 32 bit and 64 bit binaries.

Install MySQL

  • Double-click the downloaded file to install

Creating A New MySQL Connection

Creating A New MySQL Connection “MyFirstConnection”.

Load data from csv file

The example data used is a typical PLINK association output contains 13 columns (CHR, SNP, BP, A1, F_A, F_U, A2, CHISQ, P, OR, SE, L95, U95).

First create tables.

CREATE TABLE `gwas`(
`CHR` INT NOT NULL,
`SNP` VARCHAR(45) NOT NULL,
`BP` INT NOT NULL,
`A1` VARCHAR(45) NOT NULL,
`F_A` DOUBLE NOT NULL,
`F_U` DOUBLE NOT NULL,
`A2` VARCHAR(45) NOT NULL,
`CHISQ` DOUBLE NOT NULL,
`P` DOUBLE NOT NULL,
`OR` DOUBLE NOT NULL,
`SE` DOUBLE NOT NULL,
`L95` DOUBLE NOT NULL,
`U95` DOUBLE NOT NULL,
PRIMARY KEY (`SNP`));
CREATE TABLE `replicate`(
`CHR` INT NOT NULL,
`SNP` VARCHAR(45) NOT NULL,
`BP` INT NOT NULL,
`A1` VARCHAR(45) NOT NULL,
`F_A` DOUBLE NOT NULL,
`F_U` DOUBLE NOT NULL,
`A2` VARCHAR(45) NOT NULL,
`CHISQ` DOUBLE NOT NULL,
`P` DOUBLE NOT NULL,
`OR` DOUBLE NOT NULL,
`SE` DOUBLE NOT NULL,
`L95` DOUBLE NOT NULL,
`U95` DOUBLE NOT NULL,
PRIMARY KEY (`SNP`));

Then load the data.

LOAD DATA LOCAL INFILE 'C:/Users/alice/Documents/MySQL/LoadFileExample/gwas.csv' INTO TABLE gwas FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'C:/Users/alice/Documents/MySQL/LoadFileExample/replicate.csv' INTO TABLE replicate FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

Note 1: The original PLINK output was import to EXCEL and the csv file was exported. That’s why I used FIELDS TERMINATED BY ‘,’ here.

Note 2: I delete the first line (the header), if the header line was kept, you need to add IGNORE 1 LINES.

SELECT from table

select all from table ‘gwas’, order the output by ‘CHR’ and ‘BP’ (both in ascending order)

SELECT * FROM gwas ORDER BY CHR ASC, BP ASC;

Equivalent to (ascending order is default)

SELECT * FROM gwas ORDER BY CHR, BP;

select by condition

SELECT * FROM gwas WHERE P<0.00001;

order the output by ‘CHR’ and ‘BP’

SELECT * FROM gwas WHERE P<0.00001 ORDER BY CHR, BP;
SELECT * FROM gwas WHERE P<0.00001 AND gwas.OR>2 ORDER BY CHR, BP;

Note: ‘OR’ is the key word of MySQL. when select the row with ‘OR>2’, you need to add the table name ‘gwas’ here.

JOIN two tables

SELECT * FROM gwas INNER JOIN replicate ON gwas.snp = replicate.snp;
SELECT * FROM gwas INNER JOIN replicate ON gwas.P<0.00001 AND replicate.P<0.005;

Equivalent to

SELECT * FROM gwas a INNER JOIN replicate b ON a.P<0.00001 AND b.P<0.005;

Further reading

http://felixfan.github.io/ddply-without-sort

Example data

m = data.frame(x = c(3, 2, 3, 4, 5), y = c(2, 1, 2, 1, 1))
m
x y
1 3 2
2 2 1
3 3 2
4 4 1
5 5 1

Do not preserve order of m

library(plyr)
ddply(m, .(y), transform, num = length(which(y == y)), result = x/length(which(y == y)))
x y num result
1 2 1   3 0.6667
2 4 1   3 1.3333
3 5 1   3 1.6667
4 3 2   2 1.5000
5 3 2   2 1.5000

Preserves order of m

keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[, col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out))
stop("Ordering column not preserved by function")
out <- out[order(out[, col]), ]
out[, col] <- NULL
out
}

keeping.order(m, ddply, .(y), transform, num = length(which(y == y)), result = x/length(which(y == y)))
x y num result
4 3 2   2 1.5000
1 2 1   3 0.6667
5 3 2   2 1.5000
2 4 1   3 1.3333
3 5 1   3 1.6667

Further reading

http://felixfan.github.io/ddply

Download some data (stock price)

library(FinCal)
ohlc = get.ohlcs.yahoo(symbol = c("AAPL", "GOOG", "MSFT"), start = "2013-11-01",
end = "2013-11-07")
mydata = data.frame(date = ohlc$AAPL$date, Apple = ohlc$AAPL$adjusted, Google = ohlc$GOOG$adjusted,
Microsoft = ohlc$MSFT$adjusted)
mydata
date Apple Google Microsoft
1 2013-11-01 517.0   1027     35.26
2 2013-11-04 523.7   1026     35.67
3 2013-11-05 522.4   1022     36.36
4 2013-11-06 520.9   1023     37.89
5 2013-11-07 512.5   1008     37.22

reshape a data frame from wide to long

longData <- melt(your original data frame, a vector of your category variables)
library(reshape2)
mydata = melt(mydata, id.vars = c("date"), measure.vars = c("Apple", "Google",
"Microsoft"), variable.name = "company", value.name = "price")
mydata
date   company   price
1  2013-11-01     Apple  517.01
2  2013-11-04     Apple  523.69
3  2013-11-05     Apple  522.40
4  2013-11-06     Apple  520.92
5  2013-11-07     Apple  512.49
6  2013-11-01    Google 1027.04
7  2013-11-04    Google 1026.11
8  2013-11-05    Google 1021.52
9  2013-11-06    Google 1022.75
10 2013-11-07    Google 1007.95
11 2013-11-01 Microsoft   35.26
12 2013-11-04 Microsoft   35.67
13 2013-11-05 Microsoft   36.36
14 2013-11-06 Microsoft   37.89
15 2013-11-07 Microsoft   37.22

or just

mydata = melt(mydata, c("date"))
mydata

ddply()

myresult <- ddply(mydata, .(column name of factor I'm splitting by, column name second factor I'm splitting by), summarize OR transform, newcolumn = myfunction(column name I want the function to act upon))
library(plyr)
# split by company
ddply(mydata, .(company), summarize, bestPrice = max(price))
company bestPrice
1     Apple    523.69
2    Google   1027.04
3 Microsoft     37.89
# highest price in the entire data set (all company)
ddply(mydata, NULL, summarize, bestPrice = max(price))
.id bestPrice
1 <NA>      1027

summarize doesn’t give any information from other columns in the original data frame. If you want all the other column data, too, change summarize to transform

ddply(mydata, .(company), transform, bestPrice = max(price))
date   company   price bestPrice
1  2013-11-01     Apple  517.01    523.69
2  2013-11-04     Apple  523.69    523.69
3  2013-11-05     Apple  522.40    523.69
4  2013-11-06     Apple  520.92    523.69
5  2013-11-07     Apple  512.49    523.69
6  2013-11-01    Google 1027.04   1027.04
7  2013-11-04    Google 1026.11   1027.04
8  2013-11-05    Google 1021.52   1027.04
9  2013-11-06    Google 1022.75   1027.04
10 2013-11-07    Google 1007.95   1027.04
11 2013-11-01 Microsoft   35.26     37.89
12 2013-11-04 Microsoft   35.67     37.89
13 2013-11-05 Microsoft   36.36     37.89
14 2013-11-06 Microsoft   37.89     37.89
15 2013-11-07 Microsoft   37.22     37.89

ddply() lets you apply more than one function at a time.

ddply(mydata, .(company), transform, bestPrice = max(price), worstPrice = min(price))
date   company   price bestPrice worstPrice
1  2013-11-01     Apple  517.01    523.69     512.49
2  2013-11-04     Apple  523.69    523.69     512.49
3  2013-11-05     Apple  522.40    523.69     512.49
4  2013-11-06     Apple  520.92    523.69     512.49
5  2013-11-07     Apple  512.49    523.69     512.49
6  2013-11-01    Google 1027.04   1027.04    1007.95
7  2013-11-04    Google 1026.11   1027.04    1007.95
8  2013-11-05    Google 1021.52   1027.04    1007.95
9  2013-11-06    Google 1022.75   1027.04    1007.95
10 2013-11-07    Google 1007.95   1027.04    1007.95
11 2013-11-01 Microsoft   35.26     37.89      35.26
12 2013-11-04 Microsoft   35.67     37.89      35.26
13 2013-11-05 Microsoft   36.36     37.89      35.26
14 2013-11-06 Microsoft   37.89     37.89      35.26
15 2013-11-07 Microsoft   37.22     37.89      35.26

ddply() lets you apply your own function. e.g. what you want is a new data frame with just the rows that have the highest price.

ddply(mydata, .(company), function(x) x[x$price == max(x$price), ])
date   company   price
1 2013-11-04     Apple  523.69
2 2013-11-01    Google 1027.04
3 2013-11-06 Microsoft   37.89

order data by one column (price)

ddply(mydata, .(price), transform, rank(price))
date   company   price
1  2013-11-01 Microsoft   35.26
2  2013-11-04 Microsoft   35.67
3  2013-11-05 Microsoft   36.36
4  2013-11-07 Microsoft   37.22
5  2013-11-06 Microsoft   37.89
6  2013-11-07     Apple  512.49
7  2013-11-01     Apple  517.01
8  2013-11-06     Apple  520.92
9  2013-11-05     Apple  522.40
10 2013-11-04     Apple  523.69
11 2013-11-07    Google 1007.95
12 2013-11-05    Google 1021.52
13 2013-11-06    Google 1022.75
14 2013-11-04    Google 1026.11
15 2013-11-01    Google 1027.04

or

ddply(mydata, .(price), transform, function(x) x[sort(x$price), ])

or

mydata[order(mydata$price), ]

Further reading

4 data wrangling tasks in R for advanced beginners

http://felixfan.github.io/trig-differ

Trig Cheat Sheet

original post

Basic Differentiation Formulas

original post

http://felixfan.github.io/ggplot2-remove-grid-background-margin

Generate data

library(ggplot2)
a <- seq(1, 20)
b <- a^0.25
df <- as.data.frame(cbind(a, b))

basic plot

myplot = ggplot(df, aes(x = a, y = b)) + geom_point()
myplot

plot of chunk ggplot-2-1

theme_bw() will get rid of the background

myplot + theme_bw()

plot of chunk ggplot-2-2

remove grid (does not remove backgroud colour and border lines)

myplot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot of chunk ggplot-2-3

remove border lines (does not remove backgroud colour and grid lines)

myplot + theme(panel.border = element_blank())

plot of chunk ggplot-2-4

remove background (remove backgroud colour and border lines, but does not remove grid lines)

myplot + theme(panel.background = element_blank())

plot of chunk ggplot-2-5

add axis line

myplot + theme(axis.line = element_line(colour = "black"))

plot of chunk ggplot-2-6

put all together - method 1

myplot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))

plot of chunk ggplot-2-8

put all together - method 2

myplot + theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))

plot of chunk ggplot-2-9

Further reading

remove grid, background color and top and right borders from ggplot2

http://felixfan.github.io/ggplot2-book-part-4

Basic plot types

  • geom_area() draws an area plot, which is a line plot filled to the y-axis.
  • geom_bar(stat = “identity”)() makes a barchart.
  • geom_line() makes a line plot.
  • geom_path() is similar to a geom_line, but lines are connected in the order they appear in the data, not from left to right.
  • geom_point() produces a scatterplot.
  • geom_polygon() draws polygons, which are filled paths.
  • geom_text() adds labels at the specified points.
  • geom_tile() makes a image plot or level plot.
library(ggplot2)
library(effects)
library(plyr)
diamonds = na.omit(diamonds)

df <- data.frame(x = c(3, 1, 5), y = c(2, 4, 6), label = c("a", "b", "c"))
p <- ggplot(df, aes(x, y, label = label)) + xlab(NULL) + ylab(NULL)
p + geom_point() + labs(title = "geom_point")

plot of chunk ggplot-part4-1

# Equivalent to p + geom_point() + ggtitle('geom_point')

# Reduce line spacing and use bold text
p + geom_point() + ggtitle("geom_point") + theme(plot.title = element_text(lineheight = 0.8,
face = "bold"))

plot of chunk ggplot-part4-1


p + geom_bar(stat = "identity") + labs(title = "geom_bar(stat=\"identity\")")

plot of chunk ggplot-part4-1

p + geom_line() + labs(title = "geom_line")

plot of chunk ggplot-part4-1

p + geom_area() + labs(title = "geom_area")

plot of chunk ggplot-part4-1

p + geom_path() + labs(title = "geom_path")

plot of chunk ggplot-part4-1

p + geom_text() + labs(title = "geom_text")

plot of chunk ggplot-part4-1

p + geom_tile() + labs(title = "geom_tile")

plot of chunk ggplot-part4-1

p + geom_polygon() + labs(title = "geom_polygon")

plot of chunk ggplot-part4-1

Displaying distributions

# Never rely on the default parameters to get a revealing view of the
# distribution.  Zooming in on the x axis, and selecting a smaller bin
# width, reveals far more detail.  We can see that the distribution is
# slightly skew-right.  Don't forget to include information about important
# parameters (like bin width) in the caption.
qplot(depth, data = diamonds, geom = "histogram")

plot of chunk ggplot-part4-2

qplot(depth, data = diamonds, geom = "histogram", xlim = c(55, 70), binwidth = 0.1)

plot of chunk ggplot-part4-2


# Three views of the distribution of depth and cut.  faceted histogram, a
# conditional density plot, and frequency polygons.  All show an interesting
# pattern: as quality increases, the distribution shifts to the left and
# becomes more symmetric.
depth_dist <- ggplot(diamonds, aes(depth)) + xlim(58, 68)
depth_dist + geom_histogram(aes(y = ..density..), binwidth = 0.1) + facet_grid(cut ~
.)

plot of chunk ggplot-part4-2

depth_dist + geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill")

plot of chunk ggplot-part4-2

depth_dist + geom_freqpoly(aes(y = ..density.., colour = cut), binwidth = 0.1)

plot of chunk ggplot-part4-2


# The boxplot geom can be use to see the distribution of a continuous
# variable conditional on a discrete varable like cut , or continuous
# variable like carat.  For continuous variables, the group aesthetic must
# be set to get multiple boxplots.
qplot(cut, depth, data = diamonds, geom = "boxplot")

plot of chunk ggplot-part4-2

qplot(carat, depth, data = diamonds, geom = "boxplot", group = round_any(carat,
0.1, floor), xlim = c(0, 3))

plot of chunk ggplot-part4-2


# The jitter geom can be used to give a crude visualisation of 2d
# distributions with a discrete component.  Generally this works better for
# smaller datasets.  Car class vs. continuous variable city mpg and discrete
# variable drive train.
qplot(class, cty, data = mpg, geom = "jitter")

plot of chunk ggplot-part4-2

qplot(class, drv, data = mpg, geom = "jitter")

plot of chunk ggplot-part4-2


# The density plot is a smoothed version of the histogram.  It has desirable
# theoretical properties, but is more difficult to relate back to the data.
# A density plot of depth, coloured by cut
qplot(depth, data = diamonds, geom = "density", xlim = c(54, 70))

plot of chunk ggplot-part4-2

qplot(depth, data = diamonds, geom = "density", xlim = c(54, 70), fill = cut,
alpha = I(0.2))

plot of chunk ggplot-part4-2

Dealing with overplotting

df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y))
# the default shape
norm + geom_point()

plot of chunk ggplot-part4-3

# hollow points
norm + geom_point(shape = 1)

plot of chunk ggplot-part4-3

# pixel points
norm + geom_point(shape = ".")

plot of chunk ggplot-part4-3


# Using alpha blending to alleviate overplotting in sample data from a
# bivariate normal.  Alpha values from left to right: 1/3, 1/5, 1/10.
norm + geom_point(colour = "black", alpha = 1/3)

plot of chunk ggplot-part4-3

norm + geom_point(colour = "black", alpha = 1/5)

plot of chunk ggplot-part4-3

norm + geom_point(colour = "black", alpha = 1/10)

plot of chunk ggplot-part4-3


# A plot of table vs. depth from the diamonds data, showing the use of
# jitter and alpha blending to alleviate overplotting in discrete data.
td <- ggplot(diamonds, aes(table, depth)) + xlim(50, 70) + ylim(50, 70)
# geom point
td + geom_point()

plot of chunk ggplot-part4-3

# geom jitter with default jitter
td + geom_jitter()

plot of chunk ggplot-part4-3

# geom jitter with horizontal jitter of 0.5 (half the gap between bands)
jit <- position_jitter(width = 0.5)
td + geom_jitter(position = jit)

plot of chunk ggplot-part4-3

td + geom_jitter(position = jit, colour = "black", alpha = 1/10)

plot of chunk ggplot-part4-3

td + geom_jitter(position = jit, colour = "black", alpha = 1/50)

plot of chunk ggplot-part4-3

td + geom_jitter(position = jit, colour = "black", alpha = 1/200)

plot of chunk ggplot-part4-3

Drawing maps

# Example using the borders function.
library(maps)
data(us.cities)
big_cities <- subset(us.cities, pop > 5e+05)
# All cities with population (as of January 2006) of greater than half a
# million
qplot(long, lat, data = big_cities) + borders("state", size = 0.5)

plot of chunk ggplot-part4-4

# cities in Texas.
tx_cities <- subset(us.cities, country.etc == "TX")
ggplot(tx_cities, aes(long, lat)) + borders("county", "texas", colour = "grey70") +
geom_point(colour = "black", alpha = 0.5)

plot of chunk ggplot-part4-4

Further reading

http://felixfan.github.io/ggplot2-book-part-3

Build a plot layer by layer

Layers

library(ggplot2)
data(Oxboys, package = "nlme")
diamonds = na.omit(diamonds)
msleep = na.omit(msleep)
mtcars = na.omit(mtcars)
Oxboys = na.omit(Oxboys)
p <- ggplot(diamonds, aes(carat, price, colour = cut))
# This plot object cannot be displayed until we add a layer
p <- p + layer(geom = "point")
p

plot of chunk ggplot-part3-1


# Here is what a more complicated call looks like.  It produces a histogram
# coloured “steelblue” with a bin width of 2 histogram is a combination of
# bars and binning
p <- ggplot(diamonds, aes(x = carat))
p <- p + layer(geom = "bar", geom_params = list(fill = "steelblue"), stat = "bin",
stat_params = list(binwidth = 2))
p

plot of chunk ggplot-part3-1

# same as the following command
ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 2, fill = "steelblue")

# The following example shows the equivalence between these two ways of
# making plots

ggplot(msleep, aes(sleep_rem/sleep_total, awake)) + geom_point()
# which is equivalent to
qplot(sleep_rem/sleep_total, awake, data = msleep)

# You can add layers to qplot too:
qplot(sleep_rem/sleep_total, awake, data = msleep) + geom_smooth()
# This is equivalent to
qplot(sleep_rem/sleep_total, awake, data = msleep, geom = c("point", "smooth"))
# or
ggplot(msleep, aes(sleep_rem/sleep_total, awake)) + geom_point() + geom_smooth()
# plot objects can be stored as variables. The summary function can be
# helpful for inspecting the structure of a plot without plotting it
p <- ggplot(msleep, aes(sleep_rem/sleep_total, awake))
summary(p)
data: name, genus, vore, order, conservation, sleep_total,
sleep_rem, sleep_cycle, awake, brainwt, bodywt [20x11]
mapping:  x = sleep_rem/sleep_total, y = awake
faceting: facet_null()
p <- p + geom_point()
summary(p)
data: name, genus, vore, order, conservation, sleep_total,
sleep_rem, sleep_cycle, awake, brainwt, bodywt [20x11]
mapping:  x = sleep_rem/sleep_total, y = awake
faceting: facet_null()
-----------------------------------
geom_point: na.rm = FALSE
stat_identity:
position_identity: (width = NULL, height = NULL)

# a set of plots can be initialised using different data then enhanced with
# the same layer

bestfit <- geom_smooth(method = "lm", se = F, colour = "steelblue", alpha = 0.5,
size = 2)

qplot(sleep_rem, sleep_total, data = msleep) + bestfit

plot of chunk ggplot-part3-2

qplot(awake, brainwt, data = msleep, log = "y") + bestfit

plot of chunk ggplot-part3-2

qplot(bodywt, brainwt, data = msleep, log = "xy") + bestfit

plot of chunk ggplot-part3-2

Data

# You can replace the old dataset with %+%
p <- ggplot(mtcars, aes(mpg, wt, colour = cyl)) + geom_point()
p

plot of chunk ggplot-part3-3

mtcars <- transform(mtcars, mpg = mpg^2)
p %+% mtcars

plot of chunk ggplot-part3-3

Aesthetic mappings

Plots and layers
# The **aes** function takes a list of aesthetic-variable pairs aes(x =
# weight, y = height, colour = age)
p <- ggplot(mtcars, aes(x = mpg, y = wt))
p + geom_point()

plot of chunk ggplot-part3-4


# Overriding aesthetics.
p + geom_point(aes(colour = factor(cyl)))

plot of chunk ggplot-part3-4

# overriding y-position (now y is 'disp',although the y lab is still 'wt')
p + geom_point(aes(y = disp))

plot of chunk ggplot-part3-4

Setting vs. mapping
# The difference between setting colour to 'darkblue' and mapping colour to
# 'darkblue'.
p <- ggplot(mtcars, aes(mpg, wt))
p + geom_point(colour = "darkblue")  # setting

plot of chunk ggplot-part3-5

# This sets the point colour to be dark blue instead of black. This is quite
# different than
p + geom_point(aes(colour = "darkblue"))  # mapping

plot of chunk ggplot-part3-5

# qplot
qplot(mpg, wt, data = mtcars, colour = I("darkblue"))  # setting
qplot(mpg, wt, data = mtcars, colour = "darkblue")  # mapping
Grouping
Multiple groups, one aesthetic
# Correctly specifying produces one line per subject.
p <- ggplot(Oxboys, aes(age, height, group = Subject)) + geom_line()
p

plot of chunk ggplot-part3-6

qplot(age, height, data = Oxboys, group = Subject, geom = "line")
# A single line connects all observations.  This pattern is characteristic
# of an **incorrect** grouping aesthetic, and is what we see if the group
# aesthetic is omitted, which in this case is equivalent to group = 1
ggplot(Oxboys, aes(age, height, group = 1)) + geom_line()

plot of chunk ggplot-part3-7

qplot(age, height, data = Oxboys, geom = "line")
Different groups on different layers
# Adding smooths to the Oxboys data. Using the same grouping as the lines
# results in a line of best fit for each boy.
p + geom_smooth(aes(group = Subject), method = "lm", se = F)

plot of chunk ggplot-part3-8

# or
qplot(age, height, data = Oxboys, group = Subject, geom = "line") + geom_smooth(method = "lm",
se = F)
# Using aes(group = 1) in the smooth layer fits a single line of best fit
# across all boys.
p + geom_smooth(aes(group = 1), method = "lm", size = 2, se = F)

plot of chunk ggplot-part3-9

qplot(age, height, data = Oxboys, group = Subject, geom = "line") + geom_smooth(aes(group = 1),
method = "lm", size = 2, se = F)

Further reading

http://felixfan.github.io/ggplot2-book-part-2

Mastering the grammar

library(ggplot2)
mpg = na.omit(mpg)
# The fuel economy dataset, mpg, records make, model, class, engine size,
# transmission and fuel economy for a selection of US cars in 1999 and 2008

# A scatterplot of engine displacement in litres (displ) vs.  average
# highway miles per gallon (hwy).  # Points are coloured according to number
# of cylinders.  This plot summarises the most important factor governing
# fuel economy: engine size.
qplot(displ, hwy, data = mpg, colour = factor(cyl))

plot of chunk ggplot2-part2-1


# Instead of using points to represent the data, we could use other geoms
# like lines (left) or bars (right).  Neither of these geoms makes sense for
# this data, but they are still grammatically valid.
qplot(displ, hwy, data = mpg, colour = factor(cyl), geom = "line") + theme(legend.position = "none")

plot of chunk ggplot2-part2-1

qplot(displ, hwy, data = mpg, colour = factor(cyl), geom = "bar", stat = "identity",
position = "identity") + theme(legend.position = "none")

plot of chunk ggplot2-part2-1


# More complicated plots don't have their own names. This plot overlays a
# per group regression line on the existing plot.  What would you call this
# plot?
qplot(displ, hwy, data = mpg, colour = factor(cyl)) + geom_smooth(data = subset(mpg,
cyl != 5), method = "lm")

plot of chunk ggplot2-part2-1


# A more complex plot with facets and multiple layers.
qplot(displ, hwy, data = mpg, facets = . ~ year) + geom_smooth()

plot of chunk ggplot2-part2-1


# Examples of legends from four different scales.  continuous variable
# mapped to size, and to colour, discrete variable mapped to shape, and to
# colour.  The ordering of scales seems upside-down, but this matches the
# labelling of the $y$-axis: small values occur at the bottom.
x <- 1:10
y <- factor(letters[1:5])
qplot(x, x, size = x)

plot of chunk ggplot2-part2-1

qplot(x, x, 1:10, colour = x)

plot of chunk ggplot2-part2-1

qplot(y, y, 1:10, shape = y)

plot of chunk ggplot2-part2-1

qplot(y, y, 1:10, colour = y)

plot of chunk ggplot2-part2-1


# Examples of axes and grid lines for three coordinate systems: Cartesian,
# semi-log and polar.  The polar coordinate system illustrates the
# difficulties associated with non-Cartesian coordinates: it is hard to draw
# the axes well.
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x1, y1, geom = "blank", xlab = NULL, ylab = NULL) + theme_bw()
p

plot of chunk ggplot2-part2-1

p + coord_trans(y = "log10")

plot of chunk ggplot2-part2-1

p + coord_polar()

plot of chunk ggplot2-part2-1


p <- qplot(displ, hwy, data = mpg, colour = factor(cyl))
summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
fl, class [234x11]
mapping:  colour = factor(cyl), x = displ, y = hwy
faceting: facet_null()
-----------------------------------
geom_point:
stat_identity:
position_identity: (width = NULL, height = NULL)
# Save plot object to disk
save(p, file = "plot.rdata")
# Load from disk
load("plot.rdata")
# Save png to disk
ggsave("plot.png", width = 5, height = 5)

Further reading

http://felixfan.github.io/ggplot2-book-part-1

Terminology

  • The data that you want to visualise
  • Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc.

  • Statistical transformations, stats for short, summarise data in many useful ways. optional, but very useful.

  • The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape.

  • A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph.

  • A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples.

Getting started with qplot

library(ggplot2)
diamonds <- na.omit(diamonds)  # remove rows with NA
dim(diamonds)
[1] 53940    10
set.seed(9999)  # Make the sample reproducible
dsmall <- diamonds[sample(nrow(diamonds), 1000), ]

Basic use

# the first two arguments give the x- and y-coordinates
qplot(carat, price, data = diamonds)

plot of chunk ggplot2-part1-1


# The relationship looks exponential, we’d like to do is to transform the
# variables
qplot(log(carat), log(price), data = diamonds)

plot of chunk ggplot2-part1-1


# Arguments can also be combinations of existing variables
qplot(carat, x * y * z, data = diamonds)

plot of chunk ggplot2-part1-1

Colour, size, shape and other aesthetic attributes

# Mapping point colour to diamond colour
qplot(carat, price, data = dsmall, colour = color)

plot of chunk ggplot2-part1-2


# Mapping point shape to cut quality (right).
qplot(carat, price, data = dsmall, shape = cut)

plot of chunk ggplot2-part1-2


# Reducing the alpha value to 1/10 to makes it possible to see where the
# bulk of the points lie.  the denominator specifies the number of points
# that must overplot to get a completely opaque colour.
qplot(carat, price, data = diamonds, alpha = I(1/10))

plot of chunk ggplot2-part1-2


# Reducing the alpha value to 1/100,
qplot(carat, price, data = diamonds, alpha = I(1/100))

plot of chunk ggplot2-part1-2

Plot geoms

  • geom = “point” draws points to produce a scatterplot. default
  • geom = “smooth” fits a smoother to the data and displays the smooth and its standard error
  • geom = “boxplot” produces a box-and-whisker plot
  • geom = “path” and geom = “line” draw lines between the data points.
  • geom = “histogram” draws a histogram. default
  • geom =”freqpoly” a frequency polygon.
  • geom = “density” creates a density plot
  • geom = “bar” makes a bar chart
Adding a smoother to a plot
library(splines)
library(mgcv)
# Smooth curves add to scatterplots The geoms will be overlaid in the order
# in which they appear.
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam",
formula = y ~ s(x, bs = "cs"))

plot of chunk ggplot2-part1-3


# The effect of the span parameter. The wiggliness of the line is controlled
# by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not
# so wiggly)
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), span = 0.2)

plot of chunk ggplot2-part1-3

qplot(carat, price, data = dsmall, geom = c("point", "smooth"), span = 1)

plot of chunk ggplot2-part1-3


# more method available: loess, gam, lm, rlm

# The effect of the formula parameter, using a generalised additive model as
# a smoother.
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam",
formula = y ~ s(x))

plot of chunk ggplot2-part1-3

# default when there are more than 1,000 points
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam",
formula = y ~ s(x, bs = "cs"))

plot of chunk ggplot2-part1-3


# The effect of the formula parameter using a linear model as a smoother.
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "lm")

plot of chunk ggplot2-part1-3

# the default
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "lm",
formula = y ~ ns(x, 5))

plot of chunk ggplot2-part1-3

Boxplots and jittered points
# Using jittering and boxplots to investigate the distribution of price per
# carat, conditional on colour.  As the colour improves (from left to right)
# the spread of values decreases, but there is little change in the centre
# of the distribution.
qplot(color, price/carat, data = diamonds, geom = "jitter")

plot of chunk ggplot2-part1-4

qplot(color, price/carat, data = diamonds, geom = "boxplot")

plot of chunk ggplot2-part1-4


# Varying the alpha level.  From left to right: $1/5$, $1/50$, $1/200$.  As
# the opacity decreases we begin to see where the bulk of the data lies.
# However, the boxplot still does much better.
qplot(color, price/carat, data = diamonds, geom = "jitter", alpha = I(1/5))

plot of chunk ggplot2-part1-4

qplot(color, price/carat, data = diamonds, geom = "jitter", alpha = I(1/50))

plot of chunk ggplot2-part1-4

qplot(color, price/carat, data = diamonds, geom = "jitter", alpha = I(1/200))

plot of chunk ggplot2-part1-4

Histogram and density plots
# Displaying the distribution of diamonds.
qplot(carat, data = diamonds, geom = "histogram")

plot of chunk ggplot2-part1-5

qplot(carat, data = diamonds, geom = "density")

plot of chunk ggplot2-part1-5


# For the density plot, the **adjust** argument controls the degree of
# smoothness (high values of adjust produce smoother plots). For the
# histogram, the **binwidth** argument controls the amount of smoothing by
# setting the bin size.

# Varying the bin width on a histogram of carat reveals interesting
# patterns.  Binwidths from left to right: 1, 0.1 and 0.01 carats.  Only
# diamonds between 0 and 3 carats shown.
qplot(carat, data = diamonds, geom = "histogram", binwidth = 1, xlim = c(0,
3))

plot of chunk ggplot2-part1-5

qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.1, xlim = c(0,
3))

plot of chunk ggplot2-part1-5

qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01, xlim = c(0,
3))

plot of chunk ggplot2-part1-5


# Mapping a categorical variable to an aesthetic will automatically split up
# the geom by that variable.  Density plots are overlaid
qplot(carat, data = diamonds, geom = "density", colour = color)

plot of chunk ggplot2-part1-5

# histograms are stacked.
qplot(carat, data = diamonds, geom = "histogram", fill = color)

plot of chunk ggplot2-part1-5

Bar charts
# Bar charts of diamond colour.  The first plot is a simple bar chart of
# diamond colour, and the second is a bar chart of diamond colour weighted
# by carat.
qplot(color, data = diamonds, geom = "bar")

plot of chunk ggplot2-part1-6

qplot(color, data = diamonds, geom = "bar", weight = carat) + scale_y_continuous("carat")

plot of chunk ggplot2-part1-6

Time series with line and path plots
# Line plots join the points from left to right, while path plots join them
# in the order that they appear in the dataset

# Two time series measuring amount of unemployment.

# Percent of population that is unemployed
qplot(date, unemploy/pop, data = economics, geom = "line")

plot of chunk ggplot2-part1-7

# median number of weeks unemployed.
qplot(date, uempmed, data = economics, geom = "line")

plot of chunk ggplot2-part1-7


# Path plots illustrating the relationship between percent of people
# unemployed and median length of unemployment.
year <- function(x) as.POSIXlt(x)$year + 1900

# Scatterplot with overlaid path.
qplot(unemploy/pop, uempmed, data = economics, geom = c("point", "path"))

plot of chunk ggplot2-part1-7

# Pure path plot coloured by year.
qplot(unemploy/pop, uempmed, data = economics, geom = "path", colour = year(date)) +
scale_size_area()

plot of chunk ggplot2-part1-7

Faceting
# It creates tables of graphics by splitting the data into subsets and
# displaying the same graph for each subset in an arrangement that
# facilitates comparison

# The density plot makes it easier to compare distributions ignoring the
# relative abundance of diamonds within each colour. High-quality diamonds
# (colour D) are skewed towards small sizes, and as quality declines the
# distribution becomes more flat.

# Histograms showing the distribution of carat conditional on colour.  Bars
# show counts
qplot(carat, data = diamonds, facets = color ~ ., geom = "histogram", binwidth = 0.1,
xlim = c(0, 3))

plot of chunk ggplot2-part1-8

# bars show densities (proportions of the whole).
qplot(carat, ..density.., data = diamonds, facets = color ~ ., geom = "histogram",
binwidth = 0.1, xlim = c(0, 3))

plot of chunk ggplot2-part1-8

Other options
  • xlim, ylim: set limits for the x- and y-axes
  • log: a character vector indicating which (if any) axes should be logged. For example, log=”x” will log the x-axis, log=”xy” will log both.
  • main: main title for the plot, centered in large text at the top of the plot.
  • xlab, ylab: labels for the x- and y-axes.
qplot(carat, price, data = dsmall, xlab = "Price ($)", ylab = "Weight (carats)",
main = "Price-weight relationship")

plot of chunk ggplot2-part1-9

qplot(carat, price/carat, data = dsmall, ylab = expression(frac(price, carat)),
xlab = "Weight (carats)", main = "Small diamonds", xlim = c(0.2, 1))

plot of chunk ggplot2-part1-9

qplot(carat, price, data = dsmall, log = "xy")

plot of chunk ggplot2-part1-9

Further reading

http://felixfan.github.io/formatting-plots-for-pubs

First, make your plot.

library(ggplot2)
str(msleep)
'data.frame':	83 obs. of  11 variables:
$ name        : chr  "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
$ genus       : chr  "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
$ vore        : Factor w/ 4 levels "carni","herbi",..: 1 4 2 4 2 2 1 NA 1 2 ...
$ order       : chr  "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
$ conservation: Factor w/ 7 levels "","cd","domesticated",..: 5 NA 6 5 3 NA 7 NA 3 5 ...
$ sleep_total : num  12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
$ sleep_rem   : num  NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
$ sleep_cycle : num  NA NA NA 0.133 0.667 ...
$ awake       : num  11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
$ brainwt     : num  NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
$ bodywt      : num  50 0.48 1.35 0.019 600 ...
# remove rows with missing values
msleep <- na.omit(msleep)

Let’s say we have written a groundbreaking paper on the relationship between body size and sleep time. Therefore, we want to present a plot of the log of body weight by the total sleep time.

sleepplot = ggplot(data = msleep, aes(x = log(bodywt), y = sleep_total)) + geom_point(aes(color = vore))
sleepplot

plot of chunk ggplot-1

We made a beautiful model of this relationship

slp = lm(sleep_total ~ log(bodywt), data = msleep)
summary(slp)

Call:
lm(formula = sleep_total ~ log(bodywt), data = msleep)

Residuals:
Min     1Q Median     3Q    Max
-6.47  -2.20   0.44   1.29   7.10

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   11.327      0.825   13.72  5.6e-11 ***
log(bodywt)   -0.800      0.243   -3.29    0.004 **
---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.69 on 18 degrees of freedom
Multiple R-squared:  0.376,	Adjusted R-squared:  0.341
F-statistic: 10.8 on 1 and 18 DF,  p-value: 0.00404

Let’s put the model on the plot

sleepplot = sleepplot + geom_abline(intercept = coef(slp)[1], slope = coef(slp)[2])
sleepplot

plot of chunk ggplot-2

It’s beautiful! I love it! Unfortunately, you want to submit to Science (you might as well aim high), and this is what they say about figures

So we have several problems: * gray background * Poor labels (need units, capital letters, larger font on axes) * Poor legend * Poor color scheme (avoid red and green together, more contrast needed) * Not correct file format or resolution (want a PDF with at least 600dpi)

First make the labels a little more useful.

sleepplot = sleepplot + labs(x = "Log body weight (Kg)", y = "Time asleep (hrs/day)")
sleepplot

plot of chunk ggplot-3

Now let’s fix the legend. You would think you do this with some sort of “legend” command, but no, what you are looking for is “scale”.

sleepplot + scale_color_discrete(name = "Functional\n feeding group", labels = c("carnivore",
"herbivore", "insectivore", "omnivore"))

plot of chunk ggplot-4

If you haven’t figured it out yet, putting “\n” in a text string gives you a line break. It took me WAY to long to discover that.

ggplot automatically gives you evenly spaced hues for color variations, but this is not necessarily the best way to get a good contrasting color scheme. You may want to try scale_color_brewer for better contrasts.See http://colorbrewer2.org for more information.

sleepplot + scale_color_brewer(name = "Functional \n feeding group", labels = c("carnivore",
"herbivore", "insectivore", "omnivore"), type = "qual", palette = 1)

plot of chunk ggplot-5

Color figures cost an extra $700 on top of the normal page charges! Let’s try something else. This time we will vary the feeding groups by shapes instead of colors.

sleepplot2 = ggplot(data = msleep, aes(x = log(bodywt), y = sleep_total)) +
geom_point(aes(shape = vore), size = 3) + geom_abline(intercept = coef(slp)[1],
slope = coef(slp)[2])
sleepplot2

plot of chunk ggplot-6

Now to fix the labels and legend again. we will use scale_shape_discrete instead of scale_color_discrete

sleepplot2 = sleepplot2 + labs(x = "Log body weight (Kg)", y = "Time asleep (hrs/day)") +
scale_shape_discrete(name = "Functional \n feeding group", labels = c("carnivore",
"herbivore", "insectivore", "omnivore"))
sleepplot2

plot of chunk ggplot-7

Now, let’s work on how the plot looks overall. ggplot uses “themes” to adjust plot appearence without changes the actual presentation of the data.

# sleepplot2 + theme_bw(base_size=12, base_family = 'Helvetica')
sleepplot2 + theme_bw()  # on windows

plot of chunk ggplot-8

theme_bw() will get rid of the background, and gives you options to change the font. Science recomends Helvetica, wich happens to be R’s default, but we will specify it here anyway. Check out the other fonts out here: ??postscriptFonts. For even more fonts, see the {extrafont} package. Other pre-set themes can change the look of your plot.

sleepplot2 + theme_minimal()

plot of chunk ggplot-9

sleepplot2 + theme_classic()

plot of chunk ggplot-10

For more themes:

library(ggthemes)

If you want to publish in the Wall Street Journal

sleepplot2 + theme_wsj()

plot of chunk ggplot-11

But we want to publish in Science, not the Wall Street Journal, so let’s get back to our black and white theme.

sleepplot2 = sleepplot2 + theme_bw(base_size = 12, base_family = "Helvetica")
sleepplot2

plot of chunk ggplot-12

You can’t really see the gridlines with the bw theme, so we are going to tweak the pre-set theme using the theme function. theme allows you to do all kinds of stuff involved with how the plot looks. ?theme

sleepplot2 +
#increase size of gridlines
theme(panel.grid.major = element_line(size = .5, color = "grey"),
#increase size of axis lines
axis.line = element_line(size=.7, color = "black"),
#Adjust legend position to maximize space, use a vector of proportion
#across the plot and up the plot where you want the legend.
#You can also use "left", "right", "top", "bottom", for legends on the side of the plot
legend.position = c(.85,.7),
#increase the font size
text = element_text(size=14))

plot of chunk ggplot-13

You can save this theme for later use

science_theme = theme(panel.grid.major = element_line(size = 0.5, color = "grey"),
axis.line = element_line(size = 0.7, color = "black"), legend.position = c(0.85,
0.7), text = element_text(size = 14))
sleepplot2 = sleepplot2 + science_theme
sleepplot2

plot of chunk ggplot-14

That looks pretty good. Now we need to get it exported properly. The instructions say the figure should be sized to fit in one or two columns (2.3 or 4.6 inches), so we want them to look good at that resolution.

pdf(file = "sleepplot.pdf", width= 6, height = 4, #' see how it looks at this size
useDingbats=F) #I have had trouble when uploading figures with digbats before, so I don't use them
sleepplot2 #print our plot
dev.off() #stop making pdfs

A few other tricks to improve the look of your plots: Let’s say we are grouping things by categories instead of a regression.

sleepcat = ggplot(msleep, aes(x = vore, y = sleep_total, color = conservation))
sleepcat + geom_point()

plot of chunk ggplot-15

It’s hard to see what’s going on there, so we can jitter the points to make them more visible.

sleepcat + geom_point(position = position_jitter(w = 0.1))

plot of chunk ggplot-16

Maybe this would be better with averages and error bars instead of every point:

library(plyr)
msleepave = ddply(msleep, .(vore, conservation), summarize, meansleep = mean(sleep_total),
sdsleep = sd(sleep_total)/sqrt(22))
sleepmean = ggplot(msleepave, aes(x = vore, y = meansleep, color = conservation))
#' Plot it with means and error bars +/- 1 stadard deviation
sleepmean + geom_point() + geom_errorbar(aes(ymax = meansleep + sdsleep, ymin = meansleep +
sdsleep), width = 0.2)

plot of chunk ggplot-17

#' Spread them out, but in an orderly fashion this time, with position_dodge rather than jitter
sleepmean + geom_point(position = position_dodge(width = 0.5, height = 0), size = 2) +
geom_errorbar(aes(ymax = meansleep + sdsleep, ymin = meansleep - sdsleep),
position = position_dodge(width = 0.5, height = 0), width = 0.5)

plot of chunk ggplot-18

Note that dodging the points gives the conservation status in the same order for each feeding type category. A little more organized. Some other things you might want to do with formatting:

Add annotation to the plot

sleepplot2 + annotate("text", label = "R2 = 0.999", x = -4, y = 17)

plot of chunk ggplot-19

Let’s put that annotation in italics

sleepplot2 + annotate("text", label = "R2 = 0.999", x = -4, y = 17, fontface = 3)

plot of chunk ggplot-20

NOW. Let’s put half that annotation in italics, the other half plain, then insert five greek characters and rotate it 90 degrees! OR we can beat our head against a wall until it explodes and export our plot into an actual graphics program.

Not everything has to be done in R. ‘SVG’ files are vector graphic files that can be easily edited in the FREE GUI-based program Inkscape. Make and SVG and you can edit it by hand for final tweaks. Inkscape can also edit and export PDFs.

svg(filename = "sleepplot.svg", width = 6, height = 4)
sleepplot2
dev.off()

original post

http://felixfan.github.io/download-files

Download a file

require(RCurl)
myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/test.txt", ssl.verifypeer = FALSE)
myData <- read.csv(textConnection(myCsv))
myData
test   class
1    1    case
2    2 control

Downloading multiple files from FTP server

url = "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1297/suppl/"
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)
filenames <- strsplit(filenames, "\r\n")
filenames = unlist(filenames)
filenames
[1] "filelist.txt"    "GSE1297_RAW.tar"
for (filename in filenames) {
download.file(paste(url, filename, sep = ""), paste(getwd(), "/", filename,
sep = ""))
}

Load a given URL into a WWW browser

browseURL("http://cran.r-project.org/web/packages/FinCal/index.html")
http://felixfan.github.io/Pandoc-LaTeX-Chinese

1. R Markdown

R Markdown “is a format that enables easy authoring of reproducible web reports from R. It combines the core syntax of Markdown (an easy-to-write plain text format for web content) with embedded R code chunks that are run so their output can be included in the final document.”

2. Using R Markdown with RStudio

RStudio IDE is a powerful and productive user interface for R. An overview of how to use R Markdown within RStudio is available here

3. Convert R Markdown to Markdown

knitr can convert R Markdown (.Rmd) files into plain markdown (.md) files. With Rstudio, you only need to click the ‘Knit HTML’ button.

You can also use the following command:

library(knitr)
knit("test.Rmd")

4. Convert Markdown files into PDF via LaTeX using Pandoc

4.1 Install Pandoc and MiKTeX

We need to install the following tools first.

Pandoc can convert files from one markup format into another, see the webpage for more details.

MiKTeX is an up-to-date implementation of TeX/LaTeX and related programs for Windows.

4.2 Convert Markdown files into PDF

pandoc -o test.pdf test.md

Run Pandoc from Rstudio directly:

system("pandoc -o test.pdf test.md")

Note: you may need to add the path of ‘pandoc’ to your Environment Variables. How?

4.3 When Markdown files include Chinese

Run Pandoc from Rstudio directly:

system("pandoc -o test.pdf test.md  --latex-engine=xelatex --template=pm-template-felix.latex")

Note: MiKTeX need to install several packages when you run this command at the first time, this may need several minutes.

Note: You need to put pm-template-felix.latex under the same dir with .md files. ‘pm-template-felix.latex’ is a revised version of pm-template.latex

Note: 该模板默认字体为宋体(SimSun),若需要其他字体只需替换SimSun即可。更多字体

http://felixfan.github.io/plyr-part2

Data Used

library(plyr)
myCars = cbind(vehicle = row.names(mtcars), mtcars)
row.names(myCars) = NULL
myCars
vehicle  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

arrange() Order a data frame by its colums

# sort myCars data by cylinder and displacement
myCars1 = arrange(myCars, cyl, disp)
myCars1
vehicle  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1       Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
2          Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
3             Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
4            Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
5         Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
6           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
7        Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
8        Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
9           Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
10            Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
11           Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
12        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
13           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
14       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
15            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
16           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
17             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
18      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
19          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
20          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
21         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
22       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
24    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
25          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
26      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
27   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
28          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
29    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
30   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
31 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
32  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
# Sort with displacement in descending order
myCars2 = arrange(myCars, cyl, desc(disp))
myCars2
vehicle  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
3           Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
4        Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
5        Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
6           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
7         Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
8            Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
9             Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
10         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
11      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
12      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
13             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
14            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
15           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
16           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
17       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
18        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
19  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
20 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
21   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
22    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
23   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
24          Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
25      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
26          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
27    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
28         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
29       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
30          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
31          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
32         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3

colwise() Column-wise function

numcolwise(), catcolwise()

# Count number of missing values
nmissing <- function(x) sum(is.na(x))

# Apply to every column in a data frame
colwise(nmissing)(baseball)
id year stint team lg g ab r h X2b X3b hr rbi  sb   cs bb   so  ibb hbp
1  0    0     0    0  0 0  0 0 0   0   0  0  12 250 4525  0 1305 7528 377
sh   sf gidp
1 960 7390 5272

# To operate only on specified columns
baseball1 = ddply(baseball, .(year), colwise(nmissing, c("sb", "cs", "so")))
head(baseball1)
year sb cs so
1 1871  0  0  0
2 1872  0  0  0
3 1873  0  0  0
4 1874  0  0  0
5 1875  0  0  0
6 1876 15 15  0

# specify a boolean function that determines whether or not a column should
# be included
baseball2 = ddply(baseball, .(year), colwise(nmissing, is.character))
head(baseball2)
year id team lg
1 1871  0    0  0
2 1872  0    0  0
3 1873  0    0  0
4 1874  0    0  0
5 1875  0    0  0
6 1876  0    0  0

baseball3 = ddply(baseball, .(year), colwise(nmissing, is.numeric))
head(baseball3)
year stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
1 1871     0 0  0 0 0   0   0  0   0  0  0  0  0   7   7  7  7    7
2 1872     0 0  0 0 0   0   0  0   0  0  0  0  0  13  13 13 13   13
3 1873     0 0  0 0 0   0   0  0   0  0  0  0  0  13  13 13 13   13
4 1874     0 0  0 0 0   0   0  0   0  0  0  0  0  15  15 15 15   15
5 1875     0 0  0 0 0   0   0  0   0  0  0  0  0  17  17 17 17   17
6 1876     0 0  0 0 0   0   0  0   0 15 15  0  0  15  15 15 15   15
# or numcolwise()
baseball4 = ddply(baseball, .(year), numcolwise(nmissing))
head(baseball4)
year stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
1 1871     0 0  0 0 0   0   0  0   0  0  0  0  0   7   7  7  7    7
2 1872     0 0  0 0 0   0   0  0   0  0  0  0  0  13  13 13 13   13
3 1873     0 0  0 0 0   0   0  0   0  0  0  0  0  13  13 13 13   13
4 1874     0 0  0 0 0   0   0  0   0  0  0  0  0  15  15 15 15   15
5 1875     0 0  0 0 0   0   0  0   0  0  0  0  0  17  17 17 17   17
6 1876     0 0  0 0 0   0   0  0   0 15 15  0  0  15  15 15 15   15

baseball5 = ddply(baseball, .(year), colwise(nmissing, is.discrete))
head(baseball5)
year id team lg
1 1871  0    0  0
2 1872  0    0  0
3 1873  0    0  0
4 1874  0    0  0
5 1875  0    0  0
6 1876  0    0  0
# or catcolwise()
baseball6 = ddply(baseball, .(year), catcolwise(nmissing))
head(baseball6)
year id team lg
1 1871  0    0  0
2 1872  0    0  0
3 1873  0    0  0
4 1874  0    0  0
5 1875  0    0  0
6 1876  0    0  0

count() Count the number of occurences.

# Count of each value of 'id' in the first 100 cases
count(baseball[1:100, ], vars = "id")
id freq
1  ansonca01    8
2  bennech01    1
3  burdoja01    7
4  forceda01    9
5  galvipu01    1
6  gerhajo01    5
7  hinespa01    6
8  jonesch01    6
9  mathebo01    7
10 morrijo01    2
11 nelsoca01    5
12 orourji01    6
13 shaffor01    4
14 snydepo01    5
15 startjo01    7
16 suttoez01    7
17 whitede01    7
18  yorkto01    7
# Count of ids, weighted by their 'g' loading
count(baseball[1:100, ], vars = "id", wt_var = "g")
id freq
1  ansonca01  432
2  bennech01   49
3  burdoja01  414
4  forceda01  380
5  galvipu01   13
6  gerhajo01  209
7  hinespa01  302
8  jonesch01  134
9  mathebo01  327
10 morrijo01  127
11 nelsoca01  193
12 orourji01  356
13 shaffor01   90
14 snydepo01  250
15 startjo01  389
16 suttoez01  344
17 whitede01  386
18  yorkto01  396
# Count of times each player appeared in each of the years they played
count(baseball[1:100, ], c("id", "year"))[17:20, ]
id year freq
17 forceda01 1871    1
18 forceda01 1872    2
19 forceda01 1873    1
20 forceda01 1874    1

desc() Descending order

desc(1:5)
[1] -1 -2 -3 -4 -5
desc(letters[1:5])
[1] -1 -2 -3 -4 -5

Session Information

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.8  knitr_1.5

loaded via a namespace (and not attached):
[1] evaluate_0.5.1 formatR_0.10   stringr_0.6.2  tools_3.0.2

Further reading

http://felixfan.github.io/plyr
library(plyr)
library(ggplot2)

Example data

library(RCurl)
myCsv <- getURL("https://dl.dropboxusercontent.com/u/8272421/bnames.csv", ssl.verifypeer = FALSE)
myData <- read.csv(textConnection(myCsv))

myData$name = as.character(myData$name)
myData$sex = as.character(myData$sex)
head(myData)
year    name percent sex
1 1880    John 0.08154 boy
2 1880 William 0.08051 boy
3 1880   James 0.05006 boy
4 1880 Charles 0.04517 boy
5 1880  George 0.04329 boy
6 1880   Frank 0.02738 boy

transform() modifies an existing data frame.

Define functions

# get the nth character
letter <- function(x, n = 1) {
if (n < 0) {
nc <- nchar(x)
n <- nc + n + 1
}
tolower(substr(x, n, n))
}

# get the number of vowels
vowels <- function(x) {
nchar(gsub("[^aeiou]", "", x))
}

Simple transformation

bnames = myData
bnames0 <- transform(bnames, first = letter(name, 1), last = letter(name, -1),
length = nchar(name), vowels = vowels(name))
head(bnames0)
year    name percent sex first last length vowels
1 1880    John 0.08154 boy     j    n      4      1
2 1880 William 0.08051 boy     w    m      7      3
3 1880   James 0.05006 boy     j    s      5      2
4 1880 Charles 0.04517 boy     c    s      7      2
5 1880  George 0.04329 boy     g    e      6      3
6 1880   Frank 0.02738 boy     f    k      5      1

Group-wise transformation

compute the rank of a name within a sex and year (e.g. boy & 2008)?
one <- subset(myData, sex == "boy" & year == 2008)
# ties.method = 'first' -- first occurrence wins
one <- transform(one, rank = rank(-percent, ties.method = "first"))
head(one)
year      name  percent sex rank
128001 2008     Jacob 0.010355 boy    1
128002 2008   Michael 0.009437 boy    2
128003 2008     Ethan 0.009301 boy    3
128004 2008    Joshua 0.008799 boy    4
128005 2008    Daniel 0.008702 boy    5
128006 2008 Alexander 0.008566 boy    6
Transform every sex and year
# short the data to easily see the results
bnames = myData[c(1:2, 9999:10000, 129001:129002, 257998:258000), ]
bnames
year    name  percent  sex
1      1880    John 0.081541  boy
2      1880 William 0.080511  boy
9999   1889  Collie 0.000042  boy
10000  1889  Cooper 0.000042  boy
129001 1880    Mary 0.072381 girl
129002 1880    Anna 0.026678 girl
257998 2008  Kenley 0.000127 girl
257999 2008  Sloane 0.000127 girl
258000 2008 Elianna 0.000127 girl

# rank by sex and then by year
bnames1 <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent,
ties.method = "first"))
bnames1
year    name  percent  sex rank
1 1880    John 0.081541  boy    1
2 1880 William 0.080511  boy    2
3 1889  Collie 0.000042  boy    1
4 1889  Cooper 0.000042  boy    2
5 1880    Mary 0.072381 girl    1
6 1880    Anna 0.026678 girl    2
7 2008  Kenley 0.000127 girl    1
8 2008  Sloane 0.000127 girl    2
9 2008 Elianna 0.000127 girl    3

# rank by yaer and then by sex
bnames2 <- ddply(bnames, c("year", "sex"), transform, rank = rank(-percent,
ties.method = "first"))
bnames2
year    name  percent  sex rank
1 1880    John 0.081541  boy    1
2 1880 William 0.080511  boy    2
3 1880    Mary 0.072381 girl    1
4 1880    Anna 0.026678 girl    2
5 1889  Collie 0.000042  boy    1
6 1889  Cooper 0.000042  boy    2
7 2008  Kenley 0.000127 girl    1
8 2008  Sloane 0.000127 girl    2
9 2008 Elianna 0.000127 girl    3

summarise() creates a new data frame.

Simple summaries

Whole dataset summaries

summarise(bnames, max_perc = max(percent), min_perc = min(percent))
max_perc min_perc
1  0.08154  4.2e-05

Group-wise summaries

bnames = bnames0

sum.name = ddply(bnames, c("name"), summarise, tot = sum(percent))
head(sum.name)
name      tot
1   Aaden 0.000442
2 Aaliyah 0.019748
3   Aarav 0.000101
4   Aaron 0.293097
5      Ab 0.000218
6 Abagail 0.001326

sum.length = ddply(bnames, c("length"), summarise, tot = sum(percent))
head(sum.length)
length     tot
1      2  0.2315
2      3  7.2744
3      4 36.8475
4      5 57.7588
5      6 60.3609
6      7 44.3370

sum.year.sex = ddply(bnames, c("year", "sex"), summarise, tot = sum(percent))
head(sum.year.sex)
year  sex    tot
1 1880  boy 0.9307
2 1880 girl 0.9345
3 1881  boy 0.9304
4 1881 girl 0.9327
5 1882  boy 0.9275
6 1882 girl 0.9310

sum.sex.year = ddply(bnames, c("sex", "year"), summarise, tot = sum(percent))
head(sum.sex.year)
sex year    tot
1 boy 1880 0.9307
2 boy 1881 0.9304
3 boy 1882 0.9275
4 boy 1883 0.9288
5 boy 1884 0.9273
6 boy 1885 0.9255

The proportion of the first letter (by year)

sum.year.sex.first <- ddply(bnames, c("year", "sex", "first"), summarise, tot = sum(percent))
qplot(year, tot, data = sum.year.sex.first, geom = "line", colour = sex, facets = ~first)

plot of chunk plyr1

The proportion of US children who have a name in the top 100 (by year)

# rank by sex and then by year
bnames = myData
bnames = ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first"))
# top 100
top100 <- subset(bnames, rank <= 100)
top100s <- ddply(top100, c("sex", "year"), summarise, tot = sum(percent))
qplot(year, tot, data = top100s, colour = sex, geom = "line", ylim = c(0, 1))

plot of chunk plyr2

ddply

Simple example

top100 <- subset(bnames, rank <= 100)
top100s <- ddply(top100, c("sex", "year"), summarise, tot = sum(percent))
head(top100s)
sex year    tot
1 boy 1880 0.7478
2 boy 1881 0.7466
3 boy 1882 0.7416
4 boy 1883 0.7438
5 boy 1884 0.7387
6 boy 1885 0.7342

ddply with your own defined function

One simple function
top100m <- ddply(top100, c("sex", "year"), function(df) mean(df$percent))
head(top100m)
sex year       V1
1 boy 1880 0.007478
2 boy 1881 0.007466
3 boy 1882 0.007416
4 boy 1883 0.007438
5 boy 1884 0.007387
6 boy 1885 0.007342
More than one function
top100ms <- ddply(top100, c("sex", "year"), function(df) c(mean(df$percent),
sd(df$percent)))
head(top100ms)
sex year       V1      V2
1 boy 1880 0.007478 0.01381
2 boy 1881 0.007466 0.01361
3 boy 1882 0.007416 0.01320
4 boy 1883 0.007438 0.01313
5 boy 1884 0.007387 0.01268
6 boy 1885 0.007342 0.01241
Set the column names
top100ms2 <- ddply(top100, c("sex", "year"), function(df) data.frame(mean = mean(df$percent),
se = sd(df$percent)/sqrt(length(df$percent))))
head(top100ms2)
sex year     mean       se
1 boy 1880 0.007478 0.001381
2 boy 1881 0.007466 0.001361
3 boy 1882 0.007416 0.001320
4 boy 1883 0.007438 0.001313
5 boy 1884 0.007387 0.001268
6 boy 1885 0.007342 0.001241

xxply

array dataFrame  list
array     aaply     adply alply
dataFrame daply     ddply dlply
list      laply     ldply llply

Further reading

http://felixfan.github.io/webpages

Developers

  • coderwall: a site for developers to profile what projects they are up to and learn from others.
  • geeklist: the first social network for developers and the tech community.
  • masterbranch: Show me the code, profiles based on real experience combined with your regular résumé/ CV.

Q & A

  • Stack Overflow: A language-independent collaboratively edited question and answer site for programmers.
  • Biostar: this site’s focus is bioinformatics, computational genomics and biological data analysis.
  • ROSALIND: Bioinformatics problem solving
  • SEQanswers: a discussion forum and information source for next generation sequencing.

NGS

MOOCs (massive open online courses)

Learning to code

Training

  • GOBLET: is the global organization for bioinformatics learning education and training
  • ELIXIR: is a European organization set up to provide an infrastructure, including training, for life sciences information

Blogs about computational biology courses

Others

  • screenr: web-based screen recoder.
http://felixfan.github.io/cdf

Method 1 Plotting the Cumulative Probabilities Against the Ordered Data

set.seed(9)
x <- rnorm(10)
n <- length(x)
x.sorted <- sort(x)
plot(x.sorted, 1:n/n, type = "s", ylim = c(0, 1), ylab = "")
# add the label on the y-axis
mtext(text = expression(hat(F)[n](x)), side = 2, line = 2, cex = 1.5)

plot of chunk CDF1

Method 2 Using the ecdf() and plot() functions

set.seed(9)
x <- rnorm(10)
plot(ecdf(x), ylim = c(0, 1), ylab = "", verticals = T)
# add the label on the y-axis
mtext(text = expression(hat(F)[n](x)), side = 2, line = 2, cex = 1.5)

plot of chunk CDF2

Further reading

Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R
Plotting Two Empirical CDFs on the Same Graph

http://felixfan.github.io/basic-Plot

Bar plot

Simple Bar Plot

counts <- table(mtcars$gear)
barplot(counts, main = "Car Distribution", xlab = "Number of Gears")

plot of chunk basicplot1

Simple Horizontal Bar Plot with Added Labels

barplot(counts, main = "Car Distribution", horiz = TRUE, names.arg = c("3 Gears",
"4 Gears", "5 Gears"))

plot of chunk basicplot2

Stacked Bar Plot

counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main = "Car Distribution by Gears and VS", xlab = "Number of Gears",
col = c("darkblue", "red"), legend = rownames(counts))

plot of chunk basicplot3

Grouped Bar Plot

barplot(counts, main = "Car Distribution by Gears and VS", xlab = "Number of Gears",
col = c("darkblue", "red"), legend = rownames(counts), beside = TRUE)

plot of chunk basicplot4

Pie Charts

Simple Pie Chart

slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main = "Pie Chart of Countries")

plot of chunk basicplot5

Pie Chart with Annotated Percentages

pct <- round(slices/sum(slices) * 100)
lbls <- paste(lbls, pct)  # add percents to labels
lbls <- paste(lbls, "%", sep = "")  # ad % to labels
pie(slices, labels = lbls, col = rainbow(length(lbls)), main = "Pie Chart of Countries")

plot of chunk basicplot6

3D Pie Chart

library(plotrix)
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie3D(slices, labels = lbls, explode = 0.1, main = "Pie Chart of Countries ")

plot of chunk basicplot7

Scatterplots

Simple Scatterplot

plot(mtcars$wt, mtcars$mpg, main = "Scatterplot Example", xlab = "Car Weight ",
ylab = "Miles Per Gallon ", pch = 19)

plot of chunk basicplot8

Add fit lines

attach(mtcars)
plot(wt, mpg, main = "Scatterplot Example", xlab = "Car Weight ", ylab = "Miles Per Gallon ",
pch = 19)
abline(lm(mpg ~ wt), col = "red")  # regression line (y~x)
lines(lowess(wt, mpg), col = "blue")  # lowess line (x,y)

plot of chunk basicplot9

Scatterplot Matrices

Basic Scatterplot Matrix

pairs(~mpg + disp + drat + wt, data = mtcars, main = "Simple Scatterplot Matrix")

plot of chunk basicplot10

Scatterplot Matrices from the lattice Package

The lattice package provides options to condition the scatterplot matrix on a factor.

library(lattice)
super.sym <- trellis.par.get("superpose.symbol")
splom(mtcars[c(1, 3, 5, 6)], groups = cyl, data = mtcars, panel = panel.superpose,
key = list(title = "Three Cylinder Options", columns = 3, points = list(pch = super.sym$pch[1:3],
col = super.sym$col[1:3]), text = list(c("4 Cylinder", "6 Cylinder",
"8 Cylinder"))))

plot of chunk basicplot11

Scatterplot Matrices from the car Package

The car package can condition the scatterplot matrix on a factor, and optionally include lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.

library(car)
scatterplot.matrix(~mpg + disp + drat + wt | cyl, data = mtcars, main = "Three Cylinder Options")

plot of chunk basicplot12

Scatterplot Matrices from the glus Package

The gclus package provides options to rearrange the variables so that those with higher correlations are closer to the principal diagonal. It can also color code the cells to reflect the size of the correlations.

library(gclus)
dta <- mtcars[c(1, 3, 5, 6)]  # get data
dta.r <- abs(cor(dta))  # get correlations
dta.col <- dmat.color(dta.r)  # get colors
# reorder variables so those with highest correlation are closest to the
# diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors = dta.col, gap = 0.5, main = "Variables Ordered and Colored by Correlation")

plot of chunk basicplot13

Histograms

Simple Histogram

hist(mtcars$mpg)

plot of chunk basicplot14

Colored Histogram with Different Number of Bins

hist(mtcars$mpg, breaks = 12, col = "red")

plot of chunk basicplot15

Add a Normal Curve

x <- mtcars$mpg
h <- hist(x, breaks = 10, col = "red", xlab = "Miles Per Gallon", main = "Histogram with Normal Curve")
xfit <- seq(min(x), max(x), length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(h$mids[1:2]) * length(x)
lines(xfit, yfit, col = "blue", lwd = 2)

plot of chunk basicplot16

Kernel Density Plots

Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.

simple Kernel Density Plot

plot(density(mtcars$mpg))

plot of chunk basicplot17

Filled Density Plot

d <- density(mtcars$mpg)
plot(d, main = "Kernel Density of Miles Per Gallon")
polygon(d, col = "red", border = "blue")

plot of chunk basicplot18

par()

xaxt=”n” will suppresses plotting of the axis

# par() # view current settings
opar <- par()  # make a copy of current settings
par(col.lab = "red")  # red x and y labels
hist(mtcars$mpg)  # create a plot with these new settings

plot of chunk basicplot20

par(opar)  # restore original settings

text() and mtext()

text() places text within the graph while mtext() places text in one of the four margins.

# attach(mtcars)
plot(wt, mpg, main = "Milage vs. Car Weight", xlab = "Weight", ylab = "Mileage",
pch = 18, col = "blue")
text(wt, mpg, row.names(mtcars), cex = 0.6, pos = 4, col = "red")

plot of chunk basicplot21

combine figures

par(mfrow = c(2, 2))
plot(wt, mpg, main = "Scatterplot of wt vs. mpg")
plot(wt, disp, main = "Scatterplot of wt vs disp")
hist(wt, main = "Histogram of wt")
boxplot(wt, main = "Boxplot of wt")

plot of chunk basicplot22

# One figure in row 1 and two figures in row 2 row 1 is 1/3 the height of
# row 2 column 2 is 1/4 the width of the column 1
layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE), widths = c(3, 1), heights = c(1,
2))
hist(wt)
hist(mpg)
hist(disp)

plot of chunk basicplot23

# Add boxplots to a scatterplot
par(fig = c(0, 0.8, 0, 0.8), new = TRUE)
plot(mtcars$wt, mtcars$mpg, xlab = "Miles Per Gallon", ylab = "Car Weight")
par(fig = c(0, 0.8, 0.55, 1), new = TRUE)
boxplot(mtcars$wt, horizontal = TRUE, axes = FALSE)
par(fig = c(0.65, 1, 0, 0.8), new = TRUE)
boxplot(mtcars$mpg, axes = FALSE)
mtext("Enhanced Scatterplot", side = 3, outer = TRUE, line = -3)

plot of chunk basicplot24

To understand this graph, think of the full graph area as going from (0,0) in the lower left corner to (1,1) in the upper right corner. The format of the fig= parameter is a numerical vector of the form c(x1, x2, y1, y2). The first fig= sets up the scatterplot going from 0 to 0.8 on the x axis and 0 to 0.8 on the y axis. The top boxplot goes from 0 to 0.8 on the x axis and 0.55 to 1 on the y axis. I chose 0.55 rather than 0.8 so that the top figure will be pulled closer to the scatter plot. The right hand boxplot goes from 0.65 to 1 on the x axis and 0 to 0.8 on the y axis. Again, I chose a value to pull the right hand boxplot closer to the scatterplot. You have to experiment to get it just right.

legend()

# attach(mtcars)
boxplot(mpg ~ cyl, main = "Milage by Car Weight", yaxt = "n", xlab = "Milage",
horizontal = TRUE, col = terrain.colors(3))
legend("topright", inset = 0.05, title = "Number of Cylinders", c("4", "6",
"8"), fill = terrain.colors(3), horiz = TRUE)

plot of chunk basicplot19

Further reading

Quick-R: Basic Graphs

http://felixfan.github.io/Venn

The gplots package provides Venn diagrams for up to five sets. Its input is a table that is produced by another function. The function venn() calls one after the other and is the only one to be seen by the user. The venn() function accepts either a list of sets as an argument, or it takes a binary matrix, one column per set, indicating for every element, one per row, the membership with every set.

library(gplots)
venn(list(A = 1:5, B = 4:6, C = c(4, 8:10)))

plot of chunk venn1

The names of columns or the list elements are the set names. To squeeze extra circles in, those circles need to become ellipses. This works for four sets and maybe even more impressively also for five.

v.table <- venn(list(A = 1:5, B = 4:6, C = c(4, 8:10), D = c(4:12)))

plot of chunk venn2

print(v.table)
num A B C D
0000   0 0 0 0 0
0001   3 0 0 0 1
0010   0 0 0 1 0
0011   3 0 0 1 1
0100   0 0 1 0 0
0101   1 0 1 0 1
0110   0 0 1 1 0
0111   0 0 1 1 1
1000   3 1 0 0 0
1001   0 1 0 0 1
1010   0 1 0 1 0
1011   0 1 0 1 1
1100   0 1 1 0 0
1101   1 1 1 0 1
1110   0 1 1 1 0
1111   1 1 1 1 1
attr(,"class")
[1] "venn"
## construct some fake gene names..
oneName <- function() paste(sample(LETTERS, 5, replace = TRUE), collapse = "")
geneNames <- replicate(1000, oneName())

##
GroupA <- sample(geneNames, 400, replace = FALSE)
GroupB <- sample(geneNames, 750, replace = FALSE)
GroupC <- sample(geneNames, 250, replace = FALSE)
GroupD <- sample(geneNames, 300, replace = FALSE)
input <- list(GA = GroupA, GB = GroupB, GC = GroupC, GD = GroupD)
venn(input)

plot of chunk venn3

## Example using a list of item indexes belonging to the specified group.
GroupA2 <- which(geneNames %in% GroupA)
GroupB2 <- which(geneNames %in% GroupB)
GroupC2 <- which(geneNames %in% GroupC)
GroupD2 <- which(geneNames %in% GroupD)
input2 <- list(GA2 = GroupA2, GB2 = GroupB2, GC2 = GroupC2, GD2 = GroupD2)
venn(input2)

plot of chunk venn4

## Example using a data frame of indicator ('f'lag) columns
GroupA.f <- geneNames %in% GroupA
GroupB.f <- geneNames %in% GroupB
GroupC.f <- geneNames %in% GroupC
GroupD.f <- geneNames %in% GroupD
input.df <- data.frame(A = GroupA.f, B = GroupB.f, C = GroupC.f, D = GroupD.f)
head(input.df)
A     B     C     D
1  TRUE  TRUE  TRUE FALSE
2  TRUE  TRUE FALSE  TRUE
3 FALSE  TRUE FALSE FALSE
4 FALSE FALSE FALSE FALSE
5 FALSE  TRUE FALSE FALSE
6 FALSE  TRUE  TRUE FALSE
venn(input.df)

plot of chunk venn5

## Example to determine which elements are in A and B but not in C and D:
## first determine the universe, then form indicator columns and perform
## intersections on these.  R allows using the set operations directly, but
## some might find this approach more intuitive.

universe <- unique(c(GroupA, GroupB, GroupC, GroupD))
GroupA.l <- universe %in% GroupA
GroupB.l <- universe %in% GroupB
GroupC.l <- universe %in% GroupC
GroupD.l <- universe %in% GroupD

## Genes that are in GroupA and in GroupB but not in GroupD (unification of
## sets III0 and II00 in the venn diagram:
universe[GroupA.l & GroupB.l & !GroupD.l]
[1] "LJNGD" "UQVQH" "MMHSM" "DLJVX" "KUOPR" "QOOLP" "OVBIQ" "FFDFP"
[9] "ISMFE" "VTFGZ" "OCSSQ" "NUSGS" "TEZQK" "CBDPU" "HAKQH" "OOIZX"
[17] "YUQMS" "BQWZM" "VNXBM" "BXSWW" "YTDTJ" "DAQDB" "EKSTZ" "PQSCL"
[25] "MTZSP" "SFAGP" "BOTQZ" "LBWHV" "ASWAY" "IVFVS" "UJCZN" "RWOHB"
[33] "VVLWB" "AWNWU" "AIGKN" "PQCLN" "IGIFJ" "ZSWHS" "LAFWD" "ALPXH"
[41] "ZOACU" "FZJAB" "ZKUEG" "PBXGH" "EAKES" "HSHVF" "ILWER" "NKDYR"
[49] "NPICS" "ANFOE" "SETBU" "TQYRT" "QUXJT" "UUEAJ" "DXUKG" "ZGQLP"
[57] "FCZWQ" "OXYHX" "WIYEM" "RYYYT" "WVCGX" "NOLXE" "NOQBP" "KLWMC"
[65] "LSHHB" "QIHLG" "FVTFO" "JZNPN" "KFYAR" "HCYRD" "XRXLW" "ROALB"
[73] "FTHWY" "HKGVI" "YNTXC" "RHFCM" "RAXOG" "ZQOLJ" "PSJSV" "ARFWR"
[81] "JTGPX" "GQCLK" "RPUDD" "NAPEH" "DADCB" "JRTWN" "HCZKQ" "SNGZB"
[89] "TZUVO" "NXMZV" "DTFUR" "VAVAE" "QQMLZ" "JGNQQ" "FHCUD" "ZGVYH"
[97] "JXJFA" "FKQJQ" "LLREW" "IPCYI" "WBJYA" "JNLHV" "PRLIX" "SXRJP"
[105] "DMUVQ" "CVAZE" "YZLDT" "TTEVZ" "XGFXY" "IDVML" "LHXLT" "ADWRG"
[113] "RKMBY" "YBHBB" "LJJIT" "VLFCS" "BBBZQ" "KZJVR" "RNBOI" "NUVPE"
[121] "RFJVT" "VPDIH" "KSRZY" "XVUJP" "IGBTO" "BNLXS" "BPKAV" "LPERU"
[129] "DXHLN" "YMMJE" "IIUWM" "GLDOR" "SRSZN" "ZFXIY" "AXOJE" "HEJOW"
[137] "OVNMM" "KGHKC" "YARXA" "TJQHL" "TTGCQ" "QXJGM" "TQBBD" "IXCRE"
[145] "HLHFX" "IWHBS" "RRHBX" "AZXXQ" "KOSXX" "HCJKY" "YIEWX" "VGSJD"
[153] "MYGOI" "YZTCW" "WXIZF" "GMJRT" "GHFDP" "LINFA" "RANSM" "SDWXZ"
[161] "BWXMW" "WKSIK" "OIGKP" "RTDBJ" "RTXSQ" "ZBLTC" "XSJUA" "KEOID"
[169] "FZLBM" "EBYBZ" "KKZPT" "ECLUE" "FNEMV" "WHZGV" "DYKBX" "HWXKY"
[177] "AWCHJ" "RZTGO" "NVHUR" "EZJKS" "USKOB" "UQRTP" "ETLNG" "CSQLL"
[185] "BAWFQ" "XZWJR" "BYCKD" "DGGUJ" "ESWWX" "BGXYC" "NDRLP" "HWYGQ"
[193] "MQCUS" "IOFIY" "XZCJZ" "STFLH" "DCKQJ" "VSTJX" "ZKMAU" "LJNHB"
[201] "QVXGB" "BVKBT" "OUBPF" "ZQIRR" "NIJXI" "SLYHJ" "UCDXR" "UHFJL"
[209] "ZUHMY" "POVCR" "MLTSH" "HRCGY" "ZVRDU" "SJWXH" "LJASB"

## Alternatively: construct a function to test for the pattern you want.
test <- function(x) (x %in% GroupA) & (x %in% GroupB) & !(x %in% GroupC)
universe[test(universe)]
[1] "BDCSG" "UQVQH" "MMHSM" "THMNP" "JNHNT" "DLJVX" "KUOPR" "QOOLP"
[9] "FFDFP" "ISMFE" "CBDPU" "HAKQH" "OOIZX" "YUQMS" "BQWZM" "VNXBM"
[17] "NOBXM" "WYAFV" "YTDTJ" "DAQDB" "EKSTZ" "PQSCL" "ZRVVD" "WDKCI"
[25] "CBJTV" "SFAGP" "BOTQZ" "LBWHV" "ASWAY" "IYAVE" "IVFVS" "UJCZN"
[33] "RWOHB" "VVLWB" "PQCLN" "IGIFJ" "ZSWHS" "AXVHG" "LAFWD" "ALPXH"
[41] "ZOACU" "FZJAB" "JDKWV" "ZKUEG" "PBXGH" "EAKES" "HSHVF" "ILWER"
[49] "NKDYR" "NPICS" "SPETC" "JMGTR" "SETBU" "OXSHI" "TQYRT" "KUGAT"
[57] "XFKNV" "UUEAJ" "JYQUN" "DXUKG" "KLODK" "ZGQLP" "FCZWQ" "OXYHX"
[65] "WIYEM" "RYYYT" "WVCGX" "NOLXE" "LSHHB" "QIHLG" "FVTFO" "KFYAR"
[73] "HCYRD" "XRXLW" "ROALB" "FTHWY" "YNTXC" "RHFCM" "RAXOG" "ZQOLJ"
[81] "PSJSV" "ARFWR" "JTGPX" "NTUDS" "RPUDD" "SCXPA" "NAPEH" "DADCB"
[89] "JRTWN" "SNGZB" "TZUVO" "NXMZV" "DTFUR" "KFULO" "VAVAE" "QQMLZ"
[97] "FKQJQ" "UYOAU" "IPCYI" "QRHVN" "PRLIX" "DMUVQ" "TTEVZ" "TADDM"
[105] "XGFXY" "IDVML" "LHXLT" "ADWRG" "RKMBY" "LJJIT" "KZJVR" "RNBOI"
[113] "NUVPE" "RFJVT" "AXDPZ" "XVUJP" "LYNLU" "FVMQZ" "IMHAR" "TTTXI"
[121] "DXHLN" "YMMJE" "FFKFJ" "IIUWM" "GLDOR" "SRSZN" "ZFXIY" "MJAPP"
[129] "AXOJE" "HEJOW" "OVNMM" "KGHKC" "YARXA" "TJQHL" "TTGCQ" "AVKPZ"
[137] "QXJGM" "TQBBD" "IXCRE" "HLHFX" "IWHBS" "RRHBX" "AZXXQ" "KOSXX"
[145] "HCJKY" "YIEWX" "VGSJD" "MYGOI" "YZTCW" "OSDDV" "WXIZF" "GMJRT"
[153] "GHFDP" "LINFA" "RANSM" "SDWXZ" "BWXMW" "OIGKP" "HIJYT" "RTDBJ"
[161] "QLBWK" "MWTGP" "XSJUA" "KEOID" "FZLBM" "THTUK" "HCNEP" "WXESQ"
[169] "KKZPT" "GJWXU" "ECLUE" "FNEMV" "WHZGV" "DYKBX" "MYEEF" "JVDHM"
[177] "HWXKY" "AQTJY" "RZTGO" "ZNYWF" "WLEYH" "EZJKS" "XFSVL" "KUDNT"
[185] "ETLNG" "CSQLL" "BAWFQ" "XZWJR" "BYCKD" "DGGUJ" "LEUNY" "AODBU"
[193] "ESWWX" "RMVJQ" "BGXYC" "CAKIG" "NDRLP" "MQCUS" "YDXFQ" "BMBJE"
[201] "IOFIY" "MROCC" "DCKQJ" "IBYUS" "VSTJX" "ZKMAU" "LJNHB" "SPAJS"
[209] "VMWPA" "QVXGB" "BVKBT" "OUBPF" "ZQIRR" "RYQPM" "NIJXI" "WSCZW"
[217] "NYYLH" "RJKNV" "RUGDL" "SLYHJ" "ITQRZ" "UCDXR" "BWYLK" "UHFJL"
[225] "HRCGY" "ZVRDU" "SJWXH" "JMNMK" "LJASB"

Further reading
gplots

http://felixfan.github.io/Boxplot

Boxplots With Point Identification

library(car)
head(Prestige)
education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof
# if not specified, Boxplot will use the row names of the data argument
# identify all outliers: id.n=Inf
Boxplot(~income, data = Prestige, id.n = Inf)

plot of chunk boxplot1

[1] "general.managers"         "lawyers"
[3] "physicians"               "veterinarians"
[5] "osteopaths.chiropractors"
# default id.n=10
Boxplot(income ~ type, data = Prestige)

plot of chunk boxplot2

[1] "general.managers" "physicians"
# change plot order
Boxplot(income ~ type, data = Prestige, at = c(1, 3, 2))

plot of chunk boxplot3

[1] "general.managers" "physicians"
head(Mroz)
lfp k5 k618 age  wc hc     lwg    inc
1 yes  1    0  32  no no 1.21016 10.910
2 yes  0    2  30  no no 0.32850 19.500
3 yes  1    3  35  no no 1.51413 12.040
4 yes  0    3  34  no no 0.09212  6.800
5 yes  1    2  31 yes no 1.52428 20.100
6 yes  0    0  54  no no 1.55649  9.859
# produce parallel boxplots for k5+k618 within levels of the grouping
# variable ('lfp' and 'wc')
x = Boxplot(k5 + k618 ~ lfp * wc, data = Mroz)

plot of chunk boxplot4


# double check point labels:
temp = Mroz[Mroz$lfp == "no" & Mroz$wc == "no", ]
pattern <- (temp$k5 + temp$k618) == 8
rownames(temp)[which(pattern)]
[1] "746"

temp2 = Mroz[Mroz$lfp == "yes" & Mroz$wc == "no", ]
pattern2 <- (temp2$k5 + temp2$k618) == 8
rownames(temp2)[which(pattern2)]
[1] "53"
# specify the points lables
with(Prestige, Boxplot(income, labels = rownames(Prestige)))

plot of chunk boxplot5

[1] "general.managers"         "lawyers"
[3] "physicians"               "veterinarians"
[5] "osteopaths.chiropractors"
# plot 'income' within levels of the grouping variable 'type'
with(Prestige, Boxplot(income, type, labels = rownames(Prestige)))

plot of chunk boxplot6

[1] "general.managers" "physicians"

Basic boxplot

Add varwidth=TRUE to make boxplot widths proportional to the square root of the samples sizes. Add horizontal=TRUE to reverse the axis orientation.

boxplot(mpg ~ cyl, data = mtcars, main = "Car Milage Data", xlab = "Number of Cylinders",
ylab = "Miles Per Gallon")

plot of chunk boxplot7

Notched Boxplot

Notched Boxplot of Tooth Growth Against 2 Crossed Factors.

  • boxes colored for ease of interpretation.
boxplot(len ~ supp * dose, data = ToothGrowth, notch = TRUE, col = (c("gold",
"darkgreen")), main = "Tooth Growth", xlab = "Suppliment and Dose")

plot of chunk boxplot8

In the notched boxplot, if two boxes' notches do not overlap this is ‘strong evidence’ their
medians differ (Chambers et al., 1983, p. 62).
  • Colors recycle.
In the example above, if I had listed 6 colors, each box would have its own color.
Earl F. Glynn has created an easy to use [list of colors](http://research.stowers-institute.org/efg/R/Color/Chart/ColorChart.pdf) is PDF format.

Other Options

The boxplot.matrix() function in the sfsmisc package draws a boxplot for each column (row) in a matrix.

The boxplot.n() function in the gplots package annotates each boxplot with its sample size.

The ** bplot()** function in the Rlab package offers many more options controlling the positioning and labeling of boxes in the output.

Violin Plots

A violin plot is a combination of a boxplot and a kernel density plot. They can be created using the vioplot() function from vioplot package.

library(vioplot)
x1 <- mtcars$mpg[mtcars$cyl == 4]
x2 <- mtcars$mpg[mtcars$cyl == 6]
x3 <- mtcars$mpg[mtcars$cyl == 8]
vioplot(x1, x2, x3, names = c("4 cyl", "6 cyl", "8 cyl"), col = "gold")
title("Violin Plots of Miles Per Gallon")

plot of chunk boxplot9

Bagplot - A 2D Boxplot Extension

library(aplpack)
attach(mtcars)
bagplot(wt, mpg, xlab = "Car Weight", ylab = "Miles Per Gallon", main = "Bagplot Example")

plot of chunk boxplot10

Further reading

inside-R: Boxplot
Quick-R: Boxplot

http://felixfan.github.io/pkg-Using

plot of chunk unnamed-chunk-3

## $total_installations
## [1] 204

plot of chunk unnamed-chunk-4

## $total_installations
## [1] 164

plot of chunk unnamed-chunk-5

## $total_installations
## [1] 161

package {FinCal}

Start To total_installations
2013-07-16 2013-07-31 79
2013-08-01 2013-08-31 173
2013-09-01 2013-09-30 220

package {powerAnalysis}

Start To total_installations
2013-09-01 2013-09-30 131

package {PubMedWordcloud}

Start To total_installations
2013-09-16 2013-09-30 81
http://felixfan.github.io/RMA-Normalization-Microarray

Why does microarray data need to be normalized?

account for technical variation between the arrays.

Methods

Robust Multichip Average (RMA) normalization.

Demo (GSE1297)

Step 1 Install packages

# install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")  # Get data from NCBI Gene Expression Omnibus (GEO)
biocLite("affy")  # Methods for Affymetrix Oligonucleotide Arrays
biocLite("hgu133a.db", type = "source")  # GSE1297: Platform_title = [HG-U133A]
biocLite("hgu133acdf")

Download the CEL file

library(GEOquery)

# Set working directory for download
setwd("D:/GEO/AD/GSE/GSE1297")

# Download the CEL file (by GSE - Geo series id), may take very long time
getGEOSuppFiles("GSE1297")
# this command does not work for this chip, so I need to download it
# directly Platform_title = [HG-U133A] Affymetrix Human Genome U133A Array
# Platform information can be found in SOFT file

# Unpack the CEL files
setwd("D:/GEO/AD/GSE/GSE1297/GSE1297")
untar("GSE1297_RAW.tar", exdir = "data")
cels = list.files("data/", pattern = "cel")
# sometiles, it is 'CEL', you need to check it first
sapply(paste("data", cels, sep = "/"), gunzip)
cels = list.files("data/", pattern = "cel")
# sometiles, it is 'CEL', you need to check it first

Perform RMA normalization

library(affy)
library(hgu133a.db)
library(hgu133acdf)

# Set working directory for normalization
setwd("D:/GEO/AD/GSE/GSE1297/GSE1297/data")
raw.data = ReadAffy(verbose = FALSE, filenames = cels, cdfname = "hgu133acdf")

# perform RMA normalization (log2)
data.rma.norm = rma(raw.data)
Background correcting
Normalizing
Calculating Expression

# Get the expression estimates for each array
rma = exprs(data.rma.norm)

# Take a look at the result (first 5 rows and 5 columes)
rma[1:5, 1:5]
GSM21203.cel GSM21204.cel GSM21205.cel GSM21206.cel GSM21207.cel
1007_s_at       10.606       10.339       10.390       10.523       10.791
1053_at          4.586        4.463        4.572        4.531        4.546
117_at           5.937        6.017        6.234       10.114        6.253
121_at           8.763        8.951        9.026        8.804        9.130
1255_g_at        4.361        4.439        4.598        4.019        4.354

# Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t")

Annotation

tt = cbind(row.names(rma), rma)
colnames(tt) = c("ProbID", sub(".cel", "", colnames(rma), ignore.case = TRUE))
rownames(tt) = NULL
tt[1:5, 1:5]
ProbID      GSM21203           GSM21204           GSM21205
[1,] "1007_s_at" "10.6061947450577" "10.3389916272064" "10.390181163724"
[2,] "1053_at"   "4.58630979543013" "4.46270777736086" "4.57178078332995"
[3,] "117_at"    "5.93684658044002" "6.01691505298677" "6.23421446338666"
[4,] "121_at"    "8.76288033305696" "8.9513454849938"  "9.02618022186855"
[5,] "1255_g_at" "4.3610859654913"  "4.43906808970735" "4.59752976632791"
GSM21206
[1,] "10.5230672101482"
[2,] "4.53054651020166"
[3,] "10.1138768429987"
[4,] "8.80361343714862"
[5,] "4.01915640899688"

require(RCurl)
myURL <- getURL("https://dl.dropboxusercontent.com/u/8272421/geo/HGU133A.na33.txt",
ssl.verifypeer = FALSE)
annot <- read.table(textConnection(myURL), header = TRUE, sep = "\t")
head(annot)
ProbeSetID EntrezGene
1  1007_s_at  100616237
2  1007_s_at        780
3    1053_at       5982
4     117_at       3310
5     121_at       7849
6  1255_g_at       2978

# probe sets were mapped to Entrez Gene IDs.
# comb=merge(annot,tt,by.x='ProbeSetID',by.y='ProbID',all.y=TRUE)
comb = merge(annot, tt, by.x = "ProbeSetID", by.y = "ProbID")
comb[1:5, 1:5]
ProbeSetID EntrezGene         GSM21203         GSM21204         GSM21205
1  1007_s_at  100616237 10.6061947450577 10.3389916272064  10.390181163724
2  1007_s_at        780 10.6061947450577 10.3389916272064  10.390181163724
3    1053_at       5982 4.58630979543013 4.46270777736086 4.57178078332995
4     117_at       3310 5.93684658044002 6.01691505298677 6.23421446338666
5     121_at       7849 8.76288033305696  8.9513454849938 9.02618022186855
write.table(comb, file = "comb2.txt", quote = FALSE, sep = "\t", row.names = FALSE)

# If multiple probe sets corresponded to the same gene, then the expression
# values of these probe sets were averaged.
comb2 <- subset(comb, select = -c(ProbeSetID))
comb2 <- data.frame(lapply(comb2, as.character), stringsAsFactors = FALSE)
comb2 <- data.frame(lapply(comb2, as.numeric), stringsAsFactors = FALSE)
out <- aggregate(. ~ EntrezGene, data = comb2, mean)

# Format values to 5 decimal places
out = format(out, digits = 5)
out[1:5, 1:5]
EntrezGene GSM21203 GSM21204 GSM21205 GSM21206
1          2   9.0928   9.7931   9.0962   9.8969
2          9   4.5077   4.3635   4.5445   4.3781
3         10   6.2881   6.2111   7.0938   5.7314
4         12  11.5590   8.0948   8.2142  12.3144
5         13   4.5742   4.6783   5.0740   4.5521
write.table(out, file = "GSE1297.RMA.txt", quote = FALSE, sep = "\t", row.names = FALSE)

Session information

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] RCurl_1.95-4.1       bitops_1.0-6         hgu133acdf_2.12.0
[4] hgu133a.db_2.9.0     org.Hs.eg.db_2.9.0   RSQLite_0.11.4
[7] DBI_0.2-7            AnnotationDbi_1.22.6 affy_1.38.1
[10] Biobase_2.20.1       BiocGenerics_0.6.0   knitr_1.5

loaded via a namespace (and not attached):
[1] affyio_1.28.0         BiocInstaller_1.10.4  evaluate_0.5.1
[4] formatR_0.10          IRanges_1.18.4        preprocessCore_1.22.0
[7] stats4_3.0.2          stringr_0.6.2         tools_3.0.2
[10] zlibbioc_1.6.0

Further Reading

Tutorial: Analysing microarray data in BioConductor
Using Bioconductor for Microarray Analysis
Methods of RMA Normalization for Affymetrix GeneChip Arrays
A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2):185-193

http://felixfan.github.io/genetic-code

Basic functions

plotCircle <- function(diameter = 1, polygonCol = "white", plotCol = "black",
npoints = 1000) {
r = diameter/2
temp <- seq(0, 2 * pi, length.out = npoints)
x <- r * cos(temp)
y <- r * sin(temp)
dat <- (data.frame(x = x, y = y))
plot(dat$x, dat$y, type = "l", col = plotCol, xaxt = "n", yaxt = "n", xlab = "",
ylab = "", frame.plot = FALSE)
polygon(dat$x, dat$y, col = polygonCol)
}

addCircle <- function(diameter = 1, polygonCol = "white", plotCol = "black",
npoints = 1000) {
r = diameter/2
temp <- seq(0, 2 * pi, length.out = npoints)
x <- r * cos(temp)
y <- r * sin(temp)
dat <- (data.frame(x = x, y = y))
points(dat$x, dat$y, type = "l", col = plotCol)
polygon(dat$x, dat$y, col = polygonCol)
}

addLines <- function(diameter = 1, parts = 4, start.diameter = 0, lineCol = "black",
logic = rep(TRUE, parts)) {
r = diameter/2
y <- r * cos(seq(0, 2 * pi, by = 2 * pi/parts))
x <- r * sin(seq(0, 2 * pi, by = 2 * pi/parts))

r0 = start.diameter/2
y0 <- r0 * cos(seq(0, 2 * pi, by = 2 * pi/parts))
x0 <- r0 * sin(seq(0, 2 * pi, by = 2 * pi/parts))

dat <- (data.frame(x = x, y = y))
dat0 <- (data.frame(x = x0, y = y0))
i = 1
while (i < length(dat$x)) {
if (logic[i]) {
lines(c(dat0$x[i], dat$x[i]), c(dat0$y[i], dat$y[i]), col = lineCol)
}
i = i + 1
}
}

addTexts <- function(diameter = 1, labels = c("G", "C", "T", "A"), cex = 1,
logic = rep(TRUE, length(labels) + 1), start.diameter = 0, textCol = "black") {
parts = length(labels)

r = (start.diameter + diameter)/4
y <- r * cos(seq(0, 2 * pi, by = 2 * pi/parts) + pi/parts)
x <- r * sin(seq(0, 2 * pi, by = 2 * pi/parts) + pi/parts)
dat <- (data.frame(x = x, y = y))

i = 1
last.i = 1
while (i < length(dat$x)) {
j = i + 1
for (tt in j:length(dat$x)) {
if (logic[j]) {
text((dat$x[i] + dat$x[last.i])/2, (dat$y[i] + dat$y[last.i])/2,
labels[i], cex = cex, col = textCol)
last.i = j
break
}
}

i = i + 1
}
}

Demo 1

geneticCodeDemo1 <- function() {
plotCircle(diameter = 6)
addCircle(diameter = 5)
addCircle(diameter = 4)
addCircle(diameter = 3)
addCircle(diameter = 2)
addCircle(diameter = 1)

addLines(diameter = 6, parts = 4, start.diameter = 0)
addLines(diameter = 6, parts = 16, start.diameter = 1)
####
log1 = rep(c(FALSE, TRUE), 2)  #4
log2 = rep(c(FALSE, FALSE, FALSE, TRUE), 3)  #12
log3 = c(FALSE, FALSE, FALSE, TRUE)  #4
addornot = c(TRUE, log1, log2, log1, log2, log1, TRUE, TRUE, FALSE, TRUE,
log3, log1, log1, log1, log3, TRUE, FALSE, FALSE)  #64
addLines(diameter = 6, parts = 64, start.diameter = 2, logic = addornot)

text1 = c("G", "C", "T", "A")
addTexts(diameter = 1, labels = text1)
text2 = rep(c("A", "G", "C", "T"), 4)
addTexts(diameter = 2, labels = text2, cex = 0.8, start.diameter = 1)
text3 = rep(c("A", "G", "C", "T"), 16)
addTexts(diameter = 3, labels = text3, cex = 0.5, start.diameter = 2)
###
text4 = c("E", "E", "D", "D", "G", "G", "G", "G", "A", "A", "A", "A", "V",
"V", "V", "V", "Q", "Q", "H", "H", "R", "R", "R", "R", "P", "P", "P",
"P", "L", "L", "L", "L", "*", "*", "Y", "Y", "*", "W", "C", "C", "S",
"S", "S", "S", "L", "L", "F", "F", "K", "K", "N", "N", "R", "R", "S",
"S", "T", "T", "T", "T", "M", "I", "I", "I")
addTexts(diameter = 4, labels = text4, logic = c(addornot, TRUE), start.diameter = 3)
###
text5 = c("Glu", "Glu", "Asp", "Asp", "Gly", "Gly", "Gly", "Gly", "Ala",
"Ala", "Ala", "Ala", "Val", "Val", "Val", "Val", "Gln", "Gln", "His",
"His", "Arg", "Arg", "Arg", "Arg", "Pro", "Pro", "Pro", "Pro", "Leu",
"Leu", "Leu", "Leu", "*", "*", "Tyr", "Tyr", "*", "Trp", "Cys", "Cys",
"Ser", "Ser", "Ser", "Ser", "Leu", "Leu", "Phe", "Phe", "Lys", "Lys",
"Asn", "Asn", "Arg", "Arg", "Ser", "Ser", "Thr", "Thr", "Thr", "Thr",
"Met", "Ile", "Ile", "Ile")
addTexts(diameter = 5, labels = text5, logic = c(addornot, TRUE), start.diameter = 4)
###
text6 = c("glutamic acid", "glutamic acid", "aspartic acid", "aspartic acid",
"glycine", "glycine", "glycine", "glycine", "alanine", "alanine", "alanine",
"alanine", "valine", "valine", "valine", "valine", "glutamine", "glutamine",
"histidine", "histidine", "arginine", "arginine", "arginine", "arginine",
"proline", "proline", "proline", "proline", "leucine", "leucine", "leucine",
"leucine", "stop", "stop", "tyrosine", "tyrosine", "stop", "tryptophan",
"cysteine", "cysteine", "serine", "serine", "serine", "serine", "leucine",
"leucine", "phenylalanine", "phenylalanine", "lysine", "lysine", "asparagine",
"asparagine", "arginine", "arginine", "serine", "serine", "threonine",
"threonine", "threonine", "threonine", "methionine", "isoleucine", "isoleucine",
"isoleucine")
addTexts(diameter = 6, labels = text6, logic = c(addornot, TRUE), start.diameter = 5,
cex = 0.7)
}
geneticCodeDemo1()

plot of chunk genetic-code-demo1

Demo 2

geneticCodeDemo2 <- function() {
plotCircle(diameter = 4)
addCircle(diameter = 3)
addCircle(diameter = 2)
addCircle(diameter = 1)

addLines(diameter = 4, parts = 4, start.diameter = 0)
addLines(diameter = 4, parts = 16, start.diameter = 1)
####
log1 = rep(c(FALSE, TRUE), 2)  #4
log2 = rep(c(FALSE, FALSE, FALSE, TRUE), 3)  #12
log3 = c(FALSE, FALSE, FALSE, TRUE)  #4
addornot = c(TRUE, log1, log2, log1, log2, log1, TRUE, TRUE, FALSE, TRUE,
log3, log1, log1, log1, log3, TRUE, FALSE, FALSE)  #64
addLines(diameter = 4, parts = 64, start.diameter = 2, logic = addornot)

text1 = c("G", "C", "T", "A")
addTexts(diameter = 1, labels = text1)
text2 = rep(c("A", "G", "C", "T"), 4)
addTexts(diameter = 2, labels = text2, cex = 0.8, start.diameter = 1)
text3 = rep(c("A", "G", "C", "T"), 16)
addTexts(diameter = 3, labels = text3, cex = 0.5, start.diameter = 2)
###
text4 = c("E", "E", "D", "D", "G", "G", "G", "G", "A", "A", "A", "A", "V",
"V", "V", "V", "Q", "Q", "H", "H", "R", "R", "R", "R", "P", "P", "P",
"P", "L", "L", "L", "L", "*", "*", "Y", "Y", "*", "W", "C", "C", "S",
"S", "S", "S", "L", "L", "F", "F", "K", "K", "N", "N", "R", "R", "S",
"S", "T", "T", "T", "T", "M", "I", "I", "I")
addTexts(diameter = 4, labels = text4, logic = c(addornot, TRUE), start.diameter = 3)
}
geneticCodeDemo2()

plot of chunk genetic-code-demo2

Demo 3

geneticCodeDemo3 <- function() {
plotCircle(diameter = 4)
addCircle(diameter = 3)
addCircle(diameter = 2)
addCircle(diameter = 1)

addLines(diameter = 4, parts = 4, start.diameter = 0)
addLines(diameter = 4, parts = 16, start.diameter = 1)
####
log1 = rep(c(FALSE, TRUE), 2)  #4
log2 = rep(c(FALSE, FALSE, FALSE, TRUE), 3)  #12
log3 = c(FALSE, FALSE, FALSE, TRUE)  #4
addornot = c(TRUE, log1, log2, log1, log2, log1, TRUE, TRUE, FALSE, TRUE,
log3, log1, log1, log1, log3, TRUE, FALSE, FALSE)  #64
addLines(diameter = 4, parts = 64, start.diameter = 2, logic = addornot)

text1 = c("G", "C", "T", "A")
addTexts(diameter = 1, labels = text1)
text2 = rep(c("A", "G", "C", "T"), 4)
addTexts(diameter = 2, labels = text2, cex = 0.8, start.diameter = 1)
text3 = rep(c("A", "G", "C", "T"), 16)
addTexts(diameter = 3, labels = text3, cex = 0.5, start.diameter = 2)
###
text4 = c("Glu", "Glu", "Asp", "Asp", "Gly", "Gly", "Gly", "Gly", "Ala",
"Ala", "Ala", "Ala", "Val", "Val", "Val", "Val", "Gln", "Gln", "His",
"His", "Arg", "Arg", "Arg", "Arg", "Pro", "Pro", "Pro", "Pro", "Leu",
"Leu", "Leu", "Leu", "*", "*", "Tyr", "Tyr", "*", "Trp", "Cys", "Cys",
"Ser", "Ser", "Ser", "Ser", "Leu", "Leu", "Phe", "Phe", "Lys", "Lys",
"Asn", "Asn", "Arg", "Arg", "Ser", "Ser", "Thr", "Thr", "Thr", "Thr",
"Met", "Ile", "Ile", "Ile")
addTexts(diameter = 4, labels = text4, logic = c(addornot, TRUE), start.diameter = 3)
}
geneticCodeDemo3()

plot of chunk genetic-code-demo3

Demo 4

geneticCodeDemo4 <- function() {
plotCircle(diameter = 4, plotCol = "chartreuse4", polygonCol = "darkgoldenrod1")
addCircle(diameter = 3, plotCol = "chartreuse4", polygonCol = "cornsilk")
addCircle(diameter = 2, plotCol = "chartreuse4", polygonCol = "cyan")
addCircle(diameter = 1, plotCol = "chartreuse4", polygonCol = "aquamarine")

addLines(diameter = 4, parts = 4, start.diameter = 0, lineCol = "chartreuse4")
addLines(diameter = 4, parts = 16, start.diameter = 1, lineCol = "chartreuse4")
####
log1 = rep(c(FALSE, TRUE), 2)  #4
log2 = rep(c(FALSE, FALSE, FALSE, TRUE), 3)  #12
log3 = c(FALSE, FALSE, FALSE, TRUE)  #4
addornot = c(TRUE, log1, log2, log1, log2, log1, TRUE, TRUE, FALSE, TRUE,
log3, log1, log1, log1, log3, TRUE, FALSE, FALSE)  #64
addLines(diameter = 4, parts = 64, start.diameter = 2, logic = addornot,
lineCol = "chartreuse4")

text1 = c("G", "C", "T", "A")
addTexts(diameter = 1, labels = text1)
text2 = rep(c("A", "G", "C", "T"), 4)
addTexts(diameter = 2, labels = text2, cex = 0.8, start.diameter = 1)
text3 = rep(c("A", "G", "C", "T"), 16)
addTexts(diameter = 3, labels = text3, cex = 0.5, start.diameter = 2)
###
text4 = c("Glu", "Glu", "Asp", "Asp", "Gly", "Gly", "Gly", "Gly", "Ala",
"Ala", "Ala", "Ala", "Val", "Val", "Val", "Val", "Gln", "Gln", "His",
"His", "Arg", "Arg", "Arg", "Arg", "Pro", "Pro", "Pro", "Pro", "Leu",
"Leu", "Leu", "Leu", "*", "*", "Tyr", "Tyr", "*", "Trp", "Cys", "Cys",
"Ser", "Ser", "Ser", "Ser", "Leu", "Leu", "Phe", "Phe", "Lys", "Lys",
"Asn", "Asn", "Arg", "Arg", "Ser", "Ser", "Thr", "Thr", "Thr", "Thr",
"Met", "Ile", "Ile", "Ile")
addTexts(diameter = 4, labels = text4, logic = c(addornot, TRUE), start.diameter = 3)
}
geneticCodeDemo4()

plot of chunk genetic-code-demo4

http://felixfan.github.io/condition-VS-interaction

Conditioning on a SNP is done when you have two (or more) SNPs and you wish to ask the question “is the effect of SNP two independent of the effect of SNP one?”.

The conditioning SNPs are entered into the model simply as covariates, using a simple 0, 1, 2 allele dosage coding. That is, for two conditioning SNPs, rs1001 and rs1002 say, and also a standard covariate, the model would be:

Y = b0 + b1.ADD + b2.rs1001 + b3.rs1002 + b4.COV1 + e

If the b1 coefficient for the test SNP is still significant after entering these covariates, this would suggest that it does indeeed have an effect independent of rs1001, rs1002 and the other covariate. Here the test SNP b1 is the one where we’d like to test independence. The other coefficients may still be highly significant, but these reflect the effects of the conditioning SNPs and covariates, not the test SNP.

Testing for interaction is done when you have a covariate (which may not be a SNP) and a SNP and you wish to ask the question, “if I add an interaction coefficient to a model that already has terms for the covariate and the SNP, is that interaction term significant?”

The PLINK documentation illustrates this with the following model:

Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e

Here the test is not “is my SNP independent of the covariate”, but instead “if I have both my SNP and the covariate as terms, do they together exert a stronger effect on the phenotype than I would expect to see through the linear addition of their individual effects?”

Note: When including the –interaction flag, you should probably only interpret the interaction p-value.

summary

  • Conditioning on a SNP is like removing its effect. As an example, let’s say you find that SNP1 is strongly associated with your trait. Then you find SNP2, nearby, is also strongly associated. You might then ask whether these are 2 separate signals. By conditioning on SNP1 (i.e., holding SNP1 constant for all subjects), you can examine whether SNP2 still has an effect.

  • Testing for interactions answers a different question: Does the association between SNP1 and my trait differ by SNP2 genotype? This is a gene-gene interaction. Or you might ask if SNP1 behaves differently among statin drug users versus non-users. This would be testing a gene-environment interaction.

Further Reading

Biostar
PLINK
Informed conditioning on clinical covariates increases power in case-control association studies. PLoS Genet. 2012;8(11):e1003032
The Covariate’s Dilemma. PLoS Genet. 2012 November; 8(11): e1003096.

http://felixfan.github.io/bar-plot

Bar plot with significant differences

Simulate raw data for t-test

set.seed(123456)
alpha = 0.05
a = rnorm(50, 10, 2)
b = rnorm(50, 15, 2)
df = data.frame(a = a, b = b)
test = t.test(df)
test

One Sample t-test

data:  df
t = 41.22, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
11.93 13.14
sample estimates:
mean of x
12.53

Calculate statistics (95% confidence interval)

means <- sapply(df, mean)
lowers <- sapply(df, function(v) t.test(v, conf.level = 1 - alpha)$conf.int[1])
uppers <- sapply(df, function(v) t.test(v, conf.level = 1 - alpha)$conf.int[2])

Or,

means <- sapply(df, mean)
sd <- sapply(df, sd)
n <- sapply(df, length)
se <- sd/sqrt(n)
lowers <- means - qt(1-alpha/2, df=n)*se
uppers <- means + qt(1-alpha/2, df=n)*se

Use gplots to draw the bar plot

library(gplots)
ymax = max(uppers)

# bar plot with 95% confidence interval
bp <- barplot2(means, plot.ci = TRUE, ci.l = lowers, ci.u = uppers, xpd = FALSE,
ylim = c(0, ymax * 1.1))

# Add connection lines
x.cord <- sapply(bp, function(x) rep(x, each = length(uppers)))
y.cord <- rbind(c(uppers * 1.01), rep(1.05 * ymax, length(uppers)))
lines(x.cord[, 1], y.cord[, 1])
lines(x.cord[, 2], y.cord[, 2])
lines(x.cord[2, ], y.cord[2, ])

# Add significant star
x.star <- mean(bp)
y.star <- 1.08 * ymax

# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
star = ""
if (test$p.value < 0.001) {
star = "***"
} else if (test$p.value < 0.01) {
star = "**"
} else if (test$p.value < 0.05) {
star = "*"
} else if (test$p.value < 0.1) {
star = "."
} else {
star = " "
}

text(star, x = x.star, y = y.star)

plot of chunk barplot

Read more

Barplot with significant differences and interactions?
Bar plot with error bars in R
Wily data analysis

http://felixfan.github.io/R-version
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] knitr_1.4.1

loaded via a namespace (and not attached):
[1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    stringr_0.6.2
[5] tools_3.0.2
packageVersion("FinCal")
[1] '0.5'
http://felixfan.github.io/association-VS-causal

Association

  • When two variables are related, we say that there is association between them.

When researchers find a correlation, which can also be called an association, what they are saying is that they found a relationship between two, or more, variables.

Causal

  • One variable has a direct influence on the other, this is called a causal relationship.
  • Causality can only be determined by reasoning about how the data were collected.
  • The data values themselves contain no information that can help you to decide.

If two variables are causally related, it is possible to conclude that changes to the explanatory variable, X, will have a direct impact on Y.

If one variable causally affects the other, then adjusting the value of that variable will cause the other to change.

Obviously, it is much more difficult to prove causation than it is to prove an association.

Non-causal

In non-causal relationships, the relationship that is evident between the two variables is not completely the result of one variable directly affecting the other.

  • Two variables can be related to each other without either variable directly affecting the values of the other.

If two variables are not causally related, it is impossible to tell whether changes to one variable, X, will result in changes to the other variable, Y.

Observational studies

  • In observational studies, there is usually the potential for a lurking variable to underlie any observed relationship, so it is difficult to interpret relationships.

Data are collected in an observational study if we passively record (observe) values from each unit.

Experiments

  • In a well designed experiment, there is little chance of lurking variables driving the observed relationships, so any relationship will be causal.

In an experiment, the researcher actively changes some characteristics of the units before the data are collected. The values of some variables are therefore under the control of the experimenter. In other words, the experimenter is able to choose each individual’s values for some variables.

References

http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/Hcausal/causal_c1.html
http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/Hcausal/causal_c2.html
http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/Hcausal/causal_c3.html
http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/Hcausal/causal_c4.html
http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/Hcausal/causal_c5.html
Correlation, causation, and association - What does it all mean???

http://felixfan.github.io/kinship
library(kinship2)
data(sample.ped)
sample.ped2 = subset(sample.ped, ped == 2)
sample.ped2
ped  id father mother sex affected avail
42   2 201      0      0   1        1     1
43   2 202      0      0   2       NA     0
44   2 203      0      0   1        1     1
45   2 204    201    202   2        0     1
46   2 205    201    202   1       NA     0
47   2 206    201    202   2        1     1
48   2 207    201    202   2        1     1
49   2 208    201    202   2        0     0
50   2 209      0      0   1        0     0
51   2 210    203    204   1        0     0
52   2 211    203    204   1        0     1
53   2 212    209    208   2        0     1
54   2 213    209    208   1        0     0
55   2 214    209    208   1        1     1

Simple plot

pedAll <- pedigree(id = sample.ped2$id, dadid = sample.ped2$father, momid = sample.ped2$mother,
sex = sample.ped2$sex, famid = sample.ped2$ped)
print(pedAll)
Pedigree list with 14 total subjects in 1 families
print(pedAll["2"])
Pedigree object with 14 subjects, family id= 2
Bit size= 16
plot(pedAll["2"])

plot of chunk kinship1

Since there is only one family, or

pedAll <- pedigree(id = sample.ped2$id, dadid = sample.ped2$father, momid = sample.ped2$mother,
sex = sample.ped2$sex, )
print(pedAll)
Pedigree object with 14 subjects
Bit size= 16
plot(pedAll)

plot of chunk kinship2

  • Squares and circles indicate males and females, respectively.

Add affection status

pedAll <- pedigree(id = sample.ped2$id, dadid = sample.ped2$father, momid = sample.ped2$mother,
sex = sample.ped2$sex, affected = sample.ped2$affected)
print(pedAll)
Pedigree object with 14 subjects
Bit size= 16
plot(pedAll)

plot of chunk kinship3

  • Blackened squares and circles indicate the affected males and females, respectively.
  • Squares and circles with a central question mark represent affection status is unknown.

Add status (0=”censored”, 1=”dead”)

Suppose the first individual (ID=201) is dead.

sample.ped2$status = c(1, rep(0, 13))
pedAll <- pedigree(id = sample.ped2$id, dadid = sample.ped2$father, momid = sample.ped2$mother,
sex = sample.ped2$sex, affected = sample.ped2$affected, status = sample.ped2$status)
print(pedAll)
Pedigree object with 14 subjects
Bit size= 16
plot(pedAll)

plot of chunk kinship4

  • Squares and circles with a slash represent deceased.

Add relationship (1=Monozygotic twin, 2=Dizygotic twin, 3=Twin of unknown zygosity, 4=Spouse)

Suppose 210 and 211 are monozygotic twin, 212 and 213 are dizygotic twin, create a matrix with 3 columns (id1, id2, code) specifying special relationship between pairs of individuals.

id1 = c(210, 212)
id2 = c(211, 213)
r = c(1, 2)
rm = cbind(id1, id2, r)
pedAll <- pedigree(id = sample.ped2$id, dadid = sample.ped2$father, momid = sample.ped2$mother,
sex = sample.ped2$sex, affected = sample.ped2$affected, status = sample.ped2$status,
relation = rm)
print(pedAll)
Pedigree object with 14 subjects
Bit size= 16
plot(pedAll)

plot of chunk kinship5

http://felixfan.github.io/Genomic-Inflation-Factor

The genomic inflation factor \(\lambda\) is defined as the ratio of the median of the empirically observed distribution of the test statistic to the expected median, thus quantifying the extent of the bulk inflation and the excess false positive rate.

\[ \lambda = \frac{median(\chi^{2})}{0.456} \]

\[ \chi^{2}_{adjusted} = \frac{\chi^{2}}{\lambda} \]

Genomic inflation factor \(\lambda\) and quantile–quantile (Q–Q) plots were used to compare the genome-wide distribution of the test statistic with the expected null distribution. The Q–Q plot is a useful visual tool to mark deviations of the observed distribution from the expected null distribution. Inflated \(\lambda\) values or residual deviations in the Q–Q plot may point to undetected sample duplications, unknown familial relationships, a poorly calibrated test statistic, systematic technical bias or gross population stratification.

Since \(\lambda\) scales with sample size, some have found it informative to report \(\lambda _{1000}\), the inflation factor for an equivalent study of 1000 cases and 1000 controls, which can be calculated by rescaling \(\lambda\):

\[ \lambda_{1000} = 1 + (\lambda_{obs}-1) \times \frac{\frac{1}{n_{cases}}+\frac{1}{n_{controls}}}{\frac{1}{n_{cases,1000}}+\frac{1}{n_{controls,1000}}} \]

where \(n_{cases}\) and \(n_{controls}\) are the study sample size for cases and controls, respectively, and \(n_{cases,1000}\) and \(n_{controls,1000}\) are the target sample size (1000).

References

  • de Bakker, P. I. et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 2008; 17, R122–128

  • Devlin B., Roeder K. Genomic control for association studies. Biometrics 1999;55:997-1004.

  • Freedman M.L., Reich D., Penney K.L., McDonald G.J., Mignault A.A., Patterson N., Gabriel S.B., Topol E.J., Smoller J.W., Pato C.N., et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 2004;36:388-393.

  • Clayton D.G., Walker N.M., Smyth D.J., Pask R., Cooper J.D., Maier L.M., Smink L.J., Lam A.C., Ovington N.R., Stevens H.E., et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nat. Genet. 2005;37:1243-1246.

  • http://en.wikipedia.org/wiki/Population_stratification

http://felixfan.github.io/R-Data-type

vectors and assignment

x <- c(10, 5, 3, 6, 21)

x
[1] 10  5  3  6 21

assign("y", c(10.4, 5.6, 3.1, 6.4, 21.7))

y
[1] 10.4  5.6  3.1  6.4 21.7

z <- c(4, 6, 1, 4, 7)

z
[1] 4 6 1 4 7

a <- c(x, y, z)

a
[1] 10.0  5.0  3.0  6.0 21.0 10.4  5.6  3.1  6.4 21.7  4.0  6.0  1.0  4.0
[15]  7.0

v <- 2 * x + y + 1

v
[1] 31.4 16.6 10.1 19.4 64.7

mean(v)
[1] 28.44

median(v)
[1] 19.4

var(v)
[1] 470.5

sd(v)
[1] 21.69

min(v)
[1] 10.1

max(v)
[1] 64.7

range(v)
[1] 10.1 64.7

cumsum(v)
[1]  31.4  48.0  58.1  77.5 142.2

cumprod(v)
[1]      31.4     521.2    5264.5  102131.8 6607925.2

cummax(v)
[1] 31.4 31.4 31.4 31.4 64.7

cummin(v)
[1] 31.4 16.6 10.1 10.1 10.1

sum(v)
[1] 142.2

summary(v)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
10.1    16.6    19.4    28.4    31.4    64.7

length(v)
[1] 5

sort(v)
[1] 10.1 16.6 19.4 31.4 64.7

sqrt(v)
[1] 5.604 4.074 3.178 4.405 8.044

log(v)
[1] 3.447 2.809 2.313 2.965 4.170

s3 <- seq(-5, 5, by = 0.2)

s3
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4
[15] -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4
[29]  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2
[43]  3.4  3.6  3.8  4.0  4.2  4.4  4.6  4.8  5.0

s4 <- seq(length = 51, from = -5, by = 0.2)

s4
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4
[15] -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4
[29]  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2
[43]  3.4  3.6  3.8  4.0  4.2  4.4  4.6  4.8  5.0

s5 <- rep(x, times = 5)

s5
[1] 10  5  3  6 21 10  5  3  6 21 10  5  3  6 21 10  5  3  6 21 10  5  3
[24]  6 21

s6 <- rep(x, each = 5)

s6
[1] 10 10 10 10 10  5  5  5  5  5  3  3  3  3  3  6  6  6  6  6 21 21 21
[24] 21 21

a
[1] 10.0  5.0  3.0  6.0 21.0 10.4  5.6  3.1  6.4 21.7  4.0  6.0  1.0  4.0
[15]  7.0

a[1:5]
[1] 10  5  3  6 21

a[-(1:5)]
[1] 10.4  5.6  3.1  6.4 21.7  4.0  6.0  1.0  4.0  7.0

a[c(1, 3, 5)]
[1] 10  3 21

Matrices

# generates 5 x 4 numeric matrix
x <- matrix(1:20, nrow = 5, ncol = 4)

# another example
cells <- c(1, 26, 24, 68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(rnames,
cnames))

# Combining Matrices
B = matrix(c(2, 4, 3, 1, 5, 7), nrow = 3, ncol = 2)
C = matrix(c(7, 4, 2), nrow = 3, ncol = 1)
cbind(B, C)
[,1] [,2] [,3]
[1,]    2    1    7
[2,]    4    5    4
[3,]    3    7    2

D = matrix(c(6, 2), nrow = 1, ncol = 2)
rbind(B, D)
[,1] [,2]
[1,]    2    1
[2,]    4    5
[3,]    3    7
[4,]    6    2

# Deconstruction
c(B)
[1] 2 4 3 1 5 7

# Identify rows, columns or elements using subscripts.
x[, 4]  # 4th column of matrix
[1] 16 17 18 19 20
x[3, ]  # 3rd row of matrix
[1]  3  8 13 18
x[2:4, 1:3]  # rows 2,3,4 of columns 1,2,3
[,1] [,2] [,3]
[1,]    2    7   12
[2,]    3    8   13
[3,]    4    9   14

# Transpose
t(x)
[,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20

Data Frame

d <- c(1, 2, 3, 4)
e <- c("red", "white", "red", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
mydata <- data.frame(d, e, f)
mydata
d     e     f
1 1   red  TRUE
2 2 white  TRUE
3 3   red  TRUE
4 4  <NA> FALSE

# variable names
names(mydata)
[1] "d" "e" "f"
names(mydata) <- c("ID", "Color", "Passed")
names(mydata)
[1] "ID"     "Color"  "Passed"

# There are a variety of ways to identify the elements of a data frame .
mydata[, 1:2]  # columns 1,2 of data frame
ID Color
1  1   red
2  2 white
3  3   red
4  4  <NA>
mydata[c("ID", "Color")]  # columns ID and Color from data frame
ID Color
1  1   red
2  2 white
3  3   red
4  4  <NA>
mydata$Passed  # variable Passed in the data frame
[1]  TRUE  TRUE  TRUE FALSE
subset(mydata, Passed == "TRUE")
ID Color Passed
1  1   red   TRUE
2  2 white   TRUE
3  3   red   TRUE
subset(mydata, ID > 3)
ID Color Passed
4  4  <NA>  FALSE
subset(mydata, ID < 3, select = c(ID, Passed))
ID Passed
1  1   TRUE
2  2   TRUE
subset(mydata, ID < 3, select = -c(Color, Passed))
ID
1  1
2  2
subset(mydata, Color == "red" & Passed == "TRUE")
ID Color Passed
1  1   red   TRUE
3  3   red   TRUE
mydata[mydata$ID %in% c(1, 3), ]
ID Color Passed
1  1   red   TRUE
3  3   red   TRUE

# number of data rows and columns
nrow(mydata)
[1] 4
ncol(mydata)
[1] 3

List

# example of a list with 4 components: a string, a numeric vector, a matrix,
# and a scaler
a <- c(1, 2, 5.3, 6, -2, 4)  # numeric vector
y <- matrix(1:20, nrow = 5, ncol = 4)
w <- list(name = "Fred", mynumbers = a, mymatrix = y, age = 5.3)
w
$name
[1] "Fred"

$mynumbers
[1]  1.0  2.0  5.3  6.0 -2.0  4.0

$mymatrix
[,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

$age
[1] 5.3
# example of a list containing two lists
list1 = list(mynumbers = a)
list2 = list(mymatrix = y)
v <- c(list1, list2)
v
$mynumbers
[1]  1.0  2.0  5.3  6.0 -2.0  4.0

$mymatrix
[,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
# Identify elements of a list using the [[]] convention.
v[[2]]  # 2nd component of the list
[,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
v[["mynumbers"]]  # component named mynumbers in list
[1]  1.0  2.0  5.3  6.0 -2.0  4.0

Factor

# variable gender with 20 'male' entries and 30 'female' entries
gender <- c(rep("male", 20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates 1=female, 2=male
# internally (alphabetically) R now treats gender as a nominal variable
summary(gender)
female   male
30     20

# variable rating coded as 'large', 'medium', 'small'
rating <- c(rep("large", 5), rep("small", 10), rep("medium", 5))
rating <- ordered(rating)
# recodes rating to 1,2,3 and associates 1=large, 2=medium, 3=small
# internally R now treats rating as ordinal

References

Quick-R: Data type
R Tutorial: Data Frame

http://felixfan.github.io/update-SNP-id

Step 1: Create a file of genomic coordinates from your map file

The format is like this:

chr1 4158539 4158540 kgp499505

In linux, you can use the following command:

awk '{print "chr"$1,$4-1,$4,$2}' yourdata.bim > genomicCoordinates.txt

or

awk '{print "chr"$1,$4-1,$4,$2}' yourdata.map > genomicCoordinates.txt

Note: yourdata.bim and yourdata.map are in PLINK format.

Step 2: Navigate to UCSC Table Browser

Step 3: Options selection

  • Clade: Mammal
  • Genome: Human
  • Assembly: Feb. 2009 (GRCh37/hg19)
  • Group: Variation and Repeats
  • Track: All SNPs(137)
  • Table: snp137
  • Output format: selected fields from primary and related tables

Note: Make sure your genomic coordinates is same as hg19. If you are not sure about this, you can search one SNP with rs ID to check whether it is hg19 or hg18.

Step 4: Upload input file

On the “region” line, click the “define regions” button. Upload “genomicCoordinates.txt” file generated in step 1 by clicking “Choose File” button. Then click the “submit” button.

Note: There is a limit of 1,000 defined regions. It should be fine if you only update the significant SNPs.

Step 5: Select output format

On the “output format” line, select “selected fileds from primary and related tables”.

Step 6: Get output

You can enter a filename on the “output file” line or leave it blank to see the results on the screen. Enter “output.txt” and click the “get output” button.

In the new opened page “Select Fields from hg19.snp137”, check the “chrom”, “chromStart”, “chromEnd” and “name” checkboxes. Then click the “get output” button.

Step 7: Delete deletions?

You may got multiple records for one search. For example:

Input:

chr12  72650313	72650314	rs201804413

And the output:

chr12  72650313	72650317	rs66927394
chr12	72650313	72650314	rs201804413

I am not sure whether the first record in the output is a ‘deletion’, but I only want the second record, so I will remove the first record.

In linux, you can remove all first record like records using the following command:

awk '$3-$2==1{print}' output.txt > output2.txt

Note: “output.txt” is the output in step 6.

Step 8: Update IDs

In linux,

sort -k1 genomicCoordinates.txt > genomicCoordinates.sort.txt
sort -k1 output2.txt > output.sort.txt
join genomicCoordinates.sort.txt output.sort.txt | awk '{if($2==$5 && $3==$6) print $4,$7}' > updateIDs.txt

Use “–update-map updateIDs.txt –update-name” command of PLINK to update SNP IDs.

Reference

Reg. Updating SNP ids for SNPs on Illumina HumanOmni2.5M array
PLINK

http://felixfan.github.io/File-Manipulation

Creat directories

dir.create("test")

Get working wirectory

getwd()
[1] "E:/360yunpanGmail/study/R/Rmarkdown"

Set working directory

setwd("./test")
getwd()
[1] "E:/360yunpanGmail/study/R/Rmarkdown/test"

Create file

creates files with the given names if they do not already exist and truncates them if they do.

file.create("test.csv")
[1] TRUE
file.create("test1.csv", "test2.csv")
[1] TRUE TRUE

Output data to file

cat("Hello world\n", file = "test1.csv")
cat("Hello R\n", file = "test2.csv")
data(iris)
write.table(iris, file = "test.csv", row.names = FALSE, sep = ",")

Test whether file exists

file.exists("test.csv")
[1] TRUE

Read data from file

mydata1 = read.table("test.csv", header = TRUE, sep = ",")
head(mydata1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

or

mydata2 = read.csv("test.csv", header = TRUE)
head(mydata2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Remove file

file.remove("test2.csv")
[1] TRUE
file.exists("test2.csv")
[1] FALSE

Delete files and directories

unlink("test1.csv")
file.exists("test1.csv")
[1] FALSE

Rename file

rename “test.csv” to “test2.csv”.

file.rename("test.csv", "test2.csv")
[1] TRUE

Append file

Append “test2.csv” to “test1.csv”.

file.append("test1.csv", "test2.csv")
[1] TRUE
head(read.csv("test1.csv", header = FALSE))
V1          V2           V3          V4      V5
1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2          5.1         3.5          1.4         0.2  setosa
3          4.9           3          1.4         0.2  setosa
4          4.7         3.2          1.3         0.2  setosa
5          4.6         3.1          1.5         0.2  setosa
6            5         3.6          1.4         0.2  setosa

Copy file

Copy “test1.csv” to “test.csv”.

file.copy("test1.csv", "test.csv")
[1] TRUE
head(read.csv("test.csv", header = FALSE))
V1          V2           V3          V4      V5
1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2          5.1         3.5          1.4         0.2  setosa
3          4.9           3          1.4         0.2  setosa
4          4.7         3.2          1.3         0.2  setosa
5          4.6         3.1          1.5         0.2  setosa
6            5         3.6          1.4         0.2  setosa

Extract file information

file.info("test.csv")
size isdir mode               mtime               ctime
test.csv 4177 FALSE  666 2013-10-08 12:45:40 2013-10-08 12:45:05
atime exe
test.csv 2013-10-08 12:45:40  no

List the files in a directory/folder

list.files(path = ".", pattern = "*.csv")
[1] "test.csv"  "test1.csv" "test2.csv"
list.dirs(".")
[1] "."      "./test"

Sys.glob("*.csv")
[1] "test.csv"  "test1.csv" "test2.csv"

Construct path to file

file.path("test.csv")
[1] "test.csv"

Directory and file name

basename(file.path("test.csv"))
[1] "test.csv"
dirname(file.path("test.csv"))
[1] "."
http://felixfan.github.io/R-Tutorials

Q & A

Blogs

Online platform and courses

Tutorials

Finance

Machine Learning

Little Book of R

Rcpp

R Cookbook

IDE

R & Biology

R Users’ Group

CRAN and Bioconductor

Data mining

Analysis examples

Advanced R

Books

Fast-track publishing using knitr (from G-FORGE)

Tech-post

Jobs

Others

http://felixfan.github.io/install-update-R
OS:      WIN7
R:       3.02
RStudio: 0.98.312

1. Install R

R is available at http://www.r-project.org/.

2. Update R

if (!require(installr)) {
# load / install+load installr
install.packages("installr")
require(installr)
}

updateR()

3. Install R packages

3.1 Installing R packages from CRAN (http://cran.r-project.org/)

install.packages("FinCal", dependencies = TRUE)  # FinCal is the package name

or

RStudio -> Tools -> Install Packages -> Install from 'Repository (CRAN)'

3.2 Installing R packages from Package Archive File (offline)

install.packages("FinCal_0.5.zip")

or

RStudio -> Tools -> Install Packages -> Install from 'Package Archive File'

or

R CMD INSTALL FinCal_0.5.zip

3.3 Installing R packages from Bioconductor (http://www.bioconductor.org/)

source("http://bioconductor.org/biocLite.R")  # installs 'BiocInstaller'
biocLite()  # installs automatically 'Biobase' 'IRanges' 'AnnotationDbi' ‘BiocGenerics’ ‘RSQLite’
all_group()  # Get list of all packages in BioConductor
biocLite(c("GenomicFeatures", "AnnotationDbi"))  #installing GenomicFeatures &AnnotationDbi packages

3.4 Installing R packages from GitHub (https://github.com/)

install.packages("devtools")  # requires for downloading & installation of GitHub packages
require(devtools)
install_github(repo = "FinCal", username = "felixfan")  # installing FinCal package

4. Update all existing packages

4.1 Automated Re-Install of Packages (packages in the default library dir)

update.packages(ask = FALSE, repos = "http://cran.rstudio.org", checkBuilt = TRUE)

or just

update.packages(ask = FALSE)

4.2 Automated Re-Install of Packages (packages do not in the default library dir)

.libPaths()  # gets the library trees within which packages are looked for
myPath = c("C:/Users/alice/Documents/R/win-library/3.0")  # change it to your own dir
package_df <- as.data.frame(installed.packages(myPath))  #Get currently installed packages
package_list <- as.character(package_df$Package)
install.packages(package_list)  #Re-install all installed packages

5. Reference

Installing R packages from CRAN/Bioconductor/Omegahat/Github
R 3.0.0 is released! (what’s new, and how to upgrade)
Automated Re-Install of Packages for R 3.0

http://felixfan.github.io/Regular-Expressions

created on 1 Oct 2013 updated on Wed Oct 02 21:39:27 2013

NOTE: The examples used in the function ‘grep’ can be extended to other functions, to limit the space, I will only give one or two examples for other functions.

1. Introduction

R supports two regular expression flavors: POSIX 1003.2 and Perl. By default R uses POSIX extended regular expressions.

  • fixed = TRUE: use exact matching.
  • perl = TRUE: use Perl-style regular expressions.
  • fixed = FALSE, perl = FALSE: use POSIX 1003.2 extended regular expressions.

Regular expressions are represented as strings. Metacharacters often need to be escaped. For example, the metacharacter "\n" must be entered as "\\n" to prevent R from interpreting the leading backslash before sending the string to the regular expression parser.

2. Function ‘grep’

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)

The grep function requires two arguments. The first is a string containing a regular expression. The second is a vector of strings to search for matches. The grep function returns a list of indicies of the elements of x that yielded a match.

2.1 Example with dafault parameters

grep("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"))
[1] 1 3

Note: grep is case-sensitive by default.

2.2 Example with case-insensitive match

To perform a case-insensitive match, add ignore.case = TRUE to the function call.

grep("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"), ignore.case = TRUE)
[1] 1 2 3

2.3 Example return the actual matches

To return the actual matches rather than their indices, add value = TRUE to the function call.

grep("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"), value = TRUE)
[1] "I love apple and apple pie" "apple ipod"

2.4 Examples with metacharacters

A list of metacharacters: $ * + . ? [ ] ^ { } | ( ) \ * $: Force the regular expression to be at the end of the string * ^: Force the regular expression to be at the beginning of the string * *: The preceding item will be matched zero or more times. * +: The preceding item will be matched one or more times. * ?: The preceding item is optional and will be matched at most once * {n}: The preceding item is matched exactly ‘n’ times. * {n,}: The preceding item is matched ‘n’ or more times. * {n,m}: The preceding item is matched at least ‘n’ times, but not more than ‘m’ times. * .: Stands for any character. * |: Alternation match. * [ABC]: means A,B or C. * [A-Z]: means any upper letter between A and Z. * [0-9]: means any digit between 0 and 9. * \\d: Digit, 0,1,2 … 9 * \\D: Not Digit * \\s: Space * \\S: Not Space * \\w: Word * \\W: Not Word * \\t: Tab * \\n: New line

pattern: any character at the beginning of the string, followed by ‘pple’:

grep("^.p{2}le", c("I love apple and apple pie", "Apple ipad", "apple ipod"),
value = TRUE)
[1] "Apple ipad" "apple ipod"

pattern: ‘p’ before ‘d’, and any character in between

grep("p.d", c("I love apple and apple pie", "Apple ipad", "apple ipod"), value = TRUE)
[1] "Apple ipad" "apple ipod"

pattern: any upper letter

grep("[A-Z]", c("I love apple and apple pie", "Apple ipad", "apple ipod"), value = TRUE)
[1] "I love apple and apple pie" "Apple ipad"

pattern: any letter

grep("[A-Za-z]", c("I love apple and apple pie", "Apple ipad", "apple ipod"),
value = TRUE)
[1] "I love apple and apple pie" "Apple ipad"
[3] "apple ipod"

pattern: pad or pod

grep("pad|pod", c("I love apple and apple pie", "Apple ipad", "apple ipod"),
value = TRUE)
[1] "Apple ipad" "apple ipod"
grep("p(a|o)d", c("I love apple and apple pie", "Apple ipad", "apple ipod"),
value = TRUE)
[1] "Apple ipad" "apple ipod"

3. Function ‘grepl’

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

The grepl function takes the same arguments as the grep function, except for the value argument, which is not supported. grepl returns a logical vector with the same length as the input vector.

grepl("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"))
[1]  TRUE FALSE  TRUE

4. Function ‘regexpr’ and ‘gregexpr’

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

The function regexpr and gregeexpr requires two arguments: a regular expression and a vector of text to process. Function regexpr returns the locations of the regular expression matches. Function gregexpr returns the number of matches in each component.

regexpr("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"))
[1]  8 -1  1
attr(,"match.length")
[1]  5 -1  5
attr(,"useBytes")
[1] TRUE

regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the (first) regex match was found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn’t match.

gregexpr("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"))
[[1]]
[1]  8 18
attr(,"match.length")
[1] 5 5
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 1
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

gregexpr is the same as regexpr, except that it finds all matches in each string.

5. Function ‘regmatches’

regmatches(x, m, invert = FALSE)
regmatches(x, m, invert = FALSE) <- value

Use regmatches to get the actual substrings matched by the regular expression or replace matched substrings.

x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
m <- regexpr("apple", x)
regmatches(x, m)
[1] "apple" "apple"
x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
m <- regexpr("apple", x)
regmatches(x, m) <- "orange"
x
[1] "I love orange and apple pie" "Apple ipad"
[3] "orange ipod"

6. Function ‘sub’ and ‘gsub’

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,fixed = FALSE, useBytes = FALSE)

The function sub and gsub replace one pattern with another. They requires three arguemtns: a regular expression, a replacement pattern, and a vector of strings to process. sub is analogous to s/// in Perl, which replaces only the first instance of a regular expression. gsub function is analogous to s///g in Perl, which replaces all instances of a pattern.

x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
x
[1] "I love apple and apple pie" "Apple ipad"
[3] "apple ipod"
x <- sub("apple", "orange", x)
x
[1] "I love orange and apple pie" "Apple ipad"
[3] "orange ipod"
x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
x
[1] "I love apple and apple pie" "Apple ipad"
[3] "apple ipod"
x <- gsub("apple", "orange", x)
x
[1] "I love orange and orange pie" "Apple ipad"
[3] "orange ipod"

7. Function ‘strsplit’

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

Function strsplit splits its input according to a specified regular expression.

x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
strsplit(x, split = "\\W")
[[1]]
[1] "I"     "love"  "apple" "and"   "apple" "pie"

[[2]]
[1] "Apple" "ipad"

[[3]]
[1] "apple" "ipod"
strsplit(x, split = "")
[[1]]
[1] "I" " " "l" "o" "v" "e" " " "a" "p" "p" "l" "e" " " "a" "n" "d" " "
[18] "a" "p" "p" "l" "e" " " "p" "i" "e"

[[2]]
[1] "A" "p" "p" "l" "e" " " "i" "p" "a" "d"

[[3]]
[1] "a" "p" "p" "l" "e" " " "i" "p" "o" "d"

8. Summary

Function Purpose
grep() returns a vector of indices where a pattern is matched
grepl() returns a logical vector (TRUE/FALSE) for each element of the data
regexpr() returns an integer vector giving the starting position of the first match, along with a match.length attribute giving the length of the matched text.
gregexpr() returns an integer vector giving the starting position of the all matches, along with a match.length attribute giving the length of the matched text.
regmatches() Extract or Replace Matched Substrings.
sub() replaces one pattern with another at first matching location
gsub() replaces one pattern with another at every matching location
strsplit() breaks apart strings at predefined points

9. References

Regular expressions in R
Regular Expressions with The R Language
http://en.wikibooks.org/wiki/R_Programming/Text_Processing
R Regular Expression

http://felixfan.github.io/Font-ts1-zi4r-at-540-not-found

When I use {devtools} to builds and checks a source package, I got the Error: Font ts1-zi4r at 540 not found.

OS: Win7 R: 3.0.2 RStudio: 0.98.309 MikTeX: 2.9

I solved this peoblem using the following method:

  • Open a Command Prompt window. How?

  • type “initexmf –update-fndb” in the command window, this step may take several seconds

initexmf --update-fndb
  • type “initexmf –edit-config-file updmap” in the command window, this command will open updmap.cfg in your default editor. Since I use notepad++, it was opened in notepad++.
initexmf --edit-config-file updmap
  • add “Map zi4.map” to updmap.cfg (the opened file)
Map zi4.map
  • save and close updmap.cfg

  • type “initexmf –mkmaps” in the command window.

initexmf --mkmaps

It works now.
Have fun!

Note: You need to update R to 3.0.2 first. Or check this post to fix “inconsolata.sty is not available” problem first.

Reference

http://felixfan.github.io/pubmed-wordcloud

PubMedWordcloud is a R Package for creating a word cloud using the abstract of publications from PubMed. PubMedWordcloud makes it easy to created a word cloud using the abstracts of your publications to represent a visual description of your work.

More details and examples of using PubMedWordcloud is available here

PubMedWordcloud is available at CRAN and GitHub

http://felixfan.github.io/fincal

FinCal is a R Package for time value of money calculation, time series analysis and computational finance. Currently, the fuctions are as following:

  • Financial calculator for time value money: PV, FV, N, PMT, I/Y
  • NPV, IRR
  • Methods for downloading historical stock prices from Yahoo finance and Google finance
  • Technical analysis
  • Cost of goods sold and ending inventory under three methods (FIFO,LIFO,Weighted average)

More details and examples of using PubMedWordcloud is available here

FinCal is available at CRAN and GitHub

http://felixfan.github.io/ipgwas

IPGWAS was developed to facilitate the identification of the rational thresholds in QC of GWAS datasets, association analysis, Manhattan plot, quantile-quantile (QQ) plot, and format conversion for genetic analyses, such as meta-analysis, genotype phasing, and imputation. IPGWAS is a multiplatform application written in Perl with a graphical user interface (GUI).

Features
  • genome-wide association study (GWAS)
  • quality control (qc)
  • Manhattan plot and QQ plot
  • Cochran-Armitage trend test, Association test
  • combine GWAS dataset, Split GWAS Data by Chromosome
  • Convert EIGENSTRAT output chi-square to p value
  • Convert MACH imputation output to PED/MAP format and SNPTEST format
  • Convert PED/MAP files to the default input format of PHASE and BEAGLE
  • Convert PLINK and SNPTEST files to GWAMA format
  • P-Value Calculator (Chi-square test, Cochran-Armitage trend test, and Fisher’s exact test)
  • Change affection status of GWAS data
  • filt Subjects of GWAS data by affection status (case/control) and/or gender (male/female)
  • GUI for PLINK, SNP ratio Test
  • filter singleton significant SNPs
  • eigPlot (plot the eigenstrat result)
  • Convert PED/MAP files to the default input format of MACH and IMPUTE2
  • Convert IMPUTE2 imputation output to TPED/TFAM (PLINK) format

IPGWAS is available at sourceforge.