上面這個圖是用下面的R程式碼畫的

require(ggExtra)
require(animation)

x <- seq(-4.2, 4.2, length=10000)

saveMovie(
          for (i in 1:10) {
            end.pin <- 1000*i
            dat1 <- data.frame(x=x[1:end.pin], y=pnorm(x[1:end.pin]),z=dnorm(x[1:end.pin]))
            h1 <- ggplot(dat1, aes(x=x, y=y))
            h1 <- h1 + xlim(min(x), max(x)) + ylim(0, 1)            
            h1 = dnorm(x[end.pin])) {
              h1 <- h1 + geom_ribbon(aes(y=y, ymin=0, ymax=y), fill='pink', alpha=0.75)
              h1 <- h1 + geom_ribbon(aes(y=z, ymin=0, ymax=z), fill='cyan', alpha=0.75)
            } else {
              h1 <- h1 + geom_ribbon(aes(y=z, ymin=0, ymax=z), fill='cyan', alpha=0.75)
              h1 <- h1 + geom_ribbon(aes(y=y, ymin=0, ymax=y), fill='pink', alpha=0.75)
            }
            h1 <- h1 + geom_text(aes(x=-2.4, y=0.7, label=paste('x=', round(x[end.pin],3))), size=8, colour='darkgreen')
            h1 <- h1 + geom_text(aes(x=0.2, y=0.2, label="φ(x)="), size=8, colour='blue')            
            h1 <- h1 + geom_text(aes(x=2.0, y=0.2, label=round(dnorm(x[end.pin]), 3)), size=8, colour='blue')            
            h1 <- h1 + geom_text(aes(x=-1.4, y=0.5, label="Φ(x)="), size=8, colour='red')
            h1 <- h1 + geom_text(aes(x=0.4, y=0.5, label=round(pnorm(x[end.pin]), 3)), size=8, colour='red')
            h1 <- h1 + geom_line(colour='red')
            h1 <- h1 + geom_line(aes(y=z), colour='blue')
            print(h1)            
          }, moviename="normal_dist_ggplot2.gif")

Advertisements

R 的 integrate 問題

December 21, 2010

測試一下:
\[ \Phi(\color{red}x) = \int_{-\infty}^{\color{red}x} \phi(z)dz\]

> integrate(dnorm,0,4384)
0.5 with absolute error < 5.2e-05
> integrate(dnorm,0,4385)
6.124136e-05 with absolute error < 0.00012

理論上應該是要遞增才對﹐ 一下掉到快變零,這是為什麼?

R::GSL

December 17, 2010

用了一下子,只有一個感想, R::GSL完成度很高。

幾乎所有的函式都可以用gsl:: + TAB找到

看來R離完整的計算工具愈來愈近了,一個合用的計算工具應該具備以下的功能:

  • 基本計算(至少不能比 C 標準函式庫的功能少), 同時要支援 IEEE擴充規格 (可由 GSL 得到)
  • 有應用數學常用函式, 比如說一本工數課本有提到的所有函數 (可由 GSL 得到)
  • 統計常用函數 (這點 R 算是目前作得相當完整的)
  • 能解偏微分方程, ODE 和 PDE (deSolve 差強人意, 算是堪用)
  • 精確而快速的數值積分 (Rintegrate單維積分算是堪用)
  • 資料視覺化(就是畫圖啦, 我想不出開源軟體中有誰作得比R好的)

Take a look on the following simple C function:

double triplesum(double a, double b, double c) {
  return a + b + c; 
}

And we want to use it in Python:

>>> from ctypes import *
>>> lib1 = cdll.LoadLibrary('triplesum.so')
>>> lib1.triplesum.argtypes = [c_double, c_double, c_double]
>>> lib1.triplesum.restype = c_double
>>> lib1.triplesum(1.2, 3.4, 2.3)
6.8999999999999995

In R:
C code:

void triplesum(double* a, double* b, double* c, double* d) {
  *d =  *a + *b + *c; 
}

Compile it by

 $ R CMD SHLIB forR.c

R code:

> dyn.load("forR.so")
> a <- 1.2; b <- 4.3; c <- 2.3; d  .C("triplesum", a, b, c, d)[[4]]
[1] 6.9

It’s hard to say which one is more convenient, but Python seems better because you don’t need to modify the code.

R‘s mechanism makes me feel like writing Fortran subroutine but in C syntax.

In Bayesian hierarchical setting:
\[
\left.\boldsymbol x_i\right|\boldsymbol\mu, \mathbf\Sigma\sim\mathscr N(\boldsymbol\mu, \mathbf\Sigma)
\]
$i=1,2,\ldots,n$, where
\[
\left.\boldsymbol\mu\right|\mathbf\Sigma\sim\mathscr N\left(\boldsymbol\mu_0, \frac{1}{\kappa_0}\mathbf\Sigma\right)
\]
and
\[
\mathbf\Sigma\sim\mathscr{IW}(\nu_0, \mathbf\Sigma_0)
\]
and, most statisticians can tell me that
\[
\left.\boldsymbol\mu\right|\boldsymbol x_1,\boldsymbol x_2,\ldots, \boldsymbol x_n,\mathbf\Sigma\sim\mathscr N\left(\frac{\kappa_0}{n+\kappa_0}\boldsymbol\mu_0+\frac{n}{n+\kappa_0}\overline{\boldsymbol x}, \frac{1}{\kappa_0}\mathbf\Sigma\right)
\]
where $\frac{1}{n}\sum_{i=1}^n\boldsymbol x_i$ and
\[
\left.\mathbf\Sigma\right|\boldsymbol x_1,\boldsymbol x_2,\ldots, \boldsymbol x_n\sim\mathscr{IW}\left(n+\nu_0, \mathbf\Sigma_0 + \sum_{i=1}^n(\boldsymbol x_i-\overline{\boldsymbol x})(\boldsymbol x_i-\overline{\boldsymbol x})^\top + \frac{\kappa_0n}{\kappa_0+n}(\boldsymbol x_i-\boldsymbol\mu_0)(\boldsymbol x_i-\boldsymbol\mu_0)^\top\right)
\]
right away, but, if we set
\[
\left.\boldsymbol\mu\right|\mathbf\Sigma\sim\mathscr N\left(\boldsymbol\mu_0,\mathbf{A\Sigma A}^\top\right)
\]
can we get such a beautiful posterior form?

R 打敗 Python?

November 14, 2010

對統計專業又天天說R的人來說, 這篇自high必備文當然是一定要的了。

至於會說不止一種統計計算語言的人,另一件事可能會讓他們更驚訝, Google 大神欽定的 AppEngine 兩種程式語言之一的 Python 竟然上不了全世界最臭屁的新約克時間, 一篇報導也沒有。

這算是新約克的記者程度太差還是, 嗯, 還是 R 己經(至少在統計計算上)把 Python完全比下去了?

I tried the famous Gelman’s example with OpenBUGS, JAGS and famous (or notorious) WinBUGS with R to compare the RNG performance.

This is the testing code:

library(arm)
library(R2jags)

J <- 8
y <- c(28,8,-3,7,-1,1,18,12)
sigma.y <- c(15,10,16,11,9,11,10,18)
schools.data <- list ("J", "y", "sigma.y")
schools.inits <- function()
  list (theta=rnorm(J,0,1), mu.theta=rnorm(1,0,100),
        sigma.theta=runif(1,0,100))
schools.parameters <- c("theta", "mu.theta", "sigma.theta")

## using WINE+WinBUGS
print(system.time(schools.sim <- bugs (schools.data, schools.inits,
                                       schools.parameters, "schools.bug", n.chains=3, n.iter=10000,
                                       WINE='WINEDLLOVERRIDES="ole32,oleaut32,rpcrt=n" /usr/bin/wine')))


## using OpenBUGS
print(system.time(schools.sim <- bugs (schools.data, schools.inits,
                                       schools.parameters, "schools.bug", n.chains=3, n.iter=10000,
                                       program="OpenBUGS")))

## using JAGS
print(system.time(schools.sim <- jags(schools.data, schools.inits,
                                      schools.parameters, "schools.bug",
                                      n.chains=3, n.iter=10000)))

The results are quire interesting:

The first result is R+WINE+WinBUGS+R2WinBUGS, all these software come with the last version:

> source('schools.R')
fixme:keyboard:RegisterHotKey (0x10048,13,0x00000002,3): stub
   user  system elapsed 
  1.292   0.384   3.663 

The second result is R+OpenBUGS, where OpenBUGS
is the version provided on OpenBUGS’s web site, a 32-bit Linux command line executable:

   user  system elapsed 
  3.500   1.048   5.162 

And the last, fully native R+JAGS:

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 39

  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
  |**************************************************| 100%
   user  system elapsed 
  1.376   0.008   1.384 

You can see that, even OpenBUGS does provide its 32-bit Linux version, it’s still defeated by WinBUGS+WINE. In fact, in most cases, WINE‘s performance penalty is significant, but, even though, WINE+WinBUGS still outperforms OpenBUGS.

Does it means OpenBUGS need to do some more modification on it?

Let’s wait, participate, and see.

MathML stretch symbol Test

October 25, 2010

E
(

X
k

)
=

R

1

Γ

ν
+
1

2

Γ

ν
+
1

2

π
ν

x
k

ν

ν
+

x
2

ν
+
1

2

d
x

Nothing selected

if $\boldsymbol x\in\{0\cup\mathbb N\}^m$ is a random vector follows a multinomial distribution with parameter vector $\boldsymbol\theta\in (0,1)^m$.

Assume that $\boldsymbol x$’s parameter $\boldsymbol\theta$ has the prior distribution as Dirichlet with parameter $\boldsymbol\alpha\in (2, \infty)^m$.

It is clear that
\[
\mathscr P(\left.\boldsymbol x, \boldsymbol\theta\right|\boldsymbol\alpha)
= \frac{\Gamma\left(\sum_{i=1}^mx_i\right)\Gamma\left(\sum_{i=1}^m\alpha_i\right)}{\prod_{i=1}^m\Gamma(x_i)\prod_{i=1}^m\Gamma(\alpha_i)}\prod_{i=1}^m\theta_i^{x_i+\alpha_i}
\]
And so
\[
\mathscr P(\left.\boldsymbol x\right|\boldsymbol\alpha) =
\frac{\Gamma\left(\sum_{i=1}^mx_i\right)\Gamma\left(\sum_{i=1}^m\alpha_i\right)\prod_{i=1}^n\Gamma(x_i+\alpha_i)}{\prod_{i=1}^m\Gamma(x_i)\Gamma(\alpha_i)\Gamma\left(\sum_{i=1}^nx_i+\alpha_i\right)}
\]
Thus
\[
\mathscr P(\left.\boldsymbol\theta\right|\boldsymbol x, \boldsymbol\alpha) =
\frac{\Gamma\left(\sum_{i=1}^nx_i+\alpha_i\right)}{\prod_{i=1}^n\Gamma(x_i+\alpha_i)}
\prod_{i=1}^m\theta_i^{x_i+\alpha_i}
\]

For R, we have

> library(compositions)
> rmultinom(1, prob=rDirichlet.acomp(1, alpha), size=10)

Consider a collection of random variables $\boldsymbol x = (x_1, x_2, \ldots, x_k)^T$, if $\boldsymbol x$ have a joint characteristic function
$\phi_{\boldsymbol x} (\boldsymbol t)$, where $\boldsymbol t = (t_1,t_2,\ldots, t_k)^T$ are all distinct, then we can conclude that $x_i, x_j$ are independent for all $i\neq j$ if and only if
\[
\phi_{\boldsymbol x} (\boldsymbol t) = \prod_{i=1}^k\phi_{X_i}(t_i)
\]

Note that the above statement does not hold if the distinction property among the $t_i$’s does not exist, that means, the condition that
\[
\phi_{\boldsymbol x} (t\boldsymbol 1) = \prod_{i=1}^k\phi_{X_i}(t)
\quad
\text{ does not imply that}\quad X_i’s\quad\text{ are all independent}
\]
The reason is apparent, think about the case $t=0$.