Planets, asteroids and comets are assembled inside mysterious gaseous structures called *protoplanetary disks*, which form around a newborn star (see Figure 1 below). The gas in these disks is turbulent due to a combination of magnetic, chemical and kinematic effects, and this turbulence is the driving agent behind a disk’s evolution. The stronger the turbulence, the faster the disk gas spirals into the central star.

One way in which astrophysicists can understand these systems better is through numerical simulations of small disk patches . One of the simulation outputs is a very important quantity, known as “α” (alpha) in the planet-formation field, which is basically a dimensionless measure of the intensity of the turbulence. If scientists are able to decipher how disk turbulence varies during the lifetime of a protoplanetary disk (which can be on the order of 1 to 10 million years), we could potentially explain many fascinating features of our own solar system and of other planetary systems.

Up until now, astronomical observations of disks have provided only indirect, order-of-magnitude estimates of α. In spite of this, simulations that include the relevant physics allow us to glean the random variability of a disk’s α, and this provides theorists with a vital piece of information that can be used as input for their disk models.

However, thorough numerical descriptions of disks require large amounts of computational resources, particularly computing time, as well as knowledge of the complex codes that are used. What if it was possible to obtain α-values without having to carry out full 3-D simulations of disks every time someone needs them?

One way to do this is by performing a time series analysis of α-data from previously-run disk simulations, and obtaining a stochastic model that can be used in subsequent disk calculations. The statistical language R contains many packages and functions ideally suited for this task, and the analysis below uses those. Excellent introductions to the use of R in time series analysis can be found in “Introductory Time Series with R”, by P. S. P. Cowpertwait and A. V. Metcalfe (Springer), and in “Time Series Analysis WIth Applications in R”, by J. D. Cryer and K. Chan (also published by Springer). What follows is not intended to be a rigorous or comprehensive study, and I would welcome comments and suggestions to improve these results.

Figure 2 shows a realization of α from a typical protoplanetary disk calculation, using an astrophysical-fluid-dynamics code. The unit of time is one orbit, which is one revolution of the disk patch being simulated around the central star (this can easily be converted to years, or any other desired unit). Data points are spaced by 0.5 orbits. Here we are discarding the first 70 orbits, because it takes the turbulence some time to set in after the beginning of the simulation, and we want to make sure that the data are truly capturing the turbulent state.

The correlation between α-values at time *t* and values at times *t-k*, with *k*=1, 10 and 200 representing the number of time steps or “lag” for each data point, are shown in Figure 3. Values that are separated by one lag (half an orbit) are highly correlated, as evidenced by the black symbols. On the other hand, α-values that are 200 lags (100 orbits) apart show almost no correlation, as can be seen from the blue symbols. The corresponding correlation coefficients are indicated in the figure.

Another way of examining correlations between pairs of data points is to plot the *sample autocorrelation function* (or sample ACF). A *correlogram* is a helpful visual tool that can be used for this purpose. Figure 4 presents the sample ACF of the data in Figure 2; for each *k*, the correlogram shows the correlation between α_{1} and α_{1+k }, α_{2} and α_{2+k} , α_{3} and α_{3+k}, etc.

The blue horizontal lines in Figure 4 are drawn at ±2/√*N*, where *N *is the number of data points. That is, the lines represent two approximate standard errors above and below zero. The sample ACF has a sampling distribution from which we can infer whether the data come from a completely random series. If the sample ACF falls outside the lines (as it does for most lags in Figure 4), we have evidence against the null hypothesis that the autocorrelation function equals zero, at the 5% level. In other words, the correlations are statistically significant. If the sample ACF remained within the 95% confidence band, it would be reasonable to infer that no correlations exist.

The text annotation in Figure 4 reports the result of an augmented Dickey-Fuller (ADF) test on the α-data of Figure 2. I won’t go into the details of the test (an exposition can be found in the books mentioned above), but generally speaking, it can be used to quantify the certainty about whether a time series is *non-stationary* (strictly speaking, whether it contains a unit root, which is the null hypothesis). A non-stationary series has a non-constant mean and/or variance, and shows significant correlations in its correlogram. Additionally, in that case the Dickey-Fuller statistic will be a negative number with a relatively small absolute value. That seems to be the case for our data, and Figure 4 supports the notion that the disk turbulence intensity from the simulation is non-stationary.

Now, to proceed further we need to transform the original time series into a stationary one, that is, into a process which is, in a sense, in statistical equilibrium. We can achieve that by taking the natural logarithm of, and then differencing, the series. This is shown in Figure 5, where the left panel corresponds to the transformed series, and the right panel to its sample ACF.

The ACF shows a slow decay, but after some 350 lags no correlations remain. At this point, I would be tempted to say that even after this transformation (and some others that I’m not showing here) the series still remains non-stationary, given what we see in the correlogram. However, the augmented Dickey-Fuller test gives us reason to reject the null hypothesis of a unit root, or in other words, non-stationarity. Moreover, the transformed series shows a relatively stable mean compared to the original series in Figure 2, and the variance exhibits less change, except at a few times around 250 orbits. These larger changes would need to be incorporated into a more elaborate model, but for the time being I’m going with the result of the ADF test and take the transformed series to be stationary.

So, knowing that it was necessary to take the differences of the logarithms of α, we can use the “**arima**” function in R to obtain a model of that particular realization of α. “ARIMA” stands for Auto Regressive Integrated Moving Average, which describes the structure of a time series model consisting of recursive terms (the AR part) and terms of independent and identically distributed random variables (the MA part). The I in ARIMA refers to the fact that we had to difference the series to make it stationary. Exactly how many AR and MA terms are needed, and how many times we need to take differences, lies at the heart of the time series analysis problem. In this case, by trial and error, 5 AR terms and 5 MA terms are found to most accurately describe our stochastic process.

Using **arima**** **on the natural logarithm of the α-series, and once we specify that we differenced it once, R computes the coefficients of the ARMA components and their associated standard errors:

AR1 | AR2 | AR3 | AR4 | AR5 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|---|---|---|---|

Coefficient | 0.295 | -0.058 | -0.058 | 0.942 | -0.352 | -0.555 | 0.017 | -0.019 | -0.982 | 0.570 |

S.E. | 0.237 | 0.067 | 0.062 | 0.067 | 0.179 | 0.232 | 0.023 | 0.023 | 0.025 | 0.210 |

We can see that the coefficients for terms AR2, AR3, MA2 and MA3 are not significant. The resulting model is then

where the *w _{t}*‘s are white noise variables.

**This expression could, potentially, be used in relatively simple numerical models of disks that share physical characteristics with the one that produced α in Figure 2.**It is necessary to perform some diagnostics before choosing the model as a statistical descriptor of our data. To that end, we examine the standardized residuals obtained from our model fit. If the model is adequately specified and the estimates of the parameters are reasonably close to the true values, then the residuals should behave roughly like independent, identically distributed normal variables. If we plot their behavior over time, they should scatter approximately equally above and below zero, and not show any trends. This appears to be the case, as shown in Figure 6:

We can assess the normality of these residuals by means of a quantile-quantile, or Q-Q, plot. This graphical tool can be used to display the quantiles of the data versus the quantiles of a normal distribution. If the data are normally distributed, they will form an approximately straight line. They are close to doing that in Figure 7, except at the edges:

A Shapiro-Wilk test for normality can be performed to calculate the correlation between the residuals and the normal quantiles. A p-value of 0.69 is obtained, and we therefore can not reject the null hypothesis that the residuals are normally distributed.

As a final diagnostic for our model in this exercise, we look at the ACF of the residuals to establish whether any autocorrelations remain.:

Figure 8 shows that is not the case, and our model could be a good candidate to be used as a proxy of the turbulent activity in a protoplanetary disk, provided its physical features are similar to those of the 3-D numerical simulation.