Suppose that you want to create a model that generates spatial locations of trees. You want the model to be stochastic, so that each time you generate a forest, you end up with a different configuration of trees. The relevant tool is a spatial point process; mathematical details are described in books such as **Daley and Veres-Jones**. Each forest that you generate is called a realization of the process.

Further suppose you mark out a particular region, and generate many different forests, each time counting how many trees fall within the region. If your model is realistic, then, for some of the forests, the selected region will be empty (you will count zero trees) while for other forests, a cluster of trees will fall in the region (and you will count a large number). A good model will produce a distribution of counts that is overdispersed.

So, should you use a negative binomial distribution for your model? No. The negative binomial distribution describes the distribution of counts of trees, but it does not tell you WHERE TO PUT THEM. The distribution is not itself a spatial point process. There are spatial point processes that are compatible with the negative binomial distribution; that is, there are models that tell you where to put the trees, and have a negative binomial distribution for the counts of trees in any selected region.

There are two simple examples:

1) A compound Poisson process. The locations of tree clusters are generated by a Poisson process. The number of trees per cluster is generated by a logarithmic distribution.

2) A mixed Poisson process. A parameter drawn from a gamma distribution is used as the intensity for a Poisson process that generates the tree locations.

The Wolfram CDF module below illustrates these two processes. The number of trees in a unit area follows a negative binomial with parameters r and p; you can manipulate these parameters with the slider bars. Hitting the “generate” button produces spatial patterns for trees according to the two processes. To use the module, you will need the free **CDF Player plug-in**.

[WolframCDF source=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/08/NegBin.cdf” width=”880″ height=”450″ altimage=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/08/yeah1.png” altimagewidth=”512″ altimageheight=”432″]

There are two things to notice. The plot for the compound process has an extra dimension: the height of the points indicates the number of trees that are located at that point. This is a problem, because it’s not very realistic to have multiple trees in the EXACT SAME LOCATION.

The mixed Poisson process has a different problem. The forests it generates are NOT CLUMPY. Even though the tree counts in a region follow a negative binomial distribution over a large number of realizations, the spatial distribution within a single realization is Poisson.

Mathematically, the compound process described above is not “orderly” (more than one tree can occupy the same location), and the mixed process is not “ergodic” (averages of a single realization over a large spatial scale are not the same as averages of a small region over many realizations). To model clumpy forests, we want a spatial point process that is orderly and ergodic. We also want the process to be stationary (the counts should be translation invariant, so that the counts are independent of location).

I have bad news for you: there are no stationary, orderly, ergodic spatial point processes that have negative binomial count distributions. This issue was first examined in a great **paper** by **Peter Diggle**. Something has got to give in your model, so what should you sacrifice? The logical choice is the negative binomial property. Try a model like a Neyman-Scott process that is clumpy, stationary, orderly, and ergodic. Problem solved!

There is a new kid on the block for interactive visualization tools in R, **healthvis**. I have not yet taken healthvis for a spin, but the survival example in the **introductory blog post** inspired me to create a Shiny app to visualize the results of a survival analysis conducted for my dissertation.

As part of my PhD research, I used Cox proportional hazards models to analyze the behavior of ladybird larvae (see photo at bottom of post). The Cox models were used to determine if experimental factors (species of aphid eaten, duration of starvation period) affected the likelihood that a ladybird larvae (*Hippodamia convergens*) ended a behavioral bout (e.g., stopped searching intensively, exited the patch).

The standard approach to presenting the results of a survival analysis is to plot the proportion of individuals that are still alive (or engaging in a behavior) in each test group over time. Survival curves can provide a nice visualization of the effect of categorical predictors in a Cox model, but visualizing the effect of continuous covariates is trickier and might require the use of numerous lines in one figure (creating clutter) or the use of a multi-panel figure (making it harder to see the combined effects of the continuous and categorial covariates).

An interactive visualization, however, allows the user to see the effect of changing continuous covariates on the survival curves. In the example below*, clicking play causes the slider to loop through the starvation period values and creates an animation of the effect of starvation period on the likelihood that a ladybird larva leaves the patch after eating either a pea aphid (*Acyrthosiphon pisum*) or black bean aphid (*Aphis fabae*).

The ‘Model Summary’ tab provides summary output from the Cox model. There is no significant effect of the type of aphid consumed, which is apparent from the plot at low values of starvation period. However, increasing the starvation period increases the likelihood that ladybird larvae leave the patch (significant starvation effect), but only for ladybird larvae that ate a pea aphid (significant interaction effect). The starvation and interaction effects are clear from the animation as the solid red line shifts towards the plot origin (increased patch-leaving tendency) while the solid black line moves very little.

For even this simple model, the interactive visualization is a big improvement over the static visualizations that I had created previously for talks and my dissertation and, generally, interactive visualizations should outperform their static counterparts as model complexity increases. The increasingly digital nature of scientific publishing suggests that interactive graphics **are the future** and tools like Mathematica’s CDF, Shiny, and healthvis are making the creation of interactive graphics more accessible to scientists.

*The data and code used to create the Shiny app are available from **GitHub**.

Convergent ladybird beetle larva (*Hippodamia convergens*)

Black bean aphid (*Aphis fabae*)

Pea aphid (*Acyrthosiphon pisum*)

]]>
**Jeffery Bryant’s post** on the Wolfram blog. Bryant’s post dealt with simulating galatical collisions. Galaxies have hundreds of billions of stars, so visualizing them as dynamically interacting clouds of points is computationally difficult. Volume rendering provides a solution.

In today’s post, we will use volume rendering for a more prosaic example: color analysis of a photo of puppies.

**This image contains 589,824 pixels, each of is represented by a three-dimensional vector corresponding to its RGB color (the levels of red, green, and blue for the pixel). We can plot the pixels as points in three-dimensional color space. This is similar to what we did in a ****previous post **about stolen art.** **Our picture is not huge, so plotting points would work. With very large data sets, though, plotting individual points becomes unmanageable (particularly when the position and properties of the points change in time). An image created with volume rendering is less computationally intensive:

**Volume rendering** is a broad term for a variety of techniques used to display three-dimensional data on a screen. Many of these techniques are built into Mathematica’s core visualization functions, like **Image3D** and **Raster3D**. Before we can make use of these built-in capabilities, we need to process our data.

The space is divided into tiny grid cells, and the number of points in each grid cell is counted. In Mathematica, this can be accomplished with **BinCounts**. These numbers are used to assign an opacity to each grid cell. The data is represented by a three-dimensional array, where each entry is a four-dimensional vector. The first three entries of the each four-dimensional vector represent the RGB color levels; the fourth is the opacity. This array of data can be displayed using **Raster3D**.

This more efficient way of displaying data is helpful for creating interactive content. In the example below, the original puppy image is shown, along with its pixels plotted in color space. Below the original, you can select two colors by clicking on different locations in the color spectrum bars. The two colors you choose form a basis for a plane in RGB colorspace, and this is displayed to the right. The pixels in the puppy image have their colors projected on to the plane spanned by these selected colors. The resulting puppy image is shown. Essentially, you are looking at what the picture would look like if your color palette was limited to a two-dimensional color space instead of a three-dimensional one. You can drag the color space images to change perspective.

[WolframCDF source=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/03/PupCS.cdf” CDFwidth=”700″ CDFheight=”800″ altimage=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/03/pups1.png”]

If you have a very large (millions of points) data set in three dimensions, consider using tools like **Raster3D** and **Image3D** instead of simply plotting all the points!

[WolframCDF source=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/02/Macklemore.cdf” CDFwidth=”595″ CDFheight=”608″ altimage=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/02/Macklemore.png”]

*Mathematica* has a number of built-in control interfaces, including sliders, checkboxes, and pop-up menus. These control types are easily used through the handy **Manipulate** function. There is no reason to restrict yourself to these control types, though. Using the **Dynamic Module** function, you can specify any object to be interactive and dynamically updated.

The **Dynamic Module** function helpfully organizes a set of local dynamic objects. Within this function, different dynamic variables can be specified by enclosing them in the appropriately named **Dynamic** function. Placing the **Dynamic** function in the correct place can be a little tricky. For example, when graphics are generated, **Dynamic** must enclose the graphic object, not just the variables or parameters that create it.

If you are a new *Mathematica* user who has experimented with **Manipulate**, it might be time to broaden your horizons and give **Dynamic Module** a try!

And did I mention you should buy *The Heist?*

The lyric, symphonic, and emotional range of *The Heist* is impressive. You’ve probably heard the playful number one song “Thrift Shop”, but the heavier songs like “Same Love” (about gay marriage and civil rights) and “Starting Over” (about getting off the mat following relapse) show the Seattle duo at their most virtuosic.

In the video below, they are joined by Ray Dalton for a live radio performance of “Can’t Hold Us” on KEXP. Stay tuned until the end, when Dalton opens up his full register. I suggest you sit down before hitting play.

This song provides a nice opportunity to demonstrate how easy signal processing is in *Mathematica.* Below is a plot of the song’s power spectrum, which shows the relative contribution of different frequencies to the overall sound. It is obtained using a discrete Fourier transform.

You can make a homemade frequency filter in just a few lines of code. The module below simply uses “part” and “pad” operations in Mathematica to extract parts of the signal in the frequency domain, and then shifts back to the time domain using an inverse discrete Fourier transform.

You can play with the sliders, setting upper and lower bounds for the windows of the frequency that you want to hear. This is just a 31 second clip of “Can’t Hold Us”, about 15 seconds into the song. With the default settings, with frequencies between 63 and 78.75 Hz, you can really here when the bass kicks in about 10 seconds into the clip (this might not be audible if you are playing on a laptop with lousy speakers… try a good pair of headphones). Setting the frequency window at 26,000 to 36,000 Hz, Dalton’s entrance is highlighted a few seconds into the clip. It takes several seconds to load. And, yes, you need the **free cdf player**.

[WolframCDF source=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/01/music2.cdf” CDFwidth=”652″ CDFheight=”253″ altimage=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/01/altpic.jpg”]

Cheers to good music and easy Fourier transforms in Mathematica! Oh, and go and buy *The Heist* if you haven’t already done so.

Shiny uses a reactive programming framework, i.e., outputs change when inputs change without needing to refresh the browser. The reactive framework provides the interactivity that makes Shiny apps really useful for exploring data (see example **here**). However, the reactive framework becomes problematic if your app includes computationally-expensive code that should only be executed on demand rather than reactively. If your app does not benefit from reactivity, then you can suppress reactivity entirely by using the **submitButton** feature. But mixing imperative and reactive styles is a challenging problem, as Joe Cheng describes in this mailing list **response**:

The pain you’re experiencing here is the impedance mismatch between the functional-reactive style that Shiny is designed for, and the imperative style (“on click, do this action”) that most other GUI frameworks use (certainly most of the ones written before the last 18 months). I believe the functional-reactive style leads to much simpler and concise code, and far fewer errors for 90% of cases, but when you start wanting to do things that are by nature imperative (such as saving to the database) then it starts to get… interesting.

I have some ideas about how to allow you to move between one style and the other, but would like to spend a while longer watching what people are trying to do before implementing anything.

Challenges notwithstanding, I am optimistic about Shiny’s development on this front. In the last 2 weeks, the **isolate feature** was released to simplify the handling of imperative elements in a reactive framework.

No knowledge of HTML, CSS, or JavaScript is required to build apps with Shiny, but experience with these languages should help you more fully exploit Shiny’s potential. For example, if you want to include **dynamic** elements in the interface of your app, then you will need to know (or learn) at least a little bit of JavaScript. The **hammer principle** suggests that you should choose the right tool for the job, and learning JavaScript is still on my to-do list, but I’m grateful that Shiny lowers the amount of new knowledge required for an R user to start building web apps.

JavaScript provides statistical capabilities through the **jStat** library, which are more **limited** than those provided by R, but has good growth **potential**. Shiny makes beautiful apps, but the **jStat** **demonstration app** is even more beautiful. It will be interesting to see if learning JavaScript allows me to simply get more out of Shiny or leads me to abandon Shiny altogether. The latter seems unlikely because of the activity of JavaScript programmers on the Shiny mailing list. Perhaps the determining factor will be the rate of development of Shiny and jStat. Stay tuned!

A key limitation is that Shiny apps can’t **yet** be run over the open web. There are several deployment **options** to allow people to run your Shiny app locally on their machine. All of the current options require your app users to install R on their machines, install and load R packages, and run at least a couple lines of R code. The ability to deploy apps on the web with Shiny is right around the corner, though. Beta testing of **Shiny Server** is slated to begin at the end of January.

Shiny has an excellent **tutorial** that allows you to quickly build simple apps. When you are ready to move beyond the tutorial, though, you will find that other features are not as well documented (or, at least, not as easy to find). Many of the answers can be found on the mailing list, but here are a few tips to spare you some searching time.

- Descriptions of the main features/functions are found
**here**. There are also a couple of features in**shiny-incubator**. If you install the devtools package, you can install shiny-incubator with this code:

devtools::install_github("shiny-incubator", "rstudio")

- The key feature from shiny-incubator is the
**actionButton**, which allows you to mix imperative and reactive styles. The actionButton feature is most effectively used when paired with the isolate feature. - The reactive style means that when the app is first loaded, Shiny will attempt to run all of the code. For example, if you want to have a user select the location of a file, then using the following code will cause the file chooser window to open as soon as the app is launched rather than after the user clicks a button to select the file location. [Note: There is a
**fileInput**feature that provides an alternative approach for loading files.]## Create button for interface (ui.R) actionButton("select_file", "Select file location") ## R code for selecting file (server.R) selectFile <- reactive(function(){ file.choose() })

The current solution to this problem is to add a line of code that keeps the file chooser code from running until the button is clicked. [The button is a counter that starts at zero and increments with each click.]## Create button for interface (ui.R) actionButton("select_file", "Select file location") ## R code for selecting file (server.R) selectFile <- reactive(function(){ if (input$select_file == 0) {return(NULL)} file.choose() })

Similarly, you may want to wait to plot a results figure until after data is loaded and processed, which requires you to check if results data frame is null before plotting.

## server.R output$plot <- reactivePlot(function(){ if ( is.null(results()) ) {return(NULL)} plot(y~x) })

- For computationally-intensive operations, I would love to see the addition of a progress bar widget. As a hack-y alternative, you can use
**print()**or**cat()**within your R code to print messages to the R console window. You can also use**conditionalPanels**to display messages in the app, but this is trickier to get to work cleanly (see**here**for more on this). - A Shiny app needs a ui.R file to specify the layout of the user interface and a server.R file to talk to R. You can also store global variables and R functions in a global.R file. It cleans up your server.R file considerably if you wrap your R code into functions and put those functions in the global.R file. If you need to use global variables, and you probably won’t, then you will need to use the “super assignment” operator (
**<<-**) to make the global variable available in the global environment.

Once you’ve downloaded the plug-in, you should be able to interact with .cdf files that are embedded in this blog.

A simple example is shown below. It is useful for teaching systems of linear equations in introductory algebra courses. By entering numbers into the input fields, students can experiment and try to find situations where there are no solutions, a unique solution, or an infinite number of solutions. The image can be rotated by dragging.

**[WolframCDF source=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/01/LinearSys1.cdf” CDFwidth=”504″ CDFheight=”642″ altimage=”http://datavoreconsulting.com/blog/wp-content/uploads/2013/01/LinearSys1.jpg”]**

The notebook will implement a finite difference method on elliptic boundary value problems of the form:

The comments in the notebook will walk you through how to get a numerical solution. An example boundary value problem is solved, yielding a solution that looks like this:

Last week, Mathematica 9 was released. It has some awesome new features, including enhanced NDSolve capabilities. Sadly, there is no built-in finite difference method solver yet. We will have to continue to use workarounds like the notebook posted here for a while longer.

]]>Last week, **burglars stole** seven paintings from the Kunsthal museum in Rotterdam. The paintings included works by Picasso, Monet, Gauguin, and Matisse. The loot is likely worth hundreds of millions of dollars, but the loss of these great pieces surpasses anything that can be calculated as a monetary figure. Art is **unquantifiable**, right? Yes, but it is still fun to do data analysis on paintings. In today’s post, we will explore how Mathematica’s image processing capabilities can be used to compare two of the paintings stolen from the Kunsthal.

Let’s consider Matisse’s “Reading Girl in White and Yellow” (1919) and Picasso’s “Harlequin Head” (1971).

Mathematica represents each pixel with a vector whose entries correspond to the relative intensity of red, green, and blue channels, respectively. We can plot each pixel of the painting as a point in three-dimensional color space.

**Matisse:**

**Picasso:**

You can see that the Matisse draws from a larger color palette, and hence the points span a broader area of the color space.

**Entropy** provides a way to measure the information contained in a painting. A full explanation of this topic is beyond the scope of this blog, but a few general concepts are worth noting. When dealing with a distribution of different colors of pixels, the concepts of information, uncertainty, and evenness are roughly equivalent.

Suppose that you have a totally random painting, so that the pixels are evenly distributed across the color spectrum. Further suppose that you have the ability to reach into the painting and pull out a pixel at random. Don’t look at the pixel. Just hold it in your hand. Clearly, you are very uncertain about the color of the pixel that lies in your palm. Now open your hand and look at the pixel. You had almost no idea what color the pixel would be; now you know. The act of observing that pixel conveys a large amount of information to you.

Conversely, suppose that you have a painting in which 95% of pixels are pure blue. If you pick a pixel at random, then, even before you open your hand to look at it, you are pretty certain about what color you are holding. Looking at the pixel probably will not convey much information to you (seeing that it is blue be terribly surprising). This is why high uncertainty means high information (high entropy), and low uncertainty means low information (low entropy).

Things get more complicated when we consider the spatial structure of pixels in a picture. For example, if you were to reach into Matisse’s painting and take a pixel from a spot on the tablecloth, you’d be relatively certain about obtaining a white pixel. If you take a pixel from the region around the flowers, though, you have higher uncertainty about its color. Entropy takes different values in different locations.

We’d like to assign an entropy value to every point in a painting; this will allow us to see which areas of the painting are high entropy, and which are low. To define the entropy at a point, we take a small neighborhood of pixels around that point, and calculate the entropy of this collection of pixels. The size of the neighborhood matters: in general, we will use a 2r+1 x 2r+1 square, and will adjust r to examine different scales.

Here is what the entropy of the Matisse painting looks like for r=1 (black is low entropy, white is high).

Matisse, r=20:

Picasso, r=1:

And for r=20:

Let’s compare the information per pixel in the two paintings. Below is a plot of how the difference between the paintings’ average entropy per pixel (Matisse average entropy per pixel minus Picasso average entropy per pixel) changes with neighborhood size. Negative values indicate that the Picasso painting has higher entropy per pixel at that scale; positive values indicate that the Matisse painting does.

Interestingly, at very small scales the Picasso painting has higher entropy, while at larger scales the Matisse does. At very small scales, there is more structural complexity and heterogeneity in the Picasso than the Matisse. At larger scales, we see that the Matisse draws from a broader spectrum of color, and thus contains a wider distribution of pixel types.

In the calculations above, each pixel color is classified as distinct. With 3 color channels, and 256 possible values for each channel, this means that there are over 16 million possible distinct colors. In some situation, this fine distinction between colors might be more than we want to consider (see code below).

If the thieves damage the paintings, let’s hope they damage the low entropy regions identified above. It will be easier to reconstruct those regions from surrounding pixels than the regions that have higher information content.

First, import the data:

matisse = ImageData[Import["file"]]; picasso = ImageData[Import["file"]];

This calculates the neighbor pixels in a 2r+1 x 2r+1 block around a pixel at location (i,j) in picture “data”:

neighbors[data_, {i_, j_}, r_] := Flatten[Take[ data, {Clip[i - r, {1, Dimensions[data][[1]]}], Clip[i + r, {1, Dimensions[data][[1]]}]}, {Clip[ j - r, {1, Dimensions[data][[2]]}], Clip[j + r, {1, Dimensions[data][[2]]}]}], 1]

This calculates the entropy of a pixel at position (i,j) in painting “data” using neighborhood size r:

localEntropy[data_, {i_, j_}, r_] := Entropy[neighbors[data, {i, j}, r]] // N

This maps that entropy to each pixel:

entropyFilter[data_, r_] := MapIndexed[localEntropy[data, #2, r]&;, data, {2}]

This calculates the average entropy per pixel:

meanEntropy[data_, r_] := Total[Flatten[entropyFilter[data, r], 1]]/(Dimensions[data][[2]]* Dimensions[data][[1]])

Alternatively, Mathematica has an EntropyFilter function that can be applied directly, but this does not offer the ability for a customization for different color discretizations.

To handle a “coarser” view of color (that is, to treat very similar colors as the same), can use the “SameTest” option within Mathematica’s Entropy function. Simply replace the localEntropy function defined above with this smoothLocalEntropy function:

smoothLocalEntropy[data_, {i_, j_}, r_] := N[Entropy[Flatten[neighbors[data, {i, j}, r], 1], SameTest-> (Norm[#1 - #2]< 10^-2)&]]]]>

**Baintha Brakk**, aka The Ogre, is a mountain in the Karakoram range in northern Pakistan. It is “only” the 87^{th} highest mountain on the planet (23,901 feet), but it’s among the technically most difficult to climb.

Chris Bonington and Doug Scott made the first ascent in 1977. Early in the descent, Scott broke both his legs in a repelling accident. The **ensuing struggle** to get down alive was epic.

The Ogre was not climbed again until 2001 (although there were many unsuccessful attempts in the intervening years). To my knowledge, the 1977 and 2001 expeditions are still the only successful ones.

Note: For another intense account of high altitude survival, check out the most recent episode of Family Guy, ** Into Fat Air**.

**The**** USGS’s Earth Explorer** is a great resource for finding and downloading digital elevation data. Mathematica can import a variety of geospatial data formats. In the following example, I’ve used the GeoTIFF format.

In Mathematica, an array of elevation data can be conveniently displayed with the **ReliefPlot** function.

We can fit an interpolating polynomial to the data using Mathematica’s **Interpolation** function. This is useful, because it represents the terrain as a smooth function, so gradients can be calculated.

Here is what a plot of a (very high order) interpolating polynomial fit to the data looks like:

Of course, polynomials can never quite capture the real deal (photo credit: Doug Scott):

]]>