Category Archives: Programming Tips

Limitations of the negative binomial distribution in spatial models

Galaxies, trees, and influenza cases have something in common: they tend to occur in clusters. The issue of how to model clustered spatial patterns is thus of interest to a variety of scientific disciplines. For many people, the mention of spatial clumping brings to mind the negative binomial distribution. This is appropriate, because the negative binomial distribution is overdispersed, meaning its variance is greater than its mean (in contrast to the Poisson distribution, which has equal variance and mean). Unfortunately, there are some difficulties in using the negative binomial distribution in spatial modeling, which we attempt to explain in this post.

Suppose that you want to create a model that generates spatial locations of trees. You want the model to be stochastic, so that each time you generate a forest, you end up with a different configuration of trees. The relevant tool is a spatial point process; mathematical details are described in books such as Daley and Veres-Jones. Each forest that you generate is called a realization of the process.

Continue reading “Limitations of the negative binomial distribution in spatial models” »

Interactive visualization of survival curves with Shiny

We have a growing interest in using our favorite tools (R and Mathematica) to build web interfaces to interactively explore and visualize data. Our last 5 posts have involved interactive tools, namely Mathematica’s computable document format and R’s new Shiny package.

There is a new kid on the block for interactive visualization tools in R, healthvis. I have not yet taken healthvis for a spin, but the survival example in the introductory blog post inspired me to create a Shiny app to visualize the results of a survival analysis conducted for my dissertation.

Continue reading “Interactive visualization of survival curves with Shiny” »

Custom interfaces in Mathematica using Dynamic Module

In our last post, we explored how Fourier transforms can be used in Mathematica to make a frequency filter for audio files. That post was primarily concerned with implementing the appropriate transforms (and, of course, paying homage to the amazing talent of Macklemore and Ryan Lewis). The accompanying interactive tool had an extremely cumbersome interface. Here is a slightly better attempt, featuring a more widely known piece of music:

Continue reading “Custom interfaces in Mathematica using Dynamic Module” »

Macklemore and Fourier

The Heist by Macklemore and Ryan Lewis is a masterpiece. It is clearly the best album of 2012, and I suspect it will go down alongside classics like IllmaticThe ChronicThe Marshall Mathers LP, and The College Dropout.

The lyric, symphonic, and emotional range of The Heist is impressive. You’ve probably heard the playful number one song “Thrift Shop”, but the heavier songs like “Same Love” (about gay marriage and civil rights) and “Starting Over” (about getting off the mat following relapse) show the Seattle duo at their most virtuosic.

In the video below, they are joined by Ray Dalton for a live radio performance of “Can’t Hold Us” on KEXP. Stay tuned until the end, when Dalton opens up his full register. I suggest you sit down before hitting play.

Continue reading “Macklemore and Fourier” »

Shiny = Happy People

The people behind the wonderful RStudio, which I gushed about in a previous post, have developed a new package, Shiny, that makes it easy to develop interactive web applications with R. Shiny is not the first package to facilitate building web apps with R (see here for comparison of Shiny and gWidgetsWWW2.rapache), but it is arguably the easiest to learn. Shiny has an enthusiastic and engaged user community and the people at RStudio are very responsive to questions posted to the mailing list.

Continue reading “Shiny = Happy People” »

Finite Difference Method (now with free code!)

A couple of months ago, we wrote a post on how to use finite difference methods to numerically solve partial differential equations in Mathematica. Several readers have asked for more details about the method. To help everyone out, we are posting a Mathematica notebook that contains explanations and code. Download the notebook by clicking here.

The notebook will implement a finite difference method on elliptic boundary value problems of the form:

The comments in the notebook will walk you through how to get a numerical solution. An example boundary value problem is solved, yielding a solution that looks like this:

Last week, Mathematica 9 was released. It has some awesome new features, including enhanced NDSolve capabilities. Sadly, there is no built-in finite difference method solver yet. We will have to continue to use workarounds like the notebook posted here for a while longer.

Comparing stolen paintings: Picasso and Matisse

Art Heist

Last week, burglars stole seven paintings from the Kunsthal museum in Rotterdam. The paintings included works by Picasso, Monet, Gauguin, and Matisse. The loot is likely worth hundreds of millions of dollars, but the loss of these great pieces surpasses anything that can be calculated as a monetary figure. Art is unquantifiable, right? Yes, but it is still fun to do data analysis on paintings. In today’s post, we will explore how Mathematica’s image processing capabilities can be used to compare two of the paintings stolen from the Kunsthal.

Let’s consider Matisse’s “Reading Girl in White and Yellow” (1919) and Picasso’s “Harlequin Head” (1971).

Continue reading “Comparing stolen paintings: Picasso and Matisse” »

Good programming practices in R

I write sloppy R scripts. It is a byproduct of working with a high-level language that allows you to quickly write functional code on the fly (see this post for a nice description of the problem in Python code) and the result of my limited formal training in computer programming. The lack of formal training makes scientists self-conscience of the bits of code that they cobble together to solve research problems, but a professional software engineer reassuringly points out that most software runs on messy code. Although sharing sloppy code is better for research progress than not sharing any code at all, you can make the code sharing experience better by picking up some good programming habits. Even if you don’t intend to share your code, which is arguably bad for science, adopting good programming habits should improve your workflow by making old bits of code more easily understandable and re-usable.

The Department of Biostatistics at Vanderbilt University provides a nice list of programming tips for statisticians. For R-specific recommendations, see Google’s R Style Guide. Hadley Wickham provides his own recommendations, which are generally—but not always—aligned with Google’s R Style Guide. I previously thought that the best way to improve your code was by adding comments, but I hadn’t thought about how copious comments may actually make your code less readable. I intend to adopt nearly all of the recommendations in Google’s R Style Guide. Well, I will immediately adopt the style guidelines for any code that makes a public appearance on this blog, but changes to my usual programming habits will likely be more gradual.

Zing! Mathematica one-liners

I love the elegant simplicity of programming in Mathematica. There is something undeniably beautiful about accomplishing something complex in a concise chunk of code. A famous Mathematica mantra is, “if you are using a For loop, you are probably doing it wrong.” I encourage Mathematica programmers to keep that in mind.

To illustrate the efficiency of Mathematica code, I’d like to draw your attention to the recent results of Wolfram’s “One-Liner Competition”.  Each program in the competition is 140 characters or less. Yes, that’s right… tweetable programs. You can watch screencasts of the winner and honorable mentions here and download a notebook of the programs here.

For those who work in biological modelling, I recommend checking Yves Klett’s entry. It gives a new meaning to the old modelling punchline “assume all cows are spheres”.

A skeptical look at The Economist’s Sinodependency Index

In a recent blog postThe Economist discusses its “Sinodependency Index”, which measures the world’s economic dependence on China. This index was originally proposed in 2010. In today’s post, we will take a closer look at this index, and in the process, we will explore some of Mathematica’s finance-related capabilities.

The Economist’s Sinodependency Index (SI) is composed of S&P 500 firms, weighted according to their revenues from China. You can see how the SI has outperformed the S&P 500 over the past couple of years, with the implication that firms with more exposure to China are performing better than their less “Sinodependent” peers.
Continue reading “A skeptical look at The Economist’s Sinodependency Index” »

RStudio is RStupendous

I am a sucker for beautiful applications (like the ggplot2 web tool mentioned here). The latest R-related application to catch my eye is RStudio.

RStudio™ is a free and open source integrated development environment (IDE) for R. You can run it on your desktop (Windows, Mac, or Linux) or even over the web using RStudio Server.

For a glimpse of the beauty of RStudio, take 2 minutes to watch the screencast found on the RStudio home page. If you watch the screencast, you will also get an overview of the functionality of RStudio.

Continue reading “RStudio is RStupendous” »

On Labor Day, make your computer’s job easier with Milstein’s method

In today’s post, we will explore numerical schemes for integrating stochastic differential equations in Mathematica. We will take an informal approach; for an in-depth treatment of stochastic differential equations, I recommend that you look at Stochastic Processes for Physicists by Kurt Jacobs and Modeling with Ito Stochastic Differential Equations by Edward Allen.

Continue reading “On Labor Day, make your computer’s job easier with Milstein’s method” »

Late to the ggplot2 party

I have resisted learning the popular R graphics package, ggplot2. I dismissed ggplot2 as primarily useful for exploratory graphics and rationalized my avoidance of ggplot2 by assuming that it would require just as many (or more) lines of code as the R base package to whip the default plots into publication-quality figures. The few times that I poked at ggplot2 I quickly retreated to the cozy confines of the base package (see here, here, and here for tips on creating figures with base graphics).

Well, the tipping point for me came from an unlikely source—a web interface for ggplot2. Watch the demo video below to get a taste of the power of ggplot2 through the web.

Continue reading “Late to the ggplot2 party” »

Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models

Ecologists commonly collect data representing counts of organisms. Generalized linear models (GLMs) provide a powerful tool for analyzing count data. [As mentioned previously, you should generally not transform your data to fit a linear model and, particularly, do not log-transform count data.] The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Thus, we need to test if the variance>mean or if the number of zeros is greater than expected. Below, we will walk through the basic steps to determine which GLM to use to analyze your data.

Continue reading “Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models” »

Numerically solving PDEs in Mathematica using finite difference methods

Mathematica’s NDSolve command is great for numerically solving ordinary differential equations, differential algebraic equations, and many partial differential equations. Most of the integration details are handled automatically, out of the user’s sight. NDSolve switches between integration schemes based on the problem at hand, adapting step sizes and monitoring stiffness as it goes. Advanced users can override these options, customizing NDSolve to their needs.

Sadly, some types of PDEs are beyond NDSolve’s capabilities. Confronted with one of these PDEs, a user must resort to a more “manual” procedure to find a numerical solution. In this post, we’ll examine a few tricks that can make this process easier.

Continue reading “Numerically solving PDEs in Mathematica using finite difference methods” »

Using paste( ) to read and write multiple files in R

This post is a quick tip on how to use the paste( ) function to read and write multiple files. First, let’s create some data.

The next step is not necessary, but makes the subsequent code more readable.

The following example is silly because you would rarely want to split your data as shown in this example, but (hopefully) it clearly illustrates the general idea of using paste( ) to create dynamic file names when writing files.

Continue reading “Using paste( ) to read and write multiple files in R” »

Transformation of axes in R

As a general rule, you should not transform your data to try to fit a linear model. But proportions can be tricky. If the proportion data do not arise from a binomial process (e.g., proportion of a leaf consumed by a caterpillar), then transformation is still the best option. In an excellent paper, David Warton* and Francis Hui propose that the conventional transformation for proportion data (i.e., arcsine square root) is asinine, particularly if you have binomial data, and that the logit transformation is preferable for non-binomial proportion data.

The objective of this post is simply to demonstrate how to transform the axes of plots in R, but the context of the example is the logit transformation of non-binomial proportion data. First, we need to generate some data.

Continue reading “Transformation of axes in R” »

Spacing of multi-panel figures in R

In a previous post, I showed how to keep text and symbols at the same size across figures that have different numbers of panels. The figures in that post were ugly because they used the default panel spacing associated with the mfrow argument of the par( ) function. Below I will walk through how to adjust the spacing of the panels when using mfrow.

Continue reading “Spacing of multi-panel figures in R” »

A Traveling Salesman on a Sphere: Pitbull’s Arctic Adventure

You’ve probably heard of the traveling salesman problem: given a set of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city? This problem pops up in a huge number of applications.

For the purposes of this post, let’s focus on a concrete example: a musician is planning a tour and wants to travel as efficiently as possible. Suppose that concerts are to be held in the following cities (with an encore concert in the origin city): Anchorage, Athens, Barrow, Berlin, Brussels, Cairo, Copenhagen, Denver, Detroit, Dublin, Helsinki, Istanbul, Jerusalem, Kabul, Lagos, Lisbon, London, Los Angeles, Miami, Moscow, Murmansk, New York, Prague, Reykjavik, Rome, Seattle, Stockholm, and Toronto.

Perhaps you’ll object that this is an unlikely list, because it includes several high latitude cities with limited concert venues. Musicians can get sent to some pretty obscure locations, though. Take the rapper Pitbull’s recent trip to Kodiak, for example. Bear with me.

Fortunately, Mathematica has a built-in function, FindShortestTour to help compute the solution to our problem. Before we get too excited, though, we should remember that the world is not flat, and hence we won’t be able to use our usual Euclidean distance metric.

Continue reading “A Traveling Salesman on a Sphere: Pitbull’s Arctic Adventure” »