My thoughts on recovering from poor results in HCI evaluation and salvaging your experiment.

Graceful Refocusing: Designing Fault-Tolerant HCI Evaluations

By Niklas Elmqvist, University of Maryland, College Park

I just started writing “behind the scenes” posts on visualization and HCI research, and a meeting with a student prompted me to want to write this blog post on two specific examples of what I call “graceful refocusing” of an evaluation. Really, it should be called “dealing with oh-sh*t moments”, but we’ll get to that soon.

As it turns out, research is risky. It is risky for many reasons, but one of them is that sometimes even your best ideas and hunches don’t work out. It doesn’t matter how well you tune your experiment, your factors, and your tasks, in the end you may end up with a technique that does not bear out in your evaluation. That can be quite disheartening, as negative results are generally not easy to publish. For example, no matter how brilliant your new idea on optimizing mouse pointing is, if it doesn’t outperform standard pointing, it is essentially worthless.

There are two possible explanations when no significant effect can be observed, or when your new technique even performs worse than the baseline: either your experiment was not designed correctly, or your new technique is actually not all that good (ouch). Both of these explanations are painful, but the second one is clearly the more painful one. Unfortunately, there is no remedy; regardless of how experienced you are and what your gut instinct tells you, sometimes your intuition is wrong about a new idea. This is an “oh-sh*t” moment, and heaven knows it has happened many times to me.

The correct approach in a situation like this is almost always to throw out the results and start over, either with a new experiment (properly piloted and balanced this time), or even with entirely new techniques. However, in the real world, this is not always practical: paper deadlines wait for no man or woman, and experiments are often costly in time and money. Discarding a large amount of data can be heartbreaking and expensive. Publishing negative results is possible, and sometimes even explicitly encouraged (after all, knowing what doesn’t work is almost as valuable as knowing what does work), but is much more of an uphill battle. It often feels like there is no easy way out of this situation.

At the same time, there is a very real risk here in trying to “salvage” failed research results because it may result in questionable research practice, such as p-hacking (the use of inferential statistics to uncover significant patterns in data) or HARKing (Hypothesizing After the Results are Known; paper here). Examples of poor or dishonest salvage attempts abound, and have contributed to the so-called “replication crisis” in fields such as psychology and medicine, where many famous results have not been possible to replicate, even by the original researchers themselves. Dealing with such issues is beyond the scope of this blog post, but some ways include pre-registering your hypotheses using sites such as https://aspredicted.org/, publishing your datasets with your papers, and ensuring that your papers describe methods in such a way that the results can be replicated.

My approach—developed over many similar experiences—has been to gracefully refocus the evaluation to be less about the new technique and more about studying competing techniques for the overall problem. In other words, instead of hitching your wagon to the wrong horse (i.e. the technique that ended up losing), just remove yourself from the race entirely and report on it as an objective observer. Let me illustrate this with two different projects from my past research. (If you want the dirt on some of my past research projects, keep reading.)

Back in 2009, my Ph.D. student Waqas Javed and I were working on a Google-funded research project on time-series visualization. We saw the horizon graphs work developed by Hannes Reijner at Panopticon Software (which in turn were inspired by Saito et al.’s “two-tone pseudocoloring” paper from InfoVis 2005), but noted that horizon graphs, for all their brilliance, do not easily support comparison because each graph is limited to a single time series. In other words, when visualizing two different stock prices, you need two horizon graphs side by side (or one above the other), one for each stock. Traditional line charts, on the other hand, allow for adding several lines to the same chart, which promotes comparison, but thin lines can be hard to identify when using color coding. In response, we came up with a new visualization technique for time-series visualization that combined line charts with filled area charts (which are easier to distinguish), but enabled showing multiple time series in the same chart. We called the technique “braided graphs” in recognition of how the filled areas are braided together. Here’s an illustration:

As you can see, braided graphs weave together the filled areas under each curve so that the largest value is always at the back. In other words, the technique is reminiscent of stacked area charts, but instead of using the previous curve as a baseline, each time series uses a common baseline (the horizontal axis). This was, in our minds at least, an elegant and exciting new way to visualize time-series data.

Except it didn’t work.

Admittedly—and before you object—in retrospect I can see several problems with the design that should have given me pause. However, disregarding any such instincts, we happily went ahead with the experiment design and ran a full study involving dozens of participants (with the corresponding number of hours of Waqas’ time) and hundreds of trials (and hundreds of dollars). Unfortunately, of course, the results were not at all in favor of braided graphs, and many of the participants complained that the technique was confusing and difficult to use.

This left us in a little bit of a pickle, because our initial plan to write a big technique paper announcing braided graphs to the world was now clearly out of the question. The results were just not there to support any such claims, and while I still think that braided graphs are an elegant solution to the problem of combining filled area charts with line charts, they are too visually complex to warrant widespread use. However, this left us with a bunch of user study data, a lot of time and money invested in the project, and a looming conference deadline. What were two poor visualization researchers to do?

In the end, the solution we came up with was to (more or less) gracefully refocus the evaluation to be less about braided graphs and more about understanding the performance of different time-series visualization techniques. The braided graphs technique was still part of the lineup, of course, but we casually dropped it in as a minor contribution rather than as the main feature of the paper. It meant swallowing our pride and diminishing our novelty claims (the data visualization field generally thrives on new visualization techniques), but the paper was eventually accepted and published at IEEE InfoVis 2010, and is now one of my most highly cited papers.

The second example of this idea of gracefully refocusing an evaluation came from a project that I worked on in 2010 with Pierre Dragicevic and Anastasia Bezerianos. Having seen Microsoft Live Lab’s PivotViewer (succeeded in spirit by Microsoft’s SandDance), we were intrigued by animated transitions of point clouds, and had the bright idea for a new animation technique we called “motion bundles”: inspired by Danny Holten’s edge bundling (example D3 implementation here), we thought that we could dynamically “bundle together” points that were in close proximity so that they would travel together as a tight-knit group, only to spread apart again as they reached their destination. This idea was supported by research in perceptual psychology, such as Cavanagh & Alvarez (2005) and Pylyshyn & Storm (1988), that found that the number of moving targets that people can track is very small, but that many targets moving as a group can be seen as a single entity (the latter is also known as the Gestalt Law of Common Fate).

However, when we implemented our motion bundling technique, we again found—big surprise—that it didn’t work. Again, in hindsight, it is pretty easy to see why: bundling targets together into a small coherent group while they are moving actually hinders tracking individual points because it becomes easy to lose a specific target when it is in close proximity to other targets. In other words, similar to a shell game where the cups obscure the ball being tracked, the distractors that are clustered together with the target make it excessively hard to track the target itself. In reality, what you want is for the target being tracked to be separated as much as possible from all of the other objects in the point cloud.

As an aside, as with virtually any HCI technique, motion bundling performance also depends on the task; while it performs poorly when tracking specific objects, it performs much better for overview tasks involving multiple targets. Indeed, Fan Du and colleagues independently reinvented this idea (and called it trajectory bundling) in a CHI 2015 paper, and found precisely this effect: it worked well for multiple targets or when there is a lot of occlusion during the transition.

Anyway, this was another “oh-sh*t” moment for us: the entire premise of our research project was gone. Fortunately, this time we had not actually run any experiment, just implemented a testing framework and conducted a few pilots. Regardless, this left us in a bit of a limbo with no clear way forward. There was no point in continuing the motion bundling technique (although Fan Du’s example shows otherwise), but meanwhile we had all this literature search done as well as a testing framework with dataset generation and carefully balanced factors already implemented. It seemed a shame to waste all that effort, as well as a brilliant team.

We ended up refocusing the evaluation (much more gracefully this time around) to study timing in point cloud animations (fortunately, at the last minute we had involved a slow-in/slow-out condition to our animations, and those were the only interesting findings from our pilots). This allowed us to take advantage of the time investment already made. In other words, yet again the solution ended up being relaxing some of our claims and not backing a specific horse in the race; in fact, motion bundling didn’t become part of the experiment at all this time. Nevertheless, our experiment identified interesting results (basically, empirical proof of the use of easing animations) and the paper was published at CHI 2011.

What does all this mean? Experience is clearly a hard teacher, but with this blog post, I am hoping that fellow HCI and visualization researchers can learn from these examples how to gracefully refocus failed evaluations in order to salvage their efforts. The easiest way to do this, in my experience, is merely to remove your stake in the competition and report on your findings in a dispassionate way. The greater teaching, I believe, is to avoid backing any particular horse altogether, and instead view each evaluation as an honest pursuit of truth rather than a competition where you are trying to win. For some of us, designing new techniques is what we do, so this more laid-back approach can be tricky to adopt, but it is worth a try.

Another point is that regardless of your approach, you should always have a Plan B. It is hard to know the outcome of an experiment, and you should have a fallback in case your proposed new technique does not end up living up to its expectation. Otherwise you may find yourself in the same situation as in the anecdotes above, and scrambling to fix it.