# Whales, Politics, and Statisticians

To produce a mighty book [or perhaps a mighty blog post], you must choose a mighty theme. No great and enduring volume can ever be written on the flea, though many there be who have tried it. Says Melville, in "Moby-Dick; or, The Whale". And in our statistical working lives, clashing with whales and international politics decidedly counts among the mightier experiences.

Johan Hjort, 1903

In general, Norway has a mostly undistinguished, but positive international reputation. Our country is perceived as rich, peaceful and stable, and with beautiful nature. From time to time, however, international media report on a considerable stain on our reputation: Norway's involvement in the hunting of marine mammals, specifically the killing of seals and whales (see here for example). In spite of anti-whaling efforts and activism, we remain one of the few countries in the world, along with Japan and Iceland, still involved in whaling on what may be called a significant scale.

During the spring of 2017, Céline and Nils from the FocuStat core group were offered a peek into the exclusive and partly politicised world of whaling research. We were invited to report to and attend the May 2017 meetings of the Scientific Committee of the International Whaling Commission (IWC). We were not there as representatives of Norwegian whaling interests, but as the proverbially unbiased and competent scientists, i.e. as experts in statistical methodology –  to complicate this picture, however, we were hired by Japanese whaling authorities. Also, we teamed up with international heavyweight Lars Walløe (academic, chemist, physiologist, expert on matters ranging from plague to history and marine biology and indeed whales, and scientific adviser to the Norwegian government). Walløe has been a long-standing member of the Scientific Committee of the IWC, knowing also the ins and outs of the relevant political questions. He also knows more than most scientists about statistics. In this project his role was partly to facilitate research, and assist with open communications with Japanese and Australian colleagues, in particular.

The tasks set out for us, by the Japanese whale researchers (but with no pressure or bias regarding which conclusions to reach), were

• (a) to examine previous work and reports related to a particular question, which has been discussed for more than ten years in the Scientific Committee, and which to a large extent involves statistical modelling and methodology;
• (b) to write up a report, with new analyses of a large dataset from Japanese whaling (collected over a period of 18 years);
• (c) to communicate openly and clearly with a similar statistical team from Australia; and
• (d) to present this report and share our methods and views with the Scientific Committee.

While being biological in nature, this central question, returned to below, also involves several statistical aspects – along with political dimensions (as we shall see).

## Whaling, Whalers, Industry, Science, Politics

Throughout human history, coastal populations have regarded whales as a prized, but difficultly attainable source of meat and oil. Although possibly out of line with modern sensibilities, the history of whaling is a fascinating story of human ingenuity, technological development, nautical exploration – and, eventually, a tragedy of the commons (as well as for the whales). In pre-modern times, whaling was by necessity limited to slow species, with floating carcasses. Then, as sailing and ship technology improved, whaling operations moved further and further from the coast and over to steadily bigger species. The largest and fastest whale species, the rorquals (among them the blue and fin whales), remained largely unexploited until the second half of the 19th century when new technologies, like the harpoon cannon (invented by Norwegian Svend Foyn) and steam ships, launched the era of modern whaling.

Modern whaling is a textbook example of an unsustainable industry. Due to advances in processing techniques whale oil became valuable, for instance in soap and margarine production. Few countries were involved, but there was intense competition between them and the rorquals were "successfully" hunted to near extinction (in size order: first the blue, then the fin, then the sei, eventually also the humpback). Near its end the industry was not even profitable, because over-production lowered the prices of blubber products (see the Davis et al. book, 1997, for details).

From perhaps 1900 onwards scientists, qua individuals and also in national and international organisations, became more involved with the counting, measuring, and biological understanding of whales. One of the founding fathers of the International Council for the Exploration of the Sea (ICES) was Johan Hjort (seen above in a 1903 photo, the year after ICES was established). The ICES launched its own Whaling Committee, with Hjort among the key players. "During the 1930s, Hjort was the scientific brain behind the construction of the best international database ever assembled on whale fisheries", as reports Sidney Holt (2014) in his paper on Hjort's contributions to the theory of rational fishing. The Whaling Committee was the precursor for the present IWC, established in 1946. Hjort was also involved with national and international politics, and wrote extensively also outside science.

This is not the place to provide much detail on the scientific progress regarding assessing and monitoring whale populations, from 1946 onwards; we recommend, though, the Sidney Holt papers (2002, 2014) and the relevant parts of Smith's (1994) fascinating historical account on fisheries and whaling research during the crucial first century of these scientific endeavours, 1855-1955. An important landmark is the general IWC moratorium of 1986. The moratorium is not absolute: some indigenous people are allowed small quotas, some countries registered objections or reservations to the moratorium (Norway did), and some countries conduct scientific whaling. Most of the large whale species are still endangered. These species are not hunted and they are carefully monitored by the IWC. Current whaling efforts mostly involve species which are not endangered and have large, sustainable populations, like the minke whale.

The IWC grants in particular certain quotas for specific research purposes. Japan has been an active contributor to whaling research and has in this way obtained special permits. The Antarctic minke whale has been the main species in the Japanese Whale Research Program under Special Permit in the Antarctic (hereafter JARPA) and several aspects of its ecology and population structure have been studied. Some research is non-invasive, so to speak, whereas other investigations has involved catching (i.e. killing the whales). Both the amount of killing and the scientific importance and quality of the research have been objects of high-level controversies. Dramatically, in 2010 the nation Australia took the nation Japan to court, fighting it out over several law years in the International Court of Justice in Haag. Basically, the Japanese whaling authorities were accused of killing far too many whales, during the JARPA I (1988 to 2005) and JARPA II (2006 onwards) regimes, and for not delivering enough high-quality substantive research. A clear account on several of these aspects is given in this 2008 Guardian article – "campaigners and politicians condemn the practice as unethical and unnecessary, and say Japan's scientific' whaling programme is commercial whaling by another name".

Lars Walløe was an expert witness for Japan, but his standing in the whale science community is high enough, regarding both the science, statistics, and politics, that he is not seen as or judged to be "Japan's man". In the cross-examination part he was also critisising the JARPA experimental regimes regarding some of their aspects. We mundane statisticians are not quite used to such a dramatic background and attention for our modelling and analyses efforts – so it took us a little while to understand that the innocent looking task points (a), (b), (c), (d) above had some drastic political dimensions tied to them. We were instructed to communicate openly with a corresponding Australian team (comprising our esteemed colleagues Bill de la Mare, John McKinley, Alan Welsh), exchanging analyses and code. Were we to be seen as Norwegians bought by Japan to fight against Australia?

## The Key Question: Are the Whales Becoming Slimmer?

Now back to FocuStat's involvement: The species in question is the Antarctic minke whale (Balaenoptera bonaerensis). The Antarctic minke whale lives on the Southern hemisphere and has, as most baleen whales, a migratory behaviour. The summers are spent eating krill near the Antarctic, while the winters are spent closer to the Equator where the whales reproduce. Due to their relatively small size, the minke whales were mostly ignored during the period of large-scale commercial whaling, and thus the species have large populations, in the hundreds of thousands.

Our analyses concerned the body condition of the whales and whether it has decreased during the 18 years of the JARPA I period (1988 to 2005). Body condition should be understood as a measure of the "health" of the whales. In practice, the variables measured were the girth, blubber thickness, and fat weight of each whale – reflecting that for a whale the more blubber and fat the better, because this is where the whales store energy to live off during the breeding season.

Biologists are interested in the potential changes in body condition of minke whales because such changes could herald deeper transformations in the ecosystem. The krill surplus hypothesis states that the massive decline in abundance of large whales due to commercial whaling led to more krill available to minke whales (see also Konishi et al., 2008). The increased food availability may have led to a growth in the minke whale populations. If the minke whales now experience reduction in body condition, this could be due to several non-exclusive factors. First, the minke whale population may have become too large compared to available resources and is thus experiencing a period of food limitation, potentially leading to a decline in population and stabilisation at a lower level. Secondly, some large whale species, notably the humpback, may have become sufficiently numerous to again cause significant competition for krill with the minke whales. Also, the krill production could have been reduced, potentially due to climate change

So what is there disagreement on? We will come to the details below, but, in short: Some members of the committee consider the decline in body condition to be sufficiently well documented, while other members deem that it is not. These scientists assert that there is too much uncertainty to make any clear claims one way or the other. Strangely (or  not), the disagreement seems to follow the lines of pro- or anti-whaling countries. The members from pro-whaling countries tend to claim that the decline has been documented, while members from countries not in favour of whaling (like Australia) tend to counter-claim that there is still too much uncertainty. (One should note that the pro-whaling countries, Norway, Japan, Iceland, are only pro-whaling in a limited sense; they work with small quotas for some species only, and where these species are very clearly sustainable.)

Is there a natural explanation for why the members or scientists of different nationalities would come to these different conclusions? We are not claiming that any scientists in the IWC are partisans to a certain view. But we all have our biases (even statisticians, occasionally), and there is always uncertainty and elbow room for further discussion – especially in these kinds of observational studies. We may venture a perhaps simplified explanation, as follows. Critics of Japanese scientific whaling (and perhaps critics of whaling generally) often claim that the whaling has not been scientific enough; that the programmes have failed to produce interesting results, that the studies have been poorly designed, and that clear scientific hypotheses have been lacking. They claim the results have not been valuable enough to warrant the large number of whales killed, and that the programme thus should be shut down. They also often suggest that the true motivation behind the JARPA programme has not been science, but food: to provide the Japanese market with whale meat. In this light, critics of Japanese whaling have an incentive to discredit all science coming out of the JARPA programme, including arguing against the alleged decline in minke whale body condition. Japan, on the other hand, has incentives to argue that facts and scientific results have emerged from the JARPA data, thus demonstrating that the scientific programme has been successful. Thus, the disagreement concerning the minke whale analysis does not appear to concern the wider discussion around the krill surpluss hypothesis, but rather the narrower question on whether a proper scientific finding has been convincingly documented or not.

## Mixed Models for Fatness

As mentioned above, several measures of body condition were collected, but in this blog post we focus on one of these, the fat weight – a whale is caught, its fat dissected, then weighted. Let's look at the measured fat weights for a flock of whales captured over the 18 JARPA I years:

There does not seem to be much going on – certainly no clear decline! But wait! The group of whales measured in each year may be different from each other, just by chance. For example it could be possible that in some years many more females than males were caught and that in other years the opposite happened. Female and male whales have quite large size differences and different fat reserves. Also, it may happen that many more whales were caught at the beginning of the (whaling/feeding) season compared to the end of the season. Thus $Sex$ and $Date$ constitute variables that we need to control for in order to carry out a correct analysis. Everyone agrees that one has to control for something – but the specific choice of what to control for, i.e. the model used, is where the disagreement starts. The date of capture (i.e. the $Date$), for example, is crucial to include in any model of this system, because the whales are in the Antarctic to gain weight! So if the whales caught in each year are caught on widely different dates (as may well happen) this has to be taken into account in the analysis. By including $Date$ in the model (basically, ranging from 1 to 120, the summer season days) we account for the general effect of date-in-the-season on fat weight, so that the remaining variation in fat weight may be explained by other variables, including the crucially important $Year$ (here coded to be 1, 2, 3, up to 17, 18, for the the calendar years 1988 to 2005).

In some of the first analyses of these data, before Nils and Céline came into the picture, the scientists (including Lars Walløe) used linear regression methodology – basic, but robust and powerful. For instance, they used the following model,

\begin{align} FatWeight \sim & Year + BLm + Sex + Diatom + Date + Latitude + Age + BWt. \nonumber \end{align}

The notation used indicates that the fat weight of the whales is considered to be statistically related to, or governed by, the variables on the right side of $\sim$. As $Date$ varies, for example, fat weight is assumed to vary as well. Specifically, the explanatory variables are here assumed to influence fat weight in a linear way. The explanatory variable of primary interest is $Year$. If the estimated parameter related to this variable, say $\hat\beta_{\rm Year}$, is negative and large in absolute value, it suggests that the fat weight of the whales has decreased a lot over the years, when all the other explanatory variables are taken into account. Analyses like this usually come with (the famous or infamous) p-values, and if the p-value related to a specific explanatory variable is below some small threshold value, that variable is said to be significant. The same goes for more complex models, like those we employ below.

In yet other words, ten years of research and statistical discussions (well, partly) have gone into deciding whether the underlying regression model parameter $\beta_{\rm Year}$ is significantly negative or not. This sounds perhaps easier than it turns out to be, as complex statistical issues related to

• (i) selecting a good model (from a broad class of plausible candidate models);
• (ii) estimating the parameters well enough, for the given model;
• (iii) having follow-up methods to assess the degree of significance;
• (iv) how to summarise and convey findings, both to fellow scientists and to a broader audience

come into play, with ample room for discussions (with half-agreements and half-disagreements), among and between statisticians, biologists, other scientists, and others taking part in scientific communication and interpretation (from Greenpeace and anti-whaling groups to readers of newspapers).

For the first serious analyses carried out with the JARPA I data, as with Konishi et al. (2008), the effect of year was found to be negative and significant, and the results were presented to the Scientific Committee. Here they were criticised, notably by Australian scientists, and some of the criticism concerned point (i) above – the class of models worked with, as well as the methods used to select "the best" amongst these models. The linear regression models assume that the observations are all independent of each other (or actually that the error terms are independent) and that the effect of each explanatory variable is exactly the same for all observations, i.e. that the effect of say $Date$ should be the same for all the whales in the dataset. But what if the effect of date is slightly different from year to year? This does not seem so unlikely: remember that the effect of date can be considered as a measure of the rate at which the whales gain weight during the season. Some years there may be lots of krill, other years there may be less – so it is conceivable that the rate of weight gain during the season could be slightly different from year to year. Moreover, the year-to-year variation in the effect of date may well be considered as random, since it could be due to random fluctuation in krill production from year to year (or at least random in the sense of "infinitely more complex than we care or have the chance to model at this point").

All this brings us to the world of random effect models and specifically the class of linear mixed effect models (with variations) – which was suggested by some members of the Scientific Commitee as a better choice of model class than linear regressions. This is where Céline and Nils come in. We have worked with classes of plausible mixed effects models, where the theory has been well worked out for some of our purposes, though not for all. The journal article we're writing up, with Lars Walløe as co-author, is hence partly methodological, as we have exuded various efforts to hammer out formulae and properties for new Focused Information Criteria for model selection (see below), and partly reporting of the rather complex application of these methods for our whales.

One of these linear mixed models we've used, to analyse just precisely how $Year$ and $Date$ and various other relevant covariate informants influence the fatness of whales, can be written down, in the appropriate statistical model coding language, as follows:

\begin{align} FatWeight \sim & Year + BLm + Sex + Diatom + Ice + Date + Date^2 + Latitude \nonumber \\ & + Sex:FetusLength + Sex:Diatom + Diatom:Date \nonumber \\ & + Diatom:Date^2 + Latitude:Date + Latitude:DateNum^2\\ & + Region + Year:Region + Latitude:Region + Sex:Region \nonumber \\ & + Diatom:Region + (1 + Date + Date^2| Year) . \nonumber \end{align}

Explaining all the terms in this model is outside the scope of this post, but interested readers may consult our report. The model includes many explanatory variables and also interactions between them. For instance $Year:Region$ means that we allow the effect of year to be potentially different in three geographical regions where the whales were hunted. The term $(1 + Date + Date^2| Year)$ specifies the random effect structure. It enables random variations between the 18 years, influencing both the intercept ($1$), the linear effect of date and the quadratic effect of date. In practice this entails estimating a $3 \times 3$ covariance matrix as part of the general model estimation scheme. These types of models also allow observations within the same year to be dependent upon each other.

Some of the effects found, via estimating and assessing all parameters of such a model, are visualised in the following two figures. In the first we see the effect of $Date$ for the 18 different years. As expected, there is a positive relationship; the whales gain fat during the season (as they ought to!). There is also a noticeable difference between the years in terms of the date effect.

The second figure shows the (estimated) effect of Year in three geographical regions; the fatter black curve is the appropriately weighted statistical average. The effect is negative and significant (as seen via the p-values and other statistical summaries we reach), but does perhaps not seem particularly large in absolute size. What constitutes "a large effect" or not is both a statistical and biology-context question. The scale is in tonnes, and the figure indicates that whales are losing on average 80 kg in pure fatness over 10 years.

## Focused Information Criteria and Confidence Curves for Whales

In addition to motivating from biology and context, constructing, fitting, and analysing the mixed effects model we give above, we have been paying attention to the model selection task; from a class of biologically plausible models, which is best, and precisely what is best' supposed to mean in such a context? The Scientific Committee has also in several past meetings discussed such questions, thus having on its table issues like whether AIC is better than BIC, in certain contexts – the Akaike Information Criterion and the Bayesian Information Criterion are largely the two most popular model selection criteria in the statistics literature. The Claeskens and Hjort (2008) book offers a broad treatment of these and other selection methods, also for different types of models. Their favourite, in various circumstances, is the FIC, the Focused Information Criterion developed by Hjort and Claeskens in two JASA discussion papers in 2003.

Basically, when there is a clearly posed primary question, as here for the $\beta_{\rm Year}$ coefficient, the FIC is programmed to do the best job. The FIC scheme, while having a general structure to aim for (estimating the mean squared error for each of the many estimators stemming from the list of candidate models, and then selecting the model with smallest such), has over the past 10-15 years been developed for several classes of models, but there is no FIC in the literature for linear (and nonlinear) mixed models. Hence part of our efforts, in connection with the Cunen, Walløe, Hjort (2018) journal article we're writing up, have been to hammer out the necessary formulae for biases and variances for all candidate models of the mixed effects type. We refrain from giving all details here (but see our 2017 reports to the IWC, and our 2018 paper). After rather a lot of statistical theory and algebra, where such mouthfuls as

\eqalign{ J_{11,i} &= \sigma^{-2} X_i^t V_i^{-1} X_i, \cr J_{12,i} &= 2 \sigma^{-3} X_i^t V_i^{-1} (\xi_i - X_i\beta), \cr J_{13,i} &= \sigma^{-2} [(\xi_i - X_i\beta)^t V_i^{-1}Z_i \otimes X_i^t V_i^{-1} Z_i], \cr J_{22,i} &= - m \sigma^{-2} + 3 \sigma^{-4} [{{\rm Tr}}(V_i^{-1} \Sigma_i) + (\xi_i - X_i\beta)^t V_i^{-1} (\xi_i - X_i\beta)], \cr J_{23,i} &= \sigma^{-3} [{\rm vec}(Z_i^t V_i^{-1} \Sigma_i V_i^{-1} Z_i) + (\xi_i - X_i\beta)^t V_i^{-1} Z_i \otimes (\xi_i - X_i\beta)^t V_i^{-1} Z_i], \cr J_{33,i} &= \hbox{1\over2}[\sigma^{-2} (Z_i^t V_i^{-1}\{\Sigma_i + (\xi_i - X_i\beta) (\xi_i - X_i\beta)^t\}V_i^{-1} Z_i \otimes R_i \cr & + R_i \otimes Z_i^t V_i^{-1} \{\Sigma_i + (\xi_i - X_i\beta) (\xi_i - X_i\beta)^t\}V_i^{-1} Z_i - R_i \otimes R_i ) ]. \cr}

are a mere fraction, we have indeed the required FIC for Whales formulae. In a couple of technical lines, for the statistical methodologists out there, the primary model class we work with is that with observation vectors $y_i$, along with covariate matrices $X_i$ and $Z_i$, of the form

$$y_i \sim {\rm N}_{m_i}(X_i \beta, \sigma^2(I + Z_i D Z_i^{\rm t}))$$

for $i=1,\ldots,n$, and with unknown regression parameter vector $\beta$ along with variance $\sigma^2$ and additional variance component parameters in the $D$ matrix. We then consider all such submodels of a pre-chosen Wide Model, with submodels corresponding typically to subsets of $X_i$ and $Z_i$ matrices, and for each we work out an estimated mean squared error, for the focus parameter under study.

For our whales, the $y_i$ might be the list of all fat weights for whales captured for given year $i$. Our FIC for Whales then goes through all estimates of $\beta_{\rm Year}$, and ranks these from the best to the worst. For these fat weight data, and for a carefully listed pack of candidate models (all of them biologically plausible), the FIC machinery produces the following FIC plot. The estimates for $\beta_{\rm Year}$ are all seen, with root-FIC on the x-axis, and models landing to the left of the plot are better than the others (for the specific purpose of doing well for estimating $\beta_{\rm Year}$). We learn both about which models are better than others, and that the best models agree on a point estimate around $-0.008$. This is on the scale of tonnes per year, which means that the average whale is perhaps 80 kg slimmer, in pure fat weight, over 10 years. This is also in agreement with results pointed to above.

The IWC Scientific Committee delegates largely agreed with our choice of FIC as the most relevant model selection criterion; certain disagreements voiced by the Australian Team were related to other matters (like the choice of our Wide Model above). Importantly, at least for us, delegates also essentially accepted and appreciated our choice of format for presenting summaries. The traditional format for reporting summaries about focus parameters, in many sciences, is that of a point estimate with a 95% confidence interval. This is in many situations not satisfactory. We go rather for the confidence curve, associated with a broad theory for confidence distributions worked with and presented in the Schweder and Hjort (2016) book. Such a confidence curve provides information about both the estimate in question and about confidence intervals at all levels. This is particularly useful in cases with skewness and other irregularities, and for combining information across diverse sources; see Hjort and Schweder (2017) for more on these matters. Below we display ${\rm cc}(\beta_{\rm Year})$, indicating both the overall estimate of this primary interest parameter (so yes, the whales are becoming slimmer) and the full confidence. The 95% confidence interval is [-0.0137, -0.0023] (so yes, the finding is significant; the full interval is to the left of zero).

## Céline and Nils at the Meeting

When travelling to the IWC Scientific Committee Meeting in Bled, Slovenia, May 2017, we thought we were perhaps in for something like this (Céline in front, Nils rowing):

... but were in fact experiencing this instead:

There were between fifty and a hundred delegates and administration staff in a conference room, hour after hour, day after day. Those presenting reports did not go to a whiteboard or behind a projector, but rather pressed a microphone button in front of her or him, and went through the salient points, with each delegate having a copy of the reports. Each presentation was then followed by a thorough and often detailed going-through, with delegates offering both criticism and positive comments, along with various follow-up questions.

This was in particular the case with the three reports Céline presented, since the Australian Team were well prepared. Part of the learning experience was to think under fire, knowing when to ask for permission to give additional comments, or to protest against what others have been saying ("Mrs Chairlady, may I respond to de la Mare's two last points?"). As such the daily proceedings might in part look and behave more like a courtroom and its cross examinations (complete with English-to-Japanese interpreters speaking rapidly but softly into their microphones) than a university seminar room. Occasionally we were asked questions of the type "but what if you included such-and-such but excluded that-and-that in the analyses, wouldn't that alter your conclusions?". Céline was luckily well prepared for such questions, via flexible well-working R programmes which could be put to use and give clear answers to our critics, almost in real time. This type of lawyer-like preparedness, and thinking-under-fire abilities, are outside what we tend to learn in Becoming a Statistician courses!

For us it was a rare privilege, albeit a slightly mixed one, to have so eager, clever, quick-minded, educated, and critical readers, as we had on this occasion. After sending 25-page preliminary reports to our Australian colleagues, 40-page reports would come back two weeks later – friendly, polite, well-written, detailed, but critical. Perhaps this intense and detailed degree of response is even above the usual level of PhD disputation opponents? We humbly appreciated these efforts of our colleagues-in-whales, both when we ended up agreeing with some of their points, and when we did not.

There is one more learning lesson for us to point to here, and where we were perhaps not quick-minded or clever or careful enough. We thought, more or less, that the work had been carried out when our reports had been delivered and discussed through-and-through, and, as we interpreted matters, essentially agreed upon by the delegates. There is however one more chapter in such stories, associated with the careful and minute writing down of all agreed-upon points, with a proper list of caveats, etc. This is carried out by the chairperson writing down a first draft, followed by one or two iterations where the individual delegates suggest certain changes, etc. Here we experienced that a few crucial sentences were changed (sometimes subtly, sometimes with a bit of substance), when we thought our job had been finished. These matters carry practical and political importance, since such summary reports land on the tables of government committees in a long list of countries, with potential consequences for fisheries and indeed whaling for the coming years.

So what was the end result – after all the work put in by both the Norwegian and Australian teams, the hundreds of report pages and lines of R-code, the hour-long discussions in Bled? The positions of both sides remained for the most part unchanged. The delegates agreed that progress had been made and acknowledged the contributions of both teams, but the main question concerning the body condition of Antarctic minke whales will again be discussed in the 2018 IWC Scientific Committee meeting, and quite likely in meetings for several years to come.

## Wider Perspectives

Our brief experience with the discussions in the IWC Scientific Committee meetings somehow reflect a general pattern emerging, where politics intersect with science and statistics. In many such debates, the opposing sides do not necessarily defend genuinely opposing views; rather, one side will claim that "a fact has been proven", while the other side will hold the view that the evidence is still insufficient, that there is still uncertainty. A list of famous instances of this type includes the following.

• Climate change: There is strong consensus in the International Panel on Climate Change (IPCC) that the planet's average temperature is rising, that homo sapiens is partly to blame, and with half a Nobel Prize to prove it. Nevertheless a segment of educated and intelligent people argue against aspects of these findings. For instance, one interesting statistical analysis, fitting temperature time series to long-term memory but in principle stationary models, appears to indicate that the temperature swings are yet inside a stationary explanation.
• Smoking and lung cancer: Here Sir Ronald A. Fisher, one of the towering figures for 20th century statistics methodology, took the side of the uncertainty team in that debate; his arguments were along the line of correlation is not causation and that observational studies were not good enough; also, he was cherry-picking his data.
• Salmon farming in Norway: The industry has grown rapidly, and perhaps too rapidly; researchers argue against other researchers (complete with exclamation marks and harsh words and threats), regarding the short- and long-rerm consequences for the salmon, their sicknesses, and the industry.
• Where do the Norwegian wolves come from: Wolf politics is a long-term high-emotion spectacle in Norway, also with biology and genetics to fight over. Again a bystander might be forgiven for thinking that political views influence how different groups interpret the same information.
• The Long Peace: Steven Pinker and other better angels of our nature interpret big sets of well-organised Tolstoyan war-and-peace data from over the past two hundred years as implying that we've now entered an era of fewer inter-nation wars and generally lower levels of conflict. Many scholars are critical of these interpretations and ensuing predictions, however (from Clauset to Taleb to Østerud and dozens of others, with their readers almost seeing the chalks and sponges flying through the air of the university seminar rooms).

Most such debates concern questions that lie outside the realm where it is practical or possible to conduct randomised trials, and where one hence has to rely on observational studies. This is also the case here – it is inconceivable to construct a randomised trial in order to investigate the health and body condition of the Minke whales in the Antarctic ocean. The interpretation of results from observational studies is perennially difficult – there is always a possibility that one has failed to control for some important variable or that the correlation one observes is actually due to the response and predictor being controlled by a common unmeasured factor. One may also start discussing the validity of the results when the model is not perfectly correct, or addressing a number of other details in statistical practice.

So, in a way, the Uncertainty Teams are always right, but often trivially, boringly, unfruitfully right; there is uncertainty. Often enough, at least in complex situations that matter, and where easy conclusions are out of reach, fitting some adequate statistical models and carrying out decently clever inference procedures based on these, is not enough. That is why one should attempt to complement statistical analyses with ... something more. What comprises "something more" varies with problem and context, cf. the five problem areas pointed to above, and with no easy general recipe. It could take the form of borrowing statistical strength from other data sources, finding supporting principles or evidence from the relevant sciences (which could involve searching for Laws of Nature, even when these are present only as weak signals accompanied with high noise), conveying potentially important preliminary results to the right parties, contributing to and fine-tuning the right narrative, etc. Statisticians should not leave such something more issues and complexities to other scientists, but should learn to play more active roles also in these endeavours.

Coming back to our slimming minke whales, to tentatively illustrate the latter point, establishing the significant negativity of the $\beta_{\rm Year}$ parameter counts as an important finding in its own right, along with assessment of its size. Such a finding becomes more important when it is set in a proper biological dynamic interplay context, however, which involves attempting to understand other components of the bigger picture. The change in fatness over time could be related to other dynamic phenomena (cf. the krill hypothesis), to the complexities of and changes in the food-web, and potentially to climate change. Statistics and statisticians are needed to help sort out these complexities, to build better understanding of the dynamics of the sea, and to model, analyse and predict the short- and long-term fates of the whales.

## Thanks

We are grateful to our collaborator Lars Walløe for always interesting and inspiring conversations and for explaining intricacies to us about whales, politics, and more. We also appreciate discussions with Emil Aas Stoltenberg, on long lists of general issues, as well as on details of this blog post.

## References

Claeskens, G. and Hjort, N.L. (2008). Model Selection and Model Averaging. Cambridge University Press.

Clauset, A. (2017). The Enduring Threat of a Large Interstate War. OEF Research Report.

Cunen, C. and Hjort, N.L. (2016). Combining information across diverse sources: The II-CC-FF paradigm. Proceedings from the Joint Statistical Meeting 2016, the American Statistical Association, 138-153.

Cunen, C., Walløe, L., and Hjort, N.L. (2017). Decline in energy storage in Antarctic Minke whales during the JARPA period: Assessment via the Focused Information Criterion (FIC). Reports of the Scientific Committee of the International Whaling Commission SC/67A/EM/04.

Cunen, C., Walløe, L., and Hjort, N.L. (2018). Focused model selection for linear mixed models, with an application to whale ecology. Manuscript.

Daily News (newspaper article, by Deborah MacKenzie, 2014). Japan ordered to stop `scientific' whaling. "Japan’s scientific whaling programme in the Antarctic is not “for purposes of scientific research”, and therefore must stop.That is the ruling by the UN's International Court of Justice in The Hague, the Netherlands."

Dagsvik, J., Fortuna, M., and Moen, S.H. (2015). How does the temperature vary over time? Manuscript, Statistics Norway.

Davis, L., Gallman, R. E., and Gleiter, K. (1997). In Pursuit of Leviathan: Technology, Institutions, Productivity, and Profits in American Whaling, 1816-1906

Demidenko, E. (2013). Mixed Models: Theory and Applications With R. Wiley.

Fisher, R.A. (1958). Cigarettes, Cancer, and Statistics. The Centennial Review of Arts & Science, 2, 151-166.

Guardian (newspaper article, by David Adam, August 2008). Whales losing blubber, claims controversial Japanese study. "Data from Japan's widely condemned whaling programme suggests a loss of fat over the past 20 years may be due to climate change, but some claim the study is unethical."

Hjort, J., Jahn, G., and Ottestad, P. (1933). The optimum catch. Hvalrådets Skrifter, 7, 92-127.

Hjort, J. (1937). The story of whaling. The Scientific Monthly, 45, 19-34.

Hjort, J. (1938). The Human Value of Biology. Harvard University Press.

Hjort, N.L. (2016). Recruitment Dynamics and Stock Variability: The Johan Hjort Symposium, some personal reflections. FocuStat blog post.

Hjort, N.L. and Schweder, T. (2017). Confidence distributions and related themes. Editorial overiew, for a Special Issue of the Journal of Statistical Planning and Inference, with Hjort and Schweder as guest editors.

Holt, S. (2002). ICES involvement in whaling and whale conservation, and implications of IWC actions. ICES Marine Science Symposia, 215, 464-473.

Holt, S. (2014). The graceful sigmoid: Johan Hjort's contributions to rational fishing. ICES Journal of Marine Science, 71, 2008-2011.

Houde, E. (2008). Emerging from Hjort's shadow. Journal of the Northwest Atlantic Fishery Science, 41, 53-70.

Konishi, K., Tamura, T., Zenitani, R., Bando, T., Kato, H., and Walløe, L. (2008). Decline in energy storage in the antarctic minke whale (balaenoptera bonaerensis) in the southern ocean. Polar Biology, 31, 1509–1520.

Konishi, K. & Walløe, L. (2015). Substantial decline in energy storage and stomach fullness in antarctic minke whales during the 1990s. Journal of Cetacean Research and Management 15, 77–92.

McCulloch, C.E., Searle, S.R., and Neuhaus, J.M. (2008). Generalized, Linear, and Mixed Models [2nd ed.]. Wiley.

Melville, H. (1851). Moby-Dick; or, The Whale.

NRK (2017). Slaget om kvalen (The Fight About the Whale). Tv programme.

Pinker, S. (2011). The Better Angels of Our Nature: Why Violence Has Declined. Penguin Books.

Schwach, V. (2002). Internationalist and Norwegian at the same time: Johan Hjort and ICES. ICES Marine Science Symposia, 215, 39-44.

Schweder, T. and Hjort, N.L. (2016). Confidence, Likelihood, Probability. Statistical Inference With Confidence Distributions. Cambridge University Press.

Smith, T.D. (1994). Scaling Fisheries. The Science of Measuring the Effects of Fisheries, 1855-1955. Cambridge University Press.

Stolley, P.D. (1991). When genius errs: R.A. Fisher and the lung cancer controversy. American Journal of Epidemiology, 33, 416-425.

Vikingsson, G.A., Elvarsson, B., Ólafsdóttir, D., Sigurjónsson, J., Chosson, V., and Galan, A. (2012). Recent changes in the diet composition of common minke whales (Balaenoptera acutorostrata) in Icelandic waters. A consequence of climate change? Marine Biology Research 10, 138-152.

Østerud, Ø. (2008). Towards a more peaceful world? A critical view. Conflict, Security & Development, 8, 223-240.
By Céline Cunen, Nils Lid Hjort
Published Jan. 6, 2018 7:23 PM - Last modified Nov. 5, 2018 9:26 AM