# New statistical methods shed light on medieval literary mystery

Full of adventures, battles and love stories, the chivalry romance Tirant lo Blanch is a masterpiece of medieval literature. The novel, sometimes claimed to be the world's first, is made even more fascinating by the fact that its original author died before its completion and another author had to take over. The mystery concerning where the change-of-author takes place constitutes a statistical change-point challenge.

Title page of the first Castilian translation of Tirant lo Blanch (Wikipedia commons).

Written in the 1460s,  Tirant lo Blanch is famous for its satirical style, its realistic descriptions of military campaigns and its prominent sexual undertones. It is also famous for the debate around its authorship. Joanot Martorell and his friend Martí Joan de Galba were both real-life knights and also both the authors of Tirant. Martorell died before the completion of the book and de Galba finished the work. The mystery lies in where this change-of-author took place. Which chapters can be attributed to which author? This issue has intrigued literary scholars and statisticians alike, and here we provide our solution using a brand new change-point detection method.

Before we come to statistical details, a brief synopsis of Tirant is in order. The book is after all quite fantastic, and deserves wider fame outside its native Catalonia and the offices of scholars of literature. The story is presented as the true story of  the valiant English knight Tirant the White. It starts in England, but Tirant quickly leaves and travels to different parts of Europe. He experiences many fights, duels and adventures, including arranging a marriage between a French prince and a Sicilian princess. However, the main part of the book concerns the attempted invasion of Constantinople by Ottoman Turks. Tirant is made a general of Constantinople, saves the city and is promised the hand of princess Carmesina. Here our attentive readers will have realised that the story in Tirant cannot in fact be true: Constantinople fell in 1453, never again to be held by Christian rulers. Some literary scholars see the whole story as a sort of alternate history book, where the authors wrote history as they wanted it to be. Besides Tirant and Carmesina, other important characters are Tirant's impertinent female friend and assistant Plaerdemavida (literally Pleasure-of-my-life), the emperor and empress of Constantinople and Tirant's Ethiopian friend King Escariano.

When analysing literary style, as we want to do here, we first need to decide on what we should measure. An author's style may be measured in many different ways, for example by looking at sentence lengths and word choice. Here we have chosen to look at word lengths, more precisely we look at the proportion of words of different length in each chapter of the book. The book contains 487 chapters in total and in each chapter we have collected 10 proportions (which sum to one for each chapter). See the table below. For example, for the first chapter of the book (the first row of the table), we see that 8% of the words were one letter words (i.e. of length 1), 23% of the words were two letter words, and 7% of the words had 10 letters or more (the last column of the table).

Of course we could have chosen several other aspects to monitor; check out the full article for some other possibilities (to be published in JSPI). There you can also find a much more detailed description of our two change-point methods, called method A and method B. In order to use our method B (which is the one we will use here), we need to assume a model for the data. An obvious choice for this kind of count data is a multinomial model, or a multinomial-Dirichlet in order to allow for more heterogeneity between chapters. The following multinormal distribution proved to be the most fruitful, however:

$$\boldsymbol{z}_i \sim \rm{N}_9(\boldsymbol{\xi}_L, \Sigma_L /m_i) \text{ for } i \le \tau, \\ \boldsymbol{z}_i \sim \rm{N}_9(\boldsymbol{\xi}_R, \Sigma_R /m_i) \text{ for } i \ge \tau+1 .$$

Here  $\boldsymbol{z}_i$ is the 9 dimensional vector of word length proportions from chapter $i$ and $\tau$ is the parameter designating the unknown change-point. The vector has length 9, and not 10, since the proportions from each chapter necessarily sum to 1. The formula above indicates that we assume that the vector of proportions from each chapter come from (possibly) different multinormal distributions on each side of $\tau$. We allow the change-point to cause both a change in the mean vector $\xi$ and in the covariance matrix $\Sigma$. Given a model and the data, our method identifies the most likely change-point. Informally, this is achieved by finding the point where we get the largest possible difference between the models on each side of the change-point. The next paragraph contains some details on the method for the interested readers (others can skip it).

Our method B uses the log-likelihood function, profiled over the other parameters (the non-change-point-parameters) in order to construct a confidence curve for $\tau$.
Assuming independent data $y_i$, following a parametric model $f(y,\theta_L)$ to the left and $f(y,\theta_R)$ to the right. We can form the profile log-likelihood function

$$\ell_\rm{prof}(\tau) = \max\{\ell(\tau,\theta_L,\theta_R)\colon{\rm all\ }\theta_L,\theta_R\} \\ = \sum_{i\le\tau}\log f(y_i,\hat\theta_L(\tau)) +\sum_{i\ge\tau+1}\log f(y_i,\hat\theta_R(\tau)).$$

The maximiser $\hat\tau$ of this function is a good estimate of the change-point. From this we form the deviance: $D(\tau,y)=2\{\ell_\rm{prof}(\hat\tau)-\ell_\rm{prof}(\tau)\}.$The full confidence curve can be computed via stochastic simulation: $\rm{cc}(\tau)=\Pr_\tau\{D(\tau,Y) < D(\tau,y_\rm{obs})\}$. This curve provides valid inference for $\tau$ and displays the confidence intervals at all levels.

The confidence curve above summarises the analysis. It displays the change-of-author estimate and the uncertainty around it. On the horizontal axis we see a subset of the chapters of the book. We see that the figure "points" towards chapter 371, meaning that the second author, de Galba, most likely took over the writing around this chapter (according to our method). Simultaneously, the figure conveys that other chapters are also candidates for the change-of-author point, especially chapters directly before 371, and around chapter 345. This displays the uncertainty in the analysis, which actually is quite small (remember that there are 487 chapters in total).

Naturally, this analysis is dependent on the choice of quantitative measure of literary style and on the model.  Also note that it is probably a bit too simplistic to assume that de Galba took over the writing at exactly one point in the book. Presumably, both authors may  have contributed to the section around the change-of-author point. It may even be the case that Martorell wrote some of the last chapters before he died (but discovering such a pattern would require other types of statistical methods).  On the whole, we find our results convincing and they are also consistent with the general theory that Martorell wrote the greater part of the book.

Finally, we need to point out that the change-point methods presented in the aforementioned paper are not restricted to analyses of medieval literature! They can be applied to a large number of problems where the goal is to investigate where a model changes from one state to another. These kinds of changes can be called break-points, tipping points, regime shifts or structural changes, and appear in diverse applications, for example in British mining disasters, when analysing the number of skiing-days per season near Oslo and when investigating the liver-quality of Cod and its influencing covariates (read about all this in the paper!).

By Nils Lid Hjort, Céline Cunen
Published Jan. 3, 2017 2:43 PM - Last modified Jan. 3, 2017 3:46 PM