NB: the paper has been presented @ Digital Humanities and Islamic & Middle Eastern Studies, Brown University, Providence, RI (October 24-25, 2013); the video recording of the presentation is available @ www.islamichumanities.org > Day One (timestamp of the presentation 2:48:00; Q&A: 3:51:30); the entire paper is also available as a PDF; comments are welcome @ WorkingPapers.IslamicHistoryCommons.org
All models are false, but some are useful
George P. Box
The advent of digital humanities has brought the notion of ‘‘big data” into the purview of humanistic inquiry. Humanists now have access to huge corpora that open research possibilities that were unthinkable a decade or two ago. However, working with corpora requires a rather different approach that is more characteristic of sciences than humanities. Namely, one has to be transparent and explicit with regard to how data are extracted and how they are analyzed. Text-mining techniques rely on explicit algorithms because they help tracing mistakes, correcting them and, ultimately, improving results. Analytical procedures for studying extracted data rest on explicit algorithms for the same reason. As a way of constructing algorithms, modeling is part and parcel of developing complex computational procedures.
Working with big data also requires a different kind of modeling. Opting for the breadth of data we have to give up the richness of details. Close reading—to which humanists are most accustomed—becomes impossible. Working with big data one cannot maintain the nuanced complexity of details that became the hallmark of close reading as an approach. Instead of relying on complex textual evidence and reading between the lines one has to work with relatively simple textual markers—essentially, words or simple phrases—that are treated as indicators of large trends. Yet, it is through such analysis that we can look into long-term and large-scale processes that will always remain beyond the scope of close reading. The literary historian Franco Moretti dubbed such an approach ‘‘distant reading, ” explaining ‘‘distance” not as an obstacle, but a specific form of knowledge. With emphasis on fewer elements that allows us to get a sharper sense of their overall interconnection, we can distinguish shapes, relations, structures. Most importantly, we can trace small changes over long periods of time.
Modeling is an important part of this approach. With models we simplify reality down to a limited number of factors through the analysis of which we hope to get insights into complex processes. This simplification is the reason why all models are false. Yet, models are a valuable and powerful tool. They pave the way to improving our understanding of the world. Unlike theories, models are experimental and driven by data. Good models offer invaluable glimpses into the subjects of our inquiry. With them we can explore, explain, project. With them we can get a big picture. That is why some models are useful.
What follows is an attempt to model Islamic élites based on the data from al-Dhahabī’s (d. 748/1348 CE) Taʾrīkh al-islām in order to explore major social transformations that the Muslim community underwent in the course of almost seven centuries of its history. The main types of data used in the model are dates, toponyms, linguistic formulae (or, wording patterns), synsets (lists of words that point to a specific concept or entity), and, most importantly, ‘‘descriptive names” (sing. nisba).
The detailed discussion of main assumptions regarding these types of data as well as the discussion of such general issues relevant to the study of Arabic biographical collections can be found elsewhere. Here it is most important to dwell on our assumptions regarding ‘‘descriptive names” that are regarded by some scholars as the most valuable kind of data that literary sources offer to the social historian of the Islamic world, and by others as highly problematic as such. The major problem with nisbas is that it is not always clear what they stand for. For example, if an individual is described in a biographical collection as ṣaffār, does this actually mean that he was involved in ‘‘copper smithing”? When our subject is just one particular individual, it is not so difficult to establish the more or less exact meaning of this descriptive name by cross-examining biographies of this individual in other biographical collections. This is particularly easy now when dozens of electronic texts of biographical collections are just few mouse-clicks away. However, such an approach becomes problematic when this rather time-consuming procedure has to be repeated for dozens of individuals. The approach becomes particularly difficult if our goal is to study some biographical collection in its entirety, since Arabic biographical collections often contain thousands of biographies and most biographies offer multiple descriptive names for the same individual. After a certain threshold it becomes impossible to apply this approach at all. Our source, Taʾrīkh al-islām, is well beyond this threshold. In the analysis that will follow, we will deal with the dataset of almost 70, 000 nisbas (with about 700 unique ones) that represent about 26, 000 individuals over the period of 41-700/1301 CE. Working with such a dataset one cannot possibly know the exact meaning of each and every nisba. At the same time we do not have any solid foundation to argue that descriptive names are to be treated in a particular manner, or to be discarded altogether. Yet, such a dataset is too unique an opportunity for research to ignore simply because we are not entirely sure what all these data mean. This is where modeling offers an optimal solution: we need to start with assumptions and be upfront about them. In what follows, descriptive names will be treated at their face value, if only because this is the most logical starting point.