my dossier

[f-s d] Cetus

2016-06-16T00:00:00+00:00

Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.

[f-s d] Cetus

Quoting Liz Losh, Jacqueline Wernimont tweeted that behind every visualization is a spreadsheet.

@lizlosh “behind every visualization…shhhh…is a spreadsheet” #femdh #dhsi2016

— Jacqueline Wernimont (@profwernimont) June 14, 2016

But what, I wondered, is behind every spreadsheet?

But what’s behind every spreadsheet? [cue theme for “Full-Stack Dev”, hot new NPR series on the secret life of data] https://t.co/2HCLtySIwL

— Scott B. Weingart (@scott_bot) June 14, 2016

Space whales.

Okay, maybe space whales aren’t behind every spreadsheet, but they’re behind this one, dated 1662, notable for the gigantic nail it hammered into the coffin of our belief that heaven above is perfect and unchanging. The following post is the first in my new series full-stack dev (f-s d), where I explore the secret life of data.¹

Hevelius. Mercurius in Sole visus (1662).

The Princess Bride teaches us a good story involves “fencing, fighting, torture, revenge, giants, monsters, chases, escapes, true love, miracles”. In this story, Cetus, three of those play a prominent role: (red) giants, (sea) monsters, and (cosmic) miracles. Also Greek myths, interstellar explosions, beer-brewing astronomers, meticulous archivists, and top-secret digitization facilities. All together, they reveal how technologies, people, and stars aligned to stick this 350-year-old spreadsheet in your browser today.

The Sea

When Aethiopian queen Cassiopeia claimed herself more beautiful than all the sea nymphs, Poseidon was, let’s say, less than pleased. Mildly miffed. He maybe sent a sea monster named Cetus to destroy Aethiopia.

Because obviously the best way to stop a flood is to drown a princess, Queen Cassiopeia chained her daughter to the rocks as a sacrifice to Cetus. Thankfully the hero Perseus just happened to be passing through Aethiopia, returning home after beheading Medusa, that snake-haired woman whose eyes turned living creatures to stone. Perseus (depicted below as the world’s most boring 2-ball juggler) revealed Medusa’s severed head to Cetus, turning the sea monster to stone and saving the princess. And then they got married because traditional gender roles I guess?

Corinthian vase depicting Perseus, Andromeda and Ketos. [via]

Cetaceans, you may recall from grade school, are those giant carnivorous sea-mammals that Captain Ahab warned you about. Cetaceans, from Cetus. You may also remember we have a thing for naming star constellations and dividing the sky up into sections (see the Zodiac), and that we have a long history of comparing the sky to the ocean (see Carl Sagan or Star Trek IV).

It should come as no surprise, then, that we’ve designated a whole section of space as ‘The Sea’, home of Cetus (the whale), Aquarius (the God) and Eridanus (the water pouring from Aquarius’ vase, source of river floods), Pisces (two fish tied together by a rope, which makes total sense I promise), Delphinus (the dolphin), and Capricornus (the goat-fish. Listen, I didn’t make these up, okay?).

Jamieson’s Celestial Atlas, Plate 21 (1822). [via]

Jamieson’s Celestial Atlas, Plate 23 (1822). [via]

Ptolemy listed most of these constellations in his Almagest (ca. 150 A.D.), including Cetus, along with descriptions of over a thousand stars. Ptolemy’s model, with Earth at the center and the constellations just past Saturn, set the course of cosmology for over a thousand years.

Ptolemy’s Cosmos [by Robert A. Hatch]

In this cosmos, reigning in Western Europe for centuries past Copernicus’ death in 1543, the stars were fixed and motionless. There was no vacuum of space; every planet was embedded in a shell made of aether or quintessence (quint-essence, the fifth element), and each shell sat atop the next until reaching the celestial sphere. This last sphere held the stars, each one fixed to it as with a pushpin. Of course, all of it revolved around the earth.

The domain of heavenly spheres was assumed perfect in all sorts of ways. They slid across each other without friction, and the planets and stars were perfect spheres which could not change and were unmarred by inconsistencies. One reason it was so difficult for even “great thinkers” to believe the earth orbited the sun, rather than vice-versa, was because such a system would be at complete odds with how people knew physics to work. It would break gravity, break motion, and break the outer perfection of the cosmos, which was essential (…heh)² to our notions of, well, everything.

Which is why, when astronomers with their telescopes and their spreadsheets started systematically observing imperfections in planets and stars, lots of people didn’t believe them—even other astronomers. Over the course of centuries, though, these imperfections became impossible to ignore, and helped launch the earth in rotation ‘round the sun.

This is the story of one such imperfection.

A Star is Born (and then dies)

Around 1296 A.D., over the course of half a year, a red dwarf star some 2 quadrillion miles away grew from 300 to 400 times the size of our sun. Over the next half year, the star shrunk back down to its previous size. Light from the star took 300 years to reach earth, eventually striking the retina of German pastor David Fabricius. It was very early Tuesday morning on August 13, 1596, and Pastor Fabricius was looking for Jupiter.³

At that time of year, Jupiter would have been near the constellation Cetus (remember our sea monster?), but Fabricius noticed a nearby bright star (labeled ‘Mira’ in the below figure) which he did not remember from Ptolemy or Tycho Brahe’s star charts.

Mira Ceti and Jupiter. [via]

Spotting an unrecognized star wasn’t unusual, but one so bright in so common a constellation was certainly worthy of note. He wrote down some observations of the star throughout September and October, after which it seemed to have disappeared as suddenly as it appeared. The disappearance prompted Fabricius to write a letter about it to famed astronomer Tycho Brahe, who had described a similar appearing-then-disappearing star between 1572 and 1574. Brahe jotted Fabricius’ observations down in his journal. This sort of behavior, after all, was a bit shocking for a supposedly fixed and unchanging celestial sphere.

More shocking, however, was what happened 13 years later, on February 15, 1609. Once again searching for Jupiter, pastor Fabricius spotted another new star in the same spot as the last one. Tycho Brahe having recently died, Fabricius wrote a letter to his astronomical successor, Johannes Kepler, describing the miracle. This was unprecedented. No star had ever vanished and returned, and nobody knew what to make of it.

Unfortunately for Fabricius, nobody did make anything of it. His observations were either ignored or, occasionally, dismissed as an error. To add injury to insult, a local goose thief killed Fabricius with a shovel blow, thus ending his place in this star’s story, among other stories.

Mira Ceti

Three decades passed. On the winter solstice, 1638, Johannes Phocylides Holwarda prepared to view a lunar eclipse. He reported with excitement the star’s appearance and, by August 1639, its disappearance. The new star, Holwarda claimed, should be considered of the same class as Brahe, Kepler, and Fabricius’ new stars. As much a surprise to him as Fabricius, Holwarda saw the star again on November 7, 1639. Although he was not aware of it, his new star was the same as the one Fabricius spotted 30 years prior.

Two more decades passed before the new star in the neck of Cetus would be systematically sought and observed, this time by Johannes Hevelius: local politician, astronomer, and brewer of fine beers. By that time many had seen the star, but it was difficult to know whether it was the same celestial body, or even what was going on.

Hevelius brought everything together. He found recorded observations from Holwarda, Fabricius, and others, from today’s Netherlands to Germany to Poland, and realized these disparate observations were of the same star. Befitting its puzzling and seemingly miraculous nature, Hevelius dubbed the star Mira (miraculous) Ceti. The image below, from Hevelius’ Firmamentum Sobiescianum sive Uranographia (1687), depicts Mira Ceti as the bright star in the sea monster’s neck.

Hevelius. Firmamentum Sobiescianum sive Uranographia (1687).

Going further, from 1659 to 1683, Hevelius observed Mira Ceti in a more consistent fashion than any before. There were eleven recorded observations in the 65 years between Fabricius’ first sighting of the star and Hevelius’ undertaking; in the following three, he had recorded 75 more such observations. Oddly, while Hevelius was a remarkably meticulous observer, he insisted the star was inherently unpredictable, with no regularity in its reappearances or variable brightness.

Beginning shortly after Hevelius, the astronomer Ismaël Boulliau also undertook a thirty year search for Mira Ceti. He even published a prediction, that the star would go through its vanishing cycle every 332 days, which turned out to be incredibly accurate. As today’s astronomers note, Mira Ceti’s brightness increases and decreases by several orders of magnitude every 331 days, caused by an interplay between radiation pressure and gravity in the star’s gaseous exterior.

Mira Ceti composite taken by NASA’s Galaxy Evolution Explorer. [via]

While of course Boulliau didn’t arrive at today’s explanation for Mira’s variability, his solution did require a rethinking of the fixity of stars, and eventually contributed to the notion that maybe the same physical laws that apply on Earth also rule the sun and stars.

Spreadsheet Errors

But we’re not here to talk about Boulliau, or Mira Ceti. We’re here to talk about this spreadsheet:

Hevelius. Mercurius in Sole visus (1662).

This snippet represents Hevelius’ attempt to systematically collected prior observations of Mira Ceti. Unreasonably meticulous readers of this post may note an inconsistency: I wrote that Johannes Phocylides Holwarda observed Mira Ceti on November 7th, 1639, yet Hevelius here shows Holwarda observing the star on December 7th, 1639, an entire month later. The little notes on the side are basically the observers saying: “wtf this star keeps reappearing???”

This mistake was not a simple printer’s error. It reappeared in Hevelius’ printed books three times: 1662, 1668, and 1685. This is an early example of what Raymond Panko and others call a spreadsheet error, which appear in nearly 90% of 21st century spreadsheets. Hand-entry is difficult, and mistakes are bound to happen. In this case, a game of telephone also played a part: Hevelius may have pulled some observations not directly from the original astronomers, but from the notes of Tycho Brahe and Johannes Kepler, to which he had access.

Unfortunately, with so few observations, and many of the early ones so sloppy, mistakes compound themselves. It’s difficult to predict a variable star’s periodicity when you don’t have the right dates of observation, which may have contributed to Hevelius’ continued insistence that Mira Ceti kept no regular schedule. The other contributing factor, of course, is that Hevelius worked without a telescope and under cloudy skies, and stars are hard to measure under even the best circumstances.

To Be Continued

Here ends the first half of Cetus. The second half will cover how Hevelius’ book was preserved, the labor behind its digitization, and a bit about the technologies involved in creating the image you see.

Early modern astronomy is a particularly good pre-digital subject for full-stack dev (f-s d), since it required vast international correspondence networks and distributed labor in order to succeed. Hevelius could not have created this table, compiled from the observations of several others, without access to cutting-edge astronomical instruments and the contemporary scholarly network.

You may ask why I included that whole section on Greek myths and Ptolemy’s constellations. Would as many early modern astronomers have noticed Mira Ceti had it not sat in the center of a familiar constellation, I wonder?

I promised this series will be about the secret life of data, answering the question of what’s behind a spreadsheet. Cetus is only the first story (well, second, I guess), but the idea is to upturn the iceberg underlying seemingly mundane datasets to reveal the complicated stories of their creation and usage. Stay-tuned for future installments.

Notes

I’m retroactively adding my blog rant about data underlying an equality visualization to the f-s d series. ↩
this pun is only for historians of science ↩
Most of the historiography in this and the following section are summarized from Robert A. Hatch’s “Discovering Mira Ceti: Celestial Change and Cosmic Continuity” ↩

Who sits in the 41st chair?

2016-03-25T00:00:00+00:00

Who sits in the 41st chair?

By scott b. weingart · 2016-03-25

tl;dr Rich- get-richer academic prestige in a scarce job market makes meritocracy impossible. Why some things get popular and others don’t. Also agent-based simulations.

Slightly longer tl;dr This post is about why academia isn’t a meritocracy, at no intentional fault of those in power who try to make it one. None of presented ideas are novel on their own, but I do intend this as a novel conceptual contribution in its connection of disparate threads. Especially, I suggest the predictability of research success in a scarce academic economy as a theoretical framework for exploring successes and failures in the history of science.

But mostly I just beat a “musical chairs” metaphor to death.

Positive Feedback

To the victor go the spoils, and to the spoiled go the victories. Think about it: the Yankees; Alexander the Great; Stanford University. Why do the Yankees have twice as many World Series appearances as their nearest competitors, how was Alex’s empire so fucking vast, and why does Stanford get all the cool grants?

The rich get richer. Enough World Series victories, and the Yankees get the reputation and funding to entice the best players. Ol’ Allie-G inherited an amazing army, was taught by Aristotle, and pretty much every place he conquered increased his military’s numbers. Stanford’s known for amazing tech innovation, so they get the funding, which means they can afford even more innovation, which means even more people think they’re worthy of funding, and so on down the line until Stanford and its neighbors (Google, Apple, etc.) destroy the local real estate market and then accidentally blow up the world.

Alexander’s Empire [via]

Okay, maybe I exaggerated that last bit.

Point is, power begets power. Scientists call this a positive feedback loop: when a thing’s size is exactly what makes it grow larger.

You’ve heard it firsthand when a microphoned singer walks too close to her speaker. First the mic picks up what’s already coming out of the speaker. The mic, doings its job, sends what it hears to an amplifier, sending an even louder version to the very same speaker. The speaker replays a louder version of what it just produced, which is once again received by the microphone, until sound feeds back onto itself enough times to produce the ear-shattering squeal fans of live music have come to dread. This is a positive feedback loop.

Feedback loop. [via]

Positive feedback loops are everywhere. They’re why the universe counts logarithmically rather than linearly, or why income inequality is so common in free market economies. Left to their own devices, the rich tend to get richer, since it’s easier to make money when you’ve already got some.

Science and academia are equally susceptible to positive feedback loops. Top scientists, the most well-funded research institutes, and world-famous research all got to where they are, in part, because of something called the Matthew Effect.

Matthew Effect

The Matthew Effect isn’t the reality TV show it sounds like.

For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken even that which he hath. —Matthew 25:29, King James Bible.

It’s the Biblical idea that the rich get richer, and it’s become a popular party trick among sociologists (yes, sociologists go to parties) describing how society works. In academia, the phrase is brought up alongside evidence that shows previous grant-recipients are more likely to receive new grants than their peers, and the more money a researcher has been awarded, the more they’re likely to get going forward.

The Matthew Effect is also employed metaphorically, when it comes to citations. He who gets some citations will accrue more; she who has the most citations will accrue them exponentially faster. There are many correct explanations, but the simplest one will do here:

If Susan’s article on the danger of velociraptors is cited by 15 other articles, I am more likely to find it and cite her than another article on velociraptors containing the same information, that has never been cited. That’s because when I’m reading research, I look at who’s being cited. The more Susan is cited, the more likely I’ll eventually come across her article and cite it myself, which in turn increases the likelihood that much more that someone else will find her article through my own citations. Continue ad nauseam.

Some of you are thinking this is stupid. Maybe it’s trivially correct, but missing the bigger picture: quality. What if Susan’s velociraptor research is simply better than the competing research, and that’s why it’s getting cited more?

Yes, that’s also an issue. Noticeably awful research simply won’t get much traction. ¹ Let’s disqualify it from the citation game. The point is there is lots of great research out there, waiting to be read and built upon, and its quality isn’t the sole predictor of its eventual citation success.

In fact, quality is a mostly-necessary but completely insufficient indicator of research success. Superstar popularity of research depends much more on the citation effects I mentioned above – more citations begets even more. Previous success is the best predictor of future success, mostly independent of the quality of research being shared.

Example of positive feedback loops pushing some articles to citation stardom. [via]

This is all pretty hand-wavy. How do we know success is more important than quality in predicting success? Uh, basically because of Napster.

Popular Music

If VH1 were to produce a retrospective on the first decade of the 21st century, perhaps its two biggest subjects would be illegal music sharing and VH1’s I Love the 19xx… TV series. Napster came and went, followed by LimeWire, eDonkey2000, AudioGalaxy, and other services sued by Metallica. Well-known early internet memes like Hamster Dance and All Your Base Are Belong To Us spread through the web like socially transmitted diseases, and researchers found this the perfect opportunity to explore how popularity worked. Experimentally.

In 2006, a group of Columbia University social scientists designed a clever experiment to test why some songs became popular and others did not, relying on the public interest in online music sharing. They created a music downloading site which gathered 14,341 users, each one to become a participant in their social experiment.

The cleverness arose out of their experimental design, which allowed them to get past the pesky problem of history only ever happening once. It’s usually hard to learn why something became popular, because you don’t know what aspects of its popularity were simply random chance, and what aspects were genuine quality. If you could, say, just rerun the 1960s, changing a few small aspects here or there, would the Beatles still have been as successful? We can’t know, because the 1960s are pretty much stuck having happened as they did, and there’s not much we can do to change it. ²

But this music-sharing site could rerun history—or at least, it could run a few histories simultaneously. When they signed up, each of the site’s 14,341 users were randomly sorted into different groups, and their group number determined how they were presented music. The musical variety was intentionally obscure, so users wouldn’t have heard the bands before.

A user from the first group, upon logging in, would be shown songs in random order, and were given the option to listen to a song, rate it 1-5, and download it. Users from group #2, instead, were shown the songs ranked in order of their popularity among other members of group #2. Group #3 users were shown a similar rank-order of popular songs, but this time determined by the song’s popularity within group #3. So too for groups #4-#9. Every user could listen to, rate, and download music.

Essentially, the researchers put the participants into 9 different self-contained petri dishes, and waited to see which music would become most popular in each. Ranking and download popularity from group #1 was their control group, in that members judged music based on their quality without having access to social influence. Members of groups #2-#9 could be influenced by what music was popular with their peers within the group. The same songs circulated in each petri dish, and each petri dish presented its own version of history.

Music sharing site from Columbia study.

No superstar songs emerged out of the control group. Positive feedback loops weren’t built into the system, since popularity couldn’t beget more popularity if nobody saw what their peers were listening to. The other 8 musical petri dishes told a different story, however. Superstars emerged in each, but each group’s population of popular music was very different. A song’s popularity in each group was slightly related to its quality (as judged by ranking in the control group), but mostly it was social-influence-produced chaos. The authors put it this way:

In general, the “best” songs never do very badly, and the “worst” songs never do extremely well, but almost any other result is possible. —Salganik, Dodds, & Watts, 2006

These results became even more pronounced when the researchers increased the visibility of social popularity in the system. The rich got even richer still. A lot of it has to do with timing. In each group, the first few good songs to become popular are the ones that eventually do the best, simply by an accident of circumstance. The first few popular songs appear at the top of the list, for others to see, so they in-turn become even more popular, and so ad infinitum. The authors go on:

experts fail to predict success not because they are incompetent judges or misinformed about the preferences of others, but because when individual decisions are subject to social influence, markets do not simply aggregate pre-existing individual preferences.

In short, quality is a necessary but insufficient criteria for ultimate success. Social influence, timing, randomness, and other non-qualitative features of music are what turn a good piece of music into an off-the-charts hit.

Wait what about science?

Compare this to what makes a “well-respected” scientist: it ain’t all citations and social popularity, but they play a huge role. And as I described above, simply out of exposure-fueled-propagation, the more citations someone accrues, the more citations they are likely to accrue, until we get a situation like the Yankees (40 world series appearances, versus 20 appearances by the Giants) on our hands. Superstars are born, who are miles beyond the majority of working researchers in terms of grants, awards, citations, etc. Social scientists call this preferential attachment.

Which is fine, I guess. Who cares if scientific popularity is so skewed as long as good research is happening? Even if we take the Columbia social music experiment at face-value, an exact analog for scientific success, we know that the most successful are always good scientists, and the least successful are always bad ones, so what does it matter if variability within the ranks of the successful is so detached from quality?

Except, as anyone studying their #OccupyWallstreetknows, it ain’t that simple in a scarce economy. When the rich get richer, that money’s gotta come from somewhere. Like everything else (cf. the law of conservation of mass), academia is a (mostly) zero-sum game, and to the victors go the spoils. To the losers? Meh.

So let’s talk scarcity.

The 41st Chair

The same guy who who introduced the concept of the Matthew Effect to scientific grants and citations, Robert K. Merton (…of Columbia University), also brought up “the 41st chair” in the same 1968 article.

Merton’s pretty great, so I’ll let him do the talking:

In science as in other institutional realms, a special problem in the workings of the reward system turns up when individuals or organizations take on the job of gauging and suitably rewarding lofty performance on behalf of a large community. Thus, that ultimate accolade in 20th-century science, the Nobel prize, is often assumed to mark off its recipients from all the other scientists of the time. Yet this assumption is at odds with the well-known fact that a good number of scientists who have not received the prize and will not receive it have contributed as much to the advancement of science as some of the recipients, or more.

This can be described as the phenomenon of “the 41st chair.” The derivation of this tag is clear enough. The French Academy, it will be remembered, decided early that only a cohort of 40 could qualify as members and so emerge as immortals. This limitation of numbers made inevitable, of course, the exclusion through the centuries of many talented individuals who have won their own immortality. The familiar list of occupants of this 41st chair includes Descartes, Pascal, Moliere, Bayle, Rousseau, Saint-Simon, Diderot, Stendahl, Flaubert, Zola, and Proust

[…]

But in greater part, the phenomenon of the 41st chair is an artifact of having a fixed number of places available at the summit of recognition. Moreover, when a particular generation is rich in achievements of a high order, it follows from the rule of fixed numbers that some men whose accomplishments rank as high as those actually given the award will be excluded from the honorific ranks. Indeed, their accomplishments sometimes far outrank those which, in a time of less creativity, proved enough to qualify men for his high order of recognition.

The Nobel prize retains its luster because errors of the first kind—where scientific work of dubious or inferior worth has been mistakenly honored—are uncommonly few. Yet limitations of the second kind cannot be avoided. The small number of awards means that, particularly in times of great scientific advance, there will be many occupants of the 41st chair (and, since the terms governing the award of the prize do not provide for posthumous recognition, permanent occupants of that chair).

Basically, the French Academy allowed only 40 members (chairs) at a time. We can be reasonably certain those members were pretty great, but we can’t be sure that equally great—or greater—women existed who simply never got the opportunity to participate because none of the 40 members died in time.

These good-enough-to-be-members-but-weren’t were said to occupy the French Academy’s 41st chair, an inevitable outcome of a scarce economy (40 chairs) when the potential number benefactors of this economy far outnumber the goods available (40). The population occupying the 41st chair is huge, and growing, since the same number of chairs have existed since 1634, but the population of France has quadrupled in the intervening four centuries.

Returning to our question of “so what if rich-get-richer doesn’t stick the best people at the top, since at least we can assume the people at the top are all pretty good anyway?”, scarcity of chairs is the so-what.

Since faculty jobs are stagnating compared to adjunct work, yet new PhDs are being granted faster than new jobs become available, we are presented with the much-discussed crisis in higher education. Don’t worry, we’re told, academia is a meritocracy. With so few jobs, only the cream of the crop will get them. The best work will still be done, even in these hard times.

Recent Science PhD growth in the U.S. [via]

Unfortunately, as the Columbia social music study (among many other studies) showed, true meritocracies are impossible in complex social systems. Anyone who plays the academic game knows this already, and many are quick to point it out when they see people in much better jobs doing incredibly stupid things. What those who point out the falsity of meritocracy often get wrong, however, is intention: the idea that there is no meritocracy because those in power talk the meritocracy talk, but don’t then walk the walk. I’ll talk a bit later about how, even if everyone is above board in trying to push the best people forward, occupants of the 41st chair will still often wind up being more deserving than those sitting in chairs 1-40. But more on that later.

For now, let’s start building a metaphor that we’ll eventually over-extend well beyond its usefulness. Remember that kids’ game Musical Chairs, where everyone’s dancing around a bunch of chairs while the music is playing, but as soon as the music stops everyone’s got to find a chair and sit down? The catch, of course, is that there are fewer chairs than people, so someone always loses when the music stops.

The academic meritocracy works a bit like this. It is meritocratic, to a point: you can’t even play the game without proving some worth. The price of admission is a Ph.D. (which, granted, is more an endurance test than an intelligence test, but academic success ain’t all smarts, y’know?), a research area at least a few people find interesting and believe you’d be able to do good work in it, etc. It’s a pretty low meritocratic bar, since it described 50,000 people who graduated in the U.S. in 2008 alone, but it’s a bar nonetheless. And it’s your competition in Academic Musical Chairs.

Academic Musical Chairs

Time to invent a game! It’s called Academic Musical Chairs, the game where everything’s made up and the points don’t matter. It’s like Regular Musical Chairs, but more complicated (see Fig. 1). Also the game is fixed.

Figure 1: Academic Musical Chairs

See those 40 chairs in the middle green zone? People sitting in them are the winners. Once they’re seated they have what we call in the game “tenure”, and they don’t get up until they die or write something controversial on twitter. Everyone bustling around them, the active players, are vying for seats while they wait for someone to die; they occupy the yellow zone we call “the 41st chair”. Those beyond that, in the red zone, can’t yet (or may never) afford the price of game admission; they don’t have a Ph.D., they already said something controversial on Twitter, etc. The unwashed masses, you know?

As the music plays, everyone in the 41st chair is walking around in a circle waiting for someone to die and the music to stop. When that happens, everyone rushes to the empty seat. A few invariably reach it simultaneously, until one out-muscles the others and sits down. The sitting winner gets tenure. The music starts again, and the line continues to orbit the circle.

If a player spends too long orbiting in the 41st chair, he is forced to resign. If a player runs out of money while orbiting, she is forced to resign. Other factors may force a player to resign, but they will never appear in the rulebook and will always be a surprise.

Now, some players are more talented than others, whether naturally or through intense training. The game calls this “academic merit”, but it translates here to increased speed and strength, which helps some players reach the empty chair when the music stops, even if they’re a bit further away. The strength certainly helps when competing with others who reach the chair at the same time.

A careful look at Figure 1 will reveal one other way players might increase their chances of success when the music stops. The 41st chair has certain internal shells, or rings, which act a bit like that fake model of an atom everyone learned in high-school chemistry. Players, of course, are the electrons.

Electron shells. [via]

You may remember that the further out the shell, the more electrons can occupy it(-ish): the first shell holds 2 electrons, the second holds 8; third holds 18; fourth holds 32; and so on. The same holds true for Academic Musical Chairs: the coveted interior ring only fits a handful of players; the second ring fits an order of magnitude more; the third ring an order of magnitude more than that, and so on.

Getting closer to the center isn’t easy, and it has very little to do with your “academic rigor”! Also, of course, the closer you are to the center, the easier it is to reach either the chair, or the next level (remember positive feedback loops?). Contrariwise, the further you are from the center, the less chance you have of ever reaching the core.

Many factors affect whether a player can proceed to the next ring while the music plays, and some factors actively count against a player. Old age and being a woman, for example, take away 1 point. Getting published or cited adds points, as does already being friends with someone sitting in a chair (the details of how many points each adds can be found in your rulebook). Obviously the closer you are to the center, the easier you can make friends with people in the green core, which will contribute to your score even further. Once your score is high enough, you proceed to the next-closest shell.

Hooray, someone died! Let’s watch what happens.

The music stops. The people in the innermost ring who have the luckiest timing (thus are closest to the empty chair) scramble for it, and a few even reach it. Some very well-timed players from the 2nd & 3rd shells also reach it, because their “academic merit” has lent them speed and strength to reach past their position. A struggle ensues. Miraculously, a pregnant black woman sits down (this almost never happens), though not without some bodily harm, and the music begins again.

Oh, and new shells keep getting tacked on as more players can afford the cost of admission to the yellow zone, though the green core remains the same size.

Bizarrely, this is far from the first game of this nature. A Spanish boardgame from 1587 called the Courtly Philosophy had players move figures around a board, inching closer to living a luxurious life in the shadow of a rich patron. Random chance ruled their progression—a role of the dice—and occasionally they’d reach a tile that said things like: “Your patron dies, go back 5 squares”.

The courtier’s philosophy. [via]

But I digress. Let’s temporarily table the scarcity/41st-chair discussion and get back to the Matthew Effect.

The View From Inside

A friend recently came to me, excited but nervous about how well they were being treated by their department at the expense of their fellow students. “Is this what the Matthew Effect feels like?” they asked. Their question is the reason I’m writing this post, because I spent the next 24 hours scratching my head over “what does the Matthew Effect feel like?”.

I don’t know if anyone’s looked at the psychological effects of the Matthew Effect (if you do, please comment?), but my guess is it encompasses two feelings: 1) impostor syndrome, and 2) hard work finally paying off.

Since almost anyone who reaps the benefits of the Matthew Effect in academia will be an intelligent, hard-working academic, a windfall of accruing success should feel like finally reaping the benefits one deserves. You probably realize that luck played a part, and that many of your harder-working, smarter friends have been equally unlucky, but there’s no doubt in your mind that, at least, your hard work is finally paying off and the academic community is beginning to recognize that fact. No matter how unfair it is that your great colleagues aren’t seeing the same success.

But here’s the thing. You know how in physics, gravity and acceleration feel equivalent? How, if you’re in a windowless box, you wouldn’t be able to tell the difference between being stationary on Earth, or being pulled by a spaceship at 9.8 m/s2 through deep space? Success from merit or from Matthew Effect probably acts similarly, such that it’s impossible to tell one from the other from the inside.

Gravity vs. Acceleration. [via]

Incidentally, that’s why the last advice you ever want to take is someone telling you how to succeed from their own experience.

Since we’ve seen explosive success requires but doesn’t rely on skill, quality, or intent, the most successful people are not necessarily in the best position to understand the reason for their own rise. Their strategies may have paid off, but so did timing, social network effects, and positive feedback loops. The question you should be asking is, why didn’t other people with the same strategies also succeed?

Keep this especially in mind if you’re a student, and your tenured-professor advised you to seek an academic career. They may believe that giving you their strategies for success will help you succeed, when really they’re just giving you one of 50,000 admission tickets to Academic Musical Chairs.

Building a Meritocracy

I’m teetering well-past the edge of speculation here, but I assume the communities of entrenched academics encouraging undergraduates into a research career are the same communities assuming a meritocracy is at play, and are doing everything they can in hiring and tenure review to ensure a meritocratic playing field.

But even if gender bias did not exist, even if everyone responsible for decision-making genuinely wanted a meritocracy, even if the game weren’t rigged at many levels, the economy of scarcity (41st chair) combined with the Matthew Effect would ensure a true meritocracy would be impossible. There are only so many jobs, and hiring committees need to choose some selection criteria; those selection criteria will be subject to scarcity and rich-get-richer effects.

I won’t prove that point here, because original research is beyond the scope of this blog post, but I have a good idea of how to do it. In fact, after I finish writing this, I probably will go do just that. Instead, let me present very similar research, and explain how that method can be used to answer this question.

We want an answer to the question of whether positive feedback loops and a scarce economy are sufficient to prevent the possibility of a meritocracy. In 1971, Tom Schelling asked an unrelated question which he answered using a very relevant method: can racial segregation manifest in a community whose every actor is intent on not living a segregated life? Spoiler alert: yes.

He answered this question using by simulating an artificial world—similar in spirit to the Columbia social music experiment, except for using real participants, he experimented on very simple rule-abiding game creatures of his own invention. A bit like having a computer play checkers against itself.

The experiment is simple enough: a bunch of creatures occupy a checker board, and like checker pieces, they’re red or black. Every turn, one creature has the opportunity to move randomly to another empty space on the board, and their decision to move is based on their comfort with their neighbors. Red pieces want red neighbors, and black pieces want black neighbors, and they keep moving randomly ’till they’re all comfortable. Unsurprisingly, segregated creature communities appear in short order.

What if we our checker-creatures were more relaxed in their comforts? They’d be comfortable as long as they were in the majority; say, at least 50% of their neighbors were the same color. Again, let the computer play itself for a while, and within a few cycles the checker board is once again almost completely segregated.

Schelling segregation. [via]

What if the checker pieces are excited about the prospect of a diverse neighborhood? We relax the criteria even more, so red checkers only move if fewer than a third of their neighbors are red (that is, they’re totally comfortable with 66% of their neighbors being black)? If we run the experiment again, we see, again, the checker board breaks up into segregated communities.

Schelling’s claim wasn’t about how the world worked, but about what the simplest conditions were that could still explain racism. In his fictional checkers-world, every piece could be generously interested in living in a diverse neighborhood, and yet the system still eventually resulted in segregation. This offered a powerful support for the theory that racism could operate subtly, even if every actor were well-intended.

Vi Hart and Nicky Case created an interactive visualization/game that teaches Schelling’s segregation model perfectly. Go play it. Then come back. I’ll wait.

Such an experiment can be devised for our 41st-chair/positive-feedback system as well. We can even build a simulation whose rules match the Academic Musical Chairs I described above. All we need to do is show that a system in which both effects operate (a fact empirically proven time and again in academia) produces fundamental challenges for meritocracy. Such a model would be show that simple meritocratic intent is insufficient to produce a meritocracy. Hulk smashing the myth of the meritocracy seems fun; I think I’ll get started soon.

Our world ain’t that simple. For one, as seen in Academic Musical Chairs, your place in the social network influences your chances of success. A heavy-hitting advisor, an old-boys cohort, etc., all improve your starting position when you begin the game.

To put it more operationally, let’s go back to the Columbia social music experiment. Part of a song’s success was due to quality, but the stuff that made stars was much more contingent on chance timing followed by positive feedback loops. Two of the authors from the 2006 study wrote another in 2007, echoing this claim that good timing was more important than individual influence:

models of information cascades, as well as human subjects experiments that have been designed to test the models (Anderson and Holt 1997; Kubler and Weizsacker 2004), are explicitly constructed such that there is nothing special about those individuals, either in terms of their personal characteristics or in their ability to influence others. Thus, whatever influence these individuals exert on the collective outcome is an accidental consequence of their randomly assigned position in the queue.

These articles are part of a large literature in predicting popularity, viral hits, success, and so forth. There’s The Pulse of News in Social Media: Forecasting Popularity by Bandari, Asur, & Huberman, which showed that a top predictor of newspaper shares was the source rather than the content of an article, and that a major chunk of articles that do get shared never really make it to viral status. There’s Can Cascades be Predicted?by Cheng, Adamic, Dow, Kleinberg, and Leskovec (all-star cast if ever I saw one), which shows the remarkable reliance on timing & first impressions in predicting success, and also the reliance on social connectivity. That is, success travels faster through those who are well-connected (shocking, right?), and structural properties of the social network are important. This study by Susarla et al. also shows the importance of location in the social network in helping push those positive feedback loops, effecting the magnitude of success in YouTube Video shares.

Twitter information cascade. [via]

Now, I know, social media success does not an academic career predict. The point here, instead, is to show that in each of these cases, before sharing occurs and not taking into account social media effects (that is, relying solely on the merit of the thing itself), success is predictable, but stardom is not.

Concluding, Finally

Relating it to Academic Musical Chairs, it’s not too difficult to say whether someone will end up in the 41st chair, but it’s impossible to tell whether they’ll end up in seats 1-40 until you keep an eye on how positive feedback loops are affecting their career.

In the academic world, there’s a fertile prediction market for Nobel Laureates. Social networks and Matthew Effect citation bursts are decent enough predictors, but what anyone who predicts any kind of success will tell you is that it’s much easier to predict the pool of recipients than it is to predict the winners.

Take Economics. How many working economists are there? Tens of thousands, at least. But there’s this Econometric Societywhich began naming Fellows in 1933, naming 877 Fellows by 2011. And guess what, 60 of 69 Nobel Laureates in Economics before 2011 were Fellows of the society. The other 817 members are or were occupants of the 41st chair.

The point is (again, sorry), academic meritocracy is a myth. Merit is a price of admission to the game, but not a predictor of success in a scarce economy of jobs and resources. Once you pass the basic merit threshold and enter the 41st chair, forces having little to do with intellectual curiosity and rigor guide eventual success (ahem). Small positive biases like gender, well-connected advisors, early citations, lucky timing, etc. feed back into increasingly larger positive biases down the line. And since there are only so many faculty jobs out there, these feedback effects create a naturally imbalanced playing field. Sometimes Einsteins do make it into the middle ring, and sometimes they stay patent clerks. Or adjuncts, I guess. Those who do make it past the 41st chair are poorly-suited to tell you why, because by and large they employed the same strategies as everybody else.

Yep, Academic Musical Chairs

And if these six thousand words weren’t enough to convince you, I leave you with this article and this tweet. Have a nice day!

One of the only variables I’ve ever seen that truly predicts grant success … your application number pic.twitter.com/R7Q3k8PNck

— Adrian Barnett (@aidybarnett) March 19, 2016

Addendum for Historians

You thought I was done?

As a historian of science, this situation has some interesting repercussions for my research. Perhaps most importantly, it and related concepts from Complex Systems research offer a middle ground framework between environmental/contextual determinism (the world shapes us in fundamentally predictable ways) and individual historical agency (we possess the power to shape the world around us, making the world fundamentally unpredictable).

More concretely, it is historically fruitful to ask not simply what non-“scientific” strategies were employed by famous scientists to get ahead (see Biagioli’s Galileo, Courtier), but also what did or did not set those strategies apart from the masses of people we no longer remember. Galileo, Courtierprovides a great example of what we historians can do on a larger scale: it traces Galileo’s machinations to wind up in the good graces of a wealthy patron, and how such a system affected his own research. Using recently-available data on early modern social and scholarly networks, as well as the beginnings of data on people’s activities, interests, practices, and productions, it should be possible to zoom out from Biagioli’s viewpoint and get a fairly sophisticated picture of trajectories and practices of people who weren’t Galileo.

This is all very preliminary, just publicly blogging whims, but I’d be fascinated by what a wide-angle (dare I say, macroscopic?) analysis of the 41st chair in could tell us about how social and “scientific” practices shaped one another in the 16th and 17th centuries. I believe this would bear previously-impossible fruit, since a lone historian grasping ten thousand tertiary actors at once is a fool’s errand, but is a walk in the park for my laptop.

As this really is whim-blogging, I’d love to hear your thoughts.

Reader Comments

acrymble, 2016-03-26 09:10

I liked your post. As one of the people who managed to get one of the chairs, I appreciate your point that I’m not able to reflect on the process without considerable baggage. But I’d like to engage nonetheless.

I take the point that there are many great people not getting seats at the table. But I think what you’ve described looks at academic jobs the wrong way round. They aren’t prizes to be collected by the best and the brightest. They’re jobs that need doing. Jobs that involve specific teaching (eg, who can teach Early Modern British History to our first year students and the history of medicine to our final year students?), administration (we need someone to run academic quality assurance), and research (someone who does something no one else in our department does, and that looks decent enough to publish some interesting stuff). They’re also looking for someone who they think they can get along with for the next 30 years, who will engage the students, care about their work, etc.

To be competitive doesn’t just mean they have a PhD. Having a PhD is about as useful as breathing when it comes to applying for jobs. It’s such a fundamental requirement that it becomes meaningless. These non-competitive candidates produce job applications that probably emphasize their really great research (which to the rest of us may look very specific and obscure, and which quite frankly they will have very little time to do anyway). It probably didn’t occur to them to look into the specific teaching needs of the post so that they could highlight that in their application. They probably haven’t built up the ability to teach anything beyond their PhD specialisation (what ELSE can you teach?). They probably don’t know the difference between impact and engagement and how their work fulfills both, etc, etc.

These people just don’t have enough experience or awareness of the industry to be ‘appointable’ in their current state. Some people will learn it over time. Others will never get it. Usually we are too polite to tell those people to give up, which would probably be kinder. So your 50,000 people circling the chairs include a good proportion who just aren’t competitive for a variety of reasons, chiefly, because they thought having a PhD was the criteria and it is not a meaningful one if everyone else in the room has one too.

With that in mind, I think there is limited scope for merit. The person who ‘gets it’ and does the right digging into the department, pitches effectively for the specific job (and is qualified for that SPECIFIC job), rounds out their skill set and talks to lots of people about what it’s like to work as an academic or hire people, can improve their chances of getting an interview. You put yourself amongst the MANY very qualified people who can vie for the post. Not guarantee, but at least separate themselves from the people who didn’t understand they were applying for a job, not a prize.

If you get to the interview, it becomes a blind date rather than a game of musical chairs. They’re looking for someone who can do the very specific job that they need doing. But they’re also looking for that spark – the ‘je ne sais quoi’ of a long-term colleague. Just like in dating, sometimes you connect. And sometimes you don’t. You don’t want to end up in a bad marriage, so sometimes not getting the job is the best outcome, despite the frustration you may feel at the time.

I agree that there aren’t enough jobs for the people that want them (there aren’t enough acting gigs for actors either). And I appreciate luck and privilege (gender, ethnicity, age, where you went to school) are big elements in the equation. Often a candidate has the wrong skillset and experience for the specific jobs that are posted. That’s a lottery and entirely unfair if you guess wrong (eg, chose a PhD topic that becomes unsexy just as you’re finishing). But the people who get hired almost always deserve it. That doesn’t mean the people who don’t get hired aren’t amazing and brilliant people. But this isn’t about rewarding brilliance. It’s about a group of 80 first year students who need to be taught Early Modern British History and 25 final year students who need to learn about the history of medicine.

Teaching PhD students that academia is a job and not a prize is probably one of the first steps in addressing the frustration that you describe in this post. Whether we like to admit it or not, there are exactly the number of academic jobs that the market can bear. The conversation we should all be having is: what other fulfilling options are out there for people who are passionate about their subject knowledge? And how can we end this belief that academic jobs are prizes?

scottenderle, 2016-03-26 15:49

“These people just don’t have enough experience or awareness of the industry to be ‘appointable’ in their current state.” Certainly not. But what about the thousands of people living as adjuncts doing the very jobs you describe, year after year, as they struggle to find a tenure-track job? It almost sounds as if you think that the competitors in this system are all ABDs. But as I’m sure you must know, many of the competitors are university-level teachers with years of experience. Many of them have been teaching four or five classes a semester while also maintaining an active research profile. And many of them have been passed over by hiring committees in favor of an unexperienced ABD.

I know fantastic teachers, brilliant researchers, generous colleagues to whom this has happened multiple times. I used to worry that this was a sign that I was mistaken in my assessment of those people. It has taken me a very long time to adjust to the realization that their effort and their talent may simply never be recognized.

Before we can even begin to talk about “other fulfilling options” for these people, we need to acknowledge that our academic system has failed them.

acrymble, 2016-03-27 10:06

You won’t get me arguing with you about the problems of the adjucts. I don’t like it either.

Lincoln Mullen, 2016-03-29 02:47

“Again I saw that under the sun the race is not to the swift, nor the battle to the strong, nor bread to the wise, nor riches to the intelligent, nor favor to the skillful; but time and chance happen to them all.”

Scott B. Weingart, 2016-03-29 10:52

“what has been said will be said again; there is nothing new under the sun.” Or, as Lenny Bruce said on October 4th, 1961 (2:00 minutes in), before being arrested for obscenity: “Believe me, I’m not profound, this is something that I assume someone must have laid on me, because I do not have an original thought. I am screwed. I speak English. That’s it. I was not born in a vacuum.”

Unless it’s really awful, but let’s avoid that discussion here. ↩
short of a TARDIS. ↩

Historians, Doctors, and their Absence

2013-10-20T00:00:00+00:00

Historians, Doctors, and their Absence

[Note: sorry for the lack of polish on the post compared to others. This was hastily written before a day of international travel. Take it with however many grains of salt seem appropriate under the circumstances.]

[Author’s note two: Whoops! Never included the link to the article. Here it is.]

Every once in a while, ¹ a group of exceedingly clever mathematicians and physicists decide to do something exceedingly clever on something that has nothing to do with math or physics. This particular research project has to do with the 14th Century Black Death, resulting in such claims as the small-world network effect is a completely modern phenomenon, and “most social exchange among humans before the modern era took place via face-to-face interaction.”

The article itself is really cool. And really clever! I didn’t think of it, and I’m angry at myself for not thinking of it. They look at the empirical evidence of the spread of disease in the late middle ages, and note that the pattern of disease spread looked shockingly different than patterns of disease spread today. Epidemiologists have long known that today’s patterns of disease propagation are dependent on social networks, and so it’s not a huge leap to say that if earlier diseases spread differently, their networks must have been different too.

Don’t get me wrong, that’s really fantastic. I wish more people (read: me) would make observations like this. It’s the sort of observation that allows historians to infer facts about the past with reasonable certainty given tiny amounts of evidence. The problem is, the team had neither any doctors, nor any historians of the late middle ages, and it turned an otherwise great paper into a set of questionable conclusions.

Small world networks have a formal mathematical definition, which (essentially) states that no matter how big the population of the world gets, everyone is within a few degrees of separation from you. Everyone’s an acquaintance of an acquaintance of an acquaintance of an acquaintance. This non-intuitive fact is what drives the insane speeds of modern diseases; today, an epidemic can spread from Australia to every state in the U.S. in a matter of days. Due to this, disease spread maps are weirdly patchy, based more around how people travel than geographic features.

Patchy h5n1 outbreak map.

The map of the spread of black death in the 14th century looked very different. Instead of these patches, the disease appeared to spread in very deliberate waves, at a rate of about 2km/day.

Spread of the plague, via the original article.

How to reconcile these two maps? The solution, according to the network scientists, was to create a model of people interacting and spreading diseases across various distances and types of networks. Using the models, they show that in order to generate these wave patterns of disease spread, the physical contact network cannot be small world. From this, because they make the (uncited) claimed that physical contact networks had to be a subset of social contact networks (entirely ignoring, say, correspondence), the 14th century did not have small world social networks.

There’s a lot to unpack here. First, their model does not take into account the fact that people, y’know, die after they get the plague. Their model assumes infected have enough time and impetus to travel to get the disease as far as they could after becoming contagious. In the discussion, the authors do realize this is a stretch, but suggest that because, people could if they so choose travel 40km/day, and the black death only spread 2km/day, this is not sufficient to explain the waves.

I am no plague historian, nor a doctor, but a brief trip on the google suggests that black death symptoms could manifest in hours, and a swift death comes only days after. It is, I think, unlikely that people would or could be traveling great distances after symptoms began to show.

More important to note, however, are the assumptions the authors make about social ties in the middle ages. They assume a social tie must be a physical one; they assume social ties are connected with mobility; and they assume social ties are constantly maintained. This is a bit before my period of research, but only a hundred years later (still before the period the authors claim could have sustained small world networks), but any early modern historian could tell you that communication was asynchronous and travel was ordered and infrequent.

Surprisingly, I actually believe the authors’ conclusions: that by the strict mathematical definition of small world networks, the “pre-modern” world might not have that feature. I do think distance and asynchronous communication prevented an entirely global 6-degree effect. That said, the assumptions they make about what a social tie is are entirely modern, which means their conclusion is essentially inevitable: historical figures did not maintain modern-style social connections, and thus metrics based on those types of connections should not apply. Taken in the social context of the Europe in the late middle ages, however, I think the authors would find that the salient features of small world networks (short average path length and high clustering) exist in that world as well.

A second problem, and the reason I agree with the authors that there was not a global small world in the late 14th century, is because “global” is not an appropriate axis on which to measure “pre-modern” social networks. Today, we can reasonably say we all belong to a global population; at that point in time, before trade routes from Europe to the New World and because of other geographical and technological barriers, the world should instead have been seen as a set of smaller, overlapping populations. My guess is that, for more reasonable definitions of populations for the time period, small world properties would continue to hold in this time period.

Notes:

Reader Comments

Yannick Rochat, 2013-10-24 18:36

It reminds me of another Newman article : http://arxiv.org/abs/cond-mat/0305612

In the case of the plague article, I feel like you (an historian) were not supposed to find it. Like if it were storytelling for engineers. Let’s hope that such a work, made without the help of a researcher in (digital ?) humanities, and with quite no sources from work of historians, doesn’t become a reference on this subject, but remains at most one about that spread-with-jumps algorithm.

There should be a blog about such articles.

Thanks for your post.

Jack Rigby, 2014-03-05 07:33

Now this IS fascinating! The problem with trying to establish the actual truth, as distinct from the political truth, is that academically, it can mean professional death, or in the case of some industries, (Tobacco) real death. I wrote a long lost dissent 40 years ago about plague/disease characteristics and the key point was: “Hellooo?? Nobody with the Black Death infection travels far at all”

It was tied in to the nonsense about the origin of Man being in Africa, a totally geo-unstable place compared to Australia, where the locals have stories about the “DreamtimeS” ( two lots of 20,000 year memories) and “going out into the world – where everybody was dead from the cold.”

One of the big catches in disease research is the blatant lies told by the “Vested Interests” protecting their interests. But one can find valid information by looking in esoteric areas like the international battle to get rid of the literal health horrors of SODIUM Fluoride and actually track deterioration in the health of entire populations’ “with no DISCERNIBLE reason.” (Officially)

Every day? Every two days? ↩

Analyzing submissions to Digital Humanities 2013

2012-11-08T00:00:00+00:00

Analyzing submissions to Digital Humanities 2013

Digital Humanities 2013 is on its way; submissions are closed, peers will be reviewing them shortly, and (most importantly for this post) the people behind the conference are experimenting with a new method of matching submissions to reviewers. It’s a bidding process; reviewers take a look at the many submissions and state their reviewing preferences or, when necessary, conflicts of interest. It’s unclear the extent to which these preferences will be accommodated, as this is an experiment on their part. Bethany Nowviskie describes it here. As a potential reviewer, I just went through the process of listing my preferences, and managed to do some data scraping while I was there. How could I not? All 348 submission titles were available to me, as well as their authors, topic selections, and keywords, and given that my submission for this year is all about quantitatively analyzing DH, it was an opportunity I could not pass up. Given that these data are sensitive, and those who submitted did so under the assumption that rejected submissions would remain private, I’m opting not to release the data or any non-aggregated information. I’m also doing my best not to actually read the data in the interest of the privacy of my peers; I suppose you’ll all just have to trust me on that one, though.

So what are people submitting? According to the topics authors assigned to their 348 submissions, 65 submitted articles related to “literary studies,” trailed closely by 64 submissions which pertained to “data mining/ text mining.” Work on archives and visualizations are also up near the top, and only about half as many authors submitted historical studies (37) as those who submitted literary ones (65). This confirms my long suspicion that our current wave of DH (that is, what’s trending and exciting) focuses quite a bit more on literature than history. This makes me sad. You can see the breakdown in Figure 1 below, and further analysis can be found after.

Figure 1: Number of documents with each topic authors assigned to submissions for DH2013 (click to enlarge).

The majority of authors attached fewer than five topics to their submissions; a small handful included over 15. Figure 2 shows the number of topics assigned to each document.

Figure 2: The number of topics attached to each document, in order of rank.

I was curious how strongly each topic coupled with other topics, and how topics tended to cluster together in general, so I extracted a topic co-occurrence network. That is, whenever two topics appear on the same document, they are connected by an edge (see Networks Demystified Pt. 1 for a brief introduction to this sort of network); the more times two topics co-occur, the stronger the weight of the edge between them.

Topping off the list at 34 co-occurrences were “Data Mining/ Text Mining” and “Text Analysis,” not terrifically surprising as the the latter generally requires the former, followed by “Data Mining/ Text Mining” and “Content Analysis” at 23 co-occurrences, “Literary Studies” and “Text Analysis” at 22 co-occurrences, “Content Analysis” and “Text Analysis” at 20 co-occurrences, and “Data Mining/ Text Mining” and “Literary Studies” at 19 co-occurrences. Basically what I’m saying here is that Literary Studies, Mining, and Analysis seem to go hand-in-hand.

Knowing my readers, about half of you are already angry with me counting co-occurrences, and rightly so. That measurement is heavily biased by the sheer total number of times a topic is used; if “literary studies” is attached to 65 submissions, it’s much more likely that it will co-occur with any particular topic than topics (like “teaching and pedagogy”) which simply appear more infrequently. The highest frequency topics will co-occur with one another simply by an accident of magnitude.

To account for this, I measured the neighborhood overlap of each node on the topic network. This involves first finding the number of other topics a pair of two topics shares. For example, “teaching and pedagogy” and “digital humanities – pedagogy and curriculum” each co-occur with several other of the same topics, including “programming,” “interdisciplinary collaboration,” and “project design, organization, management.” I summed up the number topical co-occurrences between each pair of topics, and then divided that total by the number of co-occurrences each node in the pair had individually. In short, I looked at which pairs of topics tended to share similar other topics, making sure to take into account that some topics which are used very frequently might need some normalization. There are better normalization algorithms out there, but I opt to use this one for its simplicity for pedagogical reasons. The method does a great job leveling the playing field between pairs of infrequently-used topics compared to pairs of frequently-used topics, but doesn’t fair so well when looking at a pair where one topic is popular and the other is not. The algorithm is well-described in Figure 3, where the darker the edge, the higher the neighborhood overlap.

Figure 3: The neighborhood overlap between two nodes is how many neighbors (or connections) that pair of nodes shares. As such, A and B share very few connections, so their overlap is low, whereas D and E have quite a high overlap. Via Jaroslav Kuchar .

Neighborhood overlap paints a slightly different picture of the network. The pair of topics with the largest overlap was “Internet / World Wide Web” and “Visualization,” with 90% of their neighbors overlapping. Unsurprisingly, the next-strongest pair was “Teaching and Pedagogy” and “Digital Humanities – Pedagogy and Curriculum.” The data might be used to suggest multiple topics that might be merged into one, and this pair seems to be a pretty good candidate. “Visualization” also closely overlaps “Data Mining/ Text Mining”, which itself (as we saw before) overlaps with “Cultural Studies” and “Literary Studies.” What we see from this close clustering both in overlap and in connection strength is the traces of a fairly coherent subfield out of DH, that of quantitative literary studies. We see a similarly tight-knit cluster between topics concerning archives, databases, analysis, the web, visualizations, and interface design, which suggests another genre in the DH community: the (relatively) recent boom of user interfaces as workbenches for humanists exploring their archives. Figure 4 represents the pairs of topics which overlap to the highest degree; topics without high degrees of pair correspondence don’t appear on the network graph.

Figure 4: Network of topical neighborhood overlap. Edges between topics are weighted according to how structurally similar the two topics are. Topics that are structurally isolated are not represented in this network visualization.

The topics authors chose for each submission were from a controlled vocabulary. Authors also had the opportunity to attach their own keywords to submissions, which unsurprisingly yielded a much more diverse (and often redundant) network of co-occurrences. The resulting network revealed a few surprises: for example, “topic modeling” appears to be much more closely coupled with “visualization” than with “text analysis” or “text mining.” Of course some pairs are not terribly surprising, as with the close connection between “Interdisciplinary” and “Collaboration.” The graph also shows that the organizers have done a pretty good job putting the curated topic list together, as a significant chunk of the high thresholding keywords are also available in the topic list, with a few notable exceptions. “Scholarly Communication,” for example, is a frequently used keyword but not available as a topic – perhaps next year, this sort of analysis can be used to help augment the curated topic list. The keyword network appears in Figure 5. I’ve opted not to include a truly high resolution image to dissuade readers from trying to infer individual documents from the keyword associations.

Figure 5: Which keywords are used together on documents submitted to DH2013? Nodes are colored by cluster, and edges are weighted by number of co-occurrences. Click to enlarge.

There’s quite a bit of rich data here to be explored, and anyone who does have access to the bidding can easily see that the entire point of my group’s submission is exploring the landscape of DH, so there’s definitely more to come on the subject from this blog. I especially look forward to seeing what decisions wind up being made in the peer review process, and whether or how that skews the scholarly landscape at the conference.

On a more reflexive note, looking at the data makes it pretty clear that DH isn’t as fractured as some occasionally suggest (New Media vs. Archives vs. Analysis, etc.). Every document is related to a few others, and they are all of them together connected in a rich family, a network, of Digital Humanities. There are no islands or isolates. While there might be no “The” Digital Humanities, no unifying factor connecting all research, there are Wittgensteinian family resemblances connecting all of these submissions together, in a cohesive enough whole to suggest that yes, we can reasonably continue to call our confederation a single community. Certainly, there are many sub-communities, but there still exists an internal cohesiveness that allows us to differentiate ourselves from, say, geology or philosophy of mind, which themselves have their own internal cohesiveness.

Another Step in Keeping Pledges

2012-08-15T00:00:00+00:00

Another Step in Keeping Pledges

Long-time readers of this blog might remember that, a while ago, I pledged to do pretty much Open Everything. Last week, a friend in my department asked how I managed that without having people steal my ideas. It’s a tough question, and I’m still not certain whether my answer has more to do with idealist naïveté or actual forward-thought. Time will tell. As it is, the pool of people doing similar work to mine is small, and they pretty much all know about this blog, so I’m confident the crowd of rabid academics will keep each other in check. Still, I suppose we all have to be on guard for the occasional evil professor, wearing his white lab coat, twirling his startling mustachio, and just itching to steal the idle musings of a still-very-confused Ph.D. student.

In the interest of keeping up my pledge, I’ve decided to open up yet another document, this time for the purpose of student guidance. In 2010, I applied for the NSF Graduate Research Fellowship Program, a shockingly well-paying program that’ll surely help with the rising (and sometimes prohibitive) costs of graduate school. By several strokes of luck and (I hope) a decent project, the NSF sent the decision to fund me later that year, and I’ve had more time to focus on research ever since. In the interest of helping future applicants, I’ve posted my initial funding proposal on figshare. Over the next few weeks, there are a few other documents and datasets I plan on making public, and I’ll start a new page on this blog that consolidates all the material that I’ve opened, inspired by Ted Underwood’s similar page.

Click to get my NSF proposal.

Do you have grants or funding applications that’ve been accepted? Do you have publications out that are only accessible behind a drastic paywall? I urge you to post preprints, drafts, or whatever else you can to make scholarship a freer and more open endeavor for the benefit of all.

Reader Comments

Laurie N. Taylor, 2012-08-17

The University of Florida libraries have seen very clear benefits (increasing interest in projects, gaining new collaborators for projects, serving as PR/marketing) from posting grant applications in the UF Digital Collections. We have a full collection specifically for grant proposals: http://ufdc.ufl.edu/ufirgrants For large, collaborative projects, we’ve also found this to be specifically useful for ease of communication and project management as the grant projects proceed because it ensures everyone has ready access to the proposal. More recently, we’re trying to ensure that we also share all official press releases and all grant reports along with the funded proposals to help people better understand how grant projects normally proceed, best practices, and just to further develop a culture of grantsmanship for successful proposal writing, successful and easier grant project management, and successful next steps in terms of increasing impact from all projects. While the emphasis began on sharing proposals for larger grants, researchers have added individual fellowship proposals and the feedback has been similarly positive. Researchers for some projects have declined to share their proposals until after their project work and publication are complete for fear of being scooped, which seems like a valid concern in some instances. For many, it does not seem applicable and there do seem to be clear benefits from sharing the proposals.

I’m very interested to see how other people respond on this and for additional data (anecdotal or otherwise) on risks and benefits.

Anthony Salvagno, 2012-08-18

Personally I think the reason scientists won’t steal ideas is because we are putting them out there. As a fellow open scientist, I make all my research and data public domain. Others may attribute it with a share-alike license. Whatever the case is, generally speaking, you can’t steal ideas that are being offered to the world. That may have a lot to do with it. The current pool of participants being small may also have something to do with it (like you suggest). I recently wrote a bunch of thoughts on this here.

And that’s great that you published your funded proposal. I took the concept a step further and wrote an NSF IGERT proposal openly and published it here. Hopefully it gets funded but if not hopefully I or someone can build on it in the future. Whatever pushes science forward right?

Doing Bayesian Data Analysis

2012-01-10T00:00:00+00:00

Doing Bayesian Data Analysis

A few months ago, Science published a Thanksgiving article on what scientists can be grateful for. It’s got a lot of good points, like being thankful for family members who accept the crazy hours we work, or for those really useful research projects that make science cool enough for us to get funding for the merely really interesting. It does have one unfortunate reference to humanists:

We are thankful that Ph.D. programs in the sciences, as much as we complain about them, aren’t nearly as horrifying as, say, Ph.D. programs in the humanities. I just heard today from a friend in his ninth year of a comparative literature Ph.D. who thinks he might finish “in a year and a half.” At least the job market for comp lit Ph.D. awardees is thriving, right?

Ouch. I suppose the truth hurts. The particularly interesting point that inspired this post, however, was:

We are thankful for that one colleague who knows statistics. There’s always one.

A Scientist’s Thanksgiving. (Image from the above Science article)

The State of Things

The above quote about statisticians is so true it hurts, as (we just discovered) the truth is wont to do. It’s even more true in the humanities than it is in the more natural and quantitative sciences. When we talk about a colleague who knows statistics, we generally don’t mean someone down the hall; usually, we mean that one statistician who we met in the pub that one night and has a bizarre interest in the humanities. That’s not to say humanist statisticians don’t exist, but I doubt you’re likely to find one in any given humanities department.

This unfortunately is not only true of statistics, but also of GIS, network science, computer science, textual analysis, and many other disciplines we digital humanists love to borrow from. Thankfully, the NEH ODH’s Institutes for Advanced Topics in the Humanities, UVic’s Digital Humanities Summer Institutes, and other programs out there are improving our collective expertise, but a quick look for GIS/Stats/SNA/etc. courses in most humanities departments still produces slim pickings.

Math is scary. (I can’t find attribution, sorry. Anybody know who drew this?)

One of the best things to come out of the #hacker movement in the Digital Humanities has been the spirit to get our collective hands dirty and learn the techniques ourselves. It’s been a long time coming, and happier days are sure to follow, but one skill still seems underrepresented from the DH purview: statistics.

Why Statistics? Why Bayesian Statistics?

In a recent post by Elijah Meeks, he called Text Analysis, Spatial Analysis, and Network Analysis the “three pillars” of DH research, with a sneaking suspicion that Image Analysis should fit somewhere in there as well. This seems to be the converging sentiment in most DH circles, and although when asked most would say statistics is also important, it still doesn’t seem to be among the first subjects named.

With another round of Digging Into Data winners chosen, and a bevy of panels and presentations dedicating themselves to Big Data in the Humanities, the first direction we should point is statistics. Statistics is a tool uniquely built for understanding lots of data, and it was developed with full knowledge that the data may be incomplete, biased, or otherwise imperfect, and has legitimate work-arounds for most such occasions. Of course, all the caveats in my first Networks Demystified post apply here: don’t use it without fully understanding it, and changing it where necessary.

http://vadlo.com/cartoons.php?id=71

Many Humanists, even digital ones, frequently seem to have a (justifiably) knee-jerk reaction to statistics. If you’ve been following the Twitter and blog conversations about AHA 2012, you probably caught a flurry of discussion over Google Ngrams. Conversation tended toward horrified screams of the dangers of correlation vs. causation (or at least references to xkcd), and the ease with which one might lie via statistics or omission. These are all valid cautions, especially where ngrams is concerned, but I sometimes fear we get so caught up in bad examples that we spend more time apologizing for them than fixing them. Ted Underwood has a great post about just this, which I will touch on again shortly. (And, to Ted and Allen specifically, I’m guessing you both will enjoy this post.)

In short: statistics is useful. To quote the above-linked xkcd comic:

Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.

So how do we go about using statistics? In a comment on Ted’s recent post about statistics, Trevor Owens wrote:

if you just start signing up for statistics courses you are going to end up getting a rundown on using t-tests and ANOVAs as tools for hypothesis testing. The entire hypothesis testing idea remains a core part of how a lot of folks in the social sciences think about things and it is deeply at odds with what humanists want to do.

The key is not appropriation but adaption. We must learn statistics, even the hypothesis testing, so that we might find what methods are useful, what might be changed, and how we can get it to work for us. We’re humanists. We’re really good at methodological critique.

One of the areas of statistics most likely to bear fruit for humanists is Bayesian statistics. Some of us already use it in our text mining algorithms, although the math involved remains occult to most. It basically builds uncertainty and belief directly into statistics. Instead of coming up with one correct answer, Bayesian analysis often yields a range of more or less probable answers depending what seems to be the case from prior evidence, and can update and improve that range as more is learned.

The one XKCD comic nobody seems to have linked to. (http://xkcd.com/892/)

For humanists, this importance is (at least) two-fold. Ted Underwood sums up the first reason nicely:

[Bayesian inference] is amazingly, almost bizarrely willing to incorporate subjective belief into its definition of knowledge. It insists that definitions of probability have to depend not only on observed evidence, but on the “prior probabilities” that we expected before we saw the evidence. If humanists were more familiar with Bayesian statistics, I think it would blow a lot of minds.

The second and more specific reason worth mentioning here deals with the ranges I discussed above. If a historian, for example, is trying to understand how and why some historical event happened, Bayesian analysis could yield which set of occurrences were more or less likely, and which were so far off as to not be worth considering. By trying to find reasonable boundary conditions rather than exact explanations to answer our questions, humanists can retain that core knowledge that humans and human situations are not wholly deterministic machines, who all act the same and reproduce the same results in every situation.

We are intrinsically and inextricably inexact, and until we get computers that see and remember everything, and model it all perfectly, we should avoid looking for exact answers. Bayesian statistics, instead, can help us find a range of reasonable answers, with full awareness and use of the beliefs and evidence we have going in.

A Call to Arms

After I read that post about a scientist’s thanksgiving, I realized I didn’t want to have to rely on that one colleague who knows statistics. Nobody should. That’s why I decided to enroll in a Bayesian Data Analysis course this semester, taught by and using the book of John K. Kruschke. It’s a very readable book, directed toward people with no prior knowledge in statistics or programming, and takes you through the basics of both. Kruschke’s got a blog worth reading, as does Andrew Gelman, an author of the book Bayesian Data Analysis. I’m sure a basic Google search can point you to video lectures, if that’s your thing. I’ll also try to blog about it over the coming months as I learn more.

There are several (occasionally apocryphal) anecdotes about the great theoretical physicists of the early 20th century needing to go back to school to learn basic statistics. Some still weren’t terribly happy about it (“God does not play dice with the universe”), but in the end, pressures from the changing nature of their theories required a thorough understanding of statistics. As humanists begin to deal with a glut of information we never before had access to, it’s time we adapt in a similar fashion.

The wide angle, the distant reading, the longue durée will all benefit from a deeper understanding of statistics. That knowledge, in tandem with traditional close reading skills, will surely become one of the pillars of humanities research as Big Data becomes ever-more common.

Reader Comments (6)

Ryan Shaw, Jan 10, 2012 3:01 pm

You might be interested in Aviezer Tucker’s book Our Knowledge of the Past, which argues that historiographical practice is best understood from a Bayesian perspective.

Scott Weingart, Jan 10, 2012 6:29 pm

This is fantastic, thank you! I will definitely take a look at it.

Ted Underwood, Jan 11, 2012 1:46 pm

You’re right that I enjoyed the post! Also, Kruschke’s book looks a lot more accessible than the one I got out of our library. I’ve convinced myself that I mostly “understand” that one, but I might just read Kruschke’s to make sure that I actually do!

Allen Riddell, Jan 12, 2012 2:49 am

Great post. Thanks Scott.

Here’s my favorite quote on the subject:

The atmosphere of the Bayesian revival is captured in a comment by Rivett on [Dennis] Lindley’s move to University College London and the premier chair of statistics in Britain: “it was as though a Jehovah’s Witness had been elected Pope.”

Also worth mentioning might be a recent book from Yale UP that is addressed to a general audience: The Theory That Would Not Die by Sharon Mcgrayne https://www.powells.com/biblio/62-9780300169690-0

Ben, Jan 12, 2012 5:50 am

I agree that the publication of Kruschke’s book probably going to be a watershed moment in making Bayesian statistics widely accessible. I’d also recommend Simon Jackman’s ‘Bayesian Analysis for the Social Sciences’ (2009) and his class notes here: http://jackman.stanford.edu/classes/BASS. Another decent one is Ntzoufras’ ‘Bayesian Modeling Using WinBUGS: An introduction’ (2009) http://stat-athens.aueb.gr/~jbn/winbugs_book

Scott Weingart, Jan 12, 2012 11:15 am

Thanks, Ben, those look like fantastic resources. It’s worth pointing out that both Jackman and Kruschke suggest using JAGS over BUGS for markov chains.

Topic Modeling and Network Analysis

2011-11-15T00:00:00+00:00

Topic Modeling and Network Analysis

According to Google Scholar, David Blei’s first topic modeling paper has received 3,540 citations since 2003. Everybody’s talking about topic models. Seriously, I’m afraid of visiting my parents this Hanukkah and hearing them ask “Scott… what’s this topic modeling I keep hearing all about?” They’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination.

Since shortly after Blei’s first publication, researchers have been looking into the interplay between networks and topic models. This post will be about that interplay, looking at how they’ve been combined, what sorts of research those combinations can drive, and a few pitfalls to watch out for. I’ll bracket the big elephant in the room until a later discussion, whether these sorts of models capture the semantic meaning for which they’re often used. This post also attempts to introduce topic modeling to those not yet fully ~~converted~~ aware of its potential.

Citations to Blei (2003) from ISI Web of Science. There are even two citations already from 2012; where can I get my time machine?

A brief history of topic modeling

In my recent post on IU’s awesome alchemy project, I briefly mentioned Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) during the discussion of topic models. They’re intimately related, though LSA has been around for quite a bit longer. Without getting into too much technical detail, we should start with a brief history of LSA/LDA.

The story starts, more or less, with a tf-idf matrix. Basically, tf-idf ranks words based on how important they are to a document within a larger corpus. Let’s say we want a list of the most important words for each article in an encyclopedia.

Our first pass is obvious. For each article, just attach a list of words sorted by how frequently they’re used. The problem with this is immediately obvious to anyone who has looked at word frequencies; the top words in the entry on the History of Computing would be “the,” “and,” “is,” and so forth, rather than “turing,” “computer,” “machines,” etc. The problem is solved by tf-idf, which scores the words based on how special they are to a particular document within the larger corpus. Turing is rarely used elsewhere, but used exceptionally frequently in our computer history article, so it bubbles up to the top.

LSA and pLSA

LSA utilizes these tf-idf scores ¹ within a larger term-document matrix. Every word in the corpus is a different row in the matrix, each document has its own column, and the tf-idf score lies at the intersection of every document and word. Our computing history document will probably have a lot of zeroes next to words like “cow,” “shakespeare,” and “saucer,” and high marks next to words like “computation,” “artificial,” and “digital.” This is called a sparse matrix because it’s mostly filled with zeroes; most documents use very few words related to the entire corpus.

With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. [^2] It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.

To get an idea of the sort of fantastic outputs you can get with LSA, do check out the implementation over at The Chymistry of Isaac Newton.

Newton Project LSA

The method was significantly improved by Puzicha and Hofmann (1999), who did away with the linear algebra approach of LSA in favor of a more statistically sound probabilistic model, called probabilistic latent semantic analysis (pLSA). Now is the part of the blog post where I start getting hand-wavy, because explaining the math is more trouble than I care to take on in this introduction.

Essentially, pLSA imagines an additional layer between words and documents: topics. What if every document isn’t just a set of words, but a set of topics? In this model, our encyclopedia article about computing history might be drawn from several topics. It primarily draws from the big platonic computing topic in the sky, but it also draws from the topics of history, cryptography, lambda calculus, and all sorts of other topics to a greater or lesser degree.

Now, these topics don’t actually exist anywhere. Nobody sat down with the encyclopedia, read every entry, and decided to come up with the 200 topics from which every article draws. pLSA infers topics based on what will hereafter be referred to as black magic. Using the dark arts, pLSA “discovers” a bunch of topics, attaches them to a list of words, and classifies the documents based on those topics.

LDA

Blei et al. (2003) vastly improved upon this idea by turning it into a generative model of documents, calling the model Latent Dirichlet allocation (LDA). By this time, as well, some sounder assumptions were being made about the distribution of words and document length — but we won’t get into that. What’s important here is the generative model.

Imagine you wanted to write a new encyclopedia entry, let’s say about digital humanities. Well, we now know there are three elements that make up that process, right? Words, topics, and documents. Using these elements, how would you go about writing this new article on digital humanities?

First off, let’s figure out what topics our article will consist of. It probably draws heavily from topics about history, digitization, text analysis, and so forth. It also probably draws more weakly from a slew of other topics, concerning interdisciplinarity, the academy, and all sorts of other subjects. Let’s go a bit further and assign weights to these topics; 22% of the document will be about digitization, 19% about history, 5% about the academy, and so on. Okay, the first step is done!

Now it’s time to pull out the topics and start writing. It’s an easy process; each topic is a bag filled with words. Lots of words. All sorts of words. Let’s look in the “digitization” topic bag. It includes words like “israel” and “cheese” and “favoritism,” but they only appear once or twice, and mostly by accident. More importantly, the bag also contains 157 appearances of the word “TEI,” 210 of “OCR,” and 73 of “scanner.”

LDA Model from Blei (2011)

So here you are, you’ve dragged out your digitization bag and your history bag and your academy bag and all sorts of other bags as well. You start writing the digital humanities article by reaching into the digitization bag (remember, you’re going to reach into that bag for 22% of your words), and you pull out “OCR.” You put it on the page. You then reach for the academy bag and reach for a word in there (it happens to be “teaching,”) and you throw that on the page as well. Keep doing that. By the end, you’ve got a document that’s all about the digital humanities. It’s beautiful. Send it in for publication.

Alright, what now?

So why is the generative nature of the model so important? One of the key reasons is the ability to work backwards. If I can generate an (admittedly nonsensical) document using this model, I can also reverse the process an infer, given any new document and a topic model I’ve already generated, what the topics are that the new document draws from.

Another factor contributing to the success of LDA is the ability to extend the model. In this case, we assume there are only documents, topics, and words, but we could also make a model that assumes authors who like particular topics, or assumes that certain documents are influenced by previous documents, or that topics change over time. The possibilities are endless, as evidenced by the absurd number of topic modeling variations that have appeared in the past decade. David Mimno has compiled a wonderful bibliography of many such models.

While the generative model introduced by Blei might seem simplistic, it has been shown to be extremely powerful. When a newcomer sees the results of LDA for the first time, they are immediately taken by how intuitive they seem. People sometimes ask me “but didn’t it take forever to sit down and make all the topics?” thinking that some of the magic had to be done by hand. It wasn’t. Topic modeling yields intuitive results, generating what really feels like topics as we know them [^3], with virtually no effort on the human side. Perhaps it is the intuitive utility that appeals so much to humanists.

Topic Modeling and Networks

Topic models can interact with networks in multiple ways. While a lot of the recent interest in digital humanities has surrounded using networks to visualize how documents or topics relate to one another, the interfacing of networks and topic modeling initially worked in the other direction. Instead of inferring networks from topic models, many early (and recent) papers attempt to infer topic models from networks.

Topic Models from Networks

The first research I’m aware of in this niche was from McCallum et al. (2005). Their model is itself an extension of an earlier LDA-based model called the Author-Topic Model (Steyvers et al., 2004), which assumes topics are formed based on the mixtures of authors writing a paper. McCallum et al. extended that model for directed messages in their Author-Recipient-Topic (ART) Model. In ART, it is assumed that topics of letters, e-mails or direct messages between people can be inferred from knowledge of both the author and the recipient. Thus, ART takes into account the social structure of a communication network in order to generate topics. In a later paper (McCallum et al., 2007), they extend this model to one that infers the roles of authors within the social network.

Dietz et al. (2007) created a model that looks at citation networks, where documents are generated by topical innovation and topical inheritance via citations. Nallapati et al. (2008) similarly creates a model that finds topical similarity in citing and cited documents, with the added ability of being able to predict citations that are not present. Blei himself joined the fray in 2009, creating the Relational Topic Model (RTM) with Jonathan Chang, which itself could summarize a network of documents, predict links between them, and predict words within them. Wang et al. (2011) created a model that allows for “the joint analysis of text and links between [people] in a time-evolving social network.” Their model is able to handle situations where links exist even when there is no similarity between the associated texts.

Networks from Topic Models

Some models have been made that infer networks from non-networked text. Broniatowski and Magee (2010 & 2011) extended the Author-Topic Model, building a model that would infer social networks from meeting transcripts. They later added temporal information, which allowed them to infer status hierarchies and individual influence within those social networks.

Many times, however, rather than creating new models, researchers create networks out of topic models that have already been run over a set of data. There are a lot of benefits to this approach, as exemplified by the Newton’s Chymistry project highlighted earlier. Using networks, we can see how documents relate to one another, how they relate to topics, how topics are related to each other, and how all of those are related to words.

Elijah Meeks created a wonderful example combining topic models with networks in Comprehending the Digital Humanities. Using fifty texts that discuss humanities computing, Elijah created a topic model of those documents and used networks to show how documents, topics, and words interacted with one another within the context of the digital humanities.

Network generated by Elijah Meeks to show how digital humanities documents relate to one another via the topics they share.

~~Elijah~~ Jeff Drouin has also created networks of topic models in Proust, as reported by Elijah.

Peter Leonard recently directed me to TopicNets, a project that combines topic modeling and network analysis in order to create an intuitive and informative navigation interface for documents and topics. This is a great example of an interface that turns topic modeling into a useful scholarly tool, even for those who know little-to-nothing about networks or topic models.

If you want to do something like this yourself, Shawn Graham recently posted a great tutorial on how to create networks using MALLET and Gephi quickly and easily. Prepare your corpus of text, get topics with MALLET, prune the CSV, make a network, visualize it! Easy as pie.

Networks can be a great way to represent topic models. Beyond simple uses of navigation and relatedness as were just displayed, combining the two will put the whole battalion of network analysis tools at the researcher’s disposal. We can use them to find communities of similar documents, pinpoint those documents that were most influential to the rest, or perform any of a number of other workflows designed for network analysis.

As with anything, however, there are a few setbacks. Topic models are rich with data. Every document is related to every other document, if some only barely. Similarly, every topic is related to every other topic. By deciding to represent document similarity over a network, you must make the decision of precisely how similar you want a set of documents to be if they are to be linked. Having a network with every document connected to every other document is scarcely useful, so generally we’ll make our decision such that each document is linked to only a handful of others. This allows for easier visualization and analysis, but it also destroys much of the rich data that went into the topic model to begin with. This information can be more fully preserved using other techniques, such as multidimensional scaling.

A somewhat more theoretical complication makes these network representations useful as a tool for navigation, discovery, and exploration, but not necessarily as evidentiary support. Creating a network of a topic model of a set of documents piles on abstractions. Each of these systems comes with very different assumptions, and it is unclear what complications arise when combining these methods ad hoc.

Getting Started

Although there may be issues with the process, the combination of topic models and networks is sure to yield much fruitful research in the digital humanities. There are some fantastic tutorials out there for getting started with topic modeling in the humanities, such as Shawn Graham’s post on Getting Started with MALLET and Topic Modeling, as well as on combining them with networks, such as this post from the same blog. Shawn is right to point out MALLET, a great tool for starting out, but you can also find the code used for various models on many of the model-makers’ academic websites. One code package that stands out is Chang’s implementation of LDA and related models in R.

Airoldi, Edoardo M., David M. Blei, Stephen E. Fienberg, and Eric P. Xing. 2008. “Mixed Membership Stochastic Blockmodels.” The Journal of Machine Learning Research 9 (June): 1981–2014. http://dl.acm.org/citation.cfm?id=1390681.1442798.

AlSumait, Loulwah, Daniel Barbará, James Gentle, and Carlotta Domeniconi. 2009. “Topic Significance Ranking of LDA Generative Models.” In Machine Learning and Knowledge Discovery in Databases, edited by Wray Buntine, Marko Grobelnik, Dunja Mladenić, and John Shawe-Taylor, 5781:67–82. Berlin, Heidelberg: Springer Berlin Heidelberg. http://www.springerlink.com/content/v3jth868647716kg/.

Bamman, David, Brendan O’Connor, and Noah Smith. 2013. “Learning Latent Personas of Film Characters.” In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria.

Binder, Jeffrey M., and Collin Jennings. 2014. “Visibility and Meaning in Topic Models and 18th-century Subject Indexes.” Literary and Linguistic Computing (May 7): fqu017. doi:10.1093/llc/fqu017. http://llc.oxfordjournals.org/content/early/2014/05/06/llc.fqu017.

Blei, David M. 2012. “Probabilistic Topic Models.” Communications of the ACM 55 (4) (April 1): 77. doi:10.1145/2133806.2133826. http://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext.

Blei, David M. 2011. “Introduction to Probabilistic Topic Models.” Communications of the ACM.

Blei, David M., and John D. Lafferty. 2006. “Dynamic Topic Models.” In Proceedings of the 23rd International Conference on Machine Learning, 113–120. ICML ’06. New York, NY, USA: ACM. doi:10.1145/1143844.1143859. http://doi.acm.org/10.1145/1143844.1143859.

Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” The Annals of Applied Statistics 1 (1) (June 1): 17–35. http://www.jstor.org/stable/4537420.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March): 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.

Block, Sharon. 2006. “Doing More with Digitization.” Common-Place 6 (2) (January).

Boyd-Graber, Jordan, and David M. Blei. 2009. “Multilingual Topic Models for Unaligned Text.” In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 75–82. UAI ’09. Arlington, Virginia, United States: AUAI Press. http://dl.acm.org/citation.cfm?id=1795114.1795124.

Broniatowski, David A., and Christopher L. Magee.

“Towards a Computational Analysis of Status and Leadership Styles on FDA Panels.” In Social Computing, Behavioral-Cultural Modeling and Prediction, edited by John Salerno, Shanchieh Jay Yang, Dana Nau, and Sun-Ki Chai, 6589:212–218. Berlin, Heidelberg: Springer Berlin Heidelberg. http://www.springerlink.com/content/w655v786lp583660/.

Broniatowski, David A., and Christopher L. Magee.

“Analysis of Social Dynamics on FDA Panels Using Social Networks Extracted from Meeting Transcripts.” In 2010 IEEE Second International Conference on Social Computing (SocialCom), 329–334. IEEE. doi:10.1109/SocialCom.2010.54.

Chaney, Allison J.B., and David M. Blei. 2012. “Visualizing Topic Models.” In Dublin, Ireland.

Chang, Jonathan, and David M. Blei. 2010. “Hierarchical Relational Models for Document Networks.” The Annals of Applied Statistics 4 (1) (March): 124–150. doi:10.1214/09-AOAS309. http://projecteuclid.org/euclid.aoas/1273584450.

Chang, Jonathan, and David M. Blei. 2009. “Relational Topic Models for Document Networks.” In Proceedings of the 12th International Conference on AI and Statistics. Clearwater Beach, Florida. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.186.6279.

Dietz, Laura, Steffen Bickel, and Tobias Scheffer. 2007. “Unsupervised Prediction of Citation Influences.” In Proceedings of the 24th International Conference on Machine Learning, 233–240. ICML ’07. New York, NY, USA: ACM. doi:10.1145/1273496.1273526. http://doi.acm.org/10.1145/1273496.1273526.

Erosheva, Elena, Stephen E. Fienberg, and John D. Lafferty. 2004. “Mixed-membership Models of Scientific Publications.” Proceedings of the National Academy of Sciences 101 (January 23): 5220–5227. doi:10.1073/pnas.0307760101. http://www.pnas.org/content/101/suppl.1/5220.short.

Gardner, Matthew J., Joshua Lutes, Jeff Lund, Josh Hansen, Dan Walker, Eric Ringger, and Kevin Seppi. 2010. “The Topic Browser: An Interactive Tool for Browsing Topic Models.” In .

Gerrish, Sean, and David M. Blei. 2010. “A Language-based Approach to Measuring Scholarly Impact.” In Proceedings of the 26th International Conference on Machine Learning. Haifa, Israael. http://www.cs.princeton.edu/ blei/papers/GerrishBlei2010.pdf.

Gerrish, Sean, and David M. Blei. 2009. “Modeling Influence in Text Corpora” presented at the NIPS Workshop on Applications for Topic Models: Text and Beyond., Whistler, Canada.

Girolami, Mark, and Ata Kabán. 2003. “On an Equivalence Between PLSI and LDA.” In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, 433–434. SIGIR ’03. New York, NY, USA: ACM. doi:10.1145/860435.860537. http://doi.acm.org/10.1145/860435.860537.

Goldstone, Andrew, and Ted Underwood. 2012. “What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?” Blog. ARCADE. 12–14. http://arcade.stanford.edu/blogs/what-can-topic-models-pmla-teach-us-about-history-literary-scholarship.

Gretarsson, Brynjar, John O’Donovan, Svetlin Bostandjiev, Tobias Hollerer, Arthur Asuncion, David Newman, and Padhraic Smyth. 2011. “TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling.” In ACM Transactions on Intelligent Systems and Technology, 5:1–26.

Hall, David, Daniel Jurafsky, and Christopher D. Manning. 2008. “Studying the History of Ideas Using Topic Models.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 363–371. EMNLP ’08. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1613715.1613763.

Jockers, Matthew. 2013. Macroanalysis: Digital Methods and Literary History. UIUC Press.

Laudun, John, and Jonathan Goodwin. 2013. “Computing Folklore Studies: Mapping over a Century of Scholarly Production through Topics.” Journal of American Folklore 126 (502) (Autumn): 455–475. doi:10.1353/jaf.2013.0063. http://muse.jhu.edu/login?auth=0&type;=summary&url;=/journals/journal_of_american_folklore/v126/126.502.laudun.html.

McCallum, Andrew, Andrés Corrada-Emmanuel, and Xuerui Wang. 2005. “Topic and Role Discovery in Social Networks.” In Proceedings of the 19th International Joint Conference on Artificial Intelligence, 786–791. IJCAI’05. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=1642293.1642419.

McCallum, Andrew, Xuerui Wang, and Andrés Corrada-Emmanuel. 2007. “Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email.” Journal of Artificial Intelligence Research 30 (1) (October): 249–272. http://dl.acm.org/citation.cfm?id=1622637.1622644.

Mei, Qiaozhu, Deng Cai, Duo Zhang, and ChengXiang Zhai. 2008. “Topic Modeling with Network Regularization.” In Proceeding of the 17th International Conference on World Wide Web, 101–110. WWW ’08. New York, NY, USA: ACM. doi:10.1145/1367497.1367512. http://doi.acm.org/10.1145/1367497.1367512.

Mimno, David. 2012. “Computational Historiography: Data Mining in a Century of Classics Journals.” J. Comput. Cult. Herit. 5 (1) (April): 3:1–3:19. doi:10.1145/2160165.2160168. http://doi.acm.org/10.1145/2160165.2160168.

Mimno, David, and Andrew McCallum. 2007. “Mining a Digital Library for Influential Authors.” In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, 105–106. JCDL ’07. New York, NY, USA: ACM. doi:10.1145/1255175.1255196. http://doi.acm.org/10.1145/1255175.1255196.

Nallapati, Ramesh M., Amr Ahmed, Eric P. Xing, and William W. Cohen. 2008. “Joint Latent Topic Models for Text and Citations.” In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 542–550. KDD ’08. New York, NY, USA: ACM. doi:10.1145/1401890.1401957. http://doi.acm.org/10.1145/1401890.1401957.

Newman, David J., and Sharon Block. 2006. “Probabilistic Topic Decomposition of an Eighteenth-century American Newspaper.” Journal of the American Society for Information Science and Technology 57 (6) (April): 753–767. doi:10.1002/asi.20342. http://doi.wiley.com/10.1002/asi.20342.

Riddell, Allen B. 2012. “How to Read 22,198 Journal Articles: Studying the History of German Studies with Topic Models.” In St. Louis, MO. http://ariddell.org/static/how-to-read-n-articles.pdf.

Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. “The Author-topic Model for Authors and Documents.” In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 487–494. UAI ’04. Arlington, Virginia, United States: AUAI Press. http://dl.acm.org/citation.cfm?id=1036843.1036902.

Rusch, Thomas, Paul Hofmarcher, Reinhold Hatzinger, and Kurt Hornik. 2013. “Model Trees with Topic Model Pre-processing: An Approach for Data Journalism Illustrated with the WikiLeaks Afghanistan War Logs.” The Annals of Applied Statistics.

Steyvers, Mark, and Thomas Griffiths. 2006. “Probabilistic Topic Models.” In Latent Semantic Analysis: A Road to Meaning, edited by T. Landauer, D. McNamara, S. Dennis, and W. Kintsch, 427:424–440.

Tangherlini, Timothy R., and Peter Leonard. 2014. “Trawling in the Sea of the Great Unread: Sub-corpus Topic Modeling and Humanities Research.” Poetics. doi:10.1016/j.poetic.2013.08.002. http://www.sciencedirect.com/science/article/pii/S0304422X13000648.

Wang, Eric, Jorge Silva, Rebecca Willett, and Carin Carin. 2011. “Dynamic Relational Topic Model for Social Network Analysis with Noisy Links.” In 2011 IEEE Statistical Signal Processing Workshop (SSP), 497–500. IEEE. doi:10.1109/SSP.2011.5967741.

Reader Comments

Ted Underwood, 2011-11-16 13:34

Excellent, clear post, and I really appreciate the links to the Isaac Newton Chymistry project and TopicNets. Very helpful.

I’m deeply into a variant of LSA at the moment, so I’m disproportionately interested in a couple of details that most people won’t care about. E.g., I’m not sure that most versions of LSA actually use tf-idf scores in the term-doc matrix. I think the more common version may use log-entropy weighting instead of tf-idf weighting.

I actually prefer a different weighting scheme that I haven’t seen used widely, which is basically Observed frequency – Expected frequency. I would also argue that literary scholars are better off skipping the Singular Value Decomposition step, for reasons explained here: http:// tedunderwood.wordpress.com/2011/10/16/lsa-is-a-marvellous-tool-but- humanists-may-no-use-it-the-way-computer-scientists-do/

But to stop geeking out about LSA and return to the main point: very helpful post. I haven’t yet tried the generative methods (pLSA and LDA), because I’m so happy with LSA itself, but I know people are excited about them and I intend to compare results systematically at some point this winter.

scottbot, 2011-11-16 14:42

Thanks! I think you’re right about the tf-idf weighting, I just figured that people just approaching LSA would be more familiar with tf-idf. I’ve added a note referencing your comment, though, because the standard certainly ought to be mentioned.

That’s a great post, I’ve never thought about the issues of SVD for the purposes of the humanities. While SVD in LSA is still useful for most of my historical retrieval needs, you make a very good point about humanists needing to think very carefully about the nitty-gritty details of algorithms that were built with other purposes in mind.

Good luck on your generative model exploration – while pLSA can technically be mathematically equivalent to LDA, it’s a lot more bothersome and misses some of LDA’s functionality, so I’d definitely recommend the latter. LDA and LSA definitely serve two very different purposes; for yours, outlined in the anti-SVD post, LSA is probably more well-suited.

Thanks for the comments… I feel like I’ve come to the DH-Text Analysis-Blog party late in the game, and I’ve been trying to read through yours to catch up!

Matt Erlin, 2011-11-16 13:52

Great post, Scott! I found the historical section particularly helpful for getting a sense of how topic modeling has evolved.

scottbot, 2011-11-16 14:43

Thanks! Glad you found it useful.

Allen Riddell, 2011-11-21 15:37

Great post. I’ve been hoping that someone would explain how the extensibility of LDA makes it quite a different kind of beast (relative to LSA).

The original LDA paper is actually pretty good on this. Another good place is the 2010 Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. expanded write-up of the author-topic model. Once you get a bit beyond LDA it’s clear there’s something being done that can’t be done with LSA. Here’s the citation:

Learning author-topic models from text corpora. M Rosen-Zvi, C Chemudugunta, T Griffiths, P Smyth, M Steyvers, ACM Transactions on Information Systems (TOIS), ACM, 2010. http://www.datalab.uci.edu/papers/AT_tois.pdf

Ted Underwood rightly points out in the comments that other scoring systems are often used in lieu of tf-idf, most frequently log entropy. [^2]: Yes yes, this is a simplification of actual LSA, but it’s pretty much how it works. SVD reduces the size of the matrix to filter out noise, and then each word row is treated as a vector shooting off in some direction. The vector of each word is compared to every other word, so that every pair of words has a relatedness score between them. Ted Underwood has a great blog post about why humanists should avoid the SVD step. [^3]: They’re not, of course. We’ll worry about that later. ↩

my dossier

[f-s d] Cetus

[f-s d] Cetus

The Sea

A Star is Born (and then dies)

Mira Ceti

Spreadsheet Errors

To Be Continued

Notes

Who sits in the 41st chair?

Who sits in the 41st chair?

Positive Feedback

Matthew Effect

Popular Music

Wait what about science?

The 41st Chair

Academic Musical Chairs

The View From Inside

Building a Meritocracy

The Social Network

Concluding, Finally

Addendum for Historians

Reader Comments

Historians, Doctors, and their Absence

Historians, Doctors, and their Absence

Reader Comments

Analyzing submissions to Digital Humanities 2013

Analyzing submissions to Digital Humanities 2013

Another Step in Keeping Pledges

Another Step in Keeping Pledges

Reader Comments

Doing Bayesian Data Analysis

Doing Bayesian Data Analysis

The State of Things

Why Statistics? Why Bayesian Statistics?

A Call to Arms

Reader Comments (6)

Topic Modeling and Network Analysis

Topic Modeling and Network Analysis

A brief history of topic modeling

LSA and pLSA

LDA

Alright, what now?

Topic Modeling and Networks

Topic Models from Networks

Networks from Topic Models

Getting Started

Reader Comments