<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://scottbot.github.io/dossier/feed.xml" rel="self" type="application/atom+xml" /><link href="https://scottbot.github.io/dossier/" rel="alternate" type="text/html" /><updated>2026-05-20T00:53:41+00:00</updated><id>https://scottbot.github.io/dossier/feed.xml</id><title type="html">my dossier</title><subtitle></subtitle><entry><title type="html">[f-s d] Cetus</title><link href="https://scottbot.github.io/dossier/personal%20research/2016/06/16/fsd-cetus.html" rel="alternate" type="text/html" title="[f-s d] Cetus" /><published>2016-06-16T00:00:00+00:00</published><updated>2016-06-16T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/personal%20research/2016/06/16/fsd-cetus</id><content type="html" xml:base="https://scottbot.github.io/dossier/personal%20research/2016/06/16/fsd-cetus.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="f-s-d-cetus">[f-s d] Cetus</h1>

<p>Quoting Liz Losh, Jacqueline Wernimont tweeted that behind every visualization is a spreadsheet.</p>

<blockquote>
  <p><a href="https://twitter.com/lizlosh">@lizlosh</a> “behind every visualization…shhhh…is a spreadsheet” <a href="https://twitter.com/hashtag/femdh?src=hash">#femdh</a> <a href="https://twitter.com/hashtag/dhsi2016?src=hash">#dhsi2016</a></p>

  <p>— Jacqueline Wernimont (@profwernimont) <a href="https://twitter.com/profwernimont/status/742819757250871296">June 14, 2016</a></p>
</blockquote>

<p>But what, I wondered, is behind every spreadsheet?</p>

<blockquote>
  <p>But what’s behind every spreadsheet? [cue theme for “Full-Stack Dev”, hot new NPR series on the secret life of data] <a href="https://t.co/2HCLtySIwL">https://t.co/2HCLtySIwL</a></p>

  <p>— Scott B. Weingart (@scott_bot) <a href="https://twitter.com/scott_bot/status/742837305430441984">June 14, 2016</a></p>
</blockquote>

<p>Space whales.</p>

<p>Okay, maybe space whales aren’t behind <em>every</em> spreadsheet, but they’re behind this one, dated 1662, notable for the gigantic nail it hammered into the coffin of our belief that heaven above is perfect and unchanging. The following post is the first in my new series <em><a href="http://scottbot.net/tag/full-stack-dev/">full-stack dev</a> (f-s d)</em>, where I explore the secret life of data.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p><img src="images/SBW-030-hevelius.png" alt="Hevelius. Mercurius in Sole visus (1662)." /></p>

<p><em>Hevelius. <a href="https://books.google.com/books?id=r19DAAAAcAAJ&amp;dq=hevelius%20Mercurius%20in%20Sole%20visus&amp;pg=PA152#v=onepage&amp;q&amp;f=false">Mercurius in Sole visus (1662)</a>.</em></p>

<p>The Princess Bride teaches us a good story involves “fencing, fighting, torture, revenge, giants, monsters, chases, escapes, true love, miracles”. In this story, <em>Cetus</em>, three of those play a prominent role: (red) giants, (sea) monsters, and (cosmic) miracles. Also Greek myths, interstellar explosions, beer-brewing astronomers, meticulous archivists, and top-secret digitization facilities. All together, they reveal how technologies, people, and stars aligned to stick this 350-year-old spreadsheet in your browser today.</p>

<h1 id="the-sea">The Sea</h1>

<p>When Aethiopian queen Cassiopeia claimed herself more beautiful than all the sea nymphs, Poseidon was, let’s say, less than pleased. Mildly miffed. He maybe sent a sea monster named Cetus to destroy Aethiopia.</p>

<p>Because obviously the best way to stop a flood is to drown a princess, Queen Cassiopeia chained her daughter to the rocks as a sacrifice to Cetus. Thankfully the hero Perseus just happened to be passing through Aethiopia, returning home after beheading Medusa, that snake-haired woman whose eyes turned living creatures to stone. Perseus (depicted below as the world’s most boring 2-ball juggler) revealed Medusa’s severed head to Cetus, turning the sea monster to stone and saving the princess. And then they got married because traditional gender roles I guess?</p>

<p><img src="images/SBW-030-corinthian-vase.webp" alt="Corinthian vase depicting Perseus, Andromeda and Ketos." /></p>

<p><em>Corinthian vase depicting Perseus, Andromeda and Ketos. [<a href="https://en.wikipedia.org/wiki/Cetus_(mythology)#/media/File:Corinthian_Vase_depicting_Perseus,_Andromeda_and_Ketos.jpg">via</a>]</em></p>

<p>Cetaceans, you may recall from grade school, are those giant carnivorous sea-mammals that Captain Ahab warned you about. <em>Cetaceans</em>, from <em>Cetus</em>. You may also remember we have a thing for naming star constellations and dividing the sky up into sections (see the Zodiac), and that we have a long history of comparing the sky to the ocean (see Carl Sagan or <a href="https://en.wikipedia.org/wiki/Star_Trek_IV:_The_Voyage_Home">Star Trek IV</a>).</p>

<p>It should come as no surprise, then, that we’ve designated a whole section of space as ‘<a href="https://en.wikipedia.org/wiki/Sea_(astronomy)">The Sea</a>’, home of Cetus (the whale), Aquarius (the God) and Eridanus (the water pouring from Aquarius’ vase, source of river floods), Pisces (two fish tied together by a rope, <a href="https://en.wikipedia.org/wiki/Pisces_(constellation)#History_and_mythology">which makes total sense I promise</a>), Delphinus (the dolphin), and Capricornus (the goat-fish. Listen, I didn’t make these up, okay?).</p>

<p><img src="images/SBW-030-jamieson-plate21.webp" alt="Jamieson's Celestial Atlas, Plate 21 (1822)." /></p>

<p><em>Jamieson’s Celestial Atlas, Plate 21 (1822). [<a href="http://aa.usno.navy.mil/library/artwork/jamieson.htm">via</a>]</em></p>

<p><img src="images/SBW-030-jamieson-plate23.webp" alt="Jamieson's Celestial Atlas, Plate 23 (1822)." /></p>

<p><em>Jamieson’s Celestial Atlas, Plate 23 (1822). [<a href="http://aa.usno.navy.mil/library/artwork/jamieson.htm">via</a>]</em></p>

<p>Ptolemy listed most of these constellations in his <em>Almagest</em> (ca. 150 A.D.), including Cetus, along with descriptions of over a thousand stars. Ptolemy’s model, with Earth at the center and the constellations just past Saturn, set the course of cosmology for over a thousand years.</p>

<p><img src="images/SBW-030-ptolemy-cosmos.webp" alt="Ptolemy's Cosmos [by Robert A. Hatch]" /></p>

<p><em>Ptolemy’s Cosmos [<a href="http://users.clas.ufl.edu/ufhatch/pages/03-Sci-Rev/SCI-REV-Home/resource-ref-read/chief-systems/08-0PTOL3-WSYS.html">by Robert A. Hatch</a>]</em></p>

<p>In this cosmos, reigning in Western Europe for centuries past Copernicus’ death in 1543, the stars were fixed and motionless. There was no vacuum of space; every planet was embedded in a shell made of aether or quintessence (<em>quint-essence</em>, the fifth element), and each shell sat atop the next until reaching the celestial sphere. This last sphere held the stars, each one fixed to it as with a pushpin. Of course, all of it revolved around the earth.</p>

<p>The domain of heavenly spheres was assumed perfect in all sorts of ways. They slid across each other without friction, and the planets and stars were perfect spheres which could not change and were unmarred by inconsistencies. One reason it was so difficult for even “great thinkers” to believe the earth orbited the sun, rather than vice-versa, was because such a system would be at complete odds with how people knew physics to work. It would break gravity, break motion, and break the outer perfection of the cosmos, which was essential (…<em>heh</em>)<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> to our notions of, well, everything.</p>

<p>Which is why, when astronomers with their telescopes and their spreadsheets started systematically observing imperfections in planets and stars, lots of people didn’t believe them—even other astronomers. Over the course of centuries, though, these imperfections became impossible to ignore, and helped launch the earth in rotation ‘round the sun.</p>

<p>This is the story of one such imperfection.</p>

<h1 id="a-star-is-born-and-then-dies">A Star is Born (and then dies)</h1>

<p>Around 1296 A.D., over the course of half a year, a red dwarf star some 2 quadrillion miles away grew from 300 to 400 times the size of our sun. Over the next half year, the star shrunk back down to its previous size. Light from the star took 300 years to reach earth, eventually striking the retina of German pastor <a href="https://en.wikipedia.org/wiki/David_Fabricius">David Fabricius</a>. It was very early Tuesday morning on August 13, 1596, and Pastor Fabricius was looking for Jupiter.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<p>At that time of year, Jupiter would have been near the constellation Cetus (remember our sea monster?), but Fabricius noticed a nearby bright star (labeled ‘Mira’ in the below figure) which he did not remember from Ptolemy or Tycho Brahe’s star charts.</p>

<p><img src="images/SBW-030-cetus-mira-jupiter.webp" alt="Mira Ceti and Jupiter. [via]" /></p>

<p><em>Mira Ceti and Jupiter. [<a href="http://www.universetoday.com/99091/five-planets-around-nearby-star-tau-ceti-one-in-habitable-zone/">via</a>]</em></p>

<p>Spotting an unrecognized star wasn’t unusual, but one so bright in so common a constellation was certainly worthy of note. He wrote down some observations of the star throughout September and October, after which it seemed to have disappeared as suddenly as it appeared. The disappearance prompted Fabricius to write a letter about it to famed astronomer <a href="https://en.wikipedia.org/wiki/Tycho_Brahe">Tycho Brahe</a>, who had described a similar appearing-then-disappearing star between 1572 and 1574. Brahe jotted Fabricius’ observations down in his journal. This sort of behavior, after all, was a bit shocking for a supposedly fixed and unchanging celestial sphere.</p>

<p>More shocking, however, was what happened 13 years later, on February 15, 1609. Once again searching for Jupiter, pastor Fabricius spotted another new star in the same spot as the last one. Tycho Brahe having recently died, Fabricius wrote a letter to his astronomical successor, <a href="https://en.wikipedia.org/wiki/Johannes_Kepler">Johannes Kepler</a>, describing the miracle. This was unprecedented. No star had ever vanished and returned, and nobody knew what to make of it.</p>

<p>Unfortunately for Fabricius, nobody did make anything of it. His observations were either ignored or, occasionally, dismissed as an error. To add injury to insult, a local goose thief killed Fabricius with a shovel blow, thus ending his place in this star’s story, among other stories.</p>

<h1 id="mira-ceti">Mira Ceti</h1>

<p>Three decades passed. On the winter solstice, 1638, Johannes Phocylides Holwarda prepared to view a lunar eclipse. He reported with excitement the star’s appearance and, by August 1639, its disappearance. The new star, Holwarda claimed, should be considered of the same class as Brahe, Kepler, and Fabricius’ new stars. As much a surprise to him as Fabricius, Holwarda saw the star again on November 7, 1639. Although he was not aware of it, his new star was the same as the one Fabricius spotted 30 years prior.</p>

<p>Two more decades passed before the new star in the neck of Cetus would be systematically sought and observed, this time by Johannes Hevelius: local politician, astronomer, and brewer of fine beers. By that time many had seen the star, but it was difficult to know whether it was the same celestial body, or even what was going on.</p>

<p>Hevelius brought everything together. He found recorded observations from Holwarda, Fabricius, and others, from today’s Netherlands to Germany to Poland, and realized these disparate observations were of the same star. Befitting its puzzling and seemingly miraculous nature, Hevelius dubbed the star <em>Mira</em> (miraculous) <em>Ceti.</em> The image below, from Hevelius’ <em>Firmamentum Sobiescianum sive Uranographia</em> (1687), depicts <em>Mira Ceti</em> as the bright star in the sea monster’s neck.</p>

<p><img src="images/SBW-030-hevelius-firmamentum.webp" alt="Hevelius. Firmamentum Sobiescianum sive Uranographia (1687)." /></p>

<p><em>Hevelius. Firmamentum Sobiescianum sive Uranographia (1687).</em></p>

<p>Going further, from 1659 to 1683, Hevelius observed <em>Mira Ceti</em> in a more consistent fashion than any before. There were eleven recorded observations in the 65 years between Fabricius’ first sighting of the star and Hevelius’ undertaking; in the following three, he had recorded 75 more such observations. Oddly, while Hevelius was a remarkably meticulous observer, he insisted the star was inherently unpredictable, with no regularity in its reappearances or variable brightness.</p>

<p>Beginning shortly after Hevelius, the astronomer Ismaël Boulliau also undertook a thirty year search for <em>Mira Ceti</em>. He even published a prediction, that the star would go through its vanishing cycle every 332 days, which turned out to be incredibly accurate. As today’s astronomers note, <em>Mira Ceti</em>’s brightness increases and decreases by several orders of magnitude every 331 days, caused by an interplay between radiation pressure and gravity in the star’s gaseous exterior.</p>

<p><img src="images/SBW-030-mira-galex-composite.webp" alt="Mira Ceti composite taken by NASA's Galaxy Evolution Explorer. [via]" /></p>

<p><em>Mira Ceti composite taken by NASA’s Galaxy Evolution Explorer. [<a href="https://en.wikipedia.org/wiki/Mira#/media/File:A-Mira-Full_down_sampled_and_cropped.jpg">via</a>]</em></p>

<p>While of course Boulliau didn’t arrive at today’s explanation for <em>Mira</em>’s variability, his solution did require a rethinking of the fixity of stars, and eventually contributed to the notion that maybe the same physical laws that apply on Earth also rule the sun and stars.</p>

<h1 id="spreadsheet-errors">Spreadsheet Errors</h1>

<p>But we’re not here to talk about Boulliau, or <em>Mira Ceti</em>. We’re here to talk about this spreadsheet:</p>

<p><img src="images/SBW-030-hevelius.png" alt="Hevelius. Mercurius in Sole visus (1662)." /></p>

<p><em>Hevelius. Mercurius in Sole visus (1662).</em></p>

<p>This snippet represents Hevelius’ attempt to systematically collected prior observations of <em>Mira Ceti</em>. Unreasonably meticulous readers of this post may note an inconsistency: I wrote that Johannes Phocylides Holwarda observed Mira Ceti on November 7th, 1639, yet Hevelius here shows Holwarda observing the star on <em>December</em> 7th, 1639, an entire month later. The little notes on the side are basically the observers saying: “wtf this star keeps reappearing???”</p>

<p>This mistake was not a simple printer’s error. It reappeared in Hevelius’ printed books three times: 1662, 1668, and 1685. This is an early example of what Raymond Panko and others call a spreadsheet error, <a href="http://panko.shidler.hawaii.edu/SSR/Mypapers/whatknow.htm">which appear in nearly 90% of 21st century spreadsheets</a>. Hand-entry is difficult, and mistakes are bound to happen. In this case, a game of telephone also played a part: Hevelius may have pulled some observations not directly from the original astronomers, but from the notes of Tycho Brahe and Johannes Kepler, to which he had access.</p>

<p>Unfortunately, with so few observations, and many of the early ones so sloppy, mistakes compound themselves. It’s difficult to predict a variable star’s periodicity when you don’t have the right dates of observation, which may have contributed to Hevelius’ continued insistence that <em>Mira Ceti</em> kept no regular schedule. The other contributing factor, of course, is that Hevelius worked without a telescope and under cloudy skies, and stars are hard to measure under even the best circumstances.</p>

<h1 id="to-be-continued">To Be Continued</h1>

<p>Here ends the first half of <em>Cetus</em>. The second half will cover how Hevelius’ book was preserved, the labor behind its digitization, and a bit about the technologies involved in creating the image you see.</p>

<p>Early modern astronomy is a particularly good pre-digital subject for <em>full-stack dev (f-s d)</em>, since it required vast international correspondence networks and distributed labor in order to succeed. Hevelius could not have created this table, compiled from the observations of several others, without access to cutting-edge astronomical instruments and the contemporary scholarly network.</p>

<p>You may ask why I included that whole section on Greek myths and Ptolemy’s constellations. Would as many early modern astronomers have noticed <em>Mira</em> <em>Ceti</em> had it not sat in the center of a familiar constellation, I wonder?</p>

<p>I promised this series will be about the secret life of data, answering the question of what’s behind a spreadsheet. <em>Cetus</em> is only the first story (well, <a href="http://scottbot.net/down-the-rabbit-hole/">second</a>, I guess), but the idea is to upturn the iceberg underlying seemingly mundane datasets to reveal the complicated stories of their creation and usage. Stay-tuned for future installments.</p>

<h2 id="notes">Notes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I’m retroactively adding <a href="http://scottbot.net/down-the-rabbit-hole/">my blog rant about data underlying an equality visualization</a> to the <em>f-s d</em> series. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>this pun is only for historians of science <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Most of the historiography in this and the following section are summarized from Robert A. Hatch’s “<a href="http://link.springer.com/chapter/10.1007%2F978-94-007-0037-6_9#page-1">Discovering Mira Ceti: Celestial Change and Cosmic Continuity</a>” <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>{&quot;family&quot;=&gt;&quot;Weingart&quot;, &quot;given&quot;=&gt;&quot;Scott B.&quot;}</name></author><category term="personal research" /><category term="archives" /><category term="full-stack dev" /><category term="history of science" /><category term="republic of letters" /><category term="scholarly communication" /><summary type="html"><![CDATA[Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.]]></summary></entry><entry><title type="html">Who sits in the 41st chair?</title><link href="https://scottbot.github.io/dossier/personal%20research/2016/03/25/who-sits-in-the-41st-chair.html" rel="alternate" type="text/html" title="Who sits in the 41st chair?" /><published>2016-03-25T00:00:00+00:00</published><updated>2016-03-25T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/personal%20research/2016/03/25/who-sits-in-the-41st-chair</id><content type="html" xml:base="https://scottbot.github.io/dossier/personal%20research/2016/03/25/who-sits-in-the-41st-chair.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="who-sits-in-the-41st-chair">Who sits in the 41st chair?</h1>

<p><em>By scott b. weingart · 2016-03-25</em></p>

<p><strong>tl;dr</strong> Rich-
get-richer academic prestige in a scarce job market makes meritocracy
impossible. Why some things get popular and others don’t. Also
agent-based simulations.</p>

<p><strong>Slightly longer tl;dr</strong> This post is about why
academia isn’t a meritocracy, at no intentional fault of those in power
who try to make it one. None of presented ideas are novel on their own,
but I do intend this as a novel conceptual contribution in its
connection of disparate threads. Especially, I suggest the
predictability of research success in a scarce academic economy as a
theoretical framework for exploring successes and failures in the
history of science.</p>

<p>But mostly I just beat a “musical chairs” metaphor to death.</p>

<h1 id="positive-feedback">Positive Feedback</h1>

<p><strong>To the victor go the spoils, and to the spoiled go the victories.</strong>
Think about it: the Yankees; Alexander the Great;
Stanford University. Why do the Yankees have twice as many World
Series appearances as their nearest competitors, how was Alex’s
empire so fucking vast, and why does Stanford get all the cool grants?</p>

<p>The rich get richer. Enough World Series victories, and the Yankees
get the reputation and funding to entice the best players. Ol’ Allie-G
inherited an amazing army, was taught by Aristotle, and pretty much
every place he conquered increased his military’s numbers. Stanford’s
known for amazing tech innovation, so they get the funding, which means
they can afford even more innovation, which means <em>even more</em>
people think they’re worthy of funding, and so on down the line until
Stanford and its neighbors (Google, Apple, etc.) destroy the local real
estate market and then accidentally blow up the world.</p>

<p><img src="images/SBW-031-img-001.webp" alt="Alexander's Empire [via]" /></p>

<p><em>Alexander’s Empire [<a href="http://faculty.etsu.edu/kortumr/08hellenistic/htmdescriptionpages/01map.htm">via</a>]</em></p>

<p>Okay, maybe I exaggerated that last bit.</p>

<p>Point is, power begets power. Scientists call this a <em>positive feedback loop</em>: when a thing’s size is exactly what makes it grow larger.</p>

<p>You’ve heard it firsthand when a microphoned singer walks too
close to her speaker. First the mic picks up what’s already coming
out of the speaker. The mic, doings its job, sends what it hears to an
amplifier, sending an even louder version to the very same speaker. The
speaker replays a louder version of what it just produced, which is once
again received by the microphone, until sound <strong>feeds back</strong> onto
itself enough times to produce the ear-shattering squeal fans of
live music have come to dread. This is a positive feedback loop.</p>

<p><img src="images/SBW-031-img-002.jpg" alt="Feedback loop. [via]" /></p>

<p><em>Feedback loop. [<a href="http://tecnoiglesia.com/2013/02/como-eliminar-la-retroalimentacion-de-audio-feedback-con-ecualizacion/">via</a>]</em></p>

<p>Positive feedback loops are everywhere. They’re why <a href="http://scottbot.net/networks-demystified-3-the-power-law-rant/">the universe counts logarithmically rather than linearly</a>,
or why income inequality is so common in free market economies.
Left to their own devices, the rich tend to get richer,
since it’s easier to make money when you’ve already got some.</p>

<p>Science and academia are equally susceptible to positive feedback
loops. Top scientists, the most well-funded research institutes, and
world-famous research all got to where they are, in part, because of
something called the <em>Matthew Effect</em>.</p>

<h1 id="matthew-effect">Matthew Effect</h1>

<p>The <a href="https://en.wikipedia.org/wiki/Matthew_effect">Matthew Effect</a> isn’t the reality TV show it sounds like.</p>

<blockquote>
  <p>For unto every one that hath shall be given, and he shall
have abundance: but from him that hath not shall be taken even that
which he hath. —Matthew 25:29, King James Bible.</p>
</blockquote>

<p>It’s the Biblical idea that the rich get richer, and it’s become a
popular party trick among sociologists (yes, sociologists go to parties)
describing how society works. In academia, the phrase is brought up
alongside evidence that shows previous grant-recipients are more likely
to receive new grants than their peers, and the more money a researcher
has been awarded, the more they’re likely to get going forward.</p>

<p>The Matthew Effect is also employed metaphorically, when it
comes to citations. He who gets some citations will accrue more; she who
has the most citations will accrue them exponentially faster. There are many correct explanations, but the simplest one will do here:</p>

<p><em>If Susan’s article on the danger of
velociraptors is cited by 15 other articles, I am more likely to
find it and cite her than another article on velociraptors containing
the same information, that has never been cited</em>. <em>That’s
because when I’m reading research, I look at who’s being cited. The
more Susan is cited, the more likely I’ll eventually come across her
article and cite it myself, which in turn increases the likelihood that
much more that someone else will find her article through my own
citations. Continue ad nauseam.</em></p>

<p>Some of you are thinking this is stupid. Maybe it’s trivially
correct, but missing the bigger picture: quality. What if Susan’s
velociraptor research is simply better than the competing research, and
that’s why it’s getting cited more?</p>

<p>Yes, that’s also an issue. Noticeably awful research simply won’t get much traction. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Let’s
disqualify it from the citation game. The point is there is lots
of great research out there, waiting to be read and built upon, and its
quality isn’t the sole predictor of its eventual citation success.</p>

<p>In fact, quality is a mostly-necessary but completely
insufficient indicator of research success. Superstar popularity of
research depends much more on the citation effects I mentioned above –
more citations begets even more. Previous success is the best predictor
of future success, mostly independent of the quality of research
being shared.</p>

<p><img src="images/SBW-031-img-003.webp" alt="Example of positive feedback loops pushing some articles to citation stardom." /></p>

<p><em>Example of positive feedback loops pushing some articles to citation stardom. [<a href="https://kieranhealy.org/blog/archives/2015/02/25/gender-and-citation-in-four-general-interest-philosophy-journals-1993-2013/">via</a>]</em></p>

<p>This
is all pretty hand-wavy. How do we know success is more
important than quality in predicting success? Uh, basically because
of Napster.</p>

<h1 id="popular-music">Popular Music</h1>

<p>If VH1 were to produce a retrospective on the first decade of
the 21st century, perhaps its two biggest subjects would be illegal
music sharing and VH1’s <em>I Love the 19xx…</em> TV
series. Napster came and went, followed by LimeWire, eDonkey2000,
AudioGalaxy, and other services sued by Metallica. Well-known
early internet memes like <em>Hamster Dance</em> and <em>All Your Base Are Belong To Us</em>
spread through the web like socially transmitted diseases, and
researchers found this the perfect opportunity to explore how
popularity worked. Experimentally.</p>

<p>In 2006, a group of Columbia University social scientists <a href="https://www.princeton.edu/~mjs3/salganik_dodds_watts06_full.pdf">designed a clever experiment</a>
to test why some songs became popular and others did not, relying
on the public interest in online music sharing. They created a music
downloading site which gathered 14,341 users, each one to
become a participant in their social experiment.</p>

<p>The cleverness arose out of their experimental design, which allowed
them to get past the pesky problem of history only ever happening once.
It’s usually hard to learn why something became popular, because
you don’t know what aspects of its popularity were simply random chance,
and what aspects were genuine quality. If you could, say, just rerun
the 1960s, changing a few small aspects here or there, would the Beatles
still have been as successful? We can’t know, because the 1960s are
pretty much stuck having happened as they did, and there’s not much we
can do to change it. <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<p>But this music-sharing site <em>could</em> rerun history—or at least,
it could run a few histories simultaneously. When they signed up, each
of the site’s 14,341 users were randomly sorted into different groups,
and their group number determined how they were presented music.
The musical variety was intentionally obscure, so users wouldn’t
have heard the bands before.</p>

<p>A user from the first group, upon logging in, would be shown
songs in random order, and were given the option to listen to a song,
rate it 1-5, and download it. Users from group #2, instead,
were shown the songs ranked in order of their popularity among
other members of group #2. Group #3 users were shown a similar
rank-order of popular songs, but this time determined by the song’s
popularity within group #3. So too for groups #4-#9. Every user
could listen to, rate, and download music.</p>

<p>Essentially, the researchers put the participants into 9 different
self-contained petri dishes, and waited to see which music would become
most popular in each. Ranking and download popularity from group #1 was
their control group, in that members judged music based on their quality
without having access to social influence. Members of groups #2-#9
could be influenced by what music was popular with their peers
within the group. The same songs circulated in each petri dish, and
each petri dish presented its own version of history.</p>

<p><img src="images/SBW-031-img-004.webp" alt="Music sharing site from Columbia study." /></p>

<p><em>Music sharing site from Columbia study.</em></p>

<p>No superstar songs emerged out of the control group. Positive
feedback loops weren’t built into the system, since popularity couldn’t
beget more popularity if nobody saw what their peers were listening to.
The other 8 musical petri dishes told a different story, however.
Superstars emerged in each, but each group’s population of popular music
was very different. A song’s popularity in each group was slightly
related to its quality (as judged by ranking in the control
group), but mostly it was social-influence-produced chaos. The authors
put it this way:</p>

<blockquote>
  <p>In general, the “best” songs never do very badly,
and the “worst” songs never do extremely well, but almost
any other result is possible. —Salganik, Dodds, &amp; Watts, 2006</p>
</blockquote>

<p>These results became even more pronounced when the researchers
increased the visibility of social popularity in the system. The rich
got even richer still. A lot of it has to do with timing. In each
group, the first few good songs to become popular are the ones that
eventually do the best, simply by an accident of circumstance.
The first few popular songs appear at the top of the list, for others
to see, so they in-turn become even more popular, and so <em>ad infinitum</em>.  The authors go on:</p>

<blockquote>
  <p>experts fail to predict success not because they are
incompetent judges or misinformed about the preferences of others,
but because when individual decisions are subject to social
influence, markets do not simply aggregate pre-existing individual
preferences.</p>
</blockquote>

<p>In short, <strong>quality is a necessary but insufficient criteria for ultimate success</strong>. <strong>Social
influence, timing, randomness, and other non-qualitative features of
music are what turn a good piece of music into an off-the-charts hit.</strong></p>

<h1 id="wait-what-about-science">Wait what about science?</h1>

<p>Compare this to what makes a “well-respected” scientist: it ain’t all
citations and social popularity, but they play a huge role. And as I
described above, simply out of exposure-fueled-propagation, the more
citations someone accrues, the more citations they are likely to
accrue, until we get a situation like the Yankees (<a href="https://en.wikipedia.org/wiki/List_of_World_Series_champions">40 world series appearances, versus 20 appearances by the Giants</a>) on
our hands. Superstars are born, who are miles beyond the majority of
working researchers in terms of grants, awards, citations, etc. Social
scientists call this <em>preferential attachment</em>.</p>

<p>Which is fine, I guess. Who cares if scientific popularity is so
skewed as long as good research is happening? Even if we take the
Columbia social music experiment at face-value, an exact analog for
scientific success, we know that the most successful are always good
scientists, and the least successful are always bad ones, so what does
it matter if variability within the ranks of the successful is so
detached from quality?</p>

<p>Except, as anyone studying their <em>#OccupyWallstreet</em>knows,
it ain’t that simple in a scarce economy. When the rich get richer,
that money’s gotta come from somewhere. Like everything else (cf. <a href="https://en.wikipedia.org/wiki/Conservation_of_mass">the law of conservation of mass</a>), academia is a (mostly) zero-sum game, and to the victors go the spoils. To the losers? Meh.</p>

<p>So let’s talk scarcity.</p>

<h1 id="the-41st-chair">The 41st Chair</h1>

<p>The same guy who who introduced the concept of the Matthew Effect to
scientific grants and citations, Robert K. Merton (…of Columbia
University), also brought up “the 41st chair” in <a href="http://www.garfield.library.upenn.edu/merton/matthew1.pdf">the same 1968 article</a>.</p>

<p>Merton’s pretty great, so I’ll let him do the talking:</p>

<blockquote>
  <p>In science as in other institutional realms, a special
problem in the workings of the reward system turns up when individuals
or organizations take on the job of gauging and suitably rewarding lofty
performance on behalf of a large community. Thus, that ultimate
accolade in 20th-century science, the Nobel prize, is often assumed to
mark off its recipients from all the other scientists of the time. Yet
this assumption is at odds with the well-known fact that a good number
of scientists who have not received the prize and will not receive it
have contributed as much to the advancement of science as some of the
recipients, or more.</p>

  <p>This can be described as the phenomenon of <strong>“the 41st chair.”</strong>
The derivation of this tag is clear enough. The French Academy, it will
be remembered, decided early that only a cohort of 40 could qualify as
members and so emerge as immortals. This limitation of numbers made
inevitable, of course, the exclusion through the centuries of many
talented individuals who have won their own immortality. The familiar
list of occupants of this 41st chair includes Descartes, Pascal,
Moliere, Bayle, Rousseau, Saint-Simon, Diderot, Stendahl, Flaubert,
Zola, and Proust</p>

  <p>[…]</p>

  <p>But in greater part, the phenomenon of the 41st chair is an artifact
of having a fixed number of places available at the summit of
recognition. Moreover, when a particular generation is rich in
achievements of a high order, it follows from the rule of fixed numbers
that some men whose accomplishments rank as high as those actually
given the award will be excluded from the honorific ranks. Indeed,
their accomplishments sometimes far outrank those which, in a time of
less creativity, proved
enough to qualify men for his high order of recognition.</p>

  <p>The Nobel prize retains its luster because errors of the first
kind—where scientific work of dubious or inferior worth has been
mistakenly honored—are uncommonly few. Yet limitations of the second
kind cannot be avoided. The small number of awards means that,
particularly in times of great scientific advance, there will be many
occupants of the 41st chair (and, since the terms governing the award of
the prize do not provide for posthumous recognition, permanent
occupants of that chair).</p>
</blockquote>

<p>Basically, the French Academy allowed only 40 members (chairs) at a
time. We can be reasonably certain those members were pretty great,
but we can’t be sure that equally great—or greater—women existed who
simply never got the opportunity to participate because none of the 40
members died in time.</p>

<p>These good-enough-to-be-members-but-weren’t were said to occupy the
French Academy’s 41st chair, an inevitable outcome of a scarce economy
(40 chairs) when the potential number benefactors of this economy far
outnumber the goods available (40). The population occupying the 41st
chair is huge, and growing, since the same number of chairs have existed
since 1634, but the population of France has quadrupled in the
intervening four centuries.</p>

<p>Returning to our question of “so what if rich-get-richer doesn’t
stick the best people at the top, since at least we can assume the
people at the top are all pretty good anyway?”, scarcity of chairs is
the so-what.</p>

<p>Since <a href="https://www.higheredjobs.com/documents/HEJ_Employment_Report_2015_Q4.pdf">faculty jobs are stagnating compared to adjunct work</a>, yet <a href="http://www.nsf.gov/statistics/2016/nsb20161/uploads/1/12/fig02-21_1448906027169.png">new PhDs are being granted</a> faster
than new jobs become available, we are presented with the
much-discussed crisis in higher education. Don’t worry, we’re
told, academia is a meritocracy. With so few jobs, only the cream
of the crop will get them. The best work will still be done, even
in these hard times.</p>

<p><img src="images/SBW-031-img-005.webp" alt="Recent Science PhD growth in the U.S. [via]" /></p>

<p><em>Recent Science PhD growth in the U.S. [<a href="http://www.nsf.gov/statistics/2016/nsb20161/#/">via</a>]</em></p>

<p>Unfortunately,
as the Columbia social music study (among many other studies) showed,
true meritocracies are impossible in complex social systems. Anyone
who plays the academic game knows this already, and many are quick to
point it out when they see people in much better jobs doing incredibly
stupid things. What those who point out the falsity of meritocracy
often get wrong, however, is intention: the idea that there is no
meritocracy because those in power talk the meritocracy talk, but
don’t then walk the walk. I’ll talk a bit later about how, <em>even if everyone is above board in trying to push the best people forward</em>,
occupants of the 41st chair will still often wind up being more
deserving than those sitting in chairs 1-40. But more on that later.</p>

<p>For now, let’s start building a metaphor that we’ll eventually
over-extend well beyond its usefulness. Remember that kids’ game Musical
Chairs, where everyone’s dancing around a bunch of chairs while the
music is playing, but as soon as the music stops everyone’s got to find a
chair and sit down? The catch, of course, is that there are fewer
chairs than people, so someone always loses when the music stops.</p>

<p>The academic meritocracy works a bit like this. It is meritocratic,
to a point: you can’t even play the game without proving some worth. The
price of admission is a Ph.D. (which, granted, is more an endurance
test than an intelligence test, but academic success ain’t
all smarts, y’know?), a research area at least a few people find
interesting and believe you’d be able to do good work in it, etc.
It’s a pretty low meritocratic bar, <a href="http://www.nsf.gov/statistics/infbrief/nsf10308/">since it described 50,000 people who graduated in the U.S. in 2008 alone</a>, but it’s a bar nonetheless. And it’s your competition in Academic Musical Chairs.</p>

<h1 id="academic-musical-chairs">Academic Musical Chairs</h1>

<p>Time to invent a game! It’s called Academic Musical Chairs, the game
where everything’s made up and the points don’t matter. It’s like
Regular Musical Chairs, but more complicated (see Fig. 1). Also the
game is fixed.</p>

<p><img src="images/SBW-031-img-006.jpg" alt="Figure 1: Academic Musical Chairs" /></p>

<p><em>Figure 1: Academic Musical Chairs</em></p>

<p>See those 40 chairs in the middle green zone? People sitting in them
are the winners. Once they’re seated they have what we call in the
game “tenure”, and they don’t get up until they die or write
something controversial on twitter. Everyone bustling around them, the
active players, are vying for seats while they wait for someone to die;
they occupy the yellow zone we call “the 41st chair”.
Those beyond that, in the red zone, can’t yet (or may never) afford
the price of game admission; they don’t have a Ph.D., they <em>already</em> said something controversial on Twitter, etc. The unwashed masses, you know?</p>

<p>As the music plays, everyone in the 41st chair is walking around in a
circle waiting for someone to die and the music to stop. When that
happens, everyone rushes to the empty seat. A few invariably reach it
simultaneously, until one out-muscles the others and sits down. The
sitting winner gets tenure. The music starts again, and the line
continues to orbit the circle.</p>

<p>If a player spends too long orbiting in the 41st chair, he
is forced to resign. If a player runs out of money while orbiting,
she is forced to resign. Other factors may force a player to resign, but
they will never appear in the rulebook and will always be a surprise.</p>

<p>Now, some players are more talented than others, whether
naturally or through intense training. The game calls this “academic
merit”, but it translates here to increased speed and strength, which
helps some players reach the empty chair when the music stops, even if
they’re a bit further away. The strength certainly helps when competing
with others who reach the chair at the same time.</p>

<p>A careful look at Figure 1 will reveal one other way players might
increase their chances of success when the music stops. The 41st chair
has certain internal shells, or rings, which act a bit like that fake
model of an atom everyone learned in high-school chemistry. Players, of
course, are the electrons.</p>

<p><img src="images/SBW-031-img-007.gif" alt="Electron shells. [via]" /></p>

<p><em>Electron shells. [<a href="http://www.tulane.edu/~sanelson/eens211/crystal_chemistry.htm">via</a>]</em></p>

<p>You
may remember that the further out the shell, the more electrons
can occupy it(-ish): the first shell holds 2 electrons, the
second holds 8; third holds 18; fourth holds 32; and so on. The
same holds true for Academic Musical Chairs: the coveted interior ring
only fits a handful of players; the second ring fits an order of
magnitude more; the third ring an order of magnitude more than that, and
so on.</p>

<p>Getting closer to the center isn’t easy, and it has very little to do
with your “academic rigor”! Also, of course, the closer you are to the
center, the easier it is to reach either the chair, or the next level
(remember <em>positive feedback loops</em>?). Contrariwise, the further you are from the center, the less chance you have of ever reaching the core.</p>

<p>Many factors affect whether a player can proceed to the next ring
while the music plays, and some factors actively count against a player.
Old age and being a woman, for example, take away 1 point. Getting
published or cited adds points, as does already being friends with
someone sitting in a chair (the details of how many points each adds can
be found in your rulebook). Obviously the closer you are to the
center, the easier you can make friends with people in the
green core, which will contribute to your score even further. Once
your score is high enough, you proceed to the next-closest shell.</p>

<p>Hooray, someone died! Let’s watch what happens.</p>

<p>The music stops. The people in the innermost ring who have the
luckiest timing (thus are closest to the empty chair) scramble for it,
and a few even reach it. Some very well-timed players from the 2nd &amp;
3rd shells also reach it, because their “academic merit” has lent them
speed and strength to reach past their position. A struggle ensues.
Miraculously, a pregnant black woman sits down (this almost <em>never</em> happens), though not without some bodily harm, and the music begins again.</p>

<p>Oh, and new shells keep getting tacked on as more players can afford
the cost of admission to the yellow zone, though the green core remains
the same size.</p>

<p>Bizarrely, this is far from the first game of this nature. A Spanish boardgame from 1587 called the <em><a href="https://fleurtyherald.files.wordpress.com/2013/07/filosofia-cortesana-class-apa.jpg">Courtly Philosophy</a></em> had
players move figures around a board, inching closer
to living a luxurious life in the shadow of a rich
patron. Random chance ruled their progression—a role of the
dice—and occasionally they’d reach a tile that said things like: “Your
patron dies, go back 5 squares”.</p>

<p><img src="images/SBW-031-img-008.webp" alt="The courtier's philosophy. [via]" /></p>

<p><em>The courtier’s philosophy. [<a href="http://www.giochidelloca.it/scheda.php?id=1103">via</a>]</em></p>

<p>But I digress. Let’s temporarily table the scarcity/41st-chair discussion and get back to the Matthew Effect.</p>

<h1 id="the-view-from-inside">The View From Inside</h1>

<p>A friend recently came to me, excited but nervous about how well
they were being treated by their department at the expense of
their fellow students. “Is this what the Matthew Effect feels
like?” they asked. Their question is the reason I’m writing
this post, because I spent the next 24 hours scratching my head
over “what <em>does</em> the Matthew Effect feel like?”.</p>

<p>I don’t know if anyone’s looked at the psychological effects of
the Matthew Effect (if you do, please comment?), but my guess is
it encompasses two feelings: 1) impostor syndrome, and 2) hard
work finally paying off.</p>

<p>Since almost anyone who reaps the benefits of the Matthew Effect
in academia will be an intelligent, hard-working academic, a windfall
of accruing success should feel like finally reaping the benefits
one deserves. You probably realize that luck played a part, and that
many of your harder-working, smarter friends have been equally unlucky,
but there’s no doubt in your mind that, at least, your hard work is
finally paying off and the academic community is beginning to
recognize that fact. No matter how unfair it is that your great
colleagues aren’t seeing the same success.</p>

<p>But here’s the thing. You know how in physics, gravity and
acceleration feel equivalent? How, if you’re in a windowless box,
you wouldn’t be able to tell the difference between being
stationary on Earth, or being pulled by a spaceship at 9.8 m/s2 through
deep space? Success from merit or from Matthew Effect probably acts
similarly, such that it’s impossible to tell one from the other from the
inside.</p>

<p><img src="images/SBW-031-img-009.webp" alt="Gravity vs. Acceleration. [via]" /></p>

<p><em>Gravity vs. Acceleration. [<a href="https://en.wikipedia.org/wiki/Introduction_to_general_relativity">via</a>]</em></p>

<p>Incidentally, that’s why the last advice you ever want to take is someone telling you how to succeed from their own experience.</p>

<p><img src="images/SBW-031-img-010.webp" alt="Success" /></p>

<p>Since we’ve seen explosive success requires but doesn’t rely
on skill, quality, or intent, the most successful people are not
necessarily in the best position to understand the reason for their own
rise. Their strategies may have paid off, but so did timing, social
network effects, and positive feedback loops. The question you should be
asking is, why didn’t other people with the same strategies also
succeed?</p>

<p>Keep this especially in mind if you’re a student, and your
tenured-professor advised you to seek an academic career. They may
believe that giving you their strategies for success will help you
succeed, when really they’re just giving you one of 50,000 admission
tickets to Academic Musical Chairs.</p>

<h1 id="building-a-meritocracy">Building a Meritocracy</h1>

<p>I’m teetering well-past the edge of speculation here, but I assume
the communities of entrenched
academics encouraging undergraduates into a research career
are the same communities assuming a meritocracy is at play, and are
doing everything they can in hiring and tenure review to ensure a
meritocratic playing field.</p>

<p>But <em>even if</em> gender bias did not exist, <em>even if</em> everyone responsible for decision-making genuinely wanted a meritocracy, <em>even if</em>
the game weren’t rigged at many levels, the economy of scarcity (41st
chair) combined with the Matthew Effect would ensure a true meritocracy
would be impossible. There are only so many jobs, and hiring committees
need to choose some selection criteria; those selection
criteria will be subject to scarcity and rich-get-richer effects.</p>

<p>I won’t prove that point here, because original research is beyond
the scope of this blog post, but I have a good idea of how to do
it. In fact, after I finish writing this, I probably will go do just
that. Instead, let me present very similar research, and explain
how that method can be used to answer this question.</p>

<p>We want an answer to the question of whether positive feedback loops
and a scarce economy are sufficient to prevent the possibility of a
meritocracy. In 1971, Tom Schelling asked an unrelated question
which he answered using a very relevant method: <a href="http://www.stat.berkeley.edu/~aldous/157/Papers/Schelling_Seg_Models.pdf">can racial segregation manifest in a community whose every actor is intent on not living a segregated life</a>? Spoiler alert: yes.</p>

<p>He answered this question using by simulating an artificial
world—similar in spirit to the Columbia social music experiment, except
for using real participants, he experimented on very simple
rule-abiding game creatures of his own invention. A bit like having a
computer play checkers against itself.</p>

<p>The experiment is simple enough: a bunch of creatures occupy a
checker board, and like checker pieces, they’re red or black. Every
turn, one creature has the opportunity to move randomly to another empty
space on the board, and their decision to move is based on their
comfort with their neighbors. Red pieces want red neighbors, and black
pieces want black neighbors, and they keep moving randomly ’till they’re
all comfortable. Unsurprisingly, segregated creature communities
appear in short order.</p>

<p>What if we our checker-creatures were more relaxed in their comforts?
They’d be comfortable as long as they were in the majority; say, at
least 50% of their neighbors were the same color. Again, let the
computer play itself for a while, and within a few cycles the checker
board is once again almost completely segregated.</p>

<p><img src="images/SBW-031-img-011.png" alt="Schelling segregation. [via]" /></p>

<p><em>Schelling segregation. [<a href="http://nifty.stanford.edu/2014/mccown-schelling-model-segregation/">via</a>]</em></p>

<p>What
if the checker pieces are excited about the prospect of a diverse
neighborhood? We relax the criteria even more, so red checkers only move
if fewer than a third of their neighbors are red (that is, they’re
totally comfortable with 66% of their neighbors being black)? If
we run the experiment again, we see, <em>again</em>, the checker board breaks up into segregated communities.</p>

<p>Schelling’s claim wasn’t about how the world worked, but about
what the simplest conditions were that could still explain racism.
In his fictional checkers-world, every piece could be generously
interested in living in a diverse neighborhood, and yet the system
still eventually resulted in segregation. This offered a
powerful support for the theory that racism could operate subtly,
even if every actor were well-intended.</p>

<p>Vi Hart and Nicky Case created an <a href="http://ncase.me/polygons/">interactive visualization/game that teaches Schelling’s segregation model</a> perfectly. Go play it. Then come back. I’ll wait.</p>

<hr />

<p>Such an experiment can be devised for our
41st-chair/positive-feedback system as well. We can even build a
simulation whose rules match the Academic Musical Chairs I described
above. All we need to do is show that a system in which both
effects operate (a fact empirically proven time and again in academia)
produces fundamental challenges for meritocracy. Such a model would
be show that simple meritocratic intent is insufficient to produce a
meritocracy. Hulk smashing the myth of the meritocracy seems fun; I
think I’ll get started soon.</p>

<h1 id="the-social-network">The Social Network</h1>

<p>Our world ain’t that simple. For one, as seen in Academic Musical
Chairs, your place in the social network influences your chances of
success. A heavy-hitting advisor, an old-boys cohort, etc., all
improve your starting position when you begin the game.</p>

<p>To put it more operationally, let’s go back to the Columbia social
music experiment. Part of a song’s success was due to quality, but the
stuff that made stars was much more contingent on chance timing followed
by positive feedback loops. Two of the authors from the 2006 study
wrote <a href="https://www.gsb.stanford.edu/sites/default/files/documents/mktg_03_08_dodds_paper1.pdf">another in 2007</a>, echoing this claim that good timing was more important than individual influence:</p>

<blockquote>
  <p>models of information cascades, as well as human subjects
experiments that have been designed to test the models (Anderson and
Holt 1997; Kubler and Weizsacker 2004), are explicitly constructed such
that there is nothing special about those individuals, either in terms
of their personal characteristics or in their ability to influence
others. Thus, whatever influence these individuals exert on the
collective outcome is an accidental consequence of their randomly
assigned position in the queue.</p>
</blockquote>

<p>These articles are part of a large literature in predicting popularity, viral hits, success, and so forth. There’s <em><a href="http://arxiv.org/pdf/1202.0332.pdf">The Pulse of News in Social Media: Forecasting Popularity</a></em>
by Bandari, Asur, &amp; Huberman, which showed that a top
predictor of newspaper shares was the source rather than the content of
an article, and that a major chunk of articles that do get shared
never really make it to viral status. There’s <em><a href="http://arxiv.org/pdf/1403.4608.pdf">Can Cascades be Predicted?</a></em>by Cheng,
Adamic, Dow, Kleinberg, and Leskovec (all-star cast if ever I saw one),
which shows the remarkable reliance on timing &amp; first impressions
in predicting success, and also the reliance on social connectivity.
That is, success travels faster through those who are well-connected
(shocking, right?), and structural properties of the social network are
important. <a href="http://libtreasures.utdallas.edu/jspui/bitstream/10735.1/3218/1/SOM-SR-JHOh-310708.7.pdf">This study by Susarla et al.</a>
also shows the importance of location in the social network in
helping push those positive feedback loops, effecting the magnitude of
success in YouTube Video shares.</p>

<p><img src="images/SBW-031-img-012.webp" alt="Twitter information cascade. [via]" /></p>

<p><em>Twitter information cascade. [<a href="http://www.mdpi.com/2078-2489/4/2/171?trendmd-shared=0">via</a>]</em></p>

<p>Now,
I know, social media success does not an academic career
predict. The point here, instead, is to show that in each of these
cases, before sharing occurs and not taking into account social
media effects (that is, <strong>relying solely on the merit of the thing itself</strong>), <strong>success is predictable, but stardom is not</strong>.</p>

<h1 id="concluding-finally">Concluding, Finally</h1>

<p>Relating it to Academic Musical Chairs, it’s not too difficult to say
whether someone will end up in the 41st chair, but it’s impossible to
tell whether they’ll end up in seats 1-40 until you keep an eye on how
positive feedback loops are affecting their career.</p>

<p>In the academic world, there’s a fertile prediction market for Nobel
Laureates. Social networks and Matthew Effect citation bursts are decent
enough predictors, but what anyone who predicts any kind of
success will tell you is that it’s much easier to predict the pool of
recipients than it is to predict the winners.</p>

<p>Take Economics. How many working economists are there? Tens of thousands, at least. But there’s this <em>Econometric Society</em>which
began naming Fellows in 1933, naming 877 Fellows by 2011. And guess
what, 60 of 69 Nobel Laureates in Economics before 2011 were Fellows of
the society. The other 817 members are or were occupants of the 41st
chair.</p>

<p>The point is (again, sorry), academic meritocracy is a myth. Merit is
a price of admission to the game, but not a predictor of success in a
scarce economy of jobs and resources. Once you pass the basic merit
threshold and enter the 41st chair, forces having little to do with
intellectual curiosity and rigor guide eventual success (<em><a href="https://www.timeshighereducation.com/news/twitter-creates-new-academic-hierarchies-suggests-study">ahem</a></em>).
Small positive biases like gender, well-connected advisors,
early citations, lucky timing, etc. feed back into increasingly
larger positive biases down the line. And since there are only
so many faculty jobs out there, these feedback effects create a
naturally imbalanced playing field. Sometimes Einsteins do make it
into the middle ring, and <a href="https://en.wikipedia.org/wiki/Albert_Einstein#Patent_office">sometimes they stay patent clerks</a>. Or adjuncts, I guess. Those who <em>do</em>
make it past the 41st chair are poorly-suited to tell you why,
because by and large they employed the same strategies as everybody
else.</p>

<p><img src="images/SBW-031-img-013.jpg" alt="Figure 1: Academic Musical Chairs" /></p>

<p><em>Yep, Academic Musical Chairs</em></p>

<p>And if these six thousand words weren’t enough to convince you, I leave you <a href="http://www.pnas.org/content/108/17/6889.full?_ga=1.155701947.1658306299.1400869853">with this article</a> and this tweet. Have a nice day!</p>

<blockquote>
  <p>One of the only variables I’ve ever seen that truly predicts grant success … your application number <a href="https://t.co/R7Q3k8PNck">pic.twitter.com/R7Q3k8PNck</a></p>

  <p>— Adrian Barnett (@aidybarnett) <a href="https://twitter.com/aidybarnett/status/711081456232038400">March 19, 2016</a></p>
</blockquote>

<h1 id="addendum-for-historians">Addendum for Historians</h1>

<p>You thought I was done?</p>

<p>As a historian of science, this situation has some interesting
repercussions for my research. Perhaps most importantly, it and
related concepts from Complex Systems research offer a middle ground
framework between environmental/contextual determinism (the world shapes
us in fundamentally predictable ways) and individual historical
agency (we possess the power to shape the world around us, making
the world fundamentally unpredictable).</p>

<p>More concretely, it is historically fruitful to ask not simply what
non-“scientific” strategies were employed by famous scientists to
get ahead (see Biagioli’s <em><a href="http://www.amazon.com/Galileo-Courtier-Absolutism-Conceptual-Foundations/dp/0226045609">Galileo, Courtier</a></em>), but also what did or did not set those strategies apart from the masses of people we no longer remember. <em>Galileo, Courtier</em>provides
a great example of what we historians can do on a larger scale: it
traces Galileo’s machinations to wind up in the good graces of a wealthy
patron, and how such a system affected his own research.
Using recently-available data on early modern social and scholarly
networks, as well as the beginnings of data on people’s activities,
interests, practices, and productions, it should be possible to zoom
out from Biagioli’s viewpoint and get a fairly sophisticated
picture of trajectories and practices of people who <em>weren’t</em> Galileo.</p>

<p>This is all very preliminary, just publicly blogging whims, but I’d be fascinated by what a wide-angle (dare I say, <a href="http://themacroscope.org/">macroscopic</a>?)
analysis of the 41st chair in could tell us about how social and
“scientific” practices shaped one another in the 16th and 17th
centuries. I believe this would bear previously-impossible fruit,
since a lone historian grasping ten thousand tertiary actors at once is a
fool’s errand, but is a walk in the park for my laptop.</p>

<p>As this really is whim-blogging, I’d love to hear your thoughts.</p>

<hr />

<h2 id="reader-comments">Reader Comments</h2>

<blockquote>
  <p><strong>acrymble</strong>, 2016-03-26 09:10</p>

  <p>I liked your post. As one of the people who managed to get one
of the chairs, I appreciate your point that I’m not able to reflect on
the process without considerable baggage. But I’d like to engage
nonetheless.</p>

  <p>I take the point that there are many great people not getting seats
at the table. But I think what you’ve described looks at academic jobs
the wrong way round. They aren’t prizes to be collected by the best and
the brightest. They’re jobs that need doing. Jobs that involve specific
teaching (eg, who can teach Early Modern British History to our first
year students and the history of medicine to our final year students?),
administration (we need someone to run academic quality assurance), and
research (someone who does something no one else in our department does,
and that looks decent enough to publish some interesting stuff).
They’re also looking for someone who they think they can get along with
for the next 30 years, who will engage the students, care about their
work, etc.</p>

  <p>To be competitive doesn’t just mean they have a PhD. Having a PhD is
about as useful as breathing when it comes to applying for jobs. It’s
such a fundamental requirement that it becomes meaningless. These
non-competitive candidates produce job applications that probably
emphasize their really great research (which to the rest of us may look
very specific and obscure, and which quite frankly they will have very
little time to do anyway). It probably didn’t occur to them to look into
the specific teaching needs of the post so that they could highlight
that in their application. They probably haven’t built up the ability to
teach anything beyond their PhD specialisation (what ELSE can you
teach?). They probably don’t know the difference between impact and
engagement and how their work fulfills both, etc, etc.</p>

  <p>These people just don’t have enough experience or awareness of the
industry to be ‘appointable’ in their current state. Some people will
learn it over time. Others will never get it. Usually we are too polite
to tell those people to give up, which would probably be kinder. So your
50,000 people circling the chairs include a good proportion who just
aren’t competitive for a variety of reasons, chiefly, because they
thought having a PhD was the criteria and it is not a meaningful one if
everyone else in the room has one too.</p>

  <p>With that in mind, I think there is <em>limited</em> scope for merit. The
person who ‘gets it’ and does the right digging into the department,
pitches effectively for the specific job (and is qualified for that
SPECIFIC job), rounds out their skill set and talks to lots of people
about what it’s like to work as an academic or hire people, can improve
their chances of getting an interview. You put yourself amongst the MANY
very qualified people who can vie for the post. Not guarantee, but at
least separate themselves from the people who didn’t understand they
were applying for a job, not a prize.</p>

  <p>If you get to the interview, it becomes a blind date rather than a
game of musical chairs. They’re looking for someone who can do the very
specific job that they need doing. But they’re also looking for that
spark – the ‘je ne sais quoi’ of a long-term colleague. Just like in
dating, sometimes you connect. And sometimes you don’t. You don’t want
to end up in a bad marriage, so sometimes not getting the job is the
best outcome, despite the frustration you may feel at the time.</p>

  <p>I agree that there aren’t enough jobs for the people that want them
(there aren’t enough acting gigs for actors either). And I appreciate
luck and privilege (gender, ethnicity, age, where you went to school)
are big elements in the equation. Often a candidate has the wrong
skillset and experience for the specific jobs that are posted. That’s a
lottery and entirely unfair if you guess wrong (eg, chose a PhD topic
that becomes unsexy just as you’re finishing). But the people who get
hired almost always deserve it. That doesn’t mean the people who don’t
get hired aren’t amazing and brilliant people. But this isn’t about
rewarding brilliance. It’s about a group of 80 first year students who
need to be taught Early Modern British History and 25 final year
students who need to learn about the history of medicine.</p>

  <p>Teaching PhD students that academia is a job and not a prize is
probably one of the first steps in addressing the frustration that you
describe in this post. Whether we like to admit it or not, there are
exactly the number of academic jobs that the market can bear. The
conversation we should all be having is: what other fulfilling options
are out there for people who are passionate about their subject
knowledge? And how can we end this belief that academic jobs are prizes?</p>
</blockquote>

<blockquote>
  <blockquote>
    <p><strong>scottenderle</strong>, 2016-03-26 15:49</p>

    <p>“These people just don’t have enough experience or awareness of
the industry to be ‘appointable’ in their current state.” Certainly not.
But what about the thousands of people living as adjuncts <strong>doing the
very jobs you describe</strong>, year after year, as they struggle to find a
tenure-track job? It almost sounds as if you think that the competitors
in this system are all ABDs. But as I’m sure you must know, many of the
competitors are university-level teachers with years of experience. Many
of them have been teaching four or five classes a semester while also
maintaining an active research profile. And many of them have been
passed over by hiring committees in favor of an unexperienced ABD.</p>

    <p>I know fantastic teachers, brilliant researchers, generous colleagues
to whom this has happened multiple times. I used to worry that this was
a sign that I was mistaken in my assessment of those people. It has
taken me a very long time to adjust to the realization that their effort
and their talent may simply never be recognized.</p>

    <p>Before we can even begin to talk about “other fulfilling options” for
these people, we need to acknowledge that our academic system has
failed them.</p>
  </blockquote>
</blockquote>

<blockquote>
  <blockquote>
    <blockquote>
      <p><strong>acrymble</strong>, 2016-03-27 10:06</p>

      <p>You won’t get me arguing with you about the problems of the adjucts. I don’t like it either.</p>
    </blockquote>
  </blockquote>
</blockquote>

<blockquote>
  <p><strong>Lincoln Mullen</strong>, 2016-03-29 02:47</p>

  <p>“Again I saw that under the sun the race is not to the swift,
nor the battle to the strong, nor bread to the wise, nor riches to the
intelligent, nor favor to the skillful; but time and chance happen to
them all.”</p>
</blockquote>

<blockquote>
  <blockquote>
    <p><strong>Scott B. Weingart</strong>, 2016-03-29 10:52</p>

    <p>“what has been said will be said again; there is nothing new under the sun.”
Or, as Lenny Bruce <a href="http://www.kpfahistory.info/dandl/lennie_bruce001.mp3">said on October 4th, 1961</a> (2:00 minutes in), before being arrested for obscenity:
“Believe me, I’m not profound, this is something that I assume someone
must have laid on me, because I do not have an original thought. I am
screwed. I speak English. That’s it. I was not born in a vacuum.”</p>
  </blockquote>
</blockquote>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Unless it’s <em>really</em> awful, but let’s avoid that discussion here. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>short of a TARDIS. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>{&quot;family&quot;=&gt;&quot;Weingart&quot;, &quot;given&quot;=&gt;&quot;Scott B.&quot;, &quot;display&quot;=&gt;&quot;scott b. weingart&quot;, &quot;url&quot;=&gt;&quot;http://scottbot.net/author/admin/&quot;}</name></author><category term="personal research" /><category term="ABMs" /><category term="academia" /><category term="bias" /><category term="complexity" /><category term="diffusion" /><category term="history of science" /><category term="human dynamics" /><category term="macroanalysis" /><category term="network analysis" /><category term="scholarly communication" /><category term="scientonomy" /><category term="social networks" /><summary type="html"><![CDATA[Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.]]></summary></entry><entry><title type="html">Historians, Doctors, and their Absence</title><link href="https://scottbot.github.io/dossier/reviews/2013/10/20/historians-doctors-and-their-absence.html" rel="alternate" type="text/html" title="Historians, Doctors, and their Absence" /><published>2013-10-20T00:00:00+00:00</published><updated>2013-10-20T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/reviews/2013/10/20/historians-doctors-and-their-absence</id><content type="html" xml:base="https://scottbot.github.io/dossier/reviews/2013/10/20/historians-doctors-and-their-absence.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="historians-doctors-and-their-absence">Historians, Doctors, and their Absence</h1>

<p>[Note: sorry for the lack of polish on the post compared to others. This was hastily written before a day of international travel. Take it with however many grains of salt seem appropriate under the circumstances.]</p>

<p>[Author’s note two: Whoops! Never included the link to the article. <a href="http://arxiv.org/abs/1310.2636">Here it is</a>.]</p>

<p>Every once in a while, <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> a group of exceedingly clever mathematicians and physicists decide to do something exceedingly clever on something that has nothing to do with math or physics. This particular research project has to do with the 14th Century Black Death, resulting in such claims as the small-world network effect is a completely modern phenomenon, and “most social exchange among humans before the modern era took place via face-to-face interaction.”</p>

<p>The article itself is really cool. And really clever! I didn’t think of it, and I’m angry at myself for not thinking of it. They look at the empirical evidence of the spread of disease in the late middle ages, and note that the pattern of disease spread looked shockingly different than patterns of disease spread today. Epidemiologists have long known that today’s patterns of disease propagation are dependent on social networks, and so it’s not a huge leap to say that if earlier diseases spread differently, their networks must have been different too.</p>

<p>Don’t get me wrong, that’s <em>really fantastic</em>. I wish more people (read: me) would make observations like this. It’s the sort of observation that allows historians to infer facts about the past with reasonable certainty given tiny amounts of evidence. The problem is, the team had neither any doctors, nor any historians of the late middle ages, and it turned an otherwise great paper into a set of questionable conclusions.</p>

<p>Small world networks have a formal mathematical definition, which (essentially) states that no matter how big the population of the world gets, everyone is within a few degrees of separation from <em>you</em>. Everyone’s an acquaintance of an acquaintance of an acquaintance of an acquaintance. This non-intuitive fact is what drives the insane speeds of modern diseases; today, an epidemic can spread from Australia to every state in the U.S. in a matter of days. Due to this, disease spread maps are weirdly patchy, based more around how people travel than geographic features.</p>

<p><img src="images/SBW-032-img-001.gif" alt="Patchy h5n1 outbreak map." /></p>

<p><em>Patchy h5n1 outbreak map.</em></p>

<p>The map of the spread of black death in the 14th century looked very different. Instead of these patches, the disease appeared to spread in very deliberate waves, at a rate of about 2km/day.</p>

<p><img src="images/SBW-032-img-002.png" alt="Spread of the plague, via the original article." /></p>

<p><em>Spread of the plague, via the original article.</em></p>

<p>How to reconcile these two maps? The solution, according to the network scientists, was to create a model of people interacting and spreading diseases across various distances and types of networks. Using the models, they show that in order to generate these wave patterns of disease spread, the physical contact network cannot be small world. From this, because they make the (uncited) claimed that physical contact networks had to be a subset of social contact networks (entirely ignoring, say, correspondence), the 14th century did not have small world social networks.</p>

<p>There’s a lot to unpack here. First, their model does not take into account the fact that people, y’know, die after they get the plague. Their model assumes infected have enough time and impetus to travel to get the disease as far as they could after becoming contagious. In the discussion, the authors do realize this is a stretch, but suggest that because, people <em>could</em> if they so choose travel 40km/day, and the black death only spread 2km/day, this is not sufficient to explain the waves.</p>

<p>I am no plague historian, nor a doctor, but a brief trip on the google suggests that black death symptoms could manifest in hours, and a swift death comes only days after. It is, I think, unlikely that people would or could be traveling great distances after symptoms began to show.</p>

<p>More important to note, however, are the assumptions the authors make about social ties in the middle ages. They assume a social tie must be a physical one; they assume social ties are connected with mobility; and they assume social ties are constantly maintained. This is a bit before my period of research, but only a hundred years later (still before the period the authors claim could have sustained small world networks), but any early modern historian could tell you that communication was asynchronous and travel was ordered and infrequent.</p>

<p>Surprisingly, I actually believe the authors’ conclusions: that by the strict mathematical definition of small world networks, the “pre-modern” world might not have that feature. I <em>do</em> think distance and asynchronous communication prevented an entirely global 6-degree effect. That said, the assumptions they make about what a social tie is are entirely modern, which means their conclusion is essentially inevitable: historical figures did not maintain modern-style social connections, and thus metrics based on those types of connections should not apply. Taken in the social context of the Europe in the late middle ages, however, I think the authors would find that the salient features of small world networks (short average path length and high clustering) exist in that world as well.</p>

<p>A second problem, and the reason I agree with the authors that there was not a global small world in the late 14th century, is because “global” is not an appropriate axis on which to measure “pre-modern” social networks. Today, we can reasonably say we all belong to a global population; at that point in time, before trade routes from Europe to the New World and because of other geographical and technological barriers, the world should instead have been seen as a set of smaller, overlapping populations. My guess is that, for more reasonable definitions of populations for the time period, small world properties would continue to hold in this time period.</p>

<p>Notes:</p>

<hr />

<h2 id="reader-comments">Reader Comments</h2>

<blockquote>
  <p><strong>Yannick Rochat</strong>, 2013-10-24 18:36</p>

  <p>It reminds me of another Newman article : <a href="http://arxiv.org/abs/cond-mat/0305612">http://arxiv.org/abs/cond-mat/0305612</a></p>

  <p>In the case of the plague article, I feel like you (an historian) were not supposed to find it. Like if it were storytelling for engineers. Let’s hope that such a work, made without the help of a researcher in (digital ?) humanities, and with quite no sources from work of historians, doesn’t become a reference on this subject, but remains at most one about that spread-with-jumps algorithm.</p>

  <p>There should be a blog about such articles.</p>

  <p>Thanks for your post.</p>
</blockquote>

<blockquote>
  <p><strong>Jack Rigby</strong>, 2014-03-05 07:33</p>

  <p>Now this <em>IS</em> fascinating!
 The problem with trying to establish the actual truth, as distinct from the political truth, is that academically, it can mean professional death, or in the case of some industries, (Tobacco) real death.
 I wrote a long lost dissent 40 years ago about plague/disease characteristics and the key point was:
 “Hellooo?? Nobody with the Black Death infection travels far at all”</p>

  <p>It was tied in to the nonsense about the origin of Man being in Africa, a totally geo-unstable place compared to Australia, where the locals have stories about the “DreamtimeS” ( two lots of 20,000 year memories) and “going out into the world – where everybody was dead from the cold.”</p>

  <p>One of the big catches in disease research is the blatant lies told by the “Vested Interests” protecting their interests.
 But one can find valid information by looking in esoteric areas like the international battle to get rid of the literal health horrors of SODIUM Fluoride and actually track deterioration in the health of entire populations’ “with no DISCERNIBLE reason.” (Officially)</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Every day? Every two days? <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>{&quot;family&quot;=&gt;&quot;Weingart&quot;, &quot;given&quot;=&gt;&quot;Scott&quot;}</name></author><category term="reviews" /><category term="complexity" /><category term="diffusion" /><category term="history" /><category term="human-dynamics" /><category term="macroanalysis" /><category term="methodologies" /><category term="network-analysis" /><category term="social-networks" /><category term="social-science" /><summary type="html"><![CDATA[Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.]]></summary></entry><entry><title type="html">Analyzing submissions to Digital Humanities 2013</title><link href="https://scottbot.github.io/dossier/method/personal%20research/2012/11/08/analyzing-submissions-to-dh-2013.html" rel="alternate" type="text/html" title="Analyzing submissions to Digital Humanities 2013" /><published>2012-11-08T00:00:00+00:00</published><updated>2012-11-08T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/method/personal%20research/2012/11/08/analyzing-submissions-to-dh-2013</id><content type="html" xml:base="https://scottbot.github.io/dossier/method/personal%20research/2012/11/08/analyzing-submissions-to-dh-2013.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="analyzing-submissions-to-digital-humanities-2013">Analyzing submissions to Digital Humanities 2013</h1>

<p><a href="http://dh2013.unl.edu/">Digital Humanities 2013 is on its way</a>;
submissions are closed, peers will be reviewing them shortly, and (most
importantly for this post) the people behind the conference are
experimenting with a new method of matching submissions to reviewers.
It’s a bidding process; reviewers take a look at the many submissions
and state their reviewing preferences or, when necessary, conflicts of
interest. It’s unclear the extent to which these preferences will be
accommodated, as this is an experiment on their part. <a href="http://nowviskie.org/2012/cats-and-ships/">Bethany Nowviskie describes it here</a>.
As a potential reviewer, I just went through the process of listing my
preferences, and managed to do some data scraping while I was there. How
could I not? All 348 submission titles were available to me, as well as
their authors, topic selections, and keywords, and given that my
submission for this year is <em>all about quantitatively analyzing DH</em>,
it was an opportunity I could not pass up. Given that these data are
sensitive, and those who submitted did so under the assumption that
rejected submissions would remain private, I’m opting not to release the
data or any non-aggregated information. I’m also doing my best not to
actually read the data in the interest of the privacy of my peers; I
suppose you’ll all just have to trust me on that one, though.</p>

<p>So what are people submitting? According to the topics authors
assigned to their 348 submissions, 65 submitted articles related to
“literary studies,” trailed closely by 64 submissions which pertained to
“data mining/ text mining.” Work on archives and visualizations are
also up near the top, and only about half as many authors submitted
historical studies (37) as those who submitted literary ones (65). This
confirms my long suspicion that our current wave of DH (that is, what’s <em>trending</em>
and exciting) focuses quite a bit more on literature than history. This
makes me sad.  You can see the breakdown in Figure 1 below, and
further analysis can be found after.</p>

<p><img src="images/SBW-035-img-001.png" alt="" title="Topic Counts" /></p>

<p><em>Figure 1: Number of documents with each topic authors assigned to submissions for DH2013 (click to enlarge).</em></p>

<p>The majority of authors attached fewer than five topics to their
submissions; a small handful included over 15.  Figure 2 shows the
number of topics assigned to each document.</p>

<p><img src="images/SBW-035-img-002.png" alt="" title="Topics per document" /></p>

<p><em>Figure 2: The number of topics attached to each document, in order of rank.</em></p>

<p>I was curious how strongly each topic coupled with other topics, and
how topics tended to cluster together in general, so I extracted a topic
co-occurrence network. That is, whenever two topics appear on the same
document, they are connected by an edge (see <a href="http://www.scottbot.net/HIAL/?p=6279">Networks Demystified Pt. 1</a>
for a brief introduction to this sort of network); the more times two
topics co-occur, the stronger the weight of the edge between them.</p>

<p>Topping off the list at 34 co-occurrences were “Data Mining/ Text
Mining” and “Text Analysis,” not terrifically surprising as the the
latter generally requires the former, followed by “Data Mining/
Text Mining” and “Content Analysis” at 23 co-occurrences, “Literary
Studies” and “Text Analysis” at 22 co-occurrences, “Content Analysis”
and “Text Analysis” at 20 co-occurrences, and “Data Mining/ Text
Mining” and “Literary Studies” at 19 co-occurrences. Basically what I’m
saying here is that Literary Studies, Mining, and Analysis seem to go
hand-in-hand.</p>

<p>Knowing my readers, about half of you are already angry with me
counting co-occurrences, and rightly so. That measurement is heavily
biased by the sheer total number of times a topic is used; if “literary
studies” is attached to 65 submissions, it’s much more likely that it
will co-occur with any particular topic than topics (like “teaching and
pedagogy”) which simply appear more infrequently. The highest frequency
topics will co-occur with one another simply by an accident of
magnitude.</p>

<p>To account for this, I measured the <em>neighborhood overlap</em>
of each node on the topic network. This involves first finding the
number of other topics  a pair of two topics shares. For example,
“teaching and pedagogy” and “digital humanities – pedagogy and
curriculum” each co-occur with several other of the same topics,
including “programming,” “interdisciplinary collaboration,” and “project
design, organization, management.” I summed up the number topical
co-occurrences between each pair of topics, and then divided that total
by the number of co-occurrences each node in the pair had individually.
In short, I looked at which pairs of topics tended to share similar
other topics, making sure to take into account that some topics which
are used very frequently might need some normalization. There are better
normalization algorithms out there, but I opt to use this one for its
simplicity for pedagogical reasons. The method does a great job leveling
the playing field between pairs of infrequently-used topics compared to
pairs of frequently-used topics, but doesn’t fair so well when looking
at a pair where one topic is popular and the other is not. The algorithm
is well-described in Figure 3, where the darker the edge, the higher
the neighborhood overlap.</p>

<p><img src="images/SBW-035-img-003.png" alt="" title="Neighborhood Overlap" /></p>

<p><em>Figure 3: The neighborhood overlap between two nodes is how many neighbors (or connections) that pair of nodes shares. As such, A and B share very few connections, so their overlap is low, whereas D and E have quite a high overlap. Via Jaroslav Kuchar .</em></p>

<p>Neighborhood overlap paints a slightly different picture of the
network. The pair of topics with the largest overlap was “Internet /
World Wide Web” and “Visualization,” with 90% of their neighbors
overlapping. Unsurprisingly, the next-strongest pair was “Teaching and
Pedagogy” and “Digital Humanities – Pedagogy and Curriculum.” The
data might be used to suggest multiple topics that might be merged into
one, and this pair seems to be a pretty good candidate. “Visualization”
also closely overlaps “Data Mining/ Text Mining”, which itself (as we
saw before) overlaps with “Cultural Studies” and “Literary Studies.”
What we see from this close clustering both in overlap and in connection
strength is the traces of a fairly coherent subfield out of DH, that of
quantitative literary studies. We see a similarly tight-knit cluster
between topics concerning archives, databases, analysis, the web,
visualizations, and interface design, which suggests another genre in
the DH community: the (relatively) recent boom of user interfaces as
workbenches for humanists exploring their archives. Figure 4 represents
the pairs of topics which overlap to the highest degree; topics without
high degrees of pair correspondence don’t appear on the network graph.</p>

<p><img src="images/SBW-035-img-004.png" alt="" title="Topic Network" /></p>

<p><em>Figure 4: Network of topical neighborhood overlap. Edges between topics are weighted according to how structurally similar the two topics are. Topics that are structurally isolated are not represented in this network visualization.</em></p>

<p>The topics authors chose for each submission were from a controlled
vocabulary. Authors also had the opportunity to attach their own
keywords to submissions, which unsurprisingly yielded a much more
diverse (and often redundant) network of co-occurrences. The resulting
network revealed a few surprises: for example, “topic modeling” appears
to be much more closely coupled with “visualization” than with “text
analysis” or “text mining.” Of course some pairs are not terribly
surprising, as with the close connection between “Interdisciplinary” and
“Collaboration.” The graph also shows that the organizers have done a
pretty good job putting the curated topic list together, as a
significant chunk of the high thresholding keywords are also
available in the topic list, with a few notable exceptions. “Scholarly
Communication,” for example, is a frequently used keyword but not
available as a topic – perhaps next year, this sort of analysis can be
used to help augment the curated topic list. The keyword network appears
in Figure 5. I’ve opted not to include a truly high resolution image to
dissuade readers from trying to infer individual documents from the
keyword associations.</p>

<p><img src="images/SBW-035-img-005.png" alt="" title="DH2013 Keywords" /></p>

<p><em>Figure 5: Which keywords are used together on documents submitted to DH2013? Nodes are colored by cluster, and edges are weighted by number of co-occurrences. Click to enlarge.</em></p>

<p>There’s quite a bit of rich data here to be explored, and anyone who
does have access to the bidding can easily see that the entire point of
my group’s submission is exploring the landscape of DH, so there’s
definitely more to come on the subject from this blog. I especially look
forward to seeing what decisions wind up being made in the peer review
process, and whether or how that skews the scholarly landscape at the
conference.</p>

<p>On a more reflexive note, looking at the data makes it pretty clear
that DH isn’t as fractured as some occasionally suggest (New Media vs.
Archives vs. Analysis, etc.). Every document is related to a few others,
and they are all of them together connected in a rich family, a
network, of Digital Humanities. There are no islands or isolates. While
there might be no “The” Digital Humanities, no unifying factor
connecting all research, there are Wittgensteinian
family resemblances  connecting all of these submissions
together, in a cohesive enough whole to suggest that yes, we can
reasonably continue to call our confederation a single community.
Certainly, there are many sub-communities, but there still exists an
internal cohesiveness that allows us to differentiate ourselves from,
say, geology or philosophy of mind, which themselves have their own
internal cohesiveness.</p>]]></content><author><name>{&quot;family&quot;=&gt;&quot;Weingart&quot;, &quot;given&quot;=&gt;&quot;Scott B.&quot;, &quot;url&quot;=&gt;&quot;http://www.scottbot.net/HIAL/&quot;}</name></author><category term="method" /><category term="personal research" /><category term="data analysis" /><category term="dhconf" /><category term="digital humanities" /><category term="methodologies" /><category term="network analysis" /><category term="scholarly communication" /><category term="visualizations" /><summary type="html"><![CDATA[Digital Humanities 2013 is on its way; submissions are closed, peers will be reviewing them shortly, and (most importantly for this post) the people behind the conference are experimenting with a n...]]></summary></entry><entry><title type="html">Another Step in Keeping Pledges</title><link href="https://scottbot.github.io/dossier/personal%20research/2012/08/15/another-step-in-keeping-pledges.html" rel="alternate" type="text/html" title="Another Step in Keeping Pledges" /><published>2012-08-15T00:00:00+00:00</published><updated>2012-08-15T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/personal%20research/2012/08/15/another-step-in-keeping-pledges</id><content type="html" xml:base="https://scottbot.github.io/dossier/personal%20research/2012/08/15/another-step-in-keeping-pledges.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="another-step-in-keeping-pledges">Another Step in Keeping Pledges</h1>

<p>Long-time readers of this blog might remember that, a while ago, I <a href="http://www.scottbot.net/HIAL/?page_id=3086">pledged to do pretty much Open Everything</a>.
Last week, a friend in my department asked how I managed that without
having people steal my ideas. It’s a tough question, and I’m still not
certain whether my answer has more to do with idealist naïveté or
actual forward-thought. Time will tell. As it is, the pool of people
doing similar work to mine is small, and they pretty much all know about
this blog, so I’m confident the crowd of rabid academics will keep each
other in check. Still, I suppose we all have to be on guard for the
occasional evil professor, wearing his white lab coat, twirling his
startling mustachio,  and just itching to steal the idle
musings of a still-very-confused Ph.D. student.</p>

<p>In the interest of keeping up my pledge, I’ve decided to open up yet
another document, this time for the purpose of student guidance. In
2010, I applied for the <a href="http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=6201">NSF Graduate Research Fellowship Program</a>,
a shockingly well-paying program that’ll surely help with the rising
(and sometimes prohibitive) costs of graduate school. By several strokes
of luck and (I hope) a decent project, the NSF sent the decision to
fund me later that year, and I’ve had more time to focus on research
ever since. In the interest of helping future applicants, I’ve <a href="http://figshare.com/articles/NSF_GRFP_Accepted_Proposal_2010/94220">posted my initial funding proposal on figshare</a>.
Over the next few weeks, there are a few other documents and datasets I
plan on making public, and I’ll start a new page on this blog that
consolidates all the material that I’ve opened, inspired by <a href="http://tedunderwood.wordpress.com/open-data/">Ted Underwood’s similar page</a>.</p>

<p><img src="images/SBW-036-figsharelogo1.png" alt="figshare logo" /></p>

<p><em>Click to get my NSF proposal.</em></p>

<p>Do you have grants or funding applications that’ve been accepted? Do
you have publications out that are only accessible behind a drastic
paywall? I urge you to post preprints, drafts, or whatever else you can
to make scholarship a freer and more open endeavor for the benefit of
all.</p>

<hr />

<h2 id="reader-comments">Reader Comments</h2>

<blockquote>
  <p><strong>Laurie N. Taylor</strong>, 2012-08-17</p>

  <p>The University of Florida libraries
have seen very clear benefits (increasing interest in projects, gaining
new collaborators for projects, serving as PR/marketing) from posting
grant applications in the UF Digital Collections. We have a full
collection specifically for grant proposals: <a href="http://ufdc.ufl.edu/ufirgrants">http://ufdc.ufl.edu/ufirgrants</a>
For large, collaborative projects, we’ve also found this to be
specifically useful for ease of communication and project management as
the grant projects proceed because it ensures everyone has ready access
to the proposal. More recently, we’re trying to ensure that we also
share all official press releases and all grant reports along with the
funded proposals to help people better understand how grant projects
normally proceed, best practices, and just to further develop a culture
of grantsmanship for successful proposal writing, successful and easier
grant project management, and successful next steps in terms of
increasing impact from all projects. While the emphasis began on sharing
proposals for larger grants, researchers have added individual
fellowship proposals and the feedback has been similarly positive.
Researchers for some projects have declined to share their proposals
until after their project work and publication are complete for fear of
being scooped, which seems like a valid concern in some instances. For
many, it does not seem applicable and there do seem to be clear benefits
from sharing the proposals.</p>

  <p>I’m very interested to see how other people respond on this and for
additional data (anecdotal or otherwise) on risks and benefits.</p>
</blockquote>

<blockquote>
  <p><strong>Anthony Salvagno</strong>, 2012-08-18</p>

  <p>Personally I think the reason
scientists won’t steal ideas is because we are putting them out there.
As a fellow open scientist, I make all my research and data public
domain. Others may attribute it with a share-alike license. Whatever the
case is, generally speaking, you can’t steal ideas that are being
offered to the world. That may have a lot to do with it. The current
pool of participants being small may also have something to do with it
(like you suggest). I recently wrote a bunch of thoughts on this <a href="http://research.iheartanthony.com/2012/08/10/open-notebook-science-thoughts-inspired-by-the-biomedical-research-symposium/">here</a>.</p>

  <p>And that’s great that you published your funded proposal. I took the
concept a step further and wrote an NSF IGERT proposal openly and
published it <a href="https://docs.google.com/document/d/1YaV8XFGVxQLod0OnYImwaqJ-XrvlAOcmUI1FOgbwuQI/edit">here</a>.
Hopefully it gets funded but if not hopefully I or someone can build on
it in the future. Whatever pushes science forward right?</p>
</blockquote>]]></content><author><name>{&quot;family&quot;=&gt;&quot;Weingart&quot;, &quot;given&quot;=&gt;&quot;Scott&quot;}</name></author><category term="personal research" /><category term="open access" /><summary type="html"><![CDATA[Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.]]></summary></entry><entry><title type="html">Doing Bayesian Data Analysis</title><link href="https://scottbot.github.io/dossier/method/2012/01/10/doing-bayesian-data-analysis.html" rel="alternate" type="text/html" title="Doing Bayesian Data Analysis" /><published>2012-01-10T00:00:00+00:00</published><updated>2012-01-10T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/method/2012/01/10/doing-bayesian-data-analysis</id><content type="html" xml:base="https://scottbot.github.io/dossier/method/2012/01/10/doing-bayesian-data-analysis.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="doing-bayesian-data-analysis">Doing Bayesian Data Analysis</h1>

<p>A few months ago, <em>Science</em> published <a href="http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/2011_11_25/caredit.a1100131">a Thanksgiving article on what scientists can be grateful for</a>. It’s got a lot of good points, like being thankful for family members who accept the crazy hours we work, or for those <em>really useful</em> research projects that make science cool enough for us to get funding for the merely <em>really interesting</em>. It does have one unfortunate reference to humanists:</p>

<blockquote>
  <p>We are thankful that Ph.D. programs in the sciences, as
much as we complain about them, aren’t nearly as horrifying as, say,
Ph.D. programs in the humanities. I just heard today from a friend in
his ninth year of a comparative literature Ph.D. who thinks he might
finish “in a year and a half.” At least the job market for comp lit
Ph.D. awardees is thriving, right?</p>
</blockquote>

<p>Ouch. I suppose the truth hurts. The particularly interesting point that inspired this post, however, was:</p>

<blockquote>
  <p>We are thankful for that one colleague who knows statistics. There’s always one.</p>
</blockquote>

<p><img src="images/SBW-037-img-001-science-thanksgiving.jpg" alt="Scientist Thanksgiving" /></p>

<p><em>A Scientist’s Thanksgiving. (Image from the above Science article)</em></p>

<h1 id="the-state-of-things">The State of Things</h1>

<p>The above quote about statisticians is so true it hurts, as (we just discovered) the truth is wont to do. It’s even <em>more</em> true
 in the humanities than it is in the more natural and quantitative
sciences. When we talk about a colleague who knows statistics, we
generally don’t mean someone down the hall; usually, we mean that one
statistician who we met in the pub that one night and has a bizarre
interest in the humanities. That’s not to say humanist statisticians
don’t exist, but I doubt you’re likely to find one in any given
humanities department.</p>

<p>This unfortunately is not only true of statistics, but also of GIS,
network science, computer science, textual analysis, and many other
disciplines we digital humanists love to borrow from. Thankfully, the
NEH ODH’s <a href="http://www.neh.gov/grants/guidelines/IATDH.html">Institutes for Advanced Topics in the Humanities</a>, UVic’s <a href="http://www.dhsi.org/">Digital Humanities Summer Institutes</a>,
 and other programs out there are improving our collective expertise,
but a quick look for GIS/Stats/SNA/etc. courses in most humanities
departments still produces slim pickings.</p>

<p><img src="images/SBW-037-img-002-astrology.gif" alt="astrology" /></p>

<p><em>Math is scary. (I can’t find attribution, sorry. Anybody know who drew this?)</em></p>

<p>One of the best things to come out of the #hacker movement in the
Digital Humanities has been the spirit to get our collective hands dirty
 and learn the techniques ourselves. It’s been a long time coming, and
happier days are sure to follow, but one skill still seems
underrepresented from the DH purview: statistics.</p>

<h1 id="why-statistics-why-bayesian-statistics">Why Statistics? Why Bayesian Statistics?</h1>

<p>In a recent post by <a href="https://dhs.stanford.edu/visualization/more-networks/">Elijah Meeks</a>,
 he called Text Analysis, Spatial Analysis, and Network Analysis the
“three pillars” of DH research, with a sneaking suspicion that Image
Analysis should fit somewhere in there as well. This seems to be the
converging sentiment in most DH circles, and although when asked most
would say statistics is also important, it still doesn’t seem to be
among the first subjects named.</p>

<p>With another round of <a href="http://www.diggingintodata.org/">Digging Into Data</a>
 winners chosen, and a bevy of panels and presentations dedicating
themselves to Big Data in the Humanities, the first direction we should
point is statistics. Statistics is a tool uniquely built for
understanding lots of data, and it was developed with full knowledge
that the data may be incomplete, biased, or otherwise imperfect, and has
 legitimate work-arounds for most such occasions. Of course, all
the caveats in my <a href="http://www.scottbot.net/HIAL/?p=6279">first Networks Demystified</a> post apply here: don’t use it without fully understanding it, and changing it where necessary.</p>

<p><img src="images/SBW-037-img-003-last-line-of-defense-statistics.gif" alt="Statistics" /></p>

<p><em>http://vadlo.com/cartoons.php?id=71</em></p>

<p>Many Humanists, even digital ones, frequently seem to have a
(justifiably) knee-jerk reaction to statistics. If you’ve been following
 the Twitter and blog conversations about <a href="http://www.historians.org/annual/2012/index.cfm">AHA 2012</a>,  you probably caught a flurry of discussion over <a href="http://books.google.com/ngrams">Google Ngrams</a>. Conversation tended toward horrified screams of the dangers of correlation vs. causation (or at least references to <a href="http://xkcd.com/552/">xkcd</a>),
 and the ease with which one might lie via statistics or omission.
These are all valid cautions, especially where ngrams is concerned, but I
 sometimes fear we get so caught up in bad examples that we spend more
time apologizing for them than fixing them. Ted Underwood has <a href="http://tedunderwood.wordpress.com/2012/01/03/a-brief-outburst-about-numbers/">a great post about just this</a>, which I will touch on again shortly. (And, to Ted and <a href="http://ariddell.org/">Allen</a> specifically, I’m guessing you both will enjoy this post.)</p>

<p>In short: statistics is useful. To quote the above-linked xkcd comic:</p>

<blockquote>
  <p>Correlation doesn’t imply causation, but it does waggle
its eyebrows suggestively and gesture furtively while mouthing ‘look
over there’.</p>
</blockquote>

<p>So how do we go about using statistics? In a comment on Ted’s recent post about statistics, <a href="http://www.trevorowens.org/">Trevor Owens</a> wrote:</p>

<blockquote>
  <p>if you just start signing up for statistics courses you
are going to end up getting a rundown on using t-tests and ANOVAs as
tools for hypothesis testing. The entire hypothesis testing idea remains
 a core part of how a lot of folks in the social sciences think about
things and it is deeply at odds with what humanists want to do.</p>
</blockquote>

<p>The key is not appropriation but adaption. We must learn statistics,
even the hypothesis testing, so that we might find what methods are
useful, what might be changed, and how we can get it to work for us.
We’re humanists. We’re <em>really</em> <em>good</em> at methodological critique.</p>

<p>One of the areas of statistics most likely to bear fruit for humanists is <em><a href="http://en.wikipedia.org/wiki/Bayesian_statistics">Bayesian statistics</a></em>. Some
 of us already use it in our text mining algorithms, although the math
involved remains occult to most. It basically builds uncertainty and
belief directly into statistics. Instead of coming up with one <em>correct</em>
 answer, Bayesian analysis often yields a range of more or less probable
 answers depending what seems to be the case from prior evidence, and
can update and improve that range as more is learned.</p>

<p><img src="images/SBW-037-img-004-null-hypothesis.png" alt="null_hypothesis" /></p>

<p><em>The one XKCD comic nobody seems to have linked to. (http://xkcd.com/892/)</em></p>

<p>For humanists, this importance is (at least) two-fold. Ted Underwood sums up the first reason nicely:</p>

<blockquote>
  <p>[Bayesian inference] is amazingly,
almost bizarrely willing to incorporate subjective belief into its
definition of knowledge. It insists that definitions of probability have
 to depend not only on observed evidence, but on the “prior
probabilities” that we expected before we saw the evidence. If humanists
 were more familiar with Bayesian statistics, I think it would blow a
lot of minds.</p>
</blockquote>

<p>The second and more specific reason worth mentioning here deals with
the ranges I discussed above. If a historian, for example, is trying to
understand how and why some historical event happened, Bayesian analysis
 could yield which set of occurrences were more or less likely, and
which were so far off as to not be worth considering. By trying to find
reasonable boundary conditions rather than exact explanations to answer
our questions, humanists can retain that core knowledge that humans and
human situations are not wholly deterministic machines, who all act the
same and reproduce the same results in every situation.</p>

<p>We are intrinsically and inextricably <em>inexact</em>, and until we get computers that see and remember <em>everything</em>,
 and model it all perfectly, we should avoid looking for exact answers.
Bayesian statistics, instead, can help us find a range of <em>reasonable</em> answers, with full awareness and use of the beliefs and evidence we have going in.</p>

<h1 id="a-call-to-arms">A Call to Arms</h1>

<p>After I read that post about a scientist’s thanksgiving, I realized I
 didn’t want to have to rely on that one colleague who knows statistics.
 <em>Nobody</em> should. That’s why I decided to enroll in a Bayesian Data Analysis course this semester, taught by and using <a href="http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/">the book of John K. Kruschke</a>. It’s a <em>very</em> readable
 book, directed toward people with no prior knowledge in statistics or
programming, and takes you through the basics of both. Kruschke’s got a <a href="http://doingbayesiandataanalysis.blogspot.com/">blog</a> worth reading, as does <a href="http://en.wikipedia.org/wiki/Andrew_Gelman">Andrew Gelman</a>, an author of the book <a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis</a>. I’m sure a <a href="https://www.google.com/search?gcx=c&amp;sourceid=chrome&amp;ie=UTF-8&amp;q=bayesian+statistics+lectures">basic Google search</a> can point you to video lectures, if that’s your thing. I’ll also try to blog about it over the coming months as I learn more.</p>

<p>There are several (occasionally apocryphal) anecdotes about
 the great theoretical physicists of the early 20th century needing to
go back to school to learn basic statistics. Some still weren’t terribly
 happy about it (“God does not play dice with the universe”), but in the
 end, pressures from the changing nature of their theories required a
thorough understanding of statistics. As humanists begin to deal with a
glut of information we never before had access to, it’s time we adapt in
 a similar fashion.</p>

<p>The wide angle, the distant reading, the longue durée will all
benefit from a deeper understanding of statistics. That knowledge, in
tandem with traditional close reading skills, will surely become one of
the pillars of humanities research as Big Data becomes ever-more common.</p>

<hr />

<h2 id="reader-comments-6">Reader Comments (6)</h2>

<blockquote>
  <p><strong>Ryan Shaw</strong>, Jan 10, 2012 3:01 pm</p>

  <p>You might be interested in Aviezer Tucker’s book <a href="http://books.google.com/books?id=siS5DK1HdwsC">Our Knowledge of the Past</a>, which argues that historiographical practice is best understood from a Bayesian perspective.</p>
</blockquote>

<blockquote>
  <blockquote>
    <p><strong>Scott Weingart</strong>, Jan 10, 2012 6:29 pm</p>

    <p>This is fantastic, thank you! I will definitely take a look at it.</p>
  </blockquote>
</blockquote>

<blockquote>
  <p><strong>Ted Underwood</strong>, Jan 11, 2012 1:46 pm</p>

  <p>You’re right that I enjoyed the post! Also, Kruschke’s book looks a 
lot more accessible than the one I got out of our library. I’ve 
convinced myself that I mostly “understand” that one, but I might just 
read Kruschke’s to make sure that I actually do!</p>
</blockquote>

<blockquote>
  <p><strong>Allen Riddell</strong>, Jan 12, 2012 2:49 am</p>

  <p>Great post. Thanks Scott.</p>

  <p>Here’s my favorite quote on the subject:</p>

  <p>The atmosphere of the Bayesian revival is captured in a comment by 
Rivett on [Dennis] Lindley’s move to University College London and the 
premier chair of statistics in Britain: “it was as though a Jehovah’s 
Witness had been elected Pope.”</p>

  <p>Also worth mentioning might be a recent book from Yale UP that is addressed to a general audience: <em>The Theory That Would Not Die</em> by Sharon Mcgrayne <a href="https://www.powells.com/biblio/62-9780300169690-0">https://www.powells.com/biblio/62-9780300169690-0</a></p>
</blockquote>

<blockquote>
  <p><strong>Ben</strong>, Jan 12, 2012 5:50 am</p>

  <p>I agree that the publication of Kruschke’s book probably going to be a
 watershed moment in making Bayesian statistics widely accessible. I’d 
also recommend Simon Jackman’s ‘Bayesian Analysis for the Social 
Sciences’ (2009) and his class notes here: <a href="http://jackman.stanford.edu/classes/BASS/">http://jackman.stanford.edu/classes/BASS</a>. Another decent one is Ntzoufras’ ‘Bayesian Modeling Using WinBUGS: An introduction’ (2009) <a href="http://stat-athens.aueb.gr/~jbn/winbugs_book/">http://stat-athens.aueb.gr/~jbn/winbugs_book</a></p>
</blockquote>

<blockquote>
  <blockquote>
    <p><strong>Scott Weingart</strong>, Jan 12, 2012 11:15 am</p>

    <p>Thanks, Ben, those look like fantastic resources. It’s worth pointing
 out that both Jackman and Kruschke suggest using JAGS over BUGS for 
markov chains.</p>
  </blockquote>
</blockquote>]]></content><author><name>{&quot;family&quot;=&gt;&quot;Weingart&quot;, &quot;given&quot;=&gt;&quot;Scott B.&quot;}</name></author><category term="method" /><category term="bayesian" /><category term="big data" /><category term="digital humanities" /><category term="methodologies" /><category term="statistics" /><summary type="html"><![CDATA[A few months ago, Science published a Thanksgiving article on what scientists can be grateful for. It’s got a lot of good points, like being thankful for family members who accept the crazy h...]]></summary></entry><entry><title type="html">Topic Modeling and Network Analysis</title><link href="https://scottbot.github.io/dossier/method/2011/11/15/topic-modeling-and-network-analysis.html" rel="alternate" type="text/html" title="Topic Modeling and Network Analysis" /><published>2011-11-15T00:00:00+00:00</published><updated>2011-11-15T00:00:00+00:00</updated><id>https://scottbot.github.io/dossier/method/2011/11/15/topic-modeling-and-network-analysis</id><content type="html" xml:base="https://scottbot.github.io/dossier/method/2011/11/15/topic-modeling-and-network-analysis.html"><![CDATA[<p><em>Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.</em></p>

<h1 id="topic-modeling-and-network-analysis">Topic Modeling and Network Analysis</h1>

<p>According
to Google Scholar, David Blei’s first topic modeling paper has received
3,540 citations since 2003. Everybody’s talking about topic
models. Seriously, I’m afraid of visiting my parents this Hanukkah
and hearing them ask “Scott… what’s this topic modeling I keep hearing
all about?” They’re powerful, widely applicable, easy to use, and
difficult to understand — a dangerous combination.</p>

<p>Since shortly after Blei’s first publication, researchers have been
looking into the interplay between networks and topic models. This post
will be about that interplay, looking at how they’ve been combined, what
sorts of research those combinations can drive, and a few pitfalls to
watch out for. I’ll bracket the big elephant in the room until a later
discussion, whether these sorts of models capture the semantic meaning
for which they’re often used. This post also attempts to introduce topic
modeling to those not yet fully <del>converted</del> aware of its potential.</p>

<p><a href="http://www.scottbot.net/HIAL/wp-content/uploads/2011/11/blei.png"><img src="images/SBW-019-blei.png" alt="" title="Citations to Blei" /></a></p>

<p>Citations
to Blei (2003) from ISI Web of Science. There are even two citations
already from 2012; where can I get my time machine?</p>

<h1 id="a-brief-history-of-topic-modeling">A brief history of topic modeling</h1>

<p>In my <a href="http://www.scottbot.net/HIAL/?p=129">recent post</a> on <a href="http://webapp1.dlib.indiana.edu/newton/">IU’s awesome alchemy project</a>,
I briefly mentioned Latent Semantic Analysis (LSA) and Latent Dirichlit
Allocation (LDA) during the discussion of topic models. They’re
intimately related, though LSA has been around for quite a bit longer.
Without getting into too much technical detail, we should start with a
brief history of LSA/LDA.</p>

<p>The story starts, more or less, with a <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a>
matrix. Basically, tf-idf ranks words based on how important they are
to a document within a larger corpus. Let’s say we want a list of the
most important words for each article in an encyclopedia.</p>

<p>Our first pass is obvious. For each article, just attach a list of
words sorted by how frequently they’re used. The problem with this is
immediately obvious to anyone who has looked at word frequencies; the
top words in the entry on the History of Computing would be “the,”
“and,” “is,” and so forth, rather than “turing,” “computer,” “machines,”
etc. The problem is solved by tf-idf, which scores the words based on
how special they are to a particular document within the larger corpus.
Turing is rarely used elsewhere, but used exceptionally frequently in
our computer history article, so it bubbles up to the top.</p>

<h2 id="lsa-and-plsa">LSA and pLSA</h2>

<p>LSA utilizes these tf-idf scores <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
within a larger term-document matrix. Every word in the corpus is a
different row in the matrix, each document has its own column, and the
tf-idf score lies at the intersection of every document and word. Our
computing history document will probably have a lot of zeroes next to
words like “cow,” “shakespeare,” and “saucer,” and high marks next to
words like “computation,” “artificial,” and “digital.” This is called a
sparse matrix because it’s mostly filled with zeroes; most documents use
very few words related to the entire corpus.</p>

<p>With this matrix, LSA uses <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a>
to figure out how each word is related to every other word. Basically,
the more often words are used together within a document, the more
related they are to one another. [^2]
It’s worth noting that a “document” is defined somewhat flexibly. For
example, we can call every paragraph in a book its own “document,” and
run LSA over the individual paragraphs.</p>

<p>To get an idea of the sort of fantastic outputs you can get with LSA, do check out the implementation over at <a href="http://webapp1.dlib.indiana.edu/newton/lsa/index.php">The Chymistry of Isaac Newton</a>.</p>

<p><a href="http://www.scottbot.net/HIAL/wp-content/uploads/2011/11/newtonLSA.jpg"><img src="images/SBW-019-newtonLSA.jpg" alt="" title="Newton Project LSA" /></a></p>

<p>Newton Project LSA</p>

<p>The method was significantly improved by Puzicha and Hofmann (1999),
who did away with the linear algebra approach of LSA in favor of a more
statistically sound probabilistic model, called <a href="http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis">probabilistic latent semantic analysis</a>
(pLSA). Now is the part of the blog post where I start getting
hand-wavy, because explaining the math is more trouble than I care to
take on in this introduction.</p>

<p>Essentially, pLSA imagines an additional layer between words and
documents: topics. What if every document isn’t just a set of words, but
a set of <em>topics</em>? In this model, our encyclopedia article about
computing history might be drawn from several topics. It primarily
draws from the big platonic computing topic in the sky, but it also
draws from the topics of history, cryptography, lambda calculus, and all
sorts of other topics to a greater or lesser degree.</p>

<p>Now, these topics don’t actually exist anywhere. Nobody sat down with
the encyclopedia, read every entry, and decided to come up with the 200
topics from which every article draws. pLSA <em>infers</em> topics
based on what will hereafter be referred to as black magic. Using the
dark arts, pLSA “discovers” a bunch of topics, attaches them to a list
of words, and classifies the documents based on those topics.</p>

<h2 id="lda">LDA</h2>

<p>Blei et al. (<a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">2003</a>)
vastly improved upon this idea by turning it into a generative model of
documents, calling the model Latent Dirichlet allocation (LDA). By this
time, as well, some sounder assumptions were being made about the
distribution of words and document length — but we won’t get into that.
What’s important here is the generative model.</p>

<p>Imagine you wanted to write a new encyclopedia entry, let’s say about
digital humanities. Well, we now know there are three elements that
make up that process, right? Words, topics, and documents. Using these
elements, how would you go about writing this new article on digital
humanities?</p>

<p>First off, let’s figure out what topics our article will consist of.
It probably draws heavily from topics about history, digitization, text
analysis, and so forth. It also probably draws more weakly from a slew
of other topics, concerning interdisciplinarity, the academy, and all
sorts of other subjects. Let’s go a bit further and assign weights to
these topics; 22% of the document will be about digitization, 19% about
history, 5% about the academy, and so on. Okay, the first step is done!</p>

<p>Now it’s time to pull out the topics and start writing. It’s an easy
process; each topic is a bag filled with words. Lots of words. All sorts
of words. Let’s look in the “digitization” topic bag. It includes words
like “israel” and “cheese” and “favoritism,” but they only appear once
or twice, and mostly by accident. More importantly, the bag also
contains 157 appearances of the word “TEI,” 210 of “OCR,” and 73 of
“scanner.”</p>

<p><a href="http://www.scottbot.net/HIAL/wp-content/uploads/2011/11/IntroToLDA.png"><img src="images/SBW-019-IntroToLDA.png" alt="" title="LDA Model" /></a></p>

<p>LDA Model from Blei (2011)</p>

<p>So here you are, you’ve dragged out your digitization bag and your
history bag and your academy bag and all sorts of other bags as well.
You start writing the digital humanities article by reaching into the
digitization bag (remember, you’re going to reach into that bag for 22%
of your words), and you pull out “OCR.” You put it on the page. You then
reach for the academy bag and reach for a word in there (it happens to
be “teaching,”) and you throw that on the page as well. Keep doing that.
By the end, you’ve got a document that’s all about the digital
humanities. It’s beautiful. Send it in for publication.</p>

<h2 id="alright-what-now">Alright, what now?</h2>

<p>So why is the generative nature of the model so important? One of the
key reasons is the ability to work backwards. If I can generate an
(admittedly nonsensical) document using this model, I can also reverse
the process an infer, given any new document and a topic model I’ve
already generated, what the topics are that the new document draws from.</p>

<p>Another factor contributing to the success of LDA is the ability to
extend the model. In this case, we assume there are only documents,
topics, and words, but we could also make a model that assumes authors
who like particular topics, or assumes that certain documents are
influenced by previous documents, or that topics change over time. The
possibilities are endless, as evidenced by the absurd number of topic
modeling variations that have appeared in the past decade. David Mimno
has compiled a <a href="http://www.cs.princeton.edu/~mimno/topics.html">wonderful bibliography</a> of many such models.</p>

<p>While the generative model introduced by Blei might seem simplistic,
it has been shown to be extremely powerful. When a newcomer sees the
results of LDA for the first time, they are immediately taken by how
intuitive they seem. People sometimes ask me “but didn’t it take forever
to sit down and make all the topics?” thinking that some of the magic
had to be done by hand. It wasn’t. Topic modeling yields intuitive
results, generating what really <em>feels</em> like topics as we know them [^3], with virtually no effort on the human side. Perhaps it is the intuitive utility that appeals so much to humanists.</p>

<h1 id="topic-modeling-and-networks">Topic Modeling and Networks</h1>

<p>Topic models can interact with networks in multiple ways. While a lot
of the recent interest in digital humanities has surrounded using
networks to visualize how documents or topics relate to one another, the
interfacing of networks and topic modeling initially worked in the
other direction. Instead of inferring networks from topic models, many
early (and recent) papers attempt to infer topic models from networks.</p>

<h2 id="topic-models-from-networks">Topic Models from Networks</h2>

<p>The first research I’m aware of in this niche was from McCallum et al. (<a href="http://dl.acm.org/citation.cfm?id=1642419">2005</a>). Their model is itself an extension of an earlier LDA-based model called the Author-Topic Model (<a href="http://dl.acm.org/citation.cfm?id=1036902">Steyvers et al., 2004</a>),
which assumes topics are formed based on the mixtures of authors
writing a paper. McCallum et al. extended that model for directed
messages in their Author-Recipient-Topic (ART) Model. In ART, it is
assumed that topics of letters, e-mails or direct messages between
people can be inferred from knowledge of both the author and the
recipient. Thus, ART takes into account the social structure of a
communication network in order to generate topics. In a later paper (<a href="http://www.cs.umass.edu/~mccallum/papers/art-jair07.pdf">McCallum et al., 2007</a>), they extend this model to one that infers the roles of authors within the social network.</p>

<p>Dietz et al. (<a href="http://dl.acm.org/citation.cfm?id=1273526">2007</a>)
created a model that looks at citation networks, where documents are
generated by topical innovation and topical inheritance via citations.
Nallapati et al. (<a href="http://dl.acm.org/citation.cfm?id=1401957">2008</a>)
similarly creates a model that finds topical similarity in citing and
cited documents, with the added ability of being able to predict
citations that are not present. Blei himself joined the fray in <a href="https://www.cs.princeton.edu/~blei/papers/ChangBlei2009.pdf">2009</a>,
creating the Relational Topic Model (RTM) with Jonathan Chang, which
itself could summarize a network of documents, predict links between
them, and predict words within them. Wang et al. (<a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5967741&amp;abstractAccess=no&amp;userType=">2011</a>)
created a model that allows for “the joint analysis of text and links
between [people] in a time-evolving social network.” Their model is able
to handle situations where links exist even when there is no similarity
between the associated texts.</p>

<h2 id="networks-from-topic-models">Networks from Topic Models</h2>

<p>Some models have been made that infer networks from non-networked text. Broniatowski and Magee (<a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5591237&amp;abstractAccess=no&amp;userType=">2010</a> &amp; <a href="http://www.springerlink.com/content/w655v786lp583660/">2011</a>)
extended the Author-Topic Model, building a model that would infer
social networks from meeting transcripts. They later added temporal
information, which allowed them to infer status hierarchies and
individual influence within those social networks.</p>

<p>Many times, however, rather than creating new models, researchers
create networks out of topic models that have already been run over a
set of data. There are a lot of benefits to this approach, as
exemplified by the Newton’s Chymistry project highlighted earlier. Using
networks, we can see how documents relate to one another, how they
relate to topics, how topics are related to each other, and how all of
those are related to words.</p>

<p>Elijah Meeks created a wonderful example combining topic models with networks in <a href="https://dhs.stanford.edu/comprehending-the-digital-humanities/">Comprehending the Digital Humanities</a>.
Using fifty texts that discuss humanities computing, Elijah created a
topic model of those documents and used networks to show how documents,
topics, and words interacted with one another within the context of the
digital humanities.</p>

<p><a href="http://www.scottbot.net/HIAL/wp-content/uploads/2011/11/weak_topic_paper.png"><img src="images/SBW-019-weak_topic_paper.png" alt="" title="Topic-Paper Similarity" /></a></p>

<p>Network generated by Elijah Meeks to show how digital humanities documents relate to one another via the topics they share.</p>

<p><del>Elijah</del> Jeff Drouin has also created networks of topic models in <a href="https://dhs.stanford.edu/algorithmic-literacy/topic-networks-in-proust/">Proust</a>, as reported by Elijah.</p>

<p><a href="http://home.uchicago.edu/psleonar/">Peter Leonard</a> recently directed me to <a href="http://www.ics.uci.edu/~asuncion/pubs/TIST_11.pdf">TopicNets</a>,
a project that combines topic modeling and network analysis in order to
create an intuitive and informative navigation interface for documents
and topics. This is a great example of an interface that turns topic
modeling into a useful scholarly tool, even for those who know
little-to-nothing about networks or topic models.</p>

<p>If you want to do something like this yourself, Shawn Graham recently posted <a href="http://electricarchaeologist.wordpress.com/2011/11/11/topic-modeling-with-the-java-gui-gephi/">a great tutorial</a>
on how to create networks using MALLET and Gephi quickly and easily.
Prepare your corpus of text, get topics with MALLET, prune the CSV, make
a network, visualize it! Easy as pie.</p>

<p>Networks can be a great way to represent topic models. Beyond simple
uses of navigation and relatedness as were just displayed, combining the
two will put the whole battalion of network analysis tools at
the researcher’s disposal. We can use them to find communities of
similar documents, pinpoint those documents that were most influential
to the rest, or perform any of a number of other workflows designed for
network analysis.</p>

<p>As with anything, however, there are a few setbacks. Topic models are
rich with data. Every document is related to every other document, if
some only barely. Similarly, every topic is related to every other
topic. By deciding to represent document similarity over a network, you
must make the decision of precisely <em>how similar</em> you want a
set of documents to be if they are to be linked. Having a network with
every document connected to every other document is scarcely useful, so
generally we’ll make our decision such that each document is linked to
only a handful of others. This allows for easier visualization and
analysis, but it also destroys much of the rich data that went into the
topic model to begin with. This information can be more fully preserved
using other techniques, such as <a href="http://en.wikipedia.org/wiki/Multidimensional_scaling">multidimensional scaling</a>.</p>

<p>A somewhat more theoretical complication makes these network
representations useful as a tool for navigation, discovery, and
exploration, but not necessarily as evidentiary support. Creating a
network of a topic model of a set of documents piles on abstractions.
Each of these systems comes with very different assumptions, and it is
unclear what complications arise when combining these methods <em>ad hoc</em>.</p>

<h1 id="getting-started">Getting Started</h1>

<p>Although there may be issues with the process, the combination of
topic models and networks is sure to yield much fruitful research in the
digital humanities. There are some fantastic tutorials out there for
getting started with topic modeling in the humanities, such as Shawn
Graham’s post on <a href="http://electricarchaeologist.wordpress.com/2011/08/30/getting-started-with-mallet-and-topic-modeling/">Getting Started with MALLET and Topic Modeling</a>, as well as on combining them with networks, such as <a href="http://electricarchaeologist.wordpress.com/2011/11/11/topic-modeling-with-the-java-gui-gephi/">this post</a> from the same blog. Shawn is right to point out <a href="http://mallet.cs.umass.edu/">MALLET</a>,
a great tool for starting out, but you can also find the code used for
various models on many of the model-makers’ academic websites. One code
package that stands out is Chang’s <a href="http://cran.r-project.org/web/packages/lda/index.html">implementation of LDA and related models</a> in R.</p>

<p>Airoldi, Edoardo M., David M. Blei, Stephen E. Fienberg, and Eric P. Xing. 2008. “Mixed Membership Stochastic Blockmodels.” <em>The Journal of Machine Learning Research</em> 9 (June): 1981–2014. <a href="http://dl.acm.org/citation.cfm?id=1390681.1442798" title="Mixed Membership Stochastic Blockmodels">http://dl.acm.org/citation.cfm?id=1390681.1442798</a>.</p>

<p>AlSumait, Loulwah, Daniel Barbará, James Gentle,
and Carlotta Domeniconi. 2009. “Topic Significance Ranking of LDA
Generative Models.” In <em>Machine Learning and Knowledge Discovery in Databases</em>,
edited by Wray Buntine, Marko Grobelnik, Dunja Mladenić, and John
Shawe-Taylor, 5781:67–82. Berlin, Heidelberg: Springer Berlin
Heidelberg. <a href="http://www.springerlink.com/content/v3jth868647716kg/" title="Topic Significance Ranking of LDA Generative Models">http://www.springerlink.com/content/v3jth868647716kg/</a>.</p>

<p>Bamman, David, Brendan O’Connor, and Noah Smith. 2013. “Learning Latent Personas of Film Characters.” In <em>Proceedings of the Annual Meeting of the Association for Computational Linguistics</em>. Sofia, Bulgaria.</p>

<p>Binder, Jeffrey M., and Collin Jennings. 2014. “Visibility and Meaning in Topic Models and 18th-century Subject Indexes.” <em>Literary and Linguistic Computing</em> (May 7): fqu017. <a href="http://dx.doi.org/10.1093/llc/fqu017">doi:10.1093/llc/fqu017</a>. <a href="http://llc.oxfordjournals.org/content/early/2014/05/06/llc.fqu017" title="Visibility and meaning in topic models and 18th-century subject indexes">http://llc.oxfordjournals.org/content/early/2014/05/06/llc.fqu017</a>.</p>

<p>Blei, David M. 2012. “Probabilistic Topic Models.” <em>Communications of the ACM</em> 55 (4) (April 1): 77. <a href="http://dx.doi.org/10.1145/2133806.2133826">doi:10.1145/2133806.2133826</a>. <a href="http://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext" title="Probabilistic topic models">http://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext</a>.</p>

<p>Blei, David M. 2011. “Introduction to Probabilistic Topic Models.” <em>Communications of the ACM</em>.</p>

<p>Blei, David M., and John D. Lafferty. 2006. “Dynamic Topic Models.” In <em>Proceedings of the 23rd International Conference on Machine Learning</em>, 113–120. ICML  ’06. New York, NY, USA: ACM. <a href="http://dx.doi.org/10.1145/1143844.1143859">doi:10.1145/1143844.1143859</a>. <a href="http://doi.acm.org/10.1145/1143844.1143859" title="Dynamic topic models">http://doi.acm.org/10.1145/1143844.1143859</a>.</p>

<p>Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” <em>The Annals of Applied Statistics</em> 1 (1) (June 1): 17–35. <a href="http://www.jstor.org/stable/4537420" title="A Correlated Topic Model of Science">http://www.jstor.org/stable/4537420</a>.</p>

<p>Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” <em>J. Mach. Learn. Res.</em> 3 (March): 993–1022. <a href="http://dl.acm.org/citation.cfm?id=944919.944937" title="Latent dirichlet allocation">http://dl.acm.org/citation.cfm?id=944919.944937</a>.</p>

<p>Block, Sharon. 2006. “Doing More with Digitization.” <em>Common-Place</em> 6 (2) (January).</p>

<p>Boyd-Graber, Jordan, and David M. Blei. 2009. “Multilingual Topic Models for Unaligned Text.” In <em>Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence</em>, 75–82. UAI  ’09. Arlington, Virginia, United States: AUAI Press. <a href="http://dl.acm.org/citation.cfm?id=1795114.1795124" title="Multilingual topic models for unaligned text">http://dl.acm.org/citation.cfm?id=1795114.1795124</a>.</p>

<p>Broniatowski, David A., and Christopher L. Magee.</p>
<ol>
  <li>“Towards a Computational Analysis of Status and Leadership Styles
on FDA Panels.” In <em>Social Computing, Behavioral-Cultural Modeling and Prediction</em>,
edited by John Salerno, Shanchieh Jay Yang, Dana Nau, and Sun-Ki Chai,
6589:212–218. Berlin, Heidelberg: Springer Berlin Heidelberg. <a href="http://www.springerlink.com/content/w655v786lp583660/" title="Towards a Computational Analysis of Status and Leadership Styles on FDA Panels">http://www.springerlink.com/content/w655v786lp583660/</a>.</li>
</ol>

<p>Broniatowski, David A., and Christopher L. Magee.</p>
<ol>
  <li>“Analysis of Social Dynamics on FDA Panels Using Social Networks
Extracted from Meeting Transcripts.” In <em>2010 IEEE Second International Conference on Social Computing (SocialCom)</em>, 329–334. IEEE. <a href="http://dx.doi.org/10.1109/SocialCom.2010.54">doi:10.1109/SocialCom.2010.54</a>.</li>
</ol>

<p>Chaney, Allison J.B., and David M. Blei. 2012. “Visualizing Topic Models.” In Dublin, Ireland.</p>

<p>Chang, Jonathan, and David M. Blei. 2010. “Hierarchical Relational Models for Document Networks.” <em>The Annals of Applied Statistics</em> 4 (1) (March): 124–150. <a href="http://dx.doi.org/10.1214/09-AOAS309">doi:10.1214/09-AOAS309</a>. <a href="http://projecteuclid.org/euclid.aoas/1273584450" title="Hierarchical relational models for document networks">http://projecteuclid.org/euclid.aoas/1273584450</a>.</p>

<p>Chang, Jonathan, and David M. Blei. 2009. “Relational Topic Models for Document Networks.” In <em>Proceedings of the 12th International Conference on AI and Statistics</em>. Clearwater Beach, Florida. <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.186.6279" title="Relational topic models for document networks">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.186.6279</a>.</p>

<p>Dietz, Laura, Steffen Bickel, and Tobias Scheffer. 2007. “Unsupervised Prediction of Citation Influences.” In <em>Proceedings of the 24th International Conference on Machine Learning</em>, 233–240. ICML  ’07. New York, NY, USA: ACM. <a href="http://dx.doi.org/10.1145/1273496.1273526">doi:10.1145/1273496.1273526</a>. <a href="http://doi.acm.org/10.1145/1273496.1273526" title="Unsupervised prediction of citation influences">http://doi.acm.org/10.1145/1273496.1273526</a>.</p>

<p>Erosheva, Elena, Stephen E. Fienberg, and John D. Lafferty. 2004. “Mixed-membership Models of Scientific Publications.” <em>Proceedings of the National Academy of Sciences</em> 101 (January 23): 5220–5227. <a href="http://dx.doi.org/10.1073/pnas.0307760101">doi:10.1073/pnas.0307760101</a>. <a href="http://www.pnas.org/content/101/suppl.1/5220.short" title="Mixed-membership models of scientific publications">http://www.pnas.org/content/101/suppl.1/5220.short</a>.</p>

<p>Gardner, Matthew J., Joshua Lutes, Jeff Lund,
Josh Hansen, Dan Walker, Eric Ringger, and Kevin Seppi. 2010. “The Topic
Browser: An Interactive Tool for Browsing Topic Models.” In .</p>

<p>Gerrish, Sean, and David M. Blei. 2010. “A Language-based Approach to Measuring Scholarly Impact.” In <em>Proceedings of the 26th International Conference on Machine Learning</em>. Haifa, Israael. <a href="http://www.cs.princeton.edu/%20blei/papers/GerrishBlei2010.pdf" title="A language-based approach to measuring scholarly impact">http://www.cs.princeton.edu/ blei/papers/GerrishBlei2010.pdf</a>.</p>

<p>Gerrish, Sean, and David M. Blei. 2009. “Modeling
Influence in Text Corpora” presented at the NIPS Workshop on
Applications for Topic Models: Text and Beyond., Whistler, Canada.</p>

<p>Girolami, Mark, and Ata Kabán. 2003. “On an Equivalence Between PLSI and LDA.” In <em>Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval</em>, 433–434. SIGIR  ’03. New York, NY, USA: ACM. <a href="http://dx.doi.org/10.1145/860435.860537">doi:10.1145/860435.860537</a>. <a href="http://doi.acm.org/10.1145/860435.860537" title="On an equivalence between PLSI and LDA">http://doi.acm.org/10.1145/860435.860537</a>.</p>

<p>Goldstone, Andrew, and Ted Underwood. 2012. “What
Can Topic Models of PMLA Teach Us About the History of Literary
Scholarship?” Blog. <em>ARCADE</em>. 12–14. <a href="http://arcade.stanford.edu/blogs/what-can-topic-models-pmla-teach-us-about-history-literary-scholarship" title="What can topic models of PMLA teach us about the history of literary scholarship?">http://arcade.stanford.edu/blogs/what-can-topic-models-pmla-teach-us-about-history-literary-scholarship</a>.</p>

<p>Gretarsson, Brynjar, John O’Donovan, Svetlin
Bostandjiev, Tobias Hollerer, Arthur Asuncion, David Newman, and
Padhraic Smyth. 2011. “TopicNets: Visual Analysis of Large Text Corpora
with Topic Modeling.” In <em>ACM Transactions on Intelligent Systems and Technology</em>, 5:1–26.</p>

<p>Hall, David, Daniel Jurafsky, and Christopher D. Manning. 2008. “Studying the History of Ideas Using Topic Models.” In <em>Proceedings of the Conference on Empirical Methods in Natural Language Processing</em>, 363–371. EMNLP  ’08. Stroudsburg, PA, USA: Association for Computational Linguistics. <a href="http://dl.acm.org/citation.cfm?id=1613715.1613763" title="Studying the history of ideas using topic models">http://dl.acm.org/citation.cfm?id=1613715.1613763</a>.</p>

<p>Jockers, Matthew. 2013. <em>Macroanalysis: Digital Methods and Literary History</em>. UIUC Press.</p>

<p>Laudun, John, and Jonathan Goodwin. 2013.
“Computing Folklore Studies: Mapping over a Century of Scholarly
Production through Topics.” <em>Journal of American Folklore</em> 126 (502) (Autumn): 455–475. <a href="http://dx.doi.org/10.1353/jaf.2013.0063">doi:10.1353/jaf.2013.0063</a>. <a href="http://muse.jhu.edu/login?auth=0&amp;type=summary&amp;url=/journals/journal_of_american_folklore/v126/126.502.laudun.html" title="Computing Folklore Studies: Mapping over a Century of Scholarly Production through Topics">http://muse.jhu.edu/login?auth=0&amp;type;=summary&amp;url;=/journals/journal_of_american_folklore/v126/126.502.laudun.html</a>.</p>

<p>McCallum, Andrew, Andrés Corrada-Emmanuel, and Xuerui Wang. 2005. “Topic and Role Discovery in Social Networks.” In <em>Proceedings of the 19th International Joint Conference on Artificial Intelligence</em>, 786–791. IJCAI’05. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. <a href="http://dl.acm.org/citation.cfm?id=1642293.1642419" title="Topic and role discovery in social networks">http://dl.acm.org/citation.cfm?id=1642293.1642419</a>.</p>

<p>McCallum, Andrew, Xuerui Wang, and Andrés
Corrada-Emmanuel. 2007. “Topic and Role Discovery in Social Networks
with Experiments on Enron and Academic Email.” <em>Journal of Artificial Intelligence Research</em> 30 (1) (October): 249–272. <a href="http://dl.acm.org/citation.cfm?id=1622637.1622644" title="Topic and role discovery in social networks with experiments on enron and academic email">http://dl.acm.org/citation.cfm?id=1622637.1622644</a>.</p>

<p>Mei, Qiaozhu, Deng Cai, Duo Zhang, and ChengXiang Zhai. 2008. “Topic Modeling with Network Regularization.” In <em>Proceeding of the 17th International Conference on World Wide Web</em>, 101–110. WWW  ’08. New York, NY, USA: ACM. <a href="http://dx.doi.org/10.1145/1367497.1367512">doi:10.1145/1367497.1367512</a>. <a href="http://doi.acm.org/10.1145/1367497.1367512" title="Topic modeling with network regularization">http://doi.acm.org/10.1145/1367497.1367512</a>.</p>

<p>Mimno, David. 2012. “Computational Historiography: Data Mining in a Century of Classics Journals.” <em>J. Comput. Cult. Herit.</em> 5 (1) (April): 3:1–3:19. <a href="http://dx.doi.org/10.1145/2160165.2160168">doi:10.1145/2160165.2160168</a>. <a href="http://doi.acm.org/10.1145/2160165.2160168" title="Computational historiography: Data mining in a century of classics journals">http://doi.acm.org/10.1145/2160165.2160168</a>.</p>

<p>Mimno, David, and Andrew McCallum. 2007. “Mining a Digital Library for Influential Authors.” In <em>Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries</em>, 105–106. JCDL  ’07. New York, NY, USA: ACM. <a href="http://dx.doi.org/10.1145/1255175.1255196">doi:10.1145/1255175.1255196</a>. <a href="http://doi.acm.org/10.1145/1255175.1255196" title="Mining a digital library for influential authors">http://doi.acm.org/10.1145/1255175.1255196</a>.</p>

<p>Nallapati, Ramesh M., Amr Ahmed, Eric P. Xing,
and William W. Cohen. 2008. “Joint Latent Topic Models for Text and
Citations.” In <em>Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</em>, 542–550. KDD  ’08. New York, NY, USA: ACM. <a href="http://dx.doi.org/10.1145/1401890.1401957">doi:10.1145/1401890.1401957</a>. <a href="http://doi.acm.org/10.1145/1401890.1401957" title="Joint latent topic models for text and citations">http://doi.acm.org/10.1145/1401890.1401957</a>.</p>

<p>Newman, David J., and Sharon Block. 2006. “Probabilistic Topic Decomposition of an Eighteenth-century American Newspaper.” <em>Journal of the American Society for Information Science and Technology</em> 57 (6) (April): 753–767. <a href="http://dx.doi.org/10.1002/asi.20342">doi:10.1002/asi.20342</a>. <a href="http://doi.wiley.com/10.1002/asi.20342" title="Probabilistic topic decomposition of an eighteenth-century American newspaper">http://doi.wiley.com/10.1002/asi.20342</a>.</p>

<p>Riddell, Allen B. 2012. “How to Read 22,198
Journal Articles: Studying the History of German Studies with Topic
Models.” In St. Louis, MO. <a href="http://ariddell.org/static/how-to-read-n-articles.pdf" title="How to Read 22,198 Journal Articles: Studying the History of German Studies with Topic Models">http://ariddell.org/static/how-to-read-n-articles.pdf</a>.</p>

<p>Rosen-Zvi, Michal, Thomas Griffiths, Mark
Steyvers, and Padhraic Smyth. 2004. “The Author-topic Model for Authors
and Documents.” In <em>Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence</em>, 487–494. UAI  ’04. Arlington, Virginia, United States: AUAI Press. <a href="http://dl.acm.org/citation.cfm?id=1036843.1036902" title="The author-topic model for authors and documents">http://dl.acm.org/citation.cfm?id=1036843.1036902</a>.</p>

<p>Rusch, Thomas, Paul Hofmarcher, Reinhold
Hatzinger, and Kurt Hornik. 2013. “Model Trees with Topic Model
Pre-processing: An Approach for Data Journalism Illustrated with the
WikiLeaks Afghanistan War Logs.” <em>The Annals of Applied Statistics</em>.</p>

<p>Steyvers, Mark, and Thomas Griffiths. 2006. “Probabilistic Topic Models.” In <em>Latent Semantic Analysis: A Road to Meaning</em>, edited by T. Landauer, D. McNamara, S. Dennis, and W. Kintsch, 427:424–440.</p>

<p>Tangherlini, Timothy R., and Peter Leonard. 2014.
“Trawling in the Sea of the Great Unread: Sub-corpus Topic Modeling and
Humanities Research.” <em>Poetics</em>. <a href="http://dx.doi.org/10.1016/j.poetic.2013.08.002">doi:10.1016/j.poetic.2013.08.002</a>. <a href="http://www.sciencedirect.com/science/article/pii/S0304422X13000648" title="Trawling in the Sea of the Great Unread: Sub-corpus topic modeling and Humanities research">http://www.sciencedirect.com/science/article/pii/S0304422X13000648</a>.</p>

<p>Wang, Eric, Jorge Silva, Rebecca Willett, and
Carin Carin. 2011. “Dynamic Relational Topic Model for Social Network
Analysis with Noisy Links.” In <em>2011 IEEE Statistical Signal Processing Workshop (SSP)</em>, 497–500. IEEE. <a href="http://dx.doi.org/10.1109/SSP.2011.5967741">doi:10.1109/SSP.2011.5967741</a>.</p>

<hr />

<h2 id="reader-comments">Reader Comments</h2>

<blockquote>
  <p><strong>Ted Underwood</strong>, 2011-11-16 13:34</p>

  <p>Excellent, clear post, and I really appreciate the links to the Isaac Newton Chymistry project and TopicNets. Very helpful.</p>

  <p>I’m deeply into a variant of LSA at the moment, so I’m
disproportionately interested in a couple of details that most people
won’t care about. E.g., I’m not sure that most versions of LSA actually
use tf-idf scores in the term-doc matrix. I think the more common
version may use log-entropy weighting instead of tf-idf weighting.</p>

  <p>I actually prefer a different weighting scheme that I haven’t seen
used widely, which is basically Observed frequency – Expected frequency.
I would also argue that literary scholars are better off skipping the
Singular Value Decomposition step, for reasons explained here: <a href="http://tedunderwood.wordpress.com/2011/10/16/lsa-is-a-marvellous-tool-but-humanists-may-no-use-it-the-way-computer-scientists-do/">http://
tedunderwood.wordpress.com/2011/10/16/lsa-is-a-marvellous-tool-but-
humanists-may-no-use-it-the-way-computer-scientists-do/</a></p>

  <p>But to stop geeking out about LSA and return to the main point: very
helpful post. I haven’t yet tried the generative methods (pLSA and LDA),
because I’m so happy with LSA itself, but I know people are excited
about them and I intend to compare results systematically at some point
this winter.</p>
</blockquote>

<blockquote>
  <blockquote>
    <p><strong>scottbot</strong>, 2011-11-16 14:42</p>

    <p>Thanks! I think you’re right about the
tf-idf weighting, I just figured that people just approaching LSA would
be more familiar with tf-idf. I’ve added a note referencing your
comment, though, because the standard certainly ought to be mentioned.</p>

    <p>That’s a great post, I’ve never thought about the issues of SVD for
the purposes of the humanities. While SVD in LSA is still useful for
most of my historical retrieval needs, you make a very good point about
humanists needing to think very carefully about the nitty-gritty details
of algorithms that were built with other purposes in mind.</p>

    <p>Good luck on your generative model exploration – while pLSA can
technically be mathematically equivalent to LDA, it’s a lot more
bothersome and misses some of LDA’s functionality, so I’d definitely
recommend the latter. LDA and LSA definitely serve two very different
purposes; for yours, outlined in the anti-SVD post, LSA is probably more
well-suited.</p>

    <p>Thanks for the comments… I feel like I’ve come to the DH-Text
Analysis-Blog party late in the game, and I’ve been trying to read
through yours to catch up!</p>
  </blockquote>
</blockquote>

<blockquote>
  <p><strong>Matt Erlin</strong>, 2011-11-16 13:52</p>

  <p>Great post, Scott! I found the historical section particularly helpful for getting a sense of how topic modeling has evolved.</p>
</blockquote>

<blockquote>
  <blockquote>
    <p><strong>scottbot</strong>, 2011-11-16 14:43</p>

    <p>Thanks! Glad you found it useful.</p>
  </blockquote>
</blockquote>

<blockquote>
  <p><strong>Allen Riddell</strong>, 2011-11-21 15:37</p>

  <p>Great post. I’ve been hoping that
someone would explain how the extensibility of LDA makes it quite a
different kind of beast (relative to LSA).</p>

  <p>The original LDA paper is actually pretty good on this. Another good
place is the 2010 Rosen-Zvi, M., Griffiths, T., Steyvers, M., &amp;
Smyth, P. expanded write-up of the author-topic model. Once you get a
bit beyond LDA it’s clear there’s something being done that can’t be
done with LSA. Here’s the citation:</p>

  <p>Learning author-topic models from text corpora. M Rosen-Zvi, C
Chemudugunta, T Griffiths, P Smyth, M Steyvers, ACM Transactions on
Information Systems (TOIS), ACM, 2010. <a href="http://www.datalab.uci.edu/papers/AT_tois.pdf">http://www.datalab.uci.edu/papers/AT_tois.pdf</a></p>
</blockquote>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Ted
Underwood rightly points out in the comments that other scoring systems
are often used in lieu of tf-idf, most frequently log entropy.
[^2]: Yes
yes, this is a simplification of actual LSA, but it’s pretty much how
it works. SVD reduces the size of the matrix to filter out noise, and
then each word row is treated as a vector shooting off in some
direction. The vector of each word is compared to every other word, so
that every pair of words has a relatedness score between them. Ted
Underwood has a <a href="http://tedunderwood.wordpress.com/2011/10/16/lsa-is-a-marvellous-tool-but-humanists-may-no-use-it-the-way-computer-scientists-do/">great blog post</a> about why humanists should avoid the SVD step.
[^3]: They’re not, of course. We’ll worry about that later. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Scott Weingart</name></author><category term="method" /><category term="data analysis" /><category term="digital humanities" /><category term="methodologies" /><category term="network analysis" /><category term="social networks" /><category term="text analysis" /><category term="topic modeling" /><summary type="html"><![CDATA[Note: The conversion of this scholarly blog post to a website (via markdown) was assisted with an LLM. Errors likely exist. To correct errors or to issue a copyright takedown request, please reach out to weingart.scott+dossier@gmail.com or create a pull request.]]></summary></entry></feed>