Digital Campus

digital humanities — sramsay @ 11:25 am

Continuing this week's media blitz . . .

The folks at the Center for History and New Media have, at their extreme peril, invited me to be an "irregular" on the Digital Campus podcast (think of a shirt that is discounted because it's missing a button). This week, I joined Dan Cohen, Mills Kelly, Tom Scheinfeldt, and fellow irregular Bryan Alexander (Research Director for NITLE) in an episode entitled "Theremin Dreams."

Critical Code Studies

digital humanities — sramsay @ 9:47 pm

I was very pleased to be invited, a month or so ago, to be a contributor to the Critical Code Studies blog (maintained by Mark C. Marino at USC). In fact, I was so pleased that I actually wrote something, which, although it probably diminishes the overall quality of the discussion considerably, nonetheless expresses my hope that just as literary studies began (according to one pataphysical genealogy) with belles-lettres, so critical code studies might have its own tradition of bit-lettristic writing.

I have a lot more to say on that subject, actually, but it will have to wait. I am so very, very far from inbox zero.

The essay is called, "Tim Toady Bicarbonate."

Humanities APIs

digital humanities — sramsay @ 7:08 am

I am very pleased to be attending the Workshop on Application Programming Interfaces for the Digital Humanities sponsored by SSHRC and hosted by the amazing Bill Turkel in his role as a member of NiCHE.

Here are a few things I'm thinking about going into Day 2:

  1. In talking about APIs, we're necessarily talking about access and the political and cultural issues that surround access to cultural heritage materials. It's one thing for a library (say) to make some data collection available and to allow you to browse, search, and display it in various ways. It's another thing to allow other people to come along and create their own ways of browsing, searching, viewing (which is what API access is really about). I think we need to insist on the necessity of this form of access as essential to the future of digital work in the humanities and social sciences. At the same time, we need to be respectful of those who are understandably nervous about it. How do we articulate the benefits of this kind of access? How do we persuade content providers that this kind of access is good for the institutions that provide it, and not just for the people who take advantage of the new entry point?
  2. There's a movable wall when it comes to APIs. I heard a lot of people yesterday describing elaborate ideas about data mining with textual resources (or something similarly ambitious), but in every case, I noticed that the idea was predicated not on access to a series of data points, but on access to the entire dataset. This raises a fundamental question (for designers) on where you put the "wall" between the resource and the user. You could imagine an API that had a single function called "get_all()" Call that, and you can mirror the entire dataset and do what you like. You could also have an API with dozens of highly granular hooks that return nicely formatted data structures, and so forth. The former is undoubtedly the most flexible, but it's also the hardest to work with (particularly if you're a novice programmer). But again, it's a kind of shifting wall. If it's data mining you're after, you could do all that mining back on the archive side and make the results available through the (highly granular) API. These aren't mutually exclusive, of course; Flickr, for example, offers both kinds. Still, I think thinking about this helps to highlight some of the design challenges one encounters with APIs in general.
  3. I think we need to think more carefully about "impedance mismatches" between data sources. There was a lot of talk yesterday about mashing this humanities resource to that humanities resource, but I think there were also some hand-waving assumptions (I was guilty as much as anyone) about the degree to which that data is tractable from an interoperability standpoint. Some of the most successful web service APIs are successful, I think, because the data is simple and easy to work with (lat/longs, METAR data, stats arranged as key-value pairs, etc.). Humanities resources are often quite a bit more complicated, and there's far less agreement about how that data should be formatted. It's true that the TEI (for example) provides a degree of metadata standardization, but it's mostly silent about how the content itself should be formatted. That is, when you actually look at the content of the "tags" (whether it's XML or something else entirely), you find that people are defining things at radically different levels of granularity and with different ordering schemes. I don't want to declare that the sky is falling; I just want to point out that some of this might be quite a bit more difficult than it sounds. And it's a tough problem, because defining complicated interoperability standards in this space really does, in my opinion, run against the spirit of the thing.

I've had a wonderful time at this gathering, which includes so many talented librarians, scholars, and hackers (many of whom manage to combine all three skill sets). I can't help but think that great things will come of this.

What Have We Gotten Ourselves Into?

digital humanities — sramsay @ 8:58 pm

The blog stats are simply astounding. Tens — nay, teens — of people finding their way to my blog. Do they seek enlightenment through cartoonish dialogues? Solemn meditations on cooking? Pronouncements on programming languages and aesthetics?

No, silly, they want the syllabus!

The course enrollment is bursting at the seams (as is my email with override requests from prospective students). So let me say right now: the subject is dull, the assignments are impossible, and the professor is a jerk.

Oh, who am I kidding? The subject is fascinating and the professor is a sweetheart! (Sorry, the assignments really are impossible). He looks forward to welcoming everyone on Monday.

Day in the Life

digital humanities — sramsay @ 10:17 am

Today, I'm blogging on a different site as part of the Day in the Life of the Digital Humanities. There's an RSS feed, if you'd like to drink from the fire hose.

I'm already doing it wrong, of course — my first post is way too long, and it won't help with any kind of auto-ethnographic anything. But then, I'm skeptical toward this whole thing. And I'm on Spring Break.

The No-Reading Seminar

digital humanities — sramsay @ 3:22 pm

In my digital humanities classes, I always try to combine the technical with the philosophical (which, I believe, is one of the things that characterizes DH as a discipline). So, we'll often study control structures on Monday and Wednesday, and then spend Friday talking about new media theory and digital humanities more generally. In the first semester, we read mostly excerpts and articles (McLuhan, Bush, Licklider, Turing, Hayles, Bolter, McCarty, Manovich, Kirschenbaum, and Rockwell show up pretty regularly). In the second semester, however, I usually suggest that we focus on one or two texts — preferably, some very difficult texts. Last semester we read a good bit of A Thousand Plateaus (we planned to read Badiou's Being and Event, but didn't get to it).

This semester, I had a bit of a brainstorm and suggested to the students that we might read Heidegger's "The Question Concerning Technology," but read it only in class with each other. In other words, no one is allowed to read the text outside of class. We all bring a copy of the essay, but then we put a version up on the screen for everyone to read, and we each take turns reading paragraphs. They liked this idea.

We've now done it twice and have made it all the way to the eighth paragraph of the essay. I'm not at all bothered by the slow pace, because I truly think that this is one of most enlightening class discussions I've ever been a part of (either as a student or a teacher).

What do we talk about? Mostly, we try to make sure that we understand Heidegger (this is a very difficult essay even relative to Heidegger's already demanding corpus). But the real thrill, is that we end up thinking deeply about whether we agree or disagree with him, about our own definitions of technology, about causality, definition, ontology, and the tradition in which we're reading. I walk out of the room thinking, "Now that's a discussion," while firmly believing that the professor is only a very small part of what's going on.

As far as I can tell, the students are also finding it enlightening. We may burn out as winter turns to spring, but for now, I am being reminded every Friday of what the classroom is all about.

Digital Campus

digital humanities — sramsay @ 1:26 pm

I was delighted to be a guest (along with Bill Turkel) on Digital Campus for their 25th episode. I haven't listened to it yet — and so I'm not sure to what degree I made a fool of myself — but it was great fun to hang out with Bill, Dan, and Tom.

Digital Campus, of course, is the fantastic podcast put out by The Center for History and New Media at George Mason University.

High Performance Computing for English Majors

digital humanities — sramsay @ 7:19 pm

[HPC has been coming up a lot lately in conversations I've been having with other DH specialists. Or it was, before I went in for sinus surgery a week ago. I'm still recovering from that, and I'm not really sure about my ability to blog coherently. So please accept this essay from the archives. It's from a talk I gave at MLA in 2006.]

There are people in this world who spend untold amounts of time tweaking and tuning their cars for some perceived need for high performance that seldom materializes on roadways intended for passenger automobiles. They spend hours "modding" their rides: changing the gas-to-air ratio, boring out the cylinders, fiddling with the feathers and springs on the shock absorbers, and injecting nitrous oxide into the fuel line in order to get "Dude, like 450 horsepower" out of a sedan principally designed to ferry children to and from school.

I have precisely this relationship with computers. The latest chip, the fastest disks, the most efficient bus architectures all fill me with a kind atavistic frisson. And once I lay my hands on the geek equivalent of NOS, I start rebuilding the kernel, changing the shared memory footprint, altering the thread model, reconfiguring the drive geometry, and adding optimization flags to my C compiler. It is true that my machines are often on the verge of melting, but that's the price of perfection. There's even a special version of the Linux kernel for bleeding-edge speed freaks called the "Love Kernel." It's essentially the standard Linux kernel with hundreds of high-speed performance patches applied indiscriminately. Here's a quote from the README for the Love kernel:

IMPORTANT: steel300 and OneOfOne remind you that the patches here are sometimes experimental and could explode upon impact, make your [soda|pop] really bland, or other badness. We aren't responsible for that, but we will mention that these patches will also make your kernel ROCK LIKE NINJA.

And that's what I want to do. I want my computers to rock like ninja.

In a sense, ordinary training in software design is responsible for creating this insane desire for speed. The entire study of algorithms and data structures is framed by a concern with the trade-offs between time and space. If you undertake formal study of these matters, you find that much of what you're doing is calculating the best and worst case scenarios for storage and retrieval within a particular data structure or under the strictures of a certain algorithm. After awhile, you can't help but equate faster, smaller, and more scalable with better.

But if you study software engineering and design methodology at any level of detail — or better yet, start writing production code — you quickly discover that this equation is downright dangerous. Code optimization is fine when you're talking about a fake implementation of a sorting algorithm. In a large, complex system intended for actual users, however, premature optimization is more than likely to result in brittle, unreadable code. And this assumes that you understand where the bottlenecks are in the first place. This is why even a brief foray as a computational test pilot will cause one to develop certain rational instincts about code efficiency. You begin to lower the bar to something like "fast enough" in order to create code that is more easily maintained and understood. You begin to distrust any optimization that isn't completely verifiable using profilers and benchmarking tools. You begin to realize that it might be safer and more efficient to drive the kids to school in a minivan. Or rather, you realize that this is the rational position, even as you irrationally try to break the sound barrier.

I have been writing software for use in the context of digital humanities for about ten years. During that time, I have written thousands of lines of code, but all of it has fallen neatly into one of two categories. Either it was intended to deliver data to the Web, or it was intended to perform some kind of data analysis operation offline. That covers a lot of different types of systems, of course. Sometimes the data being delivered to the Web consisted of reams of GIS data that had to be paired with text, styled, and delivered to a client framework that would render a real-time animated map. Sometimes the offline data analysis consisted of computing complex graph theoretical algorithms for the purpose of studying relationships within a corpus. But in the former case, network latency had the effect of making most of my shrewd optimizations seem futile. Why work for hours on some little speed hack when the processing that occurs prior to network delivery and rendering is only a small fraction of the total end-to-end userspace time? In the latter case, it really didn't matter how long the analysis took. I was the only one who needed the data, and there really wasn't any particular rush. Who cares if it takes fifteen minutes — or even fifteen hours — to crunch the numbers?

For the last few years, I have been giving talks in which I proclaim an "age of tools" in digital humanities, and the evangelium goes something like this: Over the last twenty years, we have spent millions digitizing texts and putting them online. The resulting digital full-text archives are among the greatest achievements in digital humanities. Yet for all their wonder, they remain committed to a vision of digital textuality firmly ensconced within the metaphor of the physical library. You can browse the text, read the text, search the text, and even download the text, but you can't really do much beyond that. It is time to start thinking of ways to exploit this data with analytical tools and visualizations. Ideally, such tools should be an integral part of the experience of working with Web-based text collections.

Several of my colleagues in the field are working on something like this, including my fellow panelists [Greg Crane and Geoff Rockwell - ed.]. My own contribution is as a member of the Nora Project, which endeavors to implement the credo outlined above with an emphasis on particular varieties of text analysis — including, most significantly, data mining and machine learning algorithms. I won't speak for Geoff and Greg, but I think I know why I'm here today talking about high-performance. It's because for the first time in my career, caffeine-addled speed optimizations seem not only warranted, but necessary.

They're necessary, because when we talk about large, full-text archives empowered by text analytical tools and visualizations, we're really talking about trying to make procedures traditionally thought of as batch-processing jobs and importing them into a world in which, as Jacob Nielson famously noted, you have eight seconds to do something interesting.

Our data mining operations rely on massive matrices of data drawn from text corpora. For example, we might have a giant table (consisting of millions of cells) where one column is filled with word frequency counts, another one is filled with markers indicating the presence or absence of a certain feature, another is filled with ratios between nouns and verbs, and so on. We start out not knowing what any of this data really means, but we do know that texts (or parts of texts) in the corpus cluster in certain ways. There are genre distinctions, years of composition, different authors, different countries of origin. So we add one more column of data indicating the "label" for the particular text or text section. Text classification is the process of using statistics to figure out what patterns of low-level features conspire to make a text fit a particular label. So the usual method involves having a domain expert label some of the texts, and then setting the data mining algorithms loose on the rest of the matrix, so it can generate a set of predictive rules. If the rules are robust (and this is the exciting part) you should have a system that can correctly assign labels for texts it has never seen before. And, of course, the labels can be anything at all.

We've used data mining to create things like systems that can detect eroticism and sentimentality in English poetry and prose. And as soon as we say that, two objections emerge immediately. First, "Do we really need a system that can tell us that a particular Shakespeare play is a history? Don't we already know that?" And second, "Who decides what passages are erotic or sentimental in the first place?" The first objection is an entirely sensible one, but what really intrigues us is the fact that the system often gets is "wrong" in some thoroughly thrilling way. The first time we ran a data mining operation on Shakespeare, it calmly informed us that both Romeo and Juliet and Othello are comedies. The computer scientists on the team were ready to go back to the drawing board, but the literary critics were more excited than ever, because, of course, a number of influential critics have noted that these two plays follow the basic dramatic structure of comedy, and all we wanted to do was look at the generated rules to see what low-level features are complicit in this subtle moment of generic ambiguity. The second objection — "who decides what the labels are" — is also a sensible objection, but we have an easy answer to that one. The user should decide. The user should be able to choose what vectors go into the matrix, and choose the labels.

And that brings me, at long last, to the main topic of this panel. Because until recently, no one has thought of data mining as a live, interactive process. To undertake meaningful data mining on full-text archives of literary texts, you need to parse the XML documents, tokenize them, run a series of natural language processing algorithms (to determining things like parts-of-speech), check them against a gazetteer (for named-entity resolution), and then crunch all the numbers. Then you need to assemble all of that data into a matrix. Then you need to do the actual data mining algorithm. Then you need to deliver it to the client and render it. This always takes hours, and it occasionally takes days. If you're offline, it doesn't matter (though even offline, you want to come to this problem fully armed with high-performance equipment). Online, it violates Nielson's eight-second rule in a way that borders on the grotesque.

It's possible to approach the optimization of this process in a thoroughly rational manner. First, you look at the whole end-to-end system and try to divide the operation into things that bind early and things that bind late. There's no reason to parse the XML data and do the feature extraction live. All of that can be done at the pre-processing stage and loaded into a datastore of some kind. It might take days to do that, but if you're clever, you can get a ton of "canned" data ready to be loaded into a matrix for analysis. After you've done that, you can think about ways to minimize the amount of data the system has to analyze, perhaps by segmenting the data in such a way that the system has less material to sort through as it loads the matrix. You might then look for obvious inefficiencies in the analysis layer itself, and try to optimize those as much as you can (without creating brittle, difficult-to-understand code). Finally, you can figure out ways to distribute the analytical process across multiple processors.

We've done all of that. We've canned it, chunked it, speed-hacked it, and even figured out a way to multithread the process across any arbitrary number of processors. The resulting system is dazzlingly fast. It's just not fast enough for the Web. And so it is time, we think, to turn to some serious hardware.

And when we say serious, we're not talking about expensive servers (we've got those). We're talking about seriously expensive servers — distributed clusters of the sort that are used for things like particle physics, weather simulation, and the video rendering for Attack of the Clones. And that's a problem.

It's a problem, because in the context of a university, "high-performance computing" isn't a technical term at all. It's a financial act of faith made by very senior members of the administration, and a site of intense territorial protection by the "hard" scientists who help to make that act of faith seem less fraught with religious peril. A bunch of English professors who want to get into high-performance computing need to convince administrators that they should get a piece of the pie, and they need to convince the physicists that literary critics have just as much of a right to these resources as anyone else. Which should be an easy matter. All we need to do is talk to the people who are exploring the origins of the universe, and ask them to step aside for a moment while we look for dirty words in Dickinson.

And, of course, we won't be asking them to step aside "for a moment." Nearly everything done on these systems represents a batch job. The experiment (or the video rendering task) might take a long time, but it usually has a beginning and an end. We're talking about ongoing processes running on a kind of supercollider Web server. Perhaps we need our own high-performance cluster? But then, who pays for such a thing? Digital humanities can bring in grant dollars, but most of the funding agencies we deal with are loath to fund even moderate amounts of overhead. Perhaps we are in over our heads.

Now, I've already confessed to being a semi-delusional, speed-obsessed maniac. Perhaps all of this represents nothing more than the idle fantasy of someone who wants "Dude, like 450 million words per second." Surely, there's much that we can do to bring about the age of tools without pouring millions of dollars into hardware. Why be so ambitious at this early stage? Do we really need to be thinking about high-performance computing for English majors?

I think we do need to be thinking about it — not because it's a thing we need to have today, but because it's a battle we're going to need to fight tomorrow. To get where we are now in terms of text collections, we had to fight for resources that were unheard of among humanists. We were successful in that effort, not because we came up with outstanding technical arguments, but because we succeeded in effecting a cultural change at our institutions. We were able to convince Vice Presidents for Research that we could attract students and grant dollars. We were able to convince University Presidents that digital humanities was something of wide interest to the public (not to mention donors). We were able to convince library Deans that research efforts in this area could pay dividends in terms of prestige. And finally, we were able to convince our own professional societies (including the MLA) that scholarship in this area was essential to the future of the academy (witness, for example, that most astonishing of documents, the "Guidelines for Evaluating Work with Digital Media" put out by the MLA this year).

Of course, one need not act like a Ninja in order to rock like one. The best way to get into the high-stakes game of high-performance computing is to create compelling reasons to participate. I continue to believe that bringing analytical procedures to existing digital archives — particularly those that are as easy to use as search engines — is a worthy, if ambitious goal. Shadetree mechanics might have little hope of building their own highways, but clever digital humanists, by remaining committed to broad visions of the power of full-text archives, might well create the conditions in which high-performance becomes an ordinary part of our work as a discipline.

Language and Dictatorship

digital humanities — sramsay @ 12:27 pm

The Arc forum has become my favorite list lately. There's the usual stuff going on: feature requests, requests for help with specific coding problems, code examples, polls, challenges, and so forth. In this sense, it's more or less like any other hacker forum. But there's something else going on that's very exciting.

Arc isn't the first attempt to re-imagine Lisp outside of the Scheme and Common Lisp standards, but because Paul Graham — a very important voice in the Lisp community — is behind it, I think there's an unusual amount of attention being paid to it. It's also not a finished product (far from it), and so I think everyone is having a lot of fun trying to imagine what it might be. Graham's writing the code, obviously, but the forum is full of bold thinking about where it might go, and I think that many of these ideas will come to influence the future of Arc.

One of the more interesting subjects to come up lately has been the issue of "dictatorship" versus what we might call "pluralism." Now, when hackers use the word "dictatorship," they don't mean that in a bad way. When we say that a language is run by a dictatorship, we mean that its canonical form is determined by an individual or by a core group of developers. Perl is run this way, as is Ruby and Python (by Larry Wall, Yukihiro "Matz" Matsumoto, and Guido van Rossum, respectively). There aren't dozens of implementations of these languages; there's really only one of each. Few such languages are formally standardized, because there isn't really a need to do so. If there were one relational database implementation in the world, I doubt very much that anyone would feel the need for an SQL standard. In a sense, standardization is a bit like radical Athenian democracy. We standardize not because everyone agrees, but because everyone does not agree. Various parties compromise in order to prevent any one faction from overwhelming the others. (This is in part why I find the impulse within the XML community toward creating "standards" for things that are not widely used or are brand new to be entirely baffling. Now that's dictatorship!)

Until I started writing Lisp, I hadn't really encountered anything other than dictatorship. One might argue C is run through pluralism, but the platforms I've worked on basically have one canonical implementation of the C compiler (gcc), so the effect is pretty much the same. But languages like Scheme and Common Lisp are quite different. There are standards for both languages, but there are dozens of implementations. And in the case of Scheme, the various implementations can differ wildly while still adhering to the (deliberately minimal) R5RS standard. Common Lisp has a much broader and more comprehensive standard, but even there, if you want to write code that can run under any implementation, you're probably going to have to write some code that will check to see how things work in each of the major implementations.

Some people find this quite intolerable. They'd really like to be able to write code in CL or Scheme and have it run on any compliant implementation. Such people find the situation of Scheme to be particularly onerous, since the diversity of implementations really means that we need to speak not of Scheme, but "PLT Scheme scheme," "Chicken Scheme scheme," and "Bigloo Scheme scheme." For these people, Arc feels like a chance to stop the chaos.

I'm personally not bothered by this, simply because I find that the kinds of things I write don't need to run on multiple implementations, and the implementations I use tend to have everything I need. Most of the coding I do is intended for my own use, and while it would be nice not to have to ask my users to install a particular implementation, I don't think many people find that qualification onerous. However, I can well understand why others would find this very frustrating. And everyone (including me) finds the situation at least mildly frustrating when it comes to third-party modules. If you write a module for doing, say, XML, shouldn't it run on any Scheme interpreter/compiler out there?

I honestly don't think Arc will achieve this, and I think the reasons involve an interesting mixture of the cultural and the technical.

The diversity of Lisp implementations is breathtaking. Some run as interpreters, some compile to C, some compile to native code, and some do all three. Some are designed to be embedded, others more or less ignore the OS. Some are designed to be "small and pure," while others try to build into the language everything that would normally be in third party libraries. Some emphasize the ability to work with C libraries, while others are focused on the Web. And of course, there are versions not only for the major operating systems, but for handheld devices, the Java Virtual Machine, and microcontrollers. Incompatibilities abound, and yet if you're, say, a Scheme programmer, you can probably find an implementation that seems highly optimized for whatever you're trying to do.

All of this is facilitated by the most salient aspect of Lisp itself. Lisp has been aptly called "a programmable programming language," because unlike the descendants of Algol, Lisp offers the programmer the ability to alter the syntax of the language itself with very little effort. In his book, Peter Siebel talks about how with most languages, the implementation of a new language feature almost always involves a drawn out process (which, in the case of a dictatorial language) can always be vetoed. And it's easy to see why. A new language feature is almost certainly going to involve changing the compiler or the interpreter itself (and perhaps the external libraries as well). With Lisp, you just add it. This puts Lisp in the enviable position of being able to tack on — in a kind of borg-like fashion — any new trick that comes along. If you like Ruby's for-each loop, and would like to add it to Scheme, you can do so in about a five minutes. But the same goes for object-oriented programming, aspect-oriented programming, meta programming, annotations, or whatever else is the flavor of the month. You don't even have to settle on a particular style of, say, OO. You could write six different styles of OO and let the programmer just choose one that strikes his or her fancy. (One recalls Alan Kay's famous quip, "I invented the term Object-Oriented, and I can tell you I did not have C++ in mind"). In a sense, every non-trivial Lisp program is a fork of the language.

Because Lisp is this way, it's very hard to get people to settle on an implementation. If you don't like the Perl module system, you have a couple of choices. You can either re-implement Perl or you can just suffer with what you have. With Lisp, it's just too easy to change something like that, because it's easy to change pretty much anything. And Lisps are (comparatively) easy to implement, so the low barriers to change extend even to matters that relate to the compiler itself. In fact, one could argue that this is precisely what Graham is doing — using Lisp to create a new Lisp that he likes. I suspect I'll like it too. And so will lots of other people. But there will be many people who sorta like it. And they're going to turn it into something they love.

Graham's going to implement Arc in a way that makes sense to him. Being Paul Graham, he's probably going to implement Arc in Arc (this, after all, is the guy who named his company after the Y combinator). I predict that people will then complain that it's too slow — or too big, or too minimal, or too maximal, or too terse, or something else. But there just won't be any strong disinsentive to go do something about it. And so a thousand flowers will bloom.

Of course, there are lots of things Graham and others could do to ensure that code that runs on one implementation runs on all the others, but I'm not sure that dictatorship, per se, is going to be a workable answer to the problem. Lisp programmers, perhaps more than any other kinds of hackers, really don't like being told what to do. And they have the language to back them up.

Language Games

digital humanities, programming — sramsay @ 10:17 am

Stéfan Sinclair, one of DH's most talented hackers, responded to the last post with a question about the "dominant language" in DH. I suggested it was Java, but (as I noted in the comment thread), I really have no basis for saying that. Here's Stéfan:

I’d say (as unempirically) that there are more DH projects developing (not just using) code with scripting languages like PHP, than there are with Java, especially outside of the larger centres. But maybe this is just speculation based on what I see as the norm for project cycles: a researcher (or small team) gets some funding which includes money for hiring some research assistants – often graduate students – who are more likely competent in PHP (or Ruby or Perl) than in Java. Moreover, I think DH projects tend to favour getting something up fairly quickly over design and robustness; another reason why scripting languages would be more prominent. There are centres with dedicated staff willing and able to work in Java, but that seems to me a relatively rare luxury.

I didn’t mean to nitpick regarding a minor part of the post, I was just curious. I don’t think it’s crowning a champion programming language that matters, it’s more about how the reality of the DH research environment should be influencing the curriculum (or perhaps there are too few of us teaching programming for it to matter that much).

I think Stefan is probably right, but of course, both of us are relying on general impressions.

When you look at the development of computer science curricula over the last few decades, you can clearly see that discipline responding to industry pressures. Scheme might remain the language of Hal Abelson's venerable 6.001 course at MIT, but in general, we've watched CS go from Pascal, to C, to C++, to Java as the demand for engineers with particular "skills" has changed out in the real world. I don't know many CS professors who think C++ and Java are good teaching languages, though. Many of the professors I've talked to actually long for the days of Pascal, but they also know that students will object to learning a language that isn't in common use.

When I started out in programming, I asked a friend (a very skilled hacker who was pursuing a Ph.D in CS at the time) what language I should learn first. I remember very clearly his answer, which was something like, "Oh, I don't know. Just pick one. It doesn't matter. How about a fake one? There are lots of cool pseudo-languages out there that will teach you what you need to know . . ." I thought he was 100% insane, and I was astonished that this guy — who was and is a master programmer, in addition to being a highly skilled theorist — would suggest something so obviously out of touch with reality. A few years later, I realized that he was 100% right. What's important in programming is the concepts. If the goal is to learn programming and software engineering, the language literally doesn't matter. Abelson's Structure and Interpretation of Computer Programs (the textbook for 6.001), not only uses Scheme, but avoids explicitly teaching the language, per se. It's perhaps the best book I've ever read on programming.

DH, thankfully, is not burdened with "industry pressures" in the way CS is. When I tell my students that it's the concepts that are important, they believe me. They believed me even when none of their programming friends knew what Ruby was. I think they'd still believe me if I suggested that we all learn Haskell or Miranda. And that's a good thing.

Still, it would be very useful to know what languages people regularly work with in DH. If we knew the answer to that question, those of us who teach programming for the Humanities (and I know there are only a few of us out there) could perhaps structure the teaching of the concepts in such a way as to make it easy to transfer that knowledge into other kinds of languages. I do this a little bit already, by occasionally pointing out the ways in which different languages — like C or Javascript — approach some concept that we're studying in Ruby. I do that in part to emphasize that the fundamental concepts of programming don't change drastically from language to language.

It would also be good to know what languages people use in DH, because it might help us to focus development where it's most needed. We've been talking quite intensively about "tools" over the last few years in the DH community, but I've always felt that "tools" might best be thought of as an alias for libraries and APIs.

It's not possible to do a scientific poll on a blog like this, but I suspect my readership has enough hackers in the ranks to get some good anecdotal information.

So how about it? What languages do you use? What languages do the people around you use? What languages are people telling you you should know? I'd love to hear about it!

Next Page »
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2009 Stephen Ramsay | powered by WordPress with Barecity