MLA Stylin’

Like most of you, I’ve been perched on my front porch every day for two weeks waiting for the new (3rd. edition) of the MLA Style Manual and Guide to Scholarly Publishing to arrive. At long last, it came. Naturally, I sat down and read it cover to cover.

I don’t want to spoil it for anyone. I will note, however, that my lifelong dream of having one of my articles cited as an example of the use of italicized titles was not realized.

However, that insufferable slight was almost ameliorated by a new section on Fair Use that, in comparison to the cold informational tone of previous editions, almost rises to the level of protest. I particularly welcome the addition of such detailed explanations, which occur amidst frequent mention of the case law governing the Fair Use provision:

Congress intended the statutory provision . . . to restate the fair use doctrine that existed before the passage of the act, not to change, narrow, or enlarge it in any way, as the reports of the House and Senate committees make clear. Accordingly, all decisions of the courts before and after the 1976 Copyright Act are relevant to the determination of copyright law. [. . .] Furthermore, the Copyright Act makes no statement amount the relative importance of the [four] factors, and the Supreme Court clarified in Campbell v. Acuff-Rose Music, Inc. (1994) that no one factor is more important than the others, nor must the use be supported by all four factors to be fair. (51)

And my favorite . . .

Although one occasionally hears that it is acceptable to use some percentage of the work or some specified number of words, neither the statute nor any regulation nor case law sanctions such guidelines on the quantity of material protected by copyright that may be taken without permission, and authors should not rely on them.

Actually, one doesn’t “occasionally hear” that it is acceptable. I have yet to encounter a library or department that doesn’t hand out a sheet describing exactly how many pages (or lines, or words) one can copy from a text before it violates what is widely understood to be the most important of the four factors (”The effect of the use on the potential market for or value of the copyrighted work”). “Very few,” is the message communicated by these guidelines, and yet most people I know understand it as an articulation of the Fair Use provision. Few seem to be aware that these guidelines were written by the Association of American Publishers — a trade association primarily concerned with protecting the industry — and have no basis in law.

It’s refreshing to see an articulation of Fair Use (put forth by a major scholarly society) that does not attempt to frighten authors into complying with the industry’s reading of the statute, but instead subtly urges American authors to assert their Fair Use rights as citizens engaged in “criticism, comment, news reporting, teaching . . ., scholarship, or research” (51). Perhaps we could excerpt this fine section (2.2.13) of the MLA Style Manual and hand it out in department copy centers as a replacement for the AAP’s manifesto?

Comments (3)

Digital Campus

I was delighted to be a guest (along with Bill Turkel) on Digital Campus for their 25th episode. I haven’t listened to it yet — and so I’m not sure to what degree I made a fool of myself — but it was great fun to hang out with Bill, Dan, and Tom.

Digital Campus, of course, is the fantastic podcast put out by The Center for History and New Media at George Mason University.

Comments (3)

The Race Car Bed

Some time ago, I posted an essay on craftsmanship that featured a piece of furniture that my father built. That essay turned out to be the most popular blog post I’ve written, and it also led to a number of emails expressing admiration for my father’s skills as a woodworker. It is with great pleasure, then, that I post this most recent example of my father’s work — a “race car” bed for my three-year-old nephew Angus:

race_car1

race_car2

race_car3

Now, I’m going to guess that most of my readers are over the age of thirty. But I know what you’re thinking: Can I have a race car bed?

Comments

High Performance Computing for English Majors

[HPC has been coming up a lot lately in conversations I've been having with other DH specialists. Or it was, before I went in for sinus surgery a week ago. I'm still recovering from that, and I'm not really sure about my ability to blog coherently. So please accept this essay from the archives. It's from a talk I gave at MLA in 2006.]

There are people in this world who spend untold amounts of time tweaking and tuning their cars for some perceived need for high performance that seldom materializes on roadways intended for passenger automobiles. They spend hours “modding” their rides: changing the gas-to-air ratio, boring out the cylinders, fiddling with the feathers and springs on the shock absorbers, and injecting nitrous oxide into the fuel line in order to get “Dude, like 450 horsepower” out of a sedan principally designed to ferry children to and from school.

I have precisely this relationship with computers. The latest chip, the fastest disks, the most efficient bus architectures all fill me with a kind atavistic frisson. And once I lay my hands on the geek equivalent of NOS, I start rebuilding the kernel, changing the shared memory footprint, altering the thread model, reconfiguring the drive geometry, and adding optimization flags to my C compiler. It is true that my machines are often on the verge of melting, but that’s the price of perfection. There’s even a special version of the Linux kernel for bleeding-edge speed freaks called the “Love Kernel.” It’s essentially the standard Linux kernel with hundreds of high-speed performance patches applied indiscriminately. Here’s a quote from the README for the Love kernel:

IMPORTANT: steel300 and OneOfOne remind you that the patches here are sometimes experimental and could explode upon impact, make your [soda|pop] really bland, or other badness. We aren’t responsible for that, but we will mention that these patches will also make your kernel ROCK LIKE NINJA.

And that’s what I want to do. I want my computers to rock like ninja.

In a sense, ordinary training in software design is responsible for creating this insane desire for speed. The entire study of algorithms and data structures is framed by a concern with the trade-offs between time and space. If you undertake formal study of these matters, you find that much of what you’re doing is calculating the best and worst case scenarios for storage and retrieval within a particular data structure or under the strictures of a certain algorithm. After awhile, you can’t help but equate faster, smaller, and more scalable with better.

But if you study software engineering and design methodology at any level of detail — or better yet, start writing production code — you quickly discover that this equation is downright dangerous. Code optimization is fine when you’re talking about a fake implementation of a sorting algorithm. In a large, complex system intended for actual users, however, premature optimization is more than likely to result in brittle, unreadable code. And this assumes that you understand where the bottlenecks are in the first place. This is why even a brief foray as a computational test pilot will cause one to develop certain rational instincts about code efficiency. You begin to lower the bar to something like “fast enough” in order to create code that is more easily maintained and understood. You begin to distrust any optimization that isn’t completely verifiable using profilers and benchmarking tools. You begin to realize that it might be safer and more efficient to drive the kids to school in a minivan. Or rather, you realize that this is the rational position, even as you irrationally try to break the sound barrier.

I have been writing software for use in the context of digital humanities for about ten years. During that time, I have written thousands of lines of code, but all of it has fallen neatly into one of two categories. Either it was intended to deliver data to the Web, or it was intended to perform some kind of data analysis operation offline. That covers a lot of different types of systems, of course. Sometimes the data being delivered to the Web consisted of reams of GIS data that had to be paired with text, styled, and delivered to a client framework that would render a real-time animated map. Sometimes the offline data analysis consisted of computing complex graph theoretical algorithms for the purpose of studying relationships within a corpus. But in the former case, network latency had the effect of making most of my shrewd optimizations seem futile. Why work for hours on some little speed hack when the processing that occurs prior to network delivery and rendering is only a small fraction of the total end-to-end userspace time? In the latter case, it really didn’t matter how long the analysis took. I was the only one who needed the data, and there really wasn’t any particular rush. Who cares if it takes fifteen minutes — or even fifteen hours — to crunch the numbers?

For the last few years, I have been giving talks in which I proclaim an “age of tools” in digital humanities, and the evangelium goes something like this: Over the last twenty years, we have spent millions digitizing texts and putting them online. The resulting digital full-text archives are among the greatest achievements in digital humanities. Yet for all their wonder, they remain committed to a vision of digital textuality firmly ensconced within the metaphor of the physical library. You can browse the text, read the text, search the text, and even download the text, but you can’t really do much beyond that. It is time to start thinking of ways to exploit this data with analytical tools and visualizations. Ideally, such tools should be an integral part of the experience of working with Web-based text collections.

Several of my colleagues in the field are working on something like this, including my fellow panelists [Greg Crane and Geoff Rockwell - ed.]. My own contribution is as a member of the Nora Project, which endeavors to implement the credo outlined above with an emphasis on particular varieties of text analysis — including, most significantly, data mining and machine learning algorithms. I won’t speak for Geoff and Greg, but I think know why I’m here today talking about high-performance. It’s because for the first time in my career, caffeine-addled speed optimizations seem not only warranted, but necessary.

They’re necessary, because when we talk about large, full-text archives empowered by text analytical tools and visualizations, we’re really talking about trying to make procedures traditionally thought of as batch-processing jobs and importing them into a world in which, as Jacob Nielson famously noted, you have eight seconds to do something interesting.

Our data mining operations rely on massive matrices of data drawn from text corpora. For example, we might have a giant table (consisting of millions of cells) where one column is filled with word frequency counts, another one is filled with markers indicating the presence or absence of a certain feature, another is filled with ratios between nouns and verbs, and so on. We start out not knowing what any of this data really means, but we do know that texts (or parts of texts) in the corpus cluster in certain ways. There are genre distinctions, years of composition, different authors, different countries of origin. So we add one more column of data indicating the “label” for the particular text or text section. Text classification is the process of using statistics to figure what patterns of low-level features conspire to make a text fit a particular label. So the usual method involves having a domain expert label some of the texts, and then setting the data mining algorithms loose on the rest of the matrix, so it can generate a set of predictive rules. If the rules are robust (and this is the exciting part) you should have a system that can correctly assign labels for texts it has never seen before. And, of course, the labels can be anything at all.

We’ve used data mining to create things like systems that can detect eroticism and sentimentality in English poetry and prose. And as soon as we say that, two objections emerge immediately. First, “Do we really need a system that can tell us that a particular Shakespeare play is a history? Don’t we already know that?” And second, “Who decides what passages are erotic or sentimental in the first place?” The first objection is an entirely sensible one, but what really intrigues us is the fact that the system often gets is “wrong” in some thoroughly thrilling way. The first time we ran a data mining operation on Shakespeare, it calmly informed us that both Romeo and Juliet and Othello are comedies. The computer scientists on the team were ready to go back to the drawing board, but the literary critics were more excited than ever, because, of course, a number of influential critics have noted that these two plays follow the basic dramatic structure of comedy, and all we wanted to do was look at the generated rules to see what low-level features are complicit in this subtle moment of generic ambiguity. The second objection–”who decides what the labels are”–is also a sensible objection, but we have an easy answer to that one. The user should decide. The user should be able to choose what vectors go into the matrix, and choose the labels.

And that brings me, at long last, to the main topic of this panel. Because until recently, no one has thought of data mining as a live, interactive process. To undertake meaningful data mining on full-text archives of literary texts, you need to parse the XML documents, tokenize them, run a series of natural language processing algorithms (to determining things like parts-of-speech), check them against a gazetteer (for named-entity resolution), and then crunch all the numbers. Then you need to assemble all of that data into a matrix. Then you need to do the actual data mining algorithm. Then you need to deliver it to the client and render it. This always takes hours, and it occasionally takes days. If you’re offline, it doesn’t matter (though even offline, you want to come to this problem fully armed with high-performance equipment). Online, it violates Nielson’s eight-second rule in a way that borders on the grotesque.

It’s possible to approach the optimization of this process in a thoroughly rational manner. First, you look at the whole end-to-end system and try to divide the operation into things that bind early and things that bind late. There’s no reason to parse the XML data and do the feature extraction live. All of that can be done at the pre-processing stage and loaded into a datastore of some kind. It might take days to do that, but if you’re clever, you can get a ton of “canned” data ready to be loaded into a matrix for analysis. After you’ve done that, you can think about ways to minimize the amount of data the system has to analyze, perhaps by segmenting the data in such a way that the system has less material to sort through as it loads the matrix. You might then look for obvious inefficiencies in the analysis layer itself, and try to optimize those as much as you can (without creating brittle, difficult-to-understand code). Finally, you can figure out ways to distribute the analytical process across multiple processors.

We’ve done all of that. We’ve canned it, chunked it, speed-hacked it, and even figured out a way to multithread the process across any arbitrary number of processors. The resulting system is dazzlingly fast. It’s just not fast enough for the Web. And so it is time, we think, to turn to some serious hardware.

And when we say serious, we’re not talking about expensive servers (we’ve got those). We’re talking about seriously expensive servers–distributed clusters of the sort that are used for things like particle physics, weather simulation, and the video rendering for Attack of the Clones. And that’s a problem.

It’s a problem, because in the context of a university, “high-performance computing” isn’t a technical term at all. It’s a financial act of faith made by very senior members of the administration, and a site of intense territorial protection by the “hard” scientists who help to make that act of faith seem less fraught with religious peril. A bunch of English professors who want to get into high-performance computing need to convince administrators that they should get a piece of the pie, and they need to convince the physicists that literary critics have just as much of a right to these resources as anyone else. Which should be an easy matter. All we need to do is talk to the people who are exploring the origins of the universe, and ask them to step aside for a moment while we look for dirty words in Dickinson.

And, of course, we won’t be asking them to step aside “for a moment.” Nearly everything done on these systems represents a batch job. The experiment (or the video rendering task) might take a long time, but it usually has a beginning and an end. We’re talking about ongoing processes running on a kind of supercollider Web server. Perhaps we need our own high-performance cluster? But then, who pays for such a thing? Digital humanities can bring in grant dollars, but most of the funding agencies we deal with are loath to fund even moderate amounts of overhead. Perhaps we are in over our heads.

Now, I’ve already confessed to being a semi-delusional, speed-obsessed maniac. Perhaps all of this represents nothing more than the idle fantasy of someone who wants “Dude, like 450 million words per second.” Surely, there’s much that we can do to bring about the age of tools without pouring millions of dollars into hardware. Why be so ambitious at this early stage? Do we really need to be thinking about high-performance computing for English majors?

I think we do need to be thinking about it–not because it’s a thing we need to have today, but because it’s a battle we’re going to need to fight tomorrow. To get where we are now in terms of text collections, we had to fight for resources that were unheard of among humanists. We were successful in that effort, not because we came up with outstanding technical arguments, but because we succeeded in effecting a cultural change at our institutions. We were able to convince Vice Presidents for Research that we could attract students and grant dollars. We were able to convince University Presidents that digital humanities was something of wide interest to the public (not to mention donors). We were able to convince library Deans that research efforts in this area could pay dividends in terms of prestige. And finally, we were able to convince our own professional societies (including the MLA) that scholarship in this area was essential to the future of the academy (witness, for example, that most astonishing of documents, the “Guidelines for Evaluating Work with Digital Media” put out by the MLA this year).

Of course, one need not act like a Ninja in order to rock like one. The best way to get into the high-stakes game of high-performance computing is to create compelling reasons to participate. I continue to believe that bringing analytical procedures to existing digital archives–particularly those that are as easy to use as search engines–is a worthy, if ambitious goal. Shadetree mechanics might have little hope of building their own highways, but clever digital humanists, by remaining committed to broad visions of the power of full-text archives, might well create the conditions in which high-performance becomes an ordinary part of our work as a discipline.

Comments (1)

Being Trendy

The Nation has a piece by William Deresiewicz called “Professing Literature in 2008,” which presents a review of the Twentieth Anniversary edition of Gerald Graff’s Professing Literature. Mostly, it’s the usual screed, though it does a make a few points that even this pre-tenured radical might have to concede. I mention it here only to draw attention to the review’s most exhilarating moment — an “analysis” of the MLA job list:

There have always been trends in literary criticism, but the major trend now is trendiness itself, trendism, the desperate search for anything sexy. Contemporary lit, global lit, ethnic American lit; creative writing, film, ecocriticism–whatever. There are postings here for positions in science fiction, in fantasy literature, in children’s literature, even in something called “digital humanities.”

Comments (2)

Language and Dictatorship

The Arc forum has become my favorite list lately. There’s the usual stuff going on: feature requests, requests for help with specific coding problems, code examples, polls, challenges, and so forth. In this sense, it’s more or less like any other hacker forum. But there’s something else going on that’s very exciting.

Arc isn’t the first attempt to re-imagine Lisp outside of the Scheme and Common Lisp standards, but because Paul Graham — a very important voice in the Lisp community — is behind it, I think there’s an unusual amount of attention being paid to it. It’s also not a finished product (far from it), and so I think everyone is having a lot of fun trying to imagine what it might be. Graham’s writing the code, obviously, but the forum is full of bold thinking about where it might go, and I think that many of these ideas will come to influence the future of Arc.

One of the more interesting subjects to come up lately has been the issue of “dictatorship” versus what we might call “pluralism.” Now, when hackers use the word “dictatorship,” they don’t mean that in a bad way. When we say that a language is run by a dictatorship, we mean that its canonical form is determined by an individual or by a core group of developers. Perl is run this way, as is Ruby and Python (by Larry Wall, Yukihiro “Matz” Matsumoto, and Guido van Rossum, respectively). There aren’t dozens of implementations of these languages; there’s really only one of each. Few such languages are formally standardized, because there isn’t really a need to do so. If there were one relational database implementation in the world, I doubt very much that anyone would feel the need for an SQL standard. In a sense, standardization is a bit like radical Athenian democracy. We standardize not because everyone agrees, but because everyone does not agree. Various parties compromise in order to prevent any one faction from overwhelming the others. (This is in part why I find the impulse within the XML community toward creating “standards” for things that are not widely used or are brand new to be entirely baffling. Now that’s dictatorship!)

Until I started writing Lisp, I hadn’t really encountered anything other than dictatorship. One might argue C is run through pluralism, but the platforms I’ve worked on basically have one canonical implementation of the C compiler (gcc), so the effect is pretty much the same. But languages like Scheme and Common Lisp are quite different. There are standards for both languages, but there are dozens of implementations. And in the case of Scheme, the various implementations can differ wildly while still adhering to the (deliberately minimal) R5RS standard. Common Lisp has a much broader and more comprehensive standard, but even there, if you want to write code that can run under any implementation, you’re probably going to have to write some code that will check to see how things work in each of the major implementations.

Some people find this quite intolerable. They’d really like to be able to write code in CL or Scheme and have it run on any compliant implementation. Such people find the situation of Scheme to be particularly onerous, since the diversity of implementations really means that we need to speak not of Scheme, but “PLT Scheme scheme,” “Chicken Scheme scheme,” and “Bigloo Scheme scheme.” For these people, Arc feels like a chance to stop the chaos.

I’m personally not bothered by this, simply because I find that the kinds of things I write don’t need to run on multiple implementations, and the implementations I use tend to have everything I need. Most of the coding I do is intended for my own use, and while it would be nice not to have to ask my users to install a particular implementation, I don’t think many people find that qualification onerous. However, I can well understand why others would find this very frustrating. And everyone (including me) finds the situation at least mildly frustrating when it comes to third-party modules. If you write a module for doing, say, XML, shouldn’t it run on any Scheme interpreter/compiler out there?

I honestly don’t think Arc will achieve this, and I think the reasons involve an interesting mixture of the cultural and the technical.

The diversity of Lisp implementations is breathtaking. Some run as interpreters, some compile to C, some compile to native code, and some do all three. Some are designed to be embedded, others more or less ignore the OS. Some are designed to be “small and pure,” while others try to build into the language everything that would normally be in third party libraries. Some emphasize the ability to work with C libraries, while others are focused on the Web. And of course, there are versions not only for the major operating systems, but for handheld devices, the Java Virtual Machine, and microcontrollers. Incompatibilities abound, and yet if you’re, say, a Scheme programmer, you can probably find an implementation that seems highly optimized for whatever you’re trying to do.

All of this is facilitated by the most salient aspect of Lisp itself. Lisp has been aptly called “a programmable programming language,” because unlike the descendants of Algol, Lisp offers the programmer the ability to alter the syntax of the language itself with very little effort. In his book, Peter Siebel talks about how with most languages, the implementation of a new language feature almost always involves a drawn out process (which, in the case of a dictatorial language) can always be vetoed. And it’s easy to see why. A new language feature is almost certainly going to involve changing the compiler or the interpreter itself (and perhaps the external libraries as well). With Lisp, you just add it. This puts Lisp in the enviable position of being able to tack on — in a kind of borg-like fashion — any new trick that comes along. If you like Ruby’s for-each loop, and would like to add it to Scheme, you can do so in about a five minutes. But the same goes for object-oriented programming, aspect-oriented programming, meta programming, annotations, or whatever else is the flavor of the month. You don’t even have to settle on a particular style of, say, OO. You could write six different styles of OO and let the programmer just choose one that strikes his or her fancy. (One recalls Alan Kay’s famous quip, “I invented the term Object-Oriented, and I can tell you I did not have C++ in mind”). In a sense, every non-trivial Lisp program is a fork of the language.

Because Lisp is this way, it’s very hard to get people to settle on an implementation. If you don’t like the Perl module system, you have a couple of choices. You can either re-implement Perl or you can just suffer with what you have. With Lisp, it’s just too easy to change something like that, because it’s easy to change pretty much anything. And Lisps are (comparatively) easy to implement, so the low barriers to change extend even to matters that relate to the compiler itself. In fact, one could argue that this is precisely what Graham is doing — using Lisp to create a new Lisp that he likes. I suspect I’ll like it too. And so will lots of other people. But there will be many people who sorta like it. And they’re going to turn it into something they love.

Graham’s going to implement Arc in a way that makes sense to him. Being Paul Graham, he’s probably going to implement Arc in Arc (this, after all, is the guy who named his company after the Y combinator). I predict that people will then complain that it’s too slow — or too big, or too minimal, or too maximal, or too terse, or something else. But there just won’t be any strong disinsentive to go do something about it. And so a thousand flowers will bloom.

Of course, there are lots of things Graham and others could do to ensure that code that runs on one implementation runs on all the others, but I’m not sure that dictatorship, per se, is going to be a workable answer to the problem. Lisp programmers, perhaps more than any other kinds of hackers, really don’t like being told what to do. And they have the language to back them up.

Comments

Graham’s Arc

I will admit to being a huge (read, fawning) fan of Paul Graham. I’ve never met him, but I think his essays are terrific (especially the ones on Lisp). He clearly has a ninth-degree black belt in programming, and yet he’s one of the most clear-headed prose writers I’ve read when it comes to dilating complicated technical subjects. I, personally, would have no idea how Lisp macros — the single most powerful and mind-bending concept I’ve encountered in programming — work without the benefit of his book On Lisp, which has been circulating freely as digital samizdat on the Web for many years. Since I aspire to be both a black belt programmer and a clear-headed writer, I tend to take what he says and does very seriously indeed.

I’m not alone, of course, and so when Graham announced that he was creating a new dialect of Lisp (called “Arc”), I and many others got very excited. We were all particularly intrigued by the statement of design philosophy that appeared on his home page a couple of years ago, and have been waiting with bated breath ever since.

Well, Arc is out. And with its release (more of a stable pre-release) came a torrent of criticism. In fact, the criticism began years ago (Arc is vaporware, etc.). I won’t rehearse the criticism here; I’ll just make the general point that people have a lot of damn gall.

As far as I can tell, Paul Graham’s reasons for writing Arc are the same as for any voluntary development project. He’d like to give himself the tool he wishes he had, he’d like to amuse himself intellectually, he’d like to explore various aspects of design, he’d like to play around with some ideas he’s had for years, and he’d also like to contribute something useful to the world. These seem to me excellent reasons for doing just about anything, and it’s nice that he’s chosen to throw in the last one.

There are several legitimate reasons to feel miffed about software. You might be frustrated with the release schedule or the number of bugs or the fact that such-and-such a feature isn’t implemented yet. You might find the community that supports it unhelpful or the authors arrogant and imperious toward their users. You might think the whole thing is wrong from the start.

If it’s a commercial piece of software for which you’ve paid money, I can understand angry charges and criticisms. But FREE (as in beer, as in freedom, as in range, whatever) software? What gives anyone the right to shoot their mouth of like this?

It’s not that people can’t have these opinions. It’s that they air them so freely and with so little charity. Graham would be well within his rights to take his toys and go home, after all. He wouldn’t be the first FOSS developer to do so. It seems like once a month some talented hacker writes a farewell letter in which he or she admits that the constant battering is starting to wear them down.

I’m not suggesting that people stop critiquing software and design philosophies. Nor am I suggesting that everyone suppress their natural frustrations. I’m merely suggesting that all such critiques and frustrations be firmly wound up in a gentle cloak of charity, good will, and constructiveness. It might be true that the developer can tell a jackass from a contributor, but he or she might not. And the consequences of beating up on people in this way is the suppression of the motives I outlined above. Do we really want a world in which people don’t play for love of the game, but only because they’re compelled to by some other force?

In that piece on design philosophy, Graham wrote, “The great languages have been the ones that good programmers designed for their own use– C, Smalltalk, Lisp.” I think he’s right about that, and I think we could probably generalize that sentiment to lots of areas of human endeavor. Fortunately for us, lots of people with that idea have drawn the perfectly humane conclusion that what’s useful to them might be useful to others. If drawing that conclusion brings you nothing but grief, there’s no loss in really making it something for your own use. After all, couldn’t you just work on the project for love of the game and keep it to yourself?

If I were Graham, I might be asking myself why I’m even bothering. And that bothers me.

Comments

Language Games

Stéfan Sinclair, one of DH’s most talented hackers, responded to the last post with a question about the “dominant language” in DH. I suggested it was Java, but (as I noted in the comment thread), I really have no basis for saying that. Here’s Stéfan:

I’d say (as unempirically) that there are more DH projects developing (not just using) code with scripting languages like PHP, than there are with Java, especially outside of the larger centres. But maybe this is just speculation based on what I see as the norm for project cycles: a researcher (or small team) gets some funding which includes money for hiring some research assistants – often graduate students – who are more likely competent in PHP (or Ruby or Perl) than in Java. Moreover, I think DH projects tend to favour getting something up fairly quickly over design and robustness; another reason why scripting languages would be more prominent. There are centres with dedicated staff willing and able to work in Java, but that seems to me a relatively rare luxury.

I didn’t mean to nitpick regarding a minor part of the post, I was just curious. I don’t think it’s crowning a champion programming language that matters, it’s more about how the reality of the DH research environment should be influencing the curriculum (or perhaps there are too few of us teaching programming for it to matter that much).

I think Stefan is probably right, but of course, both of us are relying on general impressions.

When you look at the development of computer science curricula over the last few decades, you can clearly see that discipline responding to industry pressures. Scheme might remain the language of Hal Abelson’s venerable 6.001 course at MIT, but in general, we’ve watched CS go from Pascal, to C, to C++, to Java as the demand for engineers with particular “skills” has changed out in the real world. I don’t know many CS professors who think C++ and Java are good teaching languages, though. Many of the professors I’ve talked to actually long for the days of Pascal, but they also know that students will object to learning a language that isn’t in common use.

When I started out in programming, I asked a friend (a very skilled hacker who was pursuing a Ph.D in CS at the time) what language I should learn first. I remember very clearly his answer, which was something like, “Oh, I don’t know. Just pick one. It doesn’t matter. How about a fake one? There are lots of cool pseudo-languages out there that will teach you what you need to know . . .” I thought he was 100% insane, and I was astonished that this guy — who was and is a master programmer, in addition to being a highly skilled theorist — would suggest something so obviously out of touch with reality. A few years later, I realized that he was 100% right. What’s important in programming is the concepts. If the goal is to learn programming and software engineering, the language literally doesn’t matter. Abelson’s Structure and Interpretation of Computer Programs (the textbook for 6.001), not only uses Scheme, but avoids explicitly teaching the language, per se. It’s perhaps the best book I’ve ever read on programming.

DH, thankfully, is not burdened with “industry pressures” in the way CS is. When I tell my students that it’s the concepts that are important, they believe me. They believed me even when none of their programming friends knew what Ruby was. I think they’d still believe me if I suggested that we all learn Haskell or Miranda. And that’s a good thing.

Still, it would be very useful to know what languages people regularly work with in DH. If we knew the answer to that question, those of us who teach programming for the Humanities (and I know there are only a few of us out there) could perhaps structure the teaching of the concepts in such a way as to make it easy to transfer that knowledge into other kinds of languages. I do this a little bit already, by occasionally pointing out the ways in which different languages — like C or Javascript — approach some concept that we’re studying in Ruby. I do that in part to emphasize that the fundamental concepts of programming don’t change drastically from language to language.

It would also be good to know what languages people use in DH, because it might help us to focus development where it’s most needed. We’ve been talking quite intensively about “tools” over the last few years in the DH community, but I’ve always felt that “tools” might best be thought of as an alias for libraries and APIs.

It’s not possible to do a scientific poll on a blog like this, but I suspect my readership has enough hackers in the ranks to get some good anecdotal information.

So how about it? What languages do you use? What languages do the people around you use? What languages are people telling you you should know? I’d love to hear about it!

Comments (12)

Feelin’ Groovy

Well, it’s time to pick up another programming language.

I’ve been doing this about once a year since the late nineties, and in that time, my motivations for programming glossolalia have changed a few times. At first, I think I mostly wanted to jump on the latest bandwagon. I was also highly susceptible to arguments from more experienced hackers about how such-and-such a language was the Greatest Thing Ever. But after awhile, I found myself wanting to study languages because I find them completely fascinating. Nowadays, I think it’s just a way to expand my thinking about what languages are and what they can do.

But honestly, after doing this ten or twelve times, I find that I’m beginning to stray into the outer rim. What’s next? Haskell? Erlang? I had more or less decided that I’d do one of those over the summer, but a couple of days ago I stumbled on Groovy.

What’s Groovy? A language with an absurd name, for one. But beyond that, Groovy (according to the home page):

  • is an agile and dynamic language for the Java Virtual Machine
  • builds upon the strengths of Java but has additional power features inspired by languages like Python, Ruby and Smalltalk
  • makes modern programming features available to Java developers with almost-zero learning curve
  • supports Domain-Specific Languages and other compact syntax so your code becomes easy to read and maintain
  • makes writing shell and build scripts easy with its powerful processing primitives, OO abilities and an Ant DSL
  • increases developer productivity by reducing scaffolding code when developing web, GUI, database or console applications
  • simplifies testing by supporting unit testing and mocking out-of-the-box
  • seamlessly integrates with all existing Java objects and libraries
  • compiles straight to Java bytecode so you can use it anywhere you can use Java

I’ve used a few alternative languages for the JVM, including JRuby, Jython, and Kawa. I mean, what’s not to like? The syntax of Ruby, Python, or (be still my beating heart) Scheme with all the might and magic of the Java class library? What could be better?

In my experience, the original languages and runtime environments are better. I don’t know what it is. I always feel like the Javish stuff just don’t belong there somehow — like it’s some kind of crude hack. And before I get hate mail, let me say that these implementations are not crude hacks. They’re very skillfully done. There’s just something about the mixture that doesn’t sit well with me.

I haven’t started looking closely at Groovy yet, but I can see the general plan. They’ve taken some of the best ideas from languages like Ruby, Perl, Python, and Smalltalk, kept the general contours of Java’s C-ish syntax, and built a scripting language that has all the soul-stirring power and ubiquity of the Java class library in a nice little package. And, of course, it’s trivially easy to embed Groovy in Java or to integrate them in some other way.

Now, I might find it annoying (that is: slow, or poorly documented, or whatever). But if it really does have the ease of use and expressive power of a good scripting language, I may start teaching it in my Digital Humanities courses. I’ve been teaching Ruby to English majors for a number of years now, and I still think it’s a magnificent teaching language. But if Groovy was similarly good for teaching and allowed an easy migration path for students interested in Java, I just might be sold. Java is the dominant language in DH, and while a few of my students have been able to pick up Java after learning Ruby, it would be nice to have a language that was semantically closer to Java. I could imagine a class that goes through Groovy, and then ends by starting off with Java. But then, Groovy really would have to be groovy.

They will have to change the name, though. Can I really have a course description that says we’ll be learning Groovy?

I’ll report back when I’ve played with it some more.

Comments (8)

Zenware

As an addendum to last week’s discussion of writing workflows, I offer a quote — cribbed directly from Matt Wood at 43 Folders — which is in turn taken from Jeffrey MacIntire’s The Tao of Screen over at Slate. How’s that for connectivity? To wit:

There’s an emerging market for programs that introduce much-needed traffic calming to our massively expanding desktops. The name for this genre of clutter-management software: zenware.

The philosophy behind zenware is to force the desktop back to its Platonic essence. There are several strategies for achieving this, but most rely on suppressing the visual elements you’re used to: windows, icons, and toolbars. The applications themselves eschew pull-down menus or hide off-screen while you work. Even if you consider yourself inured to their presence, the theory goes, you’ll benefit most from their absence.

This explains, perhaps, the sudden interest in stripped down word processing environments for professional writers (like Scrivener, which is gradually becoming the easel of my intellectual life). It also explains why I was so happy — and dare I say productive — using the the BlackBox window manager on Linux for the better part of ten years. Something about its spartan landscape made me want to sink into the eremitic confines of words and code. When I see “iconistan” (McIntire’s phrase) on someone else’s desktop, I sometimes wonder how they manage to get through the day.

It strikes me that the Zen interface does not necessarily mean the simple interface. The Linux command-line (or is that “koan-line?”) is hardly simple, and yet it continues to fill me with pleasure (or is that “satori?”) — in part because there’s nothing to do there but work and think.

I’m hiding everything from now on.

Comments (1)

« Previous entries