Web-based IDE’s and AST Formats

I noticed yesterday that Github is using ACE to edit files. Apparently ACE is a descendant of the Bespin project.

In a previous post I described a script I had written that parses python into an AST represented using JSON. I’m looking for budding attempts at standardizing such representations. Does ACE have something appropriate?

As I mentioned in that previous post, this was just a small piece of a project on source code search algorithms. It’s a fun project, but when I think about why more progress hasn’t been made by others, the lack of a good standard for representing AST’s seem to be the biggest barrier.

As much as I respect the work of Xtext, MPS and Spoofax, it seems to me that the key to unlocking all this potential is creating AST interchange formats that Javascript can easily manipulate.

Five years ago the strategies with the most buzz were 1) ATerm and 2) MOF. I don’t imagine that projects like ACE will ever give the time of day to MOF. And I’m not aware of any plans to build a Javascript library for ATerm.

Ometa/JS was a cool attempt at creating language-oriented tools in a browser, but as far as I know, it never gained a big following.

The hardest part about the vast majority of software these days is not performance optimization or correctness or proving properties about some fancy type system. It’s decomposing the problem into components that the world is ready to digest.

Many of the existing attempts at multi-language IDE’s have just been too big. I’m thinking of the “libraries not frameworks” meme that has been making the rounds on Twitter lately. I’d like to be able to embed text-editing panes for language mash-ups and DSL’s in my web applications. The big GUI tools out there at the moment can’t be carved up to suit that purpose.

On the other hand, some attempts like the syntax highlighter I’m now using on this blog don’t seem ambitious enough. Will this project ever evolve into a more powerful tool, or will it be content to simply highlight text? The problem is that — as anyone who has used syntax highlighting in text editors like emacs knows — there are frequently edges cases that stymie anything other than a full description of the language.

Projects like ACE seem to be in the right position to take the lead. I’ll update this post if I find any answers.

Reporting with MongoDB

I gave a talk at the 2011 Iowa Code Camp on Saturday, April 30th.

Given a lot of events like this in MongoDB:

db.events.insert({
    report: "file1",
    time: { dateHour: "2011042412",
            date: "20110424",
            month: "201104" },
    data: { product: "coffee",
            age: "young",
            height: "tall",
            gender: "m",
            mood: "happy" }
})

And “pivots” defined like this:

db.pivots.insert({
  company: "JavaCo",
  dimensions: ["age", "gender"]
})

And some scala that more or less does this:

val buffer = ... // local filesystem file buffer
val aggregator = ... // a wrapper for MongoDB
while( true ) {
  for( report <- buffer.unprocessedFiles ) {
    aggregator.loadFile(report)
    for( pivot <- pivots ) {
      events.???(pivot, report, "dateHour")
    }
    buffer.remove(report)
    aggregator.purge(report)
  }
  pause
}
aggregator.close

We’re left to define what “???” does.

My current thought is that something like this is appropriate:

m = function() {  emit( {
  { "pivot" : ObjectId("4dba..."),
    "time" : { "dateHour" : this.time.dateHour },
    "data" : { "age" : this.data.age,
               "gender" : this.data.gender } },
  1 ) };

r = function(key, values) {
    var total=0;
    for ( var i=0; i < values.length; i++ ) {
        total += values[i];
    }
    return total;
};

db.events.mapReduce(m, r, {
    out  : { reduce: "aggregates" },
    query: {
        "filename": "file1",
        "data.product": "coffee" }
})

My sample dataset is 2 million events spread across 114 files.

I’m using the SAFE WriteConcern.

Earlier today I turned off atime updates on the ebs volume that I had run the load test on. This was purely loads — no pivots were defined, ergo no mapreduce was performed. With atime on my dataset took 425 seconds. With atime off it took 414 seconds. I don’t have enough data to know if this 11 second speedup is real, but in any case it’s not a significant speedup. As expected. I can now check off that little optimization item.

The real way to speed up this phase is likely running 8-10 ebs volumes together as a raid 10 volume using lvm.

The other phase — mapreduce — is CPU bound. And due to the single-threadedness of mapreduce, I can only bring to bear the power of one core without sharding (or doing the computation in scala). This is the more serious bottleneck. I hope that MongoDB 2.0 addresses the concurrency problems with mapreduce or provides some new features that solve the aggregation problem.

Here are some of the resources I drew upon for the talk.

Convert python AST to JSON Document

I’ve added my python2json.py script to github.

This is a small piece of the source code search algorithm project that I’ve been working on. I think this piece is useful in its own right, and that releasing it doesn’t impinge too much on the larger project.

As an example, let’s say we have the following code in example.py:

x = 1 + 2
print x

The script can be invoked with the -f option (or it can parse stdin) like so. You can pipe the output through json.tool like I do here to pretty-print the result:

./python2json.py -f example.py | python -mjson.tool

Will output:

{
    "_lineno": null,
    "node": {
        "_lineno": null,
        "spread": [
            {
                "_lineno": 2,
                "expr": {
                    "_lineno": 2,
                    "left": {
                        "_lineno": 2,
                        "type": "Const",
                        "value": "1"
                    },
                    "right": {
                        "_lineno": 2,
                        "type": "Const",
                        "value": "2"
                    },
                    "type": "Add"
                },
                "nodes": [
                    {
                        "_lineno": 2,
                        "name": "x",
                        "type": "AssName"
                    }
                ],
                "type": "Assign"
            },
            {
                "_lineno": 3,
                "nodes": [
                    {
                        "_lineno": 3,
                        "name": "x",
                        "type": "Name"
                    }
                ],
                "type": "Printnl"
            }
        ],
        "type": "Stmt"
    },
    "type": "Module"
}

YAML Schema with Moose

A friend asked me recently about parsing YAML with perl. He needed to impose some additional structure to a set of YAML documents. His initial approach was to define a new grammar for the language, but this was turning out to be non-trivial — due to significant whitespace and other complexities.

I told him about how I had used Moose types together with YAML.pm to achieve the effect of a “schema” for document types defined on top of YAML. But my solution relied on the “!!perl/hash:Bar” syntax to tell the deserializer how to bless the parsed perl data structure. That’s OK for a purely internal document that I was using, but not appropriate for anything that might be more public.

Stackoverflow led me to a cleaner solution that uses Moose’s type coercion. When passing a raw deserialized perl data structure to a Moose constructor, Moose will look for matching coercions when a type constraint is not initially met.

I’ve been looking for a schema language for YAML and JSON for a while. The wikipedia suggests that Kwalify, Rx, and Doctrine can all fulfill that role, but in the absence of consensus about a schema language, I’d prefer to use something that I have more control over.

Here’s an example. Let’s say we have the following yaml file:

---
name: Extreme Foo
id: 10
alias: FooX
bars:
  - id: 1
    name: bar1
  - id: 2
    name: bar2

The obviously implied Moose types that would define Foo and Bar are:

class Bar {
    has 'name' => (isa => 'Str', is => 'ro', required => 1);
    has 'id' => (isa => 'Int', is => 'ro', required => 1);
}

class Foo {
    has 'name' => (isa => 'Str', is => 'ro', required => 1);
    has 'id' => (isa => 'Int', is => 'ro', required => 1);
    has 'alias' => (isa => 'Str', is => 'ro', required => 0);

    has 'bars' => (isa => 'ArrayRef[Bar]',
		   is => 'ro',
		   required => 1,
		   default => sub { [] } );

    method print() {
	print $self->name . "\n"; # etc
    }
}

Unfortunately Moose will complain about the “bars” variable not being of the correct type. To fix this, we set the coerce flag on the “bars” field, so that Moose will know to go looking for coercion during object construction:

    has 'bars' => (isa => 'ArrayOfBars',
		   is => 'ro',
		   coerce => 1, # This tells Moose to look for matching type coercions
		   required => 1,
		   default => sub { [] } );

In this case — because the “Bar” is embedded in the parameterized ArrayRef type — we also need a new type called ArrayOfBars, and a coercion from ArrayRef[HashRef] to ArrayOfBars.

subtype 'ArrayOfBars'
    => as 'ArrayRef[Bar]';

coerce 'ArrayOfBars'
    => from 'ArrayRef[HashRef]'
    => via { [ map { Bar->new($_) } @{$_} ] };

Meaning that we can now do this with a yaml file that does not contain the “!!” syntax:

my $foo = Foo->new(LoadFile('example2.yaml'));
$foo->print();

The complete code is available on github

Scala Intro

David Pollack recently gave an overview of the Scala language at a BASE (Bay Area Scala Enthusiasts) meetup. A video has been posted http://blip.tv/file/4243180. I learned a few things.

It’s always different to see a presentation vs reading text. If you’re curious about scala, or have even been using it for a while, I recommend it.

Tools

I’ve been updating the libraries and tools that I use for my project lately.

Initially this started with upgrades to Scala 2.8. I haven’t explored it much other than the use of default values in method parameters, but it’s nice to be on the latest and greatest — especially since I’m told that 2.8 .class files are not compatable with 2.7.

I had lift 2.0 working briefly, but am now on Lift 2.1. I’m not using Lift extensively enough yet to notice much difference between 1.x and 2.x (with one exception noted below), but again it’s nice just to be using the latest versions, as it will hopefully mean less difficulty keeping up with this family of technology down the road.

The upgrade to Lift 2.0 co-occurred with the move to SBT 0.7.4 from maven. The continuous ~jetty-run and ~compile frequently result in out of memory exceptions, but triggering those actions on demand is quick enough. I much prefer sbt’s LiftProject.scala to maven’s pom.xml.

Following the best practices I found in other sbt projects, I’ve adopted TDD / BDD via the specs 1.6.5 library. I like the English-like syntax for making assertions, though I don’t fully grok the setup/teardown flow of the test harness. I’ll figure this out eventually. I also don’t understand how to log during testing without resorting to printing to stdout or some other file that I manage myself.

I’ve had difficulty getting the scala Eclipse plugin to work as well as I want, so I actually had gone back to emacs 22.3.1 (carbon version 1.6.0) on mac os x as my editor. At least it was fast if not fully language aware. I’ve seen a lot of references to emacs in scala-related discussions, so I figured I was in good company.

The biggest change of all, as far as I’m concerned, is a discovery from last weekend, which I originally noticed from a David Pollack tweet: ensime 2.8.1-SNAPSHOT-0.3.2. There’s a video demo available:

The author is maintaining a blog in the subject as well. I wasn’t aware that Scala’s “presentation compiler” was this powerful. I haven’t taken a good hard look into it, but it makes me wonder what’s left for a GUI IDE to implement. Ensime pretty much does everything I need an IDE to do. Some features aren’t as “ambiently findable” as they would be with a GUI tool, but this is par for the course for emacs. But given the zippiness of the editor, I think this is a reasonable tradeoff. I hope Ensime will have many years of continued evolution.

Note: I have encountered a mysterious “missing dependency” problem when using Ensime to edit my Lift project that started as a Lift 1.x site. I narrowed the problem down to my snippet’s toList(redraw: () => JsCmd)(html: NodeSeq): NodeSeq methods — which I had originally modelled after some sample code on the lift site — but have not yet determined what is is about them that gives Ensime problems.

2010 update

For my part, I’ve been working on a project that is tangentially related to language workbenches.

In short, it’s a language-parametric source code index and search algorithm. I put together a prototype in python early last year, and have been working on a port to scala in my free time since then. The scala version will begin to be demo-able in the next couple of weeks. This will be followed by several months of deeper research into scalability and performance.

Language Workbench Competition

Check out this Language Workbench Competition.

I’ve met a few of the founders, but hadn’t seen much conversation between them until recently. I take this as some confirmation that the line of thought I’ve been pursuing for several years does in fact have some cohesion. Eelco Visser at SLE 2008/09 (and his lab at TU Delft), Markus Voelter at EcilpseCon 2008, and Steven Kelley who has commented on this blog in the past.

I’ll not likely have time to prepare a submission to the contest, but I’ll certainly be following it and if it does become a part of some conference, I’d love to attend.

Of all the topics and academic or professional software engineering, I think this one cuts right to the heart of the remaining productivity bottlenecks in software engineering. I’m happy to see this talented community gaining momentum.

Category Theory

Last night I stopped by a meeting of the Bay Area Categories and Types group at Noisebridge in the Mission District of San Francisco. They’re using a text from Barr and Wells (which will be arriving soon). It’s nice to have the ability to continue exploring abstract concepts with a group.

I’ve been bumping into category theory repeatedly for many years, and it seems like something I’m going to have to master eventually.

A few years ago I recognized that one thing missing from all the computer science education I had received was a mathematical tool that allowed me to treat languages as objects. It only occurred to me after taking several classes on the syntax of natural languages. I had seen deep mathematical treatment of various aspects of languages, but none that allowed me to refer to a language with a symbol (except maybe as a set, as was the case in my undergrad formal languages class).

After dancing around the subject for years, it’s pretty clear that Category Theory is the tool that I’ve been looking for. I’m looking forward to exploring its concepts.

A colleague of mine sent me a couple of related links: A series of youtube posts from the “Catsters”

And a paper on physics & topology

Web Based IDE’s

Just noticed via a Slashdot article that Bespin a web-based IDE from Mozilla Labs, and Heroku, which appears to be a Ruby on Rails web-based IDE, are generating a lot of interest. There’s also mention of an EclipseCon talk introducing a web-based Eclipse workbench.

It had to happen eventually. I’m still planning on playing a little in this space — starting with some web-based python tools — but it looks like it may be crowded soon.