Reflections on CSE 344

I’ll just dive right into it: while I definitely learned some useful material, parts of my Intro to Data Management class (CSE 344) were disappointing.

On the positive side, I got a lot out of the SQL assignments. They were fun and some of them required careful thought and construction. I have done some SQL work in the past, but it was relatively simple–I appreciated the complexity introduced. I feel considerably more comfortable with SQL’s subtleties now.

I learned useful techniques for designing and modelling databases and expressing query logic, among other things! At times, though, the class’s emphasis seemed to sacrifice depth for breadth. Parts of the course felt so cursory that I question the value of the time we did spend on them.

Cram a complex technology into a lesson, complete an assignment, repeat. There comes a point where students aren’t going to retain much, if anything.

In particular, the professor kept reminding us to add new technology X to our resumes, and I question that advice. If you add something to your resume, you’d better be able to answer interview questions on the topic. Right? CS is one of those fields where it’s pretty easy to detect resume padding, assuming the interviewer has the requisite knowledge (if so, you’re going to have a bad time).

Heck, I spent three years building apps using the LAMP stack before I went back to school, and I don’t have PHP on my resume. At one point I knew it well, for sure. It wouldn’t take much time for me to pick it back up again. (Something something dollar sign :D) But I couldn’t walk up to a whiteboard right now and write PHP.

344 is the only course in the program that deals with databases from the client perspective (The other undergrad database course, CSE 444, is a course on building DBMSs). Given how crowded the course material feels, perhaps it’s time for an additional course on databases? Take an in-depth look at the various NoSQL approaches to data storage, maybe have a distributed database implementation project.

But even if a new class isn’t a viable option, there are clear places to trim material. For example, for one of the NoSQL DBMSs we studied, the professor introduced its query language by telling us that:

  • it’s a hot mess.
  • it will likely be replaced in the next year by a query language currently under development at another university.

In that case, why dedicate one of our eight homework assignments to it? It’s clear why the query language is being supplanted: the documentation is incomplete and inconsistent, some of the functions simply weren’t documented (particularly some of the aggregate functions). Many of the support threads were professional developers complaining about just that.

Who knows, maybe some of the functionality wasn’t even implemented, like that time I was using a D3-based graph visualization framework, couldn’t get a UI component to respond to mouse events… and then looked under the hood and the function called was just empty. XD

Regardless, there are better ways we could’ve spent our time. And, since I can’t really provide such a blunt critique without offering something constructive (oh wait, I’m on the internet ;)), here are some ideas for how 344 might be restructured:

I wouldn’t change much about the SQL coverage. It was covered pretty thoroughly, and relational databases still dominate the market:

Teaching from the perspective of MS-SQL seems like a fine choice, given its commercial popularity and the university’s proximal location to Microsoft.

Okay, so here’s the potentially controversial part—NoSQL. This winter, for the first time, JSON replaced XML in the course, and we learned CouchDB (a document store released about a decade ago). The lecture on CouchDB was very basic—they essentially just set us loose on the assignment (which is great, I’m always happy to get practice learning new things independently). The professor acknowledged that his knowledge of it was limited. If it’s mostly going to be an independent learning assignment, why not select a DBMS that is more likely to be relevant on our resumes?

For instance, for document stores, MongoDB has very clear advantages over CouchDB—it’s significantly more popular, better documented, more active community. In Seattle, there are currently 31 job listings that include MongoDB, and 2 job listings that include CouchDB.

It wouldn’t even necessarily be a difficult swap. The instruction staff could’ve used the same JSON dataset and assignment questions. It might take an hour to learn enough about MongoDB to give an equivalent lecture to the one we received for CouchDB. And then just have the TAs complete the assignment alongside us (there’s a precedent for that in other courses).

It boils down, I think, to why a given technology is selected. Is it really to make us more employable, or is it simply to get us into the habit of learning new technologies? I think that some might argue the latter, in which case the actual DBMS used is arbitrary. I agree that learning to teach ourselves is important. But there’s no reason why, in a world-class CS program like UW’s, it can’t have both.

Or, better yet (in my admittedly biased opinion), select a graph DBMS. First off, so many companies are utilizing them. Big companies that recruit from UW certainly have their own custom graph solutions: Twitter, Facebook, Google. I’m not sure what they’re using in house, but Microsoft recently released a cloud-based distributed graph engine.

Another good reason to include a graph database in the curriculum: it requires a different logical mindset to query/traverse than a relational database, lending further diversity.

Should a graph database be integrated into the curriculum, Neo4J would be an obvious choice. They literally wrote the O’Reilly book on graph databases, which they give away for free as a PDF (and which might be a little biased towards their product). They have a large, active community and some of the best documentation I’ve encountered. Cypher is easy for humans to read, lessening the learning curve for non-developers and increasing the product’s likelihood of success.

Of course, Apache’s Spark engine is already part of the curriculum and has graph processing capabilities. Perhaps there’s room for the material here? It would’ve been neat to interact with Spark using something other than SQL (which might have happened, if we’d had 11 rather than 10 weeks in Winter quarter).

Wow, that turned into a rant! Clearly I had some feelings I needed to get out. XD

Until next time!

Leave a Reply