Robin ([info]zanfur) wrote,
@ 2008-02-16 17:58:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:programming, rants, tech

Today, I decided to learn XSLT
I have to shred XML movie data from IMDb into a relational structure, for a project at work. I whipped up something using Perl's XML::Simple, because it's a simple problem, but then I figured it would be nicer if I could use standards to translate from the XML to the insert statements, especially if I could use a stream-based parser to keep memory requirements lower...as you might imagine, IMDb has a lot of movie data. So, I decided to look into XSLT, which I hear is the de facto XML transformation standard, and really awesome, if you can wrap your head around it.

Having been told, by a number of people, that it's actually a fairly difficult idea to wrap your head around, I set aside a large chunk of time to go and learn it. (I'm using the rest of that time to write this rant.) It took me 20 minutes to realize it was just a gimped version of LISP macros, and I'm embarrassed it took me that long. There's nothing conceptually innovative there; it's just a case of looking up the syntax when you need it.

I hate XML. For years and years, I saw the hype, and everyone was "learning" XML. I saw XML listed under "programming" sections in bookstores, and even on resumes. It's just a file format, people! Ever look at HTML? Now imagine that you can specify anything you want for element names between the angle brackets. Throw in a few optional headers at the top, and you've got XML. Want to specify which element names are allowed, inside of which other elements? Make a DTD, describing what elements can contain what other elements. This is not rocket science, and it's not even innovative -- LISP had the same type of hierarchical data structures, complete with a similar syntax, in 1959.

My main beef with it, I think, is that it's so godawful hard to read. Why oh why did anyone think that <name>content</name> was a good set of delimiters? Wouldn't it be clearer -- and more consistent with the underlying structure -- to use simple parentheses, like (name (content)), or even (name content)? That would be much easier to read. Less redundant. Oh noes, we have to count parentheses, instead of searching for a specific end tag! Err...except for the times when we have to count the tags too, because they're nested. Okay. It's a shame there's no hierarchical data syntax that uses that. Oh wait. Nevermind. LISP data syntax. In 1959.

And now, there's XSLT. Well, since 1999 or so. We can embed control flow into our data! Now that control flow is in the same syntax as our data, imagine the possibilities for templating: we can intersperse data and code! Surely, this is innovative. Oh, wait. No. LISP made that innovative leap in 1959, with its partial-execution macro system. (Granted, in this instance, XSL may be easier to read than the LISP macro syntax.)

I admit that it's easier to specify a tree structure with an XML DTD than it is in LISP, or actually anything else I can think of. You can do it, though. Since 1959. Because data and code are the exact same thing in LISP (wow, what an innovation!), you can just "evaluate" the data as code: If it parses, it's legit.

I'm mentioning LISP a lot because it was first. all of these things have been around, and exist in a number of other languages. Perl, Ruby, Python, ML and lots of other languages have hierarchical data syntax. SAX parsing? Every compiler known to mankind uses a similar technology, since the nearly the dawn of compilers. XPath? You have to index the heck out of your XML to make that fast, then you use -- surprise! -- relational databases to do it. The worst of both worlds: Hard to parse by humans, and hard to parse by machines!

XPath, XSLT, SAX ... they're all just libraries implemented for manipulating an arbitrarily decided "standard" syntax. There are better tools for getting each of those jobs done. It's (now) universally supported, so I suppose I'm stuck with it. That's really the only reason to use it, in my opinion. It just happens to be a very compelling reason.

So, yeah. XML is another stupid file format, amid a plethora of equally useful formats. The only thing making it special is organizational backing. Go ahead and use it, but stop thinking it's innately special somehow. Please? It's getting really old.



(Post a new comment)


[info]flagmantho
2008-02-17 02:14 am UTC (link)
that always bothered me about XML. when i first encountered it, it seemed incredibly simple; but people made such a huge fuss about it that i assumed there was something about it i just didn't "get". little did i know, it's the world that is crazy, not me.

(Reply to this)(Thread)


[info]zanfur
2008-02-18 12:58 am UTC (link)
Thing is, there's *so much* built up around XML, that it's hard to learn all the ins and outs of it. Not that you'd ever have a need to, unless you wanted to be an "XML Expert". To me, that's kinda like being a "AMD Automounter Config File Expert". Sure, it's complex. It's a file format. Look up what you need.

(Reply to this)(Parent)


[info]jonah
2008-02-17 04:33 am UTC (link)
Agreed. Unfortunately, so many companies/people/apps have invested in XML that it is here to stay. With the AJAX work I've done, I prefer JSON much more over XML. Instead of <name>content</name>, it's { "Name" : "content" }. JSON is ideal for transferring data... XML should just be kept as a markup language, oh well.

(Reply to this)(Thread)


[info]zanfur
2008-02-18 12:52 am UTC (link)
So, we seem to know people in common, but I have no idea who you are. Have we met?

(Reply to this)(Parent)(Thread)


[info]jonah
2008-02-18 07:42 pm UTC (link)
I don't believe so... I remember adding you a while back after reading a comment you made in someone's journal.

(Reply to this)(Parent)


[info]lumiere
2008-02-17 06:08 am UTC (link)
But list doesn't have DNS-based namespaces! That's totally worth the switch to XML! *cough*

(Reply to this)(Thread)


[info]zanfur
2008-02-18 12:52 am UTC (link)
That's actually one of the things I like about the entire XML camp way of thought. Delegation of authority is really useful. It's not innovative, really -- at least, it wasn't the XML folks who innovated it. That would have been DNS, in 1987.

(Reply to this)(Parent)


(Anonymous)
2008-02-17 11:40 pm UTC (link)
Interesting article . . . I too used to be a PERL developer, and found myself in exactly the same position as you a few years back, and then I thought I really needed to workout what exactly was all the fuss about, and guess what? I got completely hooked, so much that I have literally done nothing else but XSL for at least the past 5 years. When you use the right tools with debugging capabilities, there's nothing better for me.

I just noticed you've missed one recent interesting development in the the world of XML, XQuery. It has all the fundamentals of XSLT, but it uses XPath 2.0 extensively, much cleaner than the traditional verbose XSLT.

I see you had a go at SAX, and I also see you have large XML documents to process, if the documents are really big, sometimes SAX can fragment the data into smaller yet relationally referenced documents so your XSLT transformations become lighter, and access data from the file system based on your needs.

Miguel de Melo
====================
XSLT by Example (http://migueldemelo.blogspot.com/)

(Reply to this)(Thread)


[info]zanfur
2008-02-18 12:49 am UTC (link)
I've used Xquery in the past as well. If you have it shredded into some format that's fully indexed, it's a good solution. The problem is, you still have to shred it first.

Regarding splitting the files into smaller ones so XPath doesn't hoark due to memory requirements -- that's actually what we're doing. It's an incredibly inelegant solution, I think.

I looked into STX for a bit, the SAX-based XSLT equivalent (it doesn't use XPath). But, it's not a widely accepted standard, and as that's the only reason I'd be using XSLT anyway...it didn't make sense to implement.

I recommend getting yourself un-hooked. It's good at some things, bad at others.

(Reply to this)(Parent)


[info]zanfur
2008-02-18 12:55 am UTC (link)
Another point: XSLT, XPath, and XQuery are all just library API's implemented toward the purpose of manipulating XML. There's nothing really special about them; they're just code. Sure, it's great that I don't have to write the code myself, but there are a lot of libraries around to achieve similar results, with differently-formatted data.

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…