By the end of part one of this series we had a rudimentary document object model for representing only a tiny subset of the universe of possible HTML documents. This installment, modelled after the second post in Matt Brubeck’s series, implements an equally modest HTML parser that will build instances of the model we defined last time.
The HTML Parser
Our parser, just like our model, directly mirrors the design of Matt’s implementation in Rust. The primary challenge is just one of translating between two languages. Open Matt’s post in another browser window to compare the Swift code below to the original Rust.
The first thing we need is a type to represent parsing state: the input string and the index of the
next unprocessed character. Immediately this gets us into the differences between how Rust and Swift
implement strings. Both languages need to handle the complexity of Unicode while being as efficient and
safe as possible. In Rust our index was a uint
that indexes into an &str
slice created from the String
. In Swift, we
can index directly into the String, but our index type is a String.Index
struct with a succesor()
method that handles
the complexity of stepping from character to character without accidentally stepping into the middle of one that has a multi-byte representation.
1 2 3 4 |
|
Notice that since the input string doesn’t change, we can use a let
binding for it. The index, on the other hand, changes as we parse the document, which requires us to use var
. Now that
we have this type defined, we can implement a few foundational methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
These methods look very much like the Rust equivalents. The most obvious difference is that the references to self
are explicitly declared in Rust but are implicit in Swift.
Next we come to the place where the successor()
method of String.Index
allows us to safely advance to the next Unicode Character
. This is the first method that mutates parsing state. Notice how the &mut self
in the Rust implementation is replaced by the mutating
keyword in Swift.
1 2 3 4 5 6 |
|
The next method is the heart of the parser. It consumes characters one by one for as long as the predicate function evaluates to true
.
1 2 3 4 5 6 7 8 |
|
At this point in Matt’s post, we would be defining a pair of methods using the one above. Both of these methods, however, want to pass predicates that test character set membership. This is something that required a little gymnastics due to an apparent impedence mismatch between the native Swift Character
type and the NSCharacterSet
class, whose characterIsMember
method requires a unichar
. I was unable to find an easy way to convert one to another directly. Fortunately, NSString
was a useful stepping stone between the two.
1 2 3 4 5 6 7 8 |
|
Using this workaround we can implement the desired methods:
1 2 3 4 5 6 7 8 9 |
|
Now we finally get to finally write some functions that produce nodes in our object model.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The function that produces a Node
that represents an HTML element looks like the Rust equivalent. I like the assertions, so I kept them even though I worry that this code will fail to consume the characters in release builds. I have yet to test that. (Edit: this does turn out to be a problem. Thanks to /u/teequ for confirming that.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
This is also a possible concern in two of the three methods for parsing HTML element attributes and values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Before we can write the top-level method that parses an entire document, we need the method that parses a set of sibling nodes.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Using the above, we can write the final method to parse a document.
1 2 3 4 5 6 7 8 9 10 11 |
|
The Crow project also contains some unit tests to exercise this, which means we have finally arrived at a place where we can pull the trigger and see something happen.
If you have any code review to offer, please leave a comment in /r/swift.