150 and Ruby's Flip-Flop Operator
This is my 150th post on this blog. I started posting here a little over two-and-a-half years ago (first post was Jan. 16, 2013), and have since been keeping up a weekly posting schedule with a couple of short holiday/vacation breaks.
I thought I’d take this opportunity to show some statistics. Since this probably interests very few people other than me, I’ll spend most of the post talking about the Ruby code I’m using to generate the statistics. This involves some relatively simple text processing, but also one of the few legitimate use cases for Ruby’s flip-flop operator.
First, I’d like to thank you for reading. I appreciate you taking the time to read what I have to say, and for interacting via the comments and on Twitter.
Now, on to the code.
The Code
I use Octopress 2 for this blog. I write the
posts in Markdown and
use a few plugins for things like code blocks, images, and video
links. All of these plugins use Jekyll-style
tags ({% ... %}
).
Here are the statistics I’d like to compute:
- Number of posts (150, obviously)
- Number of words, not counting source code
- Number of lines of source code
- Number of images
- Number of links
I wrote a little Ruby program to help me compute these stats. I used test-driven development, largely following the process I outlined in my recent Getting Testy series of posts.
I’ll only highlight a few parts of the code here. If you’d like to see the entire program and test suite, I’ve made it available on GitHub.
The main script simply forwards to a CLI
object. Here’s what CLI
looks like:
The core part is the
collection pipeline
in the run
method. For each child of the specified directory, I
eliminate all sub-directories. Octopress stores all of its posts as
single files; there’s no directory structure. I then use a private
method to collect stats for each file, and combine the resulting
single-post stats with a merge
operation.
I use
the Symbol#to_proc
idiom
for the reject
and reduce
calls; I find it makes the code clearer
once you know the idiom. You may not have seen the
map(&method(:stats_for))
pattern before. It is equivalent to map {
|file| stats_for(file) }
.
I considered creating a single Stats
object here and passing it into
the collector along with each file. Using that approach, the Stats
object would have been a
Collecting Parameter. I
decided to go with the more functional map-reduce style instead.
Either way would be fine, though.
I formatted the collection pipeline this way after attending Dave
Thomas’ keynote at Lone Star Ruby this
past weekend. He showed that this form is much more readable than the
one-liner I’d originally written, and also showed how this would
translate almost directly to Elixir using
its |>
(pipe) operator.
Here’s the first part of the Collector
class that’s used by CLI
:
I have a simple class method to make it easier to use the class from the outside. I chose to put the input and stats into instance variables in order to avoid having to pass them around to every method in the class.
Notice that I’m taking the input and calling each_line
on it.
each_line
is defined on many different classes, including String
,
IO
(and its subclasses like File
), and Pathname
. It returns an
Enumerator
that yields each line of the input in turn. By taking
advantage of Ruby’s duck-typing, I can pass anything that responds to
each_line
into the Collector
and it will just work. The CLI
passes in a Pathname
, and the tests all pass in String
s and it all
works like it should.
Here’s the collect_stats
method:
Each post has a bit of YAML at the top (delimited by lines containing
only ---
). I don’t want to include the YAML in the stats, so I need
to skip it. I can then process the post body itself to collect
stats.
Because I’m using an Enumerator
as an
external iterator as we’ll
see below, I need to rescue the StopIteration
exception that is
raised when I reach the end of the Enumerator
. The rescue is a bit
out of place here, because there’s nothing in the code indicating that
I’m using the enumerator, but I like it better here than duplicating
it in the two lower-level methods.
collect_stats
is a
Composed Method.
The two main things it does are at the same level of abstraction, and
I can drill down into the details if I’m curious. But at the top
level, I know that I’m skipping YAML front matter and then processing
the body of the post.
Here’s the skip_yaml_front_matter
method:
I first look to see if the next element in the enumerator is the ---
delimiter using peek
. If it is, I use next
to move past it and
then skip lines until I encounter the trailing ---
delimiter.
As I mentioned above, I’m using the enumerator as an external
iterator. I want to keep the YAML-skipping code separate from the
post processing code, but I need to keep track of where I am in the
file. Using peek
, next
, and StopIteration
allow me to do that.
I’m treating the lines of each file as a stream of data that I move
through a line at a time.
Here’s process_post_body
and its helpers:
This code uses a number of constants containing regular expressions that match the various patterns I’m interested in (not shown here).
process_post_body
tells Stats
to add a post. It then iterates
through each line of the post body and processes it.
If the line is part of a code block, then it calls
process_code_block
; otherwise it calls process_line
.
The line if line =~ BEGIN_CODE_BLOCK .. line =~ END_CODE_BLOCK
is
quite special and somewhat controversial. This is known as the
flip-flop operator. It works like this:
- The operator initially returns
false
. - When the first condition is satisfied (when the line matches the
BEGIN_CODE_BLOCK
pattern), it changes totrue
. - The operator remains
true
until the second condition is satisfied (when the line matches theEND_CODE_BLOCK
pattern). It then becomesfalse
. - Repeat
There are people that think this operator should be removed from Ruby. I’ll admit that the need for it isn’t common, but it’s perfect for a case like this where we’re looking for a number of lines delimited by beginning and ending patterns.
Note that the condition will be true
for both the beginning and
ending lines of the condition as well as each line between them, so
process_code_block
has to adjust for that. I don’t want to count
the two delimiter lines as lines of code.
I wanted to move the flip-flop operator down into the process_line
method, but it seems to maintain some internal state that gets reset
on every method call; thus, it has to live in the same method as the
while
loop. It makes sense when I think about it - something has to
keep track of the state of the flip-flop - but I wish I could extract
it to a separate method.
I tried to use the flip-flop operator for the YAML front-matter as
well, but it doesn’t work there because the beginning and ending
patterns are the same (---
). The flip-flop turns on and right back
off again.
process_line
uses a case
statement to look for various patterns of
interest and calls appropriate methods on Stats
when it finds them.
I skip any Jekyll tags that I don’t care about, and also skip any
Markdown reference-style links while I’m at it.
To count words, I’m simply splitting the line on spaces and counting how many pieces there are. I could do something more sophisticated, but this works just fine for my purposes.
The Stats
Now that I’ve got a little program to help me, what does it report for this blog?
In the style of one of my favorite XKCD comics and T-shirts, these stats include the post you’re reading right now.