This is my 150th post on this blog. I started posting here a little over two-and-a-half years ago (first post was Jan. 16, 2013), and have since been keeping up a weekly posting schedule with a couple of short holiday/vacation breaks.
I thought I’d take this opportunity to show some statistics. Since this probably interests very few people other than me, I’ll spend most of the post talking about the Ruby code I’m using to generate the statistics. This involves some relatively simple text processing, but also one of the few legitimate use cases for Ruby’s flip-flop operator.
First, I’d like to thank you for reading. I appreciate you taking the time to read what I have to say, and for interacting via the comments and on Twitter.
Now, on to the code.
Here are the statistics I’d like to compute:
- Number of posts (150, obviously)
- Number of words, not counting source code
- Number of lines of source code
- Number of images
- Number of links
I wrote a little Ruby program to help me compute these stats. I used test-driven development, largely following the process I outlined in my recent Getting Testy series of posts.
I’ll only highlight a few parts of the code here. If you’d like to see the entire program and test suite, I’ve made it available on GitHub.
The main script simply forwards to a
CLI object. Here’s what
The core part is the
run method. For each child of the specified directory, I
eliminate all sub-directories. Octopress stores all of its posts as
single files; there’s no directory structure. I then use a private
method to collect stats for each file, and combine the resulting
single-post stats with a
reduce calls; I find it makes the code clearer
once you know the idiom. You may not have seen the
map(&method(:stats_for)) pattern before. It is equivalent to
|file| stats_for(file) }.
I considered creating a single
Stats object here and passing it into
the collector along with each file. Using that approach, the
object would have been a
Collecting Parameter. I
decided to go with the more functional map-reduce style instead.
Either way would be fine, though.
I formatted the collection pipeline this way after attending Dave
Thomas’ keynote at Lone Star Ruby this
past weekend. He showed that this form is much more readable than the
one-liner I’d originally written, and also showed how this would
translate almost directly to Elixir using
|> (pipe) operator.
Here’s the first part of the
Collector class that’s used by
I have a simple class method to make it easier to use the class from the outside. I chose to put the input and stats into instance variables in order to avoid having to pass them around to every method in the class.
Notice that I’m taking the input and calling
each_line on it.
each_line is defined on many different classes, including
IO (and its subclasses like
Pathname. It returns an
Enumerator that yields each line of the input in turn. By taking
advantage of Ruby’s duck-typing, I can pass anything that responds to
each_line into the
Collector and it will just work. The
passes in a
Pathname, and the tests all pass in
Strings and it all
works like it should.
Each post has a bit of YAML at the top (delimited by lines containing
---). I don’t want to include the YAML in the stats, so I need
to skip it. I can then process the post body itself to collect
Because I’m using an
Enumerator as an
external iterator as we’ll
see below, I need to rescue the
StopIteration exception that is
raised when I reach the end of the
Enumerator. The rescue is a bit
out of place here, because there’s nothing in the code indicating that
I’m using the enumerator, but I like it better here than duplicating
it in the two lower-level methods.
collect_stats is a
The two main things it does are at the same level of abstraction, and
I can drill down into the details if I’m curious. But at the top
level, I know that I’m skipping YAML front matter and then processing
the body of the post.
I first look to see if the next element in the enumerator is the
peek. If it is, I use
next to move past it and
then skip lines until I encounter the trailing
As I mentioned above, I’m using the enumerator as an external
iterator. I want to keep the YAML-skipping code separate from the
post processing code, but I need to keep track of where I am in the
StopIteration allow me to do that.
I’m treating the lines of each file as a stream of data that I move
through a line at a time.
process_post_body and its helpers:
This code uses a number of constants containing regular expressions that match the various patterns I’m interested in (not shown here).
Stats to add a post. It then iterates
through each line of the post body and processes it.
If the line is part of a code block, then it calls
process_code_block; otherwise it calls
if line =~ BEGIN_CODE_BLOCK .. line =~ END_CODE_BLOCK is
quite special and somewhat controversial. This is known as the
flip-flop operator. It works like this:
- The operator initially returns
- When the first condition is satisfied (when the line matches the
BEGIN_CODE_BLOCKpattern), it changes to
- The operator remains
trueuntil the second condition is satisfied (when the line matches the
END_CODE_BLOCKpattern). It then becomes
There are people that think this operator should be removed from Ruby. I’ll admit that the need for it isn’t common, but it’s perfect for a case like this where we’re looking for a number of lines delimited by beginning and ending patterns.
Note that the condition will be
true for both the beginning and
ending lines of the condition as well as each line between them, so
process_code_block has to adjust for that. I don’t want to count
the two delimiter lines as lines of code.
I wanted to move the flip-flop operator down into the
method, but it seems to maintain some internal state that gets reset
on every method call; thus, it has to live in the same method as the
while loop. It makes sense when I think about it - something has to
keep track of the state of the flip-flop - but I wish I could extract
it to a separate method.
I tried to use the flip-flop operator for the YAML front-matter as
well, but it doesn’t work there because the beginning and ending
patterns are the same (
---). The flip-flop turns on and right back
process_line uses a
case statement to look for various patterns of
interest and calls appropriate methods on
Stats when it finds them.
I skip any Jekyll tags that I don’t care about, and also skip any
Markdown reference-style links while I’m at it.
To count words, I’m simply splitting the line on spaces and counting how many pieces there are. I could do something more sophisticated, but this works just fine for my purposes.
Now that I’ve got a little program to help me, what does it report for this blog?