This is my 150th post on this blog. I started posting here a little over two-and-a-half years ago (first post was Jan. 16, 2013), and have since been keeping up a weekly posting schedule with a couple of short holiday/vacation breaks.

I thought I’d take this opportunity to show some statistics. Since this probably interests very few people other than me, I’ll spend most of the post talking about the Ruby code I’m using to generate the statistics. This involves some relatively simple text processing, but also one of the few legitimate use cases for Ruby’s flip-flop operator.

First, I’d like to thank you for reading. I appreciate you taking the time to read what I have to say, and for interacting via the comments and on Twitter.

Now, on to the code.

The Code

I use Octopress 2 for this blog. I write the posts in Markdown and use a few plugins for things like code blocks, images, and video links. All of these plugins use Jekyll-style tags ({% ... %}).

Here are the statistics I’d like to compute:

  • Number of posts (150, obviously)
  • Number of words, not counting source code
  • Number of lines of source code
  • Number of images
  • Number of links

I wrote a little Ruby program to help me compute these stats. I used test-driven development, largely following the process I outlined in my recent Getting Testy series of posts.

I’ll only highlight a few parts of the code here. If you’d like to see the entire program and test suite, I’ve made it available on GitHub.

The main script simply forwards to a CLI object. Here’s what CLI looks like:

CLI Object
module Blogstats
class CLI
def self.run(args = ARGV)
new.run(args)
end
def run(args = ARGV)
directory = Pathname.new(args.empty? ? Dir.getwd : args.first)
puts directory.each_child
.reject(&:directory?)
.map(&method(:stats_for))
.reduce(&:merge)
end
private
def stats_for(file)
Collector.stats_for(file)
end
end
end

The core part is the collection pipeline in the run method. For each child of the specified directory, I eliminate all sub-directories. Octopress stores all of its posts as single files; there’s no directory structure. I then use a private method to collect stats for each file, and combine the resulting single-post stats with a merge operation.

I use the Symbol#to_proc idiom for the reject and reduce calls; I find it makes the code clearer once you know the idiom. You may not have seen the map(&method(:stats_for)) pattern before. It is equivalent to map { |file| stats_for(file) }.

I considered creating a single Stats object here and passing it into the collector along with each file. Using that approach, the Stats object would have been a Collecting Parameter. I decided to go with the more functional map-reduce style instead. Either way would be fine, though.

I formatted the collection pipeline this way after attending Dave Thomas’ keynote at Lone Star Ruby this past weekend. He showed that this form is much more readable than the one-liner I’d originally written, and also showed how this would translate almost directly to Elixir using its |> (pipe) operator.

Here’s the first part of the Collector class that’s used by CLI:

Collector Part 1
module Blogstats
class Collector
def self.stats_for(input)
new(input).stats
end
def initialize(input)
@input = input.each_line
@stats = Stats.new
collect_stats
end
attr_reader :stats
# ...
end
end

I have a simple class method to make it easier to use the class from the outside. I chose to put the input and stats into instance variables in order to avoid having to pass them around to every method in the class.

Notice that I’m taking the input and calling each_line on it. each_line is defined on many different classes, including String, IO (and its subclasses like File), and Pathname. It returns an Enumerator that yields each line of the input in turn. By taking advantage of Ruby’s duck-typing, I can pass anything that responds to each_line into the Collector and it will just work. The CLI passes in a Pathname, and the tests all pass in Strings and it all works like it should.

Here’s the collect_stats method:

Collector#collect_stats
module Blogstats
class Collector
# ...
def collect_stats
skip_yaml_front_matter
process_post_body
rescue StopIteration
end
# ...
end
end

Each post has a bit of YAML at the top (delimited by lines containing only ---). I don’t want to include the YAML in the stats, so I need to skip it. I can then process the post body itself to collect stats.

Because I’m using an Enumerator as an external iterator as we’ll see below, I need to rescue the StopIteration exception that is raised when I reach the end of the Enumerator. The rescue is a bit out of place here, because there’s nothing in the code indicating that I’m using the enumerator, but I like it better here than duplicating it in the two lower-level methods.

collect_stats is a Composed Method. The two main things it does are at the same level of abstraction, and I can drill down into the details if I’m curious. But at the top level, I know that I’m skipping YAML front matter and then processing the body of the post.

Here’s the skip_yaml_front_matter method:

Collector#skip_yaml_front_matter
module Blogstats
class Collector
# ...
def skip_yaml_front_matter
return unless input.peek =~ /^---/
input.next
while input.next !~ /^---/
# Skip
end
end
# ...
end
end

I first look to see if the next element in the enumerator is the --- delimiter using peek. If it is, I use next to move past it and then skip lines until I encounter the trailing --- delimiter.

As I mentioned above, I’m using the enumerator as an external iterator. I want to keep the YAML-skipping code separate from the post processing code, but I need to keep track of where I am in the file. Using peek, next, and StopIteration allow me to do that. I’m treating the lines of each file as a stream of data that I move through a line at a time.

Here’s process_post_body and its helpers:

Collector#process_post_body
module Blogstats
class Collector
# ...
def process_post_body
stats.add_post
while line = input.next.strip
if line =~ BEGIN_CODE_BLOCK .. line =~ END_CODE_BLOCK
process_code_block(line)
else
process_line(line)
end
end
end
def process_code_block(line)
unless line =~ BEGIN_CODE_BLOCK || line =~ END_CODE_BLOCK
stats.add_loc
end
end
def process_line(line)
case line
when VIDEO_TAG
stats.add_video
when IMAGE_TAG
stats.add_image
when OTHER_JEKYLL_TAG
return
when REFERENCE_STYLE_LINK
return
else
stats.add_words(line.split.count)
end
end
# ...
end
end

This code uses a number of constants containing regular expressions that match the various patterns I’m interested in (not shown here).

process_post_body tells Stats to add a post. It then iterates through each line of the post body and processes it.

If the line is part of a code block, then it calls process_code_block; otherwise it calls process_line.

The line if line =~ BEGIN_CODE_BLOCK .. line =~ END_CODE_BLOCK is quite special and somewhat controversial. This is known as the flip-flop operator. It works like this:

  • The operator initially returns false.
  • When the first condition is satisfied (when the line matches the BEGIN_CODE_BLOCK pattern), it changes to true.
  • The operator remains true until the second condition is satisfied (when the line matches the END_CODE_BLOCK pattern). It then becomes false.
  • Repeat

There are people that think this operator should be removed from Ruby. I’ll admit that the need for it isn’t common, but it’s perfect for a case like this where we’re looking for a number of lines delimited by beginning and ending patterns.

Note that the condition will be true for both the beginning and ending lines of the condition as well as each line between them, so process_code_block has to adjust for that. I don’t want to count the two delimiter lines as lines of code.

I wanted to move the flip-flop operator down into the process_line method, but it seems to maintain some internal state that gets reset on every method call; thus, it has to live in the same method as the while loop. It makes sense when I think about it - something has to keep track of the state of the flip-flop - but I wish I could extract it to a separate method.

I tried to use the flip-flop operator for the YAML front-matter as well, but it doesn’t work there because the beginning and ending patterns are the same (---). The flip-flop turns on and right back off again.

process_line uses a case statement to look for various patterns of interest and calls appropriate methods on Stats when it finds them. I skip any Jekyll tags that I don’t care about, and also skip any Markdown reference-style links while I’m at it.

To count words, I’m simply splitting the line on spaces and counting how many pieces there are. I could do something more sophisticated, but this works just fine for my purposes.

The Stats

Now that I’ve got a little program to help me, what does it report for this blog?

In the style of one of my favorite XKCD comics and T-shirts, these stats include the post you’re reading right now.

The Stats
$ bundle exec exe/blogstats ~/src/blog/source/_posts
Posts: 150
Words: 82938
LOC: 3144
Images: 3
Videos: 3