The hunt for interested users browsing a Ghost website.

The hunt for interested users browsing a Ghost website.

or "Ghost access logs revisited"

At EducationWarehouse we use SimpleAnalytics these days to monitor our website usage. They provide a very intuitive analytics page which our customers appreciate. It's privacy focus from the ground up and therefor doesn't require a cookie banner. They discard bots, they actually adhere to the the do-not-track header and to keep stats a little realistic, they disregard browsing activity that was less than 5 seconds to avoid page-bounces. Very importantly they don't show an average but give the median;
Furthermore, because they host it, i don't have to build and host all of that goodness myself.

I really appreciate what they have built, but it lacks one feature which they purposefully will not support because of their privacy-first principle: how long a user has browsed a website. This is a metric which has no standard in logparsing software.
As we ghost websites, there's a little catch to logging. Ghost doesn't log in the ordinary access.log formats. For each request the software appends a single line with a complete JSON document to the logfile. (While writing this article I discovered it's probably the Caddy logformat mentioned below).
In earlier attempts to find a good web log analyzer i came across GoAccess which i really like, but didn't have support for caddy logfiles then, if i remember correctly. But as a logparser GoAccess defines visits in the following way:

A hit is a request (line in the access log), e.g., 10 requests = 10 hits. HTTP requests with the same IP, date, and user agent are considered a unique visit.

Given this definition you end up with two visits for one actual session if this crosses the date border. Furthermore, people can visit the same site at different times a day, which might be different visits (or sessions) to me.

So i wanted to fine-tune the log parsing a little. I want SimpleAnalytics compatible statistics but with a visit that takes actual user behavior into consideration. To differentiate, i used the term "session".

A session is a bunch of consecutive hits from the same ip and user-agent, bounded by inactivity of at least 4 hours.

That means that if you open the browser at morning and read some pages on your cell phone, and continue reading during lunch, that you would probably have 2 sessions. If you continue reading at 23:45 and continue to 00:03 you end up with three sessions. Depending on the content, the 4 hours might be tuned, so I'm still figuring out what that number should be exactly. But for now, 4 hours will do.

Because we have a few editors working constantly on these sites, we want to remove them from the statistics. Because of floating IP addresses i can't tell these users apart. But as soon as someone logs in, i can blacklist the ip and user-agent combination to disregard their traffic. More on that later.

To parse the logfiles and produce this goodness I wrote a little python script. I uses a bunch of wonderful libraries:

These are specific standard libraries:

  • fileinput - This module implements a helper class and functions to quickly write a loop over standard input or a list of files
  • collections - This module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, dict, list, set, and tuple.; The counter type is really useful for these situations.
  • statistics - This module provides functions for calculating mathematical statistics of numeric (Real-valued) data. ; Used in this script for finding the median.

Non standard libraries:

  • orjson - orjson is a fast, correct JSON library for Python
  • pyhash - pyhash is a python non-cryptographic hash library. It provides several common hash algorithms with C/C++ implementation for performance and compatibility
  • humanize - This modest package contains various common humanization utilities, like turning a number into a fuzzy human-readable duration ("3 minutes ago") or into a human-readable size or throughput.
  • more-itertools - In more-itertools we collect additional building blocks, recipes, and routines for working with Python iterables.
  • text-histogram3 - Histograms are great for exploring data, but numpy and matplotlib are heavy and overkill for quick analysis. They also can’t be easily used on remote servers over ssh.
  • pydal - pyDAL is a pure Python Database Abstraction Layer.
  • tabulate - Pretty-print tabular data in Python, a library and a command-line utility.

The scripts parsing logic in built around a single for loop:

for line in fileinput.input(files=['normal.log']):
    # parse line

Which in this case requires a normal.log file to be present. Remove the constant here and you'll end up with a lot of fileinput.input automagic to figure out what files to read. Read the docs for details.

Each line is might map a hit as defined by GoAccess, but sometimes it's other info. Those lines are filtered out. As well as everything line that doesn't include a user-agent, or has bot contained therein. More filtering is done along the processing.

Next a 'signature' is calculated based on the x-real-ip header as well as the user-agent. Pyhash's xx_64() provides this fast hasher that quickly produces an integer for fast lookup and low memory requirements.

After the fingerprint is known, basic statistics can be measured. collections.Counter is perfect for that. Simply add 1 to a given key of a Counter instance and it will be saved. Initially every key starts with 0 so it's basically built for this common purpose.

The script keeps track of the last-seen timestamp and compares next activity if a new session has started. By doing some book-keeping in dictionaries and later on some work in sqlite, it's quite easy to add unique session id's to each hit. Furthermore, each hit is coupled to a session that has it's own record in an sqlite database.

Eventually the script finishes the loop and have a bunch of statistics in Counter and dictionary instances. Moving some of these over to sqlite really makes things a little easier. While the report might eventually only produce a histogram, the sqlite file is produced to allow for more follow-up or drill-down statistics, without having to write python code. It feels somewhat like select * from accesslog.

One of the fun parts of this script, i think, is the introduction of a new function for sqlite. Sqlite is an in-memory, in-process sql engine. Because it is part of your program, it's very easy to define a function in python and use it from a sql-statement.

class SqliteMedian:
    def __init__(self):
        self.values = []
    def step(self, value):
        if value is not None:
          self.values.append(value)
    def finalize(self):
        return statistics.median(self.values) if self.values else None

db._adapter.connection.create_aggregate("median",1,SqliteMedian)

Here you see it being used to map statistics.median to sqlite.

Usage from a query is like:

select median(duration) from session;

The script was written with pure python-parsing in mind at first. Later i added the sql part, which you can see in the structure of the script. It's a little messy, but it works. If you have any additions, patches or questions: please feel free to drop a line or open an issue on github.

Thanks for reading.