A brief introduction.
Course Trakr collects course data from a university in order to generate notifications when any class opens or closes. These notifications are sent to subscribers of those particular classes. The largest challenge of this project is sending notifications of an opening or closing event as quickly as possible. This includes reducing the time taken to collect and clean data as well as detecting and discerning changes in the entire dataset. Throughout the time I've spent working on this project I've encountered some problems which I have outlined in my previous article. This article aims to explore some of the core features and functions that have reduced or eliminated my problems.
First and foremost is Docker. I turned to Docker to automate the building and deployment of Course Trakr's services into software containers.
What is Docker?
Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment. source
From top to bottom, my development process has improved significantly since I started making use of containers. I've spent less time aimlessly configuring servers and more time writing code and push changes. It has helped to draw a clear line between each service across all of environments.
Changes pushed to a "develop" or "release" branch are now automatically tested and built into Docker images. All of Course Trakr's service are defined in a docker-compose.yml file where I can control everything from startup order to CPU and memory usage limits.
This is a simplistic view of each services interactions. In reality they're a bit more connected.
The Go applications in blue form the core of Course Trakr. Some scrapers, a program to clean, validate and push the scraped data to a database. One to serve data from the database to Android and iOS devices. And lastly, a program to publish notifications to users.
$ hermes --help usage: hermes [<flags>] An application that publishes notifications to Firebase Cloud Messaging Flags: -d, --dry-run enable dry-run -c, --config=CONFIG Configuration file for the application.
After notifications are queued into Redis by Julia, Hermes dequeues them to be dispatched through Firebase Cloud Messaging, allowing me unite iOS and Android users under one platform. There really isn't any other messaging platform that competes with FCM, it's completely free for both iOS and Android. There is no limit on the number of notifications you publish and there's no limit on the number of users or subscriptions.
I utilize what FCM calls topic messaging. FCM topic messaging allows you to send a message to multiple devices that have opted in to a particular topic. You compose topic messages as needed, and Firebase handles routing and delivering the message reliably to the right devices.
There's not much that could go wrong with this program. It solely depends on Firebase and in the 6 months I've used their service, I haven't seen any error that didn't resolve itself after a retry or two.
It's quite cliché, but I named this after the massager of the Greek gods, Hermes.
$ julia --help usage: julia [<flags>] An application that queues and processes messages from PostgreSQL. Flags: -c, --config=CONFIG configuration file for the application
As the usage information says, Julia is that queues and processes messages from PostgreSQL. It makes use of Postgres's asynchronous notification feature to listen for data sent on a channel. In Postgres I have a trigger that fires when a class opens or closes. Postgres packages relevant information about the class into a JSON object and sends the notification down a channel.
-- Build notification notification = json_build_object( 'notification_id', id, 'status', NEW.status, 'topic_name', NEW.topic_name, 'university', _temp); -- Execute pg_notify(channel, notification) PERFORM pg_notify('status_change',notification::text);
Julia receives this notification object, processes then queues it for Hermes to send the notification to client devices. Much of the functionality of this program was built into Hermes before deciding it was better to separate the problems it was trying to solve.
Julia was born out the necessity to delay notifications from Rutgers University. Rutgers has a cache invalidation issue where when a class moves from open to close, multiple scraping events will result in the class seemingly simultaneously in an open or close state for period of approximately 4 minutes. So scraping at 30 seconds intervals would result in a class "opening" and "closing" 8 times within a 4-minute period. Julia solves this by maintaining a separate queue for rutgers notification and collapsing these events so that users are not bombarded by false positives.
Julia is instrumented to visualize the number of notifications that are collapsed and for how long.
$ spike --help usage: spike [<flags>] An application to serve university course information. Flags: -p, --port=9876 Port to start server on. -c, --config=CONFIG Configuration file for the application.
As the usage information suggests, Spike is a typical web API for serving data to clients. It serves data in a JSON or Protocol Buffer format depending on client support. Though the JSON route is mainly used for debugging since both Android and iOS clients have support for the Protobuf format. Connections to the database are pooled, queries are prepared, HTTP requests are logged and responses are cached using Redis.
Spike gets its name from Spike Spiegel, the protagonist of the anime Cowboy Bebop.
Spike is instrumented to visualize the latency of responses to HTTP requests.
$ ein --help usage: ein --format=[protobuf, json] [<flags>] A command-line application for inserting and updated university information Flags: -a, --insert-all Disables optimizations when updating relations. -f, --format=[protobuf, json] Choose input format. -c, --config=CONFIG Configuration file for the application.
Ein, the data daemon, named after Ein, the data dog from Cowboy Bebop. Not to be confused with Data Dog the cloud monitoring service.
I'm hesitant to call this a microservice since it's quite fat and has many responsibilities.
Scrapers queue university course data to be processed into Redis. Ein dequeues the data as a unit of work then processes and primes the data to be updated in the database. It performs a number of optimizations to reduce the amount of transactions needed to update a university's data. These optimizations will reduce the number of transactions needed by ~90%. Data is stored in a normalized form for referential integrity and a unnormalized (denormalized) form for better read performance.
PostgreSQL itself was tuned to perform well under the high write volume. Connections to the database are pooled and reused. Queries are meticulously planned. The majority of the engineering effort went into this service.
When I started this project, processing all of the courses took 3 mins/MB of course data. Over the past few months and I've incrementally taken that down to a point where the largest of university datasets would take 12 seconds. Recently I deployed a change that brought the processing time down to, on average, less than a second. Thanks mostly in part due to excellent tooling that supports Golang.
$ jet --help usage: jet --output-format=[protobuf, json] --input-format=[protobuf, json] --scraper-name=SCRAPER-NAME --scraper=SCRAPER [<scraper-flags>] A program the wraps a scraper and collect it's output Flags: --output-format=[protobuf, json] Choose output format --input-format=[protobuf, json] Choose input format --daemon=DAEMON Run as a daemon with a refresh interval --daemon-dir=DAEMON-DIR If supplied the deamon will write files to this directory -c, --config=CONFIG configuration file for the application --scraper-name=SCRAPER-NAME The scraper name, used in logging --scraper=SCRAPER The scraper this program wraps, the name of the executable
$ jet -c config.toml --scraper-name myscraper /go/bin/scraper
A typical invocation looks likes this with the default formats set to
protobuf and a default refresh interval of
1m or 1 minute. Jet executes the provided scraper and collects the program's
stdout. When the scraper exits, the output is checked for integrity and is then queued in Redis for further processing by Ein. The graphic below illustrates the typical behavior of a single instance of a wrapped scraper.
Here Jet executes the scraper at a 1-minute interval ensuring the data in the database is only 1 minute stale. A minute is a long time for a popular class to be open, especially at the start of the semester when demand is the greatest. Users may be competing with another student manually refreshing their browser. Reducing Jet's interval from
30s is one way to increase the database's freshness.
I have found a number of problems with this approach. After the reduced interval, separate goroutines could be scraping simultaneously. One nearing completion and one just starting. This produces more logs, which means debugging gets a little bit more difficult. You now potentially have 2x the tcp connections. Reasoning about the program's internal state also becomes painful because the complexity has also increased. In a way, reducing the interval on a single instance in response to higher demand will only scale vertically.
Scaling Jet is not as trivial as starting a new instance. You may end up in a situation where 2 instances start at nearly the same time. E.g. With an interval of 60 seconds,
rutgers-1 starts at
t=0 and another instance,
rutgers-2 starts at
t=5. After the first scrape, they start again at
t=65 respectively. There is no benefit to this configuration.
Each new instance will have to synchronize with other instances to divide the interval and synchronize their scraping.
It follows that an instance
rutgers-1 would start at
t=0 and a replica of that instance,
rutgers-2, will discover
rutgers-1's start time then resolve itself to begin scraping at
t=30. This way, the data freshness will be evenly distributed across the 60 second interval.
Again, it follows that scaling up to 3 instances will shave the data freshness down to 20 seconds. Looking at a 3 minute timespan,
rutgers-1 will scrape at
t=40,100,160. The beauty of this approach is that each instance does not need to be on the same machine. They synchronize and discovery each sibling instance through a remote key-value store. They automatically reconfigure themselves when new instances are added or removed. The greater advantage of this approach is that this program could wrap any other scraper from any language. It could be built into a container to be deployed and scaled anywhere.
Jet is instrumented to report how long it takes for a scraper to complete, as well as, how many bytes were scraped together. Alerting rules are set up to fire when either of these metrics are outside of acceptable ranges.
Course Trakr, currently has one stable scraper. The system as a whole is designed to be independent of the programming language used by the scraper. Go was my first choice here because Rutgers University has a JSON API for course information. Go and it's 3rd libraries don't provide the necessary tools for scraping HTML/JS websites and I fully expect to have to use a different language for certain university websites.
The decision to make it language independent would also allow for greater community contributions.
The requirements for Course Trakr scraper are as follows:
- Based on the system date, it must intelligently resolve current, next and last semesters of the university for which it scrapes.
- If nothing changed in the data source the output must be the same every time. Determinism is a must.
- It must reuse connections and rate limit itself.
- It must produce an output that fits the defined Protobuf schema.
- It must write output to STDOUT.
Once a scraper fulfills these requirements, it can be wrapped by Jet to facilitate containerizing and scaling.
Monitoring it all
Log management is not easy and there are tons of services created solely to make this part of application monitoring simple. I often get the feeling that I'm simultaneously logging too much and too little, but I've found that this means I'm not logging the correct information about my application's runtime. My advice is to make judicious use of log levels. Try to draw a line between logs necessary for debugging an application and those necessary for informing you about the internal state.
Docker has a built-in feature where it will collect container logs written to
STDOUT and, by default, write them to a file. Extending on top of that, you can specify a logging driver, an intermediary service where Docker will forward logs to. Through careful evaluation I choose the Fluentd driver to collect, filter and parse logs to be sent to AWS S3 for archival, AWS CloudWatch for searching and InfluxDB for metrics storage.
The more novel part of my approach is that I do not have to do any configuring for new services. I automatically generate a Fluentd configuration for any container based on a template I created. When a container starts or dies, a configuration file is generated to define the log sources and the outputs, filing in any secrets needed for AWS or InfluxDB.
I also use Telegraf as an agent to collect metrics from all services. This includes metrics from Postgres, Redis, NGINX, Docker and even the host machine itself. All of the metrics are inserted into InfluxDB where they can then be queried and visualized with in Grafana.
Finding a program's baseline can give you a sense of what is normal. Which you can then use to determine what is abnormal. This where you can setup rules of alerting when something goes wrong. Grafana has an alerting feature where you can, on top or existing graphs, set up alerts against certain rules. Being able to visualize your logs and metrics can give you invaluable insight into your service's behavior at runtime.
Building all my services needed to be fast and that's second only to it needing to be cheap. There are TONS of CI services out there and they are all very expensive, for good reason. But I'm not willing shell out $50 to $150 a month for something that can be replaced by a webhook and bash script. Searching for an alternative has lead me to setup a self-hosted CI server with Drone.
It was pretty straightforward (unlike Jenkins) to set up and get running. Their documentation is lacking, but there's nothing you can't figure out when you have the source code.
My previous solution of using Docker Hub's automated builds took almost 15 minutes to build all my images. With Drone, building all 14 of my images takes a measly 2 minutes.
I'd advise users of Drone to use as much of its security features as possible. Sign repositories with
drone sign my/repo to prevent malicious attacker from gaining elevated privileges. Try not to leak sensitive data to logs, Drone currently has to feature to clear the logs of a build. Using their CLI tool, inject secrets with
drone secret add to replace variables in your build config file.
Overall it has been very stable, I'm running it on an AWS free tier box with 1GB RAM. It has crash a few times but those were only due to the enormous resource usage of the Golang compiler. I definitely recommend keeping an eye out for it.