All posts by Jack Schultz

Rebooting the blog!

After a considerable amount of time off, I figured it was time to reboot the old blog. Unfortunately, in the time since, I ended up shutting down the server that was hosting it without saving my previous posts. At that time, I had probably accumulated around 20-30ish posts, a considerable loss. Luckily, I was able to search web.archive.org and found that it had done two archives of the old site! So all the posts below are copied an pasted from the archive and were able to retain most of their formatting.

Unfortunately, I know there are some posts that web archive missed, both some from the beginning of when I started writing and a few from the end. (I know I did one analysis about home court advantage in the NBA which was unfortunately deleted. Maybe someday I’ll re-write that, but I do remember it was about 3 points). Not much to do about the missing posts, but at least I could recover some.

Another note, there’s a good chance that some of the links on the older posts are no longer working. Streakflow and Yttogether are both offline as of now, but the code (as naiive as it is) is still up at my github account.

With that, welcome back, and hopefully I can provide some new insights for the future.

The Importance of sentences in articles

(Originally posted August 28, 2013)

I’ve always been interested in Natural Language Processing (NLP), so I wanted to try my hand at a simple article summarizer. The basic idea is that we want to boil down the article to only its most important sentences. Disregard the fluff, and return the ones with the most information. Sounds simple, but the determination wasn’t exactly obvious off the bat. Even trying to rank sentences on my own was tough. After a few hours of research, I came across this post which had a very clever way of determining the important sentences. The important sentences in the article should be those who share the most words with other sentences. To get an idea of this, we realize that an important sentences should have information, and supporting sentences should explain the parts of the main one.

To calculate this, we create a connected graph between sentences where each link is the number of words in common between the sentences, normalized by length. We represent this graph as a matrix and simply loop through the sentences and compare the words. This is the naive approach that the post’s author makes, but he also gives a few suggestions for improvement, such as stemming the words and removing stopwords. Stemming deals with removing pluralizations and other non-root endings to words. For example stemming roots turns to root etc. Stopwords are just common words such as ‘and’ or ‘or’ which shouldn’t provide much information about the topic of the sentence. For these techniques, python’s Natural Language Toolkit is fantastic and provides this out of the box.

After ranking all the sentences, the final step is to determine how to determine how to display the shortened article. The way the post’s author did this was by picking the best sentences from each paragraph. I wanted to be able to shorten the length arbitrarily, so I decided to, at least at the moment, display the most informative X sentences in the order they were written, where X is arbitrary.

At the moment, there are still many improvements to be done. The algorithm does well for those “stock” articles with just information. Opinion pieces are a little tougher to boil down to just the main points. By modifying some of the pieces or the ranking algorithm, it should be able to perform well no matter what the content.

Edit:

After running the above article through the algorithm, I got the following 7 sentences:

Sent 1: The basic idea is that we want to boil down the article to only its most important sentences.
Sent 5: After a few hours of research, I came across this post which had a very clever way of determining the important sentences.
Sent 6: The important sentences in the article should be those who share the most words with other sentences.
Sent 7: To get an idea of this, we realize that an important sentences should have information, and supporting sentences should explain the parts of the main one.
Sent 8: To calculate this, we create a connected graph between sentences where each link is the number of words in common between the sentences, normalized by length.
Sent 9: We represent this graph as a matrix and simply loop through the sentences and compare the words.
Sent 15: After ranking all the sentences, the final step is to determine how to determine how to display the shortened article.
Sent 16: The way the post s author did this was by picking the best sentences from each paragraph.

Not bad, but could probably do a little better. We’ll see how it goes.

Angular.js Show and simplicity

(Originally posted August 22, 2013)

When mapping Streakflow to a mobile app, I decided to use Trigger.io and their forge toolchain. Considering I’ve never done mobile development, and that I wanted to deploy on ios and Andriod out of the box, and there have been many successful apps using Trigger.io, it felt like a great fit. In doing this, it meant that I needed to learn a javascript mvc framework. There were a few choices, but because I’ve been hearing so much buzz about Angular.js, and it turned out to be a great decision.

Among other niceties, Angular’s ng-show directive is particularly simple to work with. In the app, I’ve been using it as a makeshift if statement in the templates, though it turns out to work nicer. I’ll go through two different way’s I’ve used ng-show in the app.

The first example is the simplest, and will show how easy ng-show is. As a side note, I didn’t think this method would work when I saw the example online.

<div ng-show="var_in_scope">
  <h1>Variable is true!</h1>
</div>

When this variable changes in your controller, either

//h1 will be visible
$scope.var_in_scope = true;
//h1 invisible
$scope.var_in_scope = false;

The div in the html will flash off and on. That’s all there is to it. Note that the variable in the html does not need the brackets since it is in a angular directive.

The other way I’ve used ng show is by calling a function. The syntax is exactly the same, but it just calls the function from the directive.

<div ng-show="var_in_scope()">
  <h1>Variable is true!</h1>
</div>

and the javascript this time is

$scope.var_in_scope = function() {
  var random = Math.random();
  if (random > 0.5) return true;
  else return false;
}

Since ng-show evaluates the expression, and if it comes back “truthy” show the html, this works as well. Very simple and clear to anyone reading the code.

Django and Celery

(Originally posted August 19, 2013)

While working on Streakflow, it became evident quickly that email reminders would be important. Not only because they would hopefully bring back users, but also because a little nagging is a good way to get people to finish goals. This email system should email users daily, at a time selected by them. This is exactly the type of thing that is perfect for Celery. In this case, we’re going to be using the scheduling portion of Celery. Jumping in to something new and unknown may be overwhelming, but I’ll show you how simple it is to configure Celery to use in Django, and run in production.

The first step is to install django-celery,

$ pip install django-celery

Then add djcelery to your installed apps,

INSTALLED_APPS += ("djcelery", )

And then migrate your database (assuming you’re using South, which you should):

$ python manage.py migrate djcelery

Finally, add the following three lines to your settings.py file.

import djcelery
djcelery.setup_loader()

That’s as far as the django-celery docs front page gets you, and it really is (almost) that simple. From here, we want to add, functionality for scheduled tasks. We’ll deal with the code first, and the production set up at the end.

Back to your settings.py file and add the following.

BROKER_URL = 'amqp://guest:guest@localhost:5672//'

CELERY_TIMEZONE = 'UTC'
from celery.schedules import crontab
CELERYBEAT_SCHEDULE = {
    'check-goals-complete': {
        'task': 'streakflow.apps.members.tasks.reminder_emails',
        'schedule': crontab(minute='*/30'),
    },
}

The broker_url is what celery uses to connect to the broker (something like Redis or RabbitMQ). We’ll get to that later.

CELERY_TIMEZONE is used so celery can keep track of the time correctly. No reason here to not have it be utc.

We then import corntab for use in the schedule. Here we do two things. We give the scheduled task an arbitrary name, and then tell celery where it should look for the task, and on what schedule it should. run the code. The ‘task’ parameter is in similar vein of imports, and the crontab definition here is set up for every 30 minutes because some timezones are at half hour offsets. This code is very much straight from the docs.

In our tasks.py file where we define the task, we just need to make sure that we have the following pieces.

from celery import task

@task
def reminder_emails():
  ...
  ...
  send_email(.....)

The logic would replace the …s. As long as this code matches the path from settings, you’ll be fine.

At this point, that’s all the code/setup you need, so now we’re going to shift over to getting this thing to run in production. For me, I used a smallest instance at Digital Ocean. They’re ve 1ff8 ry cheap, and also come with many tutorials to get you started, which is fantastic. Highly recommended. After firing up a default Ubuntu box and installing everything else, which isn’t the topic of this post, we want to install the queue.

$ sudo apt-get install rabbitmq-server

All we want to do with this is make sure that it is running.

$ sudo rabbitmqctl status

It should be running, but if it isn’t, you can always,

$ sudo rabbitmqctl start

To monitor the services, I like to use supervisor. It’s intuitive and simple to hook up to all Django functions like gunicorn. For celery, we’re going to use two supervisor programs. This is because we need to use celery beat for the scheduler, as well as have a celery worker to execute the code. The code both will be almost identical. For each, we need two pieces of code. One is a bash script that contains the code that we want to run when we start the celery process. This looks like the following for the worker.

#!/bin/bash

source /path/to/env/bin/activate
cd /path/to/django/proj/
exec python manage.py celery worker --loglevel=INFO

This activates the virtualenv, changes directories to where manage.py is, and runs the celery command through manage.py like it says in the docs. The only difference for the beat, is that we want to execute “celerybeat” instead of “celery worker”. Make sure to chmod u+x this script!

Finally, we want to create a supervisor conf file for each. The location for this should be in /etc/supervisor/conf.d/ along with your supervisor conf file for gunicorn if you’re using that too.

[program:celery_worker]
command=/path/to/bin/celery_worker_start
stdout_logfile=/path/to/logs/celery_worker.log
redirect_stderr=true

Again, the difference for celery beat is just changing the worker to beat and making sure they match up.

To load the celery things into supervisor,

$ sudo supervisorctl reread
$ sudo supervisorctl update

You can then check the status of both of them by running

$ sudo supervisorctl status

At this point you should see both of them running! If you check the logs, you can see that the celery beat program is waking up and checking how things are, and then sleeping until it is time to run the mailer. The celery worker just sits there until it gets a task. Anything that you print from the task will be able to be seen in the log. Also, you can change the log level from INFO, to DEBUG if you want more information.

Unexpected outcomes

(Originally posted August 16, 2013)

Just a quick note on unexpected outcomes. The NFL’s collective bargaining agreement with the player’s union changed so that there are fewer offseason requirements for workouts. This was meant to give the players more rest and time off. Seems like a great idea especially considering how violent the NFL is and how long lasting many of the injuries are.

Now obviously correlation is not causation, and I wasn’t able to find statistics, but right at the beginning of training camp there were, according to NFL reporters, an inordinate amount of ACL injuries. The reporters, many of whom are former players, felt that part of this may have come from the players’ tendons not being warmed up enough to handle the stress of football since they had been away from it for longer than normal. In the second week, theses injuries seems to have wound down, which fits into the theory because they players now have had time to adjust.

Now there’s no real way to check the veracity of this since it is only one year and statistical outliers can occur at any time, but it is an interesting reminder to realize that good intentions can sometimes have unintended consequences.

Landing pages — Show don’t Tell

(Originally posted August 15, 2013)

Effective landing pages should perform two functions. The first, and obvious one, is to convince the visitor that the service is worthwhile enough to go through the process of signing up. The other function that it should perform is to show the user that the interface is simple and manageable. This is slightly less obvious, but just as important. No matter how interesting and awesome the service might be, if it is complicated, no one will want to use it.

I’m going to assume that you can all write the amazing copy filled with buzzwords and flashy text that convinces the user that what you made is important. As for showing that the interface is simple, this is a lot tougher.

The most common method is using images that show off the interface. Whether it’s a large image at the top of the page, or little images that coincide with specific pieces of text, images do a fine job of showing the app off. But static is boring, and you can do better.

Another method that is seen is a video that walks th 1ff8 rough the important points of the app. This is an improvement over static because it is less work for the user since all they have to do is sit and watch, and it allows you to curate exactly what you want the user to see. But getting a visitor to click on the video, especially with sound, is not easy.

A combination of the first two options is to use a series of images that show the usage of the app, but without the sound. This is becoming more popular recently as it still captures the visitor’s attention, and is already playing when they scroll to that location. With this, you can show off the flow of the app without being so intrusive as a video would be, and it doesn’t require the user to fire up the video player.

There are all good options, but they still only tell the visitor about the service. You want to showthem.

When I was building Streakflow, I needed another way to convince new visitors that the service is worthwhile. At first, I tried and went through all of the methods I mentioned above. But none of those could really capture the essence of how to use the site. I realized that instead of just trying to explain the app, I could let them do the discovery themselves. It seemed simple especially considering that main functionality was on one page. The result of that is the following demo.

The demo does two things. It shows the simplicity of the interface, as well as being a slight teaser that makes the visitor want to sign up and use it for real. Plus, it has a game like interface which is always a plus for the user.

The entire thing is a state machine built in client side javascript. There’s not backend, and all actions are kept locally. And with that code, it brings the app to life, without the user having to do anything initially.

One page apps are pretty simple to envision a demo, but multipage or more complicated apps my have a harder time. Twitter could have a split screen where you have two users. You would be able to follow and unfollow and tween fake things. If you’re following the other guy, then you’ll see their tweets in your feed. unfollow and they won’t show up. It’s simple, shows the functionality, and would make you want to do it for real.

You could even get more complicated, but still have the coolness factor. Something like Foursquare could have a mini city where you move your stick avatar to different places and check in there. This shows the functionality, and would be kind of fun, and again, makes you want to do it for real.

When you don’t have traction, doing what ever you can to get the visitor over the hill and signed up is important. The standard way of doing this in the landing page is by using writing, and images. Telling doesn’t do anything. Showing, by having an interactive aspect can be very beneficial.

Django tests and app naming

(Originally posted August 9, 2013)

I was working on a little messaging app for a Django site and I had a little error when performing tests. I had named the app “messages” considering that was the function performed. Well Django also has a built in app named “messages as well. This app allows for messages to be sent to users notifying them of site updates and things like. Not the same functionality as what I was building. So when I wanted to run the tests on my messaging app,

$ python manage.py test messages

Django picked their built in app to run the tests agains, all of which took way longer than my app’s tests should have.

The fix for this is really easy, just make sure that Django’s messaging app is listed below the custom app since Django’s manage.py module stops when it finds a name match. This actually brings up a better point that you should probably just avoid duplicate names in the first place.