Introduction to RiTaJS

by Allison Parrish

In this tutorial, I’m going to show you some of the features of RiTaJS, a p5.js-compatible Javascript library by Daniel Howe and a team of contributors. Along the way, we’re going to learn a few more things about objects and data!

RiTaJS is a great library for manipulating and generating text. This tutorial covers just a small portion of what’s possible with RiTaJS; be sure to consult the tutorials on the RiTaJS site, various example sketches and the RiTaJS reference.

If you need help installing RiTaJS, please see this tutorial. It’ll take you through the process. All of the examples in this sketch assume that you’ve installed the RiTaJS library for the sketch in question. (You need to do this individually for each sketch.)

Word clouds… the simple way

One of the classic text visualization techniques is the “word cloud,” as sometimes made by tools like Wordle. Word clouds are not a very good way to visualize text, but they do look cool and are fairly easy to implement.

Here’s how a word cloud works: you take a source text and count how many times each word in the occurs. Then you draw each word to the screen, changing the font size in accordance with the frequency of the word. RiTaJS makes the first part of this task easy: it includes a concordance() function that takes a big string as a parameter, and returns a Javascript object that maps each word to the number of times the word occurs in the text.

First steps: concordance and word counts

Here’s an example that demonstrates the concordance() function of RiTaJS. You can use whatever text file you please; I’m using this plain-text version of the first chapter of Genesis from the King James Version of the Bible. Make sure that you copy this file to your sketch folder!

► run sketch ◼ stop sketch
var lines;
var counts;
var field;
var button;
function preload() {
  lines = loadStrings('genesis.txt');
}
function setup() {
  createCanvas(400, 300);
  // join lines so we have a string, not an array
  // of strings!
  counts = RiTa.concordance(lines.join(" ")); 
  console.log(counts);

  // create UI
  field = createInput();
  button = createButton("Get word count");
  button.mousePressed(displayCount);

  // set drawing parameters
  background(50);
  textAlign(CENTER, CENTER);
  textSize(24);
  noStroke();
  fill(255);
  noLoop();
}
function draw() {
}
function displayCount() {
  background(50);
  var word = field.value();
  var wordCount = 0;
  if (counts.hasOwnProperty(word)) {
    var wordCount = counts[word];
  }
  text(wordCount, width/2, height/2);
}

This sketch displays a text field and a button. When you click on the button, the sketch shows the number of times that the word you typed occurred in the text. Pretty nifty! Note that the search is case-sensitive (i.e., God returns 32, while god returns 0).

There are a few new things in this sketch. The first is the call to RiTa.concordance(). This function takes a single string as a parameter, and returns an object that has words as keys, and numbers as values, where the value for each key shows how many times that word occurs in the text. The data structure looks something like this:

{
    ...
    earth: 20,
    evening: 6,
    every: 12,
    face: 3,
    female: 1,
    fifth: 1,
    fill: 1,
    firmament: 9,
    first: 1,
    fish: 2,
    fly: 1,
    for: 6,
    form: 1,
    forth: 5,
    fourth: 1
    ...
}

(The ... above indicate that I’m only showing an excerpt of the object.) You can get the count for a particular word by evaluating the expression count[x] where x is the word you want to look up.

The real “meat” of the sketch happens in the displayCount() function, which p5.js calls when the user clicks the button. This function checks to see if the count object contains the given word, and if so, it displays the number of times the word occurs. If not, it displays 0 (which is accurate, because if the word isn’t found in the object, then it occurs zero times in the text.)

There’s a method here that we haven’t seen before, which is the .hasOwnProperty() method. This method returns true if the given value is present as a key in the object, and false otherwise. In the above code, we use .hasOwnProperty() to check to see if the word the user input is present in the concordance object. If so, we get the value for that word. If not, we fall back to the default value of 0.

Iterating over the key/value pairs in an object

That sketch is nice, but it isn’t a word cloud. Not yet, at least. In order to make a word cloud, we want to display not just one or two but all of the words in the object returned from RiTa.concordance(). We can’t just hard-code these values, because we don’t know ahead of time what all of the unique words in a text are going to be. (That’s part of the reason to use the concordance() function to begin with!) The easiest way to do this is to find some way to write code that executes on every key/value pair stored inside an object. But how?

In Javascript, the idiomatic way to perform this task is with a for loop that uses a special syntax. It looks like this:

for (var key in obj) {
  if (obj.hasOwnProperty(key)) {
    // your code here!
    console.log(key, obj[key]);
  }
}

Let’s test this out by emptying out the p5.js editor window and pasting in this code:

var obj = {'alpha': 1, 'beta': 2, 'gamma': 3};
for (var key in obj) {
  if (obj.hasOwnProperty(key)) {
    // your code here!
    console.log(key, obj[key]);
  }
}

This code will display the following to the console output area:

alpha
1
beta
2
gamma
3

You can use this code as a kind of template. Just make sure that you replace obj in all three places with the variable name of the object you want to iterate over.

A working word cloud

Okay, so what we want the code to do in order to draw the word cloud is something like this:

  1. Build the word count object with RiTa.concordance().
  2. Iterate over every key/value pair in the object returned from the concordance() function, drawing the word at a random place on the screen.
  3. Change the text size according to the frequency of the word (i.e., its value in the object returned from concordance()).

Here’s a sketch that does just that!

► run sketch ◼ stop sketch
var lines;
var counts;
function preload() {
  lines = loadStrings('genesis.txt');
}
function setup() {
  createCanvas(400, 300);
  // join lines so we have a string, not an array
  // of strings!
  counts = RiTa.concordance(lines.join(" ")); 

  // set drawing parameters
  background(50);
  textAlign(CENTER, CENTER);
  textSize(24);
  noStroke();
  fill(255);
  noLoop();
}
function draw() {
  for (var k in counts) {
    if (counts.hasOwnProperty(k)) {
      fill(random(255));
      textSize(counts[k]);
      text(k, random(width), random(height));
    }
  }
}

The new part here is the loop in draw(). This loop executes the textSize() and text() functions for each key/value pair in the counts object. (Inside the loop, k evaluates to the current word; counts[k] evaluates to the word count of that word.)

Stop words

Most people would look at the word cloud we’ve made and say that it looks a little strange: our mental model of important words in a text doesn’t include words like “and” and “the.” Small, common words like this are often thrown out in text analysis applications, with the assumption that they occur broadly in similar distributions in many different kinds of text, and that they don’t really “mean anything” (but see this essay for an important dissenting viewpoint).

Small words like this are often called “stop words” and RiTaJS has a feature to automatically exclude them from the concordance. To enable this feature, we need to pass a parameter to the concordance() function. At this point, we might consult the documentation for the concordance() function, which tells us that in order to prevent stop words from being included in the concordance analysis, we need to pass an object as the second parameter, which has a key/value pair of ignoreStopWords: true in it. While we’re at it, let’s include the parameters to ignore case and punctuation as well:

► run sketch ◼ stop sketch
var lines;
var counts;
function preload() {
  lines = loadStrings('genesis.txt');
}
function setup() {
  createCanvas(400, 300);
  // join lines so we have a string, not an array
  // of strings!
  var params = {
    ignoreStopWords: true,
    ignoreCase: true,
    ignorePunctuation: true
  };
  counts = RiTa.concordance(lines.join(" "), params); 

  // set drawing parameters
  background(50);
  textAlign(CENTER, CENTER);
  textSize(24);
  noStroke();
  fill(255);
  noLoop();
}
function draw() {
  for (var k in counts) {
    if (counts.hasOwnProperty(k)) {
      fill(random(255));
      textSize(counts[k]);
      text(k, random(width), random(height));
    }
  }
}

This idiom—creating objects and then passing those objects to functions in order to specify parameters—is fairly common in Javascript. You’ll see it appear over and over again in libraries that you use.

Proportional word sizing

So we’ve made some great strides: we’re displaying the word cloud and cutting out stop words and punctuation. But there’s a problem with the code, and it’s this: the words are kind of small. We could make the words bigger, but then think about what would happen if we used a text that isn’t as short as Genesis 1? Here’s the same sketch, for example, using the entire text of Pride and Prejudice (this may take a while to load):

► run sketch ◼ stop sketch
var lines;
var counts;
function preload() {
  lines = loadStrings('austen.txt');
}
function setup() {
  createCanvas(400, 300);
  var params = {
    ignoreStopWords: true,
    ignoreCase: true,
    ignorePunctuation: true
  };
  counts = RiTa.concordance(lines.join(" "),
    params); 

  // set drawing parameters
  background(50);
  textAlign(CENTER, CENTER);
  textSize(24);
  noStroke();
  fill(255);
  noLoop();
}
function draw() {
  for (var k in counts) {
    if (counts.hasOwnProperty(k)) {
      fill(random(255));
      textSize(counts[k]);
      text(k, random(width), random(height));
    }
  }
}

Clearly a mess: some of the words happen so frequently that they’re unreadably huge! A better approach would be to use not the absolute count of the word frequency to size the text, but a proportional count. That way we control the size of the most frequent word in the text, and other words are sized proportionally.

To calculate the proportion for each word, we first need to know the total number of word occurrences in the object. For that, we’ll write a function that looks like this:

function totalValues(obj) {
  var total = 0;
  for (var k in obj) {
    if (obj.hasOwnProperty(k)) {
      total += obj[k];
    }
  }
  return total;
}

This function returns the sum of all values in a given object. The count for one word divided by this sum gives us a percentage that expresses how much of the total concordance is made up of the given word. We’ll use this to size the words as appropriate.

Here’s the function in use:

► run sketch ◼ stop sketch
var lines;
var counts;
var total;
function preload() {
  lines = loadStrings('austen.txt');
}
function setup() {
  createCanvas(400, 300);
  var params = {
    ignoreStopWords: true,
    ignoreCase: true,
    ignorePunctuation: true
  };
  counts = RiTa.concordance(lines.join(" "),
    params); 
  total = totalValues(counts);

  // set drawing parameters
  background(50);
  textAlign(CENTER, CENTER);
  textSize(24);
  noStroke();
  fill(255);
  noLoop();
}
function draw() {
  for (var k in counts) {
    if (counts.hasOwnProperty(k)) {
      if (counts[k]/total > 0.001) {
        fill(random(255));
        textSize((counts[k]/total) * 10000);
        text(k, random(width), random(height));
      }
    }
  }
}
function totalValues(obj) {
  var total = 0;
  for (var k in obj) {
    if (obj.hasOwnProperty(k)) {
      total += obj[k];
    }
  }
  return total;
}

The line reading textSize((counts[k]/total) * 10000) essentially says that a word that takes up 1% of the text should be sized at 100 pixels tall. (In this example, I additionally included an if statement so that only the words that constitute more than 0.1% of the text are displayed. You can adjust this if you’d like to get richer or sparser results.)

Tagging parts of speech

In English, we often analyze words as having a “part of speech,” such as noun, verb, adverb, etc. A word’s “part of speech” is an indicator of that word’s syntactic properties in a sentence. RiTaJS comes with a function .getPosTags() which takes a string and returns an array of parts of speech for each word in the text. (Determining the parts of speech for words in a text is a process often called “tagging” the text.) Here’s a quick example that takes some user-input text and displays the corresponding tags:

► run sketch ◼ stop sketch
var field;
var button;
function setup() {
  createCanvas(400, 300);
  field = createInput();
  button = createButton("Tag, you're it!");
  button.mousePressed(tagText);
  background(50);
  textSize(24);
  fill(255);
  noStroke();
}
function draw() {
}
function tagText() {
  background(50);
  // getPosTags returns an array of tags
  var tags = RiTa.getPosTags(field.value());
  var tagStr = tags.join(" ");
  text(tagStr, 10, 10, width-20, height-20);
}

The part-of-speech tags that RiTaJS uses by default are the PENN part-of-speech tags, which can be confusing: JJ means “adjective,” for example. (You can force it to use a simpler scheme by passing a second parameter of true to the getPosTags() function.)

Here’s an example that loads Genesis 1 and extracts only the nouns:

► run sketch ◼ stop sketch
var lines;
var nouns = [];
function preload() {
  lines = loadStrings('genesis.txt');
}
function setup() {
  createCanvas(400, 400);
  var params = {
    ignoreStopWords: true,
    ignoreCase: true,
    ignorePunctuation: true
  };
  counts = RiTa.concordance(lines.join(" "),
    params); 
  for (var k in counts) {
    if (counts.hasOwnProperty(k)) {
      var tags = RiTa.getPosTags(k);
      if (tags[0] == 'nn') {
        nouns.push(k);
      }
    }
  }
  noLoop();
}
function draw() {
  background(50);
  textSize(24);
  fill(255);
  noStroke();
  text(nouns.join(' '), 10, 10, width-20,
    height-20);
}

EXERCISE: Use getPosTags() to make a modified version of the word cloud sketches above that only display adjectives, nouns, or verbs from a given source text.

RiLexicon

RiTaJS comes with a special kind of object, RiLexicon, which contains various interesting functions for fetching words and finding out information about them.

To create a RiLexicon object, include the following in your code:

var lexicon = new RiLexicon();

You can then call the methods discussed below on the object you created.

Getting random words

The RiLexicon object’s .randomWord() method returns a random word from the word list, potentially matching certain criteria. Here’s a simple example that displays a random word whenever you click:

► run sketch ◼ stop sketch
var lexicon;
function setup() {
  createCanvas(400, 400);
  lexicon = new RiLexicon();
  background(50);
  fill(255);
  noStroke();
  textSize(24);
  textAlign(CENTER, CENTER);
  text("Click for word", width/2, height/2);
}
function draw() {
}
function mousePressed() {
  background(50);
  text(lexicon.randomWord(), width/2, height/2);  
}

If you include a parameter, the method returns words matching only a particular part of speech (using the Penn tags). Here’s an example that writes a simple Mad Lib-style text:

► run sketch ◼ stop sketch
var lexicon;
function setup() {
  createCanvas(400, 400);
  lexicon = new RiLexicon();
  background(50);
  fill(255);
  noStroke();
  textSize(32);
  textAlign(CENTER, CENTER);
  text("Click for fun", width/2, height/2);
}
function draw() {
}
function mousePressed() {
  background(50);
  textAlign(LEFT, TOP);
  var output = "April is the " +
    lexicon.randomWord("jj") + " " + 
    lexicon.randomWord("nn") + ", " +
    lexicon.randomWord("vbg") + " " +
    lexicon.randomWord("nns") + 
    " out of the " +
    lexicon.randomWord("jj") + " " +
    lexicon.randomWord("nn");
  text(output, 10, 10, width-20, height-20);
}

With a second parameter, randomWord() returns words with a given syllable count. You can use this to make a passable Haiku generator:

► run sketch ◼ stop sketch
var lexicon;
function setup() {
  createCanvas(400, 400);
  lexicon = new RiLexicon();
  background(50);
  fill(255);
  noStroke();
  textSize(24);
  textAlign(CENTER, CENTER);
  text("Click for haiku", width/2, height/2);
}
function draw() {
}
function mousePressed() {
  background(50);
  var firstLine  = "the " + 
    lexicon.randomWord("jj", 2) + " " +
    lexicon.randomWord("nn", 2);
  var secondLine = lexicon.randomWord("vbg", 2) +
    " in the " +
    lexicon.randomWord("jj", 2) + " " +
    lexicon.randomWord("nn", 1);
  var thirdLine = "I " +
    lexicon.randomWord("vbd", 2) + " " + 
    lexicon.randomWord("rb", 2);
  text(firstLine, width/2, 150);
  text(secondLine, width/2, 200);
  text(thirdLine, width/2, 250);
}

Getting words by sound

The RiLexicon object has another method, .rhymes(), which returns a list of words that rhyme with a given word, and a method .similarBySound(), which returns words that sound like a given word. You can use this to generate text with a particular mellifluence.

The following sketch demonstrates both methods:

► run sketch ◼ stop sketch
var field;
var button1;
var button2;
var lexicon;
function setup() {
  createCanvas(400, 300);
  lexicon = new RiLexicon();
  field = createInput();
  button1 = createButton("Rhymes!");
  button1.mousePressed(getRhymes);
  button2 = createButton("Similar!");
  button2.mousePressed(getSimilar);
  background(50);
  textSize(24);
  fill(255);
  noStroke();
}
function draw() {
}
function getRhymes() {
  background(50);
  // getPosTags returns an array of tags
  var rhymes = lexicon.rhymes(field.value());
  var rhymesStr = rhymes.join(" ");
  text(rhymesStr, 10, 10, width-20, height-20);
}
function getSimilar() {
  background(50);
  // getPosTags returns an array of tags
  var similar = lexicon.similarBySound(field.value());
  var similarStr = similar.join(" ");
  text(similarStr, 10, 10, width-20, height-20);
}

Further reading

We’ve only scratched the surface here! RiTaJS has much more to offer.