Introduction to RiTaJS
In this tutorial, I’m going to show you some of the features of RiTaJS, a p5.js-compatible Javascript library by Daniel Howe and a team of contributors. Along the way, we’re going to learn a few more things about objects and data!
RiTaJS is a great library for manipulating and generating text. This tutorial covers just a small portion of what’s possible with RiTaJS; be sure to consult the tutorials on the RiTaJS site, various example sketches and the RiTaJS reference.
If you need help installing RiTaJS, please see this tutorial. It’ll take you through the process. All of the examples in this sketch assume that you’ve installed the RiTaJS library for the sketch in question. (You need to do this individually for each sketch.)
Word clouds… the simple way
One of the classic text visualization techniques is the “word cloud,” as sometimes made by tools like Wordle. Word clouds are not a very good way to visualize text, but they do look cool and are fairly easy to implement.
Here’s how a word cloud works: you take a source text and count how many times
each word in the occurs. Then you draw each word to the screen, changing the
font size in accordance with the frequency of the word. RiTaJS makes the first
part of this task easy: it includes a concordance()
function that takes a big
string as a parameter, and returns a Javascript object that maps each word to
the number of times the word occurs in the text.
First steps: concordance and word counts
Here’s an example that demonstrates the concordance()
function of RiTaJS.
You can use whatever text file you please; I’m using this plain-text version
of the first chapter of Genesis from the King James Version of the
Bible. Make sure that you copy this file to your sketch
folder!
let lines; let counts; let field; let button; function preload() { lines = loadStrings('genesis.txt'); } function setup() { createCanvas(400, 300); // join lines so we have a string, not an array // of strings! counts = RiTa.concordance(lines.join(" ")); console.log(counts); // create UI field = createInput(); button = createButton("Get word count"); button.mousePressed(displayCount); // set drawing parameters background(50); textAlign(CENTER, CENTER); textSize(24); noStroke(); fill(255); noLoop(); } function draw() { } function displayCount() { background(50); let word = field.value(); let wordCount = 0; if (counts.hasOwnProperty(word)) { wordCount = counts[word]; } text(wordCount, width/2, height/2); }
This sketch displays a text field and a button. When you click on the button,
the sketch shows the number of times that the word you typed occurred in the
text. Pretty nifty! Note that the search is case-sensitive (i.e., God
returns
32, while god
returns 0).
There are a few new things in this sketch. The first is the call to
RiTa.concordance()
. This function takes a single string as a parameter, and
returns an object that has words as keys, and numbers as values, where the
value for each key shows how many times that word occurs in the text. The data
structure looks something like this:
{
...
earth: 20,
evening: 6,
every: 12,
face: 3,
female: 1,
fifth: 1,
fill: 1,
firmament: 9,
first: 1,
fish: 2,
fly: 1,
for: 6,
form: 1,
forth: 5,
fourth: 1
...
}
(The ...
above indicate that I’m only showing an excerpt of the object.) You
can get the count for a particular word by evaluating the expression count[x]
where x
is the word you want to look up.
The real “meat” of the sketch happens in the displayCount()
function, which
p5.js calls when the user clicks the button. This function checks to see if the
count
object contains the given word, and if so, it displays the number of
times the word occurs. If not, it displays 0
(which is accurate, because if
the word isn’t found in the object, then it occurs zero times in the text.)
There’s a method here that we haven’t seen before, which is the
.hasOwnProperty()
method. This method returns true
if the given value is
present as a key in the object, and false
otherwise. In the above code, we
use .hasOwnProperty()
to check to see if the word the user input is present
in the concordance object. If so, we get the value for that word. If not, we
fall back to the default value of 0
.
Iterating over the key/value pairs in an object
That sketch is nice, but it isn’t a word cloud. Not yet, at least. In order to
make a word cloud, we want to display not just one or two but all of the
words in the object returned from RiTa.concordance()
. We can’t just hard-code
these values, because we don’t know ahead of time what all of the unique words
in a text are going to be. (That’s part of the reason to use the
concordance()
function to begin with!) The easiest way to do this is to find
some way to write code that executes on every key/value pair stored inside an
object. But how?
In Javascript, the idiomatic way to perform this task is with a for
loop that
uses a special syntax. It looks like this:
for (let key in obj) {
if (obj.hasOwnProperty(key)) {
// your code here!
console.log(key, obj[key]);
}
}
Let’s test this out by emptying out the p5.js editor window and pasting in this code:
let obj = {'alpha': 1, 'beta': 2, 'gamma': 3};
for (let key in obj) {
if (obj.hasOwnProperty(key)) {
// your code here!
console.log(key, obj[key]);
}
}
This code will display the following to the console output area:
alpha
1
beta
2
gamma
3
You can use this code as a kind of template. Just make sure that you replace
obj
in all three places with the variable name of the object you want to
iterate over.
A working word cloud
Okay, so what we want the code to do in order to draw the word cloud is something like this:
- Build the word count object with
RiTa.concordance()
. - Iterate over every key/value pair in the object returned from the
concordance()
function, drawing the word at a random place on the screen. - Change the text size according to the frequency of the word (i.e., its value
in the object returned from
concordance()
).
Here’s a sketch that does just that!
let lines; let counts; function preload() { lines = loadStrings('genesis.txt'); } function setup() { createCanvas(400, 300); // join lines so we have a string, not an array // of strings! counts = RiTa.concordance(lines.join(" ")); // set drawing parameters background(50); textAlign(CENTER, CENTER); textSize(24); noStroke(); fill(255); noLoop(); } function draw() { for (let k in counts) { if (counts.hasOwnProperty(k)) { fill(random(255)); textSize(counts[k]); text(k, random(width), random(height)); } } }
The new part here is the loop in draw()
. This loop executes the textSize()
and text()
functions for each key/value pair in the counts
object.
(Inside the loop, k
evaluates to the current word; counts[k]
evaluates to
the word count of that word.)
Stop words
Most people would look at the word cloud we’ve made and say that it looks a little strange: our mental model of important words in a text doesn’t include words like “and” and “the.” Small, common words like this are often thrown out in text analysis applications, with the assumption that they occur broadly in similar distributions in many different kinds of text, and that they don’t really “mean anything” (but see this essay for an important dissenting viewpoint).
Small words like this are often called “stop words” and RiTaJS has a feature to
automatically exclude them from the concordance. To enable this feature, we
need to pass a parameter to the concordance()
function. At this point, we
might consult the documentation for the concordance()
function,
which tells us that in order to prevent stop words from being included in the
concordance analysis, we need to pass an object as the second parameter, which
has a key/value pair of ignoreStopWords: true
in it. While we’re at it, let’s
include the parameters to ignore case and punctuation as well:
let lines; let counts; function preload() { lines = loadStrings('genesis.txt'); } function setup() { createCanvas(400, 300); // join lines so we have a string, not an array // of strings! let params = { ignoreStopWords: true, ignoreCase: true, ignorePunctuation: true }; counts = RiTa.concordance(lines.join(" "), params); // set drawing parameters background(50); textAlign(CENTER, CENTER); textSize(24); noStroke(); fill(255); noLoop(); } function draw() { for (let k in counts) { if (counts.hasOwnProperty(k)) { fill(random(255)); textSize(counts[k]); text(k, random(width), random(height)); } } }
This idiom—creating objects and then passing those objects to functions in order to specify parameters—is fairly common in Javascript. You’ll see it appear over and over again in libraries that you use.
Proportional word sizing
So we’ve made some great strides: we’re displaying the word cloud and cutting out stop words and punctuation. But there’s a problem with the code, and it’s this: the words are kind of small. We could make the words bigger, but then think about what would happen if we used a text that isn’t as short as Genesis 1? Here’s the same sketch, for example, using the entire text of Pride and Prejudice (this may take a while to load):
let lines; let counts; function preload() { lines = loadStrings('austen.txt'); } function setup() { createCanvas(400, 300); let params = { ignoreStopWords: true, ignoreCase: true, ignorePunctuation: true }; counts = RiTa.concordance(lines.join(" "), params); // set drawing parameters background(50); textAlign(CENTER, CENTER); textSize(24); noStroke(); fill(255); noLoop(); } function draw() { for (let k in counts) { if (counts.hasOwnProperty(k)) { fill(random(255)); textSize(counts[k]); text(k, random(width), random(height)); } } }
Clearly a mess: some of the words happen so frequently that they’re unreadably huge! A better approach would be to use not the absolute count of the word frequency to size the text, but a proportional count. That way we control the size of the most frequent word in the text, and other words are sized proportionally.
To calculate the proportion for each word, we first need to know the total number of word occurrences in the object. For that, we’ll write a function that looks like this:
function totalValues(obj) {
let total = 0;
for (let k in obj) {
if (obj.hasOwnProperty(k)) {
total += obj[k];
}
}
return total;
}
This function returns the sum of all values in a given object. The count for one word divided by this sum gives us a percentage that expresses how much of the total concordance is made up of the given word. We’ll use this to size the words as appropriate.
Here’s the function in use:
let lines; let counts; let total; function preload() { lines = loadStrings('austen.txt'); } function setup() { createCanvas(400, 300); let params = { ignoreStopWords: true, ignoreCase: true, ignorePunctuation: true }; counts = RiTa.concordance(lines.join(" "), params); total = totalValues(counts); // set drawing parameters background(50); textAlign(CENTER, CENTER); textSize(24); noStroke(); fill(255); noLoop(); } function draw() { for (let k in counts) { if (counts.hasOwnProperty(k)) { if (counts[k]/total > 0.001) { fill(random(255)); textSize((counts[k]/total) * 10000); text(k, random(width), random(height)); } } } } function totalValues(obj) { let total = 0; for (let k in obj) { if (obj.hasOwnProperty(k)) { total += obj[k]; } } return total; }
The line reading textSize((counts[k]/total) * 10000)
essentially says that a
word that takes up 1% of the text should be sized at 100 pixels tall. (In this
example, I additionally included an if
statement so that only the words that
constitute more than 0.1% of the text are displayed. You can adjust this if
you’d like to get richer or sparser results.)
Tagging parts of speech
In English, we often analyze words as having a “part of speech,” such as noun,
verb, adverb, etc. A word’s “part of speech” is an indicator of that word’s
syntactic properties in a sentence. RiTaJS comes with a function
.pos()
which takes a string and returns an array of parts of speech
for each word in the text. (Determining the parts of speech for words in a text
is a process often called “tagging” the text.) Here’s a quick example that
takes some user-input text and displays the corresponding tags:
let field; let button; function setup() { createCanvas(400, 300); field = createInput(); button = createButton("Tag, you're it!"); button.mousePressed(tagText); background(50); textSize(24); fill(255); noStroke(); } function draw() { } function tagText() { background(50); // returns an array of tags let tags = RiTa.pos(field.value()); let tagStr = tags.join(" "); text(tagStr, 10, 10, width-20, height-20); }
The part-of-speech tags that RiTaJS uses by default are the PENN
part-of-speech tags, which
can be confusing: JJ
means “adjective,” for example. (You can force it to use
a simpler scheme by passing a second parameter of true
to the pos()
function.)
Here’s an example that loads Genesis 1 and extracts only the nouns:
let lines; let nouns = []; function preload() { lines = loadStrings('genesis.txt'); } function setup() { createCanvas(400, 400); let params = { ignoreStopWords: true, ignoreCase: true, ignorePunctuation: true }; counts = RiTa.concordance(lines.join(" "), params); for (let k in counts) { if (counts.hasOwnProperty(k)) { let tags = RiTa.pos(k); if (tags[0] == 'nn') { nouns.push(k); } } } noLoop(); } function draw() { background(50); textSize(24); fill(255); noStroke(); text(nouns.join(' '), 10, 10, width-20, height-20); }
EXERCISE: Use
pos()
to make a modified version of the word cloud sketches above that only display adjectives, nouns, or verbs from a given source text.
RiTa’s Lexicon
RiTaJS comes with a number of interesting functions for fetching words and finding out information about them. Let’s discuss some of these below!
Getting random words
The RiTa
object’s .randomWord()
method returns a random word from the
word list, potentially matching certain criteria. Here’s a simple example
that displays a random word whenever you click:
let lexicon; function setup() { createCanvas(400, 400); background(50); fill(255); noStroke(); textSize(24); textAlign(CENTER, CENTER); text("Click for word", width/2, height/2); } function draw() { } function mousePressed() { background(50); text(RiTa.randomWord(), width/2, height/2); }
The .randomWord()
function takes a single parameter, which is an object that
specifies constraints for the word that gets selected. If the object includes a
key called pos
, the value of that key will be used to restrict the random
selection by part of speech (using the Penn tags). Here’s an example that writes a simple
Mad Lib-style text:
let lexicon; function setup() { createCanvas(400, 400); background(50); fill(255); noStroke(); textSize(32); textAlign(CENTER, CENTER); text("Click for fun", width/2, height/2); } function draw() { } function mousePressed() { background(50); textAlign(LEFT, TOP); let output = "April is the " + RiTa.randomWord({pos: "jj"}) + " " + RiTa.randomWord({pos: "nn"}) + ", " + RiTa.randomWord({pos: "vbg"}) + " " + RiTa.randomWord({pos: "nns"}) + " out of the " + RiTa.randomWord({pos: "jj"}) + " " + RiTa.randomWord({pos: "nn"}); text(output, 10, 10, width-20, height-20); }
Including a value for the numSyllables
key, will add another constraint,
such that randomWord()
returns only words with a given syllable
count. You can use this to make a passable Haiku generator:
let lexicon; function setup() { createCanvas(400, 400); background(50); fill(255); noStroke(); textSize(24); textAlign(CENTER, CENTER); text("Click for haiku", width/2, height/2); } function draw() { } function mousePressed() { background(50); let firstLine = "the " + RiTa.randomWord({pos: "jj", numSyllables: 2}) + " " + RiTa.randomWord({pos: "nn", numSyllables: 2}); let secondLine = RiTa.randomWord( {pos: "vbg", numSyllables: 2}) + " in the " + RiTa.randomWord({pos: "jj", numSyllables: 2}) + " " + RiTa.randomWord({pos: "nn", numSyllables: 1}); let thirdLine = "I " + RiTa.randomWord({pos: "vbd", numSyllables: 2}) + " " + RiTa.randomWord({pos: "rb", numSyllables: 2}); text(firstLine, width/2, 150); text(secondLine, width/2, 200); text(thirdLine, width/2, 250); }
Getting words by sound
RiTa has another function, .rhymes()
, which returns a list of
words that rhyme with a given word, and a method .soundsLike()
, which
returns words that sound like a given word. You can use this to generate text
with a particular mellifluence.
The following sketch demonstrates both methods:
let field; let button1; let button2; let lexicon; function setup() { createCanvas(400, 300); field = createInput(); button1 = createButton("Rhymes!"); button1.mousePressed(getRhymes); button2 = createButton("Similar!"); button2.mousePressed(getSimilar); background(50); textSize(24); fill(255); noStroke(); } function draw() { } function getRhymes() { background(50); let rhymes = RiTa.rhymes(field.value()); let rhymesStr = rhymes.join(" "); text(rhymesStr, 10, 10, width-20, height-20); } function getSimilar() { background(50); let similar = RiTa.soundsLike(field.value()); let similarStr = similar.join(" "); text(similarStr, 10, 10, width-20, height-20); }
Further reading
We’ve only scratched the surface here! RiTaJS has much more to offer.
- Read the official tutorials and the reference.
- Dan Shiffman’s Programming from A to Z
- Notes on Sorting Bot by Darius Kazemi. Darius used RiTaJS to generate the rhymes for popular Twitter curiousity @SortingBot.