The previous Simple Site Search post resulted in a large JSON file containing all the texts of the posts from across the site. This is good raw material for … something.
The tyres a big flapping of wings and the heat from inside its tummy.
Cue n-grams and Markov Chains. A fairly common toy challenge with collections of texts is to ingest all the texts and auto-generate text ‘in the style of’ them, using words and phrases from the texts, re-arranged randomly, but still retaining something of the essence of the original texts. Among many posts returned by Google, this one did an ok job of explaining the main steps.
Ah yes she said in surprise before dressing the bed with sheets of red card and lay them.
in the style of The Moose and Goose Stories
in the style of The Grey Parrot Stories
in the style of Predicting the Present
in the style of Emus All The Way Down
in the style of Fragments of writing
in the style of Overgeneralisations
in the style of Jekyll Notes
How it works
- the site-wide JSON file is too broad, so set up a collection-specific one, e.g. /_emus_all_the_way_down/search.json
- and a collection-specific page to display the auto-generated text, e.g. /_emus_all_the_way_down/generated.html
- which in turn uses a new layout /_layouts/auto-generated.html, which
- sets up a couple of placeholder HTML elements (for the title and body of the auto-generated text)
Too long to make someone there a very strong french accent is that moose.
The code /assets/js/ngramMarkov.js was written to explore/play with the idea rather than be particularly efficient.
- for each text
- do a bit of pre-processing to ‘rescue’ common phrases such as ‘i.e.’
- split it into sentences on ‘.’
- for each sentence
- split it into words (allowing them to contain apostrophes)
- record (and count)
- the start and end words explicitly
- all the individual words,
- the pairs of words,
- the triples of words
- (you could just keep going for the larger tuples, but they result in ‘locking’ the generated texts into almost exact replicas of the originals)
This data structure is then used to construct random sentences.
Then cut open the still warm chips to insert some of the tower very clearly but the children loved it.
- starting with a word from the list of words known to have started sentences in the original texts
- look for the known pairs of words which have that word as the first of the pair, choosing randomly if there is a choice, biased towards the more common choices. This gives us our second word.
- choose one of the known triples of words which have those first two words. This gives us our third word.
- for subsequent generated words, choose one of the known triples of words which have the last two generated words, and so on.
- every so often, look in the pairs of words rather than the triples, even if there is a triple which matches
- grab one of the individual words if no pair or triple fits
- when nearing the target length of the sentence, keep an eye out for a word known to have ended a sentence in the original texts. If you happen to generate one, end the generated sentence with it. If you reach the maximum sentence length without finding a known ending word, start the sentence again from scratch a few times.
- (this particular heuristic has resulted in a significant increase in the ‘rightness’ of the generated sentences)
They crush grapes with their lack of work on this particular monday morning moose.