In this post I want to share some fun data sources for NLG and NLP. In an earlier post, I mentioned why a suitable corpus is important, especially when you’re doing natural language generation. Since I’m working on natural language generation for video games, I try to find good datasets for that domain. However, finding datasets with text data from games is hard, which means that often I need to scrape and compile my own corpora.
I’m also collecting examples of games that already use NLG, to see which techniques are used by developers and how these affect the quality of the game.
Both games feature NLG techniques, which is an added bonus. CookieClicker has a news ticker that shows weird headlines depending on your game process. Epitaph is text-based and procedurally generated, which means it runs completely on NLG. The game generates fictional societies, languages and Civilizations-like tech trees.
Popular games such as the RPGs Diablo, Deus Ex and Bioshock have crowd-sourced FANDOM wikis, where you can find collections of game text. The Deus Ex wiki contains the text of in-game ebooks and newpapers. The Bioshock wiki has transcriptions of all audio diaries that the player can find in the game. Warning: crawling these wikis might lead to spoilers. ;)
Film scripts do not contain video game text, but are still fun for my application domain, especially for generating dialogue.
I’m really impressed by the fan community of D&D webshow Critical Role, who have transcribed all episodes of Critical Role.
There are also transcripts of Star Trek Voyager episodes available online. I’m secretly planning on turning these into a Vulcan irc-bot.
I hope you liked this list of text resources. If you know other fun datasets that could be of use to NLG/NLP research in the context of video games, let me know!