Video game corpora
Last updated: October 2020
In this post I want to share some fun data sources for NLG and NLP. In an earlier post, I mentioned why a suitable corpus is important, especially when you're doing natural language generation. Since I'm working on natural language generation for video games, I try to find good datasets for that domain. However, finding datasets with text data from games is hard, which means that often I need to scrape and compile my own corpora.
I'm also collecting examples of games that use NLG, to see which techniques are used by developers and how these affect the quality of the game.
Both games feature NLG techniques, which is an added bonus. CookieClicker has a news ticker that shows weird headlines depending on your game process. Epitaph is text-based and procedurally generated, which means it runs completely on NLG. The game generates fictional societies, languages and Civilizations-like tech trees.
Popular games such as the RPGs Diablo, Deus Ex and Bioshock have crowd-sourced FANDOM wikis, where you can find collections of game text. The Deus Ex wiki contains the text of in-game ebooks and newpapers. The Bioshock wiki has transcriptions of all audio diaries that the player can find in the game. Warning: crawling these wikis might lead to spoilers. ;)
The Imperial Library has a database of all books (and other booklike items) from the Elder Scrolls games. Some of my students used it to generate new lore for Skyrim. I imagine this would be a nice technique for Skyrim modders that want to introduce new books into the game.
Thuum.org is a website about the Dragon language from Skyrim. Besides numerous language-learning resources, the Thuum Library includes a text-dump of Skyrim dialogue, which (interestingly enough) includes notes for voice actors about the prosody of some of the dialogue lines.
Github user efonte has extracted all texts from the text-heavy RPG game Disco Elysium. (If their link is broken, you can find a fork at my own Github profile.)
Javier Torres has written a manual for extracting dialogue from Morrowind using the Elder Scrolls toolkit.
Film scripts do not contain video game text, but are still fun for my application domain, especially for generating dialogue.
I'm really impressed by the fan community of D&D webshow Critical Role, who have transcribed all episodes of Critical Role.
There are also transcripts of Star Trek Voyager episodes available online. I'm secretly planning on turning these into a Vulcan irc-bot.
I hope you liked this list of text resources. If you know other fun datasets that could be of use to NLG/NLP research in the context of video games, let me know!
Update October 2020
I've started to mine game files directly, because that's the best way to get high-quality video game text. I've published a paper about it, which can be read here. I've also released three datasets with text from video games, sourced from Star Wars: Knights of the Old Republic, The Elder Scrolls and Torchlight II.
Jakub Myśliwiec, whom I supervised for his bachelor thesis project, has also scraped a dataset of World of Warcraft quests, which can be found here in plaintext format. If you want to use the WoW quests for your own research, you need to do some preprocessing as this plaintext file is meant for fine-tuning a neural generation system.