Today, I introduce Word Galaxy, an interactive map of the 20,000 most common words in English. To explore the map for yourself, click here.
Word Galaxy plots words in such a way that words with similar meaning appear closer together. To obtain the meaning of a word, the algorithm rely’s on Firth’s law: “You shall know a word by the company it keeps.” It’s based on the latest and greatest research in natural language processing, specifically Google’s deep neural network Word2Vec trained on 200-300 million words of Wikipedia, and Lauren van der Maaten’s t-SNE dimensionality reduction. The data is rendered with a great 2D game engine, Pixi.js. For more about the algorithms and technologies behind Word Galaxy, you can view the source code and references on the Word Galaxy Github page.
Technology aside, the point of word galaxy is to let you explore meaning and culture spatially. Please, explore away and feel free to leave your comments and insights below or on the reddit post. If you are more into the technical side, I also posted it to /r/MachineLearning here.
I noticed a few things about the structure of the data that I will share with you because hey, you’ve read this far! Generic words tend to be more in the center, while domain-specific words tend to be towards the periphery. Science and technology words (e.g., “logarithm”, “variable”) are on the east, while historical and social words (e.g., “crusades”, “Ptolemy”) are on the west. If science is on the east, and the humanities are on the west, what’s in the middle? On the north, it goes mathematics -> music -> sports -> famous people’s names -> place names -> history. On the south, it goes mathematics -> physics -> optics -> electrical engineering -> software -> finance -> law -> religion -> history. Kewl. Another fun fact: “spreadsheet” is almost exactly opposite of “vegas”.
For example, here is a plot which links calculus to Jesus:
If any of you can find any patterns in the generic core of the map, let me know! I struggled a bit to find large-scale patterns there, but they are probably there.