Did you know that you can download the text of all US patents since 1976, and that the data set is updated weekly? If not, *knowledge conferred*. The government releases about 4000 patents – or about 400 MB of patent texts, every week. It’s a treasure trove of information about patterns of innovation in America, and I’ve been wanting to play around with it for some time. After a few long trips on the L, here’s what I got: The Innovation Cloud. It’s a word cloud of the most frequent words appearing in US patents from 2005-present that appeared zero times in any patent from 1976-1980.
As you can see, software terms (“internet”, “metadata”, “browser”, “website”) dominate the cloud. In second place, there are some hardware terms (“rfid”, “wirelessly”, “nanotube”, “gesture”). In a distant third, there’s some biology terms: (“transgenic”, “chimeric”).
I was a little surprised I didn’t see more biology terms, considering all the hype. One reason for that may be that biotech today has its foundation in terms used during the 1970s, and that new ideas in biology are much more fragmented / jargon-y than in software and hardware, often relying on specific chemicals that are not referred to much after that patent. Or, maybe there’s just a lot more hype than granted patents.
The analysis to create the word cloud above is pretty simple:
- Download & unzip a week data
- Extract patent text, ignoring markup and metadata
- Count all of the words in that week, and form a dictionary with the terms and # counts. Save it.
- Go to step 1 until all weeks are downloaded and all dictionaries are created. This can be done in parallel. Get a pot of coffee – this takes about 90 minutes with a quad core machine and cloud-level internet speeds.
- Merge those dictionaries into two dictionaries for comparison, one from the years 1976-1980, and the other from the years 2005-present.
- Filter the words in the 2005-present dictionary by the criteria that they do not appear in the 1976-1980 dictionary.
- Sort the filtered 2005-present dictionary by word frequency. Format & paste the top 150 into Wordle. Voila! You have just quantified the hottest new words appearing in patents.
If you are interested in technical details, check out my source code on github here. I got the raw data from Google’s Patent Data page. Thanks to Wordle for making a great word cloud generator.
Do you have more ideas for cool stuff to do with this data? If so, let me know. I’m just getting started 🙂
Bot or Not was a verbal strategy web game I made inspired by (but not identical to) the Turing test. Players entered the game are paired with a real, human partner. Both players began chatting in one of two game modes: either players were chatting with each other, or they were both chatting with a learning chatbot. The first player to correctly guess which game mode they were in wins.
Here’s how it’s different (and easier) than the Turing test: you are simultaneously the judge, and the subject of judgement. So, if the bot is unhuman you cannot be sure if it is just bad AI or a person pretending to be a bot. These game dynamics are much more forgiving for the programmer than the actual Turing test, where the human has no incentive to be anything but human. Still, I think you’ll be surprised at how tricky the bot is, given its simplicity.
The bot stores every conversation you have in it’s database (MongoDB for those interested). If the conversation ends with you or your partner guessing “Not” (ie., they think it’s a human like conversation), then all exchanges in that conversation become fair game for the bot to use from then on. To come up with a response, the bot does a full text search of your prompt through all messages in its database (about 175,000 of them at the time of writing), to find the archived message that is most like the one you just typed. After doing some filtering and randomization, it narrows it’s search down to a single message, and it responds to you with the response to that similar message from its database. The bot has no knowledge of grammar, vocabulary, or semantics – it has the sole, powerful ability to recognize similarities between its stimulus and experience.
Bot or Not was released on May 23, 2014. I posted the site to Reddit and got a pretty good reaction – it was the #1 web game for two days and on the front page for six days. You can check out the Reddit thread here. I kept it active until April 2, 2015.
Here are the results after ~24k conversations:
Note that the bot convinces players it’s human almost 44% of the time – pretty good!