Did you know that you can download the text of all US patents since 1976, and that the data set is updated weekly? If not, *knowledge conferred*. The government releases about 4000 patents – or about 400 MB of patent texts, every week. It’s a treasure trove of information about patterns of innovation in America, and I’ve been wanting to play around with it for some time. After a few long trips on the L, here’s what I got: The Innovation Cloud. It’s a word cloud of the most frequent words appearing in US patents from 2005-present that appeared zero times in any patent from 1976-1980.
As you can see, software terms (“internet”, “metadata”, “browser”, “website”) dominate the cloud. In second place, there are some hardware terms (“rfid”, “wirelessly”, “nanotube”, “gesture”). In a distant third, there’s some biology terms: (“transgenic”, “chimeric”).
I was a little surprised I didn’t see more biology terms, considering all the hype. One reason for that may be that biotech today has its foundation in terms used during the 1970s, and that new ideas in biology are much more fragmented / jargon-y than in software and hardware, often relying on specific chemicals that are not referred to much after that patent. Or, maybe there’s just a lot more hype than granted patents.
The analysis to create the word cloud above is pretty simple:
- Download & unzip a week data
- Extract patent text, ignoring markup and metadata
- Count all of the words in that week, and form a dictionary with the terms and # counts. Save it.
- Go to step 1 until all weeks are downloaded and all dictionaries are created. This can be done in parallel. Get a pot of coffee – this takes about 90 minutes with a quad core machine and cloud-level internet speeds.
- Merge those dictionaries into two dictionaries for comparison, one from the years 1976-1980, and the other from the years 2005-present.
- Filter the words in the 2005-present dictionary by the criteria that they do not appear in the 1976-1980 dictionary.
- Sort the filtered 2005-present dictionary by word frequency. Format & paste the top 150 into Wordle. Voila! You have just quantified the hottest new words appearing in patents.
Do you have more ideas for cool stuff to do with this data? If so, let me know. I’m just getting started 🙂