Read Part 1 of this post here.
So now that we know where the GitHub users are located, let's find out what they are building. What are the most popular software languages for projects on GitHub?
You might be wondering, “What about Golang?” This dataset has a mostly complete amount of data from about 2008 up to the 2016/2017 time frame, which is just before a large number of Golang projects were likely to have been created. I hope that I can track down a complete dump of the dataset easily so that we can see how things have changed over the last few years.
Watching the rate of new projects getting created is pretty impressive as well, with tremendous growth in new projects starting in about the 2012 time frame.
Now that we can see the rate of projects as well as the most popular languages used to create these projects, I wanted to find out what type of open source licenses people are using for their projects. Unfortunately, this data doesn’t exist in the GitHub Projects dataset, but the fantastic team over at Tidelift publish a very detailed list of GitHub projects, licenses in use, and other details about the state of open source software at Libraries.io. Ingesting that dataset into CHAOSSEARCH takes just minutes, and I’m now able to find out what the most popular open source software licenses are.
MIT and Apache 2.0 licensing by far outweigh most of the other software licenses for projects, while various BSD and GPL licenses follow far behind. I can’t say I’m surprised to see these results given GitHub’s open model. I would guess that users, not companies, create most software projects and use the MIT License to make it very simple for other people to use, share, and contribute. Apache 2.0 licensing being right behind also makes sense given just how many companies want to ensure their trademarks are being respected and have an open source component to their businesses.
Now that we know what the most popular license is, I was curious to see if I could find the least used open source software license. By adjusting my last query, I was able to reverse the “top 10” into a “bottom 10” query and was able to find just TWO projects that were using the University of Illinois – NCSA Open Source License. I had never heard of this license before, but it appears to be pretty close to the Apache 2.0 License. It is very interesting to see just how many different software licenses are in use across all the GitHub Projects.
Even though the default license for NPM modules when created with “npm init” is ISC, you can see a considerable number of projects are using MIT as well as Apache 2.0 for their open source license.
Since the Libraries.io dataset is very rich in content for all these open source projects, and since the GHTorrent data was missing the last few years (and thus missing any details about Golang projects), I decided to run a similar query to see how Golang projects license their code.
As we learned above, many of these companies want to enforce their trademarks, thus the move to Apache 2.0 open source license. In the end, some interesting results from diving into the GitHub users and projects — some things that I definitely would have guessed, but a few things that were a surprise to me as well as a few outliers.
All in all, you can see how quickly and easily we can get complicated answers to our questions using the CHAOSSEARCH platform. I was able to dive into this dataset and get deep analytics without having to run any databases, and having all this data exists on Amazon S3 for a low cost and low maintenance way of storing everything, and asking anything.