If you haven’t heard about the recent controversies regarding GitHub Copilot, an artificial intelligence developed by GitHub and OpenAI that assists programmers by autocompleting code, I’d highly recommend checking out the GitHub Copilot investigation site as well as the GitHub Copilot litigation site, as they do a great job summarizing the lawsuit. Aside from that, both blogs are written by Matthew Butterick, programmer, typographer, and lawyer who is serving as co-counsel to the lawsuit.
On October 17th, 2022, Matthew Butterick published a blog titled “GitHub Copilot investigation.” In this post, he claims that the threat of GitHub Copilot taking open-source code without permission could mean that open-source communities might collapse entirely.
In order to combat this threat, Butterick reactivated his California bar membership and teamed up with the class-action lawyers at the Joseph Saveri Law Firm to investigate GitHub Copilot for potentially violating its legal obligations to open-source software contributors and end users.
But how could an autocompleting feature in an IDE create grounds for a lawsuit? After all, modern IDEs or text editors already have features like that, allowing you to just hit ENTER and autocomplete certain phrases or bits of code.
But GitHub Copilot is a little bit different. Not only can it perform small autocompletions in your code, but it will also suggest entire blocks of code, large enough to fill out entire functions. This function could be as simple as returning whether or not a given integer input is odd or even, or as complex as autocompleting a fast inverse square root function.
GitHub Copilot was released by Microsoft in June of 2022 and is powered by Codex, an artificial intelligence system created by Open AI (ChatGPT anyone?). The datasets that were used to train Codex are based on tens of millions of public repositories — not limited to, but including public code repositories found on GitHub. The origin of Copilot’s training data source can be confirmed by a statement made by Copilot researcher Eddie Aftandilian, who mentioned in a podcast (at timestamp 36:40) that Copilot is trained on public repos from GitHub.
So it’s clear that Copilot is trained on billions of lines of code from public repositories on GitHub, but what’s wrong with that? If the repositories are public, then isn’t it fair-use?
Here’s where it gets a little convoluted. Most open-source software packages are released under licenses that provide users with certain rights and also enforce certain obligations, namely giving proper credit of the source code. You’ll know if code is licensed if copyright is explicitly stated in the code. Here’s an example:
Copyright (c) [year] [fullname]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
So those who want to use open-source software are presented with a decision. They can either comply with the obligations that come with a given license, or they can use the code subject to a license exception (fair-use). Since both Microsoft and OpenAI have admitted that Copilot and Codex are trained on open-source code from public repositories on GitHub and due to both organizations having published no attributions of source code, then they are falling back on an argument for fair-use.
Butterick confirms this with a quote from former GitHub CEO Nat Friedman who claimed that training machine-learning systems on public data to be fair-use.
Since the Software Freedom Conservancy disagreed with Microsoft advocating the fair-use argument, they asked Microsoft for evidence to support their stance. According to SFC director Bradley Kuhn, “They [Microsoft and GitHub representatives] provided none.”
Butterick affirms the SFC’s position, stating that Microsoft was unable to provide legal authority for its position simply because there isn’t any. Though you might expect this type of case to have occurred in the past, there actually has never been a case in the US that involved the fair-use consequences of training AI systems.
Microsoft relinquishes even more responsibility when it comes to GitHub Copilot’s autocompletions, denoting Copilot’s output as “suggestions” that Microsoft does not claim any rights to. However, Microsoft also makes no guarantees regarding how correct or secure that code is. More importantly, Microsoft does not guarantee that using such code “suggestions” will leave you safe from any of your own lawsuit troubles (as Copilot could just spit out original source code without attributing its creator as a “suggestion” for you). In layman’s terms, accepting a Copilot autocompletion means that all of the responsibility is put on you; Microsoft isn’t even involved. Just take a look at this statement that comes with Copilot:
“You are responsible for ensuring the security and quality of your code. We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself. These precautions include rigorous testing, IP [(= intellectual property)] scanning, and tracking for security vulnerabilities.”
Perhaps the most compelling part of Butterick’s investigation are the examples with clear-cut evidence that expose Copilot for making “suggestions” of licensed code without any proper attributions:
All of this sounds pretty interesting, and some parts of it sound particularly unethical (mainly when Copilot “suggests” code without attribution), but how does this connect to open-source communities collapsing altogether?
Butterick predicts that the advent of GitHub Copilot will essentially replace the traditional means of finding and interacting with open-source projects. Since Copilot can just suggest code right from public repositories, it will become a much more convenient method of utilizing the code that one might need for their own project. But this leads to a disconnect between the programmer and the open-source project that the code is being “taken” from.
Without Copilot, a programmer might typically search for a specific function or problem that needs to be solved, find some code on an open-source repository that helps answer their problem, and then get engaged with that project, perhaps subscribing to its issue tracker, participating in its discussion boards, etc.
With Copilot, a programmer might start typing code for a function, and then just hit ENTER to fix their problem before it even existed. It’s not difficult to see that GitHub Copilot is violating licensing laws and how devastating it could be to open-source communities.
On November 3rd of 2022, Butterick published an update on the GitHub Copilot suit at https://githubcopilotlitigation.com/. Butterick and his team of lawyers ended up filing a class-action lawsuit in US federal court in San Francisco, challenging the legality of GitHub Copilot. The defendants of this lawsuit include GitHub, the owner of GitHub, Microsoft, and OpenAI.
Butterick and his team argue that GitHub Copilot has violated the legal rights of various GitHub contributors under 11 different popular open-source licenses, which all require proper attribution of code authors. You may have heard of some of these types of licenses, including the MIT license and the Apache license.
Additionally, on November 10th, Butterick and his team filed a second class-action lawsuit on the behalf of two more plaintiffs. Most of the details of this lawsuit are similar to that of the lawsuit filed on November 3rd. We have yet to receive further updates beyond this point.
Opinions can change
Just before I conclude, I just wanted to note something interesting I found while digging around in Butterick’s blogs. On this page, which is linked from the initial GitHub Copilot investigation site, the very first line of Butterick’s blog post is this:
Still, I’m not worried about its effects on open source.
It seems like Butterick made quite the 180 on us. To provide complete context, Butterick stated that organizations would have to create software assets to forbid the use of Copilot or any other similar AI-assisted tools to avoid license violations. This is of course, a very optimistic outlook on what the future of open-source communities would look like in the face of GitHub Copilot. This prediction stands in stark contrast to Butterick’s statements in his investigation segment, in which he claims that open-source communities will collapse, as the interactions and incentive to discover open-source projects and contribute to them will all but disappear.
As far as we can tell, this change of opinion occurred within just 4 months (roughly the amount of time that passed between the two blog posts), but what caused his mind to change so quickly? I’m not entirely sure and after some digging, I couldn’t find any concrete info. Besides, the information that I’m looking for here isn’t really rooted in discrete timeframes. Whether Butterick truly changed his mind, or when that really happened is somewhat of a vague inquiry. I just wanted to bring this up because it seemed like a complete paradox: Butterick saying that he isn’t worried open-source one moment, and then spelling out the inevitably disaster that will befall them the next. Just some food for thought.
You can help too
Though the GitHub Copilot lawsuit seems like a massive undertaking completely out of any one person’s hands (except for maybe Matthew Butterick), just like the open-source community, many individuals coming together can make something amazing happen.
At the end of the GitHub Copilot investigation blog, Butterick claims that you can actually help out with the process, stating that: “We’d like to talk to you if…” and then listing several qualifications, such as:
- You have stored open-source code on GitHub
- You have used or do use GitHub Copilot
- You have other information regarding GitHub Copilot you would like to let Butterick and the Joseph Saveri Law Firm know about
On the blog post you can find his email as well as a link to contact the Joseph Saveria Law Firm. I’m sure that there are email filters in place (and at this point in the lawsuit, it might be a little too late for any elementary pieces of evidence or accounts), but I find it fitting that, like the open-source community, individuals can come together to make a huge impact on the world.