Listen to this story
|
GitHub Copilot, the text-to-code AI tool, has been—for the most part—revolutionary in determining how people code. Twitter has been erupting with people expressing how this new AI tool has benefitted them with organisation heads and developers alike hailing it for saving much of their time.
However, the latest discussion surrounding it suggests that things are murky.
Tim Davis, Professor – Computer Science, Texas A&M University, took to Twitter to express his resentment over Copilot producing his copyright code for a particular prompt.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Chris Rackauckas, lead developer of SciML, also shared a thread of Armin Ronacher from July 2021, adding, “Github Copilot spits out the Quake source code. It just repeats its training data often, even without OSS licenses”.
But beyond this, the latest news that has been making rounds is about June 2022, Butterick cautioned organisations creating software products against the use of Copilot, as they would be taking part in using someone else’s intellectual property, albeit unintentionally.
Download our Mobile App
GitHub is trained upon cites a passage from GitHub’s website showing how Microsoft plays safe by pushing the blame onto the end user:
“You are responsible for ensuring the security and quality of your code. We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself. These precautions include rigorous testing, IP [(= intellectual property)] scanning, and tracking for security vulnerabilities.”
In a recent statement, Open AI claimed that the training material from public repositories is not meant to be included in the output generated by Copilot. Additionally, their analysis has shown that a vast majority of the output (>90%) doesn’t match the training data.
There is a divided opinion (a grey area, if you will) about who “legally” stands right among the two parties. GitHub has made it clear that the users need to check if the code used is free of copyright infringement, but at the same time, the open-source communities see the whole facade of “AI training is fair use” for their copyrighted codes to be a disregard for their rights. See, for example, this statement by Butterick: “By claiming that AI training is fair use, Microsoft is constructing a justification for training on public code anywhere on the internet, not just GitHub.”
Hence, there is little clarity over who is to be held accountable for this—Is it Copilot or the end users employing the AI-generated code for their product?
GitHub’s claim that AI training comes under fair use needs more inspection. This is not the first time questions of copyright have sprung forth in AI applications. It has been a persistent issue throughout the recent surge in AI generative models.
In an interview with Ben Sobel by IPW in 2017, Sobel explains the problem as a “fair use dilemma”. His argument goes like this:
(i) If Machine Learning doesn’t come under fair use, then organisations have to pay remedies to millions who form the training data on which machines learn. This will hinder any progress in the field.
(ii) But, if it does come under fair use, it is likely that organisations will take liberty in using the intellectual labour of people for their own profit.
Therefore, it will not be a stretch to say that the legal aspect of AI use is in difficult terrain. If there is a case for Butterick to take the makers of Copilot to court, the outcome of the lawsuit will have a huge impact on the future of open-source communities and AI generation models.