I’ve used spicy auto-complete, as well as agents running in my IDE, in my CLI, or on GitHub’s server-side. I’ve been experimenting enough with LLM/AI-driven programming to have an opinion on it. And it kind of sucks.
I just hate that they stole all that licensed code.
It feels so wrong that people are paying to get access to code…that others put out there as open source. You can see the GPL violations sometimes when it outputs some code from doom or other such projects. Some function made with the express purpose for that library, only to be used to make Microsoft shareholders richer. And to eventually remove the developer from the development. Its really sad and makes me not want to code on GitHub. And ive been on the platform for 15+ years.
I think I get what your saying. LOL LLM bots stealing all the things.
You may note, im not arguing the ethical concerns of LLMs, just the way it was pulled. Its why open source models that pull data and let others have full access to said data could be argued as more ethical. For practical purposes, it means we can just pull them off hugging face and use them on our home setups. And reproduce them with the “correct” datasets. As always garbage in/ garbage out. I wish my work would allow me to put all the SQL over a 30(?) year period into a custom LLM just for our proprietary BS. Thats something I would have NO ethical concerns about at all.
For reference, every AI image model uses ImageNET (as far as I know) which is just a big database of publicly accessible URLs and metadata (classification info like, “bird” <coordinates in the image>).
The “big AI” companies like Meta, Google, and OpenAI/Microsoft have access to additional image data sets that are 100% proprietary. But what’s interesting is that the image models that are constructed from just ImageNET (and other open sources) are better! They’re superior in just about every way!
Compare what you get from say, ChatGPT (DALL-E 3) with a FLUX model you can download from civit.ai… you’ll get such superior results it’s like night and day! Not only that, but you have an enormous plethora of LoRAs to choose from to get exactly the type of image you want.
What we’re missing is the same sort of open data sets for LLMs. Universities have access to some stuff but even that is licensed.
Theres nothing magical about copying code, throwing it into a database, and creating an LLM based on mass data. Moreover, its not ethical given the amount of data they had to pull and the licenses Microsoft had to ignore in order to make this work. Heck my little server got hit by the AI web crawlers a while back and DDOSed my tiny little site. You can look up their IP addresses and some of them look at the robots.txt, but a VAST majority did not.
I just hate that they stole all that licensed code.
It feels so wrong that people are paying to get access to code…that others put out there as open source. You can see the GPL violations sometimes when it outputs some code from doom or other such projects. Some function made with the express purpose for that library, only to be used to make Microsoft shareholders richer. And to eventually remove the developer from the development. Its really sad and makes me not want to code on GitHub. And ive been on the platform for 15+ years.
And theres been an uptick in malware libraries that are propagating via Claude. One such example: https://www.greenbot.com/ai-malware-hunt-github-accounts/
At least with the open source models, you are helping propagate actual free (as in freedom) LLMs and info.
Stealing is when the owner of a thing doesn’t have it anymore; because it was stolen.
LLMs aren’t “stealing” anything… yet! Soon we’ll have them hooked up to robots then they’ll be stealing¹ 👍
I think I get what your saying. LOL LLM bots stealing all the things.
You may note, im not arguing the ethical concerns of LLMs, just the way it was pulled. Its why open source models that pull data and let others have full access to said data could be argued as more ethical. For practical purposes, it means we can just pull them off hugging face and use them on our home setups. And reproduce them with the “correct” datasets. As always garbage in/ garbage out. I wish my work would allow me to put all the SQL over a 30(?) year period into a custom LLM just for our proprietary BS. Thats something I would have NO ethical concerns about at all.
For reference, every AI image model uses ImageNET (as far as I know) which is just a big database of publicly accessible URLs and metadata (classification info like, “bird” <coordinates in the image>).
The “big AI” companies like Meta, Google, and OpenAI/Microsoft have access to additional image data sets that are 100% proprietary. But what’s interesting is that the image models that are constructed from just ImageNET (and other open sources) are better! They’re superior in just about every way!
Compare what you get from say, ChatGPT (DALL-E 3) with a FLUX model you can download from civit.ai… you’ll get such superior results it’s like night and day! Not only that, but you have an enormous plethora of LoRAs to choose from to get exactly the type of image you want.
What we’re missing is the same sort of open data sets for LLMs. Universities have access to some stuff but even that is licensed.
We pay for access to a high performance magic pattern machine. Not for direct access to code, which we could search ourselves if we wanted.
I disagree.
Theres nothing magical about copying code, throwing it into a database, and creating an LLM based on mass data. Moreover, its not ethical given the amount of data they had to pull and the licenses Microsoft had to ignore in order to make this work. Heck my little server got hit by the AI web crawlers a while back and DDOSed my tiny little site. You can look up their IP addresses and some of them look at the robots.txt, but a VAST majority did not.
There is a metric ton of lawsuits hitting the AI companies and they are not winning in all countries: https://sustainabletechpartner.com/topics/ai/generative-ai-lawsuit-timeline/
I’m simply saying that I’m not paying for access to the code. I’m paying for access to the high performance magic pattern machine.
I can and have browsed code all day for 35 years. Magic pattern machine is worth paying for to save time.
To be clear, stackoverflow and similar sites have also been worth paying for. Now this is the latest thing worth paying for.
I understand you have ethical concerns. But that doesn’t negate the usefulness of magic pattern machine.