He / They

  • 4 Posts
  • 543 Comments
Joined 2 years ago
cake
Cake day: June 16th, 2023

help-circle


  • Minchin said the total cost “includes the previously stated $4.1m required to redesign the front end of the websites”.

    “The remaining cost ($92.4m) reflects the significant investment required to fully rebuild and test the systems and technology that underpin the website, making sure it is secure and stable and can draw in the huge amounts of data gathered from our observing network and weather models,” Minchin said.

    So 92 MILLION dollars on SQA and maybe some pentesting? Bullshit. Pentests run $50k-$400k for single-domain websites like this, and $400k is on the very expensive end.

    Even if you paid 30 people $200k apiece for 4 years to work on this, which is more people and at higher salaries than would have happened, that would still only come to $24m, less than a third of the cited cost.

    There is no possible way for this to have legitimately cost this much. There was corruption of some kind involved.






  • But my question is, are these only “hacked” passwords? Because those who are not hacked, you don’t know what passwords they have. So this is a bit of bias here, right?

    No, that’s not how these are obtained. Password dumps are from attackers breaching a site’s user database and dumping their credentials, usually by phishing administrators’ logins. Attackers are brute-forcing passwords anymore except on a one-off, very rare basis. Here’s a list of publicly-known password dumps, and you can see details about where they came from: https://haveibeenpwned.com/PwnedWebsites








  • Might have to break this into a couple replies. because this is a LOT to work through.

    Anthropic is the only company to have admitted publicly to doing this. They were sued and settled out of court. Google and OpenAI have had no such accusations as far as I’m aware.

    Meta is being sued by several groups over this, including porn companies who caught them torrenting. Their defense has been to claim that the 2,400 videos downloaded to their corporate IP space was done for “personal use”.

    OpenAI is also being accused of pirating books (not scraping), and it has been unable to prove legal procurement of them.

    There is no such legal distinction [scraping for summary use vs scraping for supplanting the original content]. Scraping content is legal no matter WTF you plan to do with it.

    Interestingly, it’s actually Meta’s most recent partial win that explicitly helps disproves this. Apart from just generally ripping into Meta for clearly infringing copyright, the judge wrote (page 3)

    There is certainly no rule that when your use of a protected work is “transformative,” this automatically inoculates you from a claim of copyright infringement. And here, copying the protected works, however transformative, involves the creation of a product with the ability to severely harm the market for the works being copied, and thus severely undermine the incentive for human beings to create. Under the fair use doctrine, harm to the market for the copyrighted work is more important than the purpose for which the copies are made.

    So yes, Fair Use absolutely does take into account market harms.

    What an AI model does isn’t copyright infringement (usually).

    I never asserted this, and I am well aware of the distinction between the copyright infringement which involved the illegal obtainment of copyrighted material, and the AI training. You seem to be bringing a whole host of objections you get from others and applying them to me.

    I think it’s perfectly reasonable to require that AI companies legally acquire a copy of any copyrighted material. Just as it would not be legal for me to torrent a movie even if I wanted to do something transformative with it, AI companies should not be able to do so either.


  • Because the same rules that allow Google to train their search with everyone’s copyrighted websites are what allow the AI companies to train their models.

    This is false, by omission. Many of the AI companies have been downloading content through means other than scraping, such as bittorrent, to access and compile copyrighted data that is not publicly scrape-able. That includes Meta, OpenAI, and Google.

    The day we ban ingress of copyrighted works into whatever TF people want is the day the Internet stops working.

    That is also false. Just because you don’t understand the legal distinction between scraping content to summarize in order to direct people to a site (there was already a lawsuit against Google that established this, as well as its boundaries), versus scraping content to generate a replacement that obviates the original content, doesn’t mean the law doesn’t understand it.

    My comment right here is copyrighted. So is yours! I didn’t ask your permission before my Lemmy client downloaded it. I don’t need to ask your permission to use your comment however TF I want until I distribute it. That’s how the law works. That’s how it’s always worked.

    The DMCA also protects the sites that host Lemmy instances from copyright lawsuits. Because without that, they’d be guilty of distribution of copyrighted works without the owner’s permission every damned day.

    And none of this matters, because AI companies aren’t just reading content, they’re taking it and using it for commercial purposes.

    Perhaps you are unaware, but (at least in the US) while it is legal for you to view a video on YouTube, if you download it for offline use that would constitute copyright infringement if the owner objects. The video being public does not grant anyone and everyone the right to use it however they wish. Ditto for something like making an mp3 of a song on Spotify using Audacity.

    People who hate AI are supporting an argument that the movie and music studios made in the 90s: That “downloading is theft.” It is not! In fact, because that is not theft, we’re all able to enjoy the Internet every day.

    First off, I do not hate AI, I use it myself (locally-run). My issue is with AI companies using it to generate profit at the expense of the actual creators whose art AI companies are trying to replace (i.e. not directing people to it, like search results).

    Secondly, no one is arguing that it is theft, they are arguing that it is copyright infringement, which is what all of us are also subject to under the DMCA. So we’re actually arguing that AI companies should be held to the same standard that we are.

    Also, note that AI companies have argued in court (in the case brought by Steven King et al) that their use of copyrighted material shouldn’t fall under DMCA at all (i.e. arguing that it’s not about Fair Use), because their argument is that AI training is not the ‘intended use’ of the source material, so this is not eating into that commercial use. That argument leaves copyright infringement liability intact for the rest of us, while solely exempting them from liability. No thanks.

    Luckily, them arguing they’re apart and separate from Fair Use also means that this can be rejected without affecting Fair Use! Double-win!



  • Maybe I have been lucky, but I’ve never seen a company (including “cloud-native” ones) use serverless (code compute) like this. Lambdas only ever get used for tiny, atomic functions in my experience. Never heard of or seen someone try to do a 20-minute video conversion in lambda.

    Any tool misused is a handicap, and I feel like the architecture presented here isn’t making it clear that serverless code compute was ever right for this task.

    For instance, if video processing is being done on a constant or consistent, non-ad-hoc basis, why is serverless being used at all? Who made that decision? If it’s infrequent enough that it doesn’t make cost sense to run a whole dedicated instance, why not have a Fargate node (container) or something that encapsulates the “background worker” portion of the process? That would let you have an equivalent “simple” flow as in the original process diag, but without the limitations of doing everything in Lambda.

    Lambda is great for things like building SOAR flows, e.g. disabling network access to a compromised instance, taking a snapshot, pulling logs, etc. Infrequent, able to combine cloud infra and host-internal actions, and fast. That’s a perfect use-case for Lambdas.