Some Quick and Dirty Thoughts on Sabotaging AI Scrapers

BlueMonday1984@awful.systems · 5 months ago

Some Quick and Dirty Thoughts on Sabotaging AI Scrapers

imadabouzu@awful.systems · 5 months ago

It absolutely is effective – but there’s economics at play. You can’t 100% close the whole on anything. Scrappers can themselves employee expensive techniques to try to sort or clean content pre-training.

But altering the economics is meaningful, even if it won’t give you strong guarantees. Big, maximalist systems fall from a million paper cuts. They live or die on the economics of the smaller parts.

V0ldek@awful.systems · 5 months ago

How about honeypotting? What’s the chance the crawlers are written smart enough to avoid a neverending HTTP stream?

So this is an idea from SSH: you make a server that listens at port 22 and responds to any connections with a valid, but extremely long message slowly fed to the source byte by byte. Automated bots that look for open SSH ports or vulns get trapped there, and they have to keep consuming resources to service the connection.

Also what happens if you try to feed it an infinite HTML file very quickly? Like just spam the stream with <div><div><div>...?

BlueMonday1984@awful.systems · 5 months ago

How about honeypotting? What’s the chance the crawlers are written smart enough to avoid a neverending HTTP stream?

Given the security record I mentioned earlier, their generally indiscriminate scraping and that one time John Levine tripped up OpenAI’s crawler, I suspect its pretty high.

David Gerard@awful.systems · 5 months ago

feed them LLM output, obviously

mountainriver@awful.systems · 5 months ago

LLMs just train on which words follow which, right?

So if the version of the text changes every other word, it should mess with them. And if you change every other word to “communism” it should learn that the word “communism” follows logically after most words.

Just spitballing here, but I would find making the robots they intend to replace workers with into communist agitators rather funny.

YourNetworkIsHaunted@awful.systems · 5 months ago

Or you identify which company is scraping you and feed their GET request into their own model to make the resulting training data as incestuous as possible.