I became interested in scraper bots when I heard of the catastrophic mistake developers of the Rabbit r1 made with their code.
If you are unaware, youtuber Fireship, a channel I've spoken praise of before, gives a great summary . Long story short, the codebase of the Rabbit r1, an AI-assisted device that mimics the capabilities of an Android, was exposed as hardcoding its API keys for multiple services, such as ElevenLabs, Azure, Google Maps and Yelp. API keys are essentially passwords you need to access these databases, and if you've exposed them, it means anyone can access this data and do whatever they want with it - create, edit or delete it. Hence you want to protect and hide your API key at all cost so that yours and the users' data is not compromised.
It's a big no no and it's one of the first things you are taught not to do when creating an application. One issue of hardcoding your API keys and uploading your code to GitHub is that there are numerous scraper bots waiting to take advantage of these keys. But what are scraper bots?
They are automated programs, similar to web crawlers, designated to trawl websites and collect specific data. Such data can include product prices, news headlines, or social media content. However, this also means scraper bots can be programmed to collect sensitive information, such as API keys. GitHub is of course the perfect target, considering the sheer volume of published codebases that are made public.
Data theft is ever prevalent on the internet, and scraper bots are just one of the methods criminals will use to harvest sensitive information. Therefore, any such data should be treated with the utmost care and the developer should take all the steps necessary to encrypt and protect this information.
Though not all scraper bots are used entirely for malicious reasons. For example, price comparison websites use them to compare flights, hotels and car rental prices. So it is a way of automating data collection for a database.
In addition creating a scraper script is no secret. For instance, you can easily create a basic one with Node.js. Of course it's strongly advised that you only scrape from websites that allow it, and use this script within ethical and legal limits. I had a go and you can see what information I was easily able to scrape from the Giraffe wikipedia website. Should I feel the need to build a Giraffe database, I'm all set.
But when you discover the power of scraper bots, the question of plagiarism arises. OpenAI, for example, uses its own GPTBot to scrape other websites in order to collect the vast amount of data it needs. This data is then used to train the AI without credit to the authors. I mentioned in one of my earlier articles that some AI models provide their sources but this practice is not universal.
With AI there's also the problem of "giraffing". This is the problem whereby the AI is trained on tonnes of labelled photos, scraped from the internet. However, due to the quantity of Giraffe images, the AI is falsely trained that Giraffes are everywhere . If the AI was sentient, for instance, it might think that it would be near impossible to walk down Oxford Street in London without bumping into a Giraffe. This somewhat highlights the limits of scraping for data collection and how the internet is not a valid representation of the real world.
Scraper bots are not necessarily a force for evil, but are a reminder that the internet can be a dangerous place. Knowing how they work and how to develope one, can further assist you in protecting your own data.
They can be a helpful time-saver when automating data collection, though it's important to recognise what type of data you are scraping and which sources you are taking it from. This is to not only show that you are more ethical than AI, but also because you may have inadvertently scraped a website full of Giraffe pictures.