With personal data, spending habits, demographic, behavioral, and other information being routinely stored on mobile devices, cell phone and other mobile users are already concerned over how this potentially valuable and/or damaging material is being observed and gathered by government, law enforcement agencies, software developers, and advertising networks.
What many users may not be aware of is that AI has been a part of this information monitoring ecosystem for some time now, as part of a process known as data scraping.
AI is the acronym for Artificial Intelligence, a blanket term used to describe machines and computer systems designed with the ability to adapt the way they operate in response to observations they’ve made in the past, and the absorption of new knowledge and techniques – in short, the ability to “learn by experience.”
In the computing realm, machines are able to do this through special mathematical formulas or algorithms, which are written into their programming. The process whereby “intelligent” machines record information and adjust their performance in response to it is often referred to as machine learning.
Scraping is a data harvesting process which gathers specific information from websites. At a very simple level, this could involve an individual jotting down notes from a web page using a pen and paper, taking a screenshot from their computer or device desktop, or selecting some or all of the content on the page to copy and paste.
More complex scraping procedures might use algorithms or templates to sift online content for information which meets certain parameters.
What’s AI Scraping?
AI scraping could be described as the “industrial-strength” version of the data scraping process. Such a technique might be used to filter relevant information from a massive database, complex multimedia web resource, or a full-blown corporate network.
Because the work involved in attempting to sift through such large and complex data repositories would be near impossible by hand, specially programmed “bots” are provided with the Artificial Intelligence to analyze the information without human intervention, and to perform automated tasks (save, copy, print, etc.) depending on the data observed, or in response to specified conditions which they might encounter.
These AI scraping agents are variously referred to as scripts, ‘bots’, ‘webots’, ‘crawlers’, ‘harvesters’, or ‘spiders’.
AI Scraping – The Good and The Bad
In and of itself, data scraping need not be illegal, or a bad thing. Extracting (copying, downloading, etc.) information from websites for personal or academic research would be an example of a benevolent use of the technique, at a small scale. And in fact, many online platforms make such resources available to their occasional visitors through subscription-based libraries, download lists, or similar.
Search engines like Google, Bing, and Yahoo routinely deploy spiders and crawlers for the searching and indexing functions which make them such indispensable tools for all of us. Yet at the same time, data scraping itself is variously described as ‘web-scraping’, ‘web-harvesting’, ‘screen scraping’, or ‘rate-raping’ – that last one being a term used by the insurance industry, in reference to scraping activities used to plunder information for business purposes.
The problem arises when AI scraping is conducted under ambiguous (and often morally ambivalent) circumstances, without the consent or knowledge of the parties being observed – or with criminal or malicious intent, from the outset. The fact that scraping tools habitually use TOR networks and anonymous proxies to avoid detection greatly contributes to this.
The actual and potential damage caused by AI scraping may be considerable. Applications for industrial espionage or sabotage exist, whereby the crawlers may extract trade secrets and protected processes to be funneled back to business rivals, sold on the open market, or used as leverage in extortion schemes. Similar dangers exist with intellectual property, with bots being used to strip entire original works from their source sites for publication or reproduction elsewhere.
And the potential victims aren’t necessarily limited to secret labs, production companies, or huge corporations. AI scraping activities are routinely conducted on individuals, either directly (e.g., during criminal or civil investigations) or indirectly (e.g., the data harvested from mobile app users on behalf of advertisers and marketing agencies). As we’ve seen in recent months, even user data that’s been “anonymized” can quite easily be reconstructed to create viable profiles for use by fraudsters or identity thieves.
Moves for Legal Protection
There’s been a backlash against data scraping practices from several notable organizations.
The professional social networking platform LinkedIn recently filed suit in a California court against 100 unnamed individuals, alleging that they had been using bots to harvest user profiles from its website. LinkedIn cited the Computer Fraud and Abuse Act (CFAA) in its court action, claiming that the unauthorized extraction of user profiles from its site constituted hacking.
Though a ruling in mid-August 2017 upheld the right of one data gatherer (the analytics company HiQ) to continue processing publicly available data from LinkedIn for use in training AI models, other cases based on the CFAA have met with success for the plaintiffs. Examples include Craigslist’s suit against 3Taps and Facebook’s action against Power Ventures.
Traditional network defenses like firewalls or intrusion detection and prevention systems have met with little success in detecting or blocking AI scrapers. The more sophisticated scraping tools are able to imitate the search patterns of authorized network users, so even application layer firewalls encounter difficulties in countering them.
The LinkedIn lawsuit revealed the company’s use of several automated tools designed to prevent data harvesting, including products codenamed FUSE, Quicksand and Sentinel. These utilities work by monitoring the web traffic from LinkedIn users and limiting how many other profiles a user can view, and how quickly they can view them. Defenses like this are intended to prevent data scrapers from signing up for bogus profiles on a site, then using them as a launch pad for data harvesting.
Other tools uncovered from the LinkedIn arsenal included Org Block – a tool for blocking IP addresses suspected of scraping – and behavioral monitoring to track the page requests made by subscribers and guests on the site.
For the individual mobile device user, the surest defense against scraping remains encryption. This may extend from device encryption (using the on-board facilities of a device’s operating system, or a dedicated app) to the use of encrypted messaging applications, and secure connections to the internet via an encrypted VPN or Virtual Private Network. The intention here is to make it extremely difficult for any data harvested from a device to be decrypted and used by the data scraper.
Check out the new InvinciBull™ VPN now, available for IOS, Android, Windows and Mac.
Share this Post