Data Scraping for AI: When Terms of Service Become a Legal Firewall

Data Scraping for AI: When Terms of Service Become a Legal Firewall

High-quality, large-scale data is the foundation of artificial intelligence development. To acquire it, developers often resort to automated data collection, or ‘scraping,’ techniques. A recent case involving the Terms of Service (ToS) of the TipRanks service perfectly illustrates why this practice carries significant legal and business risks. A ToS is not just a tedious text to scroll through; it is a binding contract, the breach of which can have severe consequences.

The Contract as the First Line of Defense

TipRanks clearly states that the content made available on their website and through other channels (the “Service”) was created, developed, and arranged through the expenditure of substantial time and effort, using their own methods and judgment. This content is proprietary and is protected by their Terms of Service, copyright laws—including U.S. copyright laws—and other international treaties.

Do you have a question about AI security? You can reach us here:

The service is provided for personal, non-commercial use only. The terms explicitly prohibit any use of the service or its content that:

  • acts as a source of or substitute for the Service or the content,
  • affects the company’s ability to earn money in connection with the Service,
  • or competes with the service they provide.

These restrictions specifically apply to any “robot, spider, scraper, web crawler, or other automated means,” and even to similar manual processes. The document further agrees not to violate the restrictions in any `robot exclusion headers` (commonly known as `robots.txt`) or to bypass other measures employed to prevent or limit access.

AIQ Analysis: The Convergence of Technical and Legal Risks

In a corporate context, this means that technical and legal barriers are tightly interwoven. The `robots.txt` file is no longer just a polite request from the website owner but an obligation enforced by the Terms of Service. Violating it is not merely a breach of ‘good internet etiquette’ but a breach of contract.

From an AIQ standpoint, this dual-layered defense will become increasingly common. Companies are reinforcing their technical restrictions with legal instruments to protect their valuable, high-investment data. Training an AI model on data collected in breach of such a contract is extremely risky.

Connection to the OWASP LLM Top 10

This issue touches upon several points in the OWASP LLM Top 10 vulnerability list. It is particularly relevant to LLM05: Supply Chain Vulnerabilities. If a model is trained on data from a legally questionable source, the entire supply chain becomes vulnerable. Legal action could force a company to destroy or retrain the model, resulting in significant financial and reputational losses.

Furthermore, it increases the risk of LLM04: Model Poisoning. Service providers, upon detecting scraping that violates their terms, could intentionally serve manipulated or useless data to the scrapers, thereby compromising the trained model.

Compliance in Light of the EU AI Act and GDPR

From AIQ’s perspective, such data collection practices raise serious compliance issues within the European regulatory framework.

Under the GDPR, processing personal data always requires a valid legal basis. The automated collection and further use of publicly available data (e.g., for model training) rarely meet GDPR requirements, especially when the service’s ToS explicitly forbids it. Users have not given their consent for their data to be processed for such purposes.

The EU AI Act places a strong emphasis on data governance and data quality. For high-risk AI systems, it mandates that training data must be relevant, representative, and free of errors, with a clear provenance. The origin of data collected in violation of Terms of Service cannot be legally justified, which could call into question the compliance of the entire system during a regulatory audit.

Audit Takeaways and Corporate Actions

The TipRanks case is a clear warning to all companies involved in AI development. Data collection is not just a technical issue; it is primarily a legal and ethical one.

AIQ’s recommendations are as follows:

  • Data Acquisition Policy: Every company must have a clear, legally-vetted data acquisition policy that prohibits data collection in violation of terms of service.
  • Developer Education: Developers must be aware of the legal significance of `robots.txt` and the ToS. The ‘it’s just a script’ mentality is unacceptable.
  • LLM Security Audits: A comprehensive audit must include an examination of the origin and legality of training data. This serves not only the security of the model but also the legal protection of the company.

The lesson is clear: data is the most valuable asset, and its owners will protect it with increasingly sophisticated legal and technical tools. Development practices that ignore this reality are unsustainable and dangerous in the long run.

Attila Rácz-Akácosi

Independent AI Security Specialist

Two decades of analytical and systems-oriented experience. I have been working with artificial intelligence since 2017. In recent years, I have specialized in AI/LLM security and AI Red Teaming. Systems-level thinking instead of endless vulnerability checklists.