Czech publishers get new robots.txt shield against AI scrapers

Czech on-line publishers gained on March 19 a extra detailed technical framework to guard their content material from synthetic intelligence programs – one which extends properly past coaching information to cowl the real-time AI responses more and more siphoning site visitors from information websites throughout Europe.

The Sdružení professional internetový rozvoj (SPIR), the Affiliation for Web Improvement within the Czech Republic, published on March 19, 2026, an up to date unified customary enabling web site operators to declare an opt-out from the textual content and information mining (TDM) exception beneath EU copyright regulation. The replace, developed in collaboration with the Asociace on-line vydavatelů (AOV), the Česká unie vydavatelů (ČUV) and the Správce licenčních práv vydavatelů (SLPV), replaces a draft customary SPIR first issued on July 7, 2023.

The revision displays how essentially the AI panorama has shifted in lower than three years. Again in 2023, the priority was primarily coaching information – the huge corpora of textual content fed to giant language fashions throughout their improvement. At present, in accordance with SPIR, the scope of automated content material extraction has grown to incorporate information used for so-called real-time responses: AI assistants, on-line summarisation instruments, and retrieval-augmented technology (RAG) programs that pull dwell content material from the online to reply consumer queries on the fly.

Two directives, two distinct use instances

On the coronary heart of the replace lies a two-tier technical construction, every tier expressed by particular directives within the robots.txtfile – the protocol that web site operators use to speak directions to automated crawlers. The usual attracts a exact line between two eventualities that, till now, many publishers might have struggled to handle individually.

The primary tier targets AI coaching. Based on the doc, operators who don’t need their copyright-protected content material used for coaching basic AI fashions – together with giant language fashions – or for constructing datasets towards that finish can add the next to their robots.txt file:

Consumer-agent: MachineLearning
Disallow: /

The second tier is broader. It addresses each coaching and real-time utilization concurrently. Publishers who additionally want to forestall AI programs from utilizing their content material for dwell inference – akin to AI assistants producing solutions from crawled content material in actual time – can as a substitute apply:

Consumer-agent: AI
Disallow: /

Each directives cowl all content material on a given area. Crucially, in accordance with SPIR, neither setting impacts customary internet search indexing by typical search engine crawlers – until these crawlers are working in an AI mode, akin to Google’s AI Overviews. That distinction issues significantly. Publishers blocking AI crawlers via robots.txt noticed whole site visitors drop by 23% and human site visitors fall by 14%, in accordance with analysis revealed December 31, 2025, by Rutgers Enterprise Faculty and The Wharton Faculty.

Precedence hierarchy and platform-specific controls

The usual additionally establishes a transparent precedence hierarchy among the many directives. Based on SPIR, directives for particular consumer brokers take priority over these set for the MachineLearning consumer agent, and MachineLearning directives in flip take priority over these set for the AI consumer agent. This implies operators can layer their settings: a broad AI-level block may be refined or overridden by platform-specific entries.

SPIR explicitly notes that operators can subsequently allow or prohibit particular AI platforms utilizing their particular person consumer agent identifiers. Examples given in the usual embrace Apple-Prolonged, Google-Prolonged, Perplexity-Consumer, Seznam-Prolonged, and Open AI Crawlers, amongst others. That granularity is important. Google-Extended, as an example, permits publishers to regulate whether or not their content material feeds into future Gemini mannequin coaching – although questions persist about whether or not such controls adequately tackle AI Overviews participation.

To cut back ambiguity, SPIR additionally recommends including a plain-text remark to the robots.txt file explaining the directives, together with a contact tackle for licensing negotiations. The advisable remark, translated from Czech, reads roughly: “The ‘Consumer-agent: MachineLearning’ and ‘Consumer-agent: AI’ settings are instruments of the SPIR unified customary for automated textual content and information mining from this web site, significantly inside the that means of Article 4 of Directive 2019/790/EU.”

The authorized grounding: Article 4 of the EU Copyright Directive

The usual is explicitly grounded in Article 4, paragraph 3 of Directive 2019/790 of the European Parliament and of the Council – the EU’s 2019 Copyright within the Digital Single Market directive. That provision permits rights holders to order their content material from TDM by machine-readable means, successfully making the robots.txt opt-out legally significant beneath European regulation.

SPIR’s earlier 2023 advice was conceived in direct response to the emergence of AI as a major business drive. Its said purpose on the time was to create a extra clear and predictable business atmosphere for AI builders, web site operators, and authors alike. That framing, notably, locations AI builders inside the framework as potential counterparties – not adversaries – in future licensing negotiations. When an operator applies the usual, SPIR says they’re clearly declaring an opt-out from the TDM exception and signalling willingness to barter compensation with AI platforms for using their copyright-protected content material.

The robots.txt customary: older than it appears

The Robots Exclusion Protocol, which underpins the whole framework, predates the AI period by a long time. Based on SPIR, the protocol originated earlier than the yr 2000. It was formally introduced as IETF customary RFC 9309 in 2019 and permitted in 2022. Regardless of its age, it has turn into the de facto mechanism by which publishers worldwide try to handle crawler entry – a task it was by no means particularly designed for.

That improvised function is a supply of ongoing rigidity. The usual operates on voluntary compliance. Crawlers can select to disregard robots.txt directives with out authorized consequence beneath most jurisdictions, which is exactly why enforcement instruments and authorized frameworks have grown alongside the technical customary. Cloudflare launched Robotcop in December 2024 particularly to transform robots.txt declarations into energetic Net Utility Firewall guidelines, imposing them on the community stage moderately than counting on crawler goodwill. Research from Kim et al. (2025) additional confirmed that compliance falls with stricter robots.txt directives, and that some AI-related crawlers not often examine these information in any respect.

Anthropic, for its half, clarified in February 2026 how its three crawlers – ClaudeBot for mannequin coaching, Claude-Consumer for real-time queries, and Claude-SearchBot for search high quality – reply to robots.txt directives, and dedicated to not bypass CAPTCHAs. Whether or not documented commitments translate reliably into crawler behaviour has remained some extent of rivalry. Reddit’s lawsuit towards Anthropic, filed June 4, 2025, alleged the corporate continued accessing its platform greater than 100,000 occasions after publicly claiming it had stopped.

The extension of SPIR’s customary to cowl real-time AI responses is arguably its most important technical replace. Retrieval-augmented technology (RAG) programs – those who pull present data from the open internet to supply solutions – have grown quickly in prominence since 2023. Not like static coaching, RAG includes dwell crawling in the mean time a consumer asks a query. A information article revealed this morning may be ingested and summarised by an AI assistant inside hours, producing responses for customers who by no means go to the unique publication.

This dynamic sits on the centre of the business battle between publishers and AI platforms. Over 80 media executives gathered under the IAB Tech Lab banner in late July 2025 to handle systematic content material extraction by AI platforms – with OpenAI, Anthropic, and Perplexity notably absent from the room. The gathering aimed to develop an LLM Content material Ingest API that may formalise writer consent, attribution, and compensation right into a binding technical framework.

The dimensions of the financial hurt being alleged is substantial. Based on IAB Tech Lab evaluation, AI-driven search summaries cut back writer site visitors by 20% to 60% on common, with area of interest web sites experiencing losses as excessive as 90%. The organisation estimates publishers collectively face $2 billion in annual income losses from AI-driven search options.

Czech context inside the European panorama

SPIR’s replace doesn’t exist in isolation. Czech media has been navigating AI-related pressures alongside broader European regulatory developments. The identical affiliation coordinated a separate self-regulatory initiative in August 2025, when ten major Czech media organisations introduced a ten-point framework aligned with the European Media Freedom Act, which entered drive on August 8, 2025. SPIR has additionally been energetic on the political promoting entrance, seeking a developer for a centralised political promoting transparency system in February 2026.

The European Fee launched a proper antitrust investigation on December 9, 2025, into whether or not Google violated EU competition rules by utilizing writer and YouTube content material for AI functions with out compensation or viable opt-out mechanisms. On the identical time, the UK’s Competitors and Markets Authority proposed in January 2026 that Google present publishers the power to choose out of AI Overviews with out shedding search visibility – a change Google’s personal executives described in February 2026 as a “huge engineering project.”

In that broader context, SPIR’s up to date customary represents one nationwide affiliation’s concrete response to an issue that regulators in Brussels and London are nonetheless working to resolve by formal authorized mechanisms. The Czech framework presents one thing regulators can’t but assure: an instantly implementable, machine-readable sign {that a} writer doesn’t consent to AI extraction of their content material.

What the usual doesn’t cowl

SPIR is specific that the brand new customary applies particularly to extraction by way of web crawlers, to not different types of information mining. Publishers in search of to limit content material reuse by completely different technical means – akin to textual notices in web page footers or use of the TDM Reservation Protocol – are usually not precluded from doing so. Nonetheless, SPIR cautions that any different strategies must be utilized persistently with the robots.txt settings to keep away from ambiguity or contradictions.

The usual additionally notes that top websites have increasingly moved to block AI crawlers by their very own configurations lately. By July 2024, 35.7% of the top 1,000 global websites have been blocking OpenAI’s GPTBot – a sevenfold enhance from the 5% blocking price when that crawler launched in August 2023. CCBot was blocked by 22.1% of prime websites; Google-Prolonged by 13.6%.

Why this issues for advertising and marketing and promoting professionals

Publishers are usually not the one stakeholders watching these developments. For the advertising and marketing and promoting group, the rising fragmentation of content material entry insurance policies throughout territories and platforms introduces significant complexity. A Czech information web site making use of SPIR’s Consumer-agent: AI / Disallow: / directive is, in impact, opting out of AI-powered advert environments that depend on real-time content material context. Contextual concentrating on programs powered by RAG or dwell content material APIs would discover entry restricted the place writer consent has not been negotiated.

OpenAI revised its ChatGPT crawler documentation in December 2025, separating the roles of its coaching and search crawlers. Anthropic’s February 2026 clarification equally distinguished between its mannequin coaching and user-query bots. These distinctions map immediately onto the 2 tiers SPIR has now formalised: a training-only opt-out versus a full opt-out protecting dwell inference as properly. Publishers utilizing SPIR’s framework can now align their robots.txt configuration with the particular crawlers they want to block, utilizing the platform-specific consumer agent identifiers that main AI firms have begun publishing.

Whether or not AI platforms honour these declarations persistently stays an open query. The SPIR customary doesn’t create an enforcement mechanism – that continues to be the province of regulation and litigation. What it does create is a transparent, standardised, and legally grounded sign that Czech publishers can use to say their rights beneath the EU Copyright Directive, and a basis for future licensing negotiations with AI platforms that more and more rely on writer content material to operate.

Timeline

Pre-2000: Robots Exclusion Protocol (robots.txt) first emerges as an off-the-cuff customary for speaking with internet crawlers.
2019: IETF customary RFC 9309 formally introduced as the idea for the robots.txt protocol.
2019: EU Directive 2019/790 on copyright within the Digital Single Market adopted, together with Article 4’s TDM opt-out provision.
2022: RFC 9309 formally permitted because the formal robots.txt customary.
August 7, 2023: OpenAI publicizes GPTBot; main websites together with Amazon, The New York Occasions, and CNN start blocking it inside two weeks. Coverage: PPC Land
July 7, 2023: SPIR publishes its first standardised draft for Czech publishers to choose out of AI content material extraction by way of robots.txt.
June 29, 2024: Cloudflare introduces a function to dam AI scrapers and crawlers. Coverage: PPC Land
August 2024: 35.7% of top 1,000 global websites block GPTBot, a sevenfold enhance from August 2023.
August 2024: Writer site visitors begins declining measurably, 13.2 months after ChatGPT’s launch, in accordance with the Rutgers/Wharton analysis.
December 10, 2024: Cloudflare launches Robotcop to implement robots.txt insurance policies on the community stage. Coverage: PPC Land
March 9, 2025: Google updates robots meta tag documentation to incorporate AI Mode. Coverage: PPC Land
June 30, 2025: Impartial Publishers Alliance information antitrust criticism with the European Fee concentrating on Google’s AI Overviews. Coverage: PPC Land
July 1, 2025: Cloudflare launches pay-per-crawl service for content material creators.
July 30–August 3, 2025: Over 80 media executives collect at IAB Tech Lab in New York to handle AI content material scraping. Coverage: PPC Land
August 8, 2025: Ten Czech media organisations announce a ten-point self-regulatory framework aligned with the European Media Freedom Act. Coverage: PPC Land
August 21–30, 2025: Amazon updates its robots.txt to dam AI crawlers from Meta, Google, Huawei, and others. Coverage: PPC Land
September 30, 2025: UK CMA designates Google with Strategic Market Standing after a nine-month investigation.
October 10, 2025: Google VP Robby Stein confronted publicly over writer opt-out gaps for AI Overviews. Coverage: PPC Land
December 9, 2025: European Fee launches formal antitrust investigation into Google’s AI content material practices. Coverage: PPC Land
December 9, 2025: OpenAI revises ChatGPT crawler documentation, separating coaching and search bots. Coverage: PPC Land
December 31, 2025: Rutgers Enterprise Faculty and The Wharton Faculty publish analysis exhibiting publishers who blocked AI crawlers misplaced 23% of whole site visitors. Coverage: PPC Land
February 10, 2026: SPIR publicizes public tender for a political promoting transparency system. Coverage: PPC Land
February 11, 2026: Google govt calls letting publishers skip AI Overviews with out shedding search a “big engineering challenge.” Coverage: PPC Land
February 25, 2026: Anthropic clarifies the roles and blocking mechanisms for its three internet crawlers. Coverage: PPC Land
March 19, 2026: SPIR, in collaboration with AOV, ČUV, and SLPV, publishes up to date unified customary for Czech on-line publishers to choose out of AI textual content and information mining by way of robots.txt, extending scope to real-time AI responses.

Abstract

Who: The Sdružení professional internetový rozvoj (SPIR) – Affiliation for Web Improvement within the Czech Republic – performing collectively with the Asociace on-line vydavatelů (AOV), the Česká unie vydavatelů (ČUV), and the Správce licenčních práv vydavatelů (SLPV), representing Czech on-line publishers.

What: An up to date unified technical customary enabling Czech web site operators to declare an opt-out from the EU’s textual content and information mining (TDM) exception by way of the robots.txt file. The replace introduces two particular directives – Consumer-agent: MachineLearning for training-only opt-outs, and Consumer-agent: AI for a broader opt-out protecting each coaching and real-time AI inference. It replaces SPIR’s first draft, revealed July 7, 2023, and extends the scope to cowl retrieval-augmented technology and comparable real-time AI response programs.

When: The up to date customary was revealed on March 19, 2026. The unique draft it replaces dated from July 7, 2023.

The place: The usual applies to web site operators within the Czech Republic, expressed by the machine-readable robots.txt file – a globally understood protocol that AI crawlers entry on the root of any area. It’s grounded in EU Directive 2019/790 and is designed for pan-European and worldwide applicability the place equal nationwide legal guidelines exist.

Why: Since 2023, the scope of AI content material extraction has expanded from mannequin coaching information to real-time content material utilization by AI assistants and summarisation instruments. Czech publishers, alongside counterparts throughout Europe, face important site visitors and income losses from AI programs utilizing their content material with out compensation. The replace supplies a standardised, legally grounded mechanism for publishers to say their rights beneath EU copyright regulation and indicators willingness to barter licensing phrases – at a second when enforcement by regulation stays incomplete and litigation is ongoing.

Source link

Czech publishers get new robots.txt shield against AI scrapers

Two directives, two distinct use instances

Precedence hierarchy and platform-specific controls

The authorized grounding: Article 4 of the EU Copyright Directive

The robots.txt customary: older than it appears

Czech context inside the European panorama

What the usual doesn’t cowl

Why this issues for advertising and marketing and promoting professionals

Timeline

Abstract

[email protected]

Leave a Reply Cancel reply

Auto Parts & Car Parts Online Store Flutter Mobile App Template | Flutter eCommerce UI Kit

There’s a sneaky way to watch UFC 327 really cheap…

MusicMaven – Music Streaming & Podcast Flutter App UI Template

Press ESC to close

Two directives, two distinct use instances

Precedence hierarchy and platform-specific controls

The authorized grounding: Article 4 of the EU Copyright Directive

The robots.txt customary: older than it appears

Czech context inside the European panorama

What the usual doesn’t cowl

Why this issues for advertising and marketing and promoting professionals

Timeline

Abstract

Share Article:

2 App Template| Modern Quiz Solo App | Multiplayer Quiz App | 1vs1 Quiz App | Quiz App | QuizOn

Prime Video Streaming Flutter App UI Kit

Leave a Reply Cancel reply