OpenRobotsTXT

About OpenRobotsTXT

Majestic has over twenty years experience in processing robots.txt files. OpenRobotsTXT is an attempt to build on that knowledge to archive and analyse all the web's robots.txt files, creating an archive where we can report on things like trends of bots being blocked, the appearance of new user agents out in the wild and other interesting facets that will be interesting to webmasters, bot creators and other third-parties.

Why do we need OpenRobotsTXT?

Robots.txt files are easy to find and simple to download. Whether you're browsing a site manually or using an automated script, you can almost always check the root directory for the robots.txt file. Even the Internet Archive keeps a collection of them.

So, what's new?

The twist is in the purpose. While robots.txt files are designed for bots, no major archive (that we know of) has focused on analysing them at scale, until now. OpenRobotsTXT isn’t just about collecting these files, it’s about unlocking their value.

When we actively crawl and store robots.txt files, we enable deeper insights into how sites manage bot traffic. It's a very interesting way of understanding crawler behavior, detecting new bots, and helping webmasters verify that their directives are actually reaching the bots, unaltered by servers, ISPs, or other intermediaries.

What’s the Motivation Behind OpenRobotsTXT?

The driving force is simple but powerful. We would like to build an open and global archive of robots.txt files. This dataset is valuable to a wide range of researchers, webmasters, and crawler developers, because it provides a definitive snapshot of how websites wish to interact with automated agents.

By collecting and analyzing robots.txt files, OpenRobotsTXT helps answer key questions

Which user agents are blocked where?
What new crawlers are emerging?
Which domains are actively serving web content?
Does every crawler see the same robots.txt?
How often are crawlers blocked from even seeing robots.txt?

With this kind of insight, we can help the web ecosystem function more transparently and efficiently.

For crawler developers, this is an opportunity to optimize. If you know a site only specifies a 20 second crawl delay, you could consider reducing its crawl budget to focus on the most important pages in the link graph. This sort of pre-optimisation offers a smarter, more respectful way to crawl.

It also helps with discovery and prioritization. Sites that explicitly block all bots? Maybe they don’t belong at the top of your list. But sites with open, high-volume structures? Those might be worth your attention.

Why Now? Why OpenRobotsTXT?

We have noticed a growing spotlight on web crawlers, driven by the explosive rise of AI bots and large language models (LLMs). These systems harvest massive amounts of web data, often piggybacking the same lower-level crawling techniques that have existed for years.

While new proposals are emerging to help websites opt-out of AI training or manage LLM access, one of the most immediate and widely recognized defenses is still the humble robots.txt file.

OpenRobotsTXT is our response. By building an open, transparent archive of robots.txt files, and by reporting on crawler activity across the web, we hope to support a more informed and constructive conversation around the role of bots in the age of AI.

Who is OpenRobotsTXT for?

OpenRobotsTXT is for everyone who cares about how bots interact with the web. That includes researchers, webmasters, ISPs, journalists, and crawler developers. For Majestic, it helps us understand block rates, identify trends in bot behavior, and benchmark our own crawlers against broader industry patterns. To do this, we’ve introduced a dedicated user-agent that only fetches robots.txt files, making sure that we have a clean and focused dataset.

By combining data from this crawler, our partners, and data available at majestic.com, we aim to build a community-driven resource that reports on new bots, common misconfigurations, and emerging standards. This benefits webmasters, by offering insights into how bots see their sites and highlighting potential issues like malformed directives, incorrect user-agent names, or outdated rules.

Researchers can dig into trends, such as the practical effects of newer standards (like RFC 9309), and investigate whether they have been widely adopted.

Bots will be able to consult OpenRobotsTXT data before hitting a server, and this may be particularly useful when they have not been able to retrieve a site's robots.txt file for themselves.

Will I be able to Use the OpenRobotsTXT Website to Explore Crawler Data?

Yes!

The platform will soon offer tools to search and analyze data about how websites would like bots to interact with them.

We compile regular stats on top-performing bots from among tens of thousands online. For webmasters, one standout feature will be the robots.txt timeline, showing historical changes, server errors, or unexpected modifications (such as when a developer temporarily altered a robots.txt file).

There's growing interest in tracking these files, so the platform will provide a cached archive of your past robots.txt versions. We're exploring ways to let you share this with third parties, which will hopefully be helpful in discussions or disputes over what bots were allowed or denied on specific dates. It adds transparency, especially when bots behave in ways that contradict site rules.

By offering a neutral, third-party view of these changes, OpenRobotsTXT helps both site owners and crawler developers understand what was actually delivered to bots, versus what was intended. That clarity reduces ambiguity and improves communication when diagnosing issues caused by misconfigurations or unexpected changes.

Do You Recommend Whether I Should Block or Allow Certain Bots?

The goal of the project is to stay neutral and foster open discussion. We don’t recommend specific bots to block or allow, as recommending actions would shift us from a data-focused platform into a lobbying role, which doesn’t align with our mission.

Instead, we provide insights and trends to help webmasters make informed choices. If you're setting bot policies, we encourage you to fully understand their impact (both technically and ethically). With better visibility into bot identities, you can shape your approach based on your specific goals, whether that's inclusion or restriction.

How Can People Contribute to the Project?

There are several meaningful ways people can get involved and support this project. A simple way is to raise awareness. Look out for visits from OpenRobotsTxt crawler and shout on social media if you see it visit your /robots.txt.

Another great way to help is to keep an eye out on this OpenRobotsTXT website. We will soon be introducing a free login on OpenRobotsTxt to access a range of free tools. We have no plans for paid tools or subscriptions on this site. Your feedback and ideas will help us build a better experience for everyone.

We’re also working on ways to provide broader access to our growing data archive. While the data is manageable in size, it's still quite substantial, so sharing it responsibly and sustainably is key. That’s why we’re initially focusing on forming partnerships with trusted, well-established organizations.

In the near future, we also plan to collaborate with recognized researchers from academic institutions around the world. Our long-term goal is simple: make the data as widely available as possible. But we’re committed to doing so in a way that ensures both reliability and sustainability.

How Can I Partner With OpenRobotsTXT?

For organizations interested in deeper involvement, we’re developing a partners program. Details will be available on our site soon, and in the meantime, you're welcome to reach out via the contact form. We’re prioritizing partnerships with experienced, trusted players in the space to ensure the project grows responsibly and sustainably.

By carefully establishing clear protocols for collaboration, we hope to uphold the integrity of the project and share valuable insights in a way that benefits the whole ecosystem.