Borealis/Dependencies/Python/Doc/html/library/urllib.robotparser.html

423 lines
25 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" data-content_root="../">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta property="og:title" content="urllib.robotparser — Parser for robots.txt" />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://docs.python.org/3/library/urllib.robotparser.html" />
<meta property="og:site_name" content="Python documentation" />
<meta property="og:description" content="Source code: Lib/urllib/robotparser.py This module provides a single class, RobotFileParser, which answers questions about whether or not a particular user agent can fetch a URL on the web site tha..." />
<meta property="og:image" content="https://docs.python.org/3/_static/og-image.png" />
<meta property="og:image:alt" content="Python documentation" />
<meta name="description" content="Source code: Lib/urllib/robotparser.py This module provides a single class, RobotFileParser, which answers questions about whether or not a particular user agent can fetch a URL on the web site tha..." />
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta name="theme-color" content="#3776ab">
<title>urllib.robotparser — Parser for robots.txt &#8212; Python 3.13.3 documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=b86133f3" />
<link rel="stylesheet" type="text/css" href="../_static/pydoctheme.css?v=23252803" />
<link id="pygments_dark_css" media="(prefers-color-scheme: dark)" rel="stylesheet" type="text/css" href="../_static/pygments_dark.css?v=5349f25f" />
<script src="../_static/documentation_options.js?v=5d57ca2d"></script>
<script src="../_static/doctools.js?v=9bcbadda"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/sidebar.js"></script>
<link rel="search" type="application/opensearchdescription+xml"
title="Search within Python 3.13.3 documentation"
href="../_static/opensearch.xml"/>
<link rel="author" title="About these documents" href="../about.html" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="copyright" title="Copyright" href="../copyright.html" />
<link rel="next" title="http — HTTP modules" href="http.html" />
<link rel="prev" title="urllib.error — Exception classes raised by urllib.request" href="urllib.error.html" />
<link rel="canonical" href="https://docs.python.org/3/library/urllib.robotparser.html">
<style>
@media only screen {
table.full-width-table {
width: 100%;
}
}
</style>
<link rel="stylesheet" href="../_static/pydoctheme_dark.css" media="(prefers-color-scheme: dark)" id="pydoctheme_dark_css">
<link rel="shortcut icon" type="image/png" href="../_static/py.svg" />
<script type="text/javascript" src="../_static/copybutton.js"></script>
<script type="text/javascript" src="../_static/menu.js"></script>
<script type="text/javascript" src="../_static/search-focus.js"></script>
<script type="text/javascript" src="../_static/themetoggle.js"></script>
<script type="text/javascript" src="../_static/rtd_switcher.js"></script>
<meta name="readthedocs-addons-api-version" content="1">
</head>
<body>
<div class="mobile-nav">
<input type="checkbox" id="menuToggler" class="toggler__input" aria-controls="navigation"
aria-pressed="false" aria-expanded="false" role="button" aria-label="Menu" />
<nav class="nav-content" role="navigation">
<label for="menuToggler" class="toggler__label">
<span></span>
</label>
<span class="nav-items-wrapper">
<a href="https://www.python.org/" class="nav-logo">
<img src="../_static/py.svg" alt="Python logo"/>
</a>
<span class="version_switcher_placeholder"></span>
<form role="search" class="search" action="../search.html" method="get">
<svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" class="search-icon">
<path fill-rule="nonzero" fill="currentColor" d="M15.5 14h-.79l-.28-.27a6.5 6.5 0 001.48-5.34c-.47-2.78-2.79-5-5.59-5.34a6.505 6.505 0 00-7.27 7.27c.34 2.8 2.56 5.12 5.34 5.59a6.5 6.5 0 005.34-1.48l.27.28v.79l4.25 4.25c.41.41 1.08.41 1.49 0 .41-.41.41-1.08 0-1.49L15.5 14zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z"></path>
</svg>
<input placeholder="Quick search" aria-label="Quick search" type="search" name="q" />
<input type="submit" value="Go"/>
</form>
</span>
</nav>
<div class="menu-wrapper">
<nav class="menu" role="navigation" aria-label="main navigation">
<div class="language_switcher_placeholder"></div>
<label class="theme-selector-label">
Theme
<select class="theme-selector" oninput="activateTheme(this.value)">
<option value="auto" selected>Auto</option>
<option value="light">Light</option>
<option value="dark">Dark</option>
</select>
</label>
<div>
<h4>Previous topic</h4>
<p class="topless"><a href="urllib.error.html"
title="previous chapter"><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.error</span></code> — Exception classes raised by urllib.request</a></p>
</div>
<div>
<h4>Next topic</h4>
<p class="topless"><a href="http.html"
title="next chapter"><code class="xref py py-mod docutils literal notranslate"><span class="pre">http</span></code> — HTTP modules</a></p>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="../bugs.html">Report a Bug</a></li>
<li>
<a href="https://github.com/python/cpython/blob/main/Doc/library/urllib.robotparser.rst"
rel="nofollow">Show Source
</a>
</li>
</ul>
</div>
</nav>
</div>
</div>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="../py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="http.html" title="http — HTTP modules"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="urllib.error.html" title="urllib.error — Exception classes raised by urllib.request"
accesskey="P">previous</a> |</li>
<li><img src="../_static/py.svg" alt="Python logo" style="vertical-align: middle; margin-top: -1px"/></li>
<li><a href="https://www.python.org/">Python</a> &#187;</li>
<li class="switchers">
<div class="language_switcher_placeholder"></div>
<div class="version_switcher_placeholder"></div>
</li>
<li>
</li>
<li id="cpython-language-and-version">
<a href="../index.html">3.13.3 Documentation</a> &#187;
</li>
<li class="nav-item nav-item-1"><a href="index.html" >The Python Standard Library</a> &#187;</li>
<li class="nav-item nav-item-2"><a href="internet.html" accesskey="U">Internet Protocols and Support</a> &#187;</li>
<li class="nav-item nav-item-this"><a href=""><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.robotparser</span></code> — Parser for robots.txt</a></li>
<li class="right">
<div class="inline-search" role="search">
<form class="inline-search" action="../search.html" method="get">
<input placeholder="Quick search" aria-label="Quick search" type="search" name="q" id="search-box" />
<input type="submit" value="Go" />
</form>
</div>
|
</li>
<li class="right">
<label class="theme-selector-label">
Theme
<select class="theme-selector" oninput="activateTheme(this.value)">
<option value="auto" selected>Auto</option>
<option value="light">Light</option>
<option value="dark">Dark</option>
</select>
</label> |</li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="module-urllib.robotparser">
<span id="urllib-robotparser-parser-for-robots-txt"></span><h1><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.robotparser</span></code> — Parser for robots.txt<a class="headerlink" href="#module-urllib.robotparser" title="Link to this heading"></a></h1>
<p><strong>Source code:</strong> <a class="extlink-source reference external" href="https://github.com/python/cpython/tree/3.13/Lib/urllib/robotparser.py">Lib/urllib/robotparser.py</a></p>
<hr class="docutils" id="index-0" />
<p>This module provides a single class, <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code class="xref py py-class docutils literal notranslate"><span class="pre">RobotFileParser</span></code></a>, which answers
questions about whether or not a particular user agent can fetch a URL on the
web site that published the <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> file. For more details on the
structure of <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> files, see <a class="reference external" href="http://www.robotstxt.org/orig.html">http://www.robotstxt.org/orig.html</a>.</p>
<dl class="py class">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser">
<em class="property"><span class="k"><span class="pre">class</span></span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">urllib.robotparser.</span></span><span class="sig-name descname"><span class="pre">RobotFileParser</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">url</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">''</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser" title="Link to this definition"></a></dt>
<dd><p>This class provides methods to read, parse and answer questions about the
<code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> file at <em>url</em>.</p>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.set_url">
<span class="sig-name descname"><span class="pre">set_url</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">url</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.set_url" title="Link to this definition"></a></dt>
<dd><p>Sets the URL referring to a <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> file.</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.read">
<span class="sig-name descname"><span class="pre">read</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.read" title="Link to this definition"></a></dt>
<dd><p>Reads the <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code> URL and feeds it to the parser.</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.parse">
<span class="sig-name descname"><span class="pre">parse</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">lines</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.parse" title="Link to this definition"></a></dt>
<dd><p>Parses the lines argument.</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.can_fetch">
<span class="sig-name descname"><span class="pre">can_fetch</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">useragent</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">url</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.can_fetch" title="Link to this definition"></a></dt>
<dd><p>Returns <code class="docutils literal notranslate"><span class="pre">True</span></code> if the <em>useragent</em> is allowed to fetch the <em>url</em>
according to the rules contained in the parsed <code class="file docutils literal notranslate"><span class="pre">robots.txt</span></code>
file.</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.mtime">
<span class="sig-name descname"><span class="pre">mtime</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.mtime" title="Link to this definition"></a></dt>
<dd><p>Returns the time the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> file was last fetched. This is
useful for long-running web spiders that need to check for new
<code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> files periodically.</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.modified">
<span class="sig-name descname"><span class="pre">modified</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.modified" title="Link to this definition"></a></dt>
<dd><p>Sets the time the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> file was last fetched to the current
time.</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.crawl_delay">
<span class="sig-name descname"><span class="pre">crawl_delay</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">useragent</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.crawl_delay" title="Link to this definition"></a></dt>
<dd><p>Returns the value of the <code class="docutils literal notranslate"><span class="pre">Crawl-delay</span></code> parameter from <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code>
for the <em>useragent</em> in question. If there is no such parameter or it
doesnt apply to the <em>useragent</em> specified or the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> entry
for this parameter has invalid syntax, return <code class="docutils literal notranslate"><span class="pre">None</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified added">Added in version 3.6.</span></p>
</div>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.request_rate">
<span class="sig-name descname"><span class="pre">request_rate</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">useragent</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.request_rate" title="Link to this definition"></a></dt>
<dd><p>Returns the contents of the <code class="docutils literal notranslate"><span class="pre">Request-rate</span></code> parameter from
<code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> as a <a class="reference internal" href="../glossary.html#term-named-tuple"><span class="xref std std-term">named tuple</span></a> <code class="docutils literal notranslate"><span class="pre">RequestRate(requests,</span> <span class="pre">seconds)</span></code>.
If there is no such parameter or it doesnt apply to the <em>useragent</em>
specified or the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> entry for this parameter has invalid
syntax, return <code class="docutils literal notranslate"><span class="pre">None</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified added">Added in version 3.6.</span></p>
</div>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="urllib.robotparser.RobotFileParser.site_maps">
<span class="sig-name descname"><span class="pre">site_maps</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#urllib.robotparser.RobotFileParser.site_maps" title="Link to this definition"></a></dt>
<dd><p>Returns the contents of the <code class="docutils literal notranslate"><span class="pre">Sitemap</span></code> parameter from
<code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> in the form of a <a class="reference internal" href="stdtypes.html#list" title="list"><code class="xref py py-func docutils literal notranslate"><span class="pre">list()</span></code></a>. If there is no such
parameter or the <code class="docutils literal notranslate"><span class="pre">robots.txt</span></code> entry for this parameter has
invalid syntax, return <code class="docutils literal notranslate"><span class="pre">None</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified added">Added in version 3.8.</span></p>
</div>
</dd></dl>
</dd></dl>
<p>The following example demonstrates basic use of the <a class="reference internal" href="#urllib.robotparser.RobotFileParser" title="urllib.robotparser.RobotFileParser"><code class="xref py py-class docutils literal notranslate"><span class="pre">RobotFileParser</span></code></a>
class:</p>
<div class="highlight-python3 notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span><span class="w"> </span><span class="nn">urllib.robotparser</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">robotparser</span><span class="o">.</span><span class="n">RobotFileParser</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">set_url</span><span class="p">(</span><span class="s2">&quot;http://www.musi-cal.com/robots.txt&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rrate</span> <span class="o">=</span> <span class="n">rp</span><span class="o">.</span><span class="n">request_rate</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rrate</span><span class="o">.</span><span class="n">requests</span>
<span class="go">3</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rrate</span><span class="o">.</span><span class="n">seconds</span>
<span class="go">20</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">crawl_delay</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">)</span>
<span class="go">6</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">can_fetch</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">,</span> <span class="s2">&quot;http://www.musi-cal.com/cgi-bin/search?city=San+Francisco&quot;</span><span class="p">)</span>
<span class="go">False</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rp</span><span class="o">.</span><span class="n">can_fetch</span><span class="p">(</span><span class="s2">&quot;*&quot;</span><span class="p">,</span> <span class="s2">&quot;http://www.musi-cal.com/&quot;</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
</section>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="Main">
<div class="sphinxsidebarwrapper">
<div>
<h4>Previous topic</h4>
<p class="topless"><a href="urllib.error.html"
title="previous chapter"><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.error</span></code> — Exception classes raised by urllib.request</a></p>
</div>
<div>
<h4>Next topic</h4>
<p class="topless"><a href="http.html"
title="next chapter"><code class="xref py py-mod docutils literal notranslate"><span class="pre">http</span></code> — HTTP modules</a></p>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="../bugs.html">Report a Bug</a></li>
<li>
<a href="https://github.com/python/cpython/blob/main/Doc/library/urllib.robotparser.rst"
rel="nofollow">Show Source
</a>
</li>
</ul>
</div>
</div>
<div id="sidebarbutton" title="Collapse sidebar">
<span>«</span>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="../py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="http.html" title="http — HTTP modules"
>next</a> |</li>
<li class="right" >
<a href="urllib.error.html" title="urllib.error — Exception classes raised by urllib.request"
>previous</a> |</li>
<li><img src="../_static/py.svg" alt="Python logo" style="vertical-align: middle; margin-top: -1px"/></li>
<li><a href="https://www.python.org/">Python</a> &#187;</li>
<li class="switchers">
<div class="language_switcher_placeholder"></div>
<div class="version_switcher_placeholder"></div>
</li>
<li>
</li>
<li id="cpython-language-and-version">
<a href="../index.html">3.13.3 Documentation</a> &#187;
</li>
<li class="nav-item nav-item-1"><a href="index.html" >The Python Standard Library</a> &#187;</li>
<li class="nav-item nav-item-2"><a href="internet.html" >Internet Protocols and Support</a> &#187;</li>
<li class="nav-item nav-item-this"><a href=""><code class="xref py py-mod docutils literal notranslate"><span class="pre">urllib.robotparser</span></code> — Parser for robots.txt</a></li>
<li class="right">
<div class="inline-search" role="search">
<form class="inline-search" action="../search.html" method="get">
<input placeholder="Quick search" aria-label="Quick search" type="search" name="q" id="search-box" />
<input type="submit" value="Go" />
</form>
</div>
|
</li>
<li class="right">
<label class="theme-selector-label">
Theme
<select class="theme-selector" oninput="activateTheme(this.value)">
<option value="auto" selected>Auto</option>
<option value="light">Light</option>
<option value="dark">Dark</option>
</select>
</label> |</li>
</ul>
</div>
<div class="footer">
&copy;
<a href="../copyright.html">
Copyright
</a>
2001-2025, Python Software Foundation.
<br />
This page is licensed under the Python Software Foundation License Version 2.
<br />
Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
<br />
See <a href="/license.html">History and License</a> for more information.<br />
<br />
The Python Software Foundation is a non-profit corporation.
<a href="https://www.python.org/psf/donations/">Please donate.</a>
<br />
<br />
Last updated on Apr 08, 2025 (14:33 UTC).
<a href="/bugs.html">Found a bug</a>?
<br />
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.2.3.
</div>
</body>
</html>