17db96d56Sopenharmony_ci.. _urllib-howto:
27db96d56Sopenharmony_ci
37db96d56Sopenharmony_ci***********************************************************
47db96d56Sopenharmony_ci  HOWTO Fetch Internet Resources Using The urllib Package
57db96d56Sopenharmony_ci***********************************************************
67db96d56Sopenharmony_ci
77db96d56Sopenharmony_ci:Author: `Michael Foord <https://agileabstractions.com/>`_
87db96d56Sopenharmony_ci
97db96d56Sopenharmony_ci
107db96d56Sopenharmony_ciIntroduction
117db96d56Sopenharmony_ci============
127db96d56Sopenharmony_ci
137db96d56Sopenharmony_ci.. sidebar:: Related Articles
147db96d56Sopenharmony_ci
157db96d56Sopenharmony_ci    You may also find useful the following article on fetching web resources
167db96d56Sopenharmony_ci    with Python:
177db96d56Sopenharmony_ci
187db96d56Sopenharmony_ci    * `Basic Authentication <https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
197db96d56Sopenharmony_ci
207db96d56Sopenharmony_ci        A tutorial on *Basic Authentication*, with examples in Python.
217db96d56Sopenharmony_ci
227db96d56Sopenharmony_ci**urllib.request** is a Python module for fetching URLs
237db96d56Sopenharmony_ci(Uniform Resource Locators). It offers a very simple interface, in the form of
247db96d56Sopenharmony_cithe *urlopen* function. This is capable of fetching URLs using a variety of
257db96d56Sopenharmony_cidifferent protocols. It also offers a slightly more complex interface for
267db96d56Sopenharmony_cihandling common situations - like basic authentication, cookies, proxies and so
277db96d56Sopenharmony_cion. These are provided by objects called handlers and openers.
287db96d56Sopenharmony_ci
297db96d56Sopenharmony_ciurllib.request supports fetching URLs for many "URL schemes" (identified by the string
307db96d56Sopenharmony_cibefore the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
317db96d56Sopenharmony_ci``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
327db96d56Sopenharmony_ciThis tutorial focuses on the most common case, HTTP.
337db96d56Sopenharmony_ci
347db96d56Sopenharmony_ciFor straightforward situations *urlopen* is very easy to use. But as soon as you
357db96d56Sopenharmony_ciencounter errors or non-trivial cases when opening HTTP URLs, you will need some
367db96d56Sopenharmony_ciunderstanding of the HyperText Transfer Protocol. The most comprehensive and
377db96d56Sopenharmony_ciauthoritative reference to HTTP is :rfc:`2616`. This is a technical document and
387db96d56Sopenharmony_cinot intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
397db96d56Sopenharmony_ciwith enough detail about HTTP to help you through. It is not intended to replace
407db96d56Sopenharmony_cithe :mod:`urllib.request` docs, but is supplementary to them.
417db96d56Sopenharmony_ci
427db96d56Sopenharmony_ci
437db96d56Sopenharmony_ciFetching URLs
447db96d56Sopenharmony_ci=============
457db96d56Sopenharmony_ci
467db96d56Sopenharmony_ciThe simplest way to use urllib.request is as follows::
477db96d56Sopenharmony_ci
487db96d56Sopenharmony_ci    import urllib.request
497db96d56Sopenharmony_ci    with urllib.request.urlopen('http://python.org/') as response:
507db96d56Sopenharmony_ci       html = response.read()
517db96d56Sopenharmony_ci
527db96d56Sopenharmony_ciIf you wish to retrieve a resource via URL and store it in a temporary
537db96d56Sopenharmony_cilocation, you can do so via the :func:`shutil.copyfileobj` and
547db96d56Sopenharmony_ci:func:`tempfile.NamedTemporaryFile` functions::
557db96d56Sopenharmony_ci
567db96d56Sopenharmony_ci    import shutil
577db96d56Sopenharmony_ci    import tempfile
587db96d56Sopenharmony_ci    import urllib.request
597db96d56Sopenharmony_ci
607db96d56Sopenharmony_ci    with urllib.request.urlopen('http://python.org/') as response:
617db96d56Sopenharmony_ci        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
627db96d56Sopenharmony_ci            shutil.copyfileobj(response, tmp_file)
637db96d56Sopenharmony_ci
647db96d56Sopenharmony_ci    with open(tmp_file.name) as html:
657db96d56Sopenharmony_ci        pass
667db96d56Sopenharmony_ci
677db96d56Sopenharmony_ciMany uses of urllib will be that simple (note that instead of an 'http:' URL we
687db96d56Sopenharmony_cicould have used a URL starting with 'ftp:', 'file:', etc.).  However, it's the
697db96d56Sopenharmony_cipurpose of this tutorial to explain the more complicated cases, concentrating on
707db96d56Sopenharmony_ciHTTP.
717db96d56Sopenharmony_ci
727db96d56Sopenharmony_ciHTTP is based on requests and responses - the client makes requests and servers
737db96d56Sopenharmony_cisend responses. urllib.request mirrors this with a ``Request`` object which represents
747db96d56Sopenharmony_cithe HTTP request you are making. In its simplest form you create a Request
757db96d56Sopenharmony_ciobject that specifies the URL you want to fetch. Calling ``urlopen`` with this
767db96d56Sopenharmony_ciRequest object returns a response object for the URL requested. This response is
777db96d56Sopenharmony_cia file-like object, which means you can for example call ``.read()`` on the
787db96d56Sopenharmony_ciresponse::
797db96d56Sopenharmony_ci
807db96d56Sopenharmony_ci    import urllib.request
817db96d56Sopenharmony_ci
827db96d56Sopenharmony_ci    req = urllib.request.Request('http://python.org/')
837db96d56Sopenharmony_ci    with urllib.request.urlopen(req) as response:
847db96d56Sopenharmony_ci       the_page = response.read()
857db96d56Sopenharmony_ci
867db96d56Sopenharmony_ciNote that urllib.request makes use of the same Request interface to handle all URL
877db96d56Sopenharmony_cischemes.  For example, you can make an FTP request like so::
887db96d56Sopenharmony_ci
897db96d56Sopenharmony_ci    req = urllib.request.Request('ftp://example.com/')
907db96d56Sopenharmony_ci
917db96d56Sopenharmony_ciIn the case of HTTP, there are two extra things that Request objects allow you
927db96d56Sopenharmony_cito do: First, you can pass data to be sent to the server.  Second, you can pass
937db96d56Sopenharmony_ciextra information ("metadata") *about* the data or about the request itself, to
947db96d56Sopenharmony_cithe server - this information is sent as HTTP "headers".  Let's look at each of
957db96d56Sopenharmony_cithese in turn.
967db96d56Sopenharmony_ci
977db96d56Sopenharmony_ciData
987db96d56Sopenharmony_ci----
997db96d56Sopenharmony_ci
1007db96d56Sopenharmony_ciSometimes you want to send data to a URL (often the URL will refer to a CGI
1017db96d56Sopenharmony_ci(Common Gateway Interface) script or other web application). With HTTP,
1027db96d56Sopenharmony_cithis is often done using what's known as a **POST** request. This is often what
1037db96d56Sopenharmony_ciyour browser does when you submit a HTML form that you filled in on the web. Not
1047db96d56Sopenharmony_ciall POSTs have to come from forms: you can use a POST to transmit arbitrary data
1057db96d56Sopenharmony_cito your own application. In the common case of HTML forms, the data needs to be
1067db96d56Sopenharmony_ciencoded in a standard way, and then passed to the Request object as the ``data``
1077db96d56Sopenharmony_ciargument. The encoding is done using a function from the :mod:`urllib.parse`
1087db96d56Sopenharmony_cilibrary. ::
1097db96d56Sopenharmony_ci
1107db96d56Sopenharmony_ci    import urllib.parse
1117db96d56Sopenharmony_ci    import urllib.request
1127db96d56Sopenharmony_ci
1137db96d56Sopenharmony_ci    url = 'http://www.someserver.com/cgi-bin/register.cgi'
1147db96d56Sopenharmony_ci    values = {'name' : 'Michael Foord',
1157db96d56Sopenharmony_ci              'location' : 'Northampton',
1167db96d56Sopenharmony_ci              'language' : 'Python' }
1177db96d56Sopenharmony_ci
1187db96d56Sopenharmony_ci    data = urllib.parse.urlencode(values)
1197db96d56Sopenharmony_ci    data = data.encode('ascii') # data should be bytes
1207db96d56Sopenharmony_ci    req = urllib.request.Request(url, data)
1217db96d56Sopenharmony_ci    with urllib.request.urlopen(req) as response:
1227db96d56Sopenharmony_ci       the_page = response.read()
1237db96d56Sopenharmony_ci
1247db96d56Sopenharmony_ciNote that other encodings are sometimes required (e.g. for file upload from HTML
1257db96d56Sopenharmony_ciforms - see `HTML Specification, Form Submission
1267db96d56Sopenharmony_ci<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
1277db96d56Sopenharmony_cidetails).
1287db96d56Sopenharmony_ci
1297db96d56Sopenharmony_ciIf you do not pass the ``data`` argument, urllib uses a **GET** request. One
1307db96d56Sopenharmony_ciway in which GET and POST requests differ is that POST requests often have
1317db96d56Sopenharmony_ci"side-effects": they change the state of the system in some way (for example by
1327db96d56Sopenharmony_ciplacing an order with the website for a hundredweight of tinned spam to be
1337db96d56Sopenharmony_cidelivered to your door).  Though the HTTP standard makes it clear that POSTs are
1347db96d56Sopenharmony_ciintended to *always* cause side-effects, and GET requests *never* to cause
1357db96d56Sopenharmony_ciside-effects, nothing prevents a GET request from having side-effects, nor a
1367db96d56Sopenharmony_ciPOST requests from having no side-effects. Data can also be passed in an HTTP
1377db96d56Sopenharmony_ciGET request by encoding it in the URL itself.
1387db96d56Sopenharmony_ci
1397db96d56Sopenharmony_ciThis is done as follows::
1407db96d56Sopenharmony_ci
1417db96d56Sopenharmony_ci    >>> import urllib.request
1427db96d56Sopenharmony_ci    >>> import urllib.parse
1437db96d56Sopenharmony_ci    >>> data = {}
1447db96d56Sopenharmony_ci    >>> data['name'] = 'Somebody Here'
1457db96d56Sopenharmony_ci    >>> data['location'] = 'Northampton'
1467db96d56Sopenharmony_ci    >>> data['language'] = 'Python'
1477db96d56Sopenharmony_ci    >>> url_values = urllib.parse.urlencode(data)
1487db96d56Sopenharmony_ci    >>> print(url_values)  # The order may differ from below.  #doctest: +SKIP
1497db96d56Sopenharmony_ci    name=Somebody+Here&language=Python&location=Northampton
1507db96d56Sopenharmony_ci    >>> url = 'http://www.example.com/example.cgi'
1517db96d56Sopenharmony_ci    >>> full_url = url + '?' + url_values
1527db96d56Sopenharmony_ci    >>> data = urllib.request.urlopen(full_url)
1537db96d56Sopenharmony_ci
1547db96d56Sopenharmony_ciNotice that the full URL is created by adding a ``?`` to the URL, followed by
1557db96d56Sopenharmony_cithe encoded values.
1567db96d56Sopenharmony_ci
1577db96d56Sopenharmony_ciHeaders
1587db96d56Sopenharmony_ci-------
1597db96d56Sopenharmony_ci
1607db96d56Sopenharmony_ciWe'll discuss here one particular HTTP header, to illustrate how to add headers
1617db96d56Sopenharmony_cito your HTTP request.
1627db96d56Sopenharmony_ci
1637db96d56Sopenharmony_ciSome websites [#]_ dislike being browsed by programs, or send different versions
1647db96d56Sopenharmony_cito different browsers [#]_. By default urllib identifies itself as
1657db96d56Sopenharmony_ci``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
1667db96d56Sopenharmony_cinumbers of the Python release,
1677db96d56Sopenharmony_cie.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
1687db96d56Sopenharmony_cinot work. The way a browser identifies itself is through the
1697db96d56Sopenharmony_ci``User-Agent`` header [#]_. When you create a Request object you can
1707db96d56Sopenharmony_cipass a dictionary of headers in. The following example makes the same
1717db96d56Sopenharmony_cirequest as above, but identifies itself as a version of Internet
1727db96d56Sopenharmony_ciExplorer [#]_. ::
1737db96d56Sopenharmony_ci
1747db96d56Sopenharmony_ci    import urllib.parse
1757db96d56Sopenharmony_ci    import urllib.request
1767db96d56Sopenharmony_ci
1777db96d56Sopenharmony_ci    url = 'http://www.someserver.com/cgi-bin/register.cgi'
1787db96d56Sopenharmony_ci    user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
1797db96d56Sopenharmony_ci    values = {'name': 'Michael Foord',
1807db96d56Sopenharmony_ci              'location': 'Northampton',
1817db96d56Sopenharmony_ci              'language': 'Python' }
1827db96d56Sopenharmony_ci    headers = {'User-Agent': user_agent}
1837db96d56Sopenharmony_ci
1847db96d56Sopenharmony_ci    data = urllib.parse.urlencode(values)
1857db96d56Sopenharmony_ci    data = data.encode('ascii')
1867db96d56Sopenharmony_ci    req = urllib.request.Request(url, data, headers)
1877db96d56Sopenharmony_ci    with urllib.request.urlopen(req) as response:
1887db96d56Sopenharmony_ci       the_page = response.read()
1897db96d56Sopenharmony_ci
1907db96d56Sopenharmony_ciThe response also has two useful methods. See the section on `info and geturl`_
1917db96d56Sopenharmony_ciwhich comes after we have a look at what happens when things go wrong.
1927db96d56Sopenharmony_ci
1937db96d56Sopenharmony_ci
1947db96d56Sopenharmony_ciHandling Exceptions
1957db96d56Sopenharmony_ci===================
1967db96d56Sopenharmony_ci
1977db96d56Sopenharmony_ci*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
1987db96d56Sopenharmony_ciusual with Python APIs, built-in exceptions such as :exc:`ValueError`,
1997db96d56Sopenharmony_ci:exc:`TypeError` etc. may also be raised).
2007db96d56Sopenharmony_ci
2017db96d56Sopenharmony_ci:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
2027db96d56Sopenharmony_ciHTTP URLs.
2037db96d56Sopenharmony_ci
2047db96d56Sopenharmony_ciThe exception classes are exported from the :mod:`urllib.error` module.
2057db96d56Sopenharmony_ci
2067db96d56Sopenharmony_ciURLError
2077db96d56Sopenharmony_ci--------
2087db96d56Sopenharmony_ci
2097db96d56Sopenharmony_ciOften, URLError is raised because there is no network connection (no route to
2107db96d56Sopenharmony_cithe specified server), or the specified server doesn't exist.  In this case, the
2117db96d56Sopenharmony_ciexception raised will have a 'reason' attribute, which is a tuple containing an
2127db96d56Sopenharmony_cierror code and a text error message.
2137db96d56Sopenharmony_ci
2147db96d56Sopenharmony_cie.g. ::
2157db96d56Sopenharmony_ci
2167db96d56Sopenharmony_ci    >>> req = urllib.request.Request('http://www.pretend_server.org')
2177db96d56Sopenharmony_ci    >>> try: urllib.request.urlopen(req)
2187db96d56Sopenharmony_ci    ... except urllib.error.URLError as e:
2197db96d56Sopenharmony_ci    ...     print(e.reason)      #doctest: +SKIP
2207db96d56Sopenharmony_ci    ...
2217db96d56Sopenharmony_ci    (4, 'getaddrinfo failed')
2227db96d56Sopenharmony_ci
2237db96d56Sopenharmony_ci
2247db96d56Sopenharmony_ciHTTPError
2257db96d56Sopenharmony_ci---------
2267db96d56Sopenharmony_ci
2277db96d56Sopenharmony_ciEvery HTTP response from the server contains a numeric "status code". Sometimes
2287db96d56Sopenharmony_cithe status code indicates that the server is unable to fulfil the request. The
2297db96d56Sopenharmony_cidefault handlers will handle some of these responses for you (for example, if
2307db96d56Sopenharmony_cithe response is a "redirection" that requests the client fetch the document from
2317db96d56Sopenharmony_cia different URL, urllib will handle that for you). For those it can't handle,
2327db96d56Sopenharmony_ciurlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
2337db96d56Sopenharmony_cifound), '403' (request forbidden), and '401' (authentication required).
2347db96d56Sopenharmony_ci
2357db96d56Sopenharmony_ciSee section 10 of :rfc:`2616` for a reference on all the HTTP error codes.
2367db96d56Sopenharmony_ci
2377db96d56Sopenharmony_ciThe :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
2387db96d56Sopenharmony_cicorresponds to the error sent by the server.
2397db96d56Sopenharmony_ci
2407db96d56Sopenharmony_ciError Codes
2417db96d56Sopenharmony_ci~~~~~~~~~~~
2427db96d56Sopenharmony_ci
2437db96d56Sopenharmony_ciBecause the default handlers handle redirects (codes in the 300 range), and
2447db96d56Sopenharmony_cicodes in the 100--299 range indicate success, you will usually only see error
2457db96d56Sopenharmony_cicodes in the 400--599 range.
2467db96d56Sopenharmony_ci
2477db96d56Sopenharmony_ci:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
2487db96d56Sopenharmony_ciresponse codes in that shows all the response codes used by :rfc:`2616`. The
2497db96d56Sopenharmony_cidictionary is reproduced here for convenience ::
2507db96d56Sopenharmony_ci
2517db96d56Sopenharmony_ci    # Table mapping response codes to messages; entries have the
2527db96d56Sopenharmony_ci    # form {code: (shortmessage, longmessage)}.
2537db96d56Sopenharmony_ci    responses = {
2547db96d56Sopenharmony_ci        100: ('Continue', 'Request received, please continue'),
2557db96d56Sopenharmony_ci        101: ('Switching Protocols',
2567db96d56Sopenharmony_ci              'Switching to new protocol; obey Upgrade header'),
2577db96d56Sopenharmony_ci
2587db96d56Sopenharmony_ci        200: ('OK', 'Request fulfilled, document follows'),
2597db96d56Sopenharmony_ci        201: ('Created', 'Document created, URL follows'),
2607db96d56Sopenharmony_ci        202: ('Accepted',
2617db96d56Sopenharmony_ci              'Request accepted, processing continues off-line'),
2627db96d56Sopenharmony_ci        203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
2637db96d56Sopenharmony_ci        204: ('No Content', 'Request fulfilled, nothing follows'),
2647db96d56Sopenharmony_ci        205: ('Reset Content', 'Clear input form for further input.'),
2657db96d56Sopenharmony_ci        206: ('Partial Content', 'Partial content follows.'),
2667db96d56Sopenharmony_ci
2677db96d56Sopenharmony_ci        300: ('Multiple Choices',
2687db96d56Sopenharmony_ci              'Object has several resources -- see URI list'),
2697db96d56Sopenharmony_ci        301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
2707db96d56Sopenharmony_ci        302: ('Found', 'Object moved temporarily -- see URI list'),
2717db96d56Sopenharmony_ci        303: ('See Other', 'Object moved -- see Method and URL list'),
2727db96d56Sopenharmony_ci        304: ('Not Modified',
2737db96d56Sopenharmony_ci              'Document has not changed since given time'),
2747db96d56Sopenharmony_ci        305: ('Use Proxy',
2757db96d56Sopenharmony_ci              'You must use proxy specified in Location to access this '
2767db96d56Sopenharmony_ci              'resource.'),
2777db96d56Sopenharmony_ci        307: ('Temporary Redirect',
2787db96d56Sopenharmony_ci              'Object moved temporarily -- see URI list'),
2797db96d56Sopenharmony_ci
2807db96d56Sopenharmony_ci        400: ('Bad Request',
2817db96d56Sopenharmony_ci              'Bad request syntax or unsupported method'),
2827db96d56Sopenharmony_ci        401: ('Unauthorized',
2837db96d56Sopenharmony_ci              'No permission -- see authorization schemes'),
2847db96d56Sopenharmony_ci        402: ('Payment Required',
2857db96d56Sopenharmony_ci              'No payment -- see charging schemes'),
2867db96d56Sopenharmony_ci        403: ('Forbidden',
2877db96d56Sopenharmony_ci              'Request forbidden -- authorization will not help'),
2887db96d56Sopenharmony_ci        404: ('Not Found', 'Nothing matches the given URI'),
2897db96d56Sopenharmony_ci        405: ('Method Not Allowed',
2907db96d56Sopenharmony_ci              'Specified method is invalid for this server.'),
2917db96d56Sopenharmony_ci        406: ('Not Acceptable', 'URI not available in preferred format.'),
2927db96d56Sopenharmony_ci        407: ('Proxy Authentication Required', 'You must authenticate with '
2937db96d56Sopenharmony_ci              'this proxy before proceeding.'),
2947db96d56Sopenharmony_ci        408: ('Request Timeout', 'Request timed out; try again later.'),
2957db96d56Sopenharmony_ci        409: ('Conflict', 'Request conflict.'),
2967db96d56Sopenharmony_ci        410: ('Gone',
2977db96d56Sopenharmony_ci              'URI no longer exists and has been permanently removed.'),
2987db96d56Sopenharmony_ci        411: ('Length Required', 'Client must specify Content-Length.'),
2997db96d56Sopenharmony_ci        412: ('Precondition Failed', 'Precondition in headers is false.'),
3007db96d56Sopenharmony_ci        413: ('Request Entity Too Large', 'Entity is too large.'),
3017db96d56Sopenharmony_ci        414: ('Request-URI Too Long', 'URI is too long.'),
3027db96d56Sopenharmony_ci        415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
3037db96d56Sopenharmony_ci        416: ('Requested Range Not Satisfiable',
3047db96d56Sopenharmony_ci              'Cannot satisfy request range.'),
3057db96d56Sopenharmony_ci        417: ('Expectation Failed',
3067db96d56Sopenharmony_ci              'Expect condition could not be satisfied.'),
3077db96d56Sopenharmony_ci
3087db96d56Sopenharmony_ci        500: ('Internal Server Error', 'Server got itself in trouble'),
3097db96d56Sopenharmony_ci        501: ('Not Implemented',
3107db96d56Sopenharmony_ci              'Server does not support this operation'),
3117db96d56Sopenharmony_ci        502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
3127db96d56Sopenharmony_ci        503: ('Service Unavailable',
3137db96d56Sopenharmony_ci              'The server cannot process the request due to a high load'),
3147db96d56Sopenharmony_ci        504: ('Gateway Timeout',
3157db96d56Sopenharmony_ci              'The gateway server did not receive a timely response'),
3167db96d56Sopenharmony_ci        505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
3177db96d56Sopenharmony_ci        }
3187db96d56Sopenharmony_ci
3197db96d56Sopenharmony_ciWhen an error is raised the server responds by returning an HTTP error code
3207db96d56Sopenharmony_ci*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
3217db96d56Sopenharmony_cipage returned. This means that as well as the code attribute, it also has read,
3227db96d56Sopenharmony_cigeturl, and info, methods as returned by the ``urllib.response`` module::
3237db96d56Sopenharmony_ci
3247db96d56Sopenharmony_ci    >>> req = urllib.request.Request('http://www.python.org/fish.html')
3257db96d56Sopenharmony_ci    >>> try:
3267db96d56Sopenharmony_ci    ...     urllib.request.urlopen(req)
3277db96d56Sopenharmony_ci    ... except urllib.error.HTTPError as e:
3287db96d56Sopenharmony_ci    ...     print(e.code)
3297db96d56Sopenharmony_ci    ...     print(e.read())  #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
3307db96d56Sopenharmony_ci    ...
3317db96d56Sopenharmony_ci    404
3327db96d56Sopenharmony_ci    b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3337db96d56Sopenharmony_ci      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
3347db96d56Sopenharmony_ci      ...
3357db96d56Sopenharmony_ci      <title>Page Not Found</title>\n
3367db96d56Sopenharmony_ci      ...
3377db96d56Sopenharmony_ci
3387db96d56Sopenharmony_ciWrapping it Up
3397db96d56Sopenharmony_ci--------------
3407db96d56Sopenharmony_ci
3417db96d56Sopenharmony_ciSo if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
3427db96d56Sopenharmony_cibasic approaches. I prefer the second approach.
3437db96d56Sopenharmony_ci
3447db96d56Sopenharmony_ciNumber 1
3457db96d56Sopenharmony_ci~~~~~~~~
3467db96d56Sopenharmony_ci
3477db96d56Sopenharmony_ci::
3487db96d56Sopenharmony_ci
3497db96d56Sopenharmony_ci
3507db96d56Sopenharmony_ci    from urllib.request import Request, urlopen
3517db96d56Sopenharmony_ci    from urllib.error import URLError, HTTPError
3527db96d56Sopenharmony_ci    req = Request(someurl)
3537db96d56Sopenharmony_ci    try:
3547db96d56Sopenharmony_ci        response = urlopen(req)
3557db96d56Sopenharmony_ci    except HTTPError as e:
3567db96d56Sopenharmony_ci        print('The server couldn\'t fulfill the request.')
3577db96d56Sopenharmony_ci        print('Error code: ', e.code)
3587db96d56Sopenharmony_ci    except URLError as e:
3597db96d56Sopenharmony_ci        print('We failed to reach a server.')
3607db96d56Sopenharmony_ci        print('Reason: ', e.reason)
3617db96d56Sopenharmony_ci    else:
3627db96d56Sopenharmony_ci        # everything is fine
3637db96d56Sopenharmony_ci
3647db96d56Sopenharmony_ci
3657db96d56Sopenharmony_ci.. note::
3667db96d56Sopenharmony_ci
3677db96d56Sopenharmony_ci    The ``except HTTPError`` *must* come first, otherwise ``except URLError``
3687db96d56Sopenharmony_ci    will *also* catch an :exc:`HTTPError`.
3697db96d56Sopenharmony_ci
3707db96d56Sopenharmony_ciNumber 2
3717db96d56Sopenharmony_ci~~~~~~~~
3727db96d56Sopenharmony_ci
3737db96d56Sopenharmony_ci::
3747db96d56Sopenharmony_ci
3757db96d56Sopenharmony_ci    from urllib.request import Request, urlopen
3767db96d56Sopenharmony_ci    from urllib.error import URLError
3777db96d56Sopenharmony_ci    req = Request(someurl)
3787db96d56Sopenharmony_ci    try:
3797db96d56Sopenharmony_ci        response = urlopen(req)
3807db96d56Sopenharmony_ci    except URLError as e:
3817db96d56Sopenharmony_ci        if hasattr(e, 'reason'):
3827db96d56Sopenharmony_ci            print('We failed to reach a server.')
3837db96d56Sopenharmony_ci            print('Reason: ', e.reason)
3847db96d56Sopenharmony_ci        elif hasattr(e, 'code'):
3857db96d56Sopenharmony_ci            print('The server couldn\'t fulfill the request.')
3867db96d56Sopenharmony_ci            print('Error code: ', e.code)
3877db96d56Sopenharmony_ci    else:
3887db96d56Sopenharmony_ci        # everything is fine
3897db96d56Sopenharmony_ci
3907db96d56Sopenharmony_ci
3917db96d56Sopenharmony_ciinfo and geturl
3927db96d56Sopenharmony_ci===============
3937db96d56Sopenharmony_ci
3947db96d56Sopenharmony_ciThe response returned by urlopen (or the :exc:`HTTPError` instance) has two
3957db96d56Sopenharmony_ciuseful methods :meth:`info` and :meth:`geturl` and is defined in the module
3967db96d56Sopenharmony_ci:mod:`urllib.response`..
3977db96d56Sopenharmony_ci
3987db96d56Sopenharmony_ci**geturl** - this returns the real URL of the page fetched. This is useful
3997db96d56Sopenharmony_cibecause ``urlopen`` (or the opener object used) may have followed a
4007db96d56Sopenharmony_ciredirect. The URL of the page fetched may not be the same as the URL requested.
4017db96d56Sopenharmony_ci
4027db96d56Sopenharmony_ci**info** - this returns a dictionary-like object that describes the page
4037db96d56Sopenharmony_cifetched, particularly the headers sent by the server. It is currently an
4047db96d56Sopenharmony_ci:class:`http.client.HTTPMessage` instance.
4057db96d56Sopenharmony_ci
4067db96d56Sopenharmony_ciTypical headers include 'Content-length', 'Content-type', and so on. See the
4077db96d56Sopenharmony_ci`Quick Reference to HTTP Headers <https://jkorpela.fi/http.html>`_
4087db96d56Sopenharmony_cifor a useful listing of HTTP headers with brief explanations of their meaning
4097db96d56Sopenharmony_ciand use.
4107db96d56Sopenharmony_ci
4117db96d56Sopenharmony_ci
4127db96d56Sopenharmony_ciOpeners and Handlers
4137db96d56Sopenharmony_ci====================
4147db96d56Sopenharmony_ci
4157db96d56Sopenharmony_ciWhen you fetch a URL you use an opener (an instance of the perhaps
4167db96d56Sopenharmony_ciconfusingly named :class:`urllib.request.OpenerDirector`). Normally we have been using
4177db96d56Sopenharmony_cithe default opener - via ``urlopen`` - but you can create custom
4187db96d56Sopenharmony_ciopeners. Openers use handlers. All the "heavy lifting" is done by the
4197db96d56Sopenharmony_cihandlers. Each handler knows how to open URLs for a particular URL scheme (http,
4207db96d56Sopenharmony_ciftp, etc.), or how to handle an aspect of URL opening, for example HTTP
4217db96d56Sopenharmony_ciredirections or HTTP cookies.
4227db96d56Sopenharmony_ci
4237db96d56Sopenharmony_ciYou will want to create openers if you want to fetch URLs with specific handlers
4247db96d56Sopenharmony_ciinstalled, for example to get an opener that handles cookies, or to get an
4257db96d56Sopenharmony_ciopener that does not handle redirections.
4267db96d56Sopenharmony_ci
4277db96d56Sopenharmony_ciTo create an opener, instantiate an ``OpenerDirector``, and then call
4287db96d56Sopenharmony_ci``.add_handler(some_handler_instance)`` repeatedly.
4297db96d56Sopenharmony_ci
4307db96d56Sopenharmony_ciAlternatively, you can use ``build_opener``, which is a convenience function for
4317db96d56Sopenharmony_cicreating opener objects with a single function call.  ``build_opener`` adds
4327db96d56Sopenharmony_ciseveral handlers by default, but provides a quick way to add more and/or
4337db96d56Sopenharmony_cioverride the default handlers.
4347db96d56Sopenharmony_ci
4357db96d56Sopenharmony_ciOther sorts of handlers you might want to can handle proxies, authentication,
4367db96d56Sopenharmony_ciand other common but slightly specialised situations.
4377db96d56Sopenharmony_ci
4387db96d56Sopenharmony_ci``install_opener`` can be used to make an ``opener`` object the (global) default
4397db96d56Sopenharmony_ciopener. This means that calls to ``urlopen`` will use the opener you have
4407db96d56Sopenharmony_ciinstalled.
4417db96d56Sopenharmony_ci
4427db96d56Sopenharmony_ciOpener objects have an ``open`` method, which can be called directly to fetch
4437db96d56Sopenharmony_ciurls in the same way as the ``urlopen`` function: there's no need to call
4447db96d56Sopenharmony_ci``install_opener``, except as a convenience.
4457db96d56Sopenharmony_ci
4467db96d56Sopenharmony_ci
4477db96d56Sopenharmony_ciBasic Authentication
4487db96d56Sopenharmony_ci====================
4497db96d56Sopenharmony_ci
4507db96d56Sopenharmony_ciTo illustrate creating and installing a handler we will use the
4517db96d56Sopenharmony_ci``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
4527db96d56Sopenharmony_ciincluding an explanation of how Basic Authentication works - see the `Basic
4537db96d56Sopenharmony_ciAuthentication Tutorial
4547db96d56Sopenharmony_ci<https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`__.
4557db96d56Sopenharmony_ci
4567db96d56Sopenharmony_ciWhen authentication is required, the server sends a header (as well as the 401
4577db96d56Sopenharmony_cierror code) requesting authentication.  This specifies the authentication scheme
4587db96d56Sopenharmony_ciand a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
4597db96d56Sopenharmony_cirealm="REALM"``.
4607db96d56Sopenharmony_ci
4617db96d56Sopenharmony_cie.g.
4627db96d56Sopenharmony_ci
4637db96d56Sopenharmony_ci.. code-block:: none
4647db96d56Sopenharmony_ci
4657db96d56Sopenharmony_ci    WWW-Authenticate: Basic realm="cPanel Users"
4667db96d56Sopenharmony_ci
4677db96d56Sopenharmony_ci
4687db96d56Sopenharmony_ciThe client should then retry the request with the appropriate name and password
4697db96d56Sopenharmony_cifor the realm included as a header in the request. This is 'basic
4707db96d56Sopenharmony_ciauthentication'. In order to simplify this process we can create an instance of
4717db96d56Sopenharmony_ci``HTTPBasicAuthHandler`` and an opener to use this handler.
4727db96d56Sopenharmony_ci
4737db96d56Sopenharmony_ciThe ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
4747db96d56Sopenharmony_cithe mapping of URLs and realms to passwords and usernames. If you know what the
4757db96d56Sopenharmony_cirealm is (from the authentication header sent by the server), then you can use a
4767db96d56Sopenharmony_ci``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
4777db96d56Sopenharmony_cicase, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
4787db96d56Sopenharmony_ciyou to specify a default username and password for a URL. This will be supplied
4797db96d56Sopenharmony_ciin the absence of you providing an alternative combination for a specific
4807db96d56Sopenharmony_cirealm. We indicate this by providing ``None`` as the realm argument to the
4817db96d56Sopenharmony_ci``add_password`` method.
4827db96d56Sopenharmony_ci
4837db96d56Sopenharmony_ciThe top-level URL is the first URL that requires authentication. URLs "deeper"
4847db96d56Sopenharmony_cithan the URL you pass to .add_password() will also match. ::
4857db96d56Sopenharmony_ci
4867db96d56Sopenharmony_ci    # create a password manager
4877db96d56Sopenharmony_ci    password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
4887db96d56Sopenharmony_ci
4897db96d56Sopenharmony_ci    # Add the username and password.
4907db96d56Sopenharmony_ci    # If we knew the realm, we could use it instead of None.
4917db96d56Sopenharmony_ci    top_level_url = "http://example.com/foo/"
4927db96d56Sopenharmony_ci    password_mgr.add_password(None, top_level_url, username, password)
4937db96d56Sopenharmony_ci
4947db96d56Sopenharmony_ci    handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
4957db96d56Sopenharmony_ci
4967db96d56Sopenharmony_ci    # create "opener" (OpenerDirector instance)
4977db96d56Sopenharmony_ci    opener = urllib.request.build_opener(handler)
4987db96d56Sopenharmony_ci
4997db96d56Sopenharmony_ci    # use the opener to fetch a URL
5007db96d56Sopenharmony_ci    opener.open(a_url)
5017db96d56Sopenharmony_ci
5027db96d56Sopenharmony_ci    # Install the opener.
5037db96d56Sopenharmony_ci    # Now all calls to urllib.request.urlopen use our opener.
5047db96d56Sopenharmony_ci    urllib.request.install_opener(opener)
5057db96d56Sopenharmony_ci
5067db96d56Sopenharmony_ci.. note::
5077db96d56Sopenharmony_ci
5087db96d56Sopenharmony_ci    In the above example we only supplied our ``HTTPBasicAuthHandler`` to
5097db96d56Sopenharmony_ci    ``build_opener``. By default openers have the handlers for normal situations
5107db96d56Sopenharmony_ci    -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
5117db96d56Sopenharmony_ci    environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
5127db96d56Sopenharmony_ci    ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
5137db96d56Sopenharmony_ci    ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
5147db96d56Sopenharmony_ci
5157db96d56Sopenharmony_ci``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
5167db96d56Sopenharmony_cicomponent and the hostname and optionally the port number)
5177db96d56Sopenharmony_cie.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
5187db96d56Sopenharmony_cioptionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
5197db96d56Sopenharmony_ci(the latter example includes a port number).  The authority, if present, must
5207db96d56Sopenharmony_ciNOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
5217db96d56Sopenharmony_cinot correct.
5227db96d56Sopenharmony_ci
5237db96d56Sopenharmony_ci
5247db96d56Sopenharmony_ciProxies
5257db96d56Sopenharmony_ci=======
5267db96d56Sopenharmony_ci
5277db96d56Sopenharmony_ci**urllib** will auto-detect your proxy settings and use those. This is through
5287db96d56Sopenharmony_cithe ``ProxyHandler``, which is part of the normal handler chain when a proxy
5297db96d56Sopenharmony_cisetting is detected.  Normally that's a good thing, but there are occasions
5307db96d56Sopenharmony_ciwhen it may not be helpful [#]_. One way to do this is to setup our own
5317db96d56Sopenharmony_ci``ProxyHandler``, with no proxies defined. This is done using similar steps to
5327db96d56Sopenharmony_cisetting up a `Basic Authentication`_ handler: ::
5337db96d56Sopenharmony_ci
5347db96d56Sopenharmony_ci    >>> proxy_support = urllib.request.ProxyHandler({})
5357db96d56Sopenharmony_ci    >>> opener = urllib.request.build_opener(proxy_support)
5367db96d56Sopenharmony_ci    >>> urllib.request.install_opener(opener)
5377db96d56Sopenharmony_ci
5387db96d56Sopenharmony_ci.. note::
5397db96d56Sopenharmony_ci
5407db96d56Sopenharmony_ci    Currently ``urllib.request`` *does not* support fetching of ``https`` locations
5417db96d56Sopenharmony_ci    through a proxy.  However, this can be enabled by extending urllib.request as
5427db96d56Sopenharmony_ci    shown in the recipe [#]_.
5437db96d56Sopenharmony_ci
5447db96d56Sopenharmony_ci.. note::
5457db96d56Sopenharmony_ci
5467db96d56Sopenharmony_ci    ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
5477db96d56Sopenharmony_ci    the documentation on :func:`~urllib.request.getproxies`.
5487db96d56Sopenharmony_ci
5497db96d56Sopenharmony_ci
5507db96d56Sopenharmony_ciSockets and Layers
5517db96d56Sopenharmony_ci==================
5527db96d56Sopenharmony_ci
5537db96d56Sopenharmony_ciThe Python support for fetching resources from the web is layered.  urllib uses
5547db96d56Sopenharmony_cithe :mod:`http.client` library, which in turn uses the socket library.
5557db96d56Sopenharmony_ci
5567db96d56Sopenharmony_ciAs of Python 2.3 you can specify how long a socket should wait for a response
5577db96d56Sopenharmony_cibefore timing out. This can be useful in applications which have to fetch web
5587db96d56Sopenharmony_cipages. By default the socket module has *no timeout* and can hang. Currently,
5597db96d56Sopenharmony_cithe socket timeout is not exposed at the http.client or urllib.request levels.
5607db96d56Sopenharmony_ciHowever, you can set the default timeout globally for all sockets using ::
5617db96d56Sopenharmony_ci
5627db96d56Sopenharmony_ci    import socket
5637db96d56Sopenharmony_ci    import urllib.request
5647db96d56Sopenharmony_ci
5657db96d56Sopenharmony_ci    # timeout in seconds
5667db96d56Sopenharmony_ci    timeout = 10
5677db96d56Sopenharmony_ci    socket.setdefaulttimeout(timeout)
5687db96d56Sopenharmony_ci
5697db96d56Sopenharmony_ci    # this call to urllib.request.urlopen now uses the default timeout
5707db96d56Sopenharmony_ci    # we have set in the socket module
5717db96d56Sopenharmony_ci    req = urllib.request.Request('http://www.voidspace.org.uk')
5727db96d56Sopenharmony_ci    response = urllib.request.urlopen(req)
5737db96d56Sopenharmony_ci
5747db96d56Sopenharmony_ci
5757db96d56Sopenharmony_ci-------
5767db96d56Sopenharmony_ci
5777db96d56Sopenharmony_ci
5787db96d56Sopenharmony_ciFootnotes
5797db96d56Sopenharmony_ci=========
5807db96d56Sopenharmony_ci
5817db96d56Sopenharmony_ciThis document was reviewed and revised by John Lee.
5827db96d56Sopenharmony_ci
5837db96d56Sopenharmony_ci.. [#] Google for example.
5847db96d56Sopenharmony_ci.. [#] Browser sniffing is a very bad practice for website design - building
5857db96d56Sopenharmony_ci       sites using web standards is much more sensible. Unfortunately a lot of
5867db96d56Sopenharmony_ci       sites still send different versions to different browsers.
5877db96d56Sopenharmony_ci.. [#] The user agent for MSIE 6 is
5887db96d56Sopenharmony_ci       *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
5897db96d56Sopenharmony_ci.. [#] For details of more HTTP request headers, see
5907db96d56Sopenharmony_ci       `Quick Reference to HTTP Headers`_.
5917db96d56Sopenharmony_ci.. [#] In my case I have to use a proxy to access the internet at work. If you
5927db96d56Sopenharmony_ci       attempt to fetch *localhost* URLs through this proxy it blocks them. IE
5937db96d56Sopenharmony_ci       is set to use the proxy, which urllib picks up on. In order to test
5947db96d56Sopenharmony_ci       scripts with a localhost server, I have to prevent urllib from using
5957db96d56Sopenharmony_ci       the proxy.
5967db96d56Sopenharmony_ci.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
5977db96d56Sopenharmony_ci       <https://code.activestate.com/recipes/456195/>`_.
5987db96d56Sopenharmony_ci
599