17db96d56Sopenharmony_ci.. _urllib-howto: 27db96d56Sopenharmony_ci 37db96d56Sopenharmony_ci*********************************************************** 47db96d56Sopenharmony_ci HOWTO Fetch Internet Resources Using The urllib Package 57db96d56Sopenharmony_ci*********************************************************** 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci:Author: `Michael Foord <https://agileabstractions.com/>`_ 87db96d56Sopenharmony_ci 97db96d56Sopenharmony_ci 107db96d56Sopenharmony_ciIntroduction 117db96d56Sopenharmony_ci============ 127db96d56Sopenharmony_ci 137db96d56Sopenharmony_ci.. sidebar:: Related Articles 147db96d56Sopenharmony_ci 157db96d56Sopenharmony_ci You may also find useful the following article on fetching web resources 167db96d56Sopenharmony_ci with Python: 177db96d56Sopenharmony_ci 187db96d56Sopenharmony_ci * `Basic Authentication <https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ 197db96d56Sopenharmony_ci 207db96d56Sopenharmony_ci A tutorial on *Basic Authentication*, with examples in Python. 217db96d56Sopenharmony_ci 227db96d56Sopenharmony_ci**urllib.request** is a Python module for fetching URLs 237db96d56Sopenharmony_ci(Uniform Resource Locators). It offers a very simple interface, in the form of 247db96d56Sopenharmony_cithe *urlopen* function. This is capable of fetching URLs using a variety of 257db96d56Sopenharmony_cidifferent protocols. It also offers a slightly more complex interface for 267db96d56Sopenharmony_cihandling common situations - like basic authentication, cookies, proxies and so 277db96d56Sopenharmony_cion. These are provided by objects called handlers and openers. 287db96d56Sopenharmony_ci 297db96d56Sopenharmony_ciurllib.request supports fetching URLs for many "URL schemes" (identified by the string 307db96d56Sopenharmony_cibefore the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of 317db96d56Sopenharmony_ci``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP). 327db96d56Sopenharmony_ciThis tutorial focuses on the most common case, HTTP. 337db96d56Sopenharmony_ci 347db96d56Sopenharmony_ciFor straightforward situations *urlopen* is very easy to use. But as soon as you 357db96d56Sopenharmony_ciencounter errors or non-trivial cases when opening HTTP URLs, you will need some 367db96d56Sopenharmony_ciunderstanding of the HyperText Transfer Protocol. The most comprehensive and 377db96d56Sopenharmony_ciauthoritative reference to HTTP is :rfc:`2616`. This is a technical document and 387db96d56Sopenharmony_cinot intended to be easy to read. This HOWTO aims to illustrate using *urllib*, 397db96d56Sopenharmony_ciwith enough detail about HTTP to help you through. It is not intended to replace 407db96d56Sopenharmony_cithe :mod:`urllib.request` docs, but is supplementary to them. 417db96d56Sopenharmony_ci 427db96d56Sopenharmony_ci 437db96d56Sopenharmony_ciFetching URLs 447db96d56Sopenharmony_ci============= 457db96d56Sopenharmony_ci 467db96d56Sopenharmony_ciThe simplest way to use urllib.request is as follows:: 477db96d56Sopenharmony_ci 487db96d56Sopenharmony_ci import urllib.request 497db96d56Sopenharmony_ci with urllib.request.urlopen('http://python.org/') as response: 507db96d56Sopenharmony_ci html = response.read() 517db96d56Sopenharmony_ci 527db96d56Sopenharmony_ciIf you wish to retrieve a resource via URL and store it in a temporary 537db96d56Sopenharmony_cilocation, you can do so via the :func:`shutil.copyfileobj` and 547db96d56Sopenharmony_ci:func:`tempfile.NamedTemporaryFile` functions:: 557db96d56Sopenharmony_ci 567db96d56Sopenharmony_ci import shutil 577db96d56Sopenharmony_ci import tempfile 587db96d56Sopenharmony_ci import urllib.request 597db96d56Sopenharmony_ci 607db96d56Sopenharmony_ci with urllib.request.urlopen('http://python.org/') as response: 617db96d56Sopenharmony_ci with tempfile.NamedTemporaryFile(delete=False) as tmp_file: 627db96d56Sopenharmony_ci shutil.copyfileobj(response, tmp_file) 637db96d56Sopenharmony_ci 647db96d56Sopenharmony_ci with open(tmp_file.name) as html: 657db96d56Sopenharmony_ci pass 667db96d56Sopenharmony_ci 677db96d56Sopenharmony_ciMany uses of urllib will be that simple (note that instead of an 'http:' URL we 687db96d56Sopenharmony_cicould have used a URL starting with 'ftp:', 'file:', etc.). However, it's the 697db96d56Sopenharmony_cipurpose of this tutorial to explain the more complicated cases, concentrating on 707db96d56Sopenharmony_ciHTTP. 717db96d56Sopenharmony_ci 727db96d56Sopenharmony_ciHTTP is based on requests and responses - the client makes requests and servers 737db96d56Sopenharmony_cisend responses. urllib.request mirrors this with a ``Request`` object which represents 747db96d56Sopenharmony_cithe HTTP request you are making. In its simplest form you create a Request 757db96d56Sopenharmony_ciobject that specifies the URL you want to fetch. Calling ``urlopen`` with this 767db96d56Sopenharmony_ciRequest object returns a response object for the URL requested. This response is 777db96d56Sopenharmony_cia file-like object, which means you can for example call ``.read()`` on the 787db96d56Sopenharmony_ciresponse:: 797db96d56Sopenharmony_ci 807db96d56Sopenharmony_ci import urllib.request 817db96d56Sopenharmony_ci 827db96d56Sopenharmony_ci req = urllib.request.Request('http://python.org/') 837db96d56Sopenharmony_ci with urllib.request.urlopen(req) as response: 847db96d56Sopenharmony_ci the_page = response.read() 857db96d56Sopenharmony_ci 867db96d56Sopenharmony_ciNote that urllib.request makes use of the same Request interface to handle all URL 877db96d56Sopenharmony_cischemes. For example, you can make an FTP request like so:: 887db96d56Sopenharmony_ci 897db96d56Sopenharmony_ci req = urllib.request.Request('ftp://example.com/') 907db96d56Sopenharmony_ci 917db96d56Sopenharmony_ciIn the case of HTTP, there are two extra things that Request objects allow you 927db96d56Sopenharmony_cito do: First, you can pass data to be sent to the server. Second, you can pass 937db96d56Sopenharmony_ciextra information ("metadata") *about* the data or about the request itself, to 947db96d56Sopenharmony_cithe server - this information is sent as HTTP "headers". Let's look at each of 957db96d56Sopenharmony_cithese in turn. 967db96d56Sopenharmony_ci 977db96d56Sopenharmony_ciData 987db96d56Sopenharmony_ci---- 997db96d56Sopenharmony_ci 1007db96d56Sopenharmony_ciSometimes you want to send data to a URL (often the URL will refer to a CGI 1017db96d56Sopenharmony_ci(Common Gateway Interface) script or other web application). With HTTP, 1027db96d56Sopenharmony_cithis is often done using what's known as a **POST** request. This is often what 1037db96d56Sopenharmony_ciyour browser does when you submit a HTML form that you filled in on the web. Not 1047db96d56Sopenharmony_ciall POSTs have to come from forms: you can use a POST to transmit arbitrary data 1057db96d56Sopenharmony_cito your own application. In the common case of HTML forms, the data needs to be 1067db96d56Sopenharmony_ciencoded in a standard way, and then passed to the Request object as the ``data`` 1077db96d56Sopenharmony_ciargument. The encoding is done using a function from the :mod:`urllib.parse` 1087db96d56Sopenharmony_cilibrary. :: 1097db96d56Sopenharmony_ci 1107db96d56Sopenharmony_ci import urllib.parse 1117db96d56Sopenharmony_ci import urllib.request 1127db96d56Sopenharmony_ci 1137db96d56Sopenharmony_ci url = 'http://www.someserver.com/cgi-bin/register.cgi' 1147db96d56Sopenharmony_ci values = {'name' : 'Michael Foord', 1157db96d56Sopenharmony_ci 'location' : 'Northampton', 1167db96d56Sopenharmony_ci 'language' : 'Python' } 1177db96d56Sopenharmony_ci 1187db96d56Sopenharmony_ci data = urllib.parse.urlencode(values) 1197db96d56Sopenharmony_ci data = data.encode('ascii') # data should be bytes 1207db96d56Sopenharmony_ci req = urllib.request.Request(url, data) 1217db96d56Sopenharmony_ci with urllib.request.urlopen(req) as response: 1227db96d56Sopenharmony_ci the_page = response.read() 1237db96d56Sopenharmony_ci 1247db96d56Sopenharmony_ciNote that other encodings are sometimes required (e.g. for file upload from HTML 1257db96d56Sopenharmony_ciforms - see `HTML Specification, Form Submission 1267db96d56Sopenharmony_ci<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more 1277db96d56Sopenharmony_cidetails). 1287db96d56Sopenharmony_ci 1297db96d56Sopenharmony_ciIf you do not pass the ``data`` argument, urllib uses a **GET** request. One 1307db96d56Sopenharmony_ciway in which GET and POST requests differ is that POST requests often have 1317db96d56Sopenharmony_ci"side-effects": they change the state of the system in some way (for example by 1327db96d56Sopenharmony_ciplacing an order with the website for a hundredweight of tinned spam to be 1337db96d56Sopenharmony_cidelivered to your door). Though the HTTP standard makes it clear that POSTs are 1347db96d56Sopenharmony_ciintended to *always* cause side-effects, and GET requests *never* to cause 1357db96d56Sopenharmony_ciside-effects, nothing prevents a GET request from having side-effects, nor a 1367db96d56Sopenharmony_ciPOST requests from having no side-effects. Data can also be passed in an HTTP 1377db96d56Sopenharmony_ciGET request by encoding it in the URL itself. 1387db96d56Sopenharmony_ci 1397db96d56Sopenharmony_ciThis is done as follows:: 1407db96d56Sopenharmony_ci 1417db96d56Sopenharmony_ci >>> import urllib.request 1427db96d56Sopenharmony_ci >>> import urllib.parse 1437db96d56Sopenharmony_ci >>> data = {} 1447db96d56Sopenharmony_ci >>> data['name'] = 'Somebody Here' 1457db96d56Sopenharmony_ci >>> data['location'] = 'Northampton' 1467db96d56Sopenharmony_ci >>> data['language'] = 'Python' 1477db96d56Sopenharmony_ci >>> url_values = urllib.parse.urlencode(data) 1487db96d56Sopenharmony_ci >>> print(url_values) # The order may differ from below. #doctest: +SKIP 1497db96d56Sopenharmony_ci name=Somebody+Here&language=Python&location=Northampton 1507db96d56Sopenharmony_ci >>> url = 'http://www.example.com/example.cgi' 1517db96d56Sopenharmony_ci >>> full_url = url + '?' + url_values 1527db96d56Sopenharmony_ci >>> data = urllib.request.urlopen(full_url) 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ciNotice that the full URL is created by adding a ``?`` to the URL, followed by 1557db96d56Sopenharmony_cithe encoded values. 1567db96d56Sopenharmony_ci 1577db96d56Sopenharmony_ciHeaders 1587db96d56Sopenharmony_ci------- 1597db96d56Sopenharmony_ci 1607db96d56Sopenharmony_ciWe'll discuss here one particular HTTP header, to illustrate how to add headers 1617db96d56Sopenharmony_cito your HTTP request. 1627db96d56Sopenharmony_ci 1637db96d56Sopenharmony_ciSome websites [#]_ dislike being browsed by programs, or send different versions 1647db96d56Sopenharmony_cito different browsers [#]_. By default urllib identifies itself as 1657db96d56Sopenharmony_ci``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version 1667db96d56Sopenharmony_cinumbers of the Python release, 1677db96d56Sopenharmony_cie.g. ``Python-urllib/2.5``), which may confuse the site, or just plain 1687db96d56Sopenharmony_cinot work. The way a browser identifies itself is through the 1697db96d56Sopenharmony_ci``User-Agent`` header [#]_. When you create a Request object you can 1707db96d56Sopenharmony_cipass a dictionary of headers in. The following example makes the same 1717db96d56Sopenharmony_cirequest as above, but identifies itself as a version of Internet 1727db96d56Sopenharmony_ciExplorer [#]_. :: 1737db96d56Sopenharmony_ci 1747db96d56Sopenharmony_ci import urllib.parse 1757db96d56Sopenharmony_ci import urllib.request 1767db96d56Sopenharmony_ci 1777db96d56Sopenharmony_ci url = 'http://www.someserver.com/cgi-bin/register.cgi' 1787db96d56Sopenharmony_ci user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' 1797db96d56Sopenharmony_ci values = {'name': 'Michael Foord', 1807db96d56Sopenharmony_ci 'location': 'Northampton', 1817db96d56Sopenharmony_ci 'language': 'Python' } 1827db96d56Sopenharmony_ci headers = {'User-Agent': user_agent} 1837db96d56Sopenharmony_ci 1847db96d56Sopenharmony_ci data = urllib.parse.urlencode(values) 1857db96d56Sopenharmony_ci data = data.encode('ascii') 1867db96d56Sopenharmony_ci req = urllib.request.Request(url, data, headers) 1877db96d56Sopenharmony_ci with urllib.request.urlopen(req) as response: 1887db96d56Sopenharmony_ci the_page = response.read() 1897db96d56Sopenharmony_ci 1907db96d56Sopenharmony_ciThe response also has two useful methods. See the section on `info and geturl`_ 1917db96d56Sopenharmony_ciwhich comes after we have a look at what happens when things go wrong. 1927db96d56Sopenharmony_ci 1937db96d56Sopenharmony_ci 1947db96d56Sopenharmony_ciHandling Exceptions 1957db96d56Sopenharmony_ci=================== 1967db96d56Sopenharmony_ci 1977db96d56Sopenharmony_ci*urlopen* raises :exc:`URLError` when it cannot handle a response (though as 1987db96d56Sopenharmony_ciusual with Python APIs, built-in exceptions such as :exc:`ValueError`, 1997db96d56Sopenharmony_ci:exc:`TypeError` etc. may also be raised). 2007db96d56Sopenharmony_ci 2017db96d56Sopenharmony_ci:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of 2027db96d56Sopenharmony_ciHTTP URLs. 2037db96d56Sopenharmony_ci 2047db96d56Sopenharmony_ciThe exception classes are exported from the :mod:`urllib.error` module. 2057db96d56Sopenharmony_ci 2067db96d56Sopenharmony_ciURLError 2077db96d56Sopenharmony_ci-------- 2087db96d56Sopenharmony_ci 2097db96d56Sopenharmony_ciOften, URLError is raised because there is no network connection (no route to 2107db96d56Sopenharmony_cithe specified server), or the specified server doesn't exist. In this case, the 2117db96d56Sopenharmony_ciexception raised will have a 'reason' attribute, which is a tuple containing an 2127db96d56Sopenharmony_cierror code and a text error message. 2137db96d56Sopenharmony_ci 2147db96d56Sopenharmony_cie.g. :: 2157db96d56Sopenharmony_ci 2167db96d56Sopenharmony_ci >>> req = urllib.request.Request('http://www.pretend_server.org') 2177db96d56Sopenharmony_ci >>> try: urllib.request.urlopen(req) 2187db96d56Sopenharmony_ci ... except urllib.error.URLError as e: 2197db96d56Sopenharmony_ci ... print(e.reason) #doctest: +SKIP 2207db96d56Sopenharmony_ci ... 2217db96d56Sopenharmony_ci (4, 'getaddrinfo failed') 2227db96d56Sopenharmony_ci 2237db96d56Sopenharmony_ci 2247db96d56Sopenharmony_ciHTTPError 2257db96d56Sopenharmony_ci--------- 2267db96d56Sopenharmony_ci 2277db96d56Sopenharmony_ciEvery HTTP response from the server contains a numeric "status code". Sometimes 2287db96d56Sopenharmony_cithe status code indicates that the server is unable to fulfil the request. The 2297db96d56Sopenharmony_cidefault handlers will handle some of these responses for you (for example, if 2307db96d56Sopenharmony_cithe response is a "redirection" that requests the client fetch the document from 2317db96d56Sopenharmony_cia different URL, urllib will handle that for you). For those it can't handle, 2327db96d56Sopenharmony_ciurlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not 2337db96d56Sopenharmony_cifound), '403' (request forbidden), and '401' (authentication required). 2347db96d56Sopenharmony_ci 2357db96d56Sopenharmony_ciSee section 10 of :rfc:`2616` for a reference on all the HTTP error codes. 2367db96d56Sopenharmony_ci 2377db96d56Sopenharmony_ciThe :exc:`HTTPError` instance raised will have an integer 'code' attribute, which 2387db96d56Sopenharmony_cicorresponds to the error sent by the server. 2397db96d56Sopenharmony_ci 2407db96d56Sopenharmony_ciError Codes 2417db96d56Sopenharmony_ci~~~~~~~~~~~ 2427db96d56Sopenharmony_ci 2437db96d56Sopenharmony_ciBecause the default handlers handle redirects (codes in the 300 range), and 2447db96d56Sopenharmony_cicodes in the 100--299 range indicate success, you will usually only see error 2457db96d56Sopenharmony_cicodes in the 400--599 range. 2467db96d56Sopenharmony_ci 2477db96d56Sopenharmony_ci:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of 2487db96d56Sopenharmony_ciresponse codes in that shows all the response codes used by :rfc:`2616`. The 2497db96d56Sopenharmony_cidictionary is reproduced here for convenience :: 2507db96d56Sopenharmony_ci 2517db96d56Sopenharmony_ci # Table mapping response codes to messages; entries have the 2527db96d56Sopenharmony_ci # form {code: (shortmessage, longmessage)}. 2537db96d56Sopenharmony_ci responses = { 2547db96d56Sopenharmony_ci 100: ('Continue', 'Request received, please continue'), 2557db96d56Sopenharmony_ci 101: ('Switching Protocols', 2567db96d56Sopenharmony_ci 'Switching to new protocol; obey Upgrade header'), 2577db96d56Sopenharmony_ci 2587db96d56Sopenharmony_ci 200: ('OK', 'Request fulfilled, document follows'), 2597db96d56Sopenharmony_ci 201: ('Created', 'Document created, URL follows'), 2607db96d56Sopenharmony_ci 202: ('Accepted', 2617db96d56Sopenharmony_ci 'Request accepted, processing continues off-line'), 2627db96d56Sopenharmony_ci 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), 2637db96d56Sopenharmony_ci 204: ('No Content', 'Request fulfilled, nothing follows'), 2647db96d56Sopenharmony_ci 205: ('Reset Content', 'Clear input form for further input.'), 2657db96d56Sopenharmony_ci 206: ('Partial Content', 'Partial content follows.'), 2667db96d56Sopenharmony_ci 2677db96d56Sopenharmony_ci 300: ('Multiple Choices', 2687db96d56Sopenharmony_ci 'Object has several resources -- see URI list'), 2697db96d56Sopenharmony_ci 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), 2707db96d56Sopenharmony_ci 302: ('Found', 'Object moved temporarily -- see URI list'), 2717db96d56Sopenharmony_ci 303: ('See Other', 'Object moved -- see Method and URL list'), 2727db96d56Sopenharmony_ci 304: ('Not Modified', 2737db96d56Sopenharmony_ci 'Document has not changed since given time'), 2747db96d56Sopenharmony_ci 305: ('Use Proxy', 2757db96d56Sopenharmony_ci 'You must use proxy specified in Location to access this ' 2767db96d56Sopenharmony_ci 'resource.'), 2777db96d56Sopenharmony_ci 307: ('Temporary Redirect', 2787db96d56Sopenharmony_ci 'Object moved temporarily -- see URI list'), 2797db96d56Sopenharmony_ci 2807db96d56Sopenharmony_ci 400: ('Bad Request', 2817db96d56Sopenharmony_ci 'Bad request syntax or unsupported method'), 2827db96d56Sopenharmony_ci 401: ('Unauthorized', 2837db96d56Sopenharmony_ci 'No permission -- see authorization schemes'), 2847db96d56Sopenharmony_ci 402: ('Payment Required', 2857db96d56Sopenharmony_ci 'No payment -- see charging schemes'), 2867db96d56Sopenharmony_ci 403: ('Forbidden', 2877db96d56Sopenharmony_ci 'Request forbidden -- authorization will not help'), 2887db96d56Sopenharmony_ci 404: ('Not Found', 'Nothing matches the given URI'), 2897db96d56Sopenharmony_ci 405: ('Method Not Allowed', 2907db96d56Sopenharmony_ci 'Specified method is invalid for this server.'), 2917db96d56Sopenharmony_ci 406: ('Not Acceptable', 'URI not available in preferred format.'), 2927db96d56Sopenharmony_ci 407: ('Proxy Authentication Required', 'You must authenticate with ' 2937db96d56Sopenharmony_ci 'this proxy before proceeding.'), 2947db96d56Sopenharmony_ci 408: ('Request Timeout', 'Request timed out; try again later.'), 2957db96d56Sopenharmony_ci 409: ('Conflict', 'Request conflict.'), 2967db96d56Sopenharmony_ci 410: ('Gone', 2977db96d56Sopenharmony_ci 'URI no longer exists and has been permanently removed.'), 2987db96d56Sopenharmony_ci 411: ('Length Required', 'Client must specify Content-Length.'), 2997db96d56Sopenharmony_ci 412: ('Precondition Failed', 'Precondition in headers is false.'), 3007db96d56Sopenharmony_ci 413: ('Request Entity Too Large', 'Entity is too large.'), 3017db96d56Sopenharmony_ci 414: ('Request-URI Too Long', 'URI is too long.'), 3027db96d56Sopenharmony_ci 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), 3037db96d56Sopenharmony_ci 416: ('Requested Range Not Satisfiable', 3047db96d56Sopenharmony_ci 'Cannot satisfy request range.'), 3057db96d56Sopenharmony_ci 417: ('Expectation Failed', 3067db96d56Sopenharmony_ci 'Expect condition could not be satisfied.'), 3077db96d56Sopenharmony_ci 3087db96d56Sopenharmony_ci 500: ('Internal Server Error', 'Server got itself in trouble'), 3097db96d56Sopenharmony_ci 501: ('Not Implemented', 3107db96d56Sopenharmony_ci 'Server does not support this operation'), 3117db96d56Sopenharmony_ci 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 3127db96d56Sopenharmony_ci 503: ('Service Unavailable', 3137db96d56Sopenharmony_ci 'The server cannot process the request due to a high load'), 3147db96d56Sopenharmony_ci 504: ('Gateway Timeout', 3157db96d56Sopenharmony_ci 'The gateway server did not receive a timely response'), 3167db96d56Sopenharmony_ci 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), 3177db96d56Sopenharmony_ci } 3187db96d56Sopenharmony_ci 3197db96d56Sopenharmony_ciWhen an error is raised the server responds by returning an HTTP error code 3207db96d56Sopenharmony_ci*and* an error page. You can use the :exc:`HTTPError` instance as a response on the 3217db96d56Sopenharmony_cipage returned. This means that as well as the code attribute, it also has read, 3227db96d56Sopenharmony_cigeturl, and info, methods as returned by the ``urllib.response`` module:: 3237db96d56Sopenharmony_ci 3247db96d56Sopenharmony_ci >>> req = urllib.request.Request('http://www.python.org/fish.html') 3257db96d56Sopenharmony_ci >>> try: 3267db96d56Sopenharmony_ci ... urllib.request.urlopen(req) 3277db96d56Sopenharmony_ci ... except urllib.error.HTTPError as e: 3287db96d56Sopenharmony_ci ... print(e.code) 3297db96d56Sopenharmony_ci ... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE 3307db96d56Sopenharmony_ci ... 3317db96d56Sopenharmony_ci 404 3327db96d56Sopenharmony_ci b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 3337db96d56Sopenharmony_ci "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html 3347db96d56Sopenharmony_ci ... 3357db96d56Sopenharmony_ci <title>Page Not Found</title>\n 3367db96d56Sopenharmony_ci ... 3377db96d56Sopenharmony_ci 3387db96d56Sopenharmony_ciWrapping it Up 3397db96d56Sopenharmony_ci-------------- 3407db96d56Sopenharmony_ci 3417db96d56Sopenharmony_ciSo if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two 3427db96d56Sopenharmony_cibasic approaches. I prefer the second approach. 3437db96d56Sopenharmony_ci 3447db96d56Sopenharmony_ciNumber 1 3457db96d56Sopenharmony_ci~~~~~~~~ 3467db96d56Sopenharmony_ci 3477db96d56Sopenharmony_ci:: 3487db96d56Sopenharmony_ci 3497db96d56Sopenharmony_ci 3507db96d56Sopenharmony_ci from urllib.request import Request, urlopen 3517db96d56Sopenharmony_ci from urllib.error import URLError, HTTPError 3527db96d56Sopenharmony_ci req = Request(someurl) 3537db96d56Sopenharmony_ci try: 3547db96d56Sopenharmony_ci response = urlopen(req) 3557db96d56Sopenharmony_ci except HTTPError as e: 3567db96d56Sopenharmony_ci print('The server couldn\'t fulfill the request.') 3577db96d56Sopenharmony_ci print('Error code: ', e.code) 3587db96d56Sopenharmony_ci except URLError as e: 3597db96d56Sopenharmony_ci print('We failed to reach a server.') 3607db96d56Sopenharmony_ci print('Reason: ', e.reason) 3617db96d56Sopenharmony_ci else: 3627db96d56Sopenharmony_ci # everything is fine 3637db96d56Sopenharmony_ci 3647db96d56Sopenharmony_ci 3657db96d56Sopenharmony_ci.. note:: 3667db96d56Sopenharmony_ci 3677db96d56Sopenharmony_ci The ``except HTTPError`` *must* come first, otherwise ``except URLError`` 3687db96d56Sopenharmony_ci will *also* catch an :exc:`HTTPError`. 3697db96d56Sopenharmony_ci 3707db96d56Sopenharmony_ciNumber 2 3717db96d56Sopenharmony_ci~~~~~~~~ 3727db96d56Sopenharmony_ci 3737db96d56Sopenharmony_ci:: 3747db96d56Sopenharmony_ci 3757db96d56Sopenharmony_ci from urllib.request import Request, urlopen 3767db96d56Sopenharmony_ci from urllib.error import URLError 3777db96d56Sopenharmony_ci req = Request(someurl) 3787db96d56Sopenharmony_ci try: 3797db96d56Sopenharmony_ci response = urlopen(req) 3807db96d56Sopenharmony_ci except URLError as e: 3817db96d56Sopenharmony_ci if hasattr(e, 'reason'): 3827db96d56Sopenharmony_ci print('We failed to reach a server.') 3837db96d56Sopenharmony_ci print('Reason: ', e.reason) 3847db96d56Sopenharmony_ci elif hasattr(e, 'code'): 3857db96d56Sopenharmony_ci print('The server couldn\'t fulfill the request.') 3867db96d56Sopenharmony_ci print('Error code: ', e.code) 3877db96d56Sopenharmony_ci else: 3887db96d56Sopenharmony_ci # everything is fine 3897db96d56Sopenharmony_ci 3907db96d56Sopenharmony_ci 3917db96d56Sopenharmony_ciinfo and geturl 3927db96d56Sopenharmony_ci=============== 3937db96d56Sopenharmony_ci 3947db96d56Sopenharmony_ciThe response returned by urlopen (or the :exc:`HTTPError` instance) has two 3957db96d56Sopenharmony_ciuseful methods :meth:`info` and :meth:`geturl` and is defined in the module 3967db96d56Sopenharmony_ci:mod:`urllib.response`.. 3977db96d56Sopenharmony_ci 3987db96d56Sopenharmony_ci**geturl** - this returns the real URL of the page fetched. This is useful 3997db96d56Sopenharmony_cibecause ``urlopen`` (or the opener object used) may have followed a 4007db96d56Sopenharmony_ciredirect. The URL of the page fetched may not be the same as the URL requested. 4017db96d56Sopenharmony_ci 4027db96d56Sopenharmony_ci**info** - this returns a dictionary-like object that describes the page 4037db96d56Sopenharmony_cifetched, particularly the headers sent by the server. It is currently an 4047db96d56Sopenharmony_ci:class:`http.client.HTTPMessage` instance. 4057db96d56Sopenharmony_ci 4067db96d56Sopenharmony_ciTypical headers include 'Content-length', 'Content-type', and so on. See the 4077db96d56Sopenharmony_ci`Quick Reference to HTTP Headers <https://jkorpela.fi/http.html>`_ 4087db96d56Sopenharmony_cifor a useful listing of HTTP headers with brief explanations of their meaning 4097db96d56Sopenharmony_ciand use. 4107db96d56Sopenharmony_ci 4117db96d56Sopenharmony_ci 4127db96d56Sopenharmony_ciOpeners and Handlers 4137db96d56Sopenharmony_ci==================== 4147db96d56Sopenharmony_ci 4157db96d56Sopenharmony_ciWhen you fetch a URL you use an opener (an instance of the perhaps 4167db96d56Sopenharmony_ciconfusingly named :class:`urllib.request.OpenerDirector`). Normally we have been using 4177db96d56Sopenharmony_cithe default opener - via ``urlopen`` - but you can create custom 4187db96d56Sopenharmony_ciopeners. Openers use handlers. All the "heavy lifting" is done by the 4197db96d56Sopenharmony_cihandlers. Each handler knows how to open URLs for a particular URL scheme (http, 4207db96d56Sopenharmony_ciftp, etc.), or how to handle an aspect of URL opening, for example HTTP 4217db96d56Sopenharmony_ciredirections or HTTP cookies. 4227db96d56Sopenharmony_ci 4237db96d56Sopenharmony_ciYou will want to create openers if you want to fetch URLs with specific handlers 4247db96d56Sopenharmony_ciinstalled, for example to get an opener that handles cookies, or to get an 4257db96d56Sopenharmony_ciopener that does not handle redirections. 4267db96d56Sopenharmony_ci 4277db96d56Sopenharmony_ciTo create an opener, instantiate an ``OpenerDirector``, and then call 4287db96d56Sopenharmony_ci``.add_handler(some_handler_instance)`` repeatedly. 4297db96d56Sopenharmony_ci 4307db96d56Sopenharmony_ciAlternatively, you can use ``build_opener``, which is a convenience function for 4317db96d56Sopenharmony_cicreating opener objects with a single function call. ``build_opener`` adds 4327db96d56Sopenharmony_ciseveral handlers by default, but provides a quick way to add more and/or 4337db96d56Sopenharmony_cioverride the default handlers. 4347db96d56Sopenharmony_ci 4357db96d56Sopenharmony_ciOther sorts of handlers you might want to can handle proxies, authentication, 4367db96d56Sopenharmony_ciand other common but slightly specialised situations. 4377db96d56Sopenharmony_ci 4387db96d56Sopenharmony_ci``install_opener`` can be used to make an ``opener`` object the (global) default 4397db96d56Sopenharmony_ciopener. This means that calls to ``urlopen`` will use the opener you have 4407db96d56Sopenharmony_ciinstalled. 4417db96d56Sopenharmony_ci 4427db96d56Sopenharmony_ciOpener objects have an ``open`` method, which can be called directly to fetch 4437db96d56Sopenharmony_ciurls in the same way as the ``urlopen`` function: there's no need to call 4447db96d56Sopenharmony_ci``install_opener``, except as a convenience. 4457db96d56Sopenharmony_ci 4467db96d56Sopenharmony_ci 4477db96d56Sopenharmony_ciBasic Authentication 4487db96d56Sopenharmony_ci==================== 4497db96d56Sopenharmony_ci 4507db96d56Sopenharmony_ciTo illustrate creating and installing a handler we will use the 4517db96d56Sopenharmony_ci``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- 4527db96d56Sopenharmony_ciincluding an explanation of how Basic Authentication works - see the `Basic 4537db96d56Sopenharmony_ciAuthentication Tutorial 4547db96d56Sopenharmony_ci<https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`__. 4557db96d56Sopenharmony_ci 4567db96d56Sopenharmony_ciWhen authentication is required, the server sends a header (as well as the 401 4577db96d56Sopenharmony_cierror code) requesting authentication. This specifies the authentication scheme 4587db96d56Sopenharmony_ciand a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME 4597db96d56Sopenharmony_cirealm="REALM"``. 4607db96d56Sopenharmony_ci 4617db96d56Sopenharmony_cie.g. 4627db96d56Sopenharmony_ci 4637db96d56Sopenharmony_ci.. code-block:: none 4647db96d56Sopenharmony_ci 4657db96d56Sopenharmony_ci WWW-Authenticate: Basic realm="cPanel Users" 4667db96d56Sopenharmony_ci 4677db96d56Sopenharmony_ci 4687db96d56Sopenharmony_ciThe client should then retry the request with the appropriate name and password 4697db96d56Sopenharmony_cifor the realm included as a header in the request. This is 'basic 4707db96d56Sopenharmony_ciauthentication'. In order to simplify this process we can create an instance of 4717db96d56Sopenharmony_ci``HTTPBasicAuthHandler`` and an opener to use this handler. 4727db96d56Sopenharmony_ci 4737db96d56Sopenharmony_ciThe ``HTTPBasicAuthHandler`` uses an object called a password manager to handle 4747db96d56Sopenharmony_cithe mapping of URLs and realms to passwords and usernames. If you know what the 4757db96d56Sopenharmony_cirealm is (from the authentication header sent by the server), then you can use a 4767db96d56Sopenharmony_ci``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that 4777db96d56Sopenharmony_cicase, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows 4787db96d56Sopenharmony_ciyou to specify a default username and password for a URL. This will be supplied 4797db96d56Sopenharmony_ciin the absence of you providing an alternative combination for a specific 4807db96d56Sopenharmony_cirealm. We indicate this by providing ``None`` as the realm argument to the 4817db96d56Sopenharmony_ci``add_password`` method. 4827db96d56Sopenharmony_ci 4837db96d56Sopenharmony_ciThe top-level URL is the first URL that requires authentication. URLs "deeper" 4847db96d56Sopenharmony_cithan the URL you pass to .add_password() will also match. :: 4857db96d56Sopenharmony_ci 4867db96d56Sopenharmony_ci # create a password manager 4877db96d56Sopenharmony_ci password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() 4887db96d56Sopenharmony_ci 4897db96d56Sopenharmony_ci # Add the username and password. 4907db96d56Sopenharmony_ci # If we knew the realm, we could use it instead of None. 4917db96d56Sopenharmony_ci top_level_url = "http://example.com/foo/" 4927db96d56Sopenharmony_ci password_mgr.add_password(None, top_level_url, username, password) 4937db96d56Sopenharmony_ci 4947db96d56Sopenharmony_ci handler = urllib.request.HTTPBasicAuthHandler(password_mgr) 4957db96d56Sopenharmony_ci 4967db96d56Sopenharmony_ci # create "opener" (OpenerDirector instance) 4977db96d56Sopenharmony_ci opener = urllib.request.build_opener(handler) 4987db96d56Sopenharmony_ci 4997db96d56Sopenharmony_ci # use the opener to fetch a URL 5007db96d56Sopenharmony_ci opener.open(a_url) 5017db96d56Sopenharmony_ci 5027db96d56Sopenharmony_ci # Install the opener. 5037db96d56Sopenharmony_ci # Now all calls to urllib.request.urlopen use our opener. 5047db96d56Sopenharmony_ci urllib.request.install_opener(opener) 5057db96d56Sopenharmony_ci 5067db96d56Sopenharmony_ci.. note:: 5077db96d56Sopenharmony_ci 5087db96d56Sopenharmony_ci In the above example we only supplied our ``HTTPBasicAuthHandler`` to 5097db96d56Sopenharmony_ci ``build_opener``. By default openers have the handlers for normal situations 5107db96d56Sopenharmony_ci -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy` 5117db96d56Sopenharmony_ci environment variable is set), ``UnknownHandler``, ``HTTPHandler``, 5127db96d56Sopenharmony_ci ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, 5137db96d56Sopenharmony_ci ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``. 5147db96d56Sopenharmony_ci 5157db96d56Sopenharmony_ci``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme 5167db96d56Sopenharmony_cicomponent and the hostname and optionally the port number) 5177db96d56Sopenharmony_cie.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname, 5187db96d56Sopenharmony_cioptionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"`` 5197db96d56Sopenharmony_ci(the latter example includes a port number). The authority, if present, must 5207db96d56Sopenharmony_ciNOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is 5217db96d56Sopenharmony_cinot correct. 5227db96d56Sopenharmony_ci 5237db96d56Sopenharmony_ci 5247db96d56Sopenharmony_ciProxies 5257db96d56Sopenharmony_ci======= 5267db96d56Sopenharmony_ci 5277db96d56Sopenharmony_ci**urllib** will auto-detect your proxy settings and use those. This is through 5287db96d56Sopenharmony_cithe ``ProxyHandler``, which is part of the normal handler chain when a proxy 5297db96d56Sopenharmony_cisetting is detected. Normally that's a good thing, but there are occasions 5307db96d56Sopenharmony_ciwhen it may not be helpful [#]_. One way to do this is to setup our own 5317db96d56Sopenharmony_ci``ProxyHandler``, with no proxies defined. This is done using similar steps to 5327db96d56Sopenharmony_cisetting up a `Basic Authentication`_ handler: :: 5337db96d56Sopenharmony_ci 5347db96d56Sopenharmony_ci >>> proxy_support = urllib.request.ProxyHandler({}) 5357db96d56Sopenharmony_ci >>> opener = urllib.request.build_opener(proxy_support) 5367db96d56Sopenharmony_ci >>> urllib.request.install_opener(opener) 5377db96d56Sopenharmony_ci 5387db96d56Sopenharmony_ci.. note:: 5397db96d56Sopenharmony_ci 5407db96d56Sopenharmony_ci Currently ``urllib.request`` *does not* support fetching of ``https`` locations 5417db96d56Sopenharmony_ci through a proxy. However, this can be enabled by extending urllib.request as 5427db96d56Sopenharmony_ci shown in the recipe [#]_. 5437db96d56Sopenharmony_ci 5447db96d56Sopenharmony_ci.. note:: 5457db96d56Sopenharmony_ci 5467db96d56Sopenharmony_ci ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see 5477db96d56Sopenharmony_ci the documentation on :func:`~urllib.request.getproxies`. 5487db96d56Sopenharmony_ci 5497db96d56Sopenharmony_ci 5507db96d56Sopenharmony_ciSockets and Layers 5517db96d56Sopenharmony_ci================== 5527db96d56Sopenharmony_ci 5537db96d56Sopenharmony_ciThe Python support for fetching resources from the web is layered. urllib uses 5547db96d56Sopenharmony_cithe :mod:`http.client` library, which in turn uses the socket library. 5557db96d56Sopenharmony_ci 5567db96d56Sopenharmony_ciAs of Python 2.3 you can specify how long a socket should wait for a response 5577db96d56Sopenharmony_cibefore timing out. This can be useful in applications which have to fetch web 5587db96d56Sopenharmony_cipages. By default the socket module has *no timeout* and can hang. Currently, 5597db96d56Sopenharmony_cithe socket timeout is not exposed at the http.client or urllib.request levels. 5607db96d56Sopenharmony_ciHowever, you can set the default timeout globally for all sockets using :: 5617db96d56Sopenharmony_ci 5627db96d56Sopenharmony_ci import socket 5637db96d56Sopenharmony_ci import urllib.request 5647db96d56Sopenharmony_ci 5657db96d56Sopenharmony_ci # timeout in seconds 5667db96d56Sopenharmony_ci timeout = 10 5677db96d56Sopenharmony_ci socket.setdefaulttimeout(timeout) 5687db96d56Sopenharmony_ci 5697db96d56Sopenharmony_ci # this call to urllib.request.urlopen now uses the default timeout 5707db96d56Sopenharmony_ci # we have set in the socket module 5717db96d56Sopenharmony_ci req = urllib.request.Request('http://www.voidspace.org.uk') 5727db96d56Sopenharmony_ci response = urllib.request.urlopen(req) 5737db96d56Sopenharmony_ci 5747db96d56Sopenharmony_ci 5757db96d56Sopenharmony_ci------- 5767db96d56Sopenharmony_ci 5777db96d56Sopenharmony_ci 5787db96d56Sopenharmony_ciFootnotes 5797db96d56Sopenharmony_ci========= 5807db96d56Sopenharmony_ci 5817db96d56Sopenharmony_ciThis document was reviewed and revised by John Lee. 5827db96d56Sopenharmony_ci 5837db96d56Sopenharmony_ci.. [#] Google for example. 5847db96d56Sopenharmony_ci.. [#] Browser sniffing is a very bad practice for website design - building 5857db96d56Sopenharmony_ci sites using web standards is much more sensible. Unfortunately a lot of 5867db96d56Sopenharmony_ci sites still send different versions to different browsers. 5877db96d56Sopenharmony_ci.. [#] The user agent for MSIE 6 is 5887db96d56Sopenharmony_ci *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* 5897db96d56Sopenharmony_ci.. [#] For details of more HTTP request headers, see 5907db96d56Sopenharmony_ci `Quick Reference to HTTP Headers`_. 5917db96d56Sopenharmony_ci.. [#] In my case I have to use a proxy to access the internet at work. If you 5927db96d56Sopenharmony_ci attempt to fetch *localhost* URLs through this proxy it blocks them. IE 5937db96d56Sopenharmony_ci is set to use the proxy, which urllib picks up on. In order to test 5947db96d56Sopenharmony_ci scripts with a localhost server, I have to prevent urllib from using 5957db96d56Sopenharmony_ci the proxy. 5967db96d56Sopenharmony_ci.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe 5977db96d56Sopenharmony_ci <https://code.activestate.com/recipes/456195/>`_. 5987db96d56Sopenharmony_ci 599