Programmer's Python Async - Asyncio Web Client |
Written by Mike James | ||||
Wednesday, 21 December 2022 | ||||
Page 3 of 3
When the server receives the GET request it finds the specified file and sends it to the client using the same socket connection. The first part of the message sent to the client is a set of headers which we need to read and process. The first line of any response is always: HTTP/1.1 200 OK\r\n which gives the HTTP version and the status code which we can assume is going to be 200, i.e. no error. If you want to write a complete client you need to extract the error code and react to it. In our simple demonstration we can read it and ignore it: headers="" line = await reader.readline() Next we need to read the headers that the server has sent. These arrive one to a line and the end is marked by a blank like, just like the headers we sent to the server: while True: line = await reader.readline() line = line.decode('ascii') if line=="\r\n": break headers+=line This loop reads each line in turn, converts it to a Python string using ASCII encoding and builds up a complete string of headers. The loop ends when we read a blank line. We need to process the headers because the Content-Length header tells us how many bytes to read to get the content, i.e. the HTML that makes up the page. We need this because we cannot read data expecting an EOF signal, because there isn’t one. The socket stays open in case you have another request to send to the server. If you do wait for an EOF then you will usually wait a long time before the server times out. We need to read the Content-Length header to find the number of bytes to read. We could use some simple string manipulation to extract the header we want, but there is a standard way to parse HTTP headers even if it is obscure because it is part of the email module. It turns out the emails use HTTP as their protocol and hence you can use email.message_from_string to parse HTTP headers: def parseHeaders(headers): message = email.message_from_string(headers) return dict(message.items()) This utility function returns all of the headers as a dictionary keyed on the header names with values of the strings they are set to. Now we can use this to get the Content-Length header: headers = parseHeaders(headers) length = int(headers["Content-Length"]) As we now know the number of characters to read the rest of the procedure is simple: line = await reader.read(length) line = line.decode('utf8') writer.close() await writer.wait_closed() return line This time we decode the content using utf8 because this is what most modern web pages use for their content. To check, we should decode the Content-Type header which in this case reads: Content-Type: text/html; charset=UTF-8 So the content is HTML and it is UTF-8 encoded. To demonstrate all of this we need a coroutine to start things off: async def main(): start = time.perf_counter() results = await asyncio.gather( download('http://www.example.com/'), This creates two tasks to download the same page, starts them both off asynchronously and waits for them to complete. Whenever one of the tasks has to wait for data to be available it releases the main thread and the other gets a chance to run and so on. As a result main mostly has little to do and you can increase the number of downloads without increasing the time it takes by much. For example, adding an additional download on a test machine to the asynchronous program increases the time it takes by about 30 ms, whereas for a synchronous program it adds 220 ms. This means that downloading 100 pages takes about 3 seconds asynchronously, but 21 seconds doing the job synchronously. The complete program is: import asyncio import urllib.parse import time import email In Chapter but not in this extract
Summary
Programmer's Python:
|
||||
Last Updated ( Wednesday, 21 December 2022 ) |