This tutorial will demonstrate how to make web requests using Python. There are several Python modules that make it easier to create and make/parse web requests/responses (httplib, Mechanize, Beautiful Soup, and urllib/urllib2). Install these modules and check out their functionality.
Making a Web Request:
Below is a screen shot illustrating the syntax for creating a web request against a local web server running with Python’s SimpleHTTPServer:
Now that we are able to make web requests with Python let’s look a module to help parse HTML. BeautifulSoup is a very useful module to help parse HTML based on the HTML tags. Below are some examples that might be helpful for some of your HTML parsing needs:
The power of BeautifulSoup comes from the ability to parse HTML based on tags. You can use the “find_all” function within a BeautifulSoup instance “iframes = parsed.find_all(‘iframe’)”.
Very often you find a web resource you may want to make a lot of queries. This is where Python scripting comes into play to help you automate the task. One web resource I find myself using often is iplist.net, it can show me the various domain names pointing to a given IP address.
When starting your script you’ll want to consider two things:
- Structure of the URL with the request.
- What portion of the response is interesting to you – You might be able to pull the interesting part out by an HTML tag, or you may have to lean more towards Regular expression.
The structure of iplist.net is fairly simple “http://iplist.net/<ip>/” – so we can quite easily read IPs in from a file and loop through them. Next make a request and then examine the source code to see what portion is interesting to you. In this example we can examine the source code and see that the HTML header tag “<h2>domain_name</h2>” – so we can use BeautifulSoup to extract just this portion from the page. Below gets you started with this script, from here you could extract just the domains and print them to STDOUT:
Firebug is a useful tool when analyzing the source code for a web application. Below you can see that it will highlight on the screen what the source code corresponds to:
This is the type of process you’ll go through to parse the responses. Look through the response and see what information you’d like to extract to have printed to STDOUT. Here is a link to a more complex script Primal Security worked up to parse iplist.net