spiderman

The ability to script web requests is very useful for automating tasks and to save you time.  There are several Python modules that make it easier to create and parse web requests (httplib, Mechanize, Spider, Beautiful Soup, and urllib).  This blog post will explore some basics around making web requests with Python.

Making a Web Request:

Below is a screen shot illustrating the syntax for creating a web request against a local web server running with Python’s SimpleHTTPServer:

1make_web_request

Parsing HTML:

Now that we are able to make web requests with Python let’s look a module to help parse HTML. BeautifulSoup is a very useful module to help parse HTML based on the HTML tags. Below are some examples that might be helpful for some of your HTML parsing needs:

2beautifulSoup2

 The power of BeautifulSoup comes from the ability to parse HTML based on tags.  You can use the “find_all” function within a BeautifulSoup instance “iframes = parsed.find_all(‘iframe’)”.

Practical Application:

Very often you find a web resource you may want to make a lot of queries. This is where Python scripting comes into play to help you automate the task. One web resource I find myself using often is iplist.net, it can show me the various domain names pointing to a given IP address.

When starting your script you’ll want to consider two things:

  1. Structure of the URL with the request.
  2. What portion of the response is interesting to you – You might be able to pull the interesting part out by an HTML tag, or you may have to lean more towards Regular expression.

The structure of iplist.net is fairly simple “http://iplist.net/<ip>/” – so we can quite easily read IPs in from a file and loop through them. Next make a request and then examine the source code to see what portion is interesting to you. In this example we can examine the source code and see that the HTML header tag “<h2>domain_name</h2>” – so we can use BeautifulSoup to extract just this portion from the page. Below gets you started with this script, from here you could extract just the domains and print them to STDOUT:

3iplist

Firebug is a useful tool when analyzing the source code for a web application. Below you can see that it will highlight on the screen what the source code corresponds to:

4firebug

Spidering:

Spidering a web application is the process of enumerated linked content on a web application by following links within the web application to help build a site map.  Spidering a web application is a good use case to leverage Python to create a quick script.  You could create a crawler script by parsing the href tags on the response of a request and then create additional requests.  You can also leverage a Python module called “Spider” to do it in less lines of code:

5spider_crawler

There are several options you can configure related to how the spider will work “myspider(b=URL.strip(), w=200, d=5, t=5)”.  This function will return two lists of child URLs and paths.  You can modify how the spider works by changing the arguments passed to the myspider function:
b — base web URL (default: None)
w — amount of resources to crawl (default: 200)
d — depth in hierarchy to crawl (default: 5)
t — number of threads (default: None)

This blog post was just a quick look into interacting with web resources by leveraging Python.  Many more advanced use cases exists for scripting web resource interaction.  Future blog posts will demonstrate some more advanced use cases by scripting attacks against web servers.

Code snippet leveraging the Python spider module:

#!/usr/bin/python
from spider import webspider as myspider
import sys, optparse

def crawler(URLs):
        for line in open(URLs, 'r'):
                URL = line.strip()
                links = myspider(b=URL.strip(), w=200, d=5, t=5)
                link_count = len(links[0])
                out = URL+": has a link count of "+str(link_count)
                print "[+] Web Crawl Results for: "+URL
                print out
                for item in links[1]:
                        print item

def main():
        parser = optparse.OptionParser(sys.argv[0]+' '+ 
        '-r <file_with URLs>')
        parser.add_option('-r', dest='URLs', type='string', 
                help='specify target file with URLs')
        (options, args) = parser.parse_args()
        URLs=options.URLs

        if (URLs == None):
                print parser.usage
                sys.exit(0)
        else:
                crawler(URLs)

if __name__ == "__main__":
      main()