spiderman-blk

Spidering:

This Python tutorial will introduce some new modules (optparse, spider) to accomplish the task of spidering a web application.  Spidering a web application is the process of enumerated linked content on a web application by following links within the web application to help build a site map.  Spidering a web application is a good use case to leverage Python to create a quick script.

You could create a crawler script by parsing the href tags on the response of a request and then create additional requests.  You can also leverage a Python module called “Spider” to do it in less lines of code:

5spider_crawler

There are several options you can configure related to how the spider will work “myspider(b=URL.strip(), w=200, d=5, t=5)”.  This function will return two lists of child URLs and paths.  You can modify how the spider works by changing the arguments passed to the myspider function:
b — base web URL (default: None)
w — amount of resources to crawl (default: 200)
d — depth in hierarchy to crawl (default: 5)
t — number of threads (default: None)

This blog post was just a quick look into interacting with web resources by leveraging Python.  Many more advanced use cases exists for scripting web resource interaction.  Future blog posts will demonstrate some more advanced use cases by scripting attacks against web servers.

Code snippet leveraging the Python spider module:

#!/usr/bin/python
from spider import webspider as myspider
import sys, optparse

def crawler(URLs):
        for line in open(URLs, 'r'):
                URL = line.strip()
                links = myspider(b=URL.strip(), w=200, d=5, t=5)
                link_count = len(links[0])
                out = URL+": has a link count of "+str(link_count)
                print "[+] Web Crawl Results for: "+URL
                print out
                for item in links[1]:
                        print item

def main():
# This optparse module allows you to build command line switches in your scripts
# This will set the usage to '-r' and have it stored to a variable URLs
# Then we will open the file given at the command line with -r and attempt to spider
        parser = optparse.OptionParser(sys.argv[0]+' '+ 
        '-r <file_with URLs>')
        parser.add_option('-r', dest='URLs', type='string', 
                help='specify target file with URLs')
        (options, args) = parser.parse_args()
        URLs=options.URLs

        if (URLs == None):
                print parser.usage
                sys.exit(0)
        else:
                crawler(URLs)

if __name__ == "__main__":
      main()