elk-50296_640

Elasticsearch is a good way to store large amounts of data and get that information quickly via Lucene searches. Many different tools use Elasticsearch to speed up requests of their information and make a cloud-like environment for their databases.
In this tutorial we will demonstrate a few different ways to query Elasticsearch using the python import.
To start we need to import the module for Elasticsearch and its helpers:
 from elasticsearch import Elasticsearch, helpers
Depending on your use case, you will then need to establish the connection to Elasticsearch, storing the connection in a variable is the easiest way:
es= Elasticsearch(['http://192.168.x.x:9200'])
The string use_ssl=True can also be used if the connection to your Elasticsearch instance is encrypted.
In Elasticsearch, a lot of data is organized by index so getting a list of indexes is useful and can be stored as a list in a variable:
indices = es.indices.get_aliases().keys()
Once we’ve defined your indices in Elasticsearch, we can begin querying the data from each index using a simple for loop:
for each in indices:
    es.search(index=each)
Executing this command will return a TON of data! However, you’ll probably notice that this isn’t even close to the full amount of data in your Elasticsearch index, and it’s more fields than you actually care about. To increase the amount of data that the search() command returns, we can add the size argument inside the search, with a maximum value of 10000.  Even when setting the size=10000 argument, it is very likely we will not get all of the data inside the index as it will exceed the maximum size.
To ensure that we can query all data we need to use the scroll() method within our script. However, using scroll isn’t always pretty so the developers of Elasticsearch have built a ‘Helper’ which automates much of the scroll functionality and makes it more ‘pythonic’.
To perform the same scan as above, we will have to perform two loops and use the helper in the second:
for each in indices:
    temp = helpers.scan(es, index=each)
    for value in temp:
        print value
Since temp is created as a generator object, it cannot be printed directly, hence the need to create a second loop and print the value.
Once this is done a number of things can be achieved to further restrict data that is shown in the request. For example, data can be restricted based on the doc_type, and even further based on the fields inside the index and doc type using ‘_source_include’. An example string would look like this:
temp = helpers.scan(es, index=each, doc_type='malware', _source_include=('filename', 'filesize'))
Finally, further restrictions on the data can be performed using range queries. Full documentation of the range syntax is here .  For our instance though, we will query all data within a 30 day window:
temp = helpers.scan(es, index=each, doc_type='malware', _source_include=('filename', 'filesize'), query={'query': {'filtered': {'filter':{'range': {'RealTime': {'gte': "now-1M", 'lte': "now"}}}}}})
Now that we’ve slimmed down the data, you can dig deeper into our data and ensure that we’re hitting everything that we need accurately.