Search
  • Antonello Calamea

Elasticsearch, search power at your fingertips!

It’s a great satisfaction when you are able to drill down into tons of data, taming their complexity and be able to extract valuable information.


So let me introduce a very friend of mine: Elasticsearch, one of the most amazing tool I ever used!


This is just an introduction to see how it works (at very high level), show what is capable of and how to run it by yourself and I’m sure you’ll understand my passion for it after you know it a little.


But what is exactly ES?

From Wikipedia : Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.


In simpler words, you can collect, parse, transform data from a lot of different sources and then perform complex searches on it.


It already has a quite long story, so it’s not just the last hot tech buzz.


Sounds good, but what is possible to do with it?

The most obvious use is to implement a search engine inside an application (a web site, a mobile app and so on) and it works very well.


But, for me, the most awesome way is to set up a complete monitoring and reporting system, to perform complex data analysis with info gathered from different and heterogeneous sources.


  • Some examples:

  • Log monitoring

  • Metrics monitoring

  • Business data KPI monitoring

  • Applications troubleshooting

  • Automatic alerting

  • Data analysis to gain insights


For example, let’s consider having an Apache access log as input. You can build a graph showing the 404s responses and be able to see if are happening on a target server.


But you can do the same on a past period, so you can be able to see the number of 404 happened for example from 09:05 to 23:07 of a specific date. Moreover, you can correlate it with other data (for example an application log), picking the events registered in the surrounding time frame.


This is an incredible help to diagnose problems and give faster response when something is broken.


But there is more: you can perform analysis from all the stored events, using not only the tools ES offers, but using it as source data for further processing (querying data through its APIs).


And with the latest versions, it comes the possibility to use ML algorithms to make predictive and anomalies detection analysis!


How it works?

Essentially, every time something change in a data source (a new row is added to a log file, a record is changed in a database table and so on), an event, with all the chosen metadata attached, is generated and stored in ES database.


In the previous example, you can perform the analysis even if the Apache logs are no more available, because the generated events are stored on ES.


Let’s take a look under the hood…


The main components are three (known as the ELK stack):

  • Elasticsearch : the search engine

  • Logstash : the data collector (working in sinergie with another component, Beats)

  • Kibana : The data visualizer (query results, graphs and dashboards)


And this is an example of how interact together

Data is collected from MetricBeat (collecting info such cpu load, disk space, process) and FileBeat and can be passed to LogStash for further transformations or LogStash itself can collect data (for example monitoring a database on a specific query) and everything is sent to Elasticsearch using its REST APIs.


With Kibana, you get the data from ES to visualize it in different ways


But you are not forced to use them all, so you can using Grafana instead of Kibana to build dashboards or maybe build your own reporting tool.


It’s great having a decoupled architecture, right? :)


Awesome, I want to try it!

There are two main options if you want to use ES:

  1. Install on your own (ranging from a simple one node local machine to multi clusters instances on cloud servers, with or without Docker)

  2. Use an hosted service

The first option is the most time consuming and you have to manage all the parts (machines/containers, clusters and so on) but you have total control, using always the latest available version, with all the included features.


The second can differ regarding the chosen provider: the “purest” form is using the service hosted by the creator of ES (Elastic.co), otherwise you can choose a provider (AWS Elastisearch service, Logz.io, Bonsai, etc.) who hosts and “wraps” ES, meaning you use it “as is” and you will use the versions/features that the provider choose to expose.


There is not an absolutely right choice, it always depends on specific needs and available time/skills/budget.


For this guide, we’ll use a local instance using Docker, so there is no need to register or having accounts anywhere.


Let’s spin some containers!


Run ES on Docker (Linux)

There are several possibilities, the simplest is launching it with one command

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.4.0

Basically it will download the Docker image of 6.4.0 version (containing everything needed to run), opening the port 9200 and 9300 between the host and the container itself and running it as single node.


If everything is fine, just point a browser to localhost:9200 you should see the ES response, something like this

// 20180929194325
// http://localhost:9200/{
  "name": "1LoBYJR",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "DxChiMm5Rku9KVqjPz5JDQ",
  "version": {
    "number": "6.4.0",
    "build_flavor": "default",
    "build_type": "tar",
    "build_hash": "595516e",
    "build_date": "2018-08-17T23:18:47.308994Z",
    "build_snapshot": false,
    "lucene_version": "7.4.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

Hooray! You just took the first step to have more control on your data!

But let’s see something more graphical, let’s run Kibana too.


In this case we’ll use a docker compose command to spin both containers, creating a network to allow communication between them and storing the data in folder ~/data — you have to create it manually — so, when you restart the container, all the data will be there.


So stop the previous one, create a file called docker-compose.yml and paste it this code

version: '2'services:  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.4.0
    ports:
      - "9200:9200"
      - "9300:9300"
    volumes:
      - ~/esdata:/usr/share/elasticsearch/data
    environment:
      ES_JAVA_OPTS: "-Xmx256m -Xms256m"
    networks:
      - elk  kibana:
    image: docker.elastic.co/kibana/kibana:6.4.0
    ports:
      - "5601:5601"
    networks:
      - elk
    depends_on:
      - elasticsearchnetworks:
  elk:
    driver: bridge

Then run

docker-compose up

and if everything works, you should have both containers running


Now point your browser to http://localhost:5601 and you should see Kibana!

Let’s add sample data (all Shakespeare!) to have something to work on.

Steps:

Just copy this command and execute it

PUT /shakespeare
{
 "mappings": {
  "doc": {
   "properties": {
    "speaker": {"type": "keyword"},
    "play_name": {"type": "keyword"},
    "line_id": {"type": "integer"},
    "speech_number": {"type": "integer"}
   }
  }
 }
}

Then, open a shell from the folder where is present the json file and run

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json

After some moments, the data will be loaded


Last step, go to Kibana Management > Index Patterns and create an index pattern (mapping the data from ES to use it in Kibana), for example called “shake*”


You should see something like this

And now the moment you’re waiting for…see some data…go to Kibana->Discover and you will see the Shakespeare data…one event, for every line of every work, totaling 111396 hits

Let’s build a simple graph, showing the characters having most quotes. Go to visualize and configure like this:

Awesome! Gloucester is the winner!


Now you have the possibility to play a little with the data and, above all, start to think about what you can do with the data you’re really interested to.


This is just the tip of the iceberg, but every journey has a first step…


Enjoy the ride!

0 views0 comments
 

©2020 by Antonello Calamea. Proudly created with Wix.com