Elasticsearch: How to Add Full-Text Search to Your Database
Businesses often need to build applications that not only collect and save data, but also analyze it and render it searchable. Powerful search capabilities must be created in projects to accomplish this. Increasingly teams are turning to Elasticsearch to build fast, full-text search functionality.
What Is Elasticsearch: An Overview
Elasticsearch is a robust and platform-independent search engine that can provide a rapid full-text search over millions of documents.
It’s a document store based on RESTful communication. By default, it indexes all fields in a document, and they become instantly searchable. Elasticsearch stores documents in JSON format. It has support for many programming languages. As an agnostic search engine, it can be used with the language and platform of your choice.
At the time this post was written, Elasticsearch was ranked eleventh for databases, and there’s good reason why.
Why Teams Choose Elasticsearch
The following companies are no longer asking, “What is Elasticsearch?” They rely on it to search and query data.
Wikipedia. The search provider relies on Elasticsearch for full-text search to provide suggested text.
The Guardian. The media site uses Elasticsearch to give editors current feedback about public opinion on published media through social and visitor data.
Stack Overflow. The knowledge-sharing site uses Elasticsearch to complete full-text search, geolocation queries and source related questions and/or answers.
GitHub. The project host queries billions of lines of code with the search engine.
Still wondering, “What is Elasticsearch, and why does it matter?” The following benefits summarize the Elasticsearch value proposition for users.
Teams favor Elasticsearch because it is a distributed system by nature and can easily scale horizontally providing the ability to extend resources and balance the loading between the nodes in a cluster. It also replicates the data automatically to prevent data loss in case of server node failure.
Elasticsearch is capable of scaling to hundreds of servers and petabytes of information.
Elasticsearch is able to execute complex queries extremely fast. It also caches almost all of the structured queries commonly used as a filter for the result set and executes them only once. For every other request which contains a cached filter, it checks the result from the cache. This saves the time parsing and executing the query improving the speed.
Query Fine Tuning
Elasticsearch has a powerful JSON-based DSL (domain specific language), which allows development teams the ability to construct complex queries and fine tune them to receive the most precise results from a search. It provides also a way of ranking and grouping results.
Teams with a deep understanding of business needs and user perspective, can hone queries so the most relevant results always appear at the top of the result set. This way, build teams can ensure users always find what they’re looking for on page 1 between the first displayed results.
Elasticsearch provides support for all commonly-used data types such as:
Text: string (can be of both structured and unstructured data)
Numbers: long, integer, short, byte, double, float
Elasticsearch also provides support for complex types such as: arrays, objects, nested types, geo data types, iPV4 and others. Click here for a complete list.
Elasticsearch offers a variety of useful plugins to boost capabilities. Plugins provide richer analysis to understand your data and explore it. They also provide additional security functionality.
Preparing to Add a Search Engine in Your .NET Core Project
Elasticsearch comes with a great support for .NET Core and older versions of the platform. To get started, Nest is the official library you must use when working with Elasticsearch. It comes with a strongly-typed API that maps one-to-one with the Elasticsearch query DSL and takes advantage of specific .NET features like covariant results and auto mapping of POCOs.
Nest is a high-level client that internally uses the low-level Elasticsearch.Net. It is available for installation via NuGet.
To move beyond asking, “What is Elasticsearch” and to illustrate its value, I created a sample dev blog project using Elasticsearch for indexing and searching the site content. The code examples in this article are part of that project.
We will need to declare a type. Type is a definition of the structure for our information that will be held in the search engine. You can think of a type like an SQL table. You can have many types in your index.
We will need an index. Index is a logical namespace where all of our types will live. It’s similar to a database in the SQL world.
Our actual records are documents. A document is stored as a JSON. We’re going to insert only data that corresponds to the type’s mapping.
Elasticsearch internally creates an Inverted Index, which is a table where all unique terms collected from our documents are populated. It indicates the specific documents where the term exists. When a search for some value is performed, it will compare against the Inverted Index. This identifies whether or not the value exists in our documents. An Inverted Index looks similar to the image below.
.NET Core Project Elasticsearch Demo Tutorial
Creating the Data Structure for Your Project
Because Nest is able to auto map POCOs to Elasticsearch types, it becomes really easy to setup your project data structure. First, we will need to figure out what kind of data we’re going to store. This is the first step when building your index and types.
Structured vs. Unstructured Search
It’s really important to understand the difference between the structured (exact value) and unstructured (full-text) search.
Direct execution. One of the main differences is structured queries are executed directly without having to pass a special phase of analysis of the text. They are used as filters to comb the result set before executing the actual full-text search. (Most probably, developers would like to first filter the thousands or millions of documents to reduce them to set of interest. After that, a full-text query could be executed on that filtered set.) This will highly improve your query performance. Remember, filters are cached, and that means the filter execution would be fast.
Yes/no results. Structured data is also called exact value data. With structured search, the result of your query is always a yes or no. Something either belongs in the set, or it does not. Structured search does not worry about document relevance or scoring, it simply includes or excludes documents. There is no concept of “more similar” as in full-text search.
Structured data includes numbers, booleans, dates and sometimes, when make sense, text. All types are treated as exact values. Elasticsearch compares the data passed in the query to your data. The result would be a boolean “yes” (it matches) or “no” (it does not), similar to how SQL compares values. If a document matches the structured query, it is included in the result set.
Relevancy. On other hand, the unstructured data is any human text. Human text is hard for computers to handle. Within the full-text search world, there exists a concept of how relevant is the returned document to your search. This question is answered by calculating the score of the document.
Human language has many rules to follow when constructing a phrase or text. It is treated as full-text, and special analyzers are applied in order to simplify it and make it searchable. This means that the original text would be modified following special rules before being stored in the Inverted index. This process is called the “analysis phase,” and it is applied to all full-text fields. Most of the analyzers needed are already defined in Elasticsearch and come “out-of-the box.”
When applying the “English” analyzer to the text, “The dogs are running fast.” it would result in indexing only the words — “dog”, “run” and “fast”. Some of the characters and words will be removed such as “the”, “and”, “are” and “is” along with commas and periods because they will not give any special meaning to the phrase. The remaining words will be lowercased and transformed (stemmed) to their root form. So, the word “running” would become “run.”
Text, by default, is treated as full-text, but in some cases, it makes sense to treat it as an exact value and not apply any transformations to it. This applies when using GUID in your model. If the field “CreatedById” is of type GUID, Elasticsearch doesn’t have a corresponding type to map. It must be indexed as text. The GUID has a specific structure, and analyzers should not be applied because they would transform the original text.
To treat a text as an exact value, you need to explicitly specify it. This might be done by using the Nest attributes in your model class. See the “CreatedById” property.
By specifying that a string or other type is a Keyword means that the analysis phase will not happen, and the keyword will be indexed and stored in the Inverted Index in its original form.
Otherwise, if the keyword were treated as full-text, analysis will be applied and your GUID value would result in text with removed dashes, lowercase letters and other transformations depending on what analyzer was used.
This is what will be indexed if treated as a full-text field with the standard analyzer.
In a real application, you may find yourself using more structured data and less full-text data. In this sample blog application, we have only one data type — non structured (full-text) data that will be a human text, blog posts.
Creating the type
We need a place where to store our posts and their tags. This type will be called “post”. Nest.Net makes things really easy for us. After we have identified the data types we need, we must create the model that will be used by Elasticsearch. It will be a POCO decorated with attributes to have a better control over the type’s mapping creation. (If mappings are not provided, Elasticsearch will try to guess the data type).
The attribute [ElasticsearchType(Name = “”)] decorating the class indicates the name of the type that this object represents. Now that we have declared the type, it’s time to create the index and to set up Elasticsearch.
Creating the index
Creating the index is really easy. Our type fields will be automapped. The next section of code is part of the custom ElasticClient.
Where _indexName = “blog” and the URI address “http://localhost:9200” is the default address of Elasticsearch. We’re returning an instance of Nest.ElasticClient passing that address as a parameter.
The default settings can be used when creating the index. But because we’re running Elasticsearch on a single node, and it’s not in a distributed environment, I recommend setting the number_of_shards to 1 to increase the precision and number_of_replicas to 0 because we can’t replicate the data on another node.
To ensure the application is always aware of Elasticsearch, we’re going to add the call of ElasticClient.Initialize() in the project Startup.cs file. That way, on every initial application start, it will be checked if the index exists. If not, it will be created. Otherwise, the data will not be overridden. After running the application, we can check what has been created by using the Elasticsearch address + index name and the _mappings endpoint.
The index has been successfully created and it is ready for use.
Indexing, Reindexing and Deleting a Document
Indexing in this context means storing a document in an index. Next, we must start indexing documents in Elasticsearch after a post has been created or edited. The edit operation is simply reindexing an existing document which will override the existing one.
After we have saved/edited the post we’re going to index it in Elasticsearch to note the changes.
_elasticService.IndexData() maps the PostModel to PostType which is a type Elasticsearch is aware of. Then it uses ElasticClient to index the changes by calling the InsertUpdate(document) method.
You can explore the stored data by executing the GET query “http://localhost:9200/blog/post/_search” in a REST client of your choice. Where blog is your index name, post is the type name and _search is the endpoint that allows you to execute DSL queries. The result will be similar to this one.
As we can see, the response provides metadata that gives information about how much time the query execution took (in milliseconds), how many documents have been returned, the actual documents in hits field and other useful information.
Building your DSL query to obtain the stored data
Building the DSL query will happen dynamically depending on how much words are provided in the search field. Nest gives you the API to map one-to-one with Elasticsearch DSL query.
The used in the example match query is a high-level query and it understands the field mappings. It’s commonly used. The multi_match query is the way to execute match query on multiple fields.
A search query is constructed as following:
We’re performing full-text search on all three fields of the blog post with every word provided from the search field. The type of the multi_match query is set to most_fields to sum the score of the matched documents and to increase the precision of the response.
This query will work fine, but what if I want to execute a more precise query and say that the title of the blog post is more important than his content? I need to boost my query.
Boosting Your Queries
Boosting your queries allows you to give more importance to some of your fields. I recommend constructing the query by providing different boost values to the fields. By default, every field has a boost of 1, meaning that all fields are equally important. You can change that by providing a number. Boost parameter is of type double.
The title field now is 3 times more important than the content of the blog post. Next by importance, is the tag field with a boost of 2.
This will give more weight to the documents that match the searched word in their title field, less weight when matching the tag field and lesser when matching the content. The result set of the query will be constructed by sorting the documents by their score. In this way, the more relevant documents are sorted to the top of the result set.
The final goal is to display the most relevant documents in the top 5 results on the first page and ideally to never need to go on the second page.
We can now execute the query and map the response to the model that will be returned to the view.
The variable dslQuery is present for debugging purposes and gives you the raw DSL query that has been executed against Elasticsearch.
The response is of type ISearchResponse <T> and has the property Hits that contains all the matched documents sorted by their score. The actual document’s data is found in the Source field.
The final step is to display the result of the executed search query in the view.
Efficiently Index Data
In this article, we explore how to incorporate Elasticsearch into your projects. It provides the ability to fine tune queries according to your business needs and cover complex scenarios to achieve your goals.
In the software development, world there are few search engines and SQL engines that provide support for full-text search. Below find additional linked resources to help you identify which will best fit your needs.
Now you have all you need to start indexing your data and be able to deep search in a very efficient way. Start using Elasticsearch today.