Java Interview Question 2.0--elasticsearch

Welcome to follow"Java Interview Question 2.0"Collection release page, continuously updated!

What is Elasticsearch?

Elasticsearch (ES) is an open source, distributed, RESTful interface full-text search engine built on Lucene. Elasticsearch is also a distributed document database where each field is indexed data and can be searched, and it can be extended to hundreds of servers to store and process petabytes of data. It can store, search and analyze large amounts of data in a very short time. It is often used as a core engine in cases with complex search scenarios.

Elasticsearch is created for high availability and scalability. It can be done by purchasing a server with stronger performance.

Pros and cons of Elasticsearch:

advantage:

Horizontal scalability: Just add a server, do some configuration, and start Elasticsearch to merge into the cluster.

The sharding mechanism provides better distribution: the same index is divided into multiple shardings, which is similar to the block mechanism of HDFS; the division and conquer method can improve processing efficiency.

High Availability: Provides a replica mechanism, a shard can set multiple replications, so that when a server is down, the cluster can still run as usual, and will replicate and restore the data lost from the server downtime to other available nodes.

Elasticsearch application scenarios

Large distributed log analysis system ELK elasticsearch (storage logs) + logstash (collect logs) + kibana (display data)

Large-scale e-commerce product search system, website search, network disk search engine, etc.

What is Elasticsearch version control? Why version control?

1. Why do I need to perform version control CAS lock-free

In order to ensure the accuracy of data under multi-threaded operation

2. Pessimistic lock and optimistic lock

Pessimistic lock: Assuming concurrent conflicts will occur, blocking all operations that may violate data accuracy

Optimistic lock: Assuming no concurrent conflict occurs, only if the submission operation is to check whether data integrity is violated.

3. Internal version control and external version control

Internal version control: _version will grow automatically. After modifying the data, _version will automatically add 1.

External version control: In order to keep the value of _version and external version control consistent

Use version_type=external to check whether the current version value of the data is smaller than the version value in the request

es adopts optimistic locking, and when submitting, external version control is used to ensure data consistency in multi-threaded environments.

Chinese word segmenter

Because the default standard word parter word parterator in Elasticsearch is not very friendly to Chinese word parterator, it will split Chinese words into Chinese men. Therefore, the Chinese word participle-es-ik plug-in was introduced

The default standard word segmenter will split each character into a separate word.

The lexicon of Chinese word segmentation is also limited, and it is written in the configuration file. We can customize the extended lexicon and write it in the configuration file.

What types of ES support?

Basic Type

String: string, string type contains text and keyword.

text: This type is used to index long text. Before creating an index, these texts will be segmented and converted into word combinations to create an index; es are allowed to retrieve these words, and the text type cannot be used for sorting and aggregation.

keyword: This type does not require word segmentation, and can be used to search for filtering, sorting and aggregation. Keyword type can only be searched by itself (fuzzy search after text segmentation is not available).

keyword will not perform word segmentation query, no matter it is match or term query, it will not perform word segmentation or fuzzy query.

What is the difference between 9300 and 9200?

The difference between 9300 and 9200

Port 9300: Communication between ES nodes

Port 9200: ES node and external communication use

9300 is the TCP protocol port number, the communication port number between ES clusters

9200 port number, expose the ES RESTful interface port number

What is DSL language

There are two ways to query requests in es. One is a simple version of the query, and the other is a complete request body using JSON, called Structured Query (DSL).

Since DSL query is more intuitive and simple, most people use this method.

DSL query is a json in POST in the past. Since the post request is in json format, there are a lot of flexibility and many forms.

What is the difference between Term and Match?

Term query will not perform word segmentation query on fields, and will use exact matches.

Match will query word segmentation based on the word segmentation device of this field.

What is an inverted index?

The inverted table is indexed with words or words as keywords. The record table entry corresponding to the keyword in the table records all documents where the word or word appears. A table entry is a word table segment, which records the document's Where IDs and characters appear in this document.

Since the number of documents corresponding to each word or word changes dynamically, the establishment and maintenance of the inverted table are relatively complicated. However, when querying, all documents corresponding to the query keyword can be obtained at one time, so the efficiency is higher than the forward table. In full-text search, fast response of search is the most critical performance, and index establishment is carried out in the background, although it is relatively inefficient, it will not affect the efficiency of the entire search engine.

Why ES needs to implement clusters

On a single ES server node, as the business volume develops, the index files will gradually increase, which will affect efficiency and memory storage problems.

We can adopt ES clusters to store single index shards on multiple different distributed physical machines, so that high availability, fault tolerance, etc. can be achieved.

The core idea of building a software cluster:

Configure different node ids

Configure the same node name

If 3 servers implement clusters, three different configurations will be configured.

How ES solves high concurrency

ES is a distributed full-text search framework that hides complex processing mechanisms and uses sharding mechanisms, cluster discovery, and shard load balancing request routing internally.

Shards shards: represents index shards. es can divide a complete index into multiple shards. The advantage is that a large index can be split into multiple and distributed to different nodes. Constitute distributed search. The number of shards can only be specified before index creation and cannot be changed after index creation.

Replicas shards: represent index replicas. es can set replicas of multiple indexes. The function of replicas is to improve the fault tolerance of the system. When a node is damaged or lost, it can be recovered from the replica. The second is to improve the query efficiency of es, which will automatically load balance the search requests.When adding or modifying data, only the main shard will be operated, and the main shard will be refreshed synchronously to the secondary shard in real time; when querying, load balancing will be performed on the main shard and the secondary shard.

When expanding capacity by adding servers, you only need to modify the value of Replicas. By modifying the multiple of secondary shards, all servers can achieve the effect of all data sharing between all servers. The number of all primary shards and secondary shards combined should be the square of the number of servers. Because the number of primary shards cannot be modified after creation, it can only be modified by manipulating the number of secondary shards.

There are no backup shards in a single ES server

Analysis of core principles of ES cluster

1. Each index will be divided into multiple shards for storage. The default index creation is to allocate 5 shards for storage. Each shard will be distributed on multiple different nodes for deployment. This shard becomes the primary shards primary shards.

2. In order to achieve high availability, each main shard will have its own corresponding backup shard.The backup shard corresponding to the primary shard cannot be stored on the same server, and the primary shard can be stored on the same node node as other backup shards.

3. When querying data, both the main and secondary shards can be queried.You can query operations on all primary and secondary shards, but when adding or modifying, only the primary shard is modified, and then the primary shard is refreshed in real time and synchronized to the secondary shards.

4、 When adding machine horizontal expansion, it is necessary to ensure that all servers have even splits, but the number of primary shards cannot be changed, so they can only be controlled by modifying the multiple of secondary shards. Ensure that the sum of the number of primary and secondary shards is the square of the number of nodes.

Why can't the main shard be modified after creation?

document routing

When the client initiates the creation of a document, es needs to determine which shard the document is placed on the index. This process is data routing.

Routing algorithm: shard = hash(routing) % number_of_primary_shards

If number_of_primary_shards gets the change in the remaining process during query, the data cannot be obtained

ES cluster related nouns

Cluster: represents a cluster, with multiple nodes in the cluster, one of which is the master node. This master node can be elected, and the master and slave node is for the inside of the cluster. One concept of es is decentralization. Literally, it is understood that there is no center node. This is for the outside of the cluster, because from the outside, the es cluster is logically a whole, and your communication with any node and the communication with the entire es cluster are equivalent.

Shards: represents index shards. es can divide a complete index into multiple shards. The advantage is that a large index can be split into multiple and distributed to different nodes. Constitute distributed search. The number of shards can only be specified before index creation and cannot be changed after index creation.

replicas: represents index replicas. es can set replicas of multiple indexes. The function of replicas is to improve the fault tolerance of the system. When a node is damaged or lost, it can be recovered from the replica. The second is to improve the query efficiency of es, which will automatically load balance the search requests.

Recovery: represents data recovery or data redistribution. When a node joins or exits, the index shard will be redistributed according to the machine's load, and the data will be restored when the dead node is restarted.