【Java Advanced Camp】 Must ask questions for interviews with Java architects: Summary of Elasticsearch interview questions

1. Why use itElasticsearch?

When we use it in the mallSearch functionAt the same time, as the amount of data in the later project continues to increase, if we continue to use the previous database fuzzy query method to query data, the efficiency is very inefficient with millions of data. And Es is a framework that supports full-text retrieval, which is characterized by real-time storage and real-time analysis of search engines, and when our data volume is large, we can cluster it. Therefore, when we use the search function, we can store the commonly used names, prices, descriptions, ids and other information of the product into the index library, which can improve the query speed when querying.

How much do you know? Tell us about the cluster architecture of your company es, the size of index data, how many shards are there, and some tuning methods

es production cluster We deployed 2 machines, each machine is 6 cores and 64G, and the total memory of the cluster is 128G.

The daily incremental data of our es cluster is about 10 million, the daily incremental data is about 200MB, and the monthly incremental data is about 300 million, and 8G.

Currently, there are 5 indexes online, and the data volume of each index is about 10G, so within this data volume, we use the default 5 shards (slices) for each index.

Tuning methods are generally considered from three aspects:

1. During the design stage, use alias for index management, perform force_merge operations on the index at the early morning of every day, and set word segmentation for reasonable fields.

2. Before writing, the number of replicas is set to 0 and the refresh mechanism is disabled. During the writing process, bulk writes are used; after writing, the number of replicas and refresh interval are restored.

3. Use the inverted indexing mechanism when querying and try to use the keyword type.

3. What is the inverted index of elasticsearch

Our traditional search method is to find the corresponding keyword position by traversing the entire article and comparing one by one. The inverted index is to form a relationship mapping table between words and articles through word segmentation strategy. This dictionary + mapping table method is to inverted index, which is somewhat similar to the Xinhua dictionary we used before. Inverted indexes can greatly improve query efficiency. Here I recommend a framework learning communication circle to you. Communication and learning guidance: 1253431195 (there are a lot of interview questions and answers in it) I will share some video recordings recorded by senior architects: Spring, MyBatis,NettySource code analysis, high concurrency, high performance, distributed,Microservice architectureThe principles of JVM performance optimization, distributed architecture, etc. have become the necessary knowledge system for architects. You can also receive free learning resources, and you are currently benefiting a lot

What to do if there is too much index data, how to tune, and deploy

1. When designing, you can create indexes based on template + time scrolling, incrementing data every day to avoid the occurrence of large single indexes.

2. When storing, hot and cold data are stored separately, such as the data in the past 3 days as hot data and the others as cold data. If cold data is not written, you can consider regular force_merge and shrink (compression) to save space and retrieval efficiency.

3. Since es supports dynamic expansion, you can add several more machines to alleviate cluster pressure.

How to implement master elections

Prerequisites:

(1) Only the nodes of the candidate master node (master: true) can become master nodes.

(2) The purpose of the minimum number of master nodes (min_master_nodes) is to prevent split brain.

Implementation steps

Step 1: Confirm that the number of candidate master nodes meets the standard, which is the value we set in

Step 2: Comparison: First determine whether you are qualified as master, and those with the qualification of candidate master nodes will be given priority;

If both nodes are candidate master nodes, a value with a small id will be master node. Note that the id here is of type string.

Supplement: The responsibilities of master nodes mainly include the management of clusters, nodes and indexes, and are not responsible for document-level

Other management; data nodes can turn off the http function.

6. Describe in detail the process of indexing documents by Elasticsearch

The indexed document here should be understood as the process of writing the document to ES and creating the index.

Step 1: The customer writes data from a node in the cluster and sends a request. (If no routing/coordination point is specified,

The requested node plays the role of the routing node. )

Step 2: After Node 1 accepts the request, use document_id to determine that the document belongs to shard 0. Shard 0 belongs to Node 3, and the request will be transferred to Node 3. The primary shard of shard 0 is also assigned to node 3;

Step 3: Node 3 performs a write operation on the main shard, and after success, the request will be forwarded in parallel to the replica shard of Node 1 and Node 2. All replica shards are reported successfully, Node 3 will report successfully to the coordination point (Node 1), and Node 1 will report successfully to the requesting client. Here I recommend a framework learning communication circle to you. Communication and learning guidance: 1253431195 (there are a lot of interview questions and answers in it) I will share some video recordings recorded by senior architects: Spring, MyBatis, Netty source code analysis, the principles of high concurrency, high performance, distributed, and microservice architecture, JVMPerformance optimization, distributed architecture and other knowledge systems have become essential knowledge systems for architects. You can also receive free learning resources, and you are currently benefiting a lot

If the interviewer asks again: The process of obtaining document shards in the second step?

Answer: With the help of routing algorithm, the routing algorithm calculates the target's sharded id based on the route and document id.

7. When deploying Elasticsearch,LinuxWhat are the optimization methods for setting

1. Turn off cache swap;

2. The heap memory is set to: Min (node.point memory/2, 32GB);

3. Set the maximum number of file handles;

8. What is the internal structure of lucence (to be supplemented)

The core of lucence is divided into: index creation, index search;

9. What should I do if there are nodes in Elasticsearch (for example, there are 20 in total), 10 of them choose one master, and the other 10 choose another master?

This mainly involves the issue of split brain.

1. When the number of cluster master candidates is greater than or equal to 3, you can set the minimum number of votes to pass.

(.minimum_master_nodes) More than half of all candidate nodes are used to solve

split-brain problem;

2. When the number of candidates is two, only one can be modified as master candidate, and the others can be used as data.

Node, avoid split brain problems

10. When the client connects to the ES cluster, how do you choose a specific node to execute the request?
The client connects to an elasticsearch cluster remotely through the transport module. It is not added to the cluster, it simply obtains one or more initialized transport addresses and communicates with these addresses in a polling manner.

11. Describe in detail the process of Elasticsearch updating and deleting documents
Deletion and update are also write operations, but documents in Elasticsearch are immutable and cannot be deleted or modified. When we perform deletion, we actually did not really delete this document, because every segment on the disk has a .del file. When deleting, the document will be marked as deleted in the .del file, but it can still be matched when matching the query, but it will be filtered out in the result. The same goes for updates. Old documents will be marked as deleted in the .del file, but they can also be found in the query results, but the files marked as deleted will be filtered out in the results.

12. Describe the process of Elasticsearch search in detail
When searching, because we don't know which documents the corresponding query will hit, we query all shards in the index. Since the arrangement of data in each shard is not equal to the arrangement in the entire index, there are two stages: query stage and query then fetch stage. The client sends a request to a certain node. This node will match all corresponding shards according to the document id and send a request to each shard. Then the retrieval operation will be performed. After all shards are queryed, the result will be returned to this node for sorting, and finally returned to the client's query results.

13. In Elasticsearch, how do you find the corresponding inverted index based on a word?
First of all, we should understand what inverted index is. Inverted index is to form a relationship mapping table between word segmentation and article through word segmentation strategy. This method of dictionary and mapping table is our inverted index. Therefore, when we search for a word, we will query the matching index in the entire index library according to the document id, and then return it to the client.

14. For GC, what should you pay attention to when using Elasticsearch?
Come again after understanding the JVM ~

How to achieve the aggregation of large data volume (the order of hundreds of millions of yuan)?
The first approximate aggregation provided by Elasticsearch is the cardinality measure. It provides a cardinality of a field, that is, the number of distinct or unique values of the field. It is based on the HLL algorithm. HLL will first hash our input, and then make a probability estimate based on the bits in the hashing result to obtain the cardinality.

16. In concurrency, what if Elasticsearch guarantees consistency in read and write?
1. Optimistic concurrency control can be used through the version number to ensure that the new version will not be overwritten by the old version, and the application layer will handle specific conflicts;

2. In addition, for write operations, the consistency level is the default to allow write operations only when most shards are available. If the write replica fails due to network reasons, such as the network, the replica is considered to be faulty, and the shard will be rebuilt on a different node.

3. For read operations, you can set replication to ensure that the operation will not return after both the main shard and the replica shard are completed; you can also set the search request parameters to query the main shard to ensure that the document is the latest version.