Recommended by full-time computer teachers:
Full-time computer teacher, one-on-one introduction - CSDN Blog
1. List the steps for installing hadoop
a) Create a hadoop account
b) Change ip
c) Install java Change /etc/profile Configure environment variables
d) Modify the domain name of the host file
e) Install ssh and configure passwordless login
f) Unzip hadoop
g) Configure the configuration file below hadoop conf
h) Hadoop namenode-format
i) Start
2. List all processes and functions in the hadoop cluster startup
a) Namenode Management cluster Record namenode file information
b) Secondname can be used as a backup and take snapshots of data within a certain range.
c) Datanode Storage data
d) Jobtarcker Management Tasks Assign Tasks
e) Tasktracker Execute tasks
3. How to solve the problem by starting the nameNode error
a) Check whether hdfs has been started successfully
b) Check whether the input file exists
4. Write out the following execution command
Kill a job
Hadoop job -list get job id
Hadoop job kill job id
Delete the /temp/aa directory on hdfs
Hadoop -daemon。Sh start datanode
Add a new node or delete a node. Command to refresh the cluster status
5. List what you know to adjust the belly button and explain how it works
a) Fifo schedular default navel adjustment 1st in first out
b) Capacity schedular Correction ability to adjust the navel Select a low memory High priority
c) Fair schedular Adjustment All jobs occupy the same resources
6. List development map/reduce metadata storage
a)
7. Use the language you are most familiar with to distinguish a map reduce and calculate the number of the fourth primary color
a) Wordcount
8. Do you think java exploring pipe development map reduce pros and cons
a) Java: Writing maps reduce can implement complex logic. If the requirements are simple, it will be complicated.
b) Hivesql is basically written for table data in hive. It is difficult to implement complex logic.
What are the ways to save metadata? What are the characteristics?
a) In-memory database derby
b) LocalMySQLCommonly used
c) Remote mysql
10. Briefly describe how Hadoop implements secondary cache
11. Briefly describe the various ways of havingop implementing join
reduce side join is the simplest way to join, and its main ideas are as follows:
In the map stage, the map function reads two files File1 and File2 at the same time. In order to distinguish the key/value data pairs from the two sources, a tag is placed on each piece of data. For example, tag=0 means from file File1, and tag=2 means from file File2. That is: the main task of the map stage is to tag data in different files.
In the reduce stage, the reduce function obtains the same value list from File1 and File2 files, and then joins the data in File1 and File2 (Cartesian product) for the same key. That is, the actual connection operation is performed in the reduce stage.
2.2 map side join
The reason why reduce side join exists is that all required join fields cannot be obtained in the map stage, that is, the fields corresponding to the same key may be in different maps. Reduce side join is very inefficient because a large amount of data transmission is required in the shuffle stage.
Map side join is an optimization for the following scenarios: among the two tables to be joined, one table is very large, while the other table is so small that the small table can be stored directly in memory. In this way, we can copy multiple copies of the small table, so that one copy exists in each map task memory (such as stored in a hash table), and then only scan the large table: for each record key/value in the large table, find out whether there is a record of the same key in the hash table. If so, then connect and output it.
In order to support file copying, Hadoop provides a class DistributedCache, the method using this class is as follows:
(1) The user uses static method () to specify the file to be copied. Its parameters are the URI of the file (if it is a file on HDFS, you can do this: hdfs://namenode:9000/home/XXX/file, where 9000 is the NameNode port number you configured yourself). JobTracker will obtain this URI list before the job starts and copy the corresponding file to the local disk of each TaskTracker. (2) Users use the () method to obtain the file directory and use the standard file read and write API to read the corresponding file.
2.3 SemiJoin
SemiJoin, also known as semi-connection, is a method borrowed from distributed databases. Its motivation is: for reduce side join, the data transmission volume across machines is very large, which has become a bottleneck in join operations. If data that will not participate in join operations can be filtered out on the map side, it can greatly save network IO.
The implementation method is very simple: select a small table, assume it is File1, extract the key participating in the join, and save it to file File3. The File3 file is generally very small and can be placed in memory. In the map stage, use DistributedCache to copy File3 to each TaskTracker, and then filter out the records corresponding to the keys in File2 that are not in File3. The remaining reduce stage work is the same as reduce side join.
12.
Non-recursive binary search
1. public class BinarySearchClass
2. {
3.
4. public static int binary_search(int[] array, int value)
5. {
6. int beginIndex = 0;// Low-level subscript
7. int endIndex = array.length - 1;// High-level subscript
8. int midIndex = -1;
9. while (beginIndex <= endIndex) {
10. midIndex = beginIndex+ (endIndex - beginIndex) / 2;//Prevent overflow
11. if (value == array[midIndex]) {
12. return midIndex;
13. } else if (value < array[midIndex]) {
14. endIndex = midIndex - 1;
15. } else {
16. beginIndex = midIndex + 1;
17. }
18. }
19. return -1;
20. // Found, return the subscript of the found value, not found, return -1
21. }
22.
23.
24. //start Tip: The unique identifier for the start of the automatic textbook marking is not deleted or added.
25. public static void main(String[] args)
26. {
27. ("Start...");
28. int[] myArray = new int[] { 1, 2, 3, 5, 6, 7, 8, 9 };
29. ("Find the subscript of the number 8:");
30. (binary_search(myArray, 8));
31. }
14.
The role of partition
16.combine is divided into map end and reduce end, which is used to merge key-value pairs of the same key and can be customized.
The combine function combines the <key,value> pairs (multiple keys,values) generated by a map function into a new <key2,value2>. The new <key2,value2> is input into the reduce function.
This value2 can also be called values because there are multiple ones. The purpose of this merger is to reduce network transmission.
The partition is the result of splitting each node of the map. It is mapped to different reduces separately according to the key, and can also be customized. It is actually understandable here.
We categorize intricate data. For example, there are cows, sheep, chickens, ducks and geese in the zoo. They are all mixed together, but at night they will return to the cowshed, the sheep will return to the sheep pen, and the chicken will return to the chicken coop. The purpose of partition is to classify these data. But when writing programs, mapreduce usesHashHashPartitionerClassified for us. We can also customize this.
shuffle is the process between map and reduce, which includes the combination and partition at both ends.
The results of the map will be distributed to the Reducer through partition. After the Reducer completes the Reduce operation, it will output through OutputFormat.
The main function of the shuffle stage is fetchOutputs(). The function of this function is to copy the output of the map stage to the reduce node locally.
17. Use the Linux command to complete the following work
a) The number of ips and summary of the two files
b) Appears in file b IP that does not appear in file a
c) The number of times that name and his ip
Difference between internal and external tables
2.1. When importing data to an external table, the data is not moved to its own data warehouse directory, which means that the data in the external table is not managed by it itself! And the appearance is different;
2. When deleting a table, Hive will delete all the metadata and data belonging to the table; and when deleting an external table, Hive will only delete the metadata of the external table, and the data will not be deleted!
So, how should you choose which table to use? There are not many differences in most cases, so the choice is only a matter of personal preference. But as a experience, if all processing needs to be done by Hive, then you should create the table, otherwise use external tables!
How to create a rowkey table and communicate it well. How to create a ancestor?
HBase is a distributed, column-oriented database. The biggest difference between it and general relational databases is that HBase is very suitable for storing unstructured data, and it is based on columns rather than row-based patterns.
Since HBase uses KeyValue column storage, Rowkey is the Key of KeyValue, indicating the only row. Rowkey is also a binary code stream with a maximum length of 64KB, and the content can be customized by the user. When loading data, it is generally done from small to large according to Rowkey's binary order.
HBase is retrieved based on Rowkey. The system finds the Region where a Rowkey (or a Rowkey range) is located, and then routes the query data request to the Region to obtain the data. HBase search supports 3 ways:
(1) Access through a single Rowkey, that is, get operations according to a certain Rowkey key value, so as to obtain a unique record;
(2) Scan through Rowkey's range, that is, scan within this range by setting startRowKey and endRowKey. This allows you to obtain a batch of records according to specified conditions;
(3) Full table scan, that is, directly scan all row records in the entire table.
HBASE retrieval efficiency by a single Rowkey is very high, taking less than 1 millisecond, and can obtain 1,000 to 2,000 records per second, but querying of non-key columns is very slow.
2 HBase's RowKey design
2.1 Design Principles
2.1.1 Rowkey length principle
Rowkey is a binary code stream. The length of Rowkey is suggested by many developers that it is designed to be 10 to 100 bytes, but it is recommended that the shorter the better, and not more than 16 bytes.
The reasons are as follows:
(1) The persistent file of the data is stored in the HFile according to the KeyValue. If the Rowkey is too long, for example, 100 bytes, the Rowkey will occupy 100*10 million = 1 billion bytes in 10 million columns, which will be nearly 1G of data, which will greatly affect the storage efficiency of the HFile;
(2) MemStore will cache part of the data to memory. If the Rowkey field is too long, the effective utilization rate of memory will be reduced, and the system will not be able to cache more data, which will reduce the retrieval efficiency. Therefore, the shorter the byte length of the Rowkey, the better.
(3) Currently, the operating systems are all 64-bit systems, and the memory is aligned by 8 bytes. Controlled at 16 bytes, 8 bytes integer multiples to take advantage of the optimal characteristics of the operating system.
Back to top
2.1.2 Rowkey hashing principle
If Rowkey is incremented by timestamp, do not place time in front of the binary code. It is recommended to use the high bit of Rowkey as a hash field, which is generated by the program loop, and put the time field at the low bit, which will increase the chance of data balancing being distributed in each Regionserver to achieve load balancing. If there is no hash field, the first field is directly the time information, which will generate a hot spot phenomenon where all new data is accumulated on a RegionServer. In this way, when data retrieval is done, the load will be concentrated on individual RegionServers, reducing query efficiency.
Back to top
2.1.3 Rowkey's unique principle
Its uniqueness must be guaranteed in design.
2.2 Application scenarios
Based on the above three principles of Rowkey, different Rowkey design suggestions should be provided for different application scenarios.
Back to top
2.2.1 Rowkey design for transaction data
Transaction data is time-based. It is recommended to store time information into Rowkey, which helps prompt query retrieval speed. For transaction data, it is recommended to build tables for data by day by day. The benefits of this design are many. After the talent table is used, the time information can be removed and the date part can be retained only for hours, minutes and milliseconds, so that 4 bytes can be done. Add 2 bytes of the hash field, a total of 6 bytes, to form a unique Rowkey. As shown in the figure below:
Transaction data Rowkey design | ||||||
Byte 0 |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
Byte 5 |
… |
Hash fields |
Time field (milliseconds) |
Extended fields |
||||
0~65535(0x0000~0xFFFF) |
0~86399999(0x00000000~0x05265BFF) |
|
Such a design cannot save overhead from the operating system memory management level, because the 64-bit operating system must be 8 bytes aligned. However, the Rowkey part in persistent storage can save 25% overhead. Maybe someone is asking why not save the time field in host endianness so that it can also be used as a hash field. This is because the data within the time range is still as continuous as possible. The probability of data searching within the same time range is very high, and it has good effect on query retrieval. Therefore, using independent hash fields is better. For some applications, we can consider using the hash fields to store the field information of certain data in all or part, as long as the same hash value is guaranteed to be unique at the same time (milliseconds).
Back to top
2.2.2 Rowkey design for statistics
Statistics also have time attributes, and the minimum unit of statistics will only reach minutes (it will be meaningless to pre-statistics in seconds). At the same time, we also use a day-by-day data classification table by default, so there is no need to say much about the benefits of this design. After the talent table, the time information only needs to be retained for hours and minutes, so 0~1400 only takes up two bytes to save the time information. Since the number of certain dimensions of statistics is very large, 4 bytes are needed as sequence fields, so using the hash field as sequence fields is also a unique Rowkey that is composed of 6 bytes. As shown in the figure below:
Statistical Rowkey Design | ||||||
Byte 0 |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
Byte 5 |
… |
Hash field (sequence field) |
Time field (minutes) |
Extended fields |
||||
0x00000000~0xFFFFFFFF) |
0~1439(0x0000~0x059F) |
|
The same design cannot save overhead from the operating system memory management level, because the 64-bit operating system must be 8 bytes aligned. However, the Rowkey part in persistent storage can save 25% overhead. Pre-statistical data may involve repeated recalculation requirements, and it is necessary to ensure that the invalid data can be effectively deleted, and at the same time, it cannot affect the balance effect of the hash, so special treatment is required.
Back to top
2.2.3 Rowkey design for general data
General data uses a self-increasing sequence as the only primary key, and users can choose to create a table by day or single table mode. This pattern requires ensuring the uniqueness of hash fields (sequence fields) when multiple inbound load modules are run at the same time. It is possible to consider assigning unique factor differences to different loading modules. The design structure is shown in the figure below.
General Data Rowkey Design | ||||
Byte 0 |
Byte 1 |
Byte 2 |
Byte 3 |
… |
Hash field (sequence field) |
Extended fields (controlled within 12 bytes) |
|||
0x00000000~0xFFFFFFFF) |
Can be composed of multiple user fields |
Back to top
2.2.4 RowKey design that supports multi-condition query
When HBase obtains a batch of records according to the specified conditions, the scan method is used. The scan method has the following characteristics:
(1) scan can increase the speed (for space to time) through setCaching and setBatch methods;
(2) scan can define the scope through setStartRow and setEndRow. The smaller the range, the higher the performance.
Through the clever RowKey design, we can get the elements in the record collection next to each other (should be under the same Region), which can achieve good performance when traversing the results.
(3) Scan can add filters through the setFilter method, which is also the basis for pagination and multi-condition query.
After meeting the principles of length, three columns, and uniqueness, we need to consider how to use the scope function of the scan method to improve the query speed of obtaining a batch of records. The following example describes how to combine multiple columns into a RowKey and use scan's range to achieve faster query speed.
example:
What we store in the table is file information. Each file has 5 attributes: file id (long, globally unique), creation time (long), file name (String), classification name (String), and owner (User).
The query conditions we can enter: file creation time interval (such as files created from 20120901 to 20120914), file name ("The Voice of China"), category ("Variety Show"), owner ("Zhejiang Satellite TV").
Suppose we currently have the following files:
ID |
CreateTime |
Name |
Category |
UserID |
1 |
20120902 |
China's Voice Issue 1 |
Variety show |
1 |
2 |
20120904 |
China's Voice Issue 2 |
Variety show |
1 |
3 |
20120906 |
China's Voice Wild Card Competition |
Variety show |
1 |
4 |
20120908 |
China's Voice Issue 3 |
Variety show |
1 |
5 |
20120910 |
China's Voice Issue 4 |
Variety show |
1 |
6 |
20120912 |
Interview with Chinese Voice contestants |
Variety show highlights |
2 |
7 |
20120914 |
China's Voice No. 5 |
Variety show |
1 |
8 |
20120916 |
China's Voice Recording Highlights |
Variety show highlights |
2 |
9 |
20120918 |
Exclusive interview with Zhang Wei |
Highlights |
3 |
10 |
20120920 |
Jiaduobao herbal tea advertisement |
Variety Advertising |
4 |
Here, UserID should correspond to another User table and will not be listed for the time being. We just need to know what UserID means:
1 represents Zhejiang Satellite TV; 2 represents Good Voice crew; 3 represents XX Weibo; 4 represents sponsor. When calling the query interface, enter the above five conditions at the same time (20120901, 20121001, "China's Voice", "Variety Show", "Zhejiang Satellite TV"). At this time we should get the records that should have 1, 2, 3, 4, 5, and 7. Article 6: Since it does not belong to "Zhejiang Satellite TV", it should not be selected. When designing RowKey, we can do this: use UserID + CreateTime + FileID to form RowKey, which can not only meet multi-conditional queries, but also have a fast query speed.
The following points need to be noted:
(1) Each field of the RowKey of each record needs to be filled to the same length. If we expect to have up to 100,000 users, the userID should be populated to 6 digits, such as 000001, 000002…
(2) The purpose of adding a globally unique FileID at the end is to make the records corresponding to each file globally unique. Avoid overwriting two different file records with the same UserID as CreateTime.
According to this RowKey, the above file records are stored, and the following structure is in the HBase table:
rowKey(userID 6 + time 8 + fileID 6) name category ….
00000120120902000001
00000120120904000002
00000120120906000003
00000120120908000004
00000120120910000005
00000120120914000007
00000220120912000006
00000220120916000008
00000320120918000009
00000420120920000010
How to use this table?
After creating a scan object, we setStartRow(00000120120901), setEndRow(00000120120914).
In this way, only the data with userID=1 is scanned when scanning, and the time range is limited to this specified time period, which satisfies the filtering of results by users and by time range. And because records are stored centrally, the performance is very good.
Then use SingleColumnValueFilter(), which has 4, respectively, constrain the upper and lower limits of name and the upper and lower limits of category. Satisfies matching by file name and classification name at the same time.
4.
How to deal with data tilt
6.When the map /reduce program is executed, most of the reduce nodes are executed, but one or several reduce nodes run very slowly, resulting in a long processing time for the entire program. This is because the number of a certain key is much more than that of other keys (sometimes a hundred times or a thousand times). The amount of data processed by the reduce node where this key is located is much larger than that of other nodes, which leads to the incomplete operation of several nodes for a long time. This is called data skew.
When using the hadoop program to associate data, data tilt often occurs, and here is a solution.
(1) Set a hash number N to break up keys with numerous numbers.
(2) Process the data with multiple duplicate keys: add the number after the key from 1 to N as the new key. If it needs to be associated with another data, the comparison class and distribution class must be rewrite (the method is as shown in the previous article "A Method for Hadoop job to Solve Big Data Relationship"). This achieves the equal distribution of multiple keys.
int iNum = iNum % iHashNum;
String strKey = key + CTRLC + (iNum) + CTRLB + “B”;
(3) After the previous step, the key is evenly distributed to many different reduce nodes. If you need to be associated with other data, in order to ensure that the associated keys on each reduce node, the data of another single key is processed: the loop from 1 to N is added to the key after the key as the new key
for(int i = 0; i < iHashNum; ++i){
String strKey =key + CTRLC + (i) ;
(new Text(strKey), new Text(strValues));}
This solves the problem of data skew, and the running time of the program has been greatly reduced after experiments. However, this method will increase the amount of data of one of the data exponentially at the cost of increasing the amount of shuffle data. Therefore, when using this method, you need to experiment multiple times to get the best hash value.
======================================
Although the above method can solve the data skew, when the amount of associated data is huge, if a certain piece of data is increased exponentially, the amount of data of reduce shuffle will become huge, which will not be worth the cost, thus unable to solve the problem of slow running time.
There is a new way to solve the shortcomings of exponential growth data:
Find common points in the two data. For example, in addition to the associated fields, there are also fields with the same meaning in the two data. If the repetition rate of this field in all logs is relatively small, this field can be used as the value of the hash. If it is a number, it can be used to modulo the hash number. If it is a character, it can be used to modulo the hash number (of course, in order to avoid too much data falling on the same reduce, it can also be used to modulo hash). In this way, if the value distribution of this field is evenly distributed, the above problem can be solved. -
How to optimize in the framework
Storm is used to process high-speed, large-scale data streams in distributed real-time computing systems. Added reliable real-time to HadoopData processingFunction
Spark uses memory computing. From manyIterationBatch processingStarting from the process, data is allowed to be loaded into memory for repeated queries, and it also integrates various computing paradigms such as data warehouse, stream processing and graphical computing. Spark is built onHDFSIt can be well combined with Hadoop. Its RDD is a big feature.
Hadoop is currently largeData ManagementOne of the standards is used in many current commercial application systems. Structured, semi-structured, or even unstructured datasets can be easily integrated.
8.
What is the internal mechanism
In HBase, whether it is adding new lines or modifying existing lines, the internal process is the same. HBase saves the change information after receiving the command, or throws an exception if the write fails. By default, two places are written when writing is executed: write-ahead log (also known as HLog) and MemStore (see Figure 2-1). The default method of HBase is to record the write action in these two places to ensure data persistence. Only when the change information in these two places is written and confirmed, the writing action is considered to be completed.
MemStore is a write buffer in memory, and data in HBase is accumulated here before it is permanently written to the hard disk. When MemStore is filled, the data in it will be flushed to the hard disk to generate an HFile. HFile is the underlying storage format used by HBase. HFile corresponds to a column family. A column family can have multiple HFiles, but an HFile cannot store data from multiple column families. On each node of the cluster, each column family has a MemStore.
Hardware failures are common in large distributed systems, and HBase is no exception. Imagine that if MemStore has not been flushed, the server will crash and the data not written to the hard disk in memory will be lost. The way to deal with HBase is to write WAL before the writing action is completed. Each server in the HBase cluster maintains a WAL to record the changes. WAL is a file on the underlying file system. The writing action is not considered to be completed successfully until the WAL new record is successfully written. This ensures that HBase and the file system supporting it meets persistence. In most cases, HBase uses Hadoop Distributed File System (HDFS) as the underlying file system.
If the HBase server goes down, data that has not been flashed to HFile from MemStore will be restored by playing back WAL. You don't need to perform it manually. The internal mechanism of Hbase has the recovery process part to handle. Each HBase server has a WAL, and all tables (and their column families) on this server share this WAL.
You may have thought that skipping WAL during writing should improve writing performance. But we do not recommend disabling WAL unless you are willing to lose data if something goes wrong. If you want to test it, you can disable WAL:
Note: Not writing to WAL will increase the risk of data loss in the event of RegionServer failure. Turn off WAL, HBase may not be able to recover data in the event of a failure, and all written data that has not been flushed to the hard disk will be lost.
10.
11. Can we remove reduce when we are developing distributed computing jiob
Since MapReduce's calculation input and output are based on HDFS files, most companies' approach is to import mysql or sqlserver data into HDFS, and then export it to a regular database after calculation. This is one of the things that MapReduce is not flexible enough. The advantage of MapReduce is that it provides a relatively simple distributed computing programming model, making it very simple to develop such programs, and like the previous MPI programming, it was quite complicated.
In a narrow sense, MapReduce normalizes the computing tasks, which can be equivalent to the business logic part of Worker in the little monk. MapReduce splits the business logic into two parts. Map and Reduce can first calculate half of the task in the Map part and then throw it to the Reduce part to continue the subsequent calculation. Of course, it is OK to complete all the calculation tasks in the Map section. There is no explanation for the details of Mapreduce implementation. Interested students can check the relevant information or read the blog of the previous C# simulation implementation of the poster [Explore C#'s Mini MapReduce】。
If Xiao Ming’s product manager needs are put into Hadoop, the processing process is roughly as follows:
1. Import 100G data into HDFS
2. The processing logic is written according to the Mapreduce interface, divided into two parts: Map and Reduce.
3. Submit the package to the Mapreduce platform and store it in HDFS.
4. There is a role in the platform called the Jobtracker process to distribute tasks. This Master load scheduling management is similar to the little monk.
5. If there are 5 machines for calculations, 5 slave processes called TaskTracker will be run in advance. This is similar to the separate version of Xiao Monk Worker. The platform separates the program and business logic. Simply put, it is to run an independent process on the machine, which can dynamically load and execute business logic code of jar or dll.
6. After Jobtracker distributes the task to TaskTracker, TaskTracker starts dynamically loading the jar package, creates an independent process to execute the Map part, and then writes the result to HDFS.
7. If there is a Reduce part, TaskTracker will create an independent process to pull the HDFS file output by the Map remotely to the local area through RPC. After the pull is successful, Reduce will start to calculate the subsequent tasks.
8. Reduce then writes the result to HDFS
9. Export the results from HDFS.
This seems to be a simple calculation task
12.
Data compression algorithm
14.1. After compressing the data on HDFS, then storing it to HDFS
2. Supports data compression within HDFS, and here it can be divided into several methods:
2.1. The compression work is completed on DataNode. Here are two methods:
2.1.1. After the data is received, compress it again
This method has the least changes to HDFS, but the effect is the lowest. You only need to call the compression tool after the block file is closed, compress the block file, and then unzip it when opening the block file. A few lines of code can be done.
2.1.2. Compress while receiving data, using the compression library provided by a third party
Efficiency and complexity compromise method, hold the system's write and read operations, compress it before the data is written to disk, but the external interface behavior of write and read remains unchanged, such as: the original size of data is 100KB, the compressed size is 10KB. When 100KB is written, 100KB is still returned to the caller, instead of 10KB.
2.2. Leave the compression work to DFSClient, DataNode only receives and stores
This method has the highest effect, and the compression is pushed to the HDFS client in a scattered manner, but DataNode needs to know when a block block is received.
It is recommended to use the 2.2 method in the final implementation. The HDFS code that needs to be modified is not large, but the effect is the highest.
Scheduling mode
MapReduce is a framework or platform that hasdoop provides a distributed computing framework or platform. Obviously, this platform has multiple users, and each legal user can submit jobs to this platform. This brings a problem, which is job scheduling.
Any scheduling strategy considers several dimensions that need to be weighed by the platform scheduling. For example, process scheduling in the operating system, the dimensions it needs to consider are the maximum utilization rate (throughput) and real-time nature of resources (CPU). The operating system has high requirements for real-timeness, so the operating system often adopts priority-based and preemptive scheduling strategies and gives IO-intensive (relative to computing-intensive) processes a bit far.
Back to the hadoop platform, MapReduce's job scheduling does not have high real-time requirements, and it was designed based on the principle of maximum throughput, soThe default scheduling strategy used by MapReduce is FIFO(FIFOs implemented based on priority queues are not pure FIFOs, so every time h), this strategy is obviously not preemptive scheduling, so the problem is that high-priority tasks will be blocked by low-priority tasks that have been running before and have to run for a long time.
16.
The principle of interaction between the underlying and database
Hiveand Hbase have their own characteristics: hive is high latency, structured and analytical oriented, while hbase is low latency, unstructured and programmable. Hive data warehouses are high latency on hadoop. Hive integrates Hbase to use some features of hbase. The following is the integrated architecture of hive and hbase:
Figure 1 Hive and Hbase architecture diagram
Hive integrated HBase can effectively utilize the storage characteristics of HBase database, such as row updates and column indexes. Pay attention to maintaining the consistency of the HBase jar package during the integration process. Hive integration HBase requires establishing a mapping relationship between the Hive table and the HBase table, that is, the columns and column types of the Hive table are associated with the column families and column qualifiers of the HBase table. Each domain in the Hive table exists in the HBase, and the Hive table does not need to include all columns in the HBase. RowKey in HBase corresponds to Hive to select a domain using:key to correspond to it. The column family (cf:) maps to all other domains in Hive, and the column is (cf:cq). For example, Figure 2 below shows the Hive table mapped to the HBase table:
Figure 2 Hive table mapping HBase table
18.
Filter implementation principles
So far you have learned that HBase has flexible logical patterns and simple physical models that allow application systems to work closer to hard disks and networks, and
Optimize at this level. Designing effective patterns is one aspect of using HBase, and you have mastered a bunch of concepts to do this. You can design line keys
This allows accessed data to be stored together on the hard disk, so the hard disk search time can be saved during read and write operations. When reading data, you often need to use some kind of standard
Perform accurate operations and you can further optimize data access. Filters are a powerful feature used in this case.
We haven't talked about the real usage scenarios of using filters; generally speaking, adjusting the table design can optimize the access mode. But sometimes you have adjusted the table design
As good as possible, optimize as good as possible for different access modes. This is when you still need to reduce the data returned to the client, this is when considering using filters
Now. Sometimes filters are also called push-down predicates, which support you to push data filtering standards from the client to the server (as shown in the figure
4.16). These filtering logics are used during read operations and have an impact on the data returned to the client. This saves network IO by reducing the data transmitted by the network. but
Data still needs to be read from the hard disk into the RegionServer, and the filter plays a role in the RegionServer. Because you may store a large amount of data in the HBase table,
The savings of network IO are of great significance, and reading out all the data first and sending it to the client and then filtering out useful data is very expensive.
Figure 4.16 Complete data filtering on the client: read the data from the RegionServer to the client, and use filter logic to process the data on the client; or in the service
The machine side completes data filtering: pushes the filtering logic down to the RegionServer, thus reducing the amount of data transmitted to the client on the network. In essence, filters save
The overhead of network IO, sometimes even the overhead of hard disk IO.
HBase provides an API that you can use to implement custom filters. Multiple filters can also be used together. You can start the reading process,
Yu Xingjian performed filtering. Thereafter, filtering can also be performed based on the KeyValues read out by HFile. The filter must implement the Filter in the HBase Jar package
Interface, or inheritance extends an abstract class that implements the interface. We recommend inheriting and extending the FilterBase abstract class so that you don't need to write boilerplate code. inherit
Extending other classes such as CompareFilter is also an option and will work properly. When reading a line, the interface has the following methods that can be adjusted in multiple places.
Use (sequence shown in Figure 4.17). They are always executed in the order described below:
1 The first method is called and filtered based on Xingjian:
boolean filterRowKey(byte[] buffer, int offset, int length)
Based on the logic here, if the row is filtered out (not appearing in the send result set), it returns true, otherwise if sent to the client, it returns false.
2. If the line is not filtered out in the previous step, then call this method to process each KeyValue object of the current line:
ReturnCode filterKeyValue(KeyValue v)
This method returns a ReturnCode, which is an enum type defined in the Filter interface. ReturnCode to determine a KeyValue
What happened to the object.
3 After filtering the KeyValues object in step 2, then this method:
void filterRow(List kvs)
This method is passed into a list of KeyValue objects that have successfully passed the filtering. If this method accesses this list, you can perform any function on the elements in the list.
What to convert or operate.
4. If you choose to filter out certain rows, this method once again provides an opportunity to do this:
boolean filterRow()
Filter out the considered rows and return true.
5 You can build logic in the filter to stop a scan early. You can put this logic into this method:
boolean filterAllRemaining()
When you scan many rows and look for specified things in rows, column identifiers, or unit values, once the target is found, you no longer care about the remaining rows. At this time this
It is very convenient to use. This is the last method called in the filtering process.
Figure 4.17 Each step of the filtering process. This process is performed by each line in the scanner object scanning range.
Another useful method is reset(). It resets the filter and is called by the server after it is applied to the entire line.
Note that this API is very powerful, but we don’t think it is widely used in the application system. In many cases, if the pattern design changes, use the filter
20.
How much output is
22.Find the machine closest to the storage data to run map tasks related to this data. The reduce is determined by how many keys you have sorted out. It is hard to say that a machine can handle faster and keep all machines in balance.
You have written 20 maps on the above, which are related to the number of file sizes and data strips.
It depends on what the input format you choose, the default is the row offset, and then you write the map function to specify what the key and value are. The same key is integrated and passed to the reduce, and the reduce process is performed in the next step, and finally output to the specified place.
Master the situation and master the situation of hive hsql
Hive is a data warehouse analysis system built on Hadoop. It provides rich SQL query methods to analyze data stored in Hadoop distributed file system, which can structure it.
The converted data file is mapped into a database table and provides complete SQL query functions. It can convert SQL statements into MapReduce tasks for running, and query and analyze the needs of
The content required is Hive SQL, abbreviated as Hive SQL, which makes it very convenient for users who are not familiar with mapreduce to query, summarize and analyze data using the SQL language. And mapreduce developers can
The mapper and reducer that I wrote are used as plug-ins to support Hive for more complex data analysis.
It is slightly different from the SQL of relational databases, but supports most statements such as DDL, DML, as well as common aggregate functions, connection queries, and conditional queries. HIVE is not suitable for online use
Online) transaction processing, and does not provide real-time query function. It is best suited for batch jobs based on large amounts of immutable data.
Features of HIVE: scalable (dynamic addition of devices on Hadoop cluster), scalable, fault-tolerant, loose coupling of input formats.
The official document of Hive has a very detailed description of the query language. Please refer to: /hadoop/Hive/LanguageManual. Most of the content of this article is translated from this page, and some things that need to be noticed during use have been added during this period.
1. DDL Operation
DDL
•Create table
•Delete table
•Modify the table structure
•Create/delete views
•Create a database
•Show commands
Create table:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
•CREATE TABLE Create a table with a specified name. If a table with the same name already exists, an exception is thrown; the user can ignore this exception using the IF NOT EXIST option
•EXTERNAL keyword allows users to create an external table, and while creating the table, specify a path to the actual data (LOCATION)
•LIKE allows users to copy existing table structures, but not data
•COMMENT can add descriptions to tables and fields
•ROW FORMAT
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
Users can customize SerDe or use the built-in SerDe when creating tables. If ROW FORMAT or ROW FORMAT DELIMITED is not specified, the SerDe that comes with it will be used. When creating a table, the user also needs to specify a column for the table. While specifying the columns of the table, the user will also specify a custom SerDe. Hive uses SerDe to determine the data of the specific columns of the table.
•STORED AS
SEQUENCEFILE
| TEXTFILE
| RCFILE
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs compression, use STORED AS SEQUENCE.
Create a simple table:
hive> CREATE TABLE pokes (foo INT, bar STRING);
Create an external table:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
Create partition table
CREATE TABLE par_table(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(date STRING, pos STRING)
ROW FORMAT DELIMITED ‘\t’
FIELDS TERMINATED BY '\n'
STORED AS SEQUENCEFILE;
Bucket table
CREATE TABLE par_table(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(date STRING, pos STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED ‘\t’
FIELDS TERMINATED BY '\n'
STORED AS SEQUENCEFILE;
Create table and create index fields ds
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
Copy an empty table
CREATE TABLE empty_key_value_store
LIKE key_value_store;
example
create table user_info (user_id int, cid string, ckid string, username string)
row format delimited
fields terminated by '\t'
lines terminated by '\n';
The data format of the imported data table is: the fields are divided by tab keys, and the rows are broken.
And the file content format to ask us:
100636 100890 c5c86f4cddc15eb7 yyyvybtvt
100612 100865 97cc70d411c18b6f gyvcycy
100078 100087 ecd6026a15ffddf5 qa000100
Show all tables:
hive> SHOW TABLES;
Display the table according to positive conditions (regular expressions).
hive> SHOW TABLES '.*s';
Modify the table structure
•Add partitions and delete partitions
•Rename table
•Modify the column name, type, location, and comment
•Add/Update columns
•Add table metadata information
Add a column to the table:
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
Add a column and add column field comments
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
Change the table name:
hive> ALTER TABLE events RENAME TO 3koobecaf;
Delete columns:
hive> DROP TABLE pokes;
Add and delete partitions
•Increase
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
•delete
ALTER TABLE table_name DROP partition_spec, partition_spec,...
Rename table
•ALTER TABLE table_name RENAME TO new_table_name
Modify the column name, type, location, and comment:
•ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
•This command can allow changing column names, data types, comments, column positions, or any combination of them
Add a column to the table:
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
Add a column and add column field comments
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
Increase/Update columns
•ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
• ADD represents a new field, and the field position is after all columns (before the partition column)
REPLACE means replacing all fields in the table.
Add metadata information of the table
•ALTER TABLE table_name SET TBLPROPERTIES table_properties table_properties:
:[property_name = property_value…..]
• Users can use this command to add metadata to the table
Change the format and organization of table files
•ALTER TABLE table_name SET FILEFORMAT file_format
•ALTER TABLE table_name CLUSTERED BY(userid) SORTED BY(viewTime) INTO num_buckets BUCKETS
•This command modifies the physical storage properties of the table
Create/delete views
•CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ][COMMENT view_comment][TBLPROPERTIES (property_name = property_value, ...)] AS SELECT
•Add view
•If no table name is provided, the name of the view column will be automatically generated by the defined SELECT expression.
•If the properties of the base table are modified, it will not be reflected in the view, and invalid query will fail
•View is read-only and cannot be used with LOAD/INSERT/ALTER
•DROP VIEW view_name
•Delete view
Create a database
•CREATE DATABASE name
Show commands
•show tables;
•show databases;
•show partitions ;
•show functions
•describe extended table_name dot col_name
2. DML Operation: Metadata Storage
hiveNot supportedUse insert statement to enter one by oneLine insertionOperation, tooNot supportedupdate operation. The data is loaded into the created table in the form of load. Once the data is imported, it cannot be modified.
DML includes: INSERTinsert、UPDATErenew、DELETEdelete
•Load files into the data table
•Insert query results into the Hive table
•0.8 new features insert into
Load files into the data table
•LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
•Load operation is just a simple copy/move operation, moving the data file to the corresponding location of the Hive table.
•filepath
•Relative path, for example: project/data1
•Absolute path, for example: /user/hive/project/data1
• Complete URI containing the pattern, for example: hdfs://namenode:9000/user/hive/project/data1
For example:
hive> LOAD DATA LOCAL INPATH './examples/files/' OVERWRITE INTO TABLE pokes;
Load local data and give partition information at the same time
•The loading target can be a table or partition. If the table contains partitions, the partition name of each partition must be specified
•filepath can refer to a file (in this case, Hive will move the file to the corresponding directory of the table) or a directory (in this case, Hive will move all files in the directory to the corresponding directory of the table)
LOCAL keywords
•LOCAL is specified, i.e. local
•load command will search for filepath in the local file system. If it is found to be a relative path, the path is interpreted as the current path relative to the current user. Users can also specify a complete URI for the local file, such as: file:///user/hive/project/data1.
The •load command will copy the files in filepath to the target file system. The target file system is determined by the location properties of the table. The copied data file is moved to the corresponding location of the data in the table.
For example: load local data and give partition information at the same time:
hive> LOAD DATA LOCAL INPATH './examples/files/' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
• No LOCAL specified
If filepath points to a complete URI, hive will use this URI directly. Otherwise
•If schema or authority is not specified, Hive will use the schema and authority defined in the hadoop configuration file and specify the URI of Namenode
• If the path is not absolute, Hive is explained relative to /user/. Hive will move the file content specified in filepath to the path specified by table (or partition).
Load DFS data and give partition information at the same time:
hive> LOAD DATA INPATH '/user/myname/' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.
OVERWRITE
•OVERWRITE is specified
• The content (if any) in the target table (or partition) will be deleted, and then the content in the file/directory pointed to by filepath will be added to the table/partition.
•If the target table (partition) already has a file and the file name conflicts with the file name in filepath, the existing file will be replaced by the new file.
Insert query results into the Hive table
• Insert query results into the Hive table
•Write query results to HDFS file system
•Basic Mode
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement
•Multiple Insert Mode
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...
•Auto partition mode
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement
Write the query result toHDFSFile system
•INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2]
•
• Text serialization is performed when data is written to the file system, and each column is distinguished by ^A, \nBring
INSERT INTO
•INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement
3. DQL Operation: Data Query SQL
SQL Operations
•Basic Select operation
• Partition-based query
•Join
3.1 BasicSelect operate
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[ CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]
• Use the ALL and DISTINCT options to distinguish the processing of duplicate records. The default is ALL, which means querying all records. DISTINCT means to remove duplicate records
•
•Where conditions
•Similar to where conditions of our traditional SQL
• Currently, AND, OR, version 0.9 supports between
•IN, NOT IN
•Do not support EXIST, NOT EXIST
The difference between ORDER BY and SORT BY
•ORDER BY Global sorting, there is only one Reduce task
•SORT BY Sorting only on the machine
Limit
•Limit can limit the number of records inquiry
SELECT * FROM t1 LIMIT 5
•Implement Top k query
•The following query statement queries the 5 sales representatives with the largest sales record.
SET = 1
SELECT * FROM test SORT BY amount DESC LIMIT 5
•REGEX Column Specification
The SELECT statement can use regular expressions to make column selections. The following statements query all columns except ds and hr:
SELECT `(ds|hr)?+.+` FROM test
For example
Check by first
hive> SELECT FROM invites a WHERE ='<DATE>';
Export query data to the directory:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE ='<DATE>';
Output the query results to the local directory:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
Select all columns to the local directory:
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE < 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select , FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE ='<DATE>';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT , FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM() FROM pc1 a;
Insert statistics from one table into another table:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT , count(1) WHERE > 0 GROUP BY ;
hive> INSERT OVERWRITE TABLE events SELECT , count(1) FROM invites a WHERE > 0 GROUP BY ;
JOIN
hive> FROM pokes t1 JOIN invites t2 ON ( = ) INSERT OVERWRITE TABLE events SELECT , , ;
Insert multi-table data into the same table:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE < 100
INSERT OVERWRITE TABLE dest2 SELECT , WHERE >= 100 and < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT WHERE >= 200 and < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/' SELECT WHERE >= 300;
Insert the file stream directly into the file:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(, ) AS (oof, rab) USING '/bin/cat' WHERE > '2008-08-09';
This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)
3.2 Based onPartitionQuery
•Generally, SELECT query will scan the entire table and use the PARTITIONED BY clause to build the table. The query can use the feature of partition pruning.
•Hive’s current implementation is that partition pruning will be enabled only if the partition assertion appears in the WHERE clause closest to the FROM clause
3.3 Join
Syntax
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
•Hive only supports equality joins, external joins and (left semi joins). Hive does not support all non-equal connections, because non-equal connections are very difficult to convert to map/reduce tasks
•LEFT, RIGHT and FULL OUTER keywords are used to handle join hollow records
•LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subquery
•Join , the logic of each map/reduce task is as follows: reducer will cache records of all tables in the join sequence except the last table, and then serialize the results to the file system through the last table.
•In practice, the largest table should be written at the end
When join, you need to pay attention to several key points
• Only equal value joins are supported
•SELECT a.* FROM a JOIN b ON ( = )
•SELECT a.* FROM a JOIN b
ON ( = AND = )
•Can join more than 2 tables, for example
SELECT , , FROM a JOIN b
ON ( = b.key1) JOIN c ON ( = b.key2)
•If the join key of multiple tables in the join is the same, the join will be converted into a single map/reduce task
LEFT, RIGHT and FULL OUTER
•example
•SELECT , FROM a LEFT OUTER JOIN b ON (=)
•If you want to limit the output of join, you should write the filter conditions in the WHERE clause—or write the join clause.
•
•The problem that is easy to confuse is the situation of table partitioning
• SELECT , FROM c LEFT OUTER JOIN d ON (=)
WHERE ='2010-07-07' AND ='2010-07-07‘
•If the record corresponding to table c is not found in table d, all columns of table d will be listed with NULL, including column ds. In other words, join will filter all records that match the join key in the d table. In this way, LEFT OUTER makes the query result irrelevant to the WHERE clause
• Solution
•SELECT , FROM c LEFT OUTER JOIN d
ON (= AND ='2009-07-07' AND ='2009-07-07')
LEFT SEMI JOIN
•LEFT SEMI JOIN restriction is that the table on the right in the JOIN clause can only set filtering conditions in the ON clause, and filtering in the WHERE clause, SELECT clause or other places is not possible.
•
•SELECT ,
FROM a
WHERE in
(SELECT
FROM B);
Can be rewritten as:
SELECT ,
FROM a LEFT SEMI JOIN b on ( = )
UNION ALL
• Used to merge the query results of multiple selects, and it is necessary to ensure that the fields in the select must be consistent.
•select_statement UNION ALL select_statement UNION ALL select_statement ...
4. Habits that should be changed from SQL to HiveQL
1、HiveEquivalent connections are not supported
• Inline two tables in SQL can be written as:
•select * from dual a,dual b where = ;
•Hive should be
•select * from dual a join dual b on = ;
Instead of the traditional format:
SELECT t1.a1 as c1, t2.b1 as c2FROM t1, t2 WHERE t1.a2 = t2.b2
2. Semicolon characters
• The semicolon is the SQL statement end tag, too in HiveQL, but in HiveQL, the recognition of semicolons is not as smart as that, for example:
•select concat(key,concat(';',key)) from dual;
• But HiveQL prompts when parsing statements:
FAILED: Parse Error: line 0:-1 mismatched input '<EOF>' expecting ) in function specification
•The solution is to use the octal ASCII code of semicolons to escape, so the above statement should be written as:
•select concat(key,concat('\073',key)) from dual;
3、IS [NOT] NULL
• Null represents null value in SQL. It is worth noting that if a field of String type in HiveQL is an empty (empty) string, that is, the length is 0, then the result of IS NULL judgment on it is False.
4. Hive does not support data transferInsert into an existing table or partition,
Only override rewriting the entire table is supported, the example is as follows:
1. INSERT OVERWRITE TABLE t1
2. SELECT * FROM t2;
4、Hive does not supportINSERT INTO, UPDATE, DELETE operation
In this way, don’t read and write data with a very complicated lock mechanism.
INSERT INTO syntax is only available starting in version 0.8. INSERT INTO is to append data to a table or partition.
5、hiveSupports embedding mapreduce programs to handle complex logic
like:
1. FROM (
2. MAP doctext USING 'Python wc_mapper.py' AS (word, cnt)
3. FROM docs
4. CLUSTER BY word
5. ) a
6. REDUCE word, cnt USING 'python wc_reduce.py';
--doctext: is input
--word, cnt: is the output of the map program
--CLUSTER BY: After wordhashing, it is used as input to reduce program
In addition, map programs and reduce programs can be used separately, such as:
1. FROM (
2. FROM session_table
3. SELECT sessionid, tstamp, data
4. DISTRIBUTE BY sessionid SORT BY tstamp
5. ) a
6. REDUCE sessionid, tstamp, data USING 'session_reducer.sh';
--DISTRIBUTE BY: Used to assign a reduce programOKdata
6、hiveIt supports writing converted data directly to different tables, and can also be written to partitions, hdfs and local directories.
This eliminates the overhead of scanning the input table multiple times.
1. FROM t1
2.
3. INSERT OVERWRITE TABLE t2
4. SELECT t3.c2, count(1)
5. FROM t3
6. WHERE t3.c1 <= 20
7. GROUP BY t3.c2
8.
9. INSERT OVERWRITE DIRECTORY '/output_dir'
10. SELECT t3.c2, avg(t3.c1)
11. FROM t3
12. WHERE t3.c1 > 20 AND t3.c1 <= 30
13. GROUP BY t3.c2
14.
15. INSERT OVERWRITE LOCAL DIRECTORY '/home/dir'
16. SELECT t3.c2, sum(t3.c1)
17. FROM t3
18. WHERE t3.c1 > 30
19. GROUP BY t3.c2;
5. Practical Example
Create a table
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t'
STORED AS TEXTFILE;
Download the sample data file and unzip it
wget /system/files/ml-data.tar__0.gz
tar xvzf ml-data.tar__0.gz
Loading data into the table:
LOAD DATA LOCAL INPATH 'ml-data/'
OVERWRITE INTO TABLE u_data;
Total statistics:
SELECT COUNT(1) FROM u_data;
Now do some complex data analysis:
Create a weekday_mapper.py: file, split as data by week
import sys
import datetime
for line in :
line = ()
userid, movieid, rating, unixtime = ('/t')
Weekly information about the generated data
weekday = (float(unixtime)).isoweekday()
print '/t'.join([userid, movieid, rating, str(weekday)])
Using mapping scripts
//Create a table and divide the field values in rows by splitter
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t';
//Load python file to system
add FILE weekday_mapper.py;
Split data by week
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
SELECT weekday, COUNT(1)
FROM u_data_new
GROUP BY weekday;
Process Apache Weblog data
Combine the WEB logs with regular expressions first, and then combine them according to the required conditions and enter them into the table.
add jar ../build/contrib/hive_contrib.jar;
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE '.'
WITH SERDEPROPERTIES (
"" = "([^ ]*) ([^ ]*) ([^ ]*) (-|//[[^//]]*//]) ([^ /"]*|/"[^/"]*/") (-|[0-9]*) (-|[0-9]*)(?: ([^ /"]*|/"[^/"]*/") ([^ /"]*|/"[^/"]*/"))?",
"" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
24.
Under what circumstances does not backup
26.When the number of datanodes is set to 1 in the configuration file
Appears in that process
Architecture
We first introduce the architecture of HDFS. HDFS adopts the master/slave (Master/Slave) structural model. An HDFS cluster consists of a NameNode and several DataNodes. Among them, NameNode is the main server, which manages the file system namespace and client access operations; DataNode in the cluster manages stored data. HDFS allows users to store data in the form of files. From an internal perspective, the file is divided into several data blocks, and these several data blocks are stored on a group of DataNodes. NameNode performs namespace operations of the file system, such as opening, closing, renaming files or directories, etc. It is also responsible for mapping data blocks to specific DataNodes. DataNode is responsible for processing file read and write requests from file system clients, and create, delete and copy data blocks under the unified scheduling of NameNode. Figure 1-3 shows the HDFS architecture.
Both NameNode and DataNode are designed to run on a regular commercial computer. These computers usually run GNU/Linux operating systems. HDFS is developed in Java language, so any Java-enabled machine can deploy NameNode and DataNode. A typical deployment scenario is when a machine in the cluster runs a NameNode instance, and the other machines runs a DataNode instance respectively. Of course, it is not ruled out that a machine runs multiple DataNode instances. The design of a single NameNode in the cluster greatly simplifies the system architecture. NameNode is the administrator of all HDFS metadata, and user data will never pass through NameNode.
(Click to view the larger picture) Figure 1-3 HDFS architecture diagram |
29.
Architecture
31. What is queued linkedList queue
Go to table with set
33. Three major database paradigms
34.The first normal form, also known as 1NF, refers to the fact that data in an application can be organized into a table form of rows and columns, and the intersection of any row and column of the table, namely cells, cannot be divided into rows and columns. In fact, any table satisfies 1NF; the second normal form, also known as 2NF, refers to the fact that on the basis of satisfying 1NF, any non-primary key field in a data table depends entirely on the primary key field, and no non-primary key field only depends on part of the primary key field. That is, a record can be uniquely determined by the primary key field. For example, the joint primary key of the student number + course number can uniquely determine which student's grade is the grade of which course. If the student number or the course number is missing, the significance of the grade cannot be determined. The third normal form, also known as 3NF, means that on the basis of satisfying 2NF, no function dependence occurs between any non-primary key fields in the data table, that is, there is no dependence between non-primary key fields, and all depend only on the primary key fields. For example, it is unscientific to place the student's name and the class name in the same table, because students rely on the class and can store student information and class information separately to meet 3NF.
35. Three datanodes What happens when one of them has an error
Import data into mysql How to prevent data from being imported repeatedly? What happens to Sgoop if there is a data problem
37.
38.
Basic knowledge and problem analysis ability
3.1 Describe where the cache mechanism is used in hadoop, and what are its functions?
3.2 Please describe what the problem is about /jira/browse/HDFS-2379, and the final solution
What is the idea of the decision?
4. MapReduce development capability
Please refer to wordcount to implement your own map reduce, the requirements are:
a Input file format:
xxx,xxx,xxx,xxx,xxx,xxx,xxx
b Output file format:
xxx,20
xxx,30
xxx.40
c Function: count the number of times the specified keyword appears in the input file according to the command line parameters and display it
For example: hadoop jar keywordcount xxx,xxx,xxx,xxx,xxx (four keywords)
5. MapReduce Optimization
Please propose ideas on how to optimize the running speed of MR program based on the program in question 5.
6. Linux operating system knowledge inspection
Please list the configuration files under /etc that have been modified and explain the problems to be solved by modifying them?
7. Java development capabilities
7.1 Write code to implement a text file of 1G size, with the line separator \x01\x02, count the total number of lines in the file.
Need to pay attention to the handling of boundary situations
7.2 Please describe how to analyze the performance of the above programs and optimize the performance in development.
5. Hadoop interview questions provided by ***** 21:
1. Design a system so that it can extract data in a specified format from the ever-increasing different data sources.
Require:
Editor QQ: 1040195253 16
1) The operation results must be able to roughly understand the extraction effect, and the extraction method can be continuously improved accordingly;
2) Due to the differences in data sources, please give a flexible configurable program framework;
3) The data sources may include Mysql, sqlserver, etc.;
4) The system has the ability to continuously explore, that is, it can repeatedly extract more information.
2. A classic question:
There are currently 100 million integers that are evenly distributed. If you want to get the largest number in the first 1K, find the optimal algorithm.
(The memory limitations are not considered first, nor do I consider reading and writing external memory. The algorithm with the least time complexity is the optimal algorithm)
Let me first talk about my idea: divide chunks, such as 1W blocks, 1W blocks per block, and then find the maximum value of each block separately, from this
Find the largest 1K of the large 1W value,
Then you can throw away the other 9K maximum values and find the first 1K from the remaining 1K blocks.
Just one. Then the scale of the original problem will be reduced to 1/10.
question:
(1) The optimal time complexity of this blocking method.
(2) How to achieve optimal chunking. For example, it can also be divided into 10W blocks, with 1,000 numbers per block. The problem scale can be reduced to the original
1/100. But in fact, the complexity has not decreased.
(3) Is there any better and better way to solve this problem.
3. MapReduce rough process?
4. What is the function of combining, partition?
5. Use mapreduce to implement sql statements select count(x) from a group by b?
6. How to use mapreduce to achieve two table connections and what are the methods?
7. Know the general process of MapReduce, map, shuffle, reduce
8. Know the function of combining, partition, and set up compression
9. Build a hadoop cluster, and master/slave all run those services
, replica How to locate
11. Version 0.20.2->0.20.203->0.205, 0.21, 0.23, 1.0.1
What is the difference between the old and new APIs?
Parameter tuning, cluster level: JVM, map/reduce slots, job level: reducer
#,memory, use combiner? use compression?
latin, Hive What is the difference in grammar?
14. Description HBase, zookeeper construction process
The principle of operation?
The principle?
Storage mechanism?
18. Give a simple example to illustrate how mapreduce runs?
19. Use mapreduce to implement the following example
Example: There are now 10 folders, and each folder has 1000,000 urls. Now let you find it
top1000000url。
The role of Combiner?
21. How to confirm the health status of the Hadoop cluster.
6. Hadoop interview question from **** provided by **** 9:
1. What are the versions of hadoop used?
What is the principle?
The job, do not use reduce to output, what can be used instead of the function of reduce
How to tune?
How to control permissions?
What is the principle of writing data?
Can you build multiple libraries like a relational database?
How to deal with downtime?
9. Suppose the company wants to build a data center, how would you plan?
7. hadoop choice question 33
Multiple choice questions
1. Which of the following programs is responsible for HDFS data storage.
a)NameNode b)Jobtracker c)Datanode d)secondaryNameNode e)tasktracker
2. How many copies of block in HDfS are saved by default?
a) 3 parts b) 2 parts c) 1 part d) Unsure
3. Which of the following programs is usually started on a node with NameNode?
a)SecondaryNameNode b)DataNode c)TaskTracker d)Jobtracker
4. Hadoop Author
a)Martin Fowler b)Kent Beck c)Doug cutting
5. HDFS default Block Size
a)32MB b)64MB c)128MB
6. Which of the following is usually the main bottleneck of the cluster?
a)CPU b)Network c)Disk d)Memory
7. About SecondaryNameNode Which is correct?
a) It is a hot standby for NameNode b) It has no memory requirements
c) Its purpose is to help NameNode merge edit logs and reduce NameNode startup time
d) SecondaryNameNode should be deployed to a node with NameNode
Multiple choice questions:
8. Which of the following can be used as a cluster management tool?
a)Puppet b)Pdsh c)Cloudera Manager d)d)Zookeeper
9. Which of the following configuration rack perception is correct?
a) If there is a problem with a rack, it will not affect the reading and writing of data.
b) When writing data, it will be written to DataNodes in different racks.
c) MapReduce will obtain network data closer to you based on the rack.
10. Which of the following is correct when uploading files on the Client side?
a) The data is passed to DataNode through NameNode
b) Client side divides the file into Blocks and uploads it in turn
c) Client only uploads data to one DataNode, and then NameNode is responsible for Block Copy
11. Which of the following is the mode of Hadoop running?
a) stand-alone version b) pseudo-distributed c) distributed
12. What methods can Cloudera provide to install CDH?
a)Cloudera manager b)Tar ball c)Yum d)Rpm
Judgment question:
13. Ganglia can not only monitor, but also alert. ( )
14. Block Size cannot be modified. ( )
15. Nagios cannot monitor Hadoop clusters because it does not provide Hadoop support. ( )
16. If NameNode terminates unexpectedly, SecondaryNameNode will replace it to keep the cluster working. ( )
17. Cloudera CDH requires a paid use. ( )
18. Hadoop is developed in Java, so MapReduce only supports Java language writing. ( )
19. Hadoop supports random read and write data. ( )
20. NameNode is responsible for managing metadata. Every time the client side reads and writes, it will read from disk or
Can write
Enter the metadata information and feedback to the client. ( )
21. NameNode The local disk saves the location information of Block. ( )
22. DataNode maintains communication with NameNode through long connection. ( )
23. Hadoop has strict authority management and security measures to ensure the normal operation of the cluster. ( )
24. The Slave node needs to store data, so the larger its disk, the better. ( )
25. hadoop dfsadmin –report command is used to detect HDFS corrupted blocks. ( )
26. Hadoop’s default scheduler strategy is FIFO ( )
27. Each node in the cluster should be equipped with RAID to avoid single disk corruption and affecting the operation of the entire node. ( )
28. Because HDFS has multiple copies, there is no single point problem with NameNode. ( )
29. Each map slot is a thread. ( )
30. Mapreduce’s input split is a block. ( )
31. The Web UI port of NameNode is 50030, which is a Web service initiated through jetty. ( )
32. HADOOP_HEAPSIZE in Hadoop environment variable is used to set the internals of all Hadoop daemon threads
live. It's silent
It's 200 GB. ( )
33. When DataNode is added to cluster for the first time, if the incompatible file version is reported in the log, it is necessary
NameNode
Perform the "Hadoop namenode -format" operation to format the disk. ( )
8. mr and hive realize mobile phone traffic statistics interview questions 6:
What are the query statements that implement statistics?
2. Why is it recommended to use external tables in production environment?
Mapreduce What is the function of creating a class DataWritable?
4. Why create a class DataWritable?
5. How to count mobile phone traffic?
6. Compare the difference between hive and mapreduce stating mobile phone traffic?
9. Interview question from aboutyun
I went to an interview recently and I came up with such a question. If you are interested, try it.
Use Hadoop to analyze a large number of log files, and each line of log records the following data:
TableName, Time, User, TimeSpan.
Require:
Writing the MapReduce program to calculate which table is accessed most frequently during peak time period (such as 10 am), and
10. Interview questions from aboutyun �
I received an interview with Alibaba for cloud computing some time ago and took it out and shared it with us.
1. The principle of hadoop operation?
2. The principle of mapreduce?
3. HDFS storage mechanism?
4. Give a brief example to clarify how mapreduce works?
5. The interviewer will give you some questions and ask you to complete it with mapreduce?
For example: There are now 10 folders, each folder has 1000,000 urls. Now let you find them
top1000000url。
6. What is the effect of Combiner in hadoop?
Reply from a netizen on the forum:
That is the process of mapreduce. A directory node on the server plus multiple data nodes will be added to the program.
Pass it to each node and then perform calculations on the node.
That is, store data on different nodes, use the map method to handle it accordingly, and on each node
Perform the calculation and finally merge it by reduce.
The program cooperates with namenode to store data on different data nodes
4. Use a diagram to show the best way to operate. The picture cannot be drawn. Google under
5. Don’t think about skewness, function, use 2 jobs, and use the first job to read 10 folders directly using filesystem as
map input, url make key, reduce calculate the sum of url,
The next job map top uses url as key, use -sum as secondary sorting, and take top10000000 in reduce
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 23
The second method is to build hive table A and hang partition channel. Each folder is a partition.
select , from(select url,count(1) as c from A where channel ='' group by
url)x order by desc limie 1000000;
6 combiner is also a reduce, it can reduce the data transmission from map to reudce and improve shuff speed.
Remember the average value and don't use it. Requirement input = map output, output = reduce input.
11. Written test and interview questions for Xiaoluo (hadoop) (13k monthly salary) 11:
1. Written Examination
1. Java Basic Class:
1) Inheritance: write a piece of code to let the result be written;
2) Reference objects and value objects;
I can’t remember the basic Java classes very clearly, many of them are basic.
2. Linux Basics:
1) Find Usage
2) Give a text: For example
Let's write shell statistics, and the final output result is: aaa 1
Ccc 2
Bbb 3
The results are required to be sorted
There are other things that are relatively basic
3. Database class: oracle query statement
2. Interview
Talking about project experience: I asked very carefully, give paper and pen, let the painting company hadoop’s project structure, and finally let myself talk about a few things
What does it look like after passing through the platform?
Java Factory: What are the commonly used classes in the io input and output stream, as well as webService and thread-related knowledge
linux: Ask the jps command, kill command, ask what awk, sed is used for, and some common life of hadoop
make
hadoop: Talking about the process of map, shuffle, reduce in hadoop1, and asked about map and reduce
Details of the writing (fortunately I have studied it before)
Project deployment: Asked how the project is deployed and how to manage the code
Hive also asked some questions about external tables, and the difference between hive's physical model and traditional databases.
3. Interview with an Internet company:
Asked about the algorithm for analyzing human behavior: I thought that it was useful in the anti-money laundering project we did. I'll give an example:
How do we screen out suspicious money laundering?
12. Interview questions provided by flash guests, find yourself, big data, etc. 26:
****Letter Hadoop Interview Written Questions (14 questions in total, and one more question can’t be remembered)
1. Hadoop cluster construction process, write out the steps.
2. What are the functions of those threads that start during the cluster operation?
3、/tmp/hadoop-root/dfs/name the path is not exists or is not accessable.
How to solve the error in NameNode main. (The general idea is this, what is the exception)
4. Write the language used for mapreduce at work and write a mapreduce program.
5. hadoop command
1) Kill a job task (just kill the process of port 50030)
2) Delete /tmp/aaa file directory
3) hadoop The command to refresh the cluster status when the cluster adds or deletes nodes.
6. Fixed format of logs:
a,b,c,d
a,a,f,e
b,b,d,f
Use one language to write mapreduce tasks to count the number of last letters in each column.
7. What are the schedulers for hadoop and their working principles.
8. What are the join methods of mapreduce?
9. What are the methods of Hive metadata storage and what are their characteristics?
10. Java implements non-recursive dichotomy algorithm.
11. The role of Combiner and Partition in mapreduce.
12. Use linux to implement the following requirements:
ip username
210.121.123.12 zhangsan
34.23.56.78 lisi
Editor QQ: 1040195253 26
11.56.56.72 wanger
.....
58.23.53.132 liuqi
34.23.56.78 liba
.....
, at least 1 million rows in it.
1), the number of ips in each, the total number of ips.
2) ip that exists in and ip that does not exist in .
3) The total number of occurrences of each username, the number of ips corresponding to each username.
13. The general idea is that java, streaming, and pipe in hadoop have their own characteristics of processing data.
14. How to implement the secondary sorting of mapreduce.
Interview questions encountered by Dashu:
15. The interviewer comes up and asks about the scheduling mechanism of hadoop;
16. Rack perception;
17. MR data tilt reasons and solutions;
18. Cluster HA.
@Find yourself. Interview questions provided:
19. If you are asked to design, how do you think a distributed file system should be designed and what aspects should be considered;
10 billion data is entered into hbase every day, how to ensure that the data is stored correctly and all inputs are completed within the specified time,
No residual data.
20. For hive, what UDF functions have you written and what is their function?
21. hdfs’ data compression algorithm
22. Mapreduce’s scheduling mode
23. Hive’s underlying interaction principle with database
24. Hbase filter implementation principles
25. For mahout, how to implement the secondary development of code recommendation, classification, and clustering to implement the excuses respectively.
26. May I ask, if you directly use the timestamp as a Xingjian, there will be a hot problem when writing to a single region. Why?
Thirteen. The interview questions provided by Fei Ge (hadoop, monthly salary, 13k) said:
1. The principle of hdfs and the responsibilities of each module
2. The working principle of mr
3. How does the map method call the reduce method
4. How to determine whether the file exists in shell, and what should be done if it does not exist?
5. What is the difference between fsimage and edit?
6. What is the difference between hadoop1 and hadoop2?
Written test:
1. How many errors does the block in hdfs report by default?
2. Which program is usually started with nn on the same node? And make an analysis
3. List a few configuration file optimizations?
4. Write down your understanding of zookeeper
5. When datanode is added to cluster for the first time, if the log reports incompatible file version, then namenode is required
What is the reason for performing formatting operations?
6. Talk about data tilt, how it happens, and give an optimization plan.
7. Introduction to hbase filter
8. Mapreduce Basic execution process
9. Talk about the difference between hadoop1 and hadoop2
10. Hbase cluster installation precautions
11. Records include the value range F and the value range G. You must count the number of different F values in records with the same G values respectively. A simple compilation
Writing process.
14. The interview questions provided by Fei Ge (hadoop, monthly salary, 13k) were 3:
1. Algorithm Question: There are 2 barrels with capacity of 3 liters and 5 liters respectively. How to get 4 liters of water? Assuming that the water is used infinitely, write
Steps out.
2. Java written test questions: Forgot to take photos, a lot of very basic knowledge. There are many SQL-related questions later, commonly used
Query sql, write, answering questions for an hour.
3. There is a table field name in the Oracle database, name varchar2(10), how to change the table data without changing it
Next, change the length of this field to varchar2 (2)?
15. Interview questions for massive data processing algorithms 10:
Part 1: Ten massive data processing interview questions
1. Massive log data extracts the IP that visited Baidu the most times a day.
First of all, this day, and the IP in the logs accessed to Baidu are taken out and written to a large file one by one. Note
IP is 32 bits, with a maximum of 2^32 IPs. You can also use mapping methods, such as modulus 1000, to turn the entire
Large files are mapped to 1000 small files, and then find the IP with the most frequent occurrence in each small article (hash_map can be used
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 29
Perform frequency statistics, and then find out the number of the largest frequency) and the corresponding frequency. Then here are the 1,000 largest ones
Among IP, find out which IP with the largest frequency is what you want.
Or as explained as follows (Snowland Eagle):
Algorithm Thought: Divide and Conquer + Hash
(1).The IP address has a maximum of 2^32=4G value, so it cannot be fully loaded into memory for processing;
(2). You can consider adopting the idea of "dividing and conquer" and converting massive IPs according to the Hash(IP)%1024 value of the IP address, and converting a large number of IPs according to the value of Hash(IP)%1024.
The logs are stored in 1024 small files respectively. In this way, each small file contains up to 4MB IP addresses;
(3). For each small file, you can build a Hash map with IP as key and occurrence times of value, the same
record the IP address that appears the most frequently;
(4). You can get the IP with the most occurrences among 1024 small files, and then use the conventional sorting algorithm to obtain the overall
The IP with the most occurrences;
2. The search engine will record all search strings used by the user every time they search through the log file, and the
The length is 1-255 bytes.
Suppose there are currently 10 million records (these query strings have relatively high duplication, although the total number is 10 million, but if
After removing duplicates, no more than 3 million. The more duplicate a query string is, the more users are querying it, that is,
The more popular it is. ), please count the 10 most popular query strings, and the memory required to be used cannot exceed 1G.
A typical Top K algorithm is explained in this article. For details, please refer to: 11. Complete solution from beginning to end.
Analyze the Hash table algorithm.
In the article, the final algorithm given is:
The first step is to preprocess this batch of massive data first, and use the Hash table to complete the statistics (it was written before
Sort by, hereby correct. July, 2011.04.27);
The second step is to use the heap data structure to find Top K, and the time complexity is N’logK.
That is, with the help of the heap structure, we can find and adjust/move in the time of the order of log. Therefore, maintain a K (this question
In the project, it is a small root heap of 10) size, and then iterates over 3 million Query and compares it with the root element respectively. So, we
The final time complexity is: O(N) + N’*O(logK), (N is 10 million, N’ is 3 million). OK,
For more, please refer to the original text.
Or: Use a trie tree, and the keyword domain stores the number of times the query string appears, and does not appear as 0. Finally, 10 elements are used
Minimum push to sort the frequency of occurrence.
3. There is a file of 1G size, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit is
The size of the system is 1M. Returns the 100 words with the highest frequency.
Plan: Read the file in sequence, for each word x, take hash(x)%5000, and then save it to 5000 small texts according to the value.
(denoted as x0, x1,…x4999). In this way, each file is about 200k.
If some files exceed 1M size, you can continue to divide them in a similar way until the decomposition is obtained
The size of the small file is no more than 1M.
For each small file, count the words appearing in each file and the corresponding frequency (the trie tree/hash_map, etc. can be used).
And take out the 100 words with the highest frequency of occurrence (the smallest pile of 100 knots can be used), and use 100 words and corresponding ones
The frequency of the file is stored, and then 5,000 files are obtained. The next step is to merge these 5000 files (class
It seems to be a process of sorting with merge).
4. There are 10 files, each file is 1G, and each line of each file is stored by the user's query, and each file's
query may be repeated. You are asked to sort by query’s frequency.
It is still a typical TOP K algorithm, the solution is as follows:
Plan 1:
Read 10 files in sequence, and write query to another 10 files according to the result of hash(query)%10 (recorded as)
middle. In this way, each newly generated file has a size of about 1G (assuming that the hash function is random).
Find a machine with about 2G memory, and use hash_map(query, query_count) to count each
query The number of times it appears. Use fast/heap/merge sorting to sort by occurrences. The sorted query and
The corresponding query_cout is output to the file. This gives 10 sorted files (denoted as).
These 10 files are sorted together (combined with inner sorting and outer sorting).
Plan 2:
Generally, the total number of query is limited, but the number of repetitions is relatively large. It may be for all query at once.
Sex can be added to memory. In this way, we can use the trie tree/hash_map, etc. to directly count each query
The number of occurrences, and then sort the quick/heap/merge according to the number of occurrences.
Plan 3:
Similar to Solution 1, but after the hash is divided into multiple files, it can be handed over to multiple files for processing, using distributed
The architecture is used to process (such as MapReduce), and then merge it.
5. Given two files a and b, each store 5 billion urls, each url accounts for 64 bytes, and the memory limit is 4G.
Do you find the common URL of the a and b files?
Solution 1: It can be estimated that the size of each file is 5G×64=320G, which is much larger than the memory limit of 4G. So no
It may be loaded completely into memory for processing. Consider taking a dividing and conquer approach.
Iterate through file a, find hash(url)%1000 for each url, and then store urls to 1000 according to the obtained value.
in a small file (denoted as a0, a1,…, a999). In this way, each small file has about 300M.
Traverse file b and store url to 1000 small files (denoted as b0, b1,…, b999) in the same way as a.
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 32
After processing this way, all possible identical urls are in the corresponding small files (a0vsb0, a1vsb1,…, a999vsb999)
In the case of the incorrect small files, it is impossible to have the same url. Then we only require 1000 of the same url in the small file
Just do it.
When finding the same url in each pair of small files, you can store the url of one of the small files into hash_set. Then
Receive each url of another small file and see if it is in the hash_set just built. If so, then it is common
url, just save it in the file.
Solution 2: If there is a certain error rate allowed, you can use Bloom filter, and 4G memory can probably represent 34 billion yuan.
bit. Map the urls in one of the files to this 34 billion bits using Bloom filter, and then read the other one one by one
url of a file, check whether it is with Bloom filter. If so, then the url should be the same as the url (note that there will be
A certain error rate).
Bloom filter will explain in detail in this BLOG in the future.
6. Find the non-repetitive integers among the 250 million integers. Note that the memory is not enough to accommodate these 250 million integers.
Solution 1: Use 2-Bitmap (2bit per number is allocated, 00 means non-existence, 01 means occurrence once, 10 means multiple
Time, 11 is meaningless) is carried out, and a total memory is required 2^32* 2 bit=1 GB memory, which is acceptable. Then scan this 2.5
Billions of integers, check the corresponding bits in Bitmap. If it is 00, change 01, change 10, and remain unchanged. Completed
Afterwards, check the bitmap and output the integer with the corresponding bit 01.
Solution 2: You can also use a method similar to question 1 to divide small files. Then find out the difference in the small file
Complex integers and sort. Then merge, pay attention to removing duplicate elements.
7. Tencent interview question: Give 4 billion unsigned int integers that are not repeated, and then give another number.
How to quickly determine whether this number is among the 4 billion numbers?
Similar to question 6 above, my first reaction is quick sort + binary search. Here are other better ways:
Plan 1: oo, apply for 512M memory, a bit bit represents an unsigned int value. Read 4 billion numbers,
Set the corresponding bit bit, read the number to query, check whether the corresponding bit bit is 1, is 1, indicates existence, and is 0, indicates
Does not exist.
dizengrong:
Solution 2: This issue is well described in "Programming Pearl". You can refer to the following ideas and discuss it:
Also, because 2^32 is more than 4 billion, given a number may or may not be included;
Here we represent each of the 4 billion numbers in 32-bit binary
Suppose these 4 billion numbers start to be placed in a file.
Then divide these 4 billion numbers into two categories:
1. The highest position is 0
2. The highest position is 1
And write these two categories into two files, one of which has the number of <=2 billion, and the other>=2 billion
(This is equivalent to half-cut);
Compare with the highest bit of the number to be found and then enter the corresponding file before searching.
Then divide this file into two categories:
1. The highest position is 0
2. The highest position is 1
And write these two categories into two files, one of which has the number of <=1 billion, and the other >=1 billion
(This is equivalent to half-cut);
Compare with the highest bit of the number to be searched and then enter the corresponding file before searching.
…….
By analogy, you can find it, and the time complexity is O(logn), and solution 2 is finished.
Attachment: Here, let’s briefly introduce the bitmap method:
Use bitmap method to determine whether there is duplication in the shaping array.
Determining the existence of duplication in a collection is one of the common programming tasks. When the amount of data in a collection is relatively large, we usually want to do a few less.
After scanning, the double cycle method is not advisable.
The bitmap method is more suitable for this situation. It is used to create a length of max+1 according to the largest element in the set.
The new array is then scanned again. When encountering a number, give it to 1 at the position of the new array. If encountering a number of 5, give it to the new number
The sixth element of the group is set to 1, so the next time I encounter 5, I want to set it, I found that the sixth element of the new array is already 1, this
This means that the data this time must be duplicated with the previous data. This is done by setting zero and setting one after the new array is initialized
The method is similar to the bitmap processing method, so it is called the bitmap method. The worst case of its calculations is 2N. If the array is known
If the value is large, the efficiency can be doubled if the new array is lengthened in advance.
Welcome, have better ideas or methods, and communicate together.
8. How to find the one with the most repeated times among massive data?
Solution 1: First do hash, then find the modulus map into small files, find the one with the most repeated times in each small file, and
Record the number of repetitions. Then find out the most repeated data in the previous step. The one that gets the most is the one that needs (see the previous one for details.
the question).
9. Tens of millions or hundreds of millions of data (with duplication), counting the N data that appears the most frequently.
Solution 1: Tens of millions or hundreds of millions of data, the memory of the current machine should be able to store. So consider using hash_map/
Search for binary trees/red and black trees, etc. to count the number. Then take out the first N data that appears the most frequently.
Complete it using the heap mechanism mentioned in question 2.
10. A text file has about 10,000 lines, each line with one word, and requires counting the top 10 words that appear most frequently.
Please give your ideas and time complexity analysis.
Plan 1: This question considers time efficiency. Use the trie tree to count the number of times each word appears, and the time complexity is O(n*le)(le
Indicates the flat length of the word). Then find the first 10 words that appear most frequently, which can be implemented using piles. The previous question
It has been mentioned in the process that the time complexity is O(n*lg10). So the total time complexity is O(n*le) and O(n*lg10)
Which one is the largest?
Attached, find the largest 100 numbers among the 100w numbers.
Solution 1: In the previous question, we have mentioned that it is done with a minimum heap of 100 elements. The complexity is
O(100w*lg100)。
Solution 2: Adopt the idea of quick sorting. After each segmentation, only consider the part larger than the axis and know the part larger than the axis.
When it is more than 100, the traditional sorting algorithm is used to sort it, and the first 100 are taken. The complexity is O(100w*100).
Plan 3: Adopt local elimination method. Select the first 100 elements, sort them, and record them as sequence L. Then scan the remaining one at a time
Element x, which is the smallest element among the 100 sorted elements. If it is larger than this smallest one, then make this the most.
Delete small elements and insert x into sequence L using the idea of insertion sorting. Loop in turn, knowing that all scans
elements. The complexity is O(100w*100).
Acknowledgments: /youwang/.
Part 2: A summary of ten massive data processing methods
OK, after reading so many interview questions above, are you a little dizzy? Yes, a summary is needed. Next, this article will briefly
We will summarize some common methods for dealing with massive data problems, and in the future, these methods will be explained in detail in this BLOG.
The following methods are all from /yanxionglu/blog/blog, which handles massive data
The method has made a general summary. Of course, these methods may not completely cover all problems, but such
These methods can basically deal with most problems encountered. Some of the following questions are basically directly derived from the company's interview
The method for the test questions is not necessarily the best. If you have a better way of handling it, please feel free to discuss it.
1. Bloom filter
Scope of application: It can be used to implement data dictionaries, judge data, or find intersection of sets.
Basic principles and key points:
It is very simple for the principle, a bit array + k independent hash function. Set the bit array of the corresponding value of the hash function to 1.
If you find that all the corresponding bits of the hash function are 1 during searching, it means that there is an existence. It is obvious that this process does not guarantee the search link.
The result is 100% correct. It is also not supported to delete an inserted keyword, because the corresponding bit of the keyword will
Reach other keywords. So a simple improvement is counting Bloom filter, using a counter
Arrays instead of bit arrays, and deletion can be supported.
There is another important question, how to determine the size of the array m and hash functions based on the number of input elements n.
number. The error rate is the smallest when the number of hash functions k=(ln2)*(m/n). If the error rate is not greater than E, m at least
It must be equal to n*lg(1/E) to represent a collection of any n elements. But m should be bigger, because bit needs to be guaranteed
At least half of the array is 0, then m should >=nlg(1/E)*lge is about 1.44 times nlg(1/E)(lg means 2 as
logarithm of the bottom).
For example, if we assume that the error rate is 0.01, then m should be about 13 times that of n. In this way, there are about 8 k.
Note that the unit of m is different from n . m is the unit of bit, while n is in the number of elements (to be precise, no
number of the same element). Usually there are many bits in the length of a single element. So use bloom filter to upload memory
It's often saved.
Extension:
Bloom filter Map elements in the set into a bit array, and use k (k is the number of hash functions) to map all bits
1 means whether the element is in this collection. Counting bloom filter (CBF) expands each bit in a bit array into
A counter, thus supporting element deletion operations. Spectral Bloom Filter (SBF) combines it with collection elements
The number of occurrences of the stock is related. SBF uses the minimum value in counter to approximate the occurrence frequency of the element.
Question example: I'll give you two files A and B, each 5 billion URLs are stored, each URL takes up 64 bytes, and the memory limit is
4G, let you find the URL common to A and B files. What if it is three or even n files?
Based on this question, let's calculate the memory usage. 4G=2^32 is about 4 billion * 8 is about 34 billion, n=50
100 million, if calculated based on the error rate of 0.01, the required number is about 65 billion bits. What is available now is 34 billion, not much difference.
This may increase the error rate. In addition, if these urlips are one by one, they can be converted to ip, then
It's very simple.
2. Hashing
Scope of application: Quick search, deleted basic data structure, usually requires the total amount of data to be put into memory.
Basic principles and key points:
hash function selection, for strings, integers, arrangements, and the specific corresponding hash method.
Collision treatment, one is open hashing, also known as zippering method; the other is closed hashing, also known as open
Address method, opened addressing.
Extension:
d in hashing means multiple. Let’s simplify this problem first and take a look at 2-left hashing. 2-left
hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, and giving T1 and T2 respectively
Equipped with a hash function, h1 and h2. When storing a new key, two hash functions are used to calculate.
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 38
Two addresses are obtained: h1[key] and h2[key]. At this time, you need to check the position of h1[key] in T1 and h2[key] in T2.
Location, which location has stored more keys (with collisions), and then store the new key in bits with less load
Set. If there are as many sides, such as both positions are empty or both have a key stored, store the new key on the left
In the T1 subtable of the edge, 2-left also comes from this. When searching for a key, you must do two hashs and search at the same time
Two positions.
Problem example:
1). Massive log data extracts the IP that visited Baidu the most times a day.
The number of IPs is still limited, up to 2^32, so you can consider using hash to store ip directly into memory, and then
Conduct statistics.
3. bit-map
Scope of application: can quickly search, judge and delete data. Generally speaking, the data range is less than 10 times that of int
Basic principles and key points: Use a bit array to indicate whether certain elements exist, such as 8-digit phone number
Extension: bloom filter can be regarded as an extension to bit-map
Problem example:
1) It is known that a file contains some phone numbers, each number is 8 digits, and the number of different numbers is counted.
The maximum of 8 bits is 9999999, which requires about 99m bits, or about 10m bytes of memory.
2) Find the number of unrepeated integers among the 250 million integers, and the memory space is not enough to accommodate these 250 million integers.
Expand bit-map and use 2bit to represent a number, 0 means not appearing, 1 means appearing once, 2 means
Appears 2 or more times. Or we don't use 2bit to express it, we can use two bit-map to simulate and implement this
2bit-map。
4. Pile
Scope of application: In front of massive data, n is large, and n is relatively small, and the heap can be placed in memory
Basic principles and key points: Find the front n short of the maximum pile, and find the front n large number. Methods, such as finding the first n small, we are more aware of
The previous element and the largest element in the maximum heap. If it is smaller than the largest element, the largest element should be replaced. This is the most
The n elements obtained later are the smallest n. Suitable for large data volumes, n is smaller before finding the size of n is smaller,
This way, you can scan it once to get all the first n elements, which is very efficient.
Extension: Double heap, a maximum heap combined with a minimum heap, can be used to maintain medians.
Problem example:
1) Find the largest number of 100 and 100.
Just use a minimum heap of 100 elements.
5. Double-layer barrel division - In fact, it is essentially the idea of [dividing and conquering], focusing on the technique of "dividing"!
Scope of application: k, large, median, non-repetitive or repeated numbers
Basic principles and key points: Because the range of elements is large, direct addressing tables cannot be used, so it is gradually determined through multiple divisions.
Range, then finally proceed within an acceptable range. It can be reduced multiple times, double layers are just an example.
Extension:
Problem example:
1). Find the number of unrepeated integers among the 250 million integers, and the memory space is not enough to accommodate these 250 million integers.
A bit like the pigeon nest principle, the integer number is 2^32, that is, we can divide these 2^32 numbers into 2^8 areas
Domain (such as using a single file to represent a region), then separate the data into different regions, and then different regions are used
bitmap can be solved directly. In other words, as long as there is enough disk space, it can be easily solved.
2). 50 million ints find their median.
Editor QQ: 1040195253 40
This example is more obvious than the one above. First, we divide int into 2^16 areas, and then read the data statistics and fall
After we can judge which area the median number falls into based on the statistical results.
The largest number in this area of Tao happens to be the median. Then the second scan we only count the ones that fall in this area
Just some numbers.
In fact, if it is not int, it is int64, we can reduce it to an acceptable level after 3 such divisions.
That is, you can first divide int64 into 2^24 areas, then determine which number of the area is, and divide the area into 2^20 areas.
Sub-region, then determine which number is the largest number in the sub-region, and then the number of numbers in the sub-region is only 2^20, so you can
Continue to use direct addr table to perform statistics.
6. Database Index
Scope of application: Add, delete, modify and check the large data volume
Basic principles and key points: Use data design and implementation methods to process the addition, deletion, modification and search of massive data.
7. Inverted index (Inverted index)
Scope of application: search engine, keyword query
Basic principles and key points: Why is it called reverse index? An indexing method used to store a word in full text search
Map of storage locations in documents or groups of documents.
Taking English as an example, the following is the text to be indexed:
T0 = “it is what it is”
T1 = “what is it”
T2 = “it is a banana”
We can get the following reverse file index:
“a”: {2}
“banana”: {2}
“is”: {0, 1, 2}
“it”: {0, 1, 2}
“what”: {0, 1}
The search conditions "what", "is" and "it" will correspond to the intersection of the set.
Forward indexes are developed to store the words for each document. Queries with forward indexes often satisfy the order of each document.
Frequent full-text query and verification of each word in the verification document such query. In the forward index, the document occupies the
The location of the heart, each document points to a sequence of index items it contains. That is to say, the document points to what it contains
Those words, and the reverse index is the word pointing to the document containing it, it is easy to see this reverse relationship.
Extension:
Problem example: Document search system, querying those files containing a certain word, such as keyword search for common academic papers.
8. External sorting
Scope of application: big data sorting, deduplication
Basic principles and key points: merging methods for external sorting, principle of substitution selection loser tree, optimal merging tree
Extension:
Problem example:
1). There is a file of 1G size, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit is
The size of the system is 1M. Returns the 100 words with the highest frequency.
This data has obvious characteristics. The word size is 16 bytes, but the memory is only 1m and hash is a bit insufficient.
So it can be used to sort. Memory can be used as input buffer.
9. Trie Tree
Scope of application: large amount of data and many repetitions, but small data types can be put into memory.
Basic principles and key points: implementation method, representation method of node children
Extension: compression implementation.
Problem example:
1). There are 10 files, each file is 1G, and each line of each file stores the user's query, and each file's
query may be repeated. You have to sort by query’s frequency.
2).10 million strings, some of which are the same (repeat), all the duplicates need to be removed, and no duplicate characters are retained
string. How to design and implement it?
3). Looking for popular queries: The repetition of the query string is relatively high. Although the total number is 10 million, if the duplication is removed, it will not exceed
More than 3 million, each no more than 255 bytes.
10. Distributed processing mapreduce
Scope of application: large amount of data, but small type of data can be put into memory.
Basic principles and key points: hand over data to different machines for processing, divide data, and reduce the result.
Extension:
Problem example:
1).The canonical example application of MapReduce is a process to count the
appearances of
each different word in a set of documents:
2). Massive data are distributed among 100 computers, and I want to find a way to efficiently count the top 10 of this batch of data.
3). There are N machines in total, and there are N numbers on each machine. Each machine has a maximum of O(N) numbers and operates on them. like
How to find the median of N^2 numbers?
Classic problem analysis
Tens of millions or billions of data (with duplication), counting the first N data with the most occurrences, and there are two situations: you can read it at one time
Enter memory, cannot be read in at once.
Available ideas: trie tree + heap, database index, metamorphic set statistics, hash, distributed computing, approximate statistics,
Outer sort
The so-called whether it can be read into memory at once should actually refer to the amount of data after removing duplicates. If the data is deduplicated
Put it in memory, we can create a dictionary for the data, such as map, hashmap, trie, and then directly perform system
Just plan. Of course, when updating the number of occurrences of each data, we can use a heap to maintain the most occurrences
Of course, this will increase the number of maintenance times, which is worse than finding the first N after complete statistics.
If the data cannot be put into memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation.
The change that can be made is to store the dictionary on the hard disk instead of memory. This can refer to the database storage method.
Of course there is a better way, which is to use distributed computing, which is basically the map-reduce process, first
First, you can divide the data into different machines according to the data value or the value after the data hash(md5).
So that data can be read into memory at one time after partitioning, so that different machines are responsible for handling various numerical ranges.
The above is the map. After obtaining the results, each machine only needs to take out the first N data with the most occurrences, and then collect
In total, select the top N data with the most occurrences among all data, which is actually the reduce process.
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 44
In fact, you may want to directly divide the data into different machines for processing, so that you cannot get the correct solution.
Because one data may be divided equally on different machines, while the other may be completely aggregated on one machine,
There may be an identical number of data. For example, if we want to find the top 100 that appear the most, we will 10 million.
The data is distributed to 10 machines, and the first 100 that appear the most frequently are found in each machine. This cannot be guaranteed after the merger.
Find the real 100th one, because for example, the 100th one with the most occurrences may have 10,000, but it was assigned to it.
10 machines, so there are only 1,000 on each machine. Assuming that these machines are ranked before 1,000, those that are all separate.
For example, there are 1,001 units distributed on a machine, so that the one that originally had 10,000 units would be eliminated, even if we
Let each machine select the 1,000 that appear the most and then merge it, and it will still make an error because there may be a large number of
1001 people gathered. Therefore, the data cannot be divided into different machines at random, but the value after hash must be used to
They are mapped to different machines to process, allowing different machines to process a numerical range.
The external sorting method will consume a lot of IO and will not be very efficient. The above distributed method can also be used
The stand-alone version means that the total data is divided into multiple different subfiles according to the range of values, and then processed one by one. Where
After the process is completed, the words and their frequency of occurrence are merged. In fact, you can use an external sorted
and process.
In addition, approximate calculations can also be considered, that is, we can only use the real realities by combining natural language properties.
The most common words appear in the world as a dictionary, allowing this scale to be put into memory.
Sixteen. Interview questions from aboutyun
1. Tell us about the difference between a value object and a reference object?
2. Tell me about your understanding of the reflection mechanism and its use?
What are the differences between Vector and LinkedList? The difference between HashMap and HashTable and its
Pros and cons?
3. List the implementation method of threads? How to achieve synchronization?
Question, it is a chart, I forgot about it
5. List at least five design patterns? Use code or UML class diagram to describe the principles of two design patterns?
6. Talk about the technology you are currently studying, talk about the technical difficulties you have used in your recent projects and their solutions.
17. Algorithm interview questions provided by Batu:
Where did the user’s mobile number appear? Time of occurrence?
111111111 2 2014-02-18 19:03:56.123445 133
222222222 1 2013-03-14 03:18:45.263536 241
333333333 3 2014-10-23 17:14:23.176345 68
222222222 1 2013-03-14 03:20:47.123445 145
333333333 3 2014-09-15 15:24:56.222222 345
222222222 2 2011-08-30 18:13:58.111111 145
222222222 2 2011-08-30 18:18:24.222222 130
Sort by time
The expected result is:
222222222 2 2011-08-30 18:13:58.111111 145
222222222 2 2011-08-30 18:18:24.222222 130
222222222 1 2013-03-14 03:18:45.263536 24
111111111 ~~~~~~~~
333333333 ~~~~~~~
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 46
18. Interview questions from Xiangfu provided 7:
Hdfs:
1. The default file size is 64M, and what is the impact of changing it to 128M?
Principle?
What are the differences and connections with SecondaryNameNode?
MapReduce:
4. Introduce the entire process of MadpReduce, for example, the details of the WordCount example will be clear (focused on explaining
Shuffle)?
5. Do you have any experience in tuning Hadoop and have no experience in using it? (Tuning starts with parameter tuning)
How big is a single point of load? How to load balance? (You can use Partitioner)
How to implement Top 10?
19. Interview questions from mo•mo•ring provided by 13:
xxxx Software Company
1. What are the advantages of your ability to take on this position?
Advantages and reasons (at least 3)
optimization
4. Write a bubble program
Underground storage design
6. Career Planning
xxx Network Company
1. Database
1.1 The first normal form, the second normal form and the third normal form
1.2 Given two data tables, optimized table (I don’t remember the specific fields, it is about product orders and suppliers)
1.3 Based on your practical experience, let me tell you how to prevent full table scanning.
2. Network Layer 7 protocol
3. Multi-threading
4. The difference between collection HashTable and HashMap
5. Operating system fragmentation
Advantages, in which occasions are used?
What is the metastore used for?
20. Interview questions from Clouds provided 18:
1. Install ssh commands online and file decompression commands?
2. The command to append all the public keys to the authorization file? Is this command executed under the root user?
3. The order of starting and closing of brother services in HadoopHA cluster?
4. How many copies of the block block in HDFS are saved by default? What is the default size?
5. Is the meta data in NameNode stored in NameNode itself, or other nodes such as DataNode?
Does the DatNOde node have Meta data?
6. Which of the following program is usually started on a node with NameNode?
7. Which program below is responsible for HDFS data storage?
8. The main role of Zookeeper in HadoopHA cluster China, as well as the command to start and view status?
9. What is the focus of HBase when designing the model? How many Chinese definitions are Column Family
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 48
The most suitable? Why?
10. How to improve the read and write performance of HBase client? Please give an example.
11. How to set Configuration when developing based on HadoopHA cluster memory MapReduce
,The value of the quorum attribute?
12. What algorithms were used during the development of hadoop? What is its application scenario?
13. How to release the MapReduce program? If a third-party jar package is involved in MapReduce, what should I do?
deal with?
14. What cluster operation and maintenance tools have been used in actual work? Please explain the functions of the period respectively.
15. What is the role of combiner in hadoop?
16. The principle of IO, how many IO models are there?
17. What kind of model does Windows use, and what kind of model does Linux use?
18. How does a machine cope with so many request access, how to implement high concurrency, how does a request occur,
How to deal with it on the server side, how to return it to the user in the end, and how does the operating system control the entire link?
21. Interview questions provided by **** 11 said:
The client side of down when copying to the third copy, how to restore hdfs to ensure that the third copy is written next time? block
Should I write the dataNode first or nameNode first?
2. Is the quick schedule on-site writing program implemented?
How is the memory allocated?
4. Problem of poisoned wine---1,000 barrels of wine, of which 1 barrel is poisonous. Once taken, the toxicity will occur after 1 week. Ask at least
How many mice can find poisoned wine in one week?
5. Use stack to implement queues?
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 49
6. Is the linked list implemented in reverse order?
7. How is a multi-threaded model (production, consumer)? What implementation methods do concurrent multithreading use?
Is it a synchronous pessimistic lock? Mutual Exclusion? How to write synchronization to improve efficiency?
940 million numbers, find out which duplicates should be used to use the minimum number of comparisons and write a program to implement them.
Is it a value or a address?
Handle multithreading, and the other thread is waiting?
22. Interview questions provided by **** 18 said:
1. How many G logs does an online mall produce in 1 day?
2. How many log records are there (without cleaning)?
3. How many daily visits are there?
4. What is the approximate number of registrations?
5. Is there any other log in addition to the access log of apache?
6. Suppose we have other logs, can we have other business analysis on this log? What are these business analysis
Is it?
7. Question: How many servers do you have?
8. Question: How big is the memory in your server?
9. Question: How are your servers distributed? (This is about geographical location distribution, and it is best to talk about it from the rack aspect)
10. Question: What do you do in the company usually (some suggestions)
Here are some things that I don’t understand very much:
11. How to pre-partition in hbase?
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 50
12. How to provide an interface to the web front desk for access (HTABLE can provide access to HTABLE, but
How to query multiple version data of the same record)?
13. Is there any thread safety issue in the .htable API? Is it a singleton or multiple case in the program?
14. Our hbase probably has a few tables and clusters in the company's business (mainly online malls).
What kind of data are stored?
15. Concurrency problem of hbase?
The following Storm’s problem:
16. metaq message queue zookeeper cluster storm cluster (including zeromq, jzmq, and storm itself)
Can you complete the recommended system function for the mall? Are there any other middleware?
17. How to complete the counting of words in storm? (I personally read it. Storm always thinks that it is streaming, and it seems that there is no
The ability to accumulate data is distributed directly to the next component after processing)
18. Storm Some other frequently asked questions in interviews?
23. The interview questions provided by Fei Ge (hadoop, monthly salary, 13k) said:
1. What is your cluster size?
Development cluster: 10 units (8 units available) 8 cores CPU
2. What are your data imported into the database? What database is imported into?
Processing the previous import: import into the hdfs file system through the hadoop command
Export after processing is completed: use hive to export to the mysql database through sqoop
for use in the report layer.
3. How much data does your business have? How many rows of data are there? (I interviewed three companies and asked this question)
The development uses part of the data, not the full amount of data, and there are nearly 100 million rows (80,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 51
No one will particularly care about this issue during the post)
4. Do you directly read the data from the database or text data when processing data?
Import the log data into hdfs and then process it
5. How many hql statements do you write about hive?
Not sure, I didn't do statistics when I wrote it myself
6. How many job tasks are you submit? How much time does these jobs take? (I interviewed three companies, all
Ask this question)
No statistics, plus tests, will be associated with many
7. What is the difference between hive and hbase?
8. What are your main tasks in the project?
Using hive to analyze data
9. What difficulties did you encounter in the project and how did you solve it?
Some tasks have been executed for too long and the failure rate is too high. After checking the log, it is found that it has not been executed. The reason is that
Hadoop’s job’s timeout is too short (relative to the cluster’s capabilities), just set it to be longer
10. Have you written the udf function yourself? What have you written?
I haven't written this
11. How much data is when your project is submitted to the job? (I interviewed three companies and asked this question)
Not sure what to ask
12. How much data is output after reducing?
13. How many G logs will be generated in an online mall in 1 day? 4tb
14. How many log records are there (without cleaning)? 7-8 million
15. How many daily visits are there? million
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 52
16. What is the approximate number of registrations? Not sure, hundreds of thousands
17. Are there any other logs in addition to the access log of apache? Pay attention to information
18. Suppose we have other logs, can we have other business analysis on this log? All business analytics
What?
24. Interview questions from aboutyun provided 1:
There are tens of millions of text messages, with duplicates, saved in the form of text files, one line, with duplicates.
Please take 5 minutes to find the top 10 that appear the most repeatedly.
analyze:
The conventional method is to sort first, traverse it once, and find the top 10 that are repeated the most. However, the ordering algorithm has the lowest complexity
nlgn。
You can design a hash_table, hash_map<string, int>, read 10 million text messages in turn, and load them into
The hash_table table is counted, and the number of repetitions is counted, and a text message table with a maximum of 10 messages is maintained.
In this way, you can find the most top 10 items by traversing them once, and the algorithm complexity is O(n).
25. Interview questions provided by Beijing-Nansang (hadoop monthly salary 12k) are 5:
1. The operation process of job (the process of submitting a job)?
2. What are the application scenarios of various frameworks in the Hadoop ecosystem?
3. There are many more multiple-choice questions
4. Interview questions
What are the differences between the compression formats in hive? RCFile, TextFile, and SequenceFile?
Which of the above 3 files with the same size of the formats takes up space...etc
QQ942609288????,???????QQ??
QQ942609288????,???????QQ??
Editor QQ: 1040195253 53
There is also a HA compression in Hadoop.
5. If: Flume collects many small files, I need to write MR to merge these files when processing
(It is optimized in MR, and no small file can be allowed to have one MapReduce)
Their company mainly focuses on China Telecom's traffic billing, and specializes in writing MR.
26. Interview questions from Yan Emperor Initialization 2:
You don’t have to finish all the following questions, just choose the one you are good at.
Question 1: RTB Advertising DSP Algorithm Competition
Please conduct corresponding modeling and analysis according to the requirements of the competition, and record the entire analysis and processing process and the results of each step in detail.
Algorithm Competition Home Page: /cn/
Algorithm Competition Data Download Address:
/share/link?shareid=1069189720&uk=3090262723#dir
Question 2: CookieID Recognition
We have M users N days of Internet access log: see for details
The field structure is as follows:
ip �
ad_id
time_stamp �
url string URL
ref string referer
ua string User Agent
dest_ip �
cookie string cookie
day_id
The value of the cookie is as follows:
bangbigtip2=1; bdshare_firstime=1374654651270;
CNZZDATA30017898=cnzz_eid%3D2077433986-1374654656-http%253A%252F%252Fsh.
%26ntime%3D1400928250%26cnzz_a%3D0%26ltime%3D1400928244483%26rtime%3D63;
Hm_lvt_f5127c6793d40d199f68042b8a63e725=1395547468,1395547513,1395758399,13957594
68; id58=05dvZ1HvkL0TNy7GBv7gAg==;
Hm_lvt_3bb04d7a4ca3846dcc66a99c3e861511=1385294705;
__utma=253535702.2042339925.1400424865.1400424865.1400928244.2;
__utmz=253535702.1400424865.1.=(direct)|utmccn=(direct)|utmcmd=(none); city=sh;
pup_bubble=1; __ag_cm_=1400424864286; myfeet_tooltip=end; ipcity=sh%7C%u4E0A%u6D77
Editor QQ: 1040195253 54
One of the attributes can identify a user, which we call cookieID.
Please analyze the cookie ID based on the sample data.
Requires detailed description of the analysis process.
27. Interview questions from aboutyun provided by 7:
1. Explain the two concepts of "hadoop" and "hadoop ecosystem".
2. Explain the basic composition of Hadoop 2.0.
3. Compared with HDFS1.0, what are the main improvements in HDFS2.0?
4. Try to use "Step 1, Step 2, Step 3..." to explain the basic process of running applications in YARN.
5. Try to explain whether “MapReduce 2.0” is the same as “YARN”.
6. In MapReduce 2.0, what is the main function of MRAppMaster, and how to implement tasks
Fault-tolerant?
7. Why did yarn occur? What problems does it solve and what advantages does it have?
28. The interview questions provided by Nature Yuezhen Liujun said 6:
1. How many clusters are there, how much data is, how much throughput is it processed every day?
2. Have you learned about automated operation and maintenance? Are you doing automated operation and maintenance management?
3. How many copies of data are you? If the data exceeds the storage capacity, how do you deal with it?
4. How to improve the pressure brought by multiple JOBs at the same time, how to optimize, and tell me the idea?
5. What data do you use HBASE to store?
6. What are the indicators that your hive can achieve in processing data?
29. Interview questions from Xiatian provided 1:
1. Please tell me how to implement the HA of hadoop1?
30. Interview questions provided by Feng Linmuyu 18 said: